TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
inductive biases due to algorithmic regularization 95
λ = 0 λ = 0.6 λ = 2
while preserving the value of the loss. To that end, define
⎡ √ ⎤
1 − δ 2 −δ 0
⎢
√
Q δ := ⎣ δ 1 − δ 2 ⎥
0 ⎦
0 0 I r−2
and let Û := UQ δ . It is easy to verify that Q ɛ is indeed a rotation.
First, we show that for any ɛ, as long as δ 2 ≤
ɛ2 , we have Û ∈
2 Tr(M)
B ɛ (U):
Figure 9.1: Optimization landscape
(top) and contour plot
(bottom) for a single hiddenlayer
linear autoencoder network
with one dimensional
input and output and a hidden
layer of width r = 2 with
dropout, for different values of
the regularization parameter
λ. Left: for λ = 0 the problem
reduces to squared loss minimization,
which is rotation
invariant as suggested by the
level sets. Middle: for λ > 0 the
global optima shrink toward
the origin. All local minima
are global, and are equalized,
i.e. the weights are parallel to
the vector (±1, ±1). Right: as λ
increases, global optima shrink
further.
‖U − Û‖ 2 r
F = ∑ ‖u i − û i ‖ 2
i=1
= ‖u 1 − √ 1 − δ 2 u 1 − δu 2 ‖ 2 + ‖u 2 − √ 1 − δ 2 u 2 + δu 1 ‖ 2
= 2(1 − √ 1 − δ 2 )(‖u 1 ‖ 2 + ‖u 2 ‖ 2 )
≤ 2δ 2 Tr(M) ≤ ɛ 2
where the second to last inequality follows from Lemma 9.3.2, because
‖u 1 ‖ 2 + ‖u 2 ‖ 2 ≤ ‖U‖ 2 F = Tr(UU⊤ ) ≤ Tr(M), and also the fact
that 1 − √ 1 − δ 2 = 1−1+δ2
1+ √ ≤ 1−δ 2 δ2 .
Next, we show that for small enough δ, the value of L θ at Û is
strictly smaller than that of U. Observe that
√
‖û 1 ‖ 2 = (1 − δ 2 )‖u 1 ‖ 2 + δ 2 ‖u 2 ‖ 2 + 2δ 1 − δ 2 u1 ⊤ u 2
√
‖û 2 ‖ 2 = (1 − δ 2 )‖u 2 ‖ 2 + δ 2 ‖u 1 ‖ 2 − 2δ 1 − δ 2 u1 ⊤ u 2
and the remaining columns will not change, i.e. for i = 3, · · · , r,
û i = u i . Together with the fact that Q δ preserves the norms, i.e.
‖U‖ F = ‖UQ δ ‖ F , we get
‖û 1 ‖ 2 + ‖û 2 ‖ 2 = ‖u 1 ‖ 2 + ‖u 2 ‖ 2 . (9.13)