TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
98 theory of deep learning
rotations. Observe that
g(t) := L θ (U + ∆(t))
= L θ (U) + ‖ √ 1 − t 2 u 1 + tu 2 ‖ 4 − ‖u 1 ‖ 4 + ‖ √ 1 − t 2 u 2 − tu 1 ‖ 4 − ‖u 2 ‖ 4
= L θ (U) − 2t 2 (‖u 1 ‖ 4 + ‖u 2 ‖ 4 ) + 8t 2 (u 1 u 2 ) 2 + 4t 2 ‖u 1 ‖ 2 ‖u 2 ‖ 2
√
+ 4t 1 − t 2 u1 ⊤ u 2(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) + O(t 3 ).
The derivative of g then is given as
g ′ (t) = −4t(‖u 1 ‖ 4 + ‖u 2 ‖ 4 ) + 16t(u 1 u 2 ) 2 + 8t‖u 1 ‖ 2 ‖u 2 ‖ 2
+ 4( √ 1 − t 2 −
t 2
√
1 − t 2 )(u⊤ 1 u 2)(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) + O(t 2 ).
Since U is a critical point and L θ is continuously differentiable, it
should hold that
g ′ (0) = 4(u ⊤ 1 u 2)(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) = 0.
Since by assumption ‖u 1 ‖ 2 − ‖u 2 ‖ 2 > 0, it should be the case that
u ⊤ 1 u 2 = 0. We now consider the second order directional derivative:
g ′′ (0) = −4(‖u 1 ‖ 4 + ‖u 2 ‖ 4 ) + 16(u 1 u 2 ) 2 + 8‖u 1 ‖ 2 ‖u 2 ‖ 2
= −4(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) 2 < 0
which completes the proof.
We now focus on the critical points that are equalized, i.e. points U
such that ∇L θ (U) = 0 and diag(U ⊤ U) = ‖U‖2 F
d 1
I.
Lemma 9.3.7. Let r := Rank(M). Assume that d 1 ≤ d 0 and λ <
rλ r
∑ r i=1 (λ . Then all equalized local minima are global. All other equalized
i−λ r )
critical points are strict saddle points.
Proof of Lemma 9.3.7. Let U be a critical point that is equalized. Furthermore,
let r ′ be the rank of U, and U = WΣV ⊤ be its rank-r ′ SVD,
i.e. W ∈ R d 0×r ′ , V ∈ R d 1×r ′ are such that U ⊤ U = V ⊤ V = I r ′ and
Σ ∈ R r′ ×r ′ , is a positive definite diagonal matrix whose diagonal
entries are sorted in descending order. We have:
∇L θ (U) = 4(UU ⊤ − M)U + 4λUdiag(U ⊤ U) = 0
=⇒ UU ⊤ U + λ‖U‖2 F
U = MU
d 1
=⇒ WΣ 3 V ⊤ + λ‖Σ‖2 F
WΣV ⊤ = MWΣV ⊤
d 1
=⇒ Σ 2 + λ‖Σ‖2 F
I = W ⊤ MW
d 1