TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
100 theory of deep learning
U(t) = W d0 Σ ′ V ⊤ d 1
where Σ ′ ∈ R d 0×d 1 is diagonal with non-zero
√
diagonal elements given as σ
i ′ = σi 2 + t 2 for i ≤ d 1 . Observe that
U(t) ⊤ U(t) = VΣ 2 V ⊤ + t 2 V ⊤ d 1
V d1 = U ⊤ U + t 2 I d1 .
Thus, the parametric curve U(t) is equalized for all t. The population
risk at U(t) equals:
L(U(t)) =
d 1
∑
i=1
(λ i − σ 2 i − t 2 ) 2 +
d 0
∑ λ 2 i
i=d 1 +1
= L(U) + d 1 t 4 − 2t 2 d 1
∑ (λ i − σi 2).
i=1
Furthermore, since U(t) is equalized, we obtain the following form
for the regularizer:
R(U(t)) = λ d 1
‖U(t)‖ 4 F = λ d 1
(
‖U‖ 2 F + d 1t 2) 2
= R(U) + λd 1 t 4 + 2λt 2 ‖U‖ 2 F .
Define g(t) := L(U(t)) + R(U(t)). We have that
g(t) = L(U) + R(U) + d 1 t 4 − 2t 2 d 1
∑ (λ i − σi 2) + λd 1t 4 + 2λt 2 ‖U‖ 2 F .
i=1
It is easy to verify that g ′ (0) = 0. Moreover, the second derivative of g
at t = 0 is given as:
g ′′ (0) = −4
d 1
∑
i=1
(λ i − σi 2 d 1
) + 4λ‖U‖2 F = −4 ∑ λ i + 4(1 + λ)‖U‖ 2 F (9.15)
i=1
We use ‖U‖ 2 F = ∑r′ i=1 σ2 i
and Equation (9.14) to arrive at
‖U‖ 2 F = trΣ2 =
r ′
∑
i=1
(λ i − λ ∑r′ j=1 λ j
d 1 + λr ′ ) = (
r ′
∑
i=1
λ i )(1 −
λr′
d 1 + λr ′ ) = d 1 ∑ r′
d 1 + λr ′
Plugging back the above equality in Equation (9.15), we get
g ′′ (0) = −4
d 1
∑
i=1
λ i + 4 d r
1 + d 1 λ
′
d 1 + λr
∑ ′
i=1
λ i = −4
d 1
∑
i=r ′ +1
i=1 λ i
λ i + 4 (d 1 − r ′ r
)λ
′
d 1 + λr
∑ ′ λ i
i=1
To get a sufficient condition for U to be a strict saddle point, it suf-