TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
inductive biases due to algorithmic regularization 97
at most d 0 , and letting d 1 > d 0 does not increase the expressivity of
the function class represented by the network. Second, Theorem 9.3.5
guarantees that any critical point U that is not a global optimum is
a strict saddle point, i.e. ∇ 2 L(U, U) has a negative eigenvalue. This
property allows first order methods, such as dropout, to escape such
saddle points. Third, note that the guarantees in Theorem 9.3.5 hold
when the regularization parameter λ is sufficiently small. Assumptions
of this kind are common in the literature (see, for example [? ]).
While this is a sufficient condition for the result in Theorem 9.3.5, it is
not clear if it is necessary.
Proof of Theorem 9.3.5. Here we outline the main steps in the proof of
Theorem 9.3.5.
1. In Lemma 9.3.3, we show that the set of non-equalized critical
points does not include any local optima. Furthermore,
Lemma 9.3.6 shows that all such points are strict saddles.
2. In Lemma 9.3.7, we give a closed-form characterization of all the
equalized critical points in terms of the eigendecompostion of M.
We then show that if λ is chosen appropriately, all such critical
points that are not global optima, are strict saddle points.
3. It follows from Item 1 and Item 2 that if λ is chosen appropriately,
then all critical points that are not global optimum, are strict
saddle points.
Lemma 9.3.6. All critical points of Problem 9.11 that are not equalized, are
strict saddle points.
Proof of Lemma 9.3.6. By Lemma 9.3.3, the set of non-equalized critical
points does not include any local optima. We show that all such
points are strict saddles. Let U be a critical point that is not equalized.
To show that U is a strict saddle point, it suffices to show that the
Hessian has a negative eigenvalue. In here, we exhibit a curve along
which the second directional derivative is negative. Assume, without
loss of generality that ‖u 1 ‖ > ‖u 2 ‖ and consider the curve
∆(t) := [( √ 1−t 2 −1)u 1 +tu 2 , ( √ 1−t 2 −1)u 2 −tu 1 , 0 d,r−2 ]
It is easy to check that for any t ∈ R, L(U + ∆(t)) = L(U) since
U + ∆(t) is essentially a rotation on U and L is invariant under