26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

inductive biases due to algorithmic regularization 97

at most d 0 , and letting d 1 > d 0 does not increase the expressivity of

the function class represented by the network. Second, Theorem 9.3.5

guarantees that any critical point U that is not a global optimum is

a strict saddle point, i.e. ∇ 2 L(U, U) has a negative eigenvalue. This

property allows first order methods, such as dropout, to escape such

saddle points. Third, note that the guarantees in Theorem 9.3.5 hold

when the regularization parameter λ is sufficiently small. Assumptions

of this kind are common in the literature (see, for example [? ]).

While this is a sufficient condition for the result in Theorem 9.3.5, it is

not clear if it is necessary.

Proof of Theorem 9.3.5. Here we outline the main steps in the proof of

Theorem 9.3.5.

1. In Lemma 9.3.3, we show that the set of non-equalized critical

points does not include any local optima. Furthermore,

Lemma 9.3.6 shows that all such points are strict saddles.

2. In Lemma 9.3.7, we give a closed-form characterization of all the

equalized critical points in terms of the eigendecompostion of M.

We then show that if λ is chosen appropriately, all such critical

points that are not global optima, are strict saddle points.

3. It follows from Item 1 and Item 2 that if λ is chosen appropriately,

then all critical points that are not global optimum, are strict

saddle points.

Lemma 9.3.6. All critical points of Problem 9.11 that are not equalized, are

strict saddle points.

Proof of Lemma 9.3.6. By Lemma 9.3.3, the set of non-equalized critical

points does not include any local optima. We show that all such

points are strict saddles. Let U be a critical point that is not equalized.

To show that U is a strict saddle point, it suffices to show that the

Hessian has a negative eigenvalue. In here, we exhibit a curve along

which the second directional derivative is negative. Assume, without

loss of generality that ‖u 1 ‖ > ‖u 2 ‖ and consider the curve

∆(t) := [( √ 1−t 2 −1)u 1 +tu 2 , ( √ 1−t 2 −1)u 2 −tu 1 , 0 d,r−2 ]

It is easy to check that for any t ∈ R, L(U + ∆(t)) = L(U) since

U + ∆(t) is essentially a rotation on U and L is invariant under

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!