26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

96 theory of deep learning

Let δ = −c · sgn(u ⊤ 1 u 2) for a small enough c > 0 such that ‖u 2 ‖ <

‖û 2 ‖ ≤ ‖û 1 ‖ < ‖u 1 ‖. Using Equation (9.13), This implies that

‖û 1 ‖ 4 + ‖û 2 ‖ 4 < ‖u 1 ‖ 4 + ‖u 2 ‖ 4 , which in turn gives us R(Û) < R(U)

and hence L θ (Û) < L θ (U). Therefore, a non-equalized critical point

cannot be local minimum, hence the first claim of the lemma.

9.3.2 Landscape properties

Next, we characterize the solutions to which dropout converges. We

do so by understanding the optimization landscape of Problem 9.11.

Central to our analysis, is the following notion of strict saddle property.

Definition 9.3.4 (Strict saddle point/property). Let f : U → R be

a twice differentiable function and let U ∈ U be a critical point of

f . Then, U is a strict saddle point of f if the Hessian of f at U has at

least one negative eigenvalue, i.e. λ min (∇ 2 f (U)) < 0. Furthermore, f

satisfies strict saddle property if all saddle points of f are strict saddle.

Strict saddle property ensures that for any critical point U that is

not a local optimum, the Hessian has a significant negative eigenvalue

which allows first order methods such as gradient descent

(GD) and stochastic gradient descent (SGD) to escape saddle points

and converge to a local minimum [? ? ]. Following this idea, there

has been a flurry of works on studying the landscape of different

machine learning problems, including low rank matrix recovery [? ],

generalized phase retrieval problem [? ], matrix completion [? ], deep

linear networks [? ], matrix sensing and robust PCA [? ] and tensor

decomposition [? ], making a case for global optimality of first order

methods.

For the special case of no regularization (i.e. λ = 0; equivalently,

no dropout), Problem 9.11 reduces to standard squared loss minimization

which has been shown to have no spurious local minima

and satisfy strict saddle property (see, e.g. [? ? ]). However, the regularizer

induced by dropout can potentially introduce new spurious

local minima as well as degenerate saddle points. Our next result

establishes that that is not the case, at least when the dropout rate is

sufficiently small.

Theorem 9.3.5. Let r := Rank(M). Assume that d 1 ≤ d 0 and that

the regularization parameter satisfies λ <

r (M)

(∑ r i=1 λ . Then it

i(M))−rλ r (M)

holds for Problem 9.11 that

1. all local minima are global,

2. all saddle points are strict saddle points.

A few remarks are in order. First, the assumption d 1 ≤ d 0 is by

no means restrictive, since the network map UU ⊤ ∈ R d 0×d 0 has rank

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!