TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
96 theory of deep learning
Let δ = −c · sgn(u ⊤ 1 u 2) for a small enough c > 0 such that ‖u 2 ‖ <
‖û 2 ‖ ≤ ‖û 1 ‖ < ‖u 1 ‖. Using Equation (9.13), This implies that
‖û 1 ‖ 4 + ‖û 2 ‖ 4 < ‖u 1 ‖ 4 + ‖u 2 ‖ 4 , which in turn gives us R(Û) < R(U)
and hence L θ (Û) < L θ (U). Therefore, a non-equalized critical point
cannot be local minimum, hence the first claim of the lemma.
9.3.2 Landscape properties
Next, we characterize the solutions to which dropout converges. We
do so by understanding the optimization landscape of Problem 9.11.
Central to our analysis, is the following notion of strict saddle property.
Definition 9.3.4 (Strict saddle point/property). Let f : U → R be
a twice differentiable function and let U ∈ U be a critical point of
f . Then, U is a strict saddle point of f if the Hessian of f at U has at
least one negative eigenvalue, i.e. λ min (∇ 2 f (U)) < 0. Furthermore, f
satisfies strict saddle property if all saddle points of f are strict saddle.
Strict saddle property ensures that for any critical point U that is
not a local optimum, the Hessian has a significant negative eigenvalue
which allows first order methods such as gradient descent
(GD) and stochastic gradient descent (SGD) to escape saddle points
and converge to a local minimum [? ? ]. Following this idea, there
has been a flurry of works on studying the landscape of different
machine learning problems, including low rank matrix recovery [? ],
generalized phase retrieval problem [? ], matrix completion [? ], deep
linear networks [? ], matrix sensing and robust PCA [? ] and tensor
decomposition [? ], making a case for global optimality of first order
methods.
For the special case of no regularization (i.e. λ = 0; equivalently,
no dropout), Problem 9.11 reduces to standard squared loss minimization
which has been shown to have no spurious local minima
and satisfy strict saddle property (see, e.g. [? ? ]). However, the regularizer
induced by dropout can potentially introduce new spurious
local minima as well as degenerate saddle points. Our next result
establishes that that is not the case, at least when the dropout rate is
sufficiently small.
Theorem 9.3.5. Let r := Rank(M). Assume that d 1 ≤ d 0 and that
rλ
the regularization parameter satisfies λ <
r (M)
(∑ r i=1 λ . Then it
i(M))−rλ r (M)
holds for Problem 9.11 that
1. all local minima are global,
2. all saddle points are strict saddle points.
A few remarks are in order. First, the assumption d 1 ≤ d 0 is by
no means restrictive, since the network map UU ⊤ ∈ R d 0×d 0 has rank