TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
tractable landscapes for nonconvex optimization 63
equal to 1 k ∑k i=1 w i (where w i is the weight of i-th neuron in θ ∗ ), so
h ¯θ (x) = kσ(〈 1 k ∑k i=1 w i, x〉) is equivalent to a neural network with
a single neuron. In most cases a single-neuron network should not
achieve the global minimum, so by proof of contradiction we know f
should not be convex.
It’s also possible to show that functions with symmetry must have
saddle points 6 . Therefore to optimize such a function, the algorithm
needs to be able to either avoid or escape from saddle points. More
concretely, one would like to find a second order stationary point.
6
Except some degenerate cases such as
constant functions.
Definition 7.3.1 (Second order stationary point (SOSP)). For an
objective function f (w) : R d → R, a point w is a second order stationary
point if ∇ f (w) = 0 and ∇ 2 f (w) ≽ 0.
The conditions for second order stationary point are known as
the second order necessary conditions for a local minimum. Of
course, generally an optimization algorithm will not be able to find
an exact second order stationary point (just like in Section ?? we
only show gradient descent finds a point with small gradient, but
not 0 gradient). The optimization algorithms can be used to find an
approximate second order stationary point:
Definition 7.3.2 (Approximate second order stationary point). For
an objective function f (w) : R d → R, a point w is a (ɛ, γ)-second order
stationary point (later abbreviated as (ɛ, γ)-SOSP) if ‖∇ f (w)‖ 2 ≤ ɛ and
λ min (∇ 2 f (w)) ≥ −γ.
Later in Chapter ?? we will show that simple variants of gradient
descent can in fact find (ɛ, γ)-SOSPs efficiently.
Now we are ready to define a class of functions that can be optimized
efficiently and allow symmetry and saddle points.
Definition 7.3.3 (Locally optimizable functions). An objective function
f (w) is locally optimizable, if for every τ > 0, there exists ɛ, γ = poly(τ)
such that every (ɛ, γ)-SOSP w of f satisfies f (w) ≤ f (w ∗ ) + τ.
Roughly speaking, an objective function is locally optimizable
if every local minimum of the function is also a global minimum,
and the Hessian of every saddle point has a negative eigenvalue.
Similar class of functions were called “strict saddle” or “ridable”
in some previous results. Many nonconvex objectives, including
matrix sensing [? ? ? ], matrix completion [? ? ], dictionary learning
[? ], phase retrieval [? ], tensor decomposition [? ], synchronization
problems [? ] and certain objective for two-layer neural network [? ]
are known to be locally optimizable.