26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

tractable landscapes for nonconvex optimization 63

equal to 1 k ∑k i=1 w i (where w i is the weight of i-th neuron in θ ∗ ), so

h ¯θ (x) = kσ(〈 1 k ∑k i=1 w i, x〉) is equivalent to a neural network with

a single neuron. In most cases a single-neuron network should not

achieve the global minimum, so by proof of contradiction we know f

should not be convex.

It’s also possible to show that functions with symmetry must have

saddle points 6 . Therefore to optimize such a function, the algorithm

needs to be able to either avoid or escape from saddle points. More

concretely, one would like to find a second order stationary point.

6

Except some degenerate cases such as

constant functions.

Definition 7.3.1 (Second order stationary point (SOSP)). For an

objective function f (w) : R d → R, a point w is a second order stationary

point if ∇ f (w) = 0 and ∇ 2 f (w) ≽ 0.

The conditions for second order stationary point are known as

the second order necessary conditions for a local minimum. Of

course, generally an optimization algorithm will not be able to find

an exact second order stationary point (just like in Section ?? we

only show gradient descent finds a point with small gradient, but

not 0 gradient). The optimization algorithms can be used to find an

approximate second order stationary point:

Definition 7.3.2 (Approximate second order stationary point). For

an objective function f (w) : R d → R, a point w is a (ɛ, γ)-second order

stationary point (later abbreviated as (ɛ, γ)-SOSP) if ‖∇ f (w)‖ 2 ≤ ɛ and

λ min (∇ 2 f (w)) ≥ −γ.

Later in Chapter ?? we will show that simple variants of gradient

descent can in fact find (ɛ, γ)-SOSPs efficiently.

Now we are ready to define a class of functions that can be optimized

efficiently and allow symmetry and saddle points.

Definition 7.3.3 (Locally optimizable functions). An objective function

f (w) is locally optimizable, if for every τ > 0, there exists ɛ, γ = poly(τ)

such that every (ɛ, γ)-SOSP w of f satisfies f (w) ≤ f (w ∗ ) + τ.

Roughly speaking, an objective function is locally optimizable

if every local minimum of the function is also a global minimum,

and the Hessian of every saddle point has a negative eigenvalue.

Similar class of functions were called “strict saddle” or “ridable”

in some previous results. Many nonconvex objectives, including

matrix sensing [? ? ? ], matrix completion [? ? ], dictionary learning

[? ], phase retrieval [? ], tensor decomposition [? ], synchronization

problems [? ] and certain objective for two-layer neural network [? ]

are known to be locally optimizable.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!