TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
7
Tractable Landscapes for Nonconvex Optimization
Deep learning relies on optimizing complicated, nonconvex loss
functions. Finding the global minimum of a nonconvex objective is
NP-hard in the worst case. However in deep learning simple algorithms
such as stochastic gradient descent often the objective value to
zero or near-zero at the end. This chapter focuses on the optimization
landscape defined by a nonconvex objective and identifies properties
of these landscapes that allow simple optimization algorithms to find
global minima (or near-minima). These properties thus far apply to
simpler nnonconvex problems than deep learning, and it is open how
to analyse deep learning with such landscape analysis.
Warm-up: Convex Optimization To understand optimization landscape,
one can first look at optimizing a convex function. If a function
f (w) is convex, then it satisfies many nice properties, including
∀α ∈ [0, 1], w, w ′ , f (αw + (1 − α)w ′ ) ≤ α f (w) + (1 − α) f (w ′ ). (7.1)
∀w, w ′ , f (w ′ ) ≥ f (w) + 〈∇ f (w), w ′ − w〉. (7.2)
These equations characterize important geometric properties of
the objective function f (w). In particular, Equation (7.1) shows that
all the global minima of f (w) must be connected, because if w, w ′
are both globally optimal, anything on the segment αw + (1 − α)w ′
must also be optimal. Such properties are important because it gives
a characterization of all the global minima. Equation (7.2) shows that
every point with ∇ f (w) = 0 must be a global minimum, because
for every w ′ we have f (w ′ ) ≥ f (w) + 〈∇ f (w), w ′ − w)〉 ≥ f (w).
Such properties are important because it connects a local property
(gradient being 0) to global optimality.
In general, optimization landscape looks for properties of the
objective function that characterizes its local/global optimal points
(such as Equation (7.1)) or connects local properties with global
optimality (such as Equation (7.2)).