TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2
Basics of Optimization
This chapter sets up the basic analysis framework for gradient-based
optimization algorithms and discuss how it applies to deep learning.
≪Tengyu notes: Sanjeev notes:
Suggestion: when introducing usual abstractions like Lipschitz constt, Hessian norm etc. let’s
relate them concretely to what they mean in context of deep learning (noting that Lipschitz constt
is wrt the vector of parameters). Be frank about what these numbers might be for deep learning or
even how feasible it is to estimate them. (Maybe that discussion can go in the side bar.)
BTW it may be useful to give some numbers for the empirical liptschitz constt encountered in
training.
One suspects that the optimization speed analysis is rather pessimistic.≫
≪Suriya notes: To ground optimization to our case, we can also mention that f is often of the
either the ERM or stochastic optimization form L(w) = ∑ l(w; x, y) - it might also be useful to
mention that outside of this chapter, we typically use f as an alternative for h to denote a function
computed≫
≪Tengyu notes: should we use w or θ in this section?≫ ≪Suriya notes: I remembered
that we agreed on w for parameters long time back - did we we go back to theta?≫
2.1 Gradient descent
Suppose we would like to optimize a continuous function f (w) over
R d .
min f (w) .
w∈R d
The gradient descent (GD) algorithm is
w 0 = initializaiton
w t+1 = w t − η∇ f (w t )
where η is the step size or learning rate.
One motivation or justification of the GD is that the update direction
−∇ f (w t ) is the steepest descent direction locally. Consider the
Taylor expansion at a point w t
f (w) = f (w t ) + 〈∇ f (w t ), w − w t 〉 + · · ·
} {{ }
linear in w