26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2

Basics of Optimization

This chapter sets up the basic analysis framework for gradient-based

optimization algorithms and discuss how it applies to deep learning.

≪Tengyu notes: Sanjeev notes:

Suggestion: when introducing usual abstractions like Lipschitz constt, Hessian norm etc. let’s

relate them concretely to what they mean in context of deep learning (noting that Lipschitz constt

is wrt the vector of parameters). Be frank about what these numbers might be for deep learning or

even how feasible it is to estimate them. (Maybe that discussion can go in the side bar.)

BTW it may be useful to give some numbers for the empirical liptschitz constt encountered in

training.

One suspects that the optimization speed analysis is rather pessimistic.≫

≪Suriya notes: To ground optimization to our case, we can also mention that f is often of the

either the ERM or stochastic optimization form L(w) = ∑ l(w; x, y) - it might also be useful to

mention that outside of this chapter, we typically use f as an alternative for h to denote a function

computed≫

≪Tengyu notes: should we use w or θ in this section?≫ ≪Suriya notes: I remembered

that we agreed on w for parameters long time back - did we we go back to theta?≫

2.1 Gradient descent

Suppose we would like to optimize a continuous function f (w) over

R d .

min f (w) .

w∈R d

The gradient descent (GD) algorithm is

w 0 = initializaiton

w t+1 = w t − η∇ f (w t )

where η is the step size or learning rate.

One motivation or justification of the GD is that the update direction

−∇ f (w t ) is the steepest descent direction locally. Consider the

Taylor expansion at a point w t

f (w) = f (w t ) + 〈∇ f (w t ), w − w t 〉 + · · ·

} {{ }

linear in w

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!