26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

18 theory of deep learning

Suppose we drop the higher-order term and only optimize the first

order approximation within a neighborhood of w t

arg min

w∈R d f (w t ) + 〈∇ f (w t ), w − w t 〉

s.t. ‖w − w t ‖ 2 ≤ ɛ

Then, the optimizer of the program above is equal to w + δ where

δ = −α∇ f (w t )

for some positive scalar α. ≪Tengyu notes: this fact can be an exercise≫ In

other words, to locally minimize the first order approximation of f (·)

around w t , we should move towards the direction −∇ f (w t ).

2.1.1 Formalizing the Taylor Expansion

We will state a lemma that characterizes the descent of function

values under GD. We make the assumption that the eigenvalues

of ∇ 2 f (w) is bounded between [−L, L] for all w. We call functions

satisfying it L-smooth functions. ≪Tengyu notes: missing definition of ∇ 2 f but

perhaps it should belong to somewhere else.≫ This allows us to approximate the

function using Taylor expansion accurately in the following sense:

f (w) ≤ f (w t ) + 〈∇ f (w t ), w − w t 〉 + L 2 ‖w − w t‖ 2 2 (2.1)

≪Tengyu notes: another exercise≫

2.1.2 Descent lemma for gradient descent

The following says that with gradient descent and small enough

learning rate, the function value always decreases unless the gradient

at the iterate is zero.

Lemma 2.1.1 (Descent Lemma). Suppose f is L-smooth. Then, if η <

1/(2L), we have

f (w t+1 ) ≤ f (w t ) − η 2 · ‖∇ f (w t)‖ 2 2

The proof uses the Taylor expansion. The main idea is that even

using the upper provided by equation (2.1) suffices.

Proof. We have that

f (w t+1 ) = f (w t − η∇ f (w t ))

≤ f (w t ) − 〈∇ f (w t ), −η∇ f (w t )〉 + L 2 ‖η2 ∇ f (w t )‖ 2 2

= f (w t ) − (η − η 2 L/2)‖η 2 ∇ f (w t )‖ 2 2

≤ η 2 · ‖∇ f (w t)‖ 2 2 ,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!