26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

basics of optimization 19

where the second step follows from Eq. (2.1), and the last step follows

from η ≤ L/2.

≪Tengyu notes: perhaps add a corollary saying that GD "converges" to stationary points≫

2.2 Stochastic gradient descent

Motivation: Computing the gradient of a loss function could be

expensive. Recall that

̂L(h) = 1 n

n (

)

∑ l (x (i) , y (i) ), h .

i=1

Computing the gradient ∇̂L(h) scales linearly in n. Stochastic gradient

descent (SGD) estimates the gradient by sampling a mini-batch of

gradients. Especially when the gradients of examples are similar, the

estimator can be reasonably accurate. (And even if the estimator is

not accurate enough, as long as the learning rate is small enough, the

noises averages out across iterations.)

The updates: We simplify the notations a bit for the ease of exposition.

We consider optimizing the function

n

1

n

∑ f i (w)

i=1

So here f i corresponds to l((x i , y (i) ), h) in the statistical learning

setting. At each iteration t, the SGD algorithm first samples i 1 , . . . , i B

uniformly from [n], and then computes the estimated gradient using

the samples:

g S (w) = 1 B

B

∑ ∇ f ik (w t )

k=1

Here S is a shorthand for {i 1 , . . . , i B }. The SGD algorithm updates the

iterate by

w t+1 = w t − η · g S (w t ).

2.3 Accelerated Gradient Descent

The basic version of accelerated gradient descent algorithm is called

heavy-ball algorithm. It has the following update rule:

w t+1 = w t − η∇ f (w t ) + β(w t+1 − w t )

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!