TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
basics of optimization 19
where the second step follows from Eq. (2.1), and the last step follows
from η ≤ L/2.
≪Tengyu notes: perhaps add a corollary saying that GD "converges" to stationary points≫
2.2 Stochastic gradient descent
Motivation: Computing the gradient of a loss function could be
expensive. Recall that
̂L(h) = 1 n
n (
)
∑ l (x (i) , y (i) ), h .
i=1
Computing the gradient ∇̂L(h) scales linearly in n. Stochastic gradient
descent (SGD) estimates the gradient by sampling a mini-batch of
gradients. Especially when the gradients of examples are similar, the
estimator can be reasonably accurate. (And even if the estimator is
not accurate enough, as long as the learning rate is small enough, the
noises averages out across iterations.)
The updates: We simplify the notations a bit for the ease of exposition.
We consider optimizing the function
n
1
n
∑ f i (w)
i=1
So here f i corresponds to l((x i , y (i) ), h) in the statistical learning
setting. At each iteration t, the SGD algorithm first samples i 1 , . . . , i B
uniformly from [n], and then computes the estimated gradient using
the samples:
g S (w) = 1 B
B
∑ ∇ f ik (w t )
k=1
Here S is a shorthand for {i 1 , . . . , i B }. The SGD algorithm updates the
iterate by
w t+1 = w t − η · g S (w t ).
2.3 Accelerated Gradient Descent
The basic version of accelerated gradient descent algorithm is called
heavy-ball algorithm. It has the following update rule:
w t+1 = w t − η∇ f (w t ) + β(w t+1 − w t )