TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
20 theory of deep learning
Here β(w t+1 − w t ) is the so-called momentum term. The motivation
and the origin of the name of the algorithm comes from that it can be
viewed as a discretization of the second order ODE:
ẅ + aẇ + b∇ f (w) = 0
Another equivalent way to write the algorithm is
u t = −∇ f (w t ) + βu t−1
w t+1 = w t + ηu t
Exercise: verify the two forms of the algorithm are indeed equivalent.
Another variant of the heavy-ball algorithm is due to Nesterov
u t = −∇ f (w t + β · (u t − u t−1 )) + β · u t−1 ,
w t+1 = w t + η · u t .
One can see that u t stores a weighed sum of the all the historical
gradient and the update of w t uses all past gradient. This is another
interpretation of the accelerate gradient descent algorithm
Nesterov gradient descent works similarly to the heavy ball algorithm
empirically for training deep neural networks. It has the
advantage of stronger worst case guarantees on convex functions.
Both of the two algorithms can be used with stochastic gradient,
but little is know about the theoretical guarantees about stochastic
accelerate gradient descent.
2.4 Local Runtime Analysis of GD
When the iterate is near a local minimum, the behavior of gradient
descent is clearer because the function can be locally approximated
by a quadratic function. In this section, we assume for simplicity that
we are optimizing a convex quadratic function, and get some insight
on how the curvature of the function influences the convergence of
the algorithm.
We use gradient descent to optimize
1
min w 2 w⊤ Aw
where A ∈ R d×d is a positive semidefinite matrix, and w ∈ R d .
Remark: w.l.o.g, we can assume that A is a diagonal matrix. Diagonalization
is a fundamental idea in linear algebra. Suppose A has
singular vector decomposition A = UΣU ⊤ where Σ is a diagonal
matrix. We can verify that w ⊤ Aw = ŵ ⊤ Σŵ with ŵ = U ⊤ w. In other
words, in a difference coordinate system defined by U, we are dealing
with a quadratic form with a diagonal matrix Σ as the coefficient.
Note the diagonalization technique here is only used for analysis.