26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

20 theory of deep learning

Here β(w t+1 − w t ) is the so-called momentum term. The motivation

and the origin of the name of the algorithm comes from that it can be

viewed as a discretization of the second order ODE:

ẅ + aẇ + b∇ f (w) = 0

Another equivalent way to write the algorithm is

u t = −∇ f (w t ) + βu t−1

w t+1 = w t + ηu t

Exercise: verify the two forms of the algorithm are indeed equivalent.

Another variant of the heavy-ball algorithm is due to Nesterov

u t = −∇ f (w t + β · (u t − u t−1 )) + β · u t−1 ,

w t+1 = w t + η · u t .

One can see that u t stores a weighed sum of the all the historical

gradient and the update of w t uses all past gradient. This is another

interpretation of the accelerate gradient descent algorithm

Nesterov gradient descent works similarly to the heavy ball algorithm

empirically for training deep neural networks. It has the

advantage of stronger worst case guarantees on convex functions.

Both of the two algorithms can be used with stochastic gradient,

but little is know about the theoretical guarantees about stochastic

accelerate gradient descent.

2.4 Local Runtime Analysis of GD

When the iterate is near a local minimum, the behavior of gradient

descent is clearer because the function can be locally approximated

by a quadratic function. In this section, we assume for simplicity that

we are optimizing a convex quadratic function, and get some insight

on how the curvature of the function influences the convergence of

the algorithm.

We use gradient descent to optimize

1

min w 2 w⊤ Aw

where A ∈ R d×d is a positive semidefinite matrix, and w ∈ R d .

Remark: w.l.o.g, we can assume that A is a diagonal matrix. Diagonalization

is a fundamental idea in linear algebra. Suppose A has

singular vector decomposition A = UΣU ⊤ where Σ is a diagonal

matrix. We can verify that w ⊤ Aw = ŵ ⊤ Σŵ with ŵ = U ⊤ w. In other

words, in a difference coordinate system defined by U, we are dealing

with a quadratic form with a diagonal matrix Σ as the coefficient.

Note the diagonalization technique here is only used for analysis.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!