26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

basics of optimization 21

Therefore, we assume that A = diag(λ 1 , . . . , λ d ) with λ 1 ≥ · · · ≥

λ d . The function can be simplified to

f (w) = 1 2

d

∑ λ i wi

2

i=1

The gradient descent update can be written as

x ← w − η∇ f (w) = w − ηΣw

Here we omit the subscript t for the time step and use the subscript

for coordinate. Equivalently, we can write the per-coordinate

update rule

w i ← w i − ηλ i w i = (1 − λ i η i )w i

Now we see that if η > 2/λ i for some i, then the absolute value of

w i will blow up exponentially and lead to an instable behavior. Thus,

we need η 1

max λ i

. Note that max λ i corresponds to the smoothness

parameter of f because λ 1 is the largest eigenvalue of ∇ 2 f = A. This

is consistent with the condition in Lemma 2.1.1 that η needs to be

small.

Suppose for simplicity we set η = 1/(2λ 1 ), then we see that the

convergence for the w 1 coordinate is very fast — the coordinate w 1 is

halved every iteration. However, the convergence of the coordinate

w d is slower, because it’s only reduced by a factor of (1 − λ d /(2λ 1 ))

every iteration. Therefore, it takes O(λ d /λ 1 · log(1/ɛ)) iterations to

converge to an error ɛ. The analysis here can be extended to general

convex function, which also reflects the principle that:

The condition number is defined as κ = σ max (A)/σ min (A) = λ 1 /λ d .

It governs the convergence rate of GD.

≪Tengyu notes: add figure≫

2.4.1 Pre-conditioners

From the toy quadratic example above, we can see that it would be

more optimal if we can use a different learning rate for different

coordinate. In other words, if we introduce a learning rate η i = 1/λ i

for each coordinate, then we can achieve faster convergence. In the

more general setting where A is not diagonal, we don’t know the

coordinate system in advance, and the algorithm corresponds to

w ← w − A −1 ∇ f (w)

In the even more general setting where f is not quadratic, this corresponds

to the Newton’s algorithm

w ← w − ∇ 2 f (w) −1 ∇ f (w)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!