TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
basics of optimization 21
Therefore, we assume that A = diag(λ 1 , . . . , λ d ) with λ 1 ≥ · · · ≥
λ d . The function can be simplified to
f (w) = 1 2
d
∑ λ i wi
2
i=1
The gradient descent update can be written as
x ← w − η∇ f (w) = w − ηΣw
Here we omit the subscript t for the time step and use the subscript
for coordinate. Equivalently, we can write the per-coordinate
update rule
w i ← w i − ηλ i w i = (1 − λ i η i )w i
Now we see that if η > 2/λ i for some i, then the absolute value of
w i will blow up exponentially and lead to an instable behavior. Thus,
we need η 1
max λ i
. Note that max λ i corresponds to the smoothness
parameter of f because λ 1 is the largest eigenvalue of ∇ 2 f = A. This
is consistent with the condition in Lemma 2.1.1 that η needs to be
small.
Suppose for simplicity we set η = 1/(2λ 1 ), then we see that the
convergence for the w 1 coordinate is very fast — the coordinate w 1 is
halved every iteration. However, the convergence of the coordinate
w d is slower, because it’s only reduced by a factor of (1 − λ d /(2λ 1 ))
every iteration. Therefore, it takes O(λ d /λ 1 · log(1/ɛ)) iterations to
converge to an error ɛ. The analysis here can be extended to general
convex function, which also reflects the principle that:
The condition number is defined as κ = σ max (A)/σ min (A) = λ 1 /λ d .
It governs the convergence rate of GD.
≪Tengyu notes: add figure≫
2.4.1 Pre-conditioners
From the toy quadratic example above, we can see that it would be
more optimal if we can use a different learning rate for different
coordinate. In other words, if we introduce a learning rate η i = 1/λ i
for each coordinate, then we can achieve faster convergence. In the
more general setting where A is not diagonal, we don’t know the
coordinate system in advance, and the algorithm corresponds to
w ← w − A −1 ∇ f (w)
In the even more general setting where f is not quadratic, this corresponds
to the Newton’s algorithm
w ← w − ∇ 2 f (w) −1 ∇ f (w)