TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
18 theory of deep learning
Suppose we drop the higher-order term and only optimize the first
order approximation within a neighborhood of w t
arg min
w∈R d f (w t ) + 〈∇ f (w t ), w − w t 〉
s.t. ‖w − w t ‖ 2 ≤ ɛ
Then, the optimizer of the program above is equal to w + δ where
δ = −α∇ f (w t )
for some positive scalar α. ≪Tengyu notes: this fact can be an exercise≫ In
other words, to locally minimize the first order approximation of f (·)
around w t , we should move towards the direction −∇ f (w t ).
2.1.1 Formalizing the Taylor Expansion
We will state a lemma that characterizes the descent of function
values under GD. We make the assumption that the eigenvalues
of ∇ 2 f (w) is bounded between [−L, L] for all w. We call functions
satisfying it L-smooth functions. ≪Tengyu notes: missing definition of ∇ 2 f but
perhaps it should belong to somewhere else.≫ This allows us to approximate the
function using Taylor expansion accurately in the following sense:
f (w) ≤ f (w t ) + 〈∇ f (w t ), w − w t 〉 + L 2 ‖w − w t‖ 2 2 (2.1)
≪Tengyu notes: another exercise≫
2.1.2 Descent lemma for gradient descent
The following says that with gradient descent and small enough
learning rate, the function value always decreases unless the gradient
at the iterate is zero.
Lemma 2.1.1 (Descent Lemma). Suppose f is L-smooth. Then, if η <
1/(2L), we have
f (w t+1 ) ≤ f (w t ) − η 2 · ‖∇ f (w t)‖ 2 2
The proof uses the Taylor expansion. The main idea is that even
using the upper provided by equation (2.1) suffices.
Proof. We have that
f (w t+1 ) = f (w t − η∇ f (w t ))
≤ f (w t ) − 〈∇ f (w t ), −η∇ f (w t )〉 + L 2 ‖η2 ∇ f (w t )‖ 2 2
= f (w t ) − (η − η 2 L/2)‖η 2 ∇ f (w t )‖ 2 2
≤ η 2 · ‖∇ f (w t)‖ 2 2 ,