TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
50 theory of deep learning
coordinate descent with exact line search (AdaBoost) can result in
infinite step-sizes, leading the iterates to converge in a different direction
that is not a max-l 1 -margin direction [? ], hence the bounded
step-sizes rule in Theorem 6.3.2.
Theorem 6.3.2 is a generalization of the result of ? to steepest
descent with respect to other norms, and our proof follows the same
strategy as ? . We first prove a generalization of the duality result of
? ]: if there is a unit norm linear separator that achieves margin γ,
then ‖∇L(w)‖ ⋆ ≥ γL(w) for all w. By using this lower bound on the
dual norm of the gradient, we are able to show that the loss decreases
faster than the increase in the norm of the iterates, establishing
convergence in a margin maximizing direction.
In the rest of this section, we discuss the proof of Theorem 6.3.2.
The proof is divided into three steps:
1. Gradient domination condition: For all norms and any w, ‖∇L(w)‖ ⋆ ≥
γL(w)
2. Optimization properties of steepest descent such as decrease of
loss function and convergence of the gradient in dual norm to
zero.
3. Establishing sufficiently fast convergence of L(w t ) relative to the
growth of ‖w t ‖ to prove the Theorem.
Proposition 6.3.3. Gradient domination condition (Lemma 10 of [? ])
Let γ = max ‖w‖≤1 min i y i x ⊤ i
w. For all w,
‖∇L(w)‖ ⋆ ≥ γL(w).
Next, we establish some optimization properties of the steepest
descent algorithm including convergence of gradient norms and loss
value.
Proposition 6.3.4. (Lemma 11 and 12 of ? ]) Consider the steepest descent
iterates w t on (6.6) with stepsize η ≤
1
B 2 L(w 0 ) , where B = max i ‖x i ‖ ⋆ . The
following holds:
1. L(w t+1 ) ≤ L(w t ).
2. ∑ ∞ t=0 ‖∇L(w t)‖ 2 < ∞ and hence ‖∇L(w t )‖ ⋆ → 0.
3. L(w t ) → 0 and hence w ⊤ t x i → ∞.
4. ∑ ∞ t=0 ‖∇L(w t)‖ ⋆ = ∞.