26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

algorithmic regularization 49

6.3.2 Steepest Descent

. Recall that gradient descent is a special case of steepest descent

(SD) w.r.t a generic norm ‖ · ‖ with updates given by eq. (6.5). The

optimality condition of ∆w t in eq. (6.5) requires

〈∆w t ,−∇L(w t )〉 = ‖∆w t ‖ 2 = ‖∇L(w t )‖ 2 ⋆., (6.9)

where ‖x‖ ⋆ = sup ‖y‖≤1

x ⊤ y is the dual norm of ‖ · ‖. Examples of

steepest descent include gradient descent, which is steepest descent

w.r.t l 2 norm and greedy coordinate descent (Gauss-Southwell selection

rule), which is steepest descent w.r.t l 1 norm. In general, the

update ∆w t in eq. (6.5) is not uniquely defined and there could be

multiple direction ∆w t that minimize eq. (6.5). In such cases, any

minimizer of eq. (6.5) is a valid steepest descent update and satisfies

eq. (6.9).

In the preliminary result in Theorem 6.3.1, we proved the limit

direction of gradient flow on the exponential loss is the l 2 maxmargin

solution. In the following theorem, we show the natural

extension of this to all steepest descent algorithms.

Theorem 6.3.2. For any separable dataset {x i , y i }

i=1 n and any norm ‖·‖,

consider the steepest descent updates from eq. (6.9) for minimizing L(w)

in eq. (6.6) with the exponential loss l(u, y) = exp(−uy). For all initializations

w 0 , and all bounded step-sizes satisfying η t ≤ min{η + ,

1

B 2 L(w t ) },

where B := max n ‖x n ‖ ⋆ and η + < ∞ is any finite number, the iterates w t

satisfy the following,

lim min y i 〈w t , y i 〉

= max min y

t→∞ n ‖w t ‖

i 〈w , x i 〉 =: γ.

w : ‖w‖≤1 n

In particular, if there is a unique maximum-‖ · ‖ margin solution w ∗ =

arg max ‖w‖≤1 min i y i 〈w, x i 〉, then the limit direction satisfies lim t→∞

w ∗ .

w t

‖w t ‖ =

A special case of Theorem 6.3.2 is for steepest descent w.r.t. the l 1

norm, which as we already saw corresponds to greedy coordinate

descent. More specifically, coordinate descent on the exponential

loss with exact line search is equivalent to AdaBoost [? ], where each

coordinate represents the output of one “weak learner”. Indeed,

initially mysterious generalization properties of boosting have been

understood in terms of implicit l 1 regularization [? ], and later on

AdaBoost with small enough step-size was shown to converge in

direction precisely to the maximum l 1 margin solution [? ? ? ], just

as guaranteed by Theorem 6.3.2. In fact, ? ] generalized the result to

a richer variety of exponential tailed loss functions including logistic

loss, and a broad class of non-constant step-size rules. Interestingly,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!