26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

algorithmic regularization 45

updates given by,

w t+1 = w t + η t ∆w t , where ∆w t = arg min 〈∇L(w t ), v〉 + 1

v

2 ‖v‖2 . (6.5)

Examples of steepest descent include gradient descent, which is

steepest descent w.r.t l 2 norm and coordinate descent, which is

steepest descent w.r.t l 1 norm. In general, the update ∆w t in eq. (6.5)

is not uniquely defined and there could be multiple direction ∆w t

that minimize eq. (6.5). In such cases, any minimizer of eq. (6.5) is a

valid steepest descent update.

Generalizing gradient descent and mirror descent, we might

expect the steepest descent iterates to converge to the solution closest

to initialization in corresponding norm, arg min w∈G

‖w − w 0 ‖. This

is indeed the case for quadratic norms ‖v‖ D = √ v ⊤ Dv when eq. 6.5

is equivalent to mirror descent with ψ(w) = 1/2‖w‖ 2 D . Unfortunately,

this does not hold for general norms as shown by the following

results.

Example 1. In the case of coordinate descent, which is a special

case of steepest descent w.r.t. the l 1 norm, ? ] studied this phenomenon

in the context of gradient boosting: obseving that sometimes

but not always the

{

optimization path of coordinate ∣ descent

}

∂L(w

given by ∆w t+1 ∈ conv −η t )

∣∣

t ∂w[j t ] e ∂L(w

j t

: j t = argmax r )

j ∂w[j]

∣ , coincides

with the l 1 regularization path given by, ŵ(λ) = arg min w

L(w) +

λ‖w‖ 1 . The specific coordinate descent path where updates average

all the optimal coordinates and the step-sizes are infinitesimal is

equivalent to forward stage-wise selection, a.k.a. ɛ-boosting [? ].

When the l 1 regularization path ŵ(λ) is monotone in each of the

coordinates, it is identical to this stage-wise selection path, i.e., to

a coordinate descent optimization path (and also to the related

LARS path) [? ]. In this case, at the limit of λ → 0 and t → ∞,

the optimization and regularization paths, both converge to the

minimum l 1 norm solution. However, when the regularization path

ŵ(λ) is not monotone, which can and does happen, the optimization

and regularization paths diverge, and forward stage-wise selection

can converge to solutions with sub-optimal l 1 norm.

Example 2. The following example shows that even for l p norms

where the ‖.‖ 2 p is smooth and strongly convex, the global minimum

returned by the steepest descent depends on the step-size.

Consider minimizing L(w) with dataset {(x (1) = [1, 1, 1], y (1) =

1), (x (2) = [1, 2, 0], y (2) = 10)} using steepest descent updates w.r.t.

the l 4/3 norm. The empirical results for this problem in Figure 6.1

clearly show that steepest descent converges to a global minimum

that depends on the step-size and even in the continuous step-size

limit of η → 0, w t does not converge to the expected solution of

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!