TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
algorithmic regularization 45
updates given by,
w t+1 = w t + η t ∆w t , where ∆w t = arg min 〈∇L(w t ), v〉 + 1
v
2 ‖v‖2 . (6.5)
Examples of steepest descent include gradient descent, which is
steepest descent w.r.t l 2 norm and coordinate descent, which is
steepest descent w.r.t l 1 norm. In general, the update ∆w t in eq. (6.5)
is not uniquely defined and there could be multiple direction ∆w t
that minimize eq. (6.5). In such cases, any minimizer of eq. (6.5) is a
valid steepest descent update.
Generalizing gradient descent and mirror descent, we might
expect the steepest descent iterates to converge to the solution closest
to initialization in corresponding norm, arg min w∈G
‖w − w 0 ‖. This
is indeed the case for quadratic norms ‖v‖ D = √ v ⊤ Dv when eq. 6.5
is equivalent to mirror descent with ψ(w) = 1/2‖w‖ 2 D . Unfortunately,
this does not hold for general norms as shown by the following
results.
Example 1. In the case of coordinate descent, which is a special
case of steepest descent w.r.t. the l 1 norm, ? ] studied this phenomenon
in the context of gradient boosting: obseving that sometimes
but not always the
{
optimization path of coordinate ∣ descent
}
∂L(w
given by ∆w t+1 ∈ conv −η t )
∣∣
t ∂w[j t ] e ∂L(w
j t
: j t = argmax r )
j ∂w[j]
∣ , coincides
with the l 1 regularization path given by, ŵ(λ) = arg min w
L(w) +
λ‖w‖ 1 . The specific coordinate descent path where updates average
all the optimal coordinates and the step-sizes are infinitesimal is
equivalent to forward stage-wise selection, a.k.a. ɛ-boosting [? ].
When the l 1 regularization path ŵ(λ) is monotone in each of the
coordinates, it is identical to this stage-wise selection path, i.e., to
a coordinate descent optimization path (and also to the related
LARS path) [? ]. In this case, at the limit of λ → 0 and t → ∞,
the optimization and regularization paths, both converge to the
minimum l 1 norm solution. However, when the regularization path
ŵ(λ) is not monotone, which can and does happen, the optimization
and regularization paths diverge, and forward stage-wise selection
can converge to solutions with sub-optimal l 1 norm.
Example 2. The following example shows that even for l p norms
where the ‖.‖ 2 p is smooth and strongly convex, the global minimum
returned by the steepest descent depends on the step-size.
Consider minimizing L(w) with dataset {(x (1) = [1, 1, 1], y (1) =
1), (x (2) = [1, 2, 0], y (2) = 10)} using steepest descent updates w.r.t.
the l 4/3 norm. The empirical results for this problem in Figure 6.1
clearly show that steepest descent converges to a global minimum
that depends on the step-size and even in the continuous step-size
limit of η → 0, w t does not converge to the expected solution of