TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
42 theory of deep learning
descent converges in the direction of the hard margin support vector
machine solution (Theorem 6.3.2), even though the norm or margin
is not explicitly specified in the optimization problem. In fact, such
analysis showing implicit inductive bias from optimization agorithm
leading to generalization is not new. In the context of boosting algorithms,
? ] and ? ] established connections of gradient boosting
algorithm (coordinate descent) to l 1 norm minimiziation, and l 1
margin maximization, respectively. minimization was observed. Such
minimum norm or maximum margin solutions are of course very
special among all solutions or separators that fit the training data,
and in particular can ensure generalization [? ? ].
In this chapter, we largely present results on algorithmic regularization
of vanilla gradient descent when minimizing unregularized
training loss in regression and classification problem over various
simple and complex model classes. We briefly discuss general algorithmic
families like steepest descent and mirror descent.
6.1 Linear models in regression: squared loss
We first demonstrate the algorithmic regularization in a simple linear
regression setting where the prediction function is specified by a
linear function of inputs: f w (x) = w ⊤ x and we have the following
empirical risk minimzation objective.
L(w) =
n (
∑ w ⊤ x (i) − y (i)) 2
. (6.1)
i=1
Such simple modes are natural starting points to build analytical
tools for extending to complex models, and such results provide intuitions
for understaning and improving upon the empirical practices
in neural networks. Although the results in this section are specified
for squared loss, the results and proof technique extend for any
smooth loss a unique finite root: where l(ŷ, y) between a prediction ŷ
and label y is minimized at a unique and finite value of ŷ [? ].
We are particularly interested in the case where n < d and the observations
are realizable, i.e., min w L(w) = 0. Under these conditions,
the optimization problem in eq. (6.1) is underdetermined and has
multiple global minima denoted by G = {w : ∀i, w ⊤ x (i) = y (i) }. In
this and all the following problems we consider, the goal is to answer:
Which specific global minima do different optimization algorithms reach
when minimizing L(w)?
The following proposition is the simplest illustration of the algorithmic
regularization phenomenon.
Proposition 6.1.1. Consider gradient descent updates w t for the loss in
eq. (6.1) starting with initialization w 0 . For any step size schedule that