26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

42 theory of deep learning

descent converges in the direction of the hard margin support vector

machine solution (Theorem 6.3.2), even though the norm or margin

is not explicitly specified in the optimization problem. In fact, such

analysis showing implicit inductive bias from optimization agorithm

leading to generalization is not new. In the context of boosting algorithms,

? ] and ? ] established connections of gradient boosting

algorithm (coordinate descent) to l 1 norm minimiziation, and l 1

margin maximization, respectively. minimization was observed. Such

minimum norm or maximum margin solutions are of course very

special among all solutions or separators that fit the training data,

and in particular can ensure generalization [? ? ].

In this chapter, we largely present results on algorithmic regularization

of vanilla gradient descent when minimizing unregularized

training loss in regression and classification problem over various

simple and complex model classes. We briefly discuss general algorithmic

families like steepest descent and mirror descent.

6.1 Linear models in regression: squared loss

We first demonstrate the algorithmic regularization in a simple linear

regression setting where the prediction function is specified by a

linear function of inputs: f w (x) = w ⊤ x and we have the following

empirical risk minimzation objective.

L(w) =

n (

∑ w ⊤ x (i) − y (i)) 2

. (6.1)

i=1

Such simple modes are natural starting points to build analytical

tools for extending to complex models, and such results provide intuitions

for understaning and improving upon the empirical practices

in neural networks. Although the results in this section are specified

for squared loss, the results and proof technique extend for any

smooth loss a unique finite root: where l(ŷ, y) between a prediction ŷ

and label y is minimized at a unique and finite value of ŷ [? ].

We are particularly interested in the case where n < d and the observations

are realizable, i.e., min w L(w) = 0. Under these conditions,

the optimization problem in eq. (6.1) is underdetermined and has

multiple global minima denoted by G = {w : ∀i, w ⊤ x (i) = y (i) }. In

this and all the following problems we consider, the goal is to answer:

Which specific global minima do different optimization algorithms reach

when minimizing L(w)?

The following proposition is the simplest illustration of the algorithmic

regularization phenomenon.

Proposition 6.1.1. Consider gradient descent updates w t for the loss in

eq. (6.1) starting with initialization w 0 . For any step size schedule that

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!