26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6

Algorithmic Regularization

Large scale neural networks used in practice are highly over-parameterized

with far more trainable model parameters compared to the number

of training examples. Consequently, the optimization objectives for

learning such high capacity models have many global minima that

fit training data perfectly. However, minimizing the training loss

using specific optimization algorithms take us to not just any global

minima, but some special global minima – in this sense the choice of

optimization algorithms introduce a implicit form of inductive bias in

learning which can aid generalization.

In over-parameterized models, specially deep neural networks,

much, if not most, of the inductive bias of the learned model comes

from this implicit regularization from the optimization algorithm. For

example, early empirical work on this topic (ref. [? ? ? ? ? ? ? ? ? ? ?

? ]) show that deep models often generalize well even when trained

purely by minimizing the training error without any explicit regularization,

and even when the networks are highly overparameterized to

the extent of being able to fit random labels. Consequently, there are

many zero training error solutions, all global minima of the training

objective, most of which generalize horribly. Nevertheless, our choice

of optimization algorithm, typically a variant of gradient descent,

seems to prefer solutions that do generalize well. This generalization

ability cannot be explained by the capacity of the explicitly specified

model class (namely, the functions representable in the chosen

architecture). Instead, the optimization algorithm biasing toward

a “simple" model, minimizing some implicit “regularization measureâĂİ,

say R(w), is key for generalization. Understanding the implicit

inductive bias, e.g. via characterizing R(w), is thus essential for

understanding how and what the model learns. For example, in linear

regression it can be shown that minimizing an under-determined

model (with more parameters than samples) using gradient descent

yields the minimum l 2 norm solution (see Proposition 6.1.1), and for

linear logistic regression trained on linearly separable data, gradient

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!