TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6
Algorithmic Regularization
Large scale neural networks used in practice are highly over-parameterized
with far more trainable model parameters compared to the number
of training examples. Consequently, the optimization objectives for
learning such high capacity models have many global minima that
fit training data perfectly. However, minimizing the training loss
using specific optimization algorithms take us to not just any global
minima, but some special global minima – in this sense the choice of
optimization algorithms introduce a implicit form of inductive bias in
learning which can aid generalization.
In over-parameterized models, specially deep neural networks,
much, if not most, of the inductive bias of the learned model comes
from this implicit regularization from the optimization algorithm. For
example, early empirical work on this topic (ref. [? ? ? ? ? ? ? ? ? ? ?
? ]) show that deep models often generalize well even when trained
purely by minimizing the training error without any explicit regularization,
and even when the networks are highly overparameterized to
the extent of being able to fit random labels. Consequently, there are
many zero training error solutions, all global minima of the training
objective, most of which generalize horribly. Nevertheless, our choice
of optimization algorithm, typically a variant of gradient descent,
seems to prefer solutions that do generalize well. This generalization
ability cannot be explained by the capacity of the explicitly specified
model class (namely, the functions representable in the chosen
architecture). Instead, the optimization algorithm biasing toward
a “simple" model, minimizing some implicit “regularization measureâĂİ,
say R(w), is key for generalization. Understanding the implicit
inductive bias, e.g. via characterizing R(w), is thus essential for
understanding how and what the model learns. For example, in linear
regression it can be shown that minimizing an under-determined
model (with more parameters than samples) using gradient descent
yields the minimum l 2 norm solution (see Proposition 6.1.1), and for
linear logistic regression trained on linearly separable data, gradient