TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
9
Inductive Biases due to Algorithmic Regularization
Many successful modern machine learning systems based on deep
neural networks are over-parametrized, i.e., the number of parameters
is typically much larger than the sample size. In other words,
there exist (infinitely) many (approximate) minimizers of the empirical
risk, many of which would not generalize well on the unseen
data. For learning to succeed then, it is crucial to bias the learning
algorithm towards “simpler” hypotheses by trading off empirical loss
with a certain complexity term that ensures that empirical and population
risks are close. Several explicit regularization strategies have
been used in practice to help these systems generalize, including l 1
and l 2 regularization of the parameters [? ].
Besides explicit regularization techniques, practitioners have used
a spectrum of algorithmic approaches to improve the generalization
ability of over-parametrized models. This includes early stopping
of back-propagation [? ], batch normalization [? ], dropout [? ], and
more 1 . While these heuristics have enjoyed tremendous success in
training deep networks, a theoretical understanding of how these
heuristics provide regularization in deep learning remains somewhat
limited.
In this chapter, we investigate regularization due to Dropout,
an algorithmic heurisitic recently proposed by [? ]. The basic idea
when training a neural network using dropout, is that during a
forward pass, we randomly drop neurons in the neural network,
independently and identically according to a Bernoulli distribution.
Specifically, at each round of the back-propagation algorithm, for
each neuron, independently, with probability p we “drop” the neuron,
so it does not participate in making a prediction for the given
data point, and with probability 1 − p we retain that neuron 2 .
Deep learning is a field where key innovations have been driven
by practitioners, with several techniques motivated by drawing insights
from other fields. For instance, Dropout was introduced as
a way of breaking up “co-adaptation” among neurons, drawing in-
1
We refer the reader to [? ] for an
excellent exposition of over 50 of such
proposals.
2
The parameter p is treated as a hyperparameter
which we typically tune for
based on a validation set.