TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
32 theory of deep learning
The notion of “shorter description”will be formalized in a variety
of ways using a complexity measure for the class H, denoted C(H), and
use it to upper bound the generalization error.
Let S be a sample of m datapoints. Empirical risk minimization
gives us ĥ = arg min ̂L(h) where ̂L denotes the training loss. For
this chapter we will use ̂L S to emphasize the training set. Let L D (h)
denote the expected loss of h on the full data distribution D. Then
the generalization error is defined as ∆ S (h) = L D (h) − ̂L S (h). Intuitively,
if generalization error is large then the hypothesis’s performance on
training sample S does not accurately reflect the performance on the
full distribution of examples, so we say it overfitted to the sample S.
The typical upperbound on generalization error 1 shows that with
probability at least 1 − δ over the choice of training data, the following
1
This is the format of typical generalization
bound!
∆ S (h) ≤
C(H) + O(log 1/δ)
m
+ Sampling error term. (4.1)
Thus to drive the generalization error down it suffices to make m
significantly larger than the “Complexity Measure.”Hence classes
with lower complexity require fewer training samples, in line with
Occam’s intuition.
4.0.2 Motivation for generalization theory
If the experiment has already decided on the architecture, algorithm
etc. to use then generalization theory is of very limited use. They can
use a held out dataset which is never seen during training. At the
end of training, evaluating average loss on this yields a valid estimate
for L D (h) for the trained hypothesis h.
Thus the hope in developing generalization theory is that it provides
insights into suggesting architectures and algorithms that lead
to good generalization.
4.1 Some simple bounds on generalization error
The first one we prove is trivial, but as we shall see is also at the heart
of most other generalization bounds (albeit often hidden inside the
proof). The bound shows that if a hypothesis class contains at most
N distinct hypotheses, then log N (i.e., the number of bits needed to
describe a single hypothesis in this class) functions as a complexity
measure.
Theorem 4.1.1 (Simple union bound). If the loss function takes values
in [0, 1] and hypothesis class H contains N distinct hypotheses then with