26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

32 theory of deep learning

The notion of “shorter description”will be formalized in a variety

of ways using a complexity measure for the class H, denoted C(H), and

use it to upper bound the generalization error.

Let S be a sample of m datapoints. Empirical risk minimization

gives us ĥ = arg min ̂L(h) where ̂L denotes the training loss. For

this chapter we will use ̂L S to emphasize the training set. Let L D (h)

denote the expected loss of h on the full data distribution D. Then

the generalization error is defined as ∆ S (h) = L D (h) − ̂L S (h). Intuitively,

if generalization error is large then the hypothesis’s performance on

training sample S does not accurately reflect the performance on the

full distribution of examples, so we say it overfitted to the sample S.

The typical upperbound on generalization error 1 shows that with

probability at least 1 − δ over the choice of training data, the following

1

This is the format of typical generalization

bound!

∆ S (h) ≤

C(H) + O(log 1/δ)

m

+ Sampling error term. (4.1)

Thus to drive the generalization error down it suffices to make m

significantly larger than the “Complexity Measure.”Hence classes

with lower complexity require fewer training samples, in line with

Occam’s intuition.

4.0.2 Motivation for generalization theory

If the experiment has already decided on the architecture, algorithm

etc. to use then generalization theory is of very limited use. They can

use a held out dataset which is never seen during training. At the

end of training, evaluating average loss on this yields a valid estimate

for L D (h) for the trained hypothesis h.

Thus the hope in developing generalization theory is that it provides

insights into suggesting architectures and algorithms that lead

to good generalization.

4.1 Some simple bounds on generalization error

The first one we prove is trivial, but as we shall see is also at the heart

of most other generalization bounds (albeit often hidden inside the

proof). The bound shows that if a hypothesis class contains at most

N distinct hypotheses, then log N (i.e., the number of bits needed to

describe a single hypothesis in this class) functions as a complexity

measure.

Theorem 4.1.1 (Simple union bound). If the loss function takes values

in [0, 1] and hypothesis class H contains N distinct hypotheses then with

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!