TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
36 theory of deep learning
are descended from an old philosophical tradition of considering
the logical foundations for belief systems, which often uses Bayes’
Theorem. For example, in the 18th century, Laplace sought to give
meaning to questions like “What is the probability that the sun will rise
tomorrow?” The answer to this question depends upon the person’s
prior beliefs as well as their empirical observation that the sun has
risen every day in their lifetime.
Coming back to ML, PAC-Bayes bounds assume that experimenter
(i.e. machine learning expert) has some prior distribution P over
the hypothesis H. If asked to classify without seeing any concrete
training data, the experimenter would pick a hypothesis h according
to P (denoted h ∼ P) and classify using it h. After seeing the training
data and running computations, the experimenter’s distribution
changes 6 to the posterior Q, meaning now if asked to classify they
would pick h ∼ Q and use that. Thus the expected training loss is
E [L D(h)].
h∼Q
Theorem 4.3.1 (PAC-Bayes bound). Consider a distribution D on the data.
Let P be a prior distribution over hypothesis class H and δ > 0. Then with
probabilty ≥ 1 − δ, on a i.i.d. sample S of size m from D, for all distributions
Q over H (which could possibly depend on S), we have that
∆ S (Q(H)) = E
h∼Q
[L D (h)] − E
h∼Q
[L S (h)] ≤
√
D(Q||P) + ln(m/δ)
,
2(m − 1)
where D(Q||P) = E h∼Q [ln Q(h)
P(h) ] is the so-called KL-divergence7 .
In other words, generalization error is upper bounded by the
square root of the KL-divergence of the distributions (plus some
terms that arise from concentration bounds). Thus, in order to minimize
the error on the real distribution, we should try to simultaneously
minimize the empirical error as well as the KL-divergence
between the posterior and the prior. First, lets observe that for a fixed
h, using a standard Hoeffdings inequality, we have that
6
To illustrate PAC-Bayes chain of
thought for deep learning, P could be
uniform distribution on all deep nets
with a certain architecture, and the
posterior is the distribution on deep
nets obtained by random initialization
followed by training on m randomly
sampled datapoints using SGD.
7
This is a measure of distance between
distributions, meaningful when P
dominates Q, in the sense that every h
with nonzero probability in Q also has
nonzero probability in P.
Pr
S
[∆(h) > ɛ] ≤ e −2mɛ2 (4.3)
Roughly, this says that √ m∆ S (h) concentrates at least as strongly as a
univariate gaussian. 8 By direct integration over gaussian distribution
this also implies that
E
S
[e 2(m−1)∆(h)2] ≤ m
8
Low generalization error alone does
not imply that h is any good! For
example h can have terrible loss on
D, which is faithfully captured in the
training set!
and therefore, with high probability over S,
e 2(m−1)∆(h)2 = O(m) (4.4)