26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

36 theory of deep learning

are descended from an old philosophical tradition of considering

the logical foundations for belief systems, which often uses Bayes’

Theorem. For example, in the 18th century, Laplace sought to give

meaning to questions like “What is the probability that the sun will rise

tomorrow?” The answer to this question depends upon the person’s

prior beliefs as well as their empirical observation that the sun has

risen every day in their lifetime.

Coming back to ML, PAC-Bayes bounds assume that experimenter

(i.e. machine learning expert) has some prior distribution P over

the hypothesis H. If asked to classify without seeing any concrete

training data, the experimenter would pick a hypothesis h according

to P (denoted h ∼ P) and classify using it h. After seeing the training

data and running computations, the experimenter’s distribution

changes 6 to the posterior Q, meaning now if asked to classify they

would pick h ∼ Q and use that. Thus the expected training loss is

E [L D(h)].

h∼Q

Theorem 4.3.1 (PAC-Bayes bound). Consider a distribution D on the data.

Let P be a prior distribution over hypothesis class H and δ > 0. Then with

probabilty ≥ 1 − δ, on a i.i.d. sample S of size m from D, for all distributions

Q over H (which could possibly depend on S), we have that

∆ S (Q(H)) = E

h∼Q

[L D (h)] − E

h∼Q

[L S (h)] ≤

D(Q||P) + ln(m/δ)

,

2(m − 1)

where D(Q||P) = E h∼Q [ln Q(h)

P(h) ] is the so-called KL-divergence7 .

In other words, generalization error is upper bounded by the

square root of the KL-divergence of the distributions (plus some

terms that arise from concentration bounds). Thus, in order to minimize

the error on the real distribution, we should try to simultaneously

minimize the empirical error as well as the KL-divergence

between the posterior and the prior. First, lets observe that for a fixed

h, using a standard Hoeffdings inequality, we have that

6

To illustrate PAC-Bayes chain of

thought for deep learning, P could be

uniform distribution on all deep nets

with a certain architecture, and the

posterior is the distribution on deep

nets obtained by random initialization

followed by training on m randomly

sampled datapoints using SGD.

7

This is a measure of distance between

distributions, meaningful when P

dominates Q, in the sense that every h

with nonzero probability in Q also has

nonzero probability in P.

Pr

S

[∆(h) > ɛ] ≤ e −2mɛ2 (4.3)

Roughly, this says that √ m∆ S (h) concentrates at least as strongly as a

univariate gaussian. 8 By direct integration over gaussian distribution

this also implies that

E

S

[e 2(m−1)∆(h)2] ≤ m

8

Low generalization error alone does

not imply that h is any good! For

example h can have terrible loss on

D, which is faithfully captured in the

training set!

and therefore, with high probability over S,

e 2(m−1)∆(h)2 = O(m) (4.4)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!