26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

basics of generalization theory 37

Now consider the expression (derived by working backwards from

statement of the claim)

2(m − 1) E

h∼Q

[∆(h)] 2 − D(Q||P) ≤ 2(m − 1) E

h∼Q

[∆(h) 2 ] − D(Q||P)

where the inequality is by convexity of squares. This in turn is now

[

2(m − 1) E [∆(h) 2 ] − D(Q||P) = E 2(m − 1)∆(h) 2 − ln Q(h) ]

h∼Q h∼Q P(h)

[ (

)]

2(m−1)∆(h)2

P(h)

= E ln e

h∼Q Q(h)

[(

)]

2(m−1)∆(h)2

P(h)

≤ ln E e

h∼Q Q(h)

where the last inequality uses Jensen’s inequality 9 along with the

concavity of ln. Also, since taking expectation over h ∼ Q is effectively

like summing with a weighting by Q(h), we have 10

[(

)]

2(m−1)∆(h)2

P(h)

[(

ln E e = ln E e 2(m−1)∆(h)2)]

h∼Q Q(h)

h∼P

Recapping, we thus have that

2(m − 1) E

h∼Q

[∆(h)] 2 − D(Q||P) ≤ ln

(

E

h∼P

[

e 2(m−1)∆(h)2]) (4.5)

Now using the fact that belief P was fixed before seeing S (i.e., is

independent of S):

[ [

E E e 2(m−1)∆(h)2]] [

= E E

[e 2(m−1)∆(h)2]] ≤ m.

S h∼P

h∼P S

Thus, (1) implies that with high probability over S,

E

h∼P

Thus, combining the above we get

[

e 2(m−1)∆(h)2] = O(m) (4.6)

9

Jensen’s Inequality: For a concave

function f and random variable X,

E[ f (X)] ≤ f (E[X])

10

Often when you see KL-divergence in

machine learning, you will see this trick

being used to switch the distribution

over which expectation is taken!

which implies

2(m − 1) E

h∼Q

[∆(h)] 2 − D(Q||P) ≤ O(ln (m))

O(ln (m)) + D(Q||P)

E

h∼Q [∆(h)]2 ≤

2(m − 1)

Taking the square root on the both sides of the above Equation, then

we get

E [∆(h)] ≤ O(ln (m)) + D(Q||P)

h∼Q 2(m − 1)

Thus, it completes our proof sketch.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!