TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
basics of generalization theory 37
Now consider the expression (derived by working backwards from
statement of the claim)
2(m − 1) E
h∼Q
[∆(h)] 2 − D(Q||P) ≤ 2(m − 1) E
h∼Q
[∆(h) 2 ] − D(Q||P)
where the inequality is by convexity of squares. This in turn is now
[
2(m − 1) E [∆(h) 2 ] − D(Q||P) = E 2(m − 1)∆(h) 2 − ln Q(h) ]
h∼Q h∼Q P(h)
[ (
)]
2(m−1)∆(h)2
P(h)
= E ln e
h∼Q Q(h)
[(
)]
2(m−1)∆(h)2
P(h)
≤ ln E e
h∼Q Q(h)
where the last inequality uses Jensen’s inequality 9 along with the
concavity of ln. Also, since taking expectation over h ∼ Q is effectively
like summing with a weighting by Q(h), we have 10
[(
)]
2(m−1)∆(h)2
P(h)
[(
ln E e = ln E e 2(m−1)∆(h)2)]
h∼Q Q(h)
h∼P
Recapping, we thus have that
2(m − 1) E
h∼Q
[∆(h)] 2 − D(Q||P) ≤ ln
(
E
h∼P
[
e 2(m−1)∆(h)2]) (4.5)
Now using the fact that belief P was fixed before seeing S (i.e., is
independent of S):
[ [
E E e 2(m−1)∆(h)2]] [
= E E
[e 2(m−1)∆(h)2]] ≤ m.
S h∼P
h∼P S
Thus, (1) implies that with high probability over S,
E
h∼P
Thus, combining the above we get
[
e 2(m−1)∆(h)2] = O(m) (4.6)
9
Jensen’s Inequality: For a concave
function f and random variable X,
E[ f (X)] ≤ f (E[X])
10
Often when you see KL-divergence in
machine learning, you will see this trick
being used to switch the distribution
over which expectation is taken!
which implies
2(m − 1) E
h∼Q
[∆(h)] 2 − D(Q||P) ≤ O(ln (m))
O(ln (m)) + D(Q||P)
E
h∼Q [∆(h)]2 ≤
2(m − 1)
Taking the square root on the both sides of the above Equation, then
we get
√
E [∆(h)] ≤ O(ln (m)) + D(Q||P)
h∼Q 2(m − 1)
Thus, it completes our proof sketch.