TheoryofDeepLearning.2022

Recommendations

Info

36 theory of deep learningare descended from an old philosophical tradition of consideringthe logical foundations for belief systems, which often uses Bayes’Theorem. For example, in the 18th century, Laplace sought to givemeaning to questions like “What is the probability that the sun will risetomorrow?” The answer to this question depends upon the person’sprior beliefs as well as their empirical observation that the sun hasrisen every day in their lifetime.Coming back to ML, PAC-Bayes bounds assume that experimenter(i.e. machine learning expert) has some prior distribution P overthe hypothesis H. If asked to classify without seeing any concretetraining data, the experimenter would pick a hypothesis h accordingto P (denoted h ∼ P) and classify using it h. After seeing the trainingdata and running computations, the experimenter’s distributionchanges 6 to the posterior Q, meaning now if asked to classify theywould pick h ∼ Q and use that. Thus the expected training loss isE [L D(h)].h∼QTheorem 4.3.1 (PAC-Bayes bound). Consider a distribution D on the data.Let P be a prior distribution over hypothesis class H and δ > 0. Then withprobabilty ≥ 1 − δ, on a i.i.d. sample S of size m from D, for all distributionsQ over H (which could possibly depend on S), we have that∆ S (Q(H)) = Eh∼Q[L D (h)] − Eh∼Q[L S (h)] ≤√D(Q||P) + ln(m/δ),2(m − 1)where D(Q||P) = E h∼Q [ln Q(h)P(h) ] is the so-called KL-divergence7 .In other words, generalization error is upper bounded by thesquare root of the KL-divergence of the distributions (plus someterms that arise from concentration bounds). Thus, in order to minimizethe error on the real distribution, we should try to simultaneouslyminimize the empirical error as well as the KL-divergencebetween the posterior and the prior. First, lets observe that for a fixedh, using a standard Hoeffdings inequality, we have that6To illustrate PAC-Bayes chain ofthought for deep learning, P could beuniform distribution on all deep netswith a certain architecture, and theposterior is the distribution on deepnets obtained by random initializationfollowed by training on m randomlysampled datapoints using SGD.7This is a measure of distance betweendistributions, meaningful when Pdominates Q, in the sense that every hwith nonzero probability in Q also hasnonzero probability in P.PrS[∆(h) > ɛ] ≤ e −2mɛ2 (4.3)Roughly, this says that √ m∆ S (h) concentrates at least as strongly as aunivariate gaussian. 8 By direct integration over gaussian distributionthis also implies thatES[e 2(m−1)∆(h)2] ≤ m8Low generalization error alone doesnot imply that h is any good! Forexample h can have terrible loss onD, which is faithfully captured in thetraining set!and therefore, with high probability over S,e 2(m−1)∆(h)2 = O(m) (4.4)
basics of generalization theory 37Now consider the expression (derived by working backwards fromstatement of the claim)2(m − 1) Eh∼Q[∆(h)] 2 − D(Q||P) ≤ 2(m − 1) Eh∼Q[∆(h) 2 ] − D(Q||P)where the inequality is by convexity of squares. This in turn is now[2(m − 1) E [∆(h) 2 ] − D(Q||P) = E 2(m − 1)∆(h) 2 − ln Q(h) ]h∼Q h∼Q P(h)[ ()]2(m−1)∆(h)2P(h)= E ln eh∼Q Q(h)[()]2(m−1)∆(h)2P(h)≤ ln E eh∼Q Q(h)where the last inequality uses Jensen’s inequality 9 along with theconcavity of ln. Also, since taking expectation over h ∼ Q is effectivelylike summing with a weighting by Q(h), we have 10[()]2(m−1)∆(h)2P(h)[(ln E e = ln E e 2(m−1)∆(h)2)]h∼Q Q(h)h∼PRecapping, we thus have that2(m − 1) Eh∼Q[∆(h)] 2 − D(Q||P) ≤ ln(Eh∼P[e 2(m−1)∆(h)2]) (4.5)Now using the fact that belief P was fixed before seeing S (i.e., isindependent of S):[ [E E e 2(m−1)∆(h)2]] [= E E[e 2(m−1)∆(h)2]] ≤ m.S h∼Ph∼P SThus, (1) implies that with high probability over S,Eh∼PThus, combining the above we get[e 2(m−1)∆(h)2] = O(m) (4.6)9Jensen’s Inequality: For a concavefunction f and random variable X,E[ f (X)] ≤ f (E[X])10Often when you see KL-divergence inmachine learning, you will see this trickbeing used to switch the distributionover which expectation is taken!which implies2(m − 1) Eh∼Q[∆(h)] 2 − D(Q||P) ≤ O(ln (m))O(ln (m)) + D(Q||P)Eh∼Q [∆(h)]2 ≤2(m − 1)Taking the square root on the both sides of the above Equation, thenwe get√E [∆(h)] ≤ O(ln (m)) + D(Q||P)h∼Q 2(m − 1)Thus, it completes our proof sketch.
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35: basics of generalization theory 35w
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 85 and 86: inductive biases due to algorithmic
Page 87 and 88:
inductive biases due to algorithmic
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?