TheoryofDeepLearning.2022

Recommendations

Info

34 theory of deep learningiterations, which is at most (2/ρ) k since ball volume in R k scales asthe kth power of the radius.Finally, the sequence of w i ’s at the end must be a ρ-cover becausethe process stops only when no point can be found outside∪ j Ball(w j , ρ).Theorem 4.1.4 (Generalization bound for normed spaces). 3 If (i)hypotheses are unit vectors in R k and (ii) every two hypotheses h 1 , h 2 with‖h 1 − h 2 ‖ 2 ≤ ρ differ in terms of loss on every datapoint by at most γ then√k log(2/ρ)∆ S (h) ≤ γ + 2.mProof. Apply the union bound on the ρ-cover. Every other net canhave loss at most γ higher than nets in the ρ-cover.3As you might imagine, this generalizationbound via γ-cover is too loose, andgives very pessimistic estimates of whatm needs to be.4.2 Data dependent complexity measuresA complexity measure for hypothesis classes is a way to quantify their“complicatedness.” It is defined to let us prove an upper bound onthe number of training samples needed to keep down the generalizationerror. Above we implicitly defined two complexity measures:the size of the hypothesis class (assuming it is finite) and the size ofa γ-cover in it. Of course, the resulting bounds on sample complexitywere still loose.Theorists then realized that the above simple bounds hold forevery data distribution D. In practice, it seems clear that deep nets—or any learning method—works by being able to exploit propertiesof the input distribution (e.g., convolutional structure exploits thefact that all subpatches of images can be processed very similarly).Thus one should try to prove some measure of complicatedness thatdepends on the data distribution.4.2.1 Rademacher ComplexityRademacher complexity is a complexity measure that depends ondata distribution. For simplicity we will assume loss function takesvalues in [0, 1].The definition concerns the following thought experiment. Recallthat the distribution D is on labeled datapoints (x, y). For simplicitywe denote the labeled datapoint as z.Now Rademacher Complexity 4 of hypothesis class H on a distributionD is defined as follows where l(z, h) is loss of hypothesis h onlabeled datapoint z.[]1R m,D (H) = ES1 ,S 2 2m suph∈H∣ ∑l(z, h) − ∑ l(z, h), (4.2)∣z∈S 1 z∈S 24Standard accounts of this oftenconfuse students, or falsely impressthem with a complicated proof ofThm 4.2.1. In the standard definition,loss terms are weighted by iid ±1random variables. Its value is within is±O(1/ √ m) of the one in our definition.
basics of generalization theory 35where the expectation is over S 1 , S 2 are two iid samples (i.e., multisets)of size m each from the data distribution D. The followingtheorem relates this to generalization error of the trained hypothesis.Theorem 4.2.1. If h is the hypothesis trained via ERM using a training setS 2 of size m, then the probability (over S 2 ) is > 1 − δ, that∆ S2 (h) ≤ 2R m,D (H) + O((log(1/δ))/ √ m).Proof. The generalization error ∆ S2 (h) = L D (h) − ̂L S2 (h), and ERMguarantees an h that maximizes this. Imagine we pick another m iidsamples from D to get another (multi)set S 1 then with probability atleast 1 − δ the loss on these closely approximates L D (h):∆ S2 (h) ≤ ̂L S1 (h) − ̂L S2 (h) + O((log(1/δ))/ √ m).Now we notice that S 1 , S 2 thus drawn are exactly like the sets drawnin the thought experiment 5 (4.2) and the maximizer h for this expressiondefined R m,D . So the right hand side is at most2R m,D (H) + O((log(1/δ))/ √ m).Example: We can show that the Rademacher complexity of theset of linear classifiers (unit norm vectors U = {w|w ∈ R d , ‖w‖ 2 =1}), on a given sample S = (x 1 , x 2 , x 3 , ..x m ) (each x i ∈ R d ) is ≤max i∈[m] ‖x i ‖ 2 / √ m .5Here hypothesis h is allowed todepend on S 2 but not S 1 . In the thoughtexperiment the supremum is over h thatcan depend on both. This discrepancyonly helps the inequality, since thelatter h can achieve a larger value. Notethat the factor 2 is because of scaling of2m in (4.2).4.2.2 Alternative Interpretation: Ability to correlate with random labelsSometimes teachers explain Rademacher complexity more intuitivelyas ability of classifiers in H to correlate with random labelings of the data.This is best understood for binary classification (i.e., labels are 0/1),and the loss function is also binary (loss 0 for correct label and 1incorrect label). Now consider the following experiment: Pick S 1 , S 2as in the definition of Rademacher Complexity, and imagine flippingthe labels of S 1 . Now average loss on S 2 is 1 − ̂L S2 (h). Thus selectingh to maximise the right hand side of (4.2) is like finding an h that haslow loss on S 1 ∪ S 2 where the labels have been flipped on S 1 . In otherwords, h is able to achieve low loss on datasets where labels wereflipped for some randomly chosen set of half of the training points.When the loss is not binary a similar statement still holds qualitatively.4.3 PAC-Bayes boundsThese bounds due to McAllester (1999) [? ] are in principle the tightest,meaning previous bounds in this chapter are its subcases. They
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33: basics of generalization theory 33p
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 85 and 86:
inductive biases due to algorithmic
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?