TheoryofDeepLearning.2022

Recommendations

Info

32 theory of deep learningThe notion of “shorter description”will be formalized in a varietyof ways using a complexity measure for the class H, denoted C(H), anduse it to upper bound the generalization error.Let S be a sample of m datapoints. Empirical risk minimizationgives us ĥ = arg min ̂L(h) where ̂L denotes the training loss. Forthis chapter we will use ̂L S to emphasize the training set. Let L D (h)denote the expected loss of h on the full data distribution D. Thenthe generalization error is defined as ∆ S (h) = L D (h) − ̂L S (h). Intuitively,if generalization error is large then the hypothesis’s performance ontraining sample S does not accurately reflect the performance on thefull distribution of examples, so we say it overfitted to the sample S.The typical upperbound on generalization error 1 shows that withprobability at least 1 − δ over the choice of training data, the following1This is the format of typical generalizationbound!∆ S (h) ≤C(H) + O(log 1/δ)m+ Sampling error term. (4.1)Thus to drive the generalization error down it suffices to make msignificantly larger than the “Complexity Measure.”Hence classeswith lower complexity require fewer training samples, in line withOccam’s intuition.4.0.2 Motivation for generalization theoryIf the experiment has already decided on the architecture, algorithmetc. to use then generalization theory is of very limited use. They canuse a held out dataset which is never seen during training. At theend of training, evaluating average loss on this yields a valid estimatefor L D (h) for the trained hypothesis h.Thus the hope in developing generalization theory is that it providesinsights into suggesting architectures and algorithms that leadto good generalization.4.1 Some simple bounds on generalization errorThe first one we prove is trivial, but as we shall see is also at the heartof most other generalization bounds (albeit often hidden inside theproof). The bound shows that if a hypothesis class contains at mostN distinct hypotheses, then log N (i.e., the number of bits needed todescribe a single hypothesis in this class) functions as a complexitymeasure.Theorem 4.1.1 (Simple union bound). If the loss function takes valuesin [0, 1] and hypothesis class H contains N distinct hypotheses then with
basics of generalization theory 33probability at least 1 − δ√∆ S (h) ≤ 2log N + log 1 δm.Proof. For any fixed hypothesis g imagine drawing a training sampleof size m. Then ̂L S (g) is an average of iid variables and its expectationis L D (g). Concentration bounds imply that L D (g) − ̂L S (g) hasa concentration property at least as strong as univariate GaussianN (0, 1/m). The previous statement is true for all hypotheses g inthe class, so the union bound implies that the probability is at mostN exp(−ɛ 2 m/4) that this quantity exceeds ɛ for some hypothesis inthe class. Since h is the solution to ERM, we conclude that whenδ ≤ N exp(−ɛ 2 m/4) then ∆ S (h) ≤ ɛ. Simplifying and eliminating ɛ,we obtain the theorem.Of course, the union bound doesn’t apply to deep nets per se becausethe set of hypotheses —even after we have fixed the architecture—consists of all vectors in R k , where k is the number of real-valuedparameters. This is an uncountable set! However, we show it is possibleto reason about the set of all nets as a finite set after suitablediscretization. Suppose we assume that the l 2 norm of the parametervectors is at most 1, meaning the set of all deep nets has beenidentified with Ball(0, 1). (Here Ball(w, r) refers to set of all pointsin R k within distance r of w.) We assume there is a ρ > 0 such thatif w 1 , w 2 ∈ R k satisfy ‖w 1 − w 2 ‖ 2 ≤ ρ then the nets with these twoparameter vectors have essentially the same loss on every input,meaning the losses differ by at most γ for some γ > 0. 2 (It makesintuitive sense such a ρ must exist for every γ > 0 since as we letρ → 0 the two nets become equal.)Definition 4.1.2 (ρ-cover). A set of points w 1 , w 2 , . . . ∈ R k is a ρ-cover inR k if for every w ∈ Ball(0, 1) there is some w i such that w ∈ Ball(w i , ρ).2Another way to phrase this assumptionin a somewhat stronger form isthat the loss on any datapoint is a Lipschitzfunction of the parameter vector,with Lipschitz constant at most γ/ρ.Lemma 4.1.3 (Existence of ρ-cover). There exists a ρ-cover of size at most(2/ρ) k .Proof. The proof is simple but ingenious. Let us pick w 1 arbitrarily inBall(0, 1). For i = 2, 3, . . . do the following: arbitrarily pick any pointin Ball(0, 1) outside ∪ j≤i Ball(w j , ρ) and designate it as w i+1 .A priori it is unclear if this process will ever terminate. We nowshow it does after at most (2/ρ) k steps. To see this, it suffices to notethat Ball(w i , ρ/2) ∩ Ball(w j , ρ/2) = ∅ for all i < j. (Because if not,then w j ∈ Ball(w i , ρ), which means that w j could not have beenpicked during the above process.) Thus we conclude that the processmust have stopped after at mostvolume(Ball(0, 1))/volume(Ball(0, ρ/2))
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31: 4Basics of generalization theoryGen
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84:
inductive biases due to algorithmic
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?