TheoryofDeepLearning.2022

Recommendations

Info

48 theory of deep learningdirection to maximum margin separator with unit l 2 norm, i.e., thehard margin support vector machine classifier.This characterization of the implicit bias is independent of both thestep-size as well as the initialization. We already see a fundamentallydifference from the implicit bias of gradient descent for losses with aunique finite root (Section ??) where the characterization dependedon the initialization. The above result is rigorously proved as part ofa more general result in Theorem 6.3.2. Below is a simpler statementand with a heuristic proof sketch intended to convey the intuition forsuch results.Theorem 6.3.1. For almost all dataset which is linearly separable, considergradient descent updates with any initialization w 0 and any step size thatminimizes the exponential loss in eq. (6.6), i.e., L(w t ) → 0. The gradientdescnet iterates then converge in direction to the l 2 max-margin vector, i.e.,wlim t t→∞ ‖w t= ŵ‖ 2 ‖ŵ‖ , whereŵ = argmin ‖w‖ 2 s.t. ∀i, w ⊤ x (i) y (i) ≥ 1. (6.7)wWithout loss of generality assume that ∀i, y (i) = 1 as the sign forlinear models can be absobed into x (i) .Proof Sketch We first understand intuitively why an exponential tailof the loss entail asymptotic convergence to the max margin vector:examine the asymptotic regime of gradient descent in when theexponential loss is minimized, as we argued earlier, this required that∀i : w ⊤ x (i) → ∞. Suppose w t / ‖w t ‖ 2converges to some limit w ∞ , sowe can write w t = g(t)w ∞ + ρ(t) such that g(t) → ∞, ∀i, w∞x ⊤ (i) > 0,and lim t→∞ ρ(t)/g(t) = 0. The gradients at w t are given by:−∇L(w) ==n∑i=1(exp −w ⊤ x (i)) x (i)n∑ expi=1(−g(t)w ⊤ ∞x (i)) (exp −ρ(t) ⊤ x (i)) x n .(6.8)As g(t) → ∞ and the exponents become more negative, only thosesamples with the largest (i.e., least negative) exponents will contributeto the gradient. These are precisely the samples with thesmallest margin argmin iw ⊤ ∞x (i) , aka the “support vectors”. The accumulationof negative gradient, and hence w t , would then asymptoticallybe dominated by a non-negative linear combination of supportvectors. These are precisely the KKT conditions for the SVM problem(eq. 6.7). Making these intuitions rigorous constitutes the bulk of theproof in ? ], which uses a proof technique very different from that inthe following section (Section 6.3.2).
algorithmic regularization 496.3.2 Steepest Descent. Recall that gradient descent is a special case of steepest descent(SD) w.r.t a generic norm ‖ · ‖ with updates given by eq. (6.5). Theoptimality condition of ∆w t in eq. (6.5) requires〈∆w t ,−∇L(w t )〉 = ‖∆w t ‖ 2 = ‖∇L(w t )‖ 2 ⋆., (6.9)where ‖x‖ ⋆ = sup ‖y‖≤1x ⊤ y is the dual norm of ‖ · ‖. Examples ofsteepest descent include gradient descent, which is steepest descentw.r.t l 2 norm and greedy coordinate descent (Gauss-Southwell selectionrule), which is steepest descent w.r.t l 1 norm. In general, theupdate ∆w t in eq. (6.5) is not uniquely defined and there could bemultiple direction ∆w t that minimize eq. (6.5). In such cases, anyminimizer of eq. (6.5) is a valid steepest descent update and satisfieseq. (6.9).In the preliminary result in Theorem 6.3.1, we proved the limitdirection of gradient flow on the exponential loss is the l 2 maxmarginsolution. In the following theorem, we show the naturalextension of this to all steepest descent algorithms.Theorem 6.3.2. For any separable dataset {x i , y i }i=1 n and any norm ‖·‖,consider the steepest descent updates from eq. (6.9) for minimizing L(w)in eq. (6.6) with the exponential loss l(u, y) = exp(−uy). For all initializationsw 0 , and all bounded step-sizes satisfying η t ≤ min{η + ,1B 2 L(w t ) },where B := max n ‖x n ‖ ⋆ and η + < ∞ is any finite number, the iterates w tsatisfy the following,lim min y i 〈w t , y i 〉= max min yt→∞ n ‖w t ‖i 〈w , x i 〉 =: γ.w : ‖w‖≤1 nIn particular, if there is a unique maximum-‖ · ‖ margin solution w ∗ =arg max ‖w‖≤1 min i y i 〈w, x i 〉, then the limit direction satisfies lim t→∞w ∗ .w t‖w t ‖ =A special case of Theorem 6.3.2 is for steepest descent w.r.t. the l 1norm, which as we already saw corresponds to greedy coordinatedescent. More specifically, coordinate descent on the exponentialloss with exact line search is equivalent to AdaBoost [? ], where eachcoordinate represents the output of one “weak learner”. Indeed,initially mysterious generalization properties of boosting have beenunderstood in terms of implicit l 1 regularization [? ], and later onAdaBoost with small enough step-size was shown to converge indirection precisely to the maximum l 1 margin solution [? ? ? ], justas guaranteed by Theorem 6.3.2. In fact, ? ] generalized the result toa richer variety of exponential tailed loss functions including logisticloss, and a broad class of non-constant step-size rules. Interestingly,
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47: algorithmic regularization 476.2 Ma
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 99 and 100:
inductive biases due to algorithmic
Page 101 and 102:
inductive biases due to algorithmic
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?