TheoryofDeepLearning.2022

Recommendations

Info

44 theory of deep learningapproximation of the loss while constraining the step length inEuclidean norm.w t+1 = argmin 〈w, ∇L(w t )〉 + 1w2η ‖w − w t‖ 2 2 . (6.2)Motivated by the above connection, we can study other families ofalgorithms that work under different and non-Euclidean geometries.Two convenient families are mirror descent w.r.t. potential ψ [? ? ]and steepest descent w.r.t. general norms [? ].Mirror descent w.r.t. potential ψ Mirror descent updates are definedfor any strongly convex and differentiable potential ψ asw t+1 = arg min η 〈w, ∇L(w t )〉 + D ψ (w, w t ),w=⇒ ∇ψ(w t+1 ) = ∇ψ(w t ) − η∇L(w t )(6.3)where D ψ (w, w ′ ) = ψ(w) − ψ(w ′ ) − 〈∇ψ(w ′ ), w − w ′ 〉 is the Bregmandivergence [? ] w.r.t. ψ. This family captures updates wherethe geometry is specified by the Bregman divergence D ψ . Examplesof potentials ψ for mirror descent include the squared l 2 normψ(w) = 1/2‖w‖ 2 2, which leads to gradient descent; the entropy potentialψ(w) = ∑ i w[i] log w[i] − w[i]; the spectral entropy for matrixvalued w, where ψ(w) is the entropy potential on the singular valuesof w; general quadratic potentials ψ(w) = 1/2‖w‖ 2 D = 1/2 w ⊤ Dwfor any positive definite matrix D; and the squared l p norms forp ∈ (1, 2].From eq. (6.3), we see that rather than w t (called primal iterates),it is the ∇ψ(w t ) (called dual iterates) that are constrained to the lowdimensional data manifold ∇ψ(w 0 ) + span({x (i) }). The arguments forgradient descent can now be generalized to get the following result.Theorem 6.1.2. For any realizable dataset {x (i) , y (i) }n=1 N , and any stronglyconvex potential ψ, consider the mirror descent iterates w t from eq. (6.3) forminimizing the empirical loss L(w) in eq. (6.1). For all initializations w 0 , ifthe step-size schedule minimzes L(w), i.e., L(w t ) → 0, then the asymptoticsolution of the algorithm is given byw t →arg min D ψ (w, w 0 ). (6.4)w:∀i,w ⊤ x (i) =y (i)In particular, if we start at w 0 = arg min wψ(w) (so that ∇ψ(w 0 ) =0), then we get to arg min w∈Gψ(w). 1 1 The analysis of Theorem 6.1.2 andSteepest descent w.r.t. general norms Gradient descent is also aProposition 6.1.1 also hold wheninstancewise stochastic gradients areused in place of ∇L(w t ).special case of steepest descent (SD) w.r.t a generic norm ‖.‖ [? ] with
algorithmic regularization 45updates given by,w t+1 = w t + η t ∆w t , where ∆w t = arg min 〈∇L(w t ), v〉 + 1v2 ‖v‖2 . (6.5)Examples of steepest descent include gradient descent, which issteepest descent w.r.t l 2 norm and coordinate descent, which issteepest descent w.r.t l 1 norm. In general, the update ∆w t in eq. (6.5)is not uniquely defined and there could be multiple direction ∆w tthat minimize eq. (6.5). In such cases, any minimizer of eq. (6.5) is avalid steepest descent update.Generalizing gradient descent and mirror descent, we mightexpect the steepest descent iterates to converge to the solution closestto initialization in corresponding norm, arg min w∈G‖w − w 0 ‖. Thisis indeed the case for quadratic norms ‖v‖ D = √ v ⊤ Dv when eq. 6.5is equivalent to mirror descent with ψ(w) = 1/2‖w‖ 2 D . Unfortunately,this does not hold for general norms as shown by the followingresults.Example 1. In the case of coordinate descent, which is a specialcase of steepest descent w.r.t. the l 1 norm, ? ] studied this phenomenonin the context of gradient boosting: obseving that sometimesbut not always the{optimization path of coordinate ∣ descent}∂L(wgiven by ∆w t+1 ∈ conv −η t )∣∣t ∂w[j t ] e ∂L(wj t: j t = argmax r )j ∂w[j]∣ , coincideswith the l 1 regularization path given by, ŵ(λ) = arg min wL(w) +λ‖w‖ 1 . The specific coordinate descent path where updates averageall the optimal coordinates and the step-sizes are infinitesimal isequivalent to forward stage-wise selection, a.k.a. ɛ-boosting [? ].When the l 1 regularization path ŵ(λ) is monotone in each of thecoordinates, it is identical to this stage-wise selection path, i.e., toa coordinate descent optimization path (and also to the relatedLARS path) [? ]. In this case, at the limit of λ → 0 and t → ∞,the optimization and regularization paths, both converge to theminimum l 1 norm solution. However, when the regularization pathŵ(λ) is not monotone, which can and does happen, the optimizationand regularization paths diverge, and forward stage-wise selectioncan converge to solutions with sub-optimal l 1 norm.Example 2. The following example shows that even for l p normswhere the ‖.‖ 2 p is smooth and strongly convex, the global minimumreturned by the steepest descent depends on the step-size.Consider minimizing L(w) with dataset {(x (1) = [1, 1, 1], y (1) =1), (x (2) = [1, 2, 0], y (2) = 10)} using steepest descent updates w.r.t.the l 4/3 norm. The empirical results for this problem in Figure 6.1clearly show that steepest descent converges to a global minimumthat depends on the step-size and even in the continuous step-sizelimit of η → 0, w t does not converge to the expected solution of
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43: algorithmic regularization 43minimi
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 95 and 96:
inductive biases due to algorithmic
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?