TheoryofDeepLearning.2022

Recommendations

Info

52 theory of deep learningBy applying − log,min 〈w ηγut+1, x n 〉 ≥ ∑2 n∈[N] u≤tL(w u ) − η∑2 B 2 γu2 − log L(w 0 ). (6.13)u≤t2Step 2: Upper bound ‖w t+1 ‖. Using ‖∆w u ‖ = ‖∇L(w u )‖ ⋆ = γ u ,we have,‖w t+1 ‖ ≤ ‖w 0 ‖ + ∑ η ‖∆w u ‖ ≤ ‖w 0 ‖ + ∑ ηγ u . (6.14)u≤tu≤tTo complete the proof, we simply combine Equations (6.13) and(6.14) to lower bound the normalized margin.ηγu〈w t+1 , x n 〉‖w t+1 ‖ ≥ ∑ 2 u≤t L(w u )∑ u≤t ηγ u + ‖w 0 ‖⎛⎞− ⎝ ∑ u≤t η2 B 2 γu22+ log L(w 0 )⎠ .‖w t+1 ‖:= (I) +(II). (6.15)For term (I), from Proposition 6.3.3, we have γ u = ‖∇L(w u )‖ ⋆ ≥γL(w u ). Hence the numerator is lower bounded ∑ u≤tγ ∑ u≤t ηγ u . We haveηγ 2 uL(w u )≥∑ u≤tηγ 2 uL(w u )∑ u≤t ηγ u + ‖w 0 ‖ ≥ γ ∑ u≤t ηγ u∑ u≤t ηγ u + ‖w 0 ‖→ γ, (6.16)using ∑ u≤t ηγ u → ∞ and ‖w 0 ‖ < ∞ from Proposition 6.3.4.For term (II), log L(w 0 ) < ∞ and ∑ u≤tη 2 B 2 γ 2 u2< ∞ using Proposition6.3.3. Thus (II) → 0.Using the above in Equation (6.15), we obtainwt+1 ⊤ limx iw≥ γ := ⊤ xmax min it→∞ ‖w t+1 ‖ ‖w‖≤1 i ‖w‖ .6.4 Homogeneous Models with Exponential Tailed Loss≪Suriya notes: Jason: I think we should give Kaifengs’ proof here. Its more general andconcurrent work.≫ In this section, we consider the asymptotic behavior ofgradient descent when the prediction function is homogeneous in theparameters. Consider the lossL(w) =n∑ exp(−y i f i (w)), (6.17)i=1where f i (cw) = c α f i (w) is α-homogeneous. Typically, f i (w) is theoutput of the prediction function such as a deep network. Similar
algorithmic regularization 53to the linear case in Section ??, there is a related maximum marginproblem. Define the optimal margin as γ = max ‖w‖2 =1 min i y i f i (w).The associated non-linear margin maximization is given by thefollowing non-convex constrained optimization:min ‖w‖ 2 st y i f i (w) ≥ γ.(Max-Margin)Analogous to Section ??, we expect that gradient descent on Equation(6.17) converges to the optimum of the Max-Margin problem(Max-Margin). However, the max-margin problem itself is a constrainednon-convex problem, so we cannot expect to attain a globaloptimum. Instead, we show that gradient descent iterates converge tofirst-order stationary points of the max-margin problem.Definition 6.4.1 (First-order Stationary Point). The first-order optimalityconditions of Max-Margin are:1. ∀i, y i f i (w) ≥ γ2. There exists Lagrange multipliers λ ∈ R N + such that w = ∑ n λ n ∇ f n (w)and λ n = 0 for n /∈ S m (w) := {i : y i f i (w) = γ}, where S m (w) is theset of support vectors .We denote by W ⋆ the set of first-order stationary points.Let w t be the iterates of gradient flow (gradient descent withstep-size tending to zero). Define l it = exp(− f i (w t )) and l t be thevector with entries l i (t). The following two assumptions assumethat the limiting directionw texist and the limiting direction of‖w t ‖lthe losses texist. Such assumptions are natural in the context of‖l t ‖ 1max-margin problems, since we want to argue that w t converges toa max-margin direction, and also the losses l t /‖l t ‖ 1 converges to anindicator vector of the support vectors. We will directly assume theselimits exist, though this is proved in 2 2.Assumption 6.4.2 (Smoothness). We assume f i (w) is a C 2 function.Assumption 6.4.3 (Asymptotic Formulas). Assume that L(w t ) → 0, thatwis we converge to a global minimizer. Further assume that lim tandt→∞ ‖w t ‖ 2l texist. Equivalently,‖l t ‖ 1limt→∞l nt = h t a n + h t ɛ nt (6.18)w t = g t ¯w + g t δ t , (6.19)with ‖a‖ 1= 1, ‖ ¯w‖ 2= 1, limt→∞h(t) = 0, limt→∞ɛ nt = 0, and limt→∞δ t t = 0.Assumption 6.4.4 (Linear Independence Constraint Qualification).Let w be a unit vector. LICQ holds at w if the vectors {∇ f i (w)} i∈Sm (w) arelinearly independent.
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51: algorithmic regularization 51Given
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?