TheoryofDeepLearning.2022

Recommendations

Info

42 theory of deep learningdescent converges in the direction of the hard margin support vectormachine solution (Theorem 6.3.2), even though the norm or marginis not explicitly specified in the optimization problem. In fact, suchanalysis showing implicit inductive bias from optimization agorithmleading to generalization is not new. In the context of boosting algorithms,? ] and ? ] established connections of gradient boostingalgorithm (coordinate descent) to l 1 norm minimiziation, and l 1margin maximization, respectively. minimization was observed. Suchminimum norm or maximum margin solutions are of course veryspecial among all solutions or separators that fit the training data,and in particular can ensure generalization [? ? ].In this chapter, we largely present results on algorithmic regularizationof vanilla gradient descent when minimizing unregularizedtraining loss in regression and classification problem over varioussimple and complex model classes. We briefly discuss general algorithmicfamilies like steepest descent and mirror descent.6.1 Linear models in regression: squared lossWe first demonstrate the algorithmic regularization in a simple linearregression setting where the prediction function is specified by alinear function of inputs: f w (x) = w ⊤ x and we have the followingempirical risk minimzation objective.L(w) =n (∑ w ⊤ x (i) − y (i)) 2. (6.1)i=1Such simple modes are natural starting points to build analyticaltools for extending to complex models, and such results provide intuitionsfor understaning and improving upon the empirical practicesin neural networks. Although the results in this section are specifiedfor squared loss, the results and proof technique extend for anysmooth loss a unique finite root: where l(ŷ, y) between a prediction ŷand label y is minimized at a unique and finite value of ŷ [? ].We are particularly interested in the case where n < d and the observationsare realizable, i.e., min w L(w) = 0. Under these conditions,the optimization problem in eq. (6.1) is underdetermined and hasmultiple global minima denoted by G = {w : ∀i, w ⊤ x (i) = y (i) }. Inthis and all the following problems we consider, the goal is to answer:Which specific global minima do different optimization algorithms reachwhen minimizing L(w)?The following proposition is the simplest illustration of the algorithmicregularization phenomenon.Proposition 6.1.1. Consider gradient descent updates w t for the loss ineq. (6.1) starting with initialization w 0 . For any step size schedule that
algorithmic regularization 43minimizes the loss L(w), the algorithm returns a special global minimizerthat implicitly also minimzes the Euclidean distance to the initialization:w t → argmin ‖w − w 0 ‖ 2.w∈GProof. The key idea is in noting that that the gradients of the lossfunction have a special structure. For the linear regression loss ineq. (6.1) ∀w, ∇L(w) = ∑ i (w ⊤ x (i) − y (i) )x (i) ∈ span({x (i) }) - that isthe gradients are restricted to a n dimentional subspace that is independentof w. Thus, the gradient descent updates from iniitalizationw t − w 0 = ∑ t ′ <t ηw t ′, which linearly accumulate gradients, are againconstrained to the n dimensional subspace. It is now easy to checkthat there is a unique global minimizer that both fits the data (w ∈ G)as well as is reacheable by gradient descent (w ∈ w 0 + span({x (i) })).By checking the KKT conditions, it can be verified that this uniqueminimizer is given by argmin w∈G ‖w − w 0 ‖ 2 2 .In general overparameterized optimization problems, the characterizationof the implicit bias or algorithmic regulariztion is oftennot this elegant or easy. For the same model class, changing the algorithm,or changing associated hyperparameter (like step size andinitialization), or even changing the specific parameterization of themodel class can change the implicit bias. For example, ? ] showedthat for some standard deep learning architectures, variants of SGDalgorithm with different choices of momentum and adaptive gradientupdates (AdaGrad and Adam) exhibit different biases and thus havedifferent generalization performance;? ], ? ] and ? ] study how thesize of the mini-batches used in SGD influences generalization; and? ] compare the bias of path-SGD (steepest descent with respect to ascale invariant path-norm) to standard SGD.A comprehensive understanding of how all the algorithmic choicesaffect the implicit bias is beyond the scope of this chapter (and alsothe current state of research). However, in the context of this chapter,we specifically want to highlight the role of geometry induced byoptimization algorithm and specific parameterization, which arediscussed briefly below.6.1.1 Geometry induced by updates of local search algorithmsThe relation of gradient descent to implicit bias towards minimizingEuclidean distance to initialization is suggestive of the connectionbetween algorithmic regularization to the geometry of updatesin local search methods. In particular, gradient descent iterationscan be alternatively specified by the following equation where thet + 1th iterate is derived by minimizing the a local (first order Taylor)
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41: 6Algorithmic RegularizationLarge sc
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 93 and 94:
inductive biases due to algorithmic
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?