TheoryofDeepLearning.2022

Recommendations

Info

46 theory of deep learningarg min w∈G‖w − w 0 ‖.0.0−0.5−1.0−1.5−2.0−2.5−3.0−3.5−4.001w init =[0,0,0]23456w *||. ||w ∞ η→00.50.0η = 0.01η = 0.1η = 0.25η = 0.3−1.0−0.5Figure 6.1: Steepest descent w.r.t‖.‖4/3: the global minimum towhich steepest descent convergesto depends on η. Here w 0 = [0, 0, 0],w‖.‖ ∗ = arg min ψ∈G‖w‖4/3denotesthe minimum norm globalminimum, and wη→0 ∞ denotes thesolution of infinitesimal SD withη → 0. Note that even as η → 0, theexpected characterization does nothold, i.e., wη→0 ∞ ̸= w∗ ‖.‖ .In summary, for squared loss, we characterized the implicit bias ofgeneric mirror descent algorithm in terms of the potential functionand initialization. However, even in simple linear regression, forsteepest descent with general norms, we were unable to get a usefulcharacterization. In contrast, in Section 6.3.2, we study logistic likestrictly monotonic losses used in classification, where we can get acharacterization for steepest descent.6.1.2 Geometry induced by parameterization of model classIn many learning problems, the same model class can be parameterizedin multiple ways. For example, the set of linear functionsin R d can be parameterized in a canonical way as w ∈ R dwith f w (x) = w ⊤ x, but also equivalently by u, v ∈ R d withf u,v (x) = (u · v) ⊤ x or f u,v (x) = (u 2 − v 2 ) ⊤ x. All such equivalentparameterizations lead to equivalent training objectives, however, inoverparemterized models, using gradient descent on different parameterizationslead to different induced biases in the function space. Forexample, ? ? ] demonstrated this phenomenon in matrix factorizationand linear convolutional networks, where these parameterizationswere shown to introduce interesting and unusual biases towardsminimizing nuclear norm, and l p (for p = 2/depth) norm in Fourierdomain, respectively. In general, these results are suggestive of roleof architecture choice in different neural network models, and showshow even while using the same gradient descent algorith, differentgeometries in the function space can be induced by the differentparameterizations.
algorithmic regularization 476.2 Matrix factorization≪Suriya notes: I would like to include this section here but can also move to a separate chapter.Ideally, summarize our 2017 paper, Tengyu’s 2018 paper and Nadav’s 2019 paper. May be wecan discuss this after Nadav’s lecture?≫6.3 Linear Models in ClassificationWe now turn to studing classification problems with logistic or crossentropytype losses. We focus on binary classification problemswhere y (i) ∈ {−1, 1}. Many continuous surrogate of the 0-1 lossinlcuding logistic, cross-entropy, and exponential loss are examples ofstrictly monotone loss functions l where the behavior of the implicitbias is fundamentally different, and as are the situations when theimplicit bias can be characterized.We look at classification models that fit the training data {x (i) , y (i) } iwith linear decision boundaries f (x) = w ⊤ x with decision rule givenby ŷ(x) = sign( f (x)). In many instances of the proofs, we also assumewithout loss of generality that y (i) = 1 for all i, since for linearmodels, the sign of y (i) can equivalently be absorbed into x (i) . Weagain look at unregularized empirical risk minimization objective ofthe form in eq. (6.1), but now with strictly monotone losses. Whenthe training data {x (i) , y (i) } n is not linearly separable, the empiricalobjective L(w) can have a finite global minimum. However, if thedataset is linearly separable, i.e., ∃w : ∀i, y (i) w ⊤ y (i) > 0, the empiricalloss L(w) is again ill-posed, and moreover L(w) does not have anyfinite minimizer, i.e, L(w) → 0 only as ‖w‖ → ∞. Thus, for anysequence {w t } ∞ t=0 , if L(w t) → 0, then w t necessarily diverges to infinityrather than converge, and hence we cannot talk about lim t→∞ w t .Instead, we look at the limit direction ¯w ∞ = limt→∞w twhenever the‖w t ‖limit exists. We refer to existence of this limit as convergence in direction.Note that, the limit direction fully specifies the decision rule ofthe classifier that we care about.In the remainder of the chapter, we focus on the following exponentialloss l(u, y) = exp(−uy). However, our asymptotic results canbe extended to loss functions with tight exponential tails, includinglogistic and sigmoid losses, along the lines of ? ] and ? ].L(w) =6.3.1 Gradient Descentn∑ exp(−y (i) w ⊤ x (i) ). (6.6)i=1? ] showed that for almost all linearly separable datasets, gradientdescent with any initialization and any bounded step-size converges in
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45: algorithmic regularization 45update
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 97 and 98:
inductive biases due to algorithmic
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?