TheoryofDeepLearning.2022

Recommendations

Info

60 theory of deep learningcreasing as we run simple optimization algorithms such as gradientdescent. Many properties were used in previous literature, includingDefinition 7.2.1. Let f (w) be an objective function with a unique globalminimum w ∗ , thenPolyak-Lojasiewicz f satisfies Polyak-Lojasiewicz if there exists a valueµ > 0 such that for every w, ‖∇ f (x)‖ 2 2 ≥ µ( f (w) − f (w∗ )).weakly-quasi-convex f is weakly-quasi-convex if there exists a value τ > 0such that for every w, 〈∇ f (w), w − w ∗ 〉 ≥ µ( f (w) − f (w ∗ )).Restricted Secant Inequality (RSI) f satisfies RSI if there exists a value τsuch that for every w, 〈∇ f (w), w − w ∗ 〉 ≥ µ‖w − w ∗ ‖ 2 2 .Any one of these three properties can imply fast convergencetogether with some smoothness of f .Claim 7.2.2. If an objective function f satisfies one of Polyak-Lojasiewicz,weakly-quasi-convex or RSI, and f is smooth 3 , then gradient descentconverges to global minimum with a geometric rate 4 .Intuitively, Polyak-Lojasiewicz condition requires that the gradientto be nonzero for any point that is not a global minimum, thereforeone can always follow the gradient and further decrease the functionvalue. This condition can also work in some settings when the globalminimum is not unique. Weakly-quasi-convex and RSI are similarin the sense that they both require the (negative) gradient to becorrelated with the correct direction - direction from the current pointw to the global minimum w ∗ .In this section we are going to use generalized linear model as anexample to show how some of these properties can be used.3Polyak-Lojasiewicz and RSI requiresstandard smoothness definition as inEquation (2.1), weakly-quasi-convexrequires a special smoothness propertydetailed in [? ].4The potential functions for Polyak-Lojasiewicz and weakly-quasi-convexare function value f ; potential functionfor RSI is the squared distance ‖w −w ∗ ‖ 2 27.2.1 Generalized linear modelIn generalized linear model (also known as isotonic regression) [?? ], the input consists of samples {x (i) , y (i) } that are drawn fromdistribution D, where (x, y) ∼ D satisfiesy = σ(w ⊤ ∗ x) + ɛ. (7.3)Here σ : R → R is a known monotone function, ɛ is a noise thatsatisfies E [ɛ|x] = 0, and w ∗ is the unknown parameter that we aretrying to learn.In this case, it is natural to consider the following expected lossL(w) = 1 2[E (y − σ(w ⊤ x) 2] . (7.4)(x,y)∼D
tractable landscapes for nonconvex optimization 61Of course, in practice one can only access the training loss which isan average over the observed {x (i) , y (i) } pairs. For simplicity we workwith the expected loss here. The difference between the two lossescan be bounded using techniques in Chapter ??.Generalized linear model can be viewed as learning a singleneuron where σ is its nonlinearity.We will give high level ideas on how to prove properties such asweakly-quasi-convex or RSI for generalized linear model. First werewrite the objective as:L(w) = 1 E[(y − σ(w ⊤ x) 2]2 (x,y)∼D= 1 [2 E (ɛ + σ(w∗ ⊤ x) − σ(w ⊤ x)) 2] .= 1 2 E ɛ(x,ɛ)[ɛ 2] + 1 2 E x[(σ(w ⊤ ∗ x) − σ(w ⊤ x) 2] .Here the second equality uses Definition of the model (Equation(7.3)), and the third equality uses the fact that E [ɛ|x] = 0 (so thereare no cross terms). This decomposition is helpful as the first term[ɛ2 ] is now just a constant.Now we can take the derivative of the objective:]∇L(w) = E[(σ(w ⊤ x) − σ(w ⊤ x∗ x))σ ′ (w ⊤ x)x .12 E ɛNotice that both weakly-quasi-convex and RSI requires that theobjective to be correlated with w − w ∗ , so we compute]〈∇L(w), w − w ∗ 〉 = E[(σ(w ⊤ x) − σ(w ⊤ x∗ x))σ ′ (w ⊤ x)(w ⊤ x − w∗ ⊤ x) .The goal here is to show that the RHS is bigger than 0. A simpleway to see that is to use the intermediate value theorem: σ(w ⊤ x) −σ(w ⊤ ∗ x) = σ ′ (ξ)(w ⊤ x − w ⊤ ∗ x), where ξ is a value between w ⊤ x andw ⊤ ∗ x. Then we have〈∇L(w), w − w ∗ 〉 = E x[σ ′ (ξ)σ ′ (w ⊤ x)(w ⊤ x − w ⊤ ∗ x) 2] .In the expectation in the RHS, both derivatives (σ ′ (ξ), σ ′ (w ⊤ x)) arepositive as σ is monotone, and (w ⊤ x − w ⊤ ∗ x) 2 is clearly nonnegative.By making more assumptions on σ and the distribution of x, it ispossible to lowerbound the RHS in the form required by eitherweakly-quasi-convex or RSI. We leave this as an exercise.7.2.2 Alternative objective for generalized linear modelThere is another way to find w ∗ for generalized linear model that ismore specific to this setting. In this method, one estimate a different
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?