TheoryofDeepLearning.2022

Recommendations

Info

20 theory of deep learningHere β(w t+1 − w t ) is the so-called momentum term. The motivationand the origin of the name of the algorithm comes from that it can beviewed as a discretization of the second order ODE:ẅ + aẇ + b∇ f (w) = 0Another equivalent way to write the algorithm isu t = −∇ f (w t ) + βu t−1w t+1 = w t + ηu tExercise: verify the two forms of the algorithm are indeed equivalent.Another variant of the heavy-ball algorithm is due to Nesterovu t = −∇ f (w t + β · (u t − u t−1 )) + β · u t−1 ,w t+1 = w t + η · u t .One can see that u t stores a weighed sum of the all the historicalgradient and the update of w t uses all past gradient. This is anotherinterpretation of the accelerate gradient descent algorithmNesterov gradient descent works similarly to the heavy ball algorithmempirically for training deep neural networks. It has theadvantage of stronger worst case guarantees on convex functions.Both of the two algorithms can be used with stochastic gradient,but little is know about the theoretical guarantees about stochasticaccelerate gradient descent.2.4 Local Runtime Analysis of GDWhen the iterate is near a local minimum, the behavior of gradientdescent is clearer because the function can be locally approximatedby a quadratic function. In this section, we assume for simplicity thatwe are optimizing a convex quadratic function, and get some insighton how the curvature of the function influences the convergence ofthe algorithm.We use gradient descent to optimize1min w 2 w⊤ Awwhere A ∈ R d×d is a positive semidefinite matrix, and w ∈ R d .Remark: w.l.o.g, we can assume that A is a diagonal matrix. Diagonalizationis a fundamental idea in linear algebra. Suppose A hassingular vector decomposition A = UΣU ⊤ where Σ is a diagonalmatrix. We can verify that w ⊤ Aw = ŵ ⊤ Σŵ with ŵ = U ⊤ w. In otherwords, in a difference coordinate system defined by U, we are dealingwith a quadratic form with a diagonal matrix Σ as the coefficient.Note the diagonalization technique here is only used for analysis.
basics of optimization 21Therefore, we assume that A = diag(λ 1 , . . . , λ d ) with λ 1 ≥ · · · ≥λ d . The function can be simplified tof (w) = 1 2d∑ λ i wi2i=1The gradient descent update can be written asx ← w − η∇ f (w) = w − ηΣwHere we omit the subscript t for the time step and use the subscriptfor coordinate. Equivalently, we can write the per-coordinateupdate rulew i ← w i − ηλ i w i = (1 − λ i η i )w iNow we see that if η > 2/λ i for some i, then the absolute value ofw i will blow up exponentially and lead to an instable behavior. Thus,we need η 1max λ i. Note that max λ i corresponds to the smoothnessparameter of f because λ 1 is the largest eigenvalue of ∇ 2 f = A. Thisis consistent with the condition in Lemma 2.1.1 that η needs to besmall.Suppose for simplicity we set η = 1/(2λ 1 ), then we see that theconvergence for the w 1 coordinate is very fast — the coordinate w 1 ishalved every iteration. However, the convergence of the coordinatew d is slower, because it’s only reduced by a factor of (1 − λ d /(2λ 1 ))every iteration. Therefore, it takes O(λ d /λ 1 · log(1/ɛ)) iterations toconverge to an error ɛ. The analysis here can be extended to generalconvex function, which also reflects the principle that:The condition number is defined as κ = σ max (A)/σ min (A) = λ 1 /λ d .It governs the convergence rate of GD.≪Tengyu notes: add figure≫2.4.1 Pre-conditionersFrom the toy quadratic example above, we can see that it would bemore optimal if we can use a different learning rate for differentcoordinate. In other words, if we introduce a learning rate η i = 1/λ ifor each coordinate, then we can achieve faster convergence. In themore general setting where A is not diagonal, we don’t know thecoordinate system in advance, and the algorithm corresponds tow ← w − A −1 ∇ f (w)In the even more general setting where f is not quadratic, this correspondsto the Newton’s algorithmw ← w − ∇ 2 f (w) −1 ∇ f (w)
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19: basics of optimization 19where the
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71:
70 theory of deep learningthe input
Page 72 and 73:
72 theory of deep learningas the av
Page 74 and 75:
74 theory of deep learningSeveral r
Page 76 and 77:
76 theory of deep learningFigure 8.
Page 78 and 79:
78 theory of deep learningNote the
Page 81 and 82:
9Inductive Biases due to Algorithmi
Page 83 and 84:
inductive biases due to algorithmic
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?