TheoryofDeepLearning.2022

Recommendations

Info

66 theory of deep learningWhen x = ± √ λ i v i , and δ = v 1 , we haveδ ⊤ [∇ 2 f (x)]δ = v ⊤ 1 [‖√ λ i v i ‖ 2 2 I + 2λ iv i v ⊤ i − M]v 1 = λ i − λ 1 < 0.Here the last step uses the fact that v i ’s are orthonormal vectors andv ⊤ 1 Mv 1 = λ 1 . The proof for x = 0 is very similar. Combining all thesteps above, we proved the following claim:Claim 7.4.1 (Properties of critical points). The only critical points off (x) are of the form x = ± √ λ i v i or x = 0. For all critical points exceptx = ± √ λ 1 v 1 , ∇ 2 f (x) has a negative eigenvalue.This claim directly implies that the only second order stationarypoints are x = ± √ λ 1 v 1 , so all second order stationary points are alsoglobal minima.7.4.2 Finding directions of improvementsThe approach in Section 7.4.1 is straight-forward. However, in morecomplicated problems it is often infeasible to enumerate all thesolutions for ∇ f (x) = 0. What we proved in Section 7.4.1 is also notstrong enough for showing f (x) is locally optimizable, because weonly proved every exact SOSP is a global minimum, and a locallyoptimizable function requires every approximate SOSP to be close toa global minimum. We will now give an alternative approach that isoften more flexible and robust.For every point x that is not a global minimum, we define itsdirection of improvements as below:Definition 7.4.2 (Direction of improvement). For an objective functionf and a point x, we say δ is a direction of improvement (of f at x) if|〈∇ f (x), δ〉| > 0 or δ ⊤ [∇ 2 f (x)]δ < 0. We say δ is an (epsilon, γ) directionof improvement (of f at x) if |〈∇ f (x), δ〉| > ɛ‖δ‖ 2 or δ ⊤ [∇ 2 f (x)]δ <−γ‖δ‖ 2 2 .Intuitively, if δ is a direction of improvement for f at x, then movingalong one of δ or −δ for a small enough step can decrease theobjective function. In fact, if a point x has a direction of improvement,it cannot be a second order stationary point; if a point x hasan (epsilon, γ) direction of improvement, then it cannot be an (ɛ, γ)-SOSP.Now we can look at the contrapositive of what we were tryingto prove in the definition of locally optimizable functions: if everypoint x with f (x) > f (x ∗ ) + τ has an (ɛ, γ) direction of improvement,then every (ɛ, γ)-second order stationary point must satisfy f (x) ≤f (x ∗ ) + δ. Therefore, our goal in this part is to find a direction ofimprovement for every point that is not globally optimal.
tractable landscapes for nonconvex optimization 67For simplicity, we will look at an even simpler version of thetop eigenvector problem. In particular, we consider the case whereM = zz ⊤ is a rank-1 matrix, and z is a unit vector. In this case, theobjective function we defined in Equation (7.6) becomesmin xf (x) = 1 4 ‖zz⊤ − xx ⊤ ‖ 2 F . (7.8)The intended global optimal solutions are x = ±z. This problem isoften called the matrix factorization problem as we are given a matrixM = zz ⊤9 and the goal is to find a decomposition M = xx ⊤ . 9Note that we only observe M, not z.Which direction should we move to decrease the objective function?In this problem we only have the optimal direction z and thecurrent direction x, so the natural guesses would be z, x or z − x.Indeed, these directions are enough:Lemma 7.4.3. For objective function (7.8), there exists a universal constantc > 0 such that for any τ < 1, if neither x or z is an (cτ, , 1/4)-direction ofimprovement for the point x, then f (x) ≤ τ.The proof of this lemma involves some detailed calculation. To getsome intuition, we can first think about what happens if neither x orz is a direction of improvement.Lemma 7.4.4. For objective function (7.8), if neither x or z is a direction ofimprovement of f at x, then f (x) = 0.Proof. We will use the same calculation for gradient and Hessianas in Equation (7.7), except that M is now zz ⊤ . First, since x is not adirection of improvement, we must have〈∇ f (x), x〉 = 0 =⇒ ‖x‖ 4 2 = 〈x, z〉2 . (7.9)If z is not a direction of improvement, we know z ⊤ [∇ 2 f (x)]z ≥ 0,which means‖x‖ 2 + 2〈x, z〉 2 − 1 ≥ 0 =⇒ ‖x‖ 2 ≥ 1/3.Here we used the fact that 〈x, z〉 2 ≤ ‖x‖ 2 2 ‖z‖2 2 = ‖x‖2 2. Together withEquation (7.9) we know 〈x, z〉 2 = ‖x‖ 4 2 ≥ 1/9.Finally, since z is not a direction of improvement, we know〈∇ f (x), z〉 = 0, which implies 〈x, z〉(‖x‖ 2 2− 1) = 0. We have alreadyproved 〈x, z〉 2 ≥ 1/9 > 0, thus ‖x‖ 2 2= 1. Again combiningwith Equation (7.9) we know 〈x, z〉 2 = ‖x‖ 4 2= 1. The only two vectorswith 〈x, z〉 2 = 1 and ‖x‖ 2 2= 1 are x = ±z.The proof of Lemma 7.4.3 is very similar to Lemma 7.4.4, exceptwe need to allow slacks in every equation and inequality we use. Theadditional benefit of having the more robust Lemma 7.4.3 is that the
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?