TheoryofDeepLearning.2022

Recommendations

Info

64 theory of deep learning7.4 Case study: top eigenvector of a matrixIn this section we look at a simple example of a locally optimizablefunction. Given a symmetric PSD matrix M ∈ R d×d , our goal is tofind its top eigenvector (eigenvector that corresponds to the largesteigenvalue). More precisely, using SVD we can write M asM =d∑ λ i v i vi ⊤ .i=1Here v i ’s are orthonormal vectors that are eigenvectors of M, and λ i ’sare the eigenvalues. For simplicity we assume λ 1 > λ 2 ≥ λ 3 ≥ · · · ≥λ d ≥ 0 7 .There are many objective functions whose global optima gives thetop eigenvector. For example, using basic definition of spectral norm,we know for PSD matrix M the global optima of7Note that the only real assumptionhere is λ 1 > λ 2 , so the top eigenvectoris unique. Other inequalities arewithout loss of generality.max x ⊤ Mx‖x‖ 2 =1is the top eigenvector of M. However, this formulation requires aconstraint. We instead work with an unconstrained version whosecorrectness follows from Eckhart-Young Theorem:min f (x) := 1x∈R d 4 ‖M − xx⊤ ‖ 2 F . (7.6)Note that this function does have a symmetry in the sense thatf (x) = f (−x). Under our assumptions, the only global minima ofthis function are x = ± √ λ 1 v 1 . We are going to show that these arealso the only second order stationary points. We will give two proofstrategies that are commonly used to prove the locally optimizableproperty.7.4.1 Characterizing all critical pointsThe first idea is simple – we will just try to solve the Equation∇ f (x) = 0 to get the position of all critical points; then for thecritical points that are not the desired global minimum, try to provethat they are local maximum or saddle points.Computing gradient and Hessian Before we solve the equation∇ f (x) = 0 for the objective function f (x) defined in Equation (7.6),we first give a simple way of computing the gradient and Hessian.We will first expand f (x + δ) (where δ should be thought of as a small
tractable landscapes for nonconvex optimization 65perturbation):f (x + δ) = 1 4 ‖M − (x + δ)(x + δ)⊤ ‖ 2 F= 1 4 ‖M − xx⊤ − (xδ ⊤ + δx ⊤ ) − δδ ⊤ ‖ 2 F= 1 4 ‖M − xx⊤ ‖ 2 F − 1 2 〈M − xx⊤ , xδ + δx ⊤ 〉[ 1+4 ‖xδ⊤ + δx ⊤ ‖ 2 F − 1 ]2 〈M − xx⊤ , δδ ⊤ 〉 + o(‖δ‖ 2 2 ).Note that in the last step, we have collected the terms based on thedegree of δ, and ignored all the terms that are smaller than o(‖δ‖ 2 2 ).We can now compare this expression with the Taylor’s expansion off (x + δ):f (x + δ) = f (x) + 〈∇ f (x), δ〉 + 1 2 δ⊤ [∇ 2 f (x)]δ + o(‖δ‖ 2 2 ).By matching terms, we immediately have〈∇ f (x), δ〉 = − 1 2 〈M − xx⊤ , xδ ⊤ + δx ⊤ 〉,δ ⊤ [∇ 2 f (x)]δ = 1 2 ‖xδ⊤ + δx ⊤ ‖ 2 F − 〈M − xx⊤ , δδ ⊤ 〉.These can be simplified to give the actual gradient and Hessian 8∇ f (x) = (xx ⊤ − M)x, ∇ 2 f (x) = ‖x‖ 2 2 I + 2xx⊤ − M. (7.7)8 In fact in the next subsection wewill see it is often good enough toknow how to compute 〈∇ f (x), δ〉 andδ ⊤ [∇ 2 f (x)]δ.Characterizing critical points Now we can execute the original plan.First set ∇ f (x) = 0, we haveMx = xx ⊤ x = ‖x‖ 2 2 x.Luckily, this is a well studied equation because we know the onlysolutions to Mx = λx are if λ is an eigenvalue and x is (a scaledversion) of the corresponding eigenvector. Therefore we know x =± √ λ i v i or x = 0. These are the only critical points of the objectivefunction f (x).Among these critical points, x = ± √ λ 1 v 1 are our intended solutions.Next we need to show for every other critical point, itsHessian has a negative eigendirection. We will do this for x =± √ λ i v i (i > 1). By definition, it suffices to show there exists a δsuch that δ ⊤ [∇ 2 f (x)]δ < 0. The main step of the proof involvesguessing what is this direction δ. In this case we will choose δ = v 1(we will give more intuitions about how to choose such a direction inthe next subsection).
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?