TheoryofDeepLearning.2022

Recommendations

Info

58 theory of deep learning7.1 Preliminaries and challenges in nonconvex landscapesWe have been discussing global/local minimum informally, here wefirst give a precise definition:Definition 7.1.1 (Global/Local minimum). For an objective functionf (w) : R d → R, a point w ∗ is a global minimum if for every w we havef (w ∗ ) ≤ f (w). A point w is a local minimum/maximum if there existsa radius ɛ > 0 such that for every ‖w ′ − w‖ 2 ≤ ɛ, we have f (w) ≤ f (w ′ )( f (w) ≥ f (w ′ ) for local maximum). A point w with ∇ f (w) = 0 is called acritical point, for smooth functions all local minimum/maximum are criticalpoints.Throughout the chapter, we will always work with functionswhose global minimum exists, and use f (w ∗ ) to denote the optimalvalue of the function 1 . For simplicity we focus on optimizationproblems that do not have any constraints (w ∈ R d ). It is possible toextend everything in this chapter to optimization with nondegenerateequality constraints, which would require definitions of gradient andHessians with respect to a manifold and is out of the scope for thisbook.1Even though there might be multipleglobal minima w ∗ , the value f (w ∗ ) isunique by definition.Spurious local minimum The first obstacle in nonconvex optimizationis a spurious local minimum.Definition 7.1.2 (Spurious local minimum). For an objective functionf (w) : R d → R, a point w is a spurious local minimum if it is a localminimum, but f (w) > f (w ∗ ).Many of the simple optimization algorithms are based on theidea of local search, thus are not able to escape from a spurious localminimum. As we will later see, many noncovex objectives do nothave spurious local minima.Saddle pointsThe second obstacle in nonconvex optimization is asaddle point. The simplest example of a saddle point is f (w) = w 2 1 − w2 2at the point w = (0, 0). In this case, if w moves along direction(±1, 0), the function value increases; if w moves along direction(0, ±1), the function value decreases.Definition 7.1.3 (Saddle point). For an objective function f (w) : R d → R,a point w is a saddle point if ∇ f (w) = 0, and for every radius ɛ > 0, thereexists w + , w − within distance ɛ of w such that f (w − ) < f (w) < f (w + ).This definition covers all cases but makes it very hard to verifywhether a point is a saddle point. In most cases, it is possible to tellwhether a point is a saddle point, local minimum or local maximumbased on its Hessian.
tractable landscapes for nonconvex optimization 59Claim 7.1.4. For an objective function f (w) : R d → R and a critical pointw (∇ f (w) = 0), we know• If ∇ 2 f (w) ≻ 0, w is a local minimum.• If ∇ 2 f (w) ≺ 0, w is a local maximum.• If ∇ 2 f (w) has both a positive and a negative eigenvalue, w is a saddlepoint.These criteria are known as second order sufficient conditionsin optimization. Intuitively, one can prove this claim by looking atthe second-order Taylor expansion. The three cases in the claim donot cover all the possible Hessian matrices. The remaining cases areconsidered to be degenerate, and can either be a local minimum,local maximum or a saddle point 2 .Flat regions Even if a function does not have any spurious localminima or saddle point, it can still be nonconvex, see Figure 7.1. Inhigh dimensions such functions can still be very hard to optimize.The main difficulty here is that even if the norm ‖∇ f (w)‖ 2 is small,unlike convex functions one cannot conclude that f (w) is close tof (w ∗ ). However, often in such cases one can hope the function f (w)to satisfy some relaxed notion of convexity, and design efficientalgorithms accordingly. We discuss one of such cases in Section 7.2.2One can consider the w = 0 point offunctions w 4 , −w 4 , w 3 , and it is a localminimum, maximum and saddle pointrespectively.Figure 7.1: Obstacles for nonconvexoptimization. From leftto right: local minimum, saddlepoint and flat region.7.2 Cases with a unique global minimumWe first consider the case that is most similar to convex objectives.In this section, the objective functions we look at have no spuriouslocal minima or saddle points. In fact, in our example the objectiveis only going to have a unique global minimum. The only obstaclein optimizing these functions is that points with small gradients maynot be near-optimal.The main idea here is to identify properties of the objective andalso a potential function, such that the potential function keeps de-
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 107 and 108: unsupervised learning: overview 107
Page 109 and 110:
unsupervised learning: overview 109
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?