TheoryofDeepLearning.2022

Recommendations

Info

96 theory of deep learningLet δ = −c · sgn(u ⊤ 1 u 2) for a small enough c > 0 such that ‖u 2 ‖ <‖û 2 ‖ ≤ ‖û 1 ‖ < ‖u 1 ‖. Using Equation (9.13), This implies that‖û 1 ‖ 4 + ‖û 2 ‖ 4 < ‖u 1 ‖ 4 + ‖u 2 ‖ 4 , which in turn gives us R(Û) < R(U)and hence L θ (Û) < L θ (U). Therefore, a non-equalized critical pointcannot be local minimum, hence the first claim of the lemma.9.3.2 Landscape propertiesNext, we characterize the solutions to which dropout converges. Wedo so by understanding the optimization landscape of Problem 9.11.Central to our analysis, is the following notion of strict saddle property.Definition 9.3.4 (Strict saddle point/property). Let f : U → R bea twice differentiable function and let U ∈ U be a critical point off . Then, U is a strict saddle point of f if the Hessian of f at U has atleast one negative eigenvalue, i.e. λ min (∇ 2 f (U)) < 0. Furthermore, fsatisfies strict saddle property if all saddle points of f are strict saddle.Strict saddle property ensures that for any critical point U that isnot a local optimum, the Hessian has a significant negative eigenvaluewhich allows first order methods such as gradient descent(GD) and stochastic gradient descent (SGD) to escape saddle pointsand converge to a local minimum [? ? ]. Following this idea, therehas been a flurry of works on studying the landscape of differentmachine learning problems, including low rank matrix recovery [? ],generalized phase retrieval problem [? ], matrix completion [? ], deeplinear networks [? ], matrix sensing and robust PCA [? ] and tensordecomposition [? ], making a case for global optimality of first ordermethods.For the special case of no regularization (i.e. λ = 0; equivalently,no dropout), Problem 9.11 reduces to standard squared loss minimizationwhich has been shown to have no spurious local minimaand satisfy strict saddle property (see, e.g. [? ? ]). However, the regularizerinduced by dropout can potentially introduce new spuriouslocal minima as well as degenerate saddle points. Our next resultestablishes that that is not the case, at least when the dropout rate issufficiently small.Theorem 9.3.5. Let r := Rank(M). Assume that d 1 ≤ d 0 and thatrλthe regularization parameter satisfies λ <r (M)(∑ r i=1 λ . Then iti(M))−rλ r (M)holds for Problem 9.11 that1. all local minima are global,2. all saddle points are strict saddle points.A few remarks are in order. First, the assumption d 1 ≤ d 0 is byno means restrictive, since the network map UU ⊤ ∈ R d 0×d 0 has rank
inductive biases due to algorithmic regularization 97at most d 0 , and letting d 1 > d 0 does not increase the expressivity ofthe function class represented by the network. Second, Theorem 9.3.5guarantees that any critical point U that is not a global optimum isa strict saddle point, i.e. ∇ 2 L(U, U) has a negative eigenvalue. Thisproperty allows first order methods, such as dropout, to escape suchsaddle points. Third, note that the guarantees in Theorem 9.3.5 holdwhen the regularization parameter λ is sufficiently small. Assumptionsof this kind are common in the literature (see, for example [? ]).While this is a sufficient condition for the result in Theorem 9.3.5, it isnot clear if it is necessary.Proof of Theorem 9.3.5. Here we outline the main steps in the proof ofTheorem 9.3.5.1. In Lemma 9.3.3, we show that the set of non-equalized criticalpoints does not include any local optima. Furthermore,Lemma 9.3.6 shows that all such points are strict saddles.2. In Lemma 9.3.7, we give a closed-form characterization of all theequalized critical points in terms of the eigendecompostion of M.We then show that if λ is chosen appropriately, all such criticalpoints that are not global optima, are strict saddle points.3. It follows from Item 1 and Item 2 that if λ is chosen appropriately,then all critical points that are not global optimum, are strictsaddle points.Lemma 9.3.6. All critical points of Problem 9.11 that are not equalized, arestrict saddle points.Proof of Lemma 9.3.6. By Lemma 9.3.3, the set of non-equalized criticalpoints does not include any local optima. We show that all suchpoints are strict saddles. Let U be a critical point that is not equalized.To show that U is a strict saddle point, it suffices to show that theHessian has a negative eigenvalue. In here, we exhibit a curve alongwhich the second directional derivative is negative. Assume, withoutloss of generality that ‖u 1 ‖ > ‖u 2 ‖ and consider the curve∆(t) := [( √ 1−t 2 −1)u 1 +tu 2 , ( √ 1−t 2 −1)u 2 −tu 1 , 0 d,r−2 ]It is easy to check that for any t ∈ R, L(U + ∆(t)) = L(U) sinceU + ∆(t) is essentially a rotation on U and L is invariant under
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37:
basics of generalization theory 37N
Page 41 and 42:
6Algorithmic RegularizationLarge sc
Page 43 and 44:
algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 95: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?