TheoryofDeepLearning.2022

Recommendations

Info

98 theory of deep learningrotations. Observe thatg(t) := L θ (U + ∆(t))= L θ (U) + ‖ √ 1 − t 2 u 1 + tu 2 ‖ 4 − ‖u 1 ‖ 4 + ‖ √ 1 − t 2 u 2 − tu 1 ‖ 4 − ‖u 2 ‖ 4= L θ (U) − 2t 2 (‖u 1 ‖ 4 + ‖u 2 ‖ 4 ) + 8t 2 (u 1 u 2 ) 2 + 4t 2 ‖u 1 ‖ 2 ‖u 2 ‖ 2√+ 4t 1 − t 2 u1 ⊤ u 2(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) + O(t 3 ).The derivative of g then is given asg ′ (t) = −4t(‖u 1 ‖ 4 + ‖u 2 ‖ 4 ) + 16t(u 1 u 2 ) 2 + 8t‖u 1 ‖ 2 ‖u 2 ‖ 2+ 4( √ 1 − t 2 −t 2√1 − t 2 )(u⊤ 1 u 2)(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) + O(t 2 ).Since U is a critical point and L θ is continuously differentiable, itshould hold thatg ′ (0) = 4(u ⊤ 1 u 2)(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) = 0.Since by assumption ‖u 1 ‖ 2 − ‖u 2 ‖ 2 > 0, it should be the case thatu ⊤ 1 u 2 = 0. We now consider the second order directional derivative:g ′′ (0) = −4(‖u 1 ‖ 4 + ‖u 2 ‖ 4 ) + 16(u 1 u 2 ) 2 + 8‖u 1 ‖ 2 ‖u 2 ‖ 2= −4(‖u 1 ‖ 2 − ‖u 2 ‖ 2 ) 2 < 0which completes the proof.We now focus on the critical points that are equalized, i.e. points Usuch that ∇L θ (U) = 0 and diag(U ⊤ U) = ‖U‖2 Fd 1I.Lemma 9.3.7. Let r := Rank(M). Assume that d 1 ≤ d 0 and λ <rλ r∑ r i=1 (λ . Then all equalized local minima are global. All other equalizedi−λ r )critical points are strict saddle points.Proof of Lemma 9.3.7. Let U be a critical point that is equalized. Furthermore,let r ′ be the rank of U, and U = WΣV ⊤ be its rank-r ′ SVD,i.e. W ∈ R d 0×r ′ , V ∈ R d 1×r ′ are such that U ⊤ U = V ⊤ V = I r ′ andΣ ∈ R r′ ×r ′ , is a positive definite diagonal matrix whose diagonalentries are sorted in descending order. We have:∇L θ (U) = 4(UU ⊤ − M)U + 4λUdiag(U ⊤ U) = 0=⇒ UU ⊤ U + λ‖U‖2 FU = MUd 1=⇒ WΣ 3 V ⊤ + λ‖Σ‖2 FWΣV ⊤ = MWΣV ⊤d 1=⇒ Σ 2 + λ‖Σ‖2 FI = W ⊤ MWd 1
inductive biases due to algorithmic regularization 99Since the left hand side of the above equality is diagonal, it impliesthat W ∈ R d 0×r ′ corresponds to some r ′ eigenvectors of M. LetE ⊆ [d 0 ], |E| = r ′ denote the set of eigenvectors of M that are presentin W. The above equality is equivalent of the following system oflinear equations:(I + λ d 111 ⊤ )diag(Σ 2 ) = ⃗ λ,where ⃗ λ = diag(W ⊤ MW). The solution to the linear system ofequations above is given bydiag(Σ 2 λ) = (I −d 1 + λr ′ ) ⃗ λ = ⃗λ ∑ r′λ −i=1 λ id 1 + λr ′ 1 r ′. (9.14)Thus, the set E belongs to one of the following categories:0. E = [r ′ ], r ′ > ρ1. E = [r ′ ], r ′ = ρ2. E = [r ′ ], r ′ < ρ3. E ̸= [r ′ ]We provide a case by case analysis for the above partition here.Case 0. [E = [r ′ ], r ′ > ρ]. We show that E cannot belong to this class,i.e. when E = [r ′ ], it should hold that r ′ ≤ ρ. To see this, consider ther ′ -th linear equation in Equation (9.14):σr 2 ′ = λ r ′ − λ ∑r′ i=1 λ id 1 + λr ′ .Since Rank U = r ′ , it follows that σ r ′ > 0, which in turn implies thatλ r ′ > λ ∑r′ i=1 λ id 1 + λr ′ = λr′ κ r ′d 1 + λr ′ .It follows from maximality of ρ in Theorem 9.3.1 that r ′ ≤ ρ.Case 1. [E = [r ′ ], r ′ = ρ] When W corresponds to the top-ρ eigenvectorsof M, we retrieve a global optimum described by Theorem 9.3.1.Therefore, all such critical points are global minima.Case 2. [E = [r ′ ], r ′ < ρ] Let W d0 := [W, W ⊥ ] be a complete eigenbasisfor M corresponding to eigenvalues of M in descending order,where W ⊥ ∈ R d 0×d 0 −r ′ constitutes a basis for the orthogonal subspaceof W. For rank deficient M, W ⊥ contains the null-space of M,and hence eigenvectors corresponding to zero eigenvalues of M. Similarly,let V ⊥ ∈ R d 1×d 1 −r ′ span the orthogonal subspace of V, suchthat V d1 := [V, V ⊥ ] forms an orthonormal basis for R d 1. Note thatboth W ⊥ and V ⊥ are well-defined since r ′ ≤ min{d 0 , d 1 }. Define
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37:
basics of generalization theory 37N
Page 41 and 42:
6Algorithmic RegularizationLarge sc
Page 43 and 44:
algorithmic regularization 43minimi
Page 45 and 46:
algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 97: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?