TheoryofDeepLearning.2022

Recommendations

Info

100 theory of deep learningU(t) = W d0 Σ ′ V ⊤ d 1where Σ ′ ∈ R d 0×d 1 is diagonal with non-zero√diagonal elements given as σi ′ = σi 2 + t 2 for i ≤ d 1 . Observe thatU(t) ⊤ U(t) = VΣ 2 V ⊤ + t 2 V ⊤ d 1V d1 = U ⊤ U + t 2 I d1 .Thus, the parametric curve U(t) is equalized for all t. The populationrisk at U(t) equals:L(U(t)) =d 1∑i=1(λ i − σ 2 i − t 2 ) 2 +d 0∑ λ 2 ii=d 1 +1= L(U) + d 1 t 4 − 2t 2 d 1∑ (λ i − σi 2).i=1Furthermore, since U(t) is equalized, we obtain the following formfor the regularizer:R(U(t)) = λ d 1‖U(t)‖ 4 F = λ d 1(‖U‖ 2 F + d 1t 2) 2= R(U) + λd 1 t 4 + 2λt 2 ‖U‖ 2 F .Define g(t) := L(U(t)) + R(U(t)). We have thatg(t) = L(U) + R(U) + d 1 t 4 − 2t 2 d 1∑ (λ i − σi 2) + λd 1t 4 + 2λt 2 ‖U‖ 2 F .i=1It is easy to verify that g ′ (0) = 0. Moreover, the second derivative of gat t = 0 is given as:g ′′ (0) = −4d 1∑i=1(λ i − σi 2 d 1) + 4λ‖U‖2 F = −4 ∑ λ i + 4(1 + λ)‖U‖ 2 F (9.15)i=1We use ‖U‖ 2 F = ∑r′ i=1 σ2 iand Equation (9.14) to arrive at‖U‖ 2 F = trΣ2 =r ′∑i=1(λ i − λ ∑r′ j=1 λ jd 1 + λr ′ ) = (r ′∑i=1λ i )(1 −λr′d 1 + λr ′ ) = d 1 ∑ r′d 1 + λr ′Plugging back the above equality in Equation (9.15), we getg ′′ (0) = −4d 1∑i=1λ i + 4 d r1 + d 1 λ′d 1 + λr∑ ′i=1λ i = −4d 1∑i=r ′ +1i=1 λ iλ i + 4 (d 1 − r ′ r)λ′d 1 + λr∑ ′ λ ii=1To get a sufficient condition for U to be a strict saddle point, it suf-
inductive biases due to algorithmic regularization 101fices that g ′′ (t) be negative at t = 0, i.e.g ′′ (0) < 0 =⇒ (d 1 − r ′ r)λ′ d 1d 1 + λr∑ ′ λ i < ∑ λ ii=1 i=r ′ +1=⇒ λ < (d 1 + λr ′ ) ∑ r i=r ′ +1 λ i(d 1 − r ′ ) ∑ r′i=1 λ i=⇒ λ(1 − r′ ∑ d 1i=r ′ +1 λ i(d 1 − r ′ ) ∑ r′i=1 λ ) < d 1 ∑ d 1i=r ′ +1 λ ii (d 1 − r ′ ) ∑ r′i=1 λ i=⇒ λ <=⇒ λ <d 1 ∑ d 1i=r ′ +1 λ i(d 1 − r ′ ) ∑ r′i=1 λ i − r ′ ∑ d 1i=r ′ +1 λ id 1 h(r ′ )∑ r′i=1 (λ i − h(r ′ ))where h(r ′ ) := ∑d 1i=r ′ +1 λ id 1 −r ′ is the average of the tail eigenvaluesλ r ′ +1, . . . , λ d1 . It is easy to see that the right hand side is monotonicallydecreasing with r ′ , since h(r ′ ) monotonically decreases withr ′ . Hence, it suffices to make sure that λ is smaller than the righthand side for the choice of r ′ = r − 1, where r := Rank(M). That is,λ <rλ r∑ r i=1 (λ i−λ r ) .Case 3. [E ̸= [r ′ ]] We show that all such critical points are strict saddlepoints. Let w ′ be one of the top r ′ eigenvectors that are missing inW. Let j ∈ E be such that w j is not among the top r ′ eigenvectors ofM. For any t ∈ [0, 1], let W(t) be identical to W in all the columns butthe j th one, where w j (t) = √ 1 − t 2 w j + tw ′ . Note that W(t) is still anorthogonal matrix for all values of t. Define the parametrized curveU(t) := W(t)ΣV ⊤ for t ∈ [0, 1] and observe that:‖U − U(t)‖ 2 F = σ2 j ‖w j − w j (t)‖ 2= 2σ 2 j (1 − √ 1 − t 2 ) ≤ t 2 Tr MThat is, for any ɛ > 0, there exist a t > 0 such that U(t) belongs tothe ɛ-ball around U. We show that L θ (U(t)) is strictly smaller thanL θ (U), which means U cannot be a local minimum. Note that thisconstruction of U(t) guarantees that R(U ′ ) = R(U). In particular, itis easy to see that U(t) ⊤ U(t) = U ⊤ U, so that U(t) remains equalizedfor all values of t. Moreover, we have thatL θ (U(t)) − L θ (U) = ‖M − U(t)U(t) ⊤ ‖ 2 F − ‖M − UU⊤ ‖ 2 F= −2 Tr(Σ 2 W(t) ⊤ MW(t)) + 2 Tr(Σ 2 W ⊤ MW)= −2σ 2 j t2 (w j (t) ⊤ Mw j (t) − w ⊤ j Mw j ) < 0,where the last inequality follows because by construction w j (t) ⊤ Mw j (t) >w ⊤ jMw j . Define g(t) := L θ (U(t)) = L(U(t)) + R(U(t)). To see that
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37:
basics of generalization theory 37N
Page 41 and 42:
6Algorithmic RegularizationLarge sc
Page 43 and 44:
algorithmic regularization 43minimi
Page 45 and 46:
algorithmic regularization 45update
Page 47 and 48:
algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 99: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?