TheoryofDeepLearning.2022

Recommendations

Info

86 theory of deep learningConsequently, it holds that ‖ũ i ‖‖ṽ i ‖ =d 1 1‖M‖ ∗ .All that remains is to give a construction of matrix Q. We notethat a rotation matrix Q satisfies the desired properties above if andonly if all diagonal elements of Q ⊤ G U Q are equal 5 , and equal to 5since (Q ⊤ G U Q) jj = ‖ũ j ‖ 2Tr G Ud. The key idea is that for the trace zero matrix G11 := G U −Tr G UdI d1 , if G11 = ∑ r i=1 λ ie i ei ⊤ is an eigendecomposition of G 1 , thenfor the average of the eigenvectors, i.e. for w 11 = √ 1 r∑ r i=1 e i, it holdsthat w11 ⊤ G 1w 11 = 0. We use this property recursively to exhibitan orthogonal transformation Q, such that Q ⊤ G 1 Q is zero on itsdiagonal.To verify the claim, first notice that w 11 is unit norm‖w 11 ‖ 2 = ‖ 1 √ rFurther, it is easy to see thatw ⊤ 11 Gw 11 = 1 rr∑ e i Ge j = 1 ri,j=1r∑ e i ‖ 2 = 1 rr∑ ‖e i ‖ 2 = 1.i=1i=1r∑ λ j ei ⊤ e j = 1 ri,j=1r∑ λ i = 0.i=1Complete W 1 := [w 11 , w 12 , · · · , w 1d ] be such that W1 ⊤W 1 = W 1 W1 ⊤ =I d . Observe that W1 ⊤G 1W 1 has zero on its first diagonal elements[ ]W1 ⊤ G 0 b1⊤ 1W 1 =b 1 G 2The principal submatrix G 2 also has a zero trace. With a similarargument, let w 22 [∈ R d−1 be such that ‖w 22 ‖ =]1 and w22 ⊤ G 2w 22 = 01 0 0 · · · 0and define W 2 =∈ R d×d such that0 w 22 w 23 · · · w 2dW ⊤ 2 W 2 = W 2 W ⊤ 2= I d, and observe that⎡(W 1 W 2 ) ⊤ G 1 (W 1 W 2 ) =⎢⎣⎤0 · · · ·· 0 · · · ⎥⎦ .. . G 3This procedure can be applied recursively so that for the matrixQ = W 1 W 2 · · · W d we have⎡⎤0 · · · · ·Q ⊤ · 0 · · · ·G 1 Q =⎢ .⎣ . . .. ⎥ . ⎦ ,· · · 0so that Tr(ŨŨ ⊤ ) = Tr(Q ⊤ G U Q) = Tr(Σ) = Tr(Q ⊤ G V Q) = Tr(Ṽ ⊤ Ṽ).
inductive biases due to algorithmic regularization 879.1.2 Matrix CompletionNext, we consider the problem of matrix completion which can beformulated as a special case of matrix sensing with sensing matricesthat random indicator matrices. Formally, we assume that for allj ∈ [n], let A (j) be an indicator matrix whose (i, k)-th element isselected randomly with probability p(i)q(k), where p(i) and q(k)denote the probability of choosing the i-th row and the j-th column,respectively.We will show next that in this setup Dropout induces the weightedtrace-norm studied by [? ] and [? ]. Formally, we show thatΘ(M) = 1 d 1‖diag( √ p)UV ⊤ diag( √ q)‖ 2 ∗. (9.7)Proof. For any pair of factors (U, V) it holds thatR(U, V) ===d 1∑ E(u ⊤ j Av j ) 2j=1d 1∑j=1d 1∑j=1d 2∑k=1d 2∑k=1d 0∑l=1d 0∑l=1p(k)q(l)(u ⊤ j e k e ⊤ l v j) 2p(k)q(l)U(k, j) 2 V(l, j) 2d 1 √√= ∑ ‖ diag(p)u j ‖ 2 ‖ diag(q)v j ‖ 2j=1(≥ 1 d1 √√d 1∑ ‖ diag(p)u j ‖‖ diag(q)v j ‖j=1) 2( )= 1 d1 √2d 1∑ ‖ diag(p)u j v ⊤ j√diag(q)‖ ∗j=1≥ 1 √(‖d 1diag(p)d 1∑j=1u j v ⊤ j= 1 √‖ diag(p)UV ⊤√ diag(q)‖ 2 ∗d 1√diag(q)‖ ∗) 2where the first inequality is due to Cauchy-Schwartz and the secondinequality follows from the triangle inequality. The equality rightafter the first inequality follows from the fact that for any two vectorsa, b, ‖ab ⊤ ‖ ∗ = ‖ab ⊤ ‖ F = ‖a‖‖b‖. Since the inequalities hold for anyU, V, it implies thatΘ(UV ⊤ ) ≥ 1 √‖ diag(p)UV ⊤√ diag(q)‖ 2d∗.1Applying Theorem 9.1.1 on ( √ diag(p)U, √ diag(q)V), there exists
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 85: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?