TheoryofDeepLearning.2022

Recommendations

Info

82 theory of deep learningsights from the success of the sexual reproduction model in the evolutionof advanced organisms. Another motivation that was cited by [?] was in terms of “balancing networks”. Despite several theoreticalworks aimed at explaining Dropout 3 , it remains unclear what kind3of regularization does Dropout provide or what kinds of networksdoes Dropout prefer and how that helps with generalization. In thischapter, we work towards that goal by instantiating explicit forms ofregularizers due to Dropout and how they provide capacity controlin various machine learning including linear regression (Section 9.4),matrix sensing (Section 9.1.1), matrix completion (Section 9.1.2), anddeep learning (Section 9.2).9.1 Matrix SensingWe begin with understanding dropout for matrix sensing, a problemwhich arguably is an important instance of a matrix learning problemwith lots of applications, and is well understood from a theoreticalperspective. Here is the problem setup.Let M ∗ ∈ R d 2×d 0 be a matrix with rank r ∗ := Rank(M ∗ ). LetA (1) , . . . , A (n) be a set of measurement matrices of the same size asM ∗ . The goal of matrix sensing is to recover the matrix M ∗ from nobservations of the form y i = 〈M ∗ , A (i) 〉 such that n ≪ d 2 d 0 . Thelearning algorithm we consider is empirical risk minimization, andwe choose to represent the parameter matrix M ∈ R d 2×d 0 in terms ofproduct of its factors U ∈ R d 2×d 1 , V ∈ R d 0×d 1 :min ̂L(U, V) := 1U∈R d 2 ×d 1 ,V∈R d 0 ×d 1nn∑i=1(y i − 〈UV ⊤ , A (i) 〉) 2 . (9.1)When d 1 ≫ r ∗ , there exist many “bad” empirical minimizers,i.e., those with a large true risk. However, recently, [? ] showed thatunder restricted isometry property, despite the existence of such poorERM solutions, gradient descent with proper initialization is implicitlybiased towards finding solutions with minimum nuclear norm – thisis an important result which was first conjectured and empiricallyverified by [? ].We propose solving the ERM problem (9.1) with algorithmic regularizationdue to dropout, where at training time, the correspondingcolumns of U and V are dropped independently and identically accordingto a Bernoulli random variable. As opposed to the impliciteffect of gradient descent, this dropout heuristic explicitly regularizesthe empirical objective. It is then natural to ask, in the case of matrixsensing, if dropout also biases the ERM towards certain low normsolutions. To answer this question, we begin with the observationthat dropout can be viewed as an instance of SGD on the following
inductive biases due to algorithmic regularization 83objective:̂L drop (U, V) = 1 nn∑ E B (y i − 〈UBV ⊤ , A (i) 〉) 2 , (9.2)i=1where B ∈ R d 1×d 1 is a diagonal matrix whose diagonal elements areBernoulli random variables distributed as B jj ∼1−p 1 Ber(1 − p), forj ∈ [d 1 ]. In this case, we can show that for any p ∈ [0, 1):wherêL drop (U, V) = ̂L(U, V) +̂R(U, V) =d 11∑nj=1n∑i=1p1 − p ̂R(U, V), (9.3)(u ⊤ j A (i) v j ) 2 (9.4)is a data-dependent term that captures the explicit regularizer due todropout.Proof. Consider one of the summands in the Dropout objective inEquation 9.2. Then, we can writeE B [(y i − 〈UBV ⊤ , A (i) 〉) 2 ] =() 2E B [y i − 〈UBV ⊤ , A (i) 〉]+ Var(y i − 〈UBV ⊤ , A (i) 〉) (9.5)For Bernoulli random variable B jj , we have that E[B jj ] = 1 andVar(B jj ) =p1−p. Thus, the first term on right hand side is equal to(y i − 〈UV ⊤ , A (i) 〉) 2 . For the second term we haveVar(y i − 〈UBV ⊤ , A (i) 〉) = Var(〈UBV ⊤ , A (i) 〉)= Var(〈B, U ⊤ A (i) V〉)= Var(=d 1∑j=1d 1∑j=1= p1 − pB jj u ⊤ j A (i) v j )(u ⊤ j A (i) v j ) 2 Var(B jj )d 1∑j=1Using the facts above in Equation (9.2), we get̂L drop (U, V) = 1 nn∑i=1(y i − 〈UV ⊤ , A (i) 〉) 2 + 1 n= ̂L(U, V) + p1 − p ̂R(U, V).which completes the proof.(u ⊤ j A (i) v j ) 2n∑i=1p1 − pd 1∑j=1(u ⊤ j A (i) v j ) 2
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81: 9Inductive Biases due to Algorithmi
Page 85 and 86: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?