TheoryofDeepLearning.2022

Recommendations

Info

70 theory of deep learningthe input. Given a training dataset {(x i , y i )} n i=1 ⊂ Rd × R, considertraining the neural network by minimizing the squared loss overtraining data:l(w) = 1 2n∑ ( f (w, x i ) − y i ) 2 .i=1For simplicity, in this chapter, we study gradient flow, a.k.a., gradientdecent with infinitesimally small learning rate. In this case, thedynamics can be described by an ordinary differential equation(ODE):dw(t)= −∇l(w(t)).dtNote this the dynamics on the parameters. The following lemmadescribes the dynamics of the predictions on training data points.Lemma 8.1.1. Let u(t) = ( f (w(t), x i )) i∈[n] ∈ R n be the network outputson all x i ’s at time t, and y = (y i ) i∈[n] be the labels. Then u(t) follows thefollowing evolution, where H(t) is an n × n positive semidefinite matrix〈 ∂ f (w(t),xi )whose (i, j)-th entry is∂w, ∂ f (w(t),x 〉j)∂w:du(t)dt= −H(t) · (u(t) − y). (8.1)Proof of Lemma 8.1.1. The parameters w evolve according to the differentialequationdw(t)dt= −∇l(w(t)) = −n∑i=1( f (w(t), x i ) − y i ) ∂ f (w(t), x i), (8.2)∂wwhere t ≥ 0 is a continuous time index. Under Equation (8.2), theevolution of the network output f (w(t), x i ) can be written asd f (w(t), x i )dt= −n∑j=1〈 ∂ f (w(t), xi )( f (w(t), x j ) − y j ), ∂ f (w(t), x 〉j).∂w ∂w(8.3)Since u(t) = ( f (w(t), x i )) i∈[n] ∈ R n is the network outputs on all x i ’sat time t, and y = (y i ) i∈[n] is the desired outputs, Equation (8.3) canbe written more compactly asdu(t)dt= −H(t) · (u(t) − y), (8.4)where H(t) ∈ R n×n is a kernel matrix defined as [H(t)] i,j =〈 ∂ f (w(t),xi )∂w, ∂ f (w(t),x 〉j)∂w(∀i, j ∈ [n]).The statement of Lemma 8.1.1 involves a matrix H(t). Below wedefine a neural network architecture whose width is allowed to goto infinity, while fixing the training data as above. In the limit, it canbe shown that the matrix H(t) remains constant during training i.e.,
ultra-wide neural networks and neural tangent kernel 71equal to H(0). Moreover, under a random initialization of parameters,the random matrix H(0) converges in probability to a certaindeterministic kernel matrix H ∗ as the width goes to infinity, whichis the Neural Tangent Kernel k(·, ·) evaluated on the training data. IfH(t) = H ∗ for all t, then Equation (8.1) becomesdu(t)dt= −H ∗ · (u(t) − y). (8.5)Note that the above dynamics is identical to the dynamics of kernelregression under gradient flow, for which at time t → ∞ the finalprediction function is (assuming u(0) = 0)f ∗ (x) = (k(x, x 1 ), . . . , k(x, x n )) · (H ∗ ) −1 y. (8.6)8.2 Coupling Ultra-wide Neural Networks and NTKIn this section, we consider a simple two-layer neural network of thefollowing form:f (a, W, x) = √ 1 m ( )m∑ a r σ wr ⊤ xr=1(8.7)where σ (·) is the activation function. Here we assume | ˙σ (z)| and|¨σ (z)| are bounded by 1 for all z ∈ R and For example, soft-plus activationfunction, σ (z) = log (1 + exp(z)), satisfies this assumption. 1We also assume all any input x has Euclidean norm 1, ‖x‖ 2= 1. Thescaling 1/ √ m will play an important role in proving H(t) is close tothe fixed H ∗ kernel. Throughout the section, to measure the closenessof two matrices A and B, we use the operator norm ‖·‖ 2.We use random initialization w r (0) ∼ N(0, I) and a r ∼ Unif [{−1, 1}].For simplicity, we will only optimize the first layer, i.e., W = [w 1 , . . . , w m ].Note this is still a non-convex optimization problem.We can first calculate H(0) and show as m → ∞, H(0) convergesto a fixed matrix H ∗ . Note ∂ f (a,W,x i)∂w r= √ 1ma r x i ˙σ ( wr ⊤ )x i . Therefore,each entry of H(0) admits the formula[H(0)] ij==m∑r=1m∑r=1〈 ∂ f (a, W(0), xi ), ∂ f (a, W(0), x 〉j)∂w r (0) ∂w r (0)〈 1)√ a r x i ˙σ(w r (0) ⊤ 1)x i , √ a r x j ˙σ(w 〉r (0) ⊤ x i m m=x ⊤ i x j · ∑m r=1 ˙σ ( w r (0) ⊤ x i) ˙σ(wr (0) ⊤ x j)Here the last step we used a 2 r = 1 for all r = 1, . . . , m because weinitialize a r ∼ Unif [{−1, 1}]. Recall every w r (0) is i.i.d. sampled froma standard Gaussian distribution. Therefore, one can view [H(0)] ijm1 Note rectified linear unit (ReLU)activation function does not satisfy thisassumption. However, one can use aspecialized analysis of ReLU to showH(t) ≈ H ∗ [? ].
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?