TheoryofDeepLearning.2022

Recommendations

Info

74 theory of deep learningSeveral remarks are in sequel.Remark 1: The assumption that y i = O(1) is a mild assumptionbecause in practice most labels are bounded by an absolute constant.Remark 2: The assumption on u i (τ) = O(1) for all τ ≤ t andm’s dependency on t can be relaxed. This requires a more refinedanalysis. See [? ].Remark 3: One can generalize the proof for multi-layer neuralnetwork. See [? ] for more details.Remark 4: While we only prove the continuous time limit, it isnot hard to show with small learning rate (discrete time) gradientdescent, H(t) is close to H ∗ . See [? ].8.3 Explaining Optimization and Generalization of Ultra-wideNeural Networks via NTKNow we have established the following approximationdu(t)dt≈ −H ∗ · (u(t) − y) (8.8)where H ∗ is the NTK matrix. Now we use this approximation toanalyze the optimization and generalization behavior of ultra-wideneural networks.Understanding Optimizationdu(t)dtThe dynamics of u(t) that follows= −H ∗ · (u(t) − y)is actually linear dynamical system. For this dynamics, there is astandard analysis. We denote the eigenvalue decomposition of H ∗ asH ∗ =n∑ λ i v i vi⊤i=1where λ 1 ≥ . . . ≥ λ n ≥ 0 are eigenvalues and v 1 , . . . , v n are eigenvectors.With this decomposition, we consider the dynamics of u(t) oneach eigenvector separately. Formally, fixing an eigenvevector v i andmultiplying both side by v i , we obtaindv ⊤ iu(t)dt= − vi ⊤ H ∗ · (u(t) − y))= − λ i(vi ⊤ (u(t) − y) .Observe that the dynamics of vi ⊤ u(t) only depends on itself andλ i , so this is actually a one dimensional ODE. Moreover, this ODEadmits an analytical solution()vi ⊤ (u(t) − y) = exp (−λ i t) vi ⊤ (u(0) − y) . (8.9)
ultra-wide neural networks and neural tangent kernel 75Figure 8.1: Convergence rate vs.projections onto eigenvectors ofthe kernel matrix.Now we use Equation (8.9) to explain why we can find a zero trainingerror solution. We need to assume λ i > 0 for all i = 1, . . . , n, i.e.,all eigenvalues of this kernel matrix are strictly positive. One canprove this under fairly general conditions. See [? ? ].Observe that (u(t) − y) is the difference between predictions andtraining labels at time t and the algorithm finds a 0 training errorsolutions means as t → ∞, we have u(t) − y → 0. Equation (8.9)implies that each component of this difference, i.e., vi⊤ (u(t) − y) isconverging to 0 exponentially fast because of the exp(−λ i t) term.Furthermore, notice that {v 1 , . . . , v n } forms an orthonormal basisof R n , so (u(t) − y) = ∑i=1 n v⊤ i(u(t) − y). Since we know eachvi⊤ (u i (t) − y) → 0, we can conclude that (u(t) − y) → 0 as well.Equation (8.9) actually gives us more information about the convergence.Note each component vi ⊤ (u(t) − y) converges to 0 at a differentrate. The component that corresponds to larger λ i converges to 0 at afaster rate than the one with a smaller λ i . For a set of labels, in orderto have faster convergence, we would like the projections of y ontothe top eigenvectors to be larger. 2 Therefore, we obtain the followingintuitive rule to compare the convergence rates in a qualitativemanner (for fixed ‖y‖ 2):2Here we ignore the effect of u(0) forsimplicity. See [? ] on how to mitigatethe effect on u(0).• For a set of labels y, if they align with top eigenvectors, i.e., (v ⊤ iy)is large for large λ i , then gradient descent converges quickly.• For a set of labels, if the projections on eigenvectors {(v ⊤ iy)} n i=1 areuniform, or labels align with eigenvectors with respect to smalleigenvalues, then gradient descent converges with a slow rate.We can verify this phenomenon experimentally. In Figure 8.1,we compare convergence rates of gradient descent between usingoriginal labels, random labels and the worst case labels (normalizedeigenvector of H ∗ corresponding to λ n . We use the neural networkarchitecture defined in Equation (8.7) with ReLU activation functionand only train the first layer. In the right figure, we plot the eigenvaluesof H ∗ as well as projections of true, random, and worst caselabels on different eigenvectors of H ∗ . The experiments use gradientdescent on data from two classes of MNIST. The plots demonstrate
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?