TheoryofDeepLearning.2022

Recommendations

Info

78 theory of deep learningNote the first term〈〉g (h−1) (x), g (h−1) (x ′ ) is the covariance between〈x and x ′ at the〉h-th layer. When the width goes to infinity,g (h−1) (x), g (h−1) (x ′ ) will converge to a fix number, which we denoteas Σ (h−1) (x, x ′ ). This covariance admits a recursive formula, forh ∈ [L],Σ (0) (x, x ′ ) = x ⊤ x ′ ,()Λ (h) (x, x ′ Σ) =(h−1) (x, x) Σ (h−1) (x, x ′ )Σ (h−1) (x ′ , x) Σ (h−1) (x ′ , x ′ ∈ R 2×2 , (8.14))Σ (h) (x, x ′ ) = c σ E (u,v)∼N (0,Λ(h)) [σ (u) σ (v)] .[Now we]proceed to[derive this formula. The intuition is thatf (h+1) (x) = ∑ d hij=1W (h+1)] [ ]g (h) (x) is a centered Gaussiani,jjprocess conditioned on f (h) (∀i ∈ [d h+1 ]), with covariance[[E=]f (h+1) (x)i ·[〈g (h) (x), g (h) (x ′ )]f (h+1) (x ′ )i〉∣ f (h)]= c d h( [ ] ) ( [ ] )σd h∑ σ f (h) (x) σ f (h) (x ′ ) ,jjj=1which converges to Σ (h) (x, x ′ ) as d h → ∞ given that each(8.15)[f (h)] j isa centered Gaussian process with covariance Σ (h−1) . This yields theinductive definition in Equation (8.14).〈〉Next we deal with the second term b (h) (x), b (h) (x ′ ) . FromEquation (8.12) we get〈〉b (h) (x), b (h) (x ′ )〈√cσ(= D (h) (x) W (h+1)) ⊤b (h+1) (x),d h√cσ(D (h) (x ′ ) W (h+1)) 〉⊤b (h+1) (x ′ ) .d h(8.16)Although W (h+1) and b h+1 (x) are dependent, the Gaussian initializationof W (h+1) allows us to replace W (h+1) with a fresh newsample ˜W (h+1) without changing its limit: (See [? ] for the preciseproof.)〈√cσD (h) (x)d h〈√cσ≈ D (h) (x)d h(W (h+1)) √⊤b (h+1) cσ(x),(˜W (h+1)) √⊤b (h+1) cσ(x),d hD (h) (x ′ )D (h) (x ′ )d h→ c 〈〉σtrD (h) (x)D (h) (x ′ ) b (h+1) (x), b (h+1) (x ′ )d h→ ˙Σ (h) ( x, x ′) 〈〉b (h+1) (x), b (h+1) (x ′ ) .(W (h+1)) 〉⊤b (h+1) (x ′ )(˜W (h+1)) ⊤b (h+1) (x )〉′
ultra-wide neural networks and neural tangent kernel 79Applying this approximation inductively in Equation (8.16), we get〈〉b (h) (x), b (h) (x ′ L) → ∏˙Σ (h′) (x, x ′ ).h ′ =h〈〉〈〉∂ f (w,x)Finally, since∂w , ∂ f (w,x′ )∂w= ∑ L+1 ∂ f (w,x)h=1, ∂ f (w,x′ ), we obtainthe final NTK expression for the fully-connected neural network:∂W (h) ∂W (h)()Θ (L) (x, x ′ L+1) = ∑ Σ (h−1) (x, x ′ ) · ˙Σ (h′) (x, x ′ ) .h=18.5 NTK in PracticeL+1∏h ′ =hUp to now we have shown an ultra-wide neural network with certaininitialization scheme and trained by gradient flow correspond to akernel with a particular kernel function. A natural question is: whydon’t we use this kernel classifier directly?A recent line of work showed that NTKs can be empirically useful,especially on small to medium scale datasets. Arora et al. [? ] testedthe NTK classifier on 90 small to medium scale datasets from UCIdatabase. 3 They found NTK can beat neural networks, other kernelslike Gaussian and the best previous classifier, random forest undervarious metrics, including average rank, average accuracy, etc. Thissuggests the NTK classifier should belong in any list of off-the-shelfmachine learning methods.For every neural network architecture, one can derive a correspondingkernel function. Du et al. [? ] derived graph NTK (GNTK)for graph classification tasks. On various social network and bioinformaticsdatasets, GNTK can outperform graph neural networks.Similarly, Arora et al. [? ] derived convolutional NTK (CNTK)formula that corresponds to convolutional neural networks. Forimage classification task, in small-scale data and low-shot settings,CNTKs can be quite strong [? ]. However, for large scale data, Aroraet al. [? ] found there is still a performance gap between CNTK andCNN. It is an open problem to explain this phenomenon theoretically.This may need to go beyond the NTK framework.3https://archive.ics.uci.edu/ml/datasets.php8.6 Exercises1. NTK formula for ReLU activation function: prove E w∼N (0,I)[ ˙σw ⊤ x ˙σw ⊤ x ′] =( )π−arccos x⊤x ′‖x‖ 2 ‖x ′ ‖ 22π.2. Prove Equation (8.10)
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28: backpropagation and its variants 27
Page 29 and 30: backpropagation and its variants 29
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?