TheoryofDeepLearning.2022

Recommendations

Info

72 theory of deep learningas the average of m i.i.d. random variables. If m is large, then by the lawof large number, we know this average is close to the expectation ofthe random variable. Here the expectation is the NTK evaluated on x iand x j :[ ) )]Hij ∗ x⊤ i x j · E w∼N(0,I) ˙σ(w ⊤ x i ˙σ(w ⊤ x jUsing Hoeffding inequality and the union bound, one can easilyobtain the following bound characterizing m and the closeness ofH(0) and H ∗ .Lemma 8.2.1 (Perturbation on the Initialization). Fix some ɛ >0. If m = Ω( )n 4 log(n/δ)ɛ 2w 1 (0), . . . , w m (0), we have, then with probability at least 1 − δ over‖H(0) − H ∗ ‖ 2≤ ɛ.Proof of Lemma 8.2.1. We first fixed an entry (i, j). Note) ( )∣∣∣xi ⊤ x j ˙σ(w t (0) ⊤ x i ˙σ w r (0) ⊤ ∣∣x j ≤ 1.Applying Hoeffding inequality, we have with probability 1 − δn 2 ,∣∣[H(0)] ij− Hij∗ ∣ ≤ ɛ n 2 .Next, applying the union bound over all pairs (i, j) ∈ [n] × [n], we∣have for all (i, j), ∣[H(0)] ij− Hij∗ ∣ ≤ ɛ . To establish the operator normn 2bound, we simply use the following chain of inequalities‖H(0) − H ∗ ‖ 2≤ ‖H(0) − H ∗ ‖ F≤ ∑ij∣∣[H(0)] ij− Hij∗ ∣ ≤ n 2 ·ɛn 2 = ɛ.Now we proceed to show during training, H(t) is close to H(0).Formally, we prove the following lemma.Lemma 8.2.2. Assume y i = O(1) for all i = 1, . . . , n. Given t > 0, supposethat for all 0 ≤ τ ≤ t, u i (τ) = O(1) for all i = 1, . . . , n. If m = Ωwe have‖H(t) − H(0)‖ 2≤ ɛ.(n 6 t 2Proof of Lemma 8.2.2. The first key idea is to show that every weightvector only moves little if m is large. To show this, let us calculate theɛ 2 ),
ultra-wide neural networks and neural tangent kernel 73movement of a single weight vector w r .∫ t‖w r (t) − w r (0)‖ 2=dw r (τ)dτ∥ dτ∥02∫ t n=1)√∥ m ∑ (u i (τ) − y i ) a r x i ˙σ(w r (τ) ⊤ x i dτ∥i=10≤ √ 1 ∫ ∥ n ( ) ∥ ∥∥∥m∑ (u i (τ) − y i ) a r x i ˙σ w r (τ) ⊤ ∥∥∥2x i dτi=1≤ √ 1 nm∑i=1∫ t0∫ t≤ √ 1 nm∑i=10( ) tn=O √ . m( )∥∥∥u i (τ) − y i a r x i ˙σ w r (τ) ⊤ ∥∥2x i dτO(1)dτThis calculation shows that at any given time t, w r (t) is close to w r (0),as long as m is large. Next, we show this implies the kernel matrixH(t) is close H(0). We calculate the difference on a single entry.∥2[H(t)] ij− [H(0)] ijm1())) ( ))=∣ m∑ ˙σ(w r (t) ⊤ x i ˙σ(w r (t) ⊤ x j − ˙σ(w ∣ r (0) ⊤ x i ˙σ w r (0) ⊤ ∣∣∣x jr=1) () ( ))∣ ∣ ˙σ(w r (t) ⊤ x i ˙σ(w r (t) ⊤ x j − ˙σ w r (0) ⊤ ∣∣x j≤ 1 m≤ 1 m= 1 m=Om∑r=1+ 1 mm∑r=1+ 1 mm∑r=1∣∣ ˙σ∣maxrm∑r=1m∑r=1O∣( ) tn√ . m) () ( ))∣ (w r (0) ⊤ x j ˙σ(w r (t) ⊤ x j − ˙σ w r (0) ⊤ ∣∣x i∣maxr)∣˙σ(w r (t) ⊤ ∣∣x i ‖x i ‖ 2 ‖w r (t) − w r (0)‖ 2( ) tn√ m)∣˙σ(w r (t) ⊤ ∣∣x i ‖x i ‖ 2 ‖w r (t) − w r (0)‖ 2Therefore, using the same argument as in Lemma 8.2.1, we have‖H(t) − H(0)‖ 2≤ ∑i,j∣ (∣∣∣ tn3)∣[H(t)] ij− [H(0)] ij = O √ . mPlugging our assumption on m, we finish the proof.
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?