26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ultra-wide neural networks and neural tangent kernel 73

movement of a single weight vector w r .

∫ t

‖w r (t) − w r (0)‖ 2

=

dw r (τ)

∥ dτ

0

2

∫ t n

=

1

)

∥ m ∑ (u i (τ) − y i ) a r x i ˙σ

(w r (τ) ⊤ x i dτ

i=1

0

≤ √ 1 ∫ ∥ n ( ) ∥ ∥∥∥

m

∑ (u i (τ) − y i ) a r x i ˙σ w r (τ) ⊤ ∥∥∥2

x i dτ

i=1

≤ √ 1 n

m

i=1

∫ t

0

∫ t

≤ √ 1 n

m

i=1

0

( ) tn

=O √ . m

( )∥

∥u i (τ) − y i a r x i ˙σ w r (τ) ⊤ ∥∥2

x i dτ

O(1)dτ

This calculation shows that at any given time t, w r (t) is close to w r (0),

as long as m is large. Next, we show this implies the kernel matrix

H(t) is close H(0). We calculate the difference on a single entry.

2

[H(t)] ij

− [H(0)] ij

m

1

(

)

)

) ( ))

=

∣ m

∑ ˙σ

(w r (t) ⊤ x i ˙σ

(w r (t) ⊤ x j − ˙σ

(w ∣ r (0) ⊤ x i ˙σ w r (0) ⊤ ∣∣∣

x j

r=1

) (

) ( ))∣ ∣ ˙σ

(w r (t) ⊤ x i ˙σ

(w r (t) ⊤ x j − ˙σ w r (0) ⊤ ∣∣

x j

≤ 1 m

≤ 1 m

= 1 m

=O

m

r=1

+ 1 m

m

r=1

+ 1 m

m

r=1

∣ ˙σ

∣max

r

m

r=1

m

r=1

O

( ) tn

√ . m

) (

) ( ))∣ (w r (0) ⊤ x j ˙σ

(w r (t) ⊤ x j − ˙σ w r (0) ⊤ ∣∣

x i

∣max

r

)

˙σ

(w r (t) ⊤ ∣∣

x i ‖x i ‖ 2 ‖w r (t) − w r (0)‖ 2

( ) tn

√ m

)

˙σ

(w r (t) ⊤ ∣∣

x i ‖x i ‖ 2 ‖w r (t) − w r (0)‖ 2

Therefore, using the same argument as in Lemma 8.2.1, we have

‖H(t) − H(0)‖ 2

≤ ∑

i,j

∣ (

∣∣ tn

3

)

∣[H(t)] ij

− [H(0)] ij = O √ . m

Plugging our assumption on m, we finish the proof.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!