TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
ultra-wide neural networks and neural tangent kernel 73
movement of a single weight vector w r .
∫ t
‖w r (t) − w r (0)‖ 2
=
dw r (τ)
dτ
∥ dτ
∥
0
2
∫ t n
=
1
)
√
∥ m ∑ (u i (τ) − y i ) a r x i ˙σ
(w r (τ) ⊤ x i dτ
∥
i=1
0
≤ √ 1 ∫ ∥ n ( ) ∥ ∥∥∥
m
∑ (u i (τ) − y i ) a r x i ˙σ w r (τ) ⊤ ∥∥∥2
x i dτ
i=1
≤ √ 1 n
m
∑
i=1
∫ t
0
∫ t
≤ √ 1 n
m
∑
i=1
0
( ) tn
=O √ . m
( )∥
∥
∥u i (τ) − y i a r x i ˙σ w r (τ) ⊤ ∥∥2
x i dτ
O(1)dτ
This calculation shows that at any given time t, w r (t) is close to w r (0),
as long as m is large. Next, we show this implies the kernel matrix
H(t) is close H(0). We calculate the difference on a single entry.
∥
2
[H(t)] ij
− [H(0)] ij
m
1
(
)
)
) ( ))
=
∣ m
∑ ˙σ
(w r (t) ⊤ x i ˙σ
(w r (t) ⊤ x j − ˙σ
(w ∣ r (0) ⊤ x i ˙σ w r (0) ⊤ ∣∣∣
x j
r=1
) (
) ( ))∣ ∣ ˙σ
(w r (t) ⊤ x i ˙σ
(w r (t) ⊤ x j − ˙σ w r (0) ⊤ ∣∣
x j
≤ 1 m
≤ 1 m
= 1 m
=O
m
∑
r=1
+ 1 m
m
∑
r=1
+ 1 m
m
∑
r=1
∣
∣ ˙σ
∣max
r
m
∑
r=1
m
∑
r=1
O
∣
( ) tn
√ . m
) (
) ( ))∣ (w r (0) ⊤ x j ˙σ
(w r (t) ⊤ x j − ˙σ w r (0) ⊤ ∣∣
x i
∣max
r
)
∣
˙σ
(w r (t) ⊤ ∣∣
x i ‖x i ‖ 2 ‖w r (t) − w r (0)‖ 2
( ) tn
√ m
)
∣
˙σ
(w r (t) ⊤ ∣∣
x i ‖x i ‖ 2 ‖w r (t) − w r (0)‖ 2
Therefore, using the same argument as in Lemma 8.2.1, we have
‖H(t) − H(0)‖ 2
≤ ∑
i,j
∣ (
∣
∣∣ tn
3
)
∣[H(t)] ij
− [H(0)] ij = O √ . m
Plugging our assumption on m, we finish the proof.