TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
74 theory of deep learning
Several remarks are in sequel.
Remark 1: The assumption that y i = O(1) is a mild assumption
because in practice most labels are bounded by an absolute constant.
Remark 2: The assumption on u i (τ) = O(1) for all τ ≤ t and
m’s dependency on t can be relaxed. This requires a more refined
analysis. See [? ].
Remark 3: One can generalize the proof for multi-layer neural
network. See [? ] for more details.
Remark 4: While we only prove the continuous time limit, it is
not hard to show with small learning rate (discrete time) gradient
descent, H(t) is close to H ∗ . See [? ].
8.3 Explaining Optimization and Generalization of Ultra-wide
Neural Networks via NTK
Now we have established the following approximation
du(t)
dt
≈ −H ∗ · (u(t) − y) (8.8)
where H ∗ is the NTK matrix. Now we use this approximation to
analyze the optimization and generalization behavior of ultra-wide
neural networks.
Understanding Optimization
du(t)
dt
The dynamics of u(t) that follows
= −H ∗ · (u(t) − y)
is actually linear dynamical system. For this dynamics, there is a
standard analysis. We denote the eigenvalue decomposition of H ∗ as
H ∗ =
n
∑ λ i v i vi
⊤
i=1
where λ 1 ≥ . . . ≥ λ n ≥ 0 are eigenvalues and v 1 , . . . , v n are eigenvectors.
With this decomposition, we consider the dynamics of u(t) on
each eigenvector separately. Formally, fixing an eigenvevector v i and
multiplying both side by v i , we obtain
dv ⊤ i
u(t)
dt
= − vi ⊤ H ∗ · (u(t) − y)
)
= − λ i
(vi ⊤ (u(t) − y) .
Observe that the dynamics of vi ⊤ u(t) only depends on itself and
λ i , so this is actually a one dimensional ODE. Moreover, this ODE
admits an analytical solution
(
)
vi ⊤ (u(t) − y) = exp (−λ i t) vi ⊤ (u(0) − y) . (8.9)