TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
70 theory of deep learning
the input. Given a training dataset {(x i , y i )} n i=1 ⊂ Rd × R, consider
training the neural network by minimizing the squared loss over
training data:
l(w) = 1 2
n
∑ ( f (w, x i ) − y i ) 2 .
i=1
For simplicity, in this chapter, we study gradient flow, a.k.a., gradient
decent with infinitesimally small learning rate. In this case, the
dynamics can be described by an ordinary differential equation
(ODE):
dw(t)
= −∇l(w(t)).
dt
Note this the dynamics on the parameters. The following lemma
describes the dynamics of the predictions on training data points.
Lemma 8.1.1. Let u(t) = ( f (w(t), x i )) i∈[n] ∈ R n be the network outputs
on all x i ’s at time t, and y = (y i ) i∈[n] be the labels. Then u(t) follows the
following evolution, where H(t) is an n × n positive semidefinite matrix
〈 ∂ f (w(t),xi )
whose (i, j)-th entry is
∂w
, ∂ f (w(t),x 〉
j)
∂w
:
du(t)
dt
= −H(t) · (u(t) − y). (8.1)
Proof of Lemma 8.1.1. The parameters w evolve according to the differential
equation
dw(t)
dt
= −∇l(w(t)) = −
n
∑
i=1
( f (w(t), x i ) − y i ) ∂ f (w(t), x i)
, (8.2)
∂w
where t ≥ 0 is a continuous time index. Under Equation (8.2), the
evolution of the network output f (w(t), x i ) can be written as
d f (w(t), x i )
dt
= −
n
∑
j=1
〈 ∂ f (w(t), xi )
( f (w(t), x j ) − y j )
, ∂ f (w(t), x 〉
j)
.
∂w ∂w
(8.3)
Since u(t) = ( f (w(t), x i )) i∈[n] ∈ R n is the network outputs on all x i ’s
at time t, and y = (y i ) i∈[n] is the desired outputs, Equation (8.3) can
be written more compactly as
du(t)
dt
= −H(t) · (u(t) − y), (8.4)
where H(t) ∈ R n×n is a kernel matrix defined as [H(t)] i,j =
〈 ∂ f (w(t),xi )
∂w
, ∂ f (w(t),x 〉
j)
∂w
(∀i, j ∈ [n]).
The statement of Lemma 8.1.1 involves a matrix H(t). Below we
define a neural network architecture whose width is allowed to go
to infinity, while fixing the training data as above. In the limit, it can
be shown that the matrix H(t) remains constant during training i.e.,