26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

70 theory of deep learning

the input. Given a training dataset {(x i , y i )} n i=1 ⊂ Rd × R, consider

training the neural network by minimizing the squared loss over

training data:

l(w) = 1 2

n

∑ ( f (w, x i ) − y i ) 2 .

i=1

For simplicity, in this chapter, we study gradient flow, a.k.a., gradient

decent with infinitesimally small learning rate. In this case, the

dynamics can be described by an ordinary differential equation

(ODE):

dw(t)

= −∇l(w(t)).

dt

Note this the dynamics on the parameters. The following lemma

describes the dynamics of the predictions on training data points.

Lemma 8.1.1. Let u(t) = ( f (w(t), x i )) i∈[n] ∈ R n be the network outputs

on all x i ’s at time t, and y = (y i ) i∈[n] be the labels. Then u(t) follows the

following evolution, where H(t) is an n × n positive semidefinite matrix

〈 ∂ f (w(t),xi )

whose (i, j)-th entry is

∂w

, ∂ f (w(t),x 〉

j)

∂w

:

du(t)

dt

= −H(t) · (u(t) − y). (8.1)

Proof of Lemma 8.1.1. The parameters w evolve according to the differential

equation

dw(t)

dt

= −∇l(w(t)) = −

n

i=1

( f (w(t), x i ) − y i ) ∂ f (w(t), x i)

, (8.2)

∂w

where t ≥ 0 is a continuous time index. Under Equation (8.2), the

evolution of the network output f (w(t), x i ) can be written as

d f (w(t), x i )

dt

= −

n

j=1

〈 ∂ f (w(t), xi )

( f (w(t), x j ) − y j )

, ∂ f (w(t), x 〉

j)

.

∂w ∂w

(8.3)

Since u(t) = ( f (w(t), x i )) i∈[n] ∈ R n is the network outputs on all x i ’s

at time t, and y = (y i ) i∈[n] is the desired outputs, Equation (8.3) can

be written more compactly as

du(t)

dt

= −H(t) · (u(t) − y), (8.4)

where H(t) ∈ R n×n is a kernel matrix defined as [H(t)] i,j =

〈 ∂ f (w(t),xi )

∂w

, ∂ f (w(t),x 〉

j)

∂w

(∀i, j ∈ [n]).

The statement of Lemma 8.1.1 involves a matrix H(t). Below we

define a neural network architecture whose width is allowed to go

to infinity, while fixing the training data as above. In the limit, it can

be shown that the matrix H(t) remains constant during training i.e.,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!