26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ultra-wide neural networks and neural tangent kernel 77

neural network recursively, for h = 1, 2, . . . , L:

f (h) (x) = W (h) g (h−1) (x) ∈ R d h, g (h) cσ

( )

(x) = σ f (h) (x) ∈ R d h

d h

(8.11)

where W (h) ∈ R d h×d h−1 is the weight matrix in the h-th layer (h ∈ [L]),

σ : R → R is a coordinate-wise activation function, and c σ =

( [

E z∼N (0,1) σz

2 ]) −1

. The last layer of the neural network is

f (w, x) = f (L+1) (x) = W (L+1) · g (L) (x)

= W (L+1) cσ

· σW (L) ·

d L

d L−1

σW (L−1) · · ·

d 1

σW (1) x,

where W (L+1) ∈ R 1×d L is the weights in the final layer, and w =

(

W (1) , . . . , W (L+1)) represents all the parameters in the network.

We initialize all the weights to be i.i.d. N (0, 1) random variables,

and consider the limit of large hidden widths: d 1 , d 2 , . . . , d L →

∞. The scaling factor √ c σ /d h in Equation (8.11) ensures that the

norm of g (h) (x) for each h ∈ [L] is approximately preserved at

initialization

[

(see [? ]). In particular, for ReLU activation, we have

∥∥∥g ]

E (h) ∥

(x) = ‖x‖ 2 2

(∀h ∈ [L]).

∥ 2 2

Recall from Lemma

8.1.1 that we need to compute the value that

∂ f (w,x)

∂w , ∂ f (w,x′ )

∂w

converges to at random initialization in the infinite

width limit. We can write the partial derivative with respect to a

particular weight matrix W (h) in a compact form:

∂ f (w, x)

(

∂W (h) = b (h) (x) · g (x)) (h−1) , h = 1, 2, . . . , L + 1,

where

b (h) 1 ∈ R, h = L + 1,

(x) = √ (

⎩ cσ

d

D (h) (x) W (h+1)) ⊤

b (h+1) (x) ∈ R d h, h = 1, . . . , L,

h

(8.12)

( ( ))

D (h) (x) = diag ˙σ f (h) (x) ∈ R d h×d h , h = 1, . . . , L. (8.13)

Then, for any two inputs x and x ′ , and any h ∈ [L + 1], we can

compute

〈 ∂ f (w, x)

∂W (h) , ∂ f (w, 〉

x′ )

∂W

(h) (

⊤ ( ) 〉 ⊤

= b (h) (x) · g (x)) (h−1) , b (h) (x ′ ) · g (h−1) (x ′ )

〉 〈

= g (h−1) (x), g (h−1) (x ′ ) · b (h) (x), b (h) (x ′ ) .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!