TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
ultra-wide neural networks and neural tangent kernel 77
neural network recursively, for h = 1, 2, . . . , L:
√
f (h) (x) = W (h) g (h−1) (x) ∈ R d h, g (h) cσ
( )
(x) = σ f (h) (x) ∈ R d h
d h
(8.11)
where W (h) ∈ R d h×d h−1 is the weight matrix in the h-th layer (h ∈ [L]),
σ : R → R is a coordinate-wise activation function, and c σ =
( [
E z∼N (0,1) σz
2 ]) −1
. The last layer of the neural network is
f (w, x) = f (L+1) (x) = W (L+1) · g (L) (x)
√
= W (L+1) cσ
· σW (L) ·
d L
√
cσ
d L−1
σW (L−1) · · ·
√
cσ
d 1
σW (1) x,
where W (L+1) ∈ R 1×d L is the weights in the final layer, and w =
(
W (1) , . . . , W (L+1)) represents all the parameters in the network.
We initialize all the weights to be i.i.d. N (0, 1) random variables,
and consider the limit of large hidden widths: d 1 , d 2 , . . . , d L →
∞. The scaling factor √ c σ /d h in Equation (8.11) ensures that the
norm of g (h) (x) for each h ∈ [L] is approximately preserved at
initialization
[
(see [? ]). In particular, for ReLU activation, we have
∥∥∥g ]
E (h) ∥
(x) = ‖x‖ 2 2
(∀h ∈ [L]).
∥ 2 2
〈
Recall from Lemma
〉
8.1.1 that we need to compute the value that
∂ f (w,x)
∂w , ∂ f (w,x′ )
∂w
converges to at random initialization in the infinite
width limit. We can write the partial derivative with respect to a
particular weight matrix W (h) in a compact form:
∂ f (w, x)
(
⊤
∂W (h) = b (h) (x) · g (x)) (h−1) , h = 1, 2, . . . , L + 1,
where
⎧
⎨
b (h) 1 ∈ R, h = L + 1,
(x) = √ (
⎩ cσ
d
D (h) (x) W (h+1)) ⊤
b (h+1) (x) ∈ R d h, h = 1, . . . , L,
h
(8.12)
( ( ))
D (h) (x) = diag ˙σ f (h) (x) ∈ R d h×d h , h = 1, . . . , L. (8.13)
Then, for any two inputs x and x ′ , and any h ∈ [L + 1], we can
compute
〈 ∂ f (w, x)
∂W (h) , ∂ f (w, 〉
x′ )
∂W
〈
(h) (
⊤ ( ) 〉 ⊤
= b (h) (x) · g (x)) (h−1) , b (h) (x ′ ) · g (h−1) (x ′ )
〈
〉 〈
〉
= g (h−1) (x), g (h−1) (x ′ ) · b (h) (x), b (h) (x ′ ) .