TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
78 theory of deep learning
Note the first term
〈
〉
g (h−1) (x), g (h−1) (x ′ ) is the covariance between
〈
x and x ′ at the
〉
h-th layer. When the width goes to infinity,
g (h−1) (x), g (h−1) (x ′ ) will converge to a fix number, which we denote
as Σ (h−1) (x, x ′ ). This covariance admits a recursive formula, for
h ∈ [L],
Σ (0) (x, x ′ ) = x ⊤ x ′ ,
(
)
Λ (h) (x, x ′ Σ
) =
(h−1) (x, x) Σ (h−1) (x, x ′ )
Σ (h−1) (x ′ , x) Σ (h−1) (x ′ , x ′ ∈ R 2×2 , (8.14)
)
Σ (h) (x, x ′ ) = c σ E (u,v)∼N (0,Λ
(h)
) [σ (u) σ (v)] .
[
Now we
]
proceed to
[
derive this formula. The intuition is that
f (h+1) (x) = ∑ d h
i
j=1
W (h+1)] [ ]
g (h) (x) is a centered Gaussian
i,j
j
process conditioned on f (h) (∀i ∈ [d h+1 ]), with covariance
[[
E
=
]
f (h+1) (x)
i ·
[
〈
g (h) (x), g (h) (x ′ )
]
f (h+1) (x ′ )
i
〉
∣ f (h)]
= c d h
( [ ] ) ( [ ] )
σ
d h
∑ σ f (h) (x) σ f (h) (x ′ ) ,
j
j
j=1
which converges to Σ (h) (x, x ′ ) as d h → ∞ given that each
(8.15)
[
f (h)] j is
a centered Gaussian process with covariance Σ (h−1) . This yields the
inductive definition in Equation (8.14).
〈
〉
Next we deal with the second term b (h) (x), b (h) (x ′ ) . From
Equation (8.12) we get
〈
〉
b (h) (x), b (h) (x ′ )
〈√
cσ
(
= D (h) (x) W (h+1)) ⊤
b (h+1) (x),
d h
√
cσ
(
D (h) (x ′ ) W (h+1)) 〉
⊤
b (h+1) (x ′ ) .
d h
(8.16)
Although W (h+1) and b h+1 (x) are dependent, the Gaussian initialization
of W (h+1) allows us to replace W (h+1) with a fresh new
sample ˜W (h+1) without changing its limit: (See [? ] for the precise
proof.)
〈√
cσ
D (h) (x)
d h
〈√
cσ
≈ D (h) (x)
d h
(
W (h+1)) √
⊤
b (h+1) cσ
(x),
(
˜W (h+1)) √
⊤
b (h+1) cσ
(x),
d h
D (h) (x ′ )
D (h) (x ′ )
d h
→ c 〈
〉
σ
trD (h) (x)D (h) (x ′ ) b (h+1) (x), b (h+1) (x ′ )
d h
→ ˙Σ (h) ( x, x ′) 〈 〉
b (h+1) (x), b (h+1) (x ′ ) .
(
W (h+1)) 〉
⊤
b (h+1) (x ′ )
(
˜W (h+1)) ⊤
b (h+1) (x )〉
′