TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
28 theory of deep learning
Extension to vector messages : In fact (b) can be done efficiently in
more general settings where we allow the output of each node in the
network to be a vector (or even matrix/tensor) instead of only a real
number. Here we need to replace
∂z ∂u · S by ∂u
j ∂z
[S], which denotes the
j
result of applying the operator
∂z ∂u on S. We note that to be consistent
j
with the convention in the usual exposition of backpropagation,
when y ∈ R p is a funciton of x ∈ R q , we use ∂y
∂x
to denote q × p
dimensional matrix with ∂y j /∂x i as the (i, j)-th entry. Readers might
notice that this is the transpose of the usual Jacobian matrix defined
in mathematics. Thus ∂y
∂x is an operator that maps Rp to R q and we
can verify S has the same dimension as u and
∂z ∂u [S] has the same
j
dimension as z j .
For example, as illustrated below, suppose the node U ∈ R d 1×d 3 is
a product of two matrices W ∈ R d 2×d 3 and Z ∈ R d 1×d 2 . Then we have
that ∂U/∂Z is a linear operator that maps R d 2×d 3 to R d 1×d 3 , which
naively requires a matrix representation of dimension d 2 d 3 × d 1 d 3 .
However, the computation (b) can be done efficiently because
∂U
∂Z [S] = W⊤ S.
Such vector operations can also be implemented efficiently using
today’s GPUs.
Figure 3.3: Vector version of
above
3.4 Notable Extensions
Allowing weight tying: In many neural architectures, the designer
wants to force many network units such as edges or nodes to share
the same parameter. For example, in including the ubiquitous
convolutional net, the same filter has to be applied all over the
image, which implies reusing the same parameter for a large set of
edges between two layers of the net.
For simplicity, suppose two parameters a and b are supposed to
share the same value. This is equivalent to adding a new node u
and connecting u to both a and b with the operation a = u and
b = u. Thus, by chain rule,
∂ f
∂u = ∂ f
∂a · ∂a
∂u + ∂ f
∂b · ∂b
∂u = ∂ f
∂a + ∂ f
∂b .