26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

28 theory of deep learning

Extension to vector messages : In fact (b) can be done efficiently in

more general settings where we allow the output of each node in the

network to be a vector (or even matrix/tensor) instead of only a real

number. Here we need to replace

∂z ∂u · S by ∂u

j ∂z

[S], which denotes the

j

result of applying the operator

∂z ∂u on S. We note that to be consistent

j

with the convention in the usual exposition of backpropagation,

when y ∈ R p is a funciton of x ∈ R q , we use ∂y

∂x

to denote q × p

dimensional matrix with ∂y j /∂x i as the (i, j)-th entry. Readers might

notice that this is the transpose of the usual Jacobian matrix defined

in mathematics. Thus ∂y

∂x is an operator that maps Rp to R q and we

can verify S has the same dimension as u and

∂z ∂u [S] has the same

j

dimension as z j .

For example, as illustrated below, suppose the node U ∈ R d 1×d 3 is

a product of two matrices W ∈ R d 2×d 3 and Z ∈ R d 1×d 2 . Then we have

that ∂U/∂Z is a linear operator that maps R d 2×d 3 to R d 1×d 3 , which

naively requires a matrix representation of dimension d 2 d 3 × d 1 d 3 .

However, the computation (b) can be done efficiently because

∂U

∂Z [S] = W⊤ S.

Such vector operations can also be implemented efficiently using

today’s GPUs.

Figure 3.3: Vector version of

above

3.4 Notable Extensions

Allowing weight tying: In many neural architectures, the designer

wants to force many network units such as edges or nodes to share

the same parameter. For example, in including the ubiquitous

convolutional net, the same filter has to be applied all over the

image, which implies reusing the same parameter for a large set of

edges between two layers of the net.

For simplicity, suppose two parameters a and b are supposed to

share the same value. This is equivalent to adding a new node u

and connecting u to both a and b with the operation a = u and

b = u. Thus, by chain rule,

∂ f

∂u = ∂ f

∂a · ∂a

∂u + ∂ f

∂b · ∂b

∂u = ∂ f

∂a + ∂ f

∂b .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!