26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

backpropagation and its variants 27

Base Case: At the output layer this is true, since ∂ f /∂ f = 1.

Inductive step: Suppose the claim was true for layers t + 1 and

higher and u is at layer t, with outgoing edges go to some nodes

u 1 , u 2 , . . . , u m at levels t + 1 or higher. By inductive hypothesis, node z

indeed receives ∂ f

∂u

× ∂u j

j ∂z

from each of u j. Thus by Chain rule,

S =

m

∂ f ∂u

i

∂u

i=1 i ∂z = ∂ f

∂z .

This completes the induction and proves the Main Claim.

3.3 Auto-differentiation

Since the exposition above used almost no details about the network

and the operations that the node perform, it extends to every computation

that can be organized as an acyclic graph whose each node

computes a differentiable function of its incoming neighbors. This

observation underlies many auto-differentiation packages found

in deep learning environments: they allow computing the gradient

of the output of such a computation with respect to the network

parameters.

We first observe that Claim 3.1.1 continues to hold in this very general

setting. This is without loss of generality because we can view

the parameters associated to the edges as also sitting on the nodes

(actually, leaf nodes). This can be done via a simple transformation to

the network; for a single node it is shown in the picture below; and

one would need to continue to do this transformation in the rest of

the networks feeding into u 1 , u 2 , .. etc from below.

Then, we can use the messaging protocol to compute the derivatives

with respect to the nodes, as long as the local partial derivative

can be computed efficiently. We note that the algorithm can be implemented

in a fairly modular manner: For every node u, it suffices to

specify (a) how it depends on the incoming nodes, say, z 1 , . . . , z n and

(b) how to compute the partial derivative times S, that is, S · ∂u

∂z j

.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!