TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
backpropagation and its variants 27
Base Case: At the output layer this is true, since ∂ f /∂ f = 1.
Inductive step: Suppose the claim was true for layers t + 1 and
higher and u is at layer t, with outgoing edges go to some nodes
u 1 , u 2 , . . . , u m at levels t + 1 or higher. By inductive hypothesis, node z
indeed receives ∂ f
∂u
× ∂u j
j ∂z
from each of u j. Thus by Chain rule,
S =
m
∂ f ∂u
∑
i
∂u
i=1 i ∂z = ∂ f
∂z .
This completes the induction and proves the Main Claim.
3.3 Auto-differentiation
Since the exposition above used almost no details about the network
and the operations that the node perform, it extends to every computation
that can be organized as an acyclic graph whose each node
computes a differentiable function of its incoming neighbors. This
observation underlies many auto-differentiation packages found
in deep learning environments: they allow computing the gradient
of the output of such a computation with respect to the network
parameters.
We first observe that Claim 3.1.1 continues to hold in this very general
setting. This is without loss of generality because we can view
the parameters associated to the edges as also sitting on the nodes
(actually, leaf nodes). This can be done via a simple transformation to
the network; for a single node it is shown in the picture below; and
one would need to continue to do this transformation in the rest of
the networks feeding into u 1 , u 2 , .. etc from below.
Then, we can use the messaging protocol to compute the derivatives
with respect to the nodes, as long as the local partial derivative
can be computed efficiently. We note that the algorithm can be implemented
in a fairly modular manner: For every node u, it suffices to
specify (a) how it depends on the incoming nodes, say, z 1 , . . . , z n and
(b) how to compute the partial derivative times S, that is, S · ∂u
∂z j
.