TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
24 theory of deep learning
visualized as being structured in numbered layers, with nodes in the
t + 1th layer getting all their inputs from the outputs of nodes in layers
t and earlier. We use f ∈ R to denote the output of the network.
In all our figures, the input of the network is at the bottom and the
output on the top.
Our exposition uses the notion ∂ f /∂u, where f is the output and u
is a node in the net. This means the following: suppose we cut off all
the incoming edges of the node u, and fix/clamp the current values
of all network parameters. Now imagine changing u from its current
value. This change may affect values of nodes at higher levels that
are connected to u, and the final output f is one such node. Then
∂ f /∂u denotes the rate at which f will change as we vary u. (Aside:
Readers familiar with the usual exposition of back-propagation
should note that there f is the training error and this ∂ f /∂u turns out
to be exactly the "error" propagated back to on the node u.)
Claim 3.1.1. To compute the desired gradient with respect to the parameters,
it suffices to compute ∂ f /∂u for every node u.
Proof. Follows from direct application of chain rule and we prove
it by picture, namely Figure 3.1. Suppose node u is a weighted sum
of the nodes z 1 , . . . , z m (which will be passed through a non-linear
activation σ afterwards). That is, we have u = w 1 z 1 + · · · + w n z n . By
Chain rule, we have
∂ f
= ∂ f
∂w 1 ∂u · ∂u
= ∂ f
∂w 1 ∂u · z 1.
Figure 3.1: Why it suffices
to compute derivatives with
respect to nodes.
Hence, we see that having computed ∂ f /∂u we can compute
∂ f /∂w 1 , and moreover this can be done locally by the endpoints of