26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

24 theory of deep learning

visualized as being structured in numbered layers, with nodes in the

t + 1th layer getting all their inputs from the outputs of nodes in layers

t and earlier. We use f ∈ R to denote the output of the network.

In all our figures, the input of the network is at the bottom and the

output on the top.

Our exposition uses the notion ∂ f /∂u, where f is the output and u

is a node in the net. This means the following: suppose we cut off all

the incoming edges of the node u, and fix/clamp the current values

of all network parameters. Now imagine changing u from its current

value. This change may affect values of nodes at higher levels that

are connected to u, and the final output f is one such node. Then

∂ f /∂u denotes the rate at which f will change as we vary u. (Aside:

Readers familiar with the usual exposition of back-propagation

should note that there f is the training error and this ∂ f /∂u turns out

to be exactly the "error" propagated back to on the node u.)

Claim 3.1.1. To compute the desired gradient with respect to the parameters,

it suffices to compute ∂ f /∂u for every node u.

Proof. Follows from direct application of chain rule and we prove

it by picture, namely Figure 3.1. Suppose node u is a weighted sum

of the nodes z 1 , . . . , z m (which will be passed through a non-linear

activation σ afterwards). That is, we have u = w 1 z 1 + · · · + w n z n . By

Chain rule, we have

∂ f

= ∂ f

∂w 1 ∂u · ∂u

= ∂ f

∂w 1 ∂u · z 1.

Figure 3.1: Why it suffices

to compute derivatives with

respect to nodes.

Hence, we see that having computed ∂ f /∂u we can compute

∂ f /∂w 1 , and moreover this can be done locally by the endpoints of

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!