26.12.2022 Views

TheoryofDeepLearning.2022

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

26 theory of deep learning

3.1.2 Naive feedforward algorithm (not efficient!)

It is useful to first point out the naive quadratic time algorithm

implied by the chain rule. Most authors skip this trivial version,

which we think is analogous to teaching sorting using only quicksort,

and skipping over the less efficient bubblesort.

The naive algorithm is to compute ∂u i /∂u j for every pair of nodes

where u i is at a higher level than u j . Of course, among these V 2

values (where V is the number of nodes) are also the desired ∂ f /∂u i

for all i since f is itself the value of the output node.

This computation can be done in feedforward fashion. If such

value has been obtained for every u j on the level up to and including

level t, then one can express (by inspecting the multivariate chain

rule) the value ∂u l /∂u j for some u l at level t + 1 as a weighted

combination of values ∂u i /∂u j for each u i that is a direct input to u l .

This description shows that the amount of computation for a fixed

j is proportional to the number of edges E. This amount of work

happens for all j ∈ V, letting us conclude that the total work in the

algorithm is O(VE).

3.2 Backpropagation (Linear Time)

The more efficient backpropagation, as the name suggests, computes

the partial derivatives in the reverse direction. Messages are passed

in one wave backwards from higher number layers to lower number

layers. (Some presentations of the algorithm describe it as dynamic

programming.)

Algorithm 1 Backpropagation

The node u receives a message along each outgoing edge from the

node at the other end of that edge. It sums these messages to get a

number S (if u is the output of the entire net, then define S = 1) and

then it sends the following message to any node z adjacent to it at a

lower level:

S · ∂u

∂z

Clearly, the amount of work done by each node is proportional

to its degree, and thus overall work is the sum of the node degrees.

Summing all node degrees ends up double-counting eac edge, and

thus the overall work is O(Network Size).

To prove correctness, we prove the following:

Lemma 3.2.1. At each node z, the value S is exactly ∂ f /∂z.

Proof. Follows from simple induction on depth.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!