TheoryofDeepLearning.2022
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
26 theory of deep learning
3.1.2 Naive feedforward algorithm (not efficient!)
It is useful to first point out the naive quadratic time algorithm
implied by the chain rule. Most authors skip this trivial version,
which we think is analogous to teaching sorting using only quicksort,
and skipping over the less efficient bubblesort.
The naive algorithm is to compute ∂u i /∂u j for every pair of nodes
where u i is at a higher level than u j . Of course, among these V 2
values (where V is the number of nodes) are also the desired ∂ f /∂u i
for all i since f is itself the value of the output node.
This computation can be done in feedforward fashion. If such
value has been obtained for every u j on the level up to and including
level t, then one can express (by inspecting the multivariate chain
rule) the value ∂u l /∂u j for some u l at level t + 1 as a weighted
combination of values ∂u i /∂u j for each u i that is a direct input to u l .
This description shows that the amount of computation for a fixed
j is proportional to the number of edges E. This amount of work
happens for all j ∈ V, letting us conclude that the total work in the
algorithm is O(VE).
3.2 Backpropagation (Linear Time)
The more efficient backpropagation, as the name suggests, computes
the partial derivatives in the reverse direction. Messages are passed
in one wave backwards from higher number layers to lower number
layers. (Some presentations of the algorithm describe it as dynamic
programming.)
Algorithm 1 Backpropagation
The node u receives a message along each outgoing edge from the
node at the other end of that edge. It sums these messages to get a
number S (if u is the output of the entire net, then define S = 1) and
then it sends the following message to any node z adjacent to it at a
lower level:
S · ∂u
∂z
Clearly, the amount of work done by each node is proportional
to its degree, and thus overall work is the sum of the node degrees.
Summing all node degrees ends up double-counting eac edge, and
thus the overall work is O(Network Size).
To prove correctness, we prove the following:
Lemma 3.2.1. At each node z, the value S is exactly ∂ f /∂z.
Proof. Follows from simple induction on depth.