TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
backpropagation and its variants 29
Hence, equivalently, the gradient with respect to a shared parameter
is the sum of the gradients with respect to individual
occurrences.
Backpropagation on networks with loops. The above exposition assumed
the network is acyclic. Many cutting-edge applications such as
machine translation and language understanding use networks
with directed loops (e.g., recurrent neural networks). These architectures
—all examples of the "differentiable computing" paradigm
below—can get complicated and may involve operations on a separate
memory as well as mechanisms to shift attention to different
parts of data and memory.
Networks with loops are trained using gradient descent as well,
using back-propagation through time which consists of expanding
the network through a finite number of time steps into an acyclic
graph, with replicated copies of the same network. These replicas
share the weights (weight tying!) so the gradient can be computed.
In practice an issue may arise with exploding or vanishing gradients,
which impact convergence. Such issues can be carefully addressed
in practice by clipping the gradient or re-parameterization techniques
such as long short-term memory. Recent work suggests that
careful initialization of parameters can ameliorate some of the
vanishing gradient problems.
The fact that the gradient can be computed efficiently for such
general networks with loops has motivated neural net models with
memory or even data structures (see for example neural Turing
machines and differentiable neural computer). Using gradient descent,
one can optimize over a family of parameterized networks with
loops to find the best one that solves a certain computational task
(on the training examples). The limits of these ideas are still being
explored.
3.4.1 Hessian-vector product in linear time: Pearlmutter’s trick
It is possible to generalize backpropagation to work with 2nd order
derivatives, specifically with the Hessian H which is the symmetric
matrix whose (i, j) entry is ∂ 2 f /∂w i ∂w j . Sometimes H is also denoted
∇ 2 f . Just writing down this matrix takes quadratic time and memory,
which is infeasible for today’s deep nets. Surprisingly, using backpropagation
it is possible to compute in linear time the matrix-vector
product Hx for any vector x.
Claim 3.4.1. Suppose an acyclic network with V nodes and E edges has
output f and leaves z 1 , . . . , z m . Then there exists a network of size O(V + E)
that has z 1 , . . . , z m as input nodes and ∂ f
∂z
, . . . , as output nodes.
1
∂ f
∂z m