26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

backpropagation and its variants 29

Hence, equivalently, the gradient with respect to a shared parameter

is the sum of the gradients with respect to individual

occurrences.

Backpropagation on networks with loops. The above exposition assumed

the network is acyclic. Many cutting-edge applications such as

machine translation and language understanding use networks

with directed loops (e.g., recurrent neural networks). These architectures

—all examples of the "differentiable computing" paradigm

below—can get complicated and may involve operations on a separate

memory as well as mechanisms to shift attention to different

parts of data and memory.

Networks with loops are trained using gradient descent as well,

using back-propagation through time which consists of expanding

the network through a finite number of time steps into an acyclic

graph, with replicated copies of the same network. These replicas

share the weights (weight tying!) so the gradient can be computed.

In practice an issue may arise with exploding or vanishing gradients,

which impact convergence. Such issues can be carefully addressed

in practice by clipping the gradient or re-parameterization techniques

such as long short-term memory. Recent work suggests that

careful initialization of parameters can ameliorate some of the

vanishing gradient problems.

The fact that the gradient can be computed efficiently for such

general networks with loops has motivated neural net models with

memory or even data structures (see for example neural Turing

machines and differentiable neural computer). Using gradient descent,

one can optimize over a family of parameterized networks with

loops to find the best one that solves a certain computational task

(on the training examples). The limits of these ideas are still being

explored.

3.4.1 Hessian-vector product in linear time: Pearlmutter’s trick

It is possible to generalize backpropagation to work with 2nd order

derivatives, specifically with the Hessian H which is the symmetric

matrix whose (i, j) entry is ∂ 2 f /∂w i ∂w j . Sometimes H is also denoted

∇ 2 f . Just writing down this matrix takes quadratic time and memory,

which is infeasible for today’s deep nets. Surprisingly, using backpropagation

it is possible to compute in linear time the matrix-vector

product Hx for any vector x.

Claim 3.4.1. Suppose an acyclic network with V nodes and E edges has

output f and leaves z 1 , . . . , z m . Then there exists a network of size O(V + E)

that has z 1 , . . . , z m as input nodes and ∂ f

∂z

, . . . , as output nodes.

1

∂ f

∂z m

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!