TheoryofDeepLearning.2022

Recommendations

Info

28 theory of deep learningExtension to vector messages : In fact (b) can be done efficiently inmore general settings where we allow the output of each node in thenetwork to be a vector (or even matrix/tensor) instead of only a realnumber. Here we need to replace∂z ∂u · S by ∂uj ∂z[S], which denotes thejresult of applying the operator∂z ∂u on S. We note that to be consistentjwith the convention in the usual exposition of backpropagation,when y ∈ R p is a funciton of x ∈ R q , we use ∂y∂xto denote q × pdimensional matrix with ∂y j /∂x i as the (i, j)-th entry. Readers mightnotice that this is the transpose of the usual Jacobian matrix definedin mathematics. Thus ∂y∂x is an operator that maps Rp to R q and wecan verify S has the same dimension as u and∂z ∂u [S] has the samejdimension as z j .For example, as illustrated below, suppose the node U ∈ R d 1×d 3 isa product of two matrices W ∈ R d 2×d 3 and Z ∈ R d 1×d 2 . Then we havethat ∂U/∂Z is a linear operator that maps R d 2×d 3 to R d 1×d 3 , whichnaively requires a matrix representation of dimension d 2 d 3 × d 1 d 3 .However, the computation (b) can be done efficiently because∂U∂Z [S] = W⊤ S.Such vector operations can also be implemented efficiently usingtoday’s GPUs.Figure 3.3: Vector version ofabove3.4 Notable ExtensionsAllowing weight tying: In many neural architectures, the designerwants to force many network units such as edges or nodes to sharethe same parameter. For example, in including the ubiquitousconvolutional net, the same filter has to be applied all over theimage, which implies reusing the same parameter for a large set ofedges between two layers of the net.For simplicity, suppose two parameters a and b are supposed toshare the same value. This is equivalent to adding a new node uand connecting u to both a and b with the operation a = u andb = u. Thus, by chain rule,∂ f∂u = ∂ f∂a · ∂a∂u + ∂ f∂b · ∂b∂u = ∂ f∂a + ∂ f∂b .
backpropagation and its variants 29Hence, equivalently, the gradient with respect to a shared parameteris the sum of the gradients with respect to individualoccurrences.Backpropagation on networks with loops. The above exposition assumedthe network is acyclic. Many cutting-edge applications such asmachine translation and language understanding use networkswith directed loops (e.g., recurrent neural networks). These architectures—all examples of the "differentiable computing" paradigmbelow—can get complicated and may involve operations on a separatememory as well as mechanisms to shift attention to differentparts of data and memory.Networks with loops are trained using gradient descent as well,using back-propagation through time which consists of expandingthe network through a finite number of time steps into an acyclicgraph, with replicated copies of the same network. These replicasshare the weights (weight tying!) so the gradient can be computed.In practice an issue may arise with exploding or vanishing gradients,which impact convergence. Such issues can be carefully addressedin practice by clipping the gradient or re-parameterization techniquessuch as long short-term memory. Recent work suggests thatcareful initialization of parameters can ameliorate some of thevanishing gradient problems.The fact that the gradient can be computed efficiently for suchgeneral networks with loops has motivated neural net models withmemory or even data structures (see for example neural Turingmachines and differentiable neural computer). Using gradient descent,one can optimize over a family of parameterized networks withloops to find the best one that solves a certain computational task(on the training examples). The limits of these ideas are still beingexplored.3.4.1 Hessian-vector product in linear time: Pearlmutter’s trickIt is possible to generalize backpropagation to work with 2nd orderderivatives, specifically with the Hessian H which is the symmetricmatrix whose (i, j) entry is ∂ 2 f /∂w i ∂w j . Sometimes H is also denoted∇ 2 f . Just writing down this matrix takes quadratic time and memory,which is infeasible for today’s deep nets. Surprisingly, using backpropagationit is possible to compute in linear time the matrix-vectorproduct Hx for any vector x.Claim 3.4.1. Suppose an acyclic network with V nodes and E edges hasoutput f and leaves z 1 , . . . , z m . Then there exists a network of size O(V + E)that has z 1 , . . . , z m as input nodes and ∂ f∂z, . . . , as output nodes.1∂ f∂z m
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25 and 26: backpropagation and its variants 25
Page 27: backpropagation and its variants 27
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79:
78 theory of deep learningNote the
Page 81 and 82:
9Inductive Biases due to Algorithmi
Page 83 and 84:
inductive biases due to algorithmic
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?