TheoryofDeepLearning.2022

Recommendations

Info

24 theory of deep learningvisualized as being structured in numbered layers, with nodes in thet + 1th layer getting all their inputs from the outputs of nodes in layerst and earlier. We use f ∈ R to denote the output of the network.In all our figures, the input of the network is at the bottom and theoutput on the top.Our exposition uses the notion ∂ f /∂u, where f is the output and uis a node in the net. This means the following: suppose we cut off allthe incoming edges of the node u, and fix/clamp the current valuesof all network parameters. Now imagine changing u from its currentvalue. This change may affect values of nodes at higher levels thatare connected to u, and the final output f is one such node. Then∂ f /∂u denotes the rate at which f will change as we vary u. (Aside:Readers familiar with the usual exposition of back-propagationshould note that there f is the training error and this ∂ f /∂u turns outto be exactly the "error" propagated back to on the node u.)Claim 3.1.1. To compute the desired gradient with respect to the parameters,it suffices to compute ∂ f /∂u for every node u.Proof. Follows from direct application of chain rule and we proveit by picture, namely Figure 3.1. Suppose node u is a weighted sumof the nodes z 1 , . . . , z m (which will be passed through a non-linearactivation σ afterwards). That is, we have u = w 1 z 1 + · · · + w n z n . ByChain rule, we have∂ f= ∂ f∂w 1 ∂u · ∂u= ∂ f∂w 1 ∂u · z 1.Figure 3.1: Why it sufficesto compute derivatives withrespect to nodes.Hence, we see that having computed ∂ f /∂u we can compute∂ f /∂w 1 , and moreover this can be done locally by the endpoints of
backpropagation and its variants 25the edge where w 1 resides.3.1.1 Multivariate Chain RuleTowards computing the derivatives with respect to the nodes, wefirst recall the multivariate Chain rule, which handily describes therelationships between these partial derivatives (depending on thegraph structure).Suppose a variable f is a function of variables u 1 , . . . , u n , whichin turn depend on the variable z. Then, multivariate Chain rule saysthatn∂ f∂z = ∂ f∑ · ∂u j∂uj=1 j ∂z .To illustrate, in Figure 3.2 we apply it to the same example as weused before but with a different focus and numbering of the nodes.Figure 3.2: Multivariate chainrule: derivative with respect tonode z can be computed usingweighted sum of derivativeswith respect to all nodes that zfeeds into.We see that given we’ve computed the derivatives with respect toall the nodes that is above the node z, we can compute the derivativewith respect to the node z via a weighted sum, where the weightsinvolve the local derivative ∂u j /∂z that is often easy to compute.This brings us to the question of how we measure running time. Forbook-keeping, we assume thatBasic assumption: If u is a node at level t + 1 and z is any node at level≤ t whose output is an input to u, then computing ∂u∂ztakes unit timeon our computer.
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23: 3Backpropagation and its VariantsTh
Page 27 and 28: backpropagation and its variants 27
Page 29 and 30: backpropagation and its variants 29
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75:
74 theory of deep learningSeveral r
Page 76 and 77:
76 theory of deep learningFigure 8.
Page 78 and 79:
78 theory of deep learningNote the
Page 81 and 82:
9Inductive Biases due to Algorithmi
Page 83 and 84:
inductive biases due to algorithmic
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?