TheoryofDeepLearning.2022

Recommendations

Info

26 theory of deep learning3.1.2 Naive feedforward algorithm (not efficient!)It is useful to first point out the naive quadratic time algorithmimplied by the chain rule. Most authors skip this trivial version,which we think is analogous to teaching sorting using only quicksort,and skipping over the less efficient bubblesort.The naive algorithm is to compute ∂u i /∂u j for every pair of nodeswhere u i is at a higher level than u j . Of course, among these V 2values (where V is the number of nodes) are also the desired ∂ f /∂u ifor all i since f is itself the value of the output node.This computation can be done in feedforward fashion. If suchvalue has been obtained for every u j on the level up to and includinglevel t, then one can express (by inspecting the multivariate chainrule) the value ∂u l /∂u j for some u l at level t + 1 as a weightedcombination of values ∂u i /∂u j for each u i that is a direct input to u l .This description shows that the amount of computation for a fixedj is proportional to the number of edges E. This amount of workhappens for all j ∈ V, letting us conclude that the total work in thealgorithm is O(VE).3.2 Backpropagation (Linear Time)The more efficient backpropagation, as the name suggests, computesthe partial derivatives in the reverse direction. Messages are passedin one wave backwards from higher number layers to lower numberlayers. (Some presentations of the algorithm describe it as dynamicprogramming.)Algorithm 1 BackpropagationThe node u receives a message along each outgoing edge from thenode at the other end of that edge. It sums these messages to get anumber S (if u is the output of the entire net, then define S = 1) andthen it sends the following message to any node z adjacent to it at alower level:S · ∂u∂zClearly, the amount of work done by each node is proportionalto its degree, and thus overall work is the sum of the node degrees.Summing all node degrees ends up double-counting eac edge, andthus the overall work is O(Network Size).To prove correctness, we prove the following:Lemma 3.2.1. At each node z, the value S is exactly ∂ f /∂z.Proof. Follows from simple induction on depth.
backpropagation and its variants 27Base Case: At the output layer this is true, since ∂ f /∂ f = 1.Inductive step: Suppose the claim was true for layers t + 1 andhigher and u is at layer t, with outgoing edges go to some nodesu 1 , u 2 , . . . , u m at levels t + 1 or higher. By inductive hypothesis, node zindeed receives ∂ f∂u× ∂u jj ∂zfrom each of u j. Thus by Chain rule,S =m∂ f ∂u∑i∂ui=1 i ∂z = ∂ f∂z .This completes the induction and proves the Main Claim.3.3 Auto-differentiationSince the exposition above used almost no details about the networkand the operations that the node perform, it extends to every computationthat can be organized as an acyclic graph whose each nodecomputes a differentiable function of its incoming neighbors. Thisobservation underlies many auto-differentiation packages foundin deep learning environments: they allow computing the gradientof the output of such a computation with respect to the networkparameters.We first observe that Claim 3.1.1 continues to hold in this very generalsetting. This is without loss of generality because we can viewthe parameters associated to the edges as also sitting on the nodes(actually, leaf nodes). This can be done via a simple transformation tothe network; for a single node it is shown in the picture below; andone would need to continue to do this transformation in the rest ofthe networks feeding into u 1 , u 2 , .. etc from below.Then, we can use the messaging protocol to compute the derivativeswith respect to the nodes, as long as the local partial derivativecan be computed efficiently. We note that the algorithm can be implementedin a fairly modular manner: For every node u, it suffices tospecify (a) how it depends on the incoming nodes, say, z 1 , . . . , z n and(b) how to compute the partial derivative times S, that is, S · ∂u∂z j.
Page 1: C O N T R I B U T O R S : R A M A N
Page 4 and 5: 44 Basics of generalization theory
Page 6 and 7: 612 Representation Learning 11113 E
Page 8 and 9: 810.2 Autoencoder defined using a d
Page 11: IntroductionThis monograph discusse
Page 14 and 15: 14 theory of deep learning• Train
Page 17 and 18: 2Basics of OptimizationThis chapter
Page 19 and 20: basics of optimization 19where the
Page 21 and 22: basics of optimization 21Therefore,
Page 23 and 24: 3Backpropagation and its VariantsTh
Page 25: backpropagation and its variants 25
Page 29 and 30: backpropagation and its variants 29
Page 31 and 32: 4Basics of generalization theoryGen
Page 33 and 34: basics of generalization theory 33p
Page 35 and 36: basics of generalization theory 35w
Page 37: basics of generalization theory 37N
Page 41 and 42: 6Algorithmic RegularizationLarge sc
Page 43 and 44: algorithmic regularization 43minimi
Page 45 and 46: algorithmic regularization 45update
Page 47 and 48: algorithmic regularization 476.2 Ma
Page 49 and 50: algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77:
76 theory of deep learningFigure 8.
Page 78 and 79:
78 theory of deep learningNote the
Page 81 and 82:
9Inductive Biases due to Algorithmi
Page 83 and 84:
inductive biases due to algorithmic
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
10Unsupervised learning: OverviewMu
Page 105 and 106:
unsupervised learning: overview 105
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
11Generative Adversarial NetsChapte
Page 113:
12Representation Learning
Page 116 and 117:
116 theory of deep learning13.3 Exa
Page 118:
118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?