TheoryofDeepLearning.2022

Recommendations

Info

108 theory of deep learningconstraints on the network. In particular, we could impose sparsityconstraints on the mapped value. Essentially, if x = ∑ i α i u i + σ asabove, we could enforce α to be r-sparse, i.e., allowing only r nonzerovalues. Examples of "sparse coding" include [? ? ].10.3.2 Topic modelsTopic models are used to learn the abstract "topics" that occur in acollection of documents, and uncover hidden semantic structures.They can also been fitted into the autoencoder view. Assume thedocuments are given in a bag-of-words representation. As definedabove, we have u 1 , ..., u k ∈ R n vectors, with k < n. We enforce that∑ i α i = 1, such that α i is the coefficient of the i-th topic. For example,a news article might be represented as a mixture of 0.7 of the topicpolitics and 0.3 of the topic economy. The goal in topic modeling is,when given a large enough collection of documents, to discover theunderlying set of topics used to generate them, both efficiently andaccurately. A practical algorithm for topic modeling with provableguarantees is given by [? ].10.4 Variational Autoencoder (VAE)In the context of deep learning, Variational Autoencoders (VAEs)1 are one of the earliest models that have demonstrated promising1qualitative performance in learning complicated distributions. Asits name suggests two core classical ideas rest behind the design ofVAEs: autoencoders – the original data x ∈ R n is mapped into a highleveldescriptor z ∈ R d on a low dimensional (hopefully) meaningfulmanifold; variational inference – the objective to maximize is a lowerbound on log-likelihood instead of the log-likelihood itself.Recall that in density estimation we are given a data samplex 1 , . . . , x m and a parametric model p θ (x), and our goal is to maximizethe log-likelihood of the data: max θ ∑ m i=1 log p θ(x i ). As a variationalmethod, VAEs use the evidence lower bound (ELBO) as a trainingobjective instead. For any distributions p on (x, z) and q on z|x,ELBO is derived from the fact that KL(q(z|x) || p(z|x)) ≥ 0log p(x) ≥ E q(z|x) [log p(x, z)] − E q(z|x) [log q(z|x)] = ELBO (10.8)where equality holds if and only if q(z|x) ≡ p(z|x). In the VAEsetting, the distribution q(z|x) acts as the encoder, mapping agiven data point x to a distribution of high-level descriptors, whilep(x, z) = p(z)p(x|z) acts as the decoder, reconstructing a distributionon data x given a random seed z ∼ p(z). Deep learning comes inplay for VAEs when constructing the aforementioned encoder q and
unsupervised learning: overview 109decoder p. In particular,q(z|x) = N (z; µ x , σ 2 x I d ), µ x , σ x = E φ (x) (10.9)p(x|z) = N (x; µ z , σz 2 I n ), µ z , σ z = D θ (z), p(z) = N (z; 0, I d )(10.10)where E φ and D θ are the encoder and decoder neural networks parameterizedby φ and θ respectively, µ x , µ z are vectors of correspondingdimensions, and σ x , σ z are (nonnegative) scalars. The particularchoice of Gaussians is not a necessity in itself for the model and canbe replaced with any other relevant distribution. However, Gaussiansprovide, as is often the case, computational ease and intuitive backing.The intuitive argument behind the use of Gaussian distributionsis that under mild regularity conditions every distribution can beapproximated (in distribution) by a mixture of Gaussians. This followsfrom the fact that by approximating the CDF of a distributionby step functions one obtains an approximation in distribution bya mixture of constants, i.e. mixture of Gaussians with ≈ 0 variance.The computational ease, on the other hand, is more clearly seen inthe training process of VAEs.10.4.1 Training VAEsAs previously mentioned, the training of variational autoencoders involvesmaximizing the RHS of (10.8), the ELBO, over the parametersφ, θ under the model described by (10.9), (10.10). Given that the parametricmodel is based on two neural networks E φ , D θ , the objectiveoptimization is done via gradient-based methods. Since the objectiveinvolves expectation over q(z|x), computing an exact estimate of it,and consequently its gradient, is intractable so we resort to (unbiased)gradient estimators and eventually use a stochastic gradient-basedoptimization method (e.g. SGD).In this section, use the notation µ φ (x), σ φ (x) = E φ (x) andµ θ (z), σ θ (z) = D θ (z) to emphasize the dependence on the parametersφ, θ. Given training data x 1 , . . . , x m ∈ R n , consider an arbitrary datapoint x i , i ∈ [m] and pass it through the encoder neural network E φ toobtain µ φ (x i ), σ φ (x i ). Next, sample s points z i1 , . . . , z is , where s is thebatch size, from the distribution q(z|x = x i ) = N (z; µ φ (x i ), σ φ (x i ) 2 I d )via the reparameterization trick 2 by sampling ɛ 1 , . . . , ɛ s ∼ N (0, I d )2from the standard Gaussian and using the transformation z ij =µ φ (x i ) + σ φ (x i ) · ɛ j . The reason behind the reparameterization trickis that the gradient w.r.t. parameter φ of an unbiased estimate of expectationover a general distribution q φ is not necessarily an unbiasedestimate of the gradient of expectation. This is the case, however,when the distribution q φ can separate the parameter φ from the ran-
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37:
basics of generalization theory 37N
Page 41 and 42:
6Algorithmic RegularizationLarge sc
Page 43 and 44:
algorithmic regularization 43minimi
Page 45 and 46:
algorithmic regularization 45update
Page 47 and 48:
algorithmic regularization 476.2 Ma
Page 49 and 50:
algorithmic regularization 496.3.2
Page 51 and 52:
algorithmic regularization 51Given
Page 53 and 54:
algorithmic regularization 53to the
Page 55:
algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105 and 106: unsupervised learning: overview 105
Page 107: unsupervised learning: overview 107
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?