TheoryofDeepLearning.2022

Recommendations

Info

106 theory of deep learningcorrespond to the average variance per pixel. Thus a random sample fromthe learn distribution will look like some noisy version of the average pixel.This example also shows that matching loglikelihood for the averagetraining and held-out sample is insufficient for good distribution learning.The gaussian model only has d + 1 parameters and simple concentrationbounds show under fairly general conditions (such as coordinates of x i ’sbeing bounded) that one the number of training samples is moderately highthen the log likelihood of the average test sample is similar to that of theaverage training sample. However, the learned distribution may be nothinglike the true distribution.This is reminiscent of the situation in supervised learning wherebya nonsensical model — one that outputs random labels—has excellentgeneralization as well.As in supervised learning, one has to keep track of training loglikelihoodin addition to generalization, and choose among modelsthat maximise it. In general this is computationally intractable foreven simple settings.Theorem 10.1.2. The θ maximizing (10.2) minimizes the KL divergenceKL(Q||P) where P is the true distribution and Q is the learnt distribution.Proof. TBD10.2 Variational methodsThe variational method leverages duality, a widespread principlein math. You may have seen LP duality in an algorithms class. Thename “variational”in the title refers to calculus of variations, the partof math where such principles are studied.This method maintains some estimate q(h|x) of p(h|x) and improvesit. One useful fact is that:log p(x) ≥ E q(h|x) [log(p(x, h))] + H[q(h|x)], ∀q(h|x) (10.3)where H is the Shannon Entropy.We would like to prove this bound on log p(x) and resort to maximizingthe lower bound given in (10.3), referred to as the evidencelower bound (ELBO). Towards this end we will introduce the KullbackLeibler divergence (KL) between two distributions given by[KL[q(h|x) || p(h|x)] = E q(h|x) log q(h|x) ]p(h|x)Moreover, p(x)p(h|x) = p(x, h) is true by Bayes Rule. Then we cansee that(10.4)
unsupervised learning: overview 107KL[q(h|x)|p(h|x)] = E q(h|x) [log q(h|x) · p(x)] (10.5)p(x, h)= E q(h|x) [log(q(h|x))] −E q(h|x) [log(p(x, h))] + E q(h|x) [log p(x)]} {{ }−H(q(h|x))(10.6)But we know that the KL divergence is always nonnegative, so weget:E q(h|x) [log(p(x))] − E q(h|x) [log(p(x, h))] − H(q(h|x)) ≥ 0 (10.7)which is the same as ELBO (10.3) since log(p(x)) is constant overq(h|x), hence is equal to its expectation.The variational methods use some form of gradient descent orlocal improvement to improve q(h|x). For details see the blog post onoffconvex.org by Arora and Risteski.10.3 AutoencodersAutoencoders are a subcase of density estimation popular in theneural nets world. It is assumed that the model first generates alatent representation from a simple distribution, and then uses somesimple circuit to map it to the visible datapoint x. This is calleddecoder. There is a companion encoder circuit that maps x to a latentrepresentation. The pair of circuits is called an autoencoder.In other words, the autoencoder is trying to learn an approximationto the identity function, so as to output x ′ that is similar to x.In order to force the algorithm to learn a function different fromthe identity function, we place constraints on the model, such asby limiting the representation z to be of low dimensionality. In thesimplest example, assume k vectors u 1 , ..., u k ∈ R n , where k ≪ n, andx = ∑ i α i u i + σ, where σ is Gaussian noise. By applying rank-k PCA,one could recover values in span(u 1 , ..., u k ). Thus PCA/SVD can beseen as a simple form of autoencoder learning.An autoencoder can also be seen as an example of the so-calledmanifold view, whereby data is assumed to have a latent representationz which has a simpler distribution than x.10.3.1 Sparse autoencoderOur argument above relied on the size of the encoded representationof the data to be small. But even when this is not the case, (i.e.,k > n), we can still learn meaningful structure, by imposing other
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37:
basics of generalization theory 37N
Page 41 and 42:
6Algorithmic RegularizationLarge sc
Page 43 and 44:
algorithmic regularization 43minimi
Page 45 and 46:
algorithmic regularization 45update
Page 47 and 48:
algorithmic regularization 476.2 Ma
Page 49 and 50:
algorithmic regularization 496.3.2
Page 51 and 52:
algorithmic regularization 51Given
Page 53 and 54:
algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103 and 104: 10Unsupervised learning: OverviewMu
Page 105: unsupervised learning: overview 105
Page 109 and 110: unsupervised learning: overview 109
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?