TheoryofDeepLearning.2022

Recommendations

Info

104 theory of deep learningFigure 10.1: Visualization ofPearson’s Crab Data as mixtureof two Gaussians. (Credit:MIX homepage at McMasterUniversity.)as well as ρ 1 . The visible part x consists of attribute vector for acrab. Hidden vector h consists of a bit, indicating which of the twoGaussians this x was generated from, as well as the value of thegaussian random variable that generated x.The goal is to learn this description of the distribution given i.i.d.samples of x.Learning good representation/featurization of data For example, the pixelrepresentation of images may not be very useful in other tasksand one may desire a more “high level” representation that allowsdownstream tasks to be solved in a data-efficient way. One wouldhope to learn such featurization using unlabeled data.In some settings, featurization is done via distribution learning:one assumes a data distribution p θ (h, x) as above and the featurizationof the visible samplepoint x is assumed to be the hiddenvariable h that was used to generate it. More precisely, the hiddenvariable is a sample from the conditional distribution p(h|x). Thisview of representation learning is sometimes called an autoencoderin deep learning.For example, topic models are a simple probabilistic model oftext generation, where x is some piece of text, and h is the proportionof specific topics (“sports,” “politics” etc.). Then one couldimagine that h is some short and more high-level descriptor of x.Many techniques for density estimation —such as variationalmethods, described laterr —also give a notion of a representation: theFigure 10.2: Autoencoderdefined using a density distributionp(h, x), where h is thelatent feature vector correspondingto visible vector x. Theprocess of computing h givenx is called “encoding” and thereverse is called “decoding.” Ingeneral applying the encoderon x followed by the decoderwould not give x again, sincethe composed transformation isa sample from a distribution.
unsupervised learning: overview 105method for learning the distribution often also come with a candidatedistribution for p(h|x). This is perhaps why people sometimesconflate representation learning with density estimation.10.1 Training Objective for Density estimation: Log LikelihoodFrom the above description it is clear that one way to formalize“structure ”of data is to think of it as being i.i.d. samples S ={x 1 , x 2 , ...x m } samples from an underlying distribution, and to learnthis underlying distribution. We assume the unknown distributionhas a known parametric form p θ (x) which is the probability of observingx given the parameters of the model, θ. But θ is unknown.For example, θ could be parameters of an unknown deep net g θ witha certain prescribed architecture, and x is generated by using anh ∼ N (0, I) and outputting g θ (h).We wish to infer the best θ given the i.i.d. samples (“evidence”)from the distribution. One standard way to quantify “best” is pick θis according to the maximum likelihood principle.maxθ∏x (i) ∈Sp θ (x (i) ) (10.1)Because log is monotone, this is also equivalent to minimizing thelog likelihood, which is a sum over training samples and thus similarin form to the training objectives seen so far in the book:maxθ∑x (i) ∈Slog p θ (x (i) ) (log likelihood) (10.2)Thus it is always possible to fit such a parametric form. The questionis how well this learnt distribution fits the data distribution.We need a notion of “goodness” for unsupervised learning that isanalogous to generalization in supervised learning. The obvious oneis log likelihood of held-out data: reserve some of the data for testingand compare the average log likelihood of the model on training datawith that on test data.Example 10.1.1. The log likelihood objective makes sense for fitting anyparametric model to the training data. For example, it is always possible tofit a simple Gaussian distribution N (µ, σ 2 I) to the training data in R d . Thelog-likehood objective is∑i|x i − µ| 2σ 2 ,which is minimized by setting µ to 1 m ∑ i x i and σ 2 to ∑ i1n|x i − µ| 2 .Suppose we carry this out for the distribution of real-life images. What dowe learn? The mean µ will be the vector of average pixel values, and σ 2 will
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37:
basics of generalization theory 37N
Page 41 and 42:
6Algorithmic RegularizationLarge sc
Page 43 and 44:
algorithmic regularization 43minimi
Page 45 and 46:
algorithmic regularization 45update
Page 47 and 48:
algorithmic regularization 476.2 Ma
Page 49 and 50:
algorithmic regularization 496.3.2
Page 51 and 52:
algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 103: 10Unsupervised learning: OverviewMu
Page 107 and 108: unsupervised learning: overview 107
Page 109 and 110: unsupervised learning: overview 109
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?