TheoryofDeepLearning.2022

Recommendations

Info

102 theory of deep learningsuch saddle points are non-degenerate, it suffices to show g ′′ (0) < 0.It is easy to check that the second directional derivative at the originis given byg ′′ (0) = −4σ 2 j (w j(t) ⊤ Mw j (t) − w ⊤ j Mw j ) < 0,which completes the proof.9.4 Role of ParametrizationFor least squares linear regression (i.e., for k = 1 and u = W ⊤ 1 ∈ Rd 0in Problem 9.8), we can show that using dropout amounts to solvingthe following regularized problem:minu∈R d 01nn∑i=1(y i − u ⊤ x i ) 2 + λu ⊤ Ĉu.All the minimizers of the above problem are solutions to the followingsystem of linear equations (1 + λ)X ⊤ Xu = X ⊤ y, whereX = [x 1 , · · · , x n ] ⊤ ∈ R n×d 0, y = [y 1 , · · · , y n ] ⊤ ∈ R n×1 are the designmatrix and the response vector, respectively. Unlike Tikhonovregularization which yields solutions to the system of linear equations(X ⊤ X + λI)u = X ⊤ y (a useful prior, discards the directionsthat account for small variance in data even when they exhibit gooddiscriminability), the dropout regularizer manifests merely as a scalingof the parameters. This suggests that parametrization plays animportant role in determining the nature of the resulting regularizer.However, a similar result was shown for deep linear networks [? ]that the data dependent regularization due to dropout results inmerely scaling of the parameters. At the same time, in the case ofmatrix sensing we see a richer class of regularizers. One potentialexplanation is that in the case of linear networks, we require a convolutionalstructure in the network to yield rich inductive biases. Forinstance, matrix sensing can be written as a two layer network in thefollowing convolutional form:〈UV ⊤ , A〉 = 〈U ⊤ , V ⊤ A ⊤ 〉 = 〈U ⊤ , (I ⊗ V ⊤ )A ⊤ 〉.
10Unsupervised learning: OverviewMuch of the book so far concerned supervised learning —i.e., learningto classify inputs into classes, where the training data consistsof sampled inputs together with their correct labels. This chapter isan introduction to unsupervised learning, where one has randomlysampled datapoints but no labels or classes.10.0.1 Possible goals of unsupervised learningLearn hidden/latent structure of data. An example would be PrincipalComponent Analysis (PCA), concerned with finding the mostimportant directions in the data. Other examples of structurelearning can include sparse coding (aka dictionary learning) ornonnegative matrix factorization (NMF).Learn the distribution of the data. A classic example is Pearson’s 1893contribution to theory of evolution by studying data about the crabpopulation on Malta island. Biologists had sampled a thousandcrabs in the wild, and measured 23 attributes (e.g., length, weight,etc.) for each. The presumption was that these datapoints shouldexhibit Gaussian distribution, but Pearson could not find a good fitto a Gaussian. He was able to show however that the distributionwas actualyy mixture of two Gaussians. Thus the populationconsisted of two distinct species, which had diverged not too longago in evolutionary terms.In general, in density estimation the hypothesis is that the distributionof data is p θ (h, x) where θ is some unknown vector ofparameters, x is the observable (i.e., data) and h are some hiddenvariables, often called latent variables. Then the density distributionof x is ∫ p θ (h, x)dh. In the crab example, the distribution amixture of Gaussians N (µ 1 , Σ 1 ), N (µ 2 , Σ 2 ) where the first contributesρ 1 fraction of samples and the other contributes 1 − ρ 2fraction. Then θ vector consists of parameters of the two Gaussians
Page 1:
C O N T R I B U T O R S : R A M A N
Page 4 and 5:
44 Basics of generalization theory
Page 6 and 7:
612 Representation Learning 11113 E
Page 8 and 9:
810.2 Autoencoder defined using a d
Page 11:
IntroductionThis monograph discusse
Page 14 and 15:
14 theory of deep learning• Train
Page 17 and 18:
2Basics of OptimizationThis chapter
Page 19 and 20:
basics of optimization 19where the
Page 21 and 22:
basics of optimization 21Therefore,
Page 23 and 24:
3Backpropagation and its VariantsTh
Page 25 and 26:
backpropagation and its variants 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
4Basics of generalization theoryGen
Page 33 and 34:
basics of generalization theory 33p
Page 35 and 36:
basics of generalization theory 35w
Page 37:
basics of generalization theory 37N
Page 41 and 42:
6Algorithmic RegularizationLarge sc
Page 43 and 44:
algorithmic regularization 43minimi
Page 45 and 46:
algorithmic regularization 45update
Page 47 and 48:
algorithmic regularization 476.2 Ma
Page 49 and 50:
algorithmic regularization 496.3.2
Page 51 and 52: algorithmic regularization 51Given
Page 53 and 54: algorithmic regularization 53to the
Page 55: algorithmic regularization 55Since
Page 58 and 59: 58 theory of deep learning7.1 Preli
Page 60 and 61: 60 theory of deep learningcreasing
Page 62 and 63: 62 theory of deep learning“gradie
Page 64 and 65: 64 theory of deep learning7.4 Case
Page 66 and 67: 66 theory of deep learningWhen x =
Page 68 and 69: 68 theory of deep learningproof is
Page 70 and 71: 70 theory of deep learningthe input
Page 72 and 73: 72 theory of deep learningas the av
Page 74 and 75: 74 theory of deep learningSeveral r
Page 76 and 77: 76 theory of deep learningFigure 8.
Page 78 and 79: 78 theory of deep learningNote the
Page 81 and 82: 9Inductive Biases due to Algorithmi
Page 83 and 84: inductive biases due to algorithmic
Page 101: inductive biases due to algorithmic
Page 105 and 106: unsupervised learning: overview 105
Page 111 and 112: 11Generative Adversarial NetsChapte
Page 113: 12Representation Learning
Page 116 and 117: 116 theory of deep learning13.3 Exa
Page 118: 118 theory of deep learning13.5 Exa
show all

TheoryofDeepLearning.2022

Create successful ePaper yourself

Delete template?

Save as template?