26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

unsupervised learning: overview 107

KL[q(h|x)|p(h|x)] = E q(h|x) [log q(h|x) · p(x)] (10.5)

p(x, h)

= E q(h|x) [log(q(h|x))] −E q(h|x) [log(p(x, h))] + E q(h|x) [log p(x)]

} {{ }

−H(q(h|x))

(10.6)

But we know that the KL divergence is always nonnegative, so we

get:

E q(h|x) [log(p(x))] − E q(h|x) [log(p(x, h))] − H(q(h|x)) ≥ 0 (10.7)

which is the same as ELBO (10.3) since log(p(x)) is constant over

q(h|x), hence is equal to its expectation.

The variational methods use some form of gradient descent or

local improvement to improve q(h|x). For details see the blog post on

offconvex.org by Arora and Risteski.

10.3 Autoencoders

Autoencoders are a subcase of density estimation popular in the

neural nets world. It is assumed that the model first generates a

latent representation from a simple distribution, and then uses some

simple circuit to map it to the visible datapoint x. This is called

decoder. There is a companion encoder circuit that maps x to a latent

representation. The pair of circuits is called an autoencoder.

In other words, the autoencoder is trying to learn an approximation

to the identity function, so as to output x ′ that is similar to x.

In order to force the algorithm to learn a function different from

the identity function, we place constraints on the model, such as

by limiting the representation z to be of low dimensionality. In the

simplest example, assume k vectors u 1 , ..., u k ∈ R n , where k ≪ n, and

x = ∑ i α i u i + σ, where σ is Gaussian noise. By applying rank-k PCA,

one could recover values in span(u 1 , ..., u k ). Thus PCA/SVD can be

seen as a simple form of autoencoder learning.

An autoencoder can also be seen as an example of the so-called

manifold view, whereby data is assumed to have a latent representation

z which has a simpler distribution than x.

10.3.1 Sparse autoencoder

Our argument above relied on the size of the encoded representation

of the data to be small. But even when this is not the case, (i.e.,

k > n), we can still learn meaningful structure, by imposing other

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!