TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
unsupervised learning: overview 107
KL[q(h|x)|p(h|x)] = E q(h|x) [log q(h|x) · p(x)] (10.5)
p(x, h)
= E q(h|x) [log(q(h|x))] −E q(h|x) [log(p(x, h))] + E q(h|x) [log p(x)]
} {{ }
−H(q(h|x))
(10.6)
But we know that the KL divergence is always nonnegative, so we
get:
E q(h|x) [log(p(x))] − E q(h|x) [log(p(x, h))] − H(q(h|x)) ≥ 0 (10.7)
which is the same as ELBO (10.3) since log(p(x)) is constant over
q(h|x), hence is equal to its expectation.
The variational methods use some form of gradient descent or
local improvement to improve q(h|x). For details see the blog post on
offconvex.org by Arora and Risteski.
10.3 Autoencoders
Autoencoders are a subcase of density estimation popular in the
neural nets world. It is assumed that the model first generates a
latent representation from a simple distribution, and then uses some
simple circuit to map it to the visible datapoint x. This is called
decoder. There is a companion encoder circuit that maps x to a latent
representation. The pair of circuits is called an autoencoder.
In other words, the autoencoder is trying to learn an approximation
to the identity function, so as to output x ′ that is similar to x.
In order to force the algorithm to learn a function different from
the identity function, we place constraints on the model, such as
by limiting the representation z to be of low dimensionality. In the
simplest example, assume k vectors u 1 , ..., u k ∈ R n , where k ≪ n, and
x = ∑ i α i u i + σ, where σ is Gaussian noise. By applying rank-k PCA,
one could recover values in span(u 1 , ..., u k ). Thus PCA/SVD can be
seen as a simple form of autoencoder learning.
An autoencoder can also be seen as an example of the so-called
manifold view, whereby data is assumed to have a latent representation
z which has a simpler distribution than x.
10.3.1 Sparse autoencoder
Our argument above relied on the size of the encoded representation
of the data to be small. But even when this is not the case, (i.e.,
k > n), we can still learn meaningful structure, by imposing other