TheoryofDeepLearning.2022
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
106 theory of deep learning
correspond to the average variance per pixel. Thus a random sample from
the learn distribution will look like some noisy version of the average pixel.
This example also shows that matching loglikelihood for the average
training and held-out sample is insufficient for good distribution learning.
The gaussian model only has d + 1 parameters and simple concentration
bounds show under fairly general conditions (such as coordinates of x i ’s
being bounded) that one the number of training samples is moderately high
then the log likelihood of the average test sample is similar to that of the
average training sample. However, the learned distribution may be nothing
like the true distribution.
This is reminiscent of the situation in supervised learning whereby
a nonsensical model — one that outputs random labels—has excellent
generalization as well.
As in supervised learning, one has to keep track of training loglikelihood
in addition to generalization, and choose among models
that maximise it. In general this is computationally intractable for
even simple settings.
Theorem 10.1.2. The θ maximizing (10.2) minimizes the KL divergence
KL(Q||P) where P is the true distribution and Q is the learnt distribution.
Proof. TBD
10.2 Variational methods
The variational method leverages duality, a widespread principle
in math. You may have seen LP duality in an algorithms class. The
name “variational”in the title refers to calculus of variations, the part
of math where such principles are studied.
This method maintains some estimate q(h|x) of p(h|x) and improves
it. One useful fact is that:
log p(x) ≥ E q(h|x) [log(p(x, h))] + H[q(h|x)], ∀q(h|x) (10.3)
where H is the Shannon Entropy.
We would like to prove this bound on log p(x) and resort to maximizing
the lower bound given in (10.3), referred to as the evidence
lower bound (ELBO). Towards this end we will introduce the Kullback
Leibler divergence (KL) between two distributions given by
[
KL[q(h|x) || p(h|x)] = E q(h|x) log q(h|x) ]
p(h|x)
Moreover, p(x)p(h|x) = p(x, h) is true by Bayes Rule. Then we can
see that
(10.4)