26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

106 theory of deep learning

correspond to the average variance per pixel. Thus a random sample from

the learn distribution will look like some noisy version of the average pixel.

This example also shows that matching loglikelihood for the average

training and held-out sample is insufficient for good distribution learning.

The gaussian model only has d + 1 parameters and simple concentration

bounds show under fairly general conditions (such as coordinates of x i ’s

being bounded) that one the number of training samples is moderately high

then the log likelihood of the average test sample is similar to that of the

average training sample. However, the learned distribution may be nothing

like the true distribution.

This is reminiscent of the situation in supervised learning whereby

a nonsensical model — one that outputs random labels—has excellent

generalization as well.

As in supervised learning, one has to keep track of training loglikelihood

in addition to generalization, and choose among models

that maximise it. In general this is computationally intractable for

even simple settings.

Theorem 10.1.2. The θ maximizing (10.2) minimizes the KL divergence

KL(Q||P) where P is the true distribution and Q is the learnt distribution.

Proof. TBD

10.2 Variational methods

The variational method leverages duality, a widespread principle

in math. You may have seen LP duality in an algorithms class. The

name “variational”in the title refers to calculus of variations, the part

of math where such principles are studied.

This method maintains some estimate q(h|x) of p(h|x) and improves

it. One useful fact is that:

log p(x) ≥ E q(h|x) [log(p(x, h))] + H[q(h|x)], ∀q(h|x) (10.3)

where H is the Shannon Entropy.

We would like to prove this bound on log p(x) and resort to maximizing

the lower bound given in (10.3), referred to as the evidence

lower bound (ELBO). Towards this end we will introduce the Kullback

Leibler divergence (KL) between two distributions given by

[

KL[q(h|x) || p(h|x)] = E q(h|x) log q(h|x) ]

p(h|x)

Moreover, p(x)p(h|x) = p(x, h) is true by Bayes Rule. Then we can

see that

(10.4)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!