26.12.2022 Views

TheoryofDeepLearning.2022

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

unsupervised learning: overview 109

decoder p. In particular,

q(z|x) = N (z; µ x , σ 2 x I d ), µ x , σ x = E φ (x) (10.9)

p(x|z) = N (x; µ z , σz 2 I n ), µ z , σ z = D θ (z), p(z) = N (z; 0, I d )

(10.10)

where E φ and D θ are the encoder and decoder neural networks parameterized

by φ and θ respectively, µ x , µ z are vectors of corresponding

dimensions, and σ x , σ z are (nonnegative) scalars. The particular

choice of Gaussians is not a necessity in itself for the model and can

be replaced with any other relevant distribution. However, Gaussians

provide, as is often the case, computational ease and intuitive backing.

The intuitive argument behind the use of Gaussian distributions

is that under mild regularity conditions every distribution can be

approximated (in distribution) by a mixture of Gaussians. This follows

from the fact that by approximating the CDF of a distribution

by step functions one obtains an approximation in distribution by

a mixture of constants, i.e. mixture of Gaussians with ≈ 0 variance.

The computational ease, on the other hand, is more clearly seen in

the training process of VAEs.

10.4.1 Training VAEs

As previously mentioned, the training of variational autoencoders involves

maximizing the RHS of (10.8), the ELBO, over the parameters

φ, θ under the model described by (10.9), (10.10). Given that the parametric

model is based on two neural networks E φ , D θ , the objective

optimization is done via gradient-based methods. Since the objective

involves expectation over q(z|x), computing an exact estimate of it,

and consequently its gradient, is intractable so we resort to (unbiased)

gradient estimators and eventually use a stochastic gradient-based

optimization method (e.g. SGD).

In this section, use the notation µ φ (x), σ φ (x) = E φ (x) and

µ θ (z), σ θ (z) = D θ (z) to emphasize the dependence on the parameters

φ, θ. Given training data x 1 , . . . , x m ∈ R n , consider an arbitrary data

point x i , i ∈ [m] and pass it through the encoder neural network E φ to

obtain µ φ (x i ), σ φ (x i ). Next, sample s points z i1 , . . . , z is , where s is the

batch size, from the distribution q(z|x = x i ) = N (z; µ φ (x i ), σ φ (x i ) 2 I d )

via the reparameterization trick 2 by sampling ɛ 1 , . . . , ɛ s ∼ N (0, I d )

2

from the standard Gaussian and using the transformation z ij =

µ φ (x i ) + σ φ (x i ) · ɛ j . The reason behind the reparameterization trick

is that the gradient w.r.t. parameter φ of an unbiased estimate of expectation

over a general distribution q φ is not necessarily an unbiased

estimate of the gradient of expectation. This is the case, however,

when the distribution q φ can separate the parameter φ from the ran-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!