01.08.2013 Views

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981<br />

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.<br />

About Chapter 29<br />

The last couple of chapters have assumed that a Gaussian approximation to<br />

the probability distribution we are interested in is adequate. What if it is not?<br />

We have already seen an example – clustering – where the likelihood function<br />

is multimodal, <strong>and</strong> has nasty unboundedly-high spikes in certain locations in<br />

the parameter space; so maximizing the posterior probability <strong>and</strong> fitting a<br />

Gaussian is not always going to work. This difficulty with Laplace’s method is<br />

one motivation for being interested in Monte Carlo methods. In fact, Monte<br />

Carlo methods provide a general-purpose set of tools with applications in<br />

Bayesian data modelling <strong>and</strong> many other fields.<br />

This chapter describes a sequence of methods: importance sampling, rejection<br />

sampling, the Metropolis method, Gibbs sampling <strong>and</strong> slice sampling.<br />

For each method, we discuss whether the method is expected to be useful for<br />

high-dimensional problems such as arise in inference with graphical models.<br />

[A graphical model is a probabilistic model in which dependencies <strong>and</strong> independencies<br />

of variables are represented by edges in a graph whose nodes are<br />

the variables.] Along the way, the terminology of Markov chain Monte Carlo<br />

methods is presented. The subsequent chapter discusses advanced methods<br />

for reducing r<strong>and</strong>om walk behaviour.<br />

For details of Monte Carlo methods, theorems <strong>and</strong> proofs <strong>and</strong> a full list<br />

of references, the reader is directed to Neal (1993b), Gilks et al. (1996), <strong>and</strong><br />

Tanner (1996).<br />

In this chapter I will use the word ‘sample’ in the following sense: a sample<br />

from a distribution P (x) is a single realization x whose probability distribution<br />

is P (x). This contrasts with the alternative usage in statistics, where ‘sample’<br />

refers to a collection of realizations {x}.<br />

When we discuss transition probability matrices, I will use a right-multiplication<br />

convention: I like my matrices to act to the right, preferring<br />

to<br />

u = Mv (29.1)<br />

u T = v T M T . (29.2)<br />

A transition probability matrix Tij or T i|j specifies the probability, given the<br />

current state is j, of making the transition from j to i. The columns of T are<br />

probability vectors. If we write down a transition probability density, we use<br />

the same convention for the order of its arguments: T (x ′ ; x) is a transition<br />

probability density from x to x ′ . This unfortunately means that you have<br />

to get used to reading from right to left – the sequence xyz has probability<br />

T (z; y)T (y; x)π(x).<br />

356

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!