01.08.2013 Views

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981<br />

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.<br />

296 21 — Exact <strong>Inference</strong> by Complete Enumeration<br />

function for the decay length as a function of λ by evaluating the likelihood<br />

at a finely-spaced series of points.<br />

A two-parameter model<br />

Let’s look at the Gaussian distribution as an example of a model with a twodimensional<br />

hypothesis space. The one-dimensional Gaussian distribution is<br />

parameterized by a mean µ <strong>and</strong> a st<strong>and</strong>ard deviation σ:<br />

P (x | µ, σ) = 1<br />

<br />

(x − µ)2<br />

√ exp −<br />

2πσ 2σ2 <br />

≡ Normal(x; µ, σ 2 ). (21.8)<br />

Figure 21.2 shows an enumeration of one hundred hypotheses about the mean<br />

<strong>and</strong> st<strong>and</strong>ard deviation of a one-dimensional Gaussian distribution. These<br />

hypotheses are evenly spaced in a ten by ten square grid covering ten values<br />

of µ <strong>and</strong> ten values of σ. Each hypothesis is represented by a picture showing<br />

the probability density that it puts on x. We now examine the inference of µ<br />

<strong>and</strong> σ given data points xn, n = 1, . . . , N, assumed to be drawn independently<br />

from this density.<br />

Imagine that we acquire data, for example the five points shown in figure<br />

21.3. We can now evaluate the posterior probability of each of the one<br />

hundred subhypotheses by evaluating the likelihood of each, that is, the value<br />

of P ({xn} 5 n=1<br />

| µ, σ). The likelihood values are shown diagrammatically in<br />

figure 21.4 using the line thickness to encode the value of the likelihood. Subhypotheses<br />

with likelihood smaller than e −8 times the maximum likelihood<br />

have been deleted.<br />

Using a finer grid, we can represent the same information by plotting the<br />

likelihood as a surface plot or contour plot as a function of µ <strong>and</strong> σ (figure 21.5).<br />

A five-parameter mixture model<br />

Eyeballing the data (figure 21.3), you might agree that it seems more plausible<br />

that they come not from a single Gaussian but from a mixture of two<br />

Gaussians, defined by two means, two st<strong>and</strong>ard deviations, <strong>and</strong> two mixing<br />

Figure 21.2. Enumeration of an<br />

entire (discretized) hypothesis<br />

space for one Gaussian with<br />

parameters µ (horizontal axis)<br />

<strong>and</strong> σ (vertical).<br />

-0.5 0 0.5 1 1.5 2 2.5<br />

Figure 21.3. Five datapoints<br />

{xn} 5 n=1 . The horizontal<br />

coordinate is the value of the<br />

datum, xn; the vertical coordinate<br />

has no meaning.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!