01.08.2013 Views

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

Information Theory, Inference, and Learning ... - MAELabs UCSD

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981<br />

You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.<br />

23.5: Distributions over probabilities 317<br />

8<br />

4<br />

0<br />

-4<br />

u = (20, 10, 7) u = (0.2, 1, 2) u = (0.2, 0.3, 0.15)<br />

8<br />

4<br />

0<br />

-4<br />

-8<br />

-8 -4 0 4 8 -8<br />

-8<br />

-8 -4 0 4 8<br />

-8 -4 0 4 8<br />

ln p<br />

1−p . In the softmax basis, the ugly minus-ones in the exponents in the<br />

Dirichlet distribution (23.30) disappear, <strong>and</strong> the density is given by:<br />

P (a | αm) ∝<br />

1<br />

Z(αm)<br />

I<br />

i=1<br />

8<br />

4<br />

0<br />

-4<br />

p αmi<br />

i δ ( <br />

i ai) . (23.34)<br />

The role of the parameter α can be characterized in two ways. First, α measures<br />

the sharpness of the distribution (figure 23.8); it measures how different<br />

we expect typical samples p from the distribution to be from the mean m, just<br />

as the precision τ = 1/σ 2 of a Gaussian measures how far samples stray from its<br />

mean. A large value of α produces a distribution over p that is sharply peaked<br />

around m. The effect of α in higher-dimensional situations can be visualized<br />

by drawing a typical sample from the distribution Dirichlet (I) (p | αm), with<br />

m set to the uniform vector mi = 1/I, <strong>and</strong> making a Zipf plot, that is, a ranked<br />

plot of the values of the components pi. It is traditional to plot both pi (vertical<br />

axis) <strong>and</strong> the rank (horizontal axis) on logarithmic scales so that power<br />

law relationships appear as straight lines. Figure 23.9 shows these plots for a<br />

single sample from ensembles with I = 100 <strong>and</strong> I = 1000 <strong>and</strong> with α from 0.1<br />

to 1000. For large α, the plot is shallow with many components having similar<br />

values. For small α, typically one component pi receives an overwhelming<br />

share of the probability, <strong>and</strong> of the small probability that remains to be shared<br />

among the other components, another component pi ′ receives a similarly large<br />

share. In the limit as α goes to zero, the plot tends to an increasingly steep<br />

power law.<br />

Second, we can characterize the role of α in terms of the predictive distribution<br />

that results when we observe samples from p <strong>and</strong> obtain counts<br />

F = (F1, F2, . . . , FI) of the possible outcomes. The value of α defines the<br />

number of samples from p that are required in order that the data dominate<br />

over the prior in predictions.<br />

Exercise 23.3. [3 ] The Dirichlet distribution satisfies a nice additivity property.<br />

Imagine that a biased six-sided die has two red faces <strong>and</strong> four blue faces.<br />

The die is rolled N times <strong>and</strong> two Bayesians examine the outcomes in<br />

order to infer the bias of the die <strong>and</strong> make predictions. One Bayesian<br />

has access to the red/blue colour outcomes only, <strong>and</strong> he infers a twocomponent<br />

probability vector (pR, pB). The other Bayesian has access<br />

to each full outcome: he can see which of the six faces came up, <strong>and</strong><br />

he infers a six-component probability vector (p1, p2, p3, p4, p5, p6), where<br />

Figure 23.8. Three Dirichlet<br />

distributions over a<br />

three-dimensional probability<br />

vector (p1, p2, p3). The upper<br />

figures show 1000 r<strong>and</strong>om draws<br />

from each distribution, showing<br />

the values of p1 <strong>and</strong> p2 on the two<br />

axes. p3 = 1 − (p1 + p2). The<br />

triangle in the first figure is the<br />

simplex of legal probability<br />

distributions.<br />

The lower figures show the same<br />

points in the ‘softmax’ basis<br />

(equation (23.33)). The two axes<br />

show a1 <strong>and</strong> a2. a3 = −a1 − a2.<br />

1<br />

0.1<br />

0.01<br />

0.001<br />

I = 100<br />

0.1 1<br />

10<br />

100<br />

1000<br />

0.0001<br />

1 10 100<br />

1<br />

0.1<br />

0.01<br />

0.001<br />

0.0001<br />

I = 1000<br />

0.1 1<br />

10<br />

100<br />

1000<br />

1e-05<br />

1 10 100 1000<br />

Figure 23.9. Zipf plots for r<strong>and</strong>om<br />

samples from Dirichlet<br />

distributions with various values<br />

of α = 0.1 . . . 1000. For each value<br />

of I = 100 or 1000 <strong>and</strong> each α,<br />

one sample p from the Dirichlet<br />

distribution was generated. The<br />

Zipf plot shows the probabilities<br />

pi, ranked by magnitude, versus<br />

their rank.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!