10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.24.1: Inferring the mean <strong>and</strong> variance of a Gaussian distribution 3210.060.050.040.030.020.010.6sigma0.4(a1)(c)1 00.80.090.080.070.060.050.040.030.020.010.200.51mean1.5mu=1mu=1.25mu=1.500.2 0.4 0.6 0.8 1 1.2 1.4 1.61.8 22(d)0 0.5 1 1.5 2(a2)mean0.090.08P(sigma|D,mu=1)0.070.060.050.040.030.020.01P(sigma|D)10.90.80.70.6sigma0.50.40.30.20.100.2 0.4 0.6 0.8 1 1.2 1.4 1.61.8 2In sampling theory, the estimators above can be motivated as follows. ¯x isan unbiased estimator of µ which, out of all the possible unbiased estimatorsof µ, has smallest variance (where this variance is computed by averaging overan ensemble of imaginary experiments in which the data samples are assumedto come from an unknown Gaussian distribution). The estimator (¯x, σ N ) is themaximum likelihood estimator for (µ, σ). The estimator σ N is biased, however:the expectation of σ N , given σ, averaging over many imagined experiments, isnot σ.Exercise 24.1. [2, p.323] Give an intuitive explanation why the estimator σ N isbiased.This bias motivates the invention, in sampling theory, of σ N−1 , which can beshown to be an unbiased estimator. Or to be precise, it is σ 2 N−1that is anunbiased estimator of σ 2 .We now look at some Bayesian inferences for this problem, assuming noninformativepriors for µ <strong>and</strong> σ. The emphasis is thus not on the priors, butrather on (a) the likelihood function, <strong>and</strong> (b) the concept of marginalization.The joint posterior probability of µ <strong>and</strong> σ is proportional to the likelihoodfunction illustrated by a contour plot in figure 24.1a. The log likelihood is:ln P ({x n } N n=1 | µ, σ) = −N ln(√ 2πσ) − ∑ n(x n − µ) 2 /(2σ 2 ), (24.5)Figure 24.1. The likelihoodfunction for the parameters of aGaussian distribution, repeatedfrom figure 21.5.(a1, a2) Surface plot <strong>and</strong> contourplot of the log likelihood as afunction of µ <strong>and</strong> σ. The data setof N = 5 points had mean ¯x = 1.0<strong>and</strong> S = ∑ (x − ¯x) 2 = 1.0. Noticethat the maximum is skew in σ.The two estimators of st<strong>and</strong>arddeviation have values σ N = 0.45<strong>and</strong> σ N−1 = 0.50.(c) The posterior probability of σfor various fixed values of µ(shown as a density over ln σ).(d) The posterior probability of σ,P (σ | D), assuming a flat prior onµ, obtained by projecting theprobability mass in (a) onto the σaxis. The maximum of P (σ | D) isat σ N−1 . By contrast, themaximum of P (σ | D, µ = ¯x) is atσ N . (Both probabilities are showsas densities over ln σ.)= −N ln( √ 2πσ) − [N(µ − ¯x) 2 + S]/(2σ 2 ), (24.6)where S ≡ ∑ n (x n − ¯x) 2 . Given the Gaussian model, the likelihood can beexpressed in terms of the two functions of the data ¯x <strong>and</strong> S, so these twoquantities are known as ‘sufficient statistics’. The posterior probability of µ<strong>and</strong> σ is, using the improper priors:P (µ, σ | {x n } N n=1 ) = P ({x n} N n=1 | µ, σ)P (µ, σ)=P ({x n } N n=1 ) (24.7)()1exp − N(µ−¯x)2 +S 1 1(2πσ 2 ) N/2 2σ 2 σ µ σP ({x n } N n=1 ) . (24.8)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!