10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.322 24 — Exact MarginalizationThis function describes the answer to the question, ‘given the data, <strong>and</strong> thenoninformative priors, what might µ <strong>and</strong> σ be?’ It may be of interest to findthe parameter values that maximize the posterior probability, though it shouldbe emphasized that posterior probability maxima have no fundamental statusin Bayesian inference, since their location depends on the choice of basis. Herewe choose the basis (µ, ln σ), in which our prior is flat, so that the posteriorprobability maximum coincides with the maximum of the likelihood. As wesaw in exercise 22.4 (p.302), the maximum likelihood solution for µ <strong>and</strong> ln σ{is {µ, σ} ML = ¯x, σ N = √ }S/N .There is more to the posterior distribution than just its mode.As canbe seen in figure 24.1a, the likelihood has a skew peak. As we increase σ,the width of the conditional distribution of µ increases (figure 22.1b). Andif we fix µ to a sequence of values moving away from the sample mean ¯x, weobtain a sequence of conditional distributions over σ whose maxima move toincreasing values of σ (figure 24.1c).The posterior probability of µ given σ isP (µ | {x n } N n=1 , σ) = P ({x n} N n=1 | µ, σ)P (µ)P ({x n } N n=1 | σ) (24.9)∝ exp(−N(µ − ¯x) 2 /(2σ 2 )) (24.10)= Normal(µ; ¯x, σ 2 /N). (24.11)We note the familiar σ/ √ N scaling of the error bars on µ.Let us now ask the question ‘given the data, <strong>and</strong> the noninformative priors,what might σ be?’ This question differs from the first one we asked in that weare now not interested in µ. This parameter must therefore be marginalizedover. The posterior probability of σ is:P (σ | {x n } N n=1 ) = P ({x n} N n=1 | σ)P (σ)P ({x n } N n=1 ) . (24.12)The data-dependent term P ({x n } N n=1 | σ) appeared earlier as the normalizingconstant in equation (24.9); one name for this quantity is the ‘evidence’, ormarginal likelihood, for σ. We obtain the evidence for σ by integrating outµ; a noninformative prior P (µ) = constant is assumed; we call this constant1/σ µ , so that we can think of the prior as a top-hat prior of width σ µ . TheGaussian integral, P ({x n } N n=1 | σ) = ∫ P ({x n } N n=1 | µ, σ)P (µ) dµ, yields:ln P ({x n } N n=1 | σ) = −N ln(√ 2πσ) −S√ √2πσ/ N2σ 2 + ln . (24.13)σ µThe first two terms are the best-fit log likelihood (i.e., the log likelihood withµ = ¯x). The last term is the log of the Occam factor which penalizes smallervalues of σ. (We will discuss Occam factors more in Chapter 28.) When wedifferentiate the log evidence with respect to ln σ, to find the most probableσ, the additional volume factor (σ/ √ N) shifts the maximum from σ N toσ N−1 = √ S/(N − 1). (24.14)Intuitively, the denominator (N −1) counts the number of noise measurementscontained in the quantity S = ∑ n (x n − ¯x) 2 . The sum contains N residualssquared, but there are only (N −1) effective noise measurements because thedetermination of one parameter µ from the data causes one dimension of noiseto be gobbled up in unavoidable overfitting. In the terminology of classical

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!