10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.320 24 — Exact Marginalizationdensity of the inverse variance (the precision parameter) β = 1/σ 2 :P (β) = Γ(β; b β , c β ) = 1 β c β−1Γ(c β ) b c exp(− β ), 0 ≤ β < ∞. (24.2)βbββThis is a simple peaked distribution with mean b β c β <strong>and</strong> variance b 2 β c β. Inthe limit b β c β = 1, c β → 0, we obtain the noninformative prior for a scaleparameter, the 1/σ prior. This is ‘noninformative’ because it is invariantunder the reparameterization σ ′ = cσ. The 1/σ prior is less strange-looking ifwe examine the resulting density over ln σ, or ln β, which is flat. This is theprior that expresses ignorance about σ by saying ‘well, it could be 10, or itcould be 1, or it could be 0.1, . . . ’ Scale variables such as σ are usually bestrepresented in terms of their logarithm. Again, this noninformative 1/σ prioris improper.In the following examples, I will use the improper noninformative priorsfor µ <strong>and</strong> σ. Using improper priors is viewed as distasteful in some circles,so let me excuse myself by saying it’s for the sake of readability; if I includedproper priors, the calculations could still be done but the key points would beobscured by the flood of extra parameters.Reminder: when we changevariables from σ to l(σ), aone-to-one function of σ, theprobability density transformsfrom P σ (σ) toP l (l) = P σ (σ)∂σ∣ ∂l ∣ .Here, the Jacobian is∂σ∣∂ ln σ ∣ = σ.Maximum likelihood <strong>and</strong> marginalization: σ N<strong>and</strong> σ N−1The task of inferring the mean <strong>and</strong> st<strong>and</strong>ard deviation of a Gaussian distributionfrom N samples is a familiar one, though maybe not everyone underst<strong>and</strong>sthe difference between the σ N <strong>and</strong> σ N−1 buttons on their calculator. Let usrecap the formulae, then derive them.Given data D = {x n } N n=1 , an ‘estimator’ of µ is<strong>and</strong> two estimators of σ are:¯x ≡ ∑ Nn=1 x n/N, (24.3)σ N ≡√∑Nn=1 (x n − ¯x) 2N<strong>and</strong> σ N−1 ≡√ ∑Nn=1 (x n − ¯x) 2N − 1. (24.4)There are two principal paradigms for statistics: sampling theory <strong>and</strong> Bayesianinference. In sampling theory (also known as ‘frequentist’ or orthodox statistics),one invents estimators of quantities of interest <strong>and</strong> then chooses betweenthose estimators using some criterion measuring their sampling properties;there is no clear principle for deciding which criterion to use to measure theperformance of an estimator; nor, for most criteria, is there any systematicprocedure for the construction of optimal estimators. In Bayesian inference,in contrast, once we have made explicit all our assumptions about the model<strong>and</strong> the data, our inferences are mechanical. Whatever question we wish topose, the rules of probability theory give a unique answer which consistentlytakes into account all the given information. Human-designed estimators <strong>and</strong>confidence intervals have no role in Bayesian inference; human input only entersinto the important tasks of designing the hypothesis space (that is, thespecification of the model <strong>and</strong> all its probability distributions), <strong>and</strong> figuringout how to do the computations that implement inference in that space. Theanswers to our questions are probability distributions over the quantities ofinterest. We often find that the estimators of sampling theory emerge automaticallyas modes or means of these posterior distributions when we choosea simple hypothesis space <strong>and</strong> turn the h<strong>and</strong>le of Bayesian inference.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!