10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.348 28 — Model Comparison <strong>and</strong> Occam’s RazorP (w | H i )σ wσ w|Dw MPP (w | D, H i )no fundamental status in Bayesian inference – they both change undernonlinear reparameterizations. Maximization of a posterior probabilityis useful only if an approximation like equation (28.5) gives a goodsummary of the distribution.]wFigure 28.5. The Occam factor.This figure shows the quantitiesthat determine the Occam factorfor a hypothesis H i having asingle parameter w. The priordistribution (solid line) for theparameter has width σ w . Theposterior distribution (dashedline) has a single peak at w MPwith characteristic width σ w|D .The Occam factor isσ w|D P (w MP | H i ) = σ w|Dσ w.2. Model comparison. At the second level of inference, we wish to inferwhich model is most plausible given the data. The posterior probabilityof each model is:P (H i | D) ∝ P (D | H i )P (H i ). (28.6)Notice that the data-dependent term P (D | H i ) is the evidence for H i ,which appeared as the normalizing constant in (28.4). The second term,P (H i ), is the subjective prior over our hypothesis space, which expresseshow plausible we thought the alternative models were before the dataarrived. Assuming that we choose to assign equal priors P (H i ) to thealternative models, models H i are ranked by evaluating the evidence. Thenormalizing constant P (D) = ∑ i P (D | H i)P (H i ) has been omitted fromequation (28.6) because in the data-modelling process we may developnew models after the data have arrived, when an inadequacy of the firstmodels is detected, for example. <strong>Inference</strong> is open ended: we continuallyseek more probable models to account for the data we gather.To repeat the key idea: to rank alternative models H i , a Bayesian evaluatesthe evidence P (D | H i ). This concept is very general: the evidencecan be evaluated for parametric <strong>and</strong> ‘non-parametric’ modelsalike; whatever our data-modelling task, a regression problem, a classificationproblem, or a density estimation problem, the evidence is atransportable quantity for comparing alternative models. In all thesecases the evidence naturally embodies Occam’s razor.Evaluating the evidenceLet us now study the evidence more closely to gain insight into how theBayesian Occam’s razor works. The evidence is the normalizing constant forequation (28.4):∫P (D | H i ) = P (D | w, H i )P (w | H i ) dw. (28.7)For many problems the posterior P (w | D, H i ) ∝ P (D | w, H i )P (w | H i ) hasa strong peak at the most probable parameters w MP (figure 28.5). Then,taking for simplicity the one-dimensional case, the evidence can be approximated,using Laplace’s method, by the height of the peak of the integr<strong>and</strong>P (D | w, H i )P (w | H i ) times its width, σ w|D :

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!