10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.344 28 — Model Comparison <strong>and</strong> Occam’s RazorP(D|H )2EvidenceC1P(D|H )1(Paul Dirac)); the second reason is the past empirical success of Occam’s razor.However there is a different justification for Occam’s razor, namely:Coherent inference (as embodied by Bayesian probability) automaticallyembodies Occam’s razor, quantitatively.It is indeed more probable that there’s one box behind the tree, <strong>and</strong> we cancompute how much more probable one is than two.Model comparison <strong>and</strong> Occam’s razorWe evaluate the plausibility of two alternative theories H 1 <strong>and</strong> H 2 in the lightof data D as follows: using Bayes’ theorem, we relate the plausibility of modelH 1 given the data, P (H 1 | D), to the predictions made by the model aboutthe data, P (D | H 1 ), <strong>and</strong> the prior plausibility of H 1 , P (H 1 ). This gives thefollowing probability ratio between theory H 1 <strong>and</strong> theory H 2 :P (H 1 | D)P (H 2 | D) = P (H 1) P (D | H 1 )P (H 2 ) P (D | H 2 ) . (28.1)The first ratio (P (H 1 )/P (H 2 )) on the right-h<strong>and</strong> side measures how much ourinitial beliefs favoured H 1 over H 2 . The second ratio expresses how well theobserved data were predicted by H 1 , compared to H 2 .How does this relate to Occam’s razor, when H 1 is a simpler model thanH 2 ? The first ratio (P (H 1 )/P (H 2 )) gives us the opportunity, if we wish, toinsert a prior bias in favour of H 1 on aesthetic grounds, or on the basis ofexperience. This would correspond to the aesthetic <strong>and</strong> empirical motivationsfor Occam’s razor mentioned earlier. But such a prior bias is not necessary:the second ratio, the data-dependent factor, embodies Occam’s razor automatically.Simple models tend to make precise predictions. Complex models,by their nature, are capable of making a greater variety of predictions (figure28.3). So if H 2 is a more complex model, it must spread its predictive probabilityP (D | H 2 ) more thinly over the data space than H 1 . Thus, in the casewhere the data are compatible with both theories, the simpler H 1 will turn outmore probable than H 2 , without our having to express any subjective dislikefor complex models. Our subjective prior just needs to assign equal prior probabilitiesto the possibilities of simplicity <strong>and</strong> complexity. Probability theorythen allows the observed data to express their opinion.Let us turn to a simple example. Here is a sequence of numbers:DFigure 28.3. Why Bayesianinference embodies Occam’s razor.This figure gives the basicintuition for why complex modelscan turn out to be less probable.The horizontal axis represents thespace of possible data sets D.Bayes’ theorem rewards models inproportion to how much theypredicted the data that occurred.These predictions are quantifiedby a normalized probabilitydistribution on D. Thisprobability of the data givenmodel H i , P (D | H i ), is called theevidence for H i .A simple model H 1 makes only alimited range of predictions,shown by P (D | H 1 ); a morepowerful model H 2 , that has, forexample, more free parametersthan H 1 , is able to predict agreater variety of data sets. Thismeans, however, that H 2 does notpredict the data sets in region C 1as strongly as H 1 . Suppose thatequal prior probabilities have beenassigned to the two models. Then,if the data set falls in region C 1 ,the less powerful model H 1 will bethe more probable model.−1, 3, 7, 11.The task is to predict the next two numbers, <strong>and</strong> infer the underlying processthat gave rise to this sequence. A popular answer to this question is theprediction ‘15, 19’, with the explanation ‘add 4 to the previous number’.What about the alternative answer ‘−19.9, 1043.8’ with the underlyingrule being: ‘get the next number from the previous number, x, by evaluating

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!