10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.28.3: Minimum description length (MDL) 353MDL does have its uses as a pedagogical tool. The description lengthconcept is useful for motivating prior probability distributions. Also, differentways of breaking down the task of communicating data using a model can givehelpful insights into the modelling process, as will now be illustrated.On-line learning <strong>and</strong> cross-validation.In cases where the data consist of a sequence of points D = t (1) , t (2) , . . . , t (N) ,the log evidence can be decomposed as a sum of ‘on-line’ predictive performances:log P (D | H) = log P (t (1) | H) + log P (t (2) | t (1) , H)+ log P (t (3) | t (1) , t (2) , H) + · · · + log P (t (N) | t (1) . . . t (N−1) , H). (28.18)This decomposition can be used to explain the difference between the evidence<strong>and</strong> ‘leave-one-out cross-validation’ as measures of predictive ability.Cross-validation examines the average value of just the last term,log P (t (N) | t (1) . . . t (N−1) , H), under r<strong>and</strong>om re-orderings of the data. The evidence,on the other h<strong>and</strong>, sums up how well the model predicted all the data,starting from scratch.The ‘bits-back’ encoding method.Another MDL thought experiment (Hinton <strong>and</strong> van Camp, 1993) involves incorporatingr<strong>and</strong>om bits into our message. The data are communicated using aparameter block <strong>and</strong> a data block. The parameter vector sent is a r<strong>and</strong>om samplefrom the posterior, P (w | D, H) = P (D | w, H)P (w | H)/P (D | H). Thissample w is sent to an arbitrary small granularity δw using a message lengthL(w | H) = − log[P (w | H)δw]. The data are encoded relative to w with amessage of length L(D | w, H) = − log[P (D | w, H)δD]. Once the data messagehas been received, the r<strong>and</strong>om bits used to generate the sample w fromthe posterior can be deduced by the receiver. The number of bits so recoveredis −log[P (w | D, H)δw]. These recovered bits need not count towards themessage length, since we might use some other optimally-encoded message asa r<strong>and</strong>om bit string, thereby communicating that message at the same time.The net description cost is therefore:P (w | H) P (D | w, H) δDL(w | H) + L(D | w, H) − ‘Bits back’ = − logP (w | D, H)= − log P (D | H) − log δD. (28.19)Thus this thought experiment has yielded the optimal description length. Bitsbackencoding has been turned into a practical compression method for datamodelled with latent variable models by Frey (1998).Further readingBayesian methods are introduced <strong>and</strong> contrasted with sampling-theory statisticsin (Jaynes, 1983; Gull, 1988; Loredo, 1990). The Bayesian Occam’s razoris demonstrated on model problems in (Gull, 1988; MacKay, 1992a). Usefultextbooks are (Box <strong>and</strong> Tiao, 1973; Berger, 1985).One debate worth underst<strong>and</strong>ing is the question of whether it’s permissibleto use improper priors in Bayesian inference (Dawid et al., 1996). Ifwe want to do model comparison (as discussed in this chapter), it is essentialto use proper priors – otherwise the evidences <strong>and</strong> the Occam factors are

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!