10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.32 2 — Probability, Entropy, <strong>and</strong> <strong>Inference</strong>What do you notice about your solutions? Does each answer depend on thedetailed contents of each urn?The details of the other possible outcomes <strong>and</strong> their probabilities are irrelevant.All that matters is the probability of the outcome that actuallyhappened (here, that the ball drawn was black) given the different hypotheses.We need only to know the likelihood, i.e., how the probability of the datathat happened varies with the hypothesis. This simple rule about inference isknown as the likelihood principle.The likelihood principle: given a generative model for data d givenparameters θ, P (d | θ), <strong>and</strong> having observed a particular outcomed 1 , all inferences <strong>and</strong> predictions should depend only on the functionP (d 1 | θ).In spite of the simplicity of this principle, many classical statistical methodsviolate it.2.4 Definition of entropy <strong>and</strong> related functionsThe Shannon information content of an outcome x is defined to beh(x) = log 21P (x) . (2.34)It is measured in bits. [The word ‘bit’ is also used to denote a variablewhose value is 0 or 1; I hope context will always make clear which of thetwo meanings is intended.]In the next few chapters, we will establish that the Shannon informationcontent h(a i ) is indeed a natural measure of the information contentof the event x = a i . At that point, we will shorten the name of thisquantity to ‘the information content’.The fourth column in table 2.9 shows the Shannon information contentof the 27 possible outcomes when a r<strong>and</strong>om character is picked froman English document. The outcome x = z has a Shannon informationcontent of 10.4 bits, <strong>and</strong> x = e has an information content of 3.5 bits.i a i p i h(p i )1 a .0575 4.12 b .0128 6.33 c .0263 5.24 d .0285 5.15 e .0913 3.56 f .0173 5.97 g .0133 6.28 h .0313 5.09 i .0599 4.110 j .0006 10.711 k .0084 6.912 l .0335 4.913 m .0235 5.414 n .0596 4.115 o .0689 3.916 p .0192 5.717 q .0008 10.318 r .0508 4.319 s .0567 4.120 t .0706 3.821 u .0334 4.922 v .0069 7.223 w .0119 6.424 x .0073 7.125 y .0164 5.926 z .0007 10.427 - .1928 2.4∑ip i log 21p i4.1Table 2.9. Shannon informationcontents of the outcomes a–z.The entropy of an ensemble X is defined to be the average Shannon informationcontent of an outcome:H(X) ≡ ∑1P (x) logP (x) , (2.35)x∈A Xwith the convention for P (x) = 0 that 0 × log 1/0 ≡ 0, sincelim θ→0 + θ log 1/θ = 0.Like the information content, entropy is measured in bits.When it is convenient, we may also write H(X) as H(p), where p isthe vector (p 1 , p 2 , . . . , p I ). Another name for the entropy of X is theuncertainty of X.Example 2.12. The entropy of a r<strong>and</strong>omly selected letter in an English documentis about 4.11 bits, assuming its probability is as given in table 2.9.We obtain this number by averaging log 1/p i (shown in the fourth column)under the probability distribution p i (shown in the third column).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!