10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.264 18 — Crosswords <strong>and</strong> Codebreaking0.10.010.0010.00010.000011 10 100 1000 10000Figure 18.7. Zipf plots for thewords of two ‘languages’generated by creating successivecharacters from a Dirichletprocess with α = 2, <strong>and</strong> declaringone character to be the spacecharacter. The two curves resultfrom two different choices of thespace character.With a small tweak, however, Dirichlet processes can produce rather niceZipf plots. Imagine generating a language composed of elementary symbolsusing a Dirichlet process with a rather small value of the parameter α, so thatthe number of reasonably frequent symbols is about 27. If we then declareone of those symbols (now called ‘characters’ rather than words) to be a spacecharacter, then we can identify the strings between the space characters as‘words’. If we generate a language in this way then the frequencies of wordsoften come out as very nice Zipf plots, as shown in figure 18.7. Which characteris selected as the space character determines the slope of the Zipf plot – a lessprobable space character gives rise to a richer language with a shallower slope.18.3 Units of information contentThe information content of an outcome, x, whose probability is P (x), is definedto be1h(x) = logP (x) . (18.13)The entropy of an ensemble is an average information content,H(X) = ∑ xP (x) log1P (x) . (18.14)When we compare hypotheses with each other in the light of data, it is oftenconvenient to compare the log of the probability of the data under thealternative hypotheses,‘log evidence for H i ’ = log P (D | H i ), (18.15)or, in the case where just two hypotheses are being compared, we evaluate the‘log odds’,log P (D | H 1)P (D | H 2 ) , (18.16)which has also been called the ‘weight of evidence in favour of H 1 ’. Thelog evidence for a hypothesis, log P (D | H i ) is the negative of the informationcontent of the data D: if the data have large information content, given a hypothesis,then they are surprising to that hypothesis; if some other hypothesisis not so surprised by the data, then that hypothesis becomes more probable.‘<strong>Information</strong> content’, ‘surprise value’, <strong>and</strong> log likelihood or log evidence arethe same thing.All these quantities are logarithms of probabilities, or weighted sums oflogarithms of probabilities, so they can all be measured in the same units.The units depend on the choice of the base of the logarithm.The names that have been given to these units are shown in table 18.8.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!