10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.4.2: Data compression 73are given the text one word at a time, the Shannon information content ofeach five-character word is log 32,768 = 15 bits, since Wenglish uses all itswords with equal probability. The average information content per characteris therefore 3 bits.Now let’s look at the information content if we read the document onecharacter at a time. If, say, the first letter of a word is a, the Shannoninformation content is log 1/0.0625 ≃ 4 bits. If the first letter is z, the Shannoninformation content is log 1/0.001 ≃ 10 bits. The information content is thushighly variable at the first character. The total information content of the 5characters in a word, however, is exactly 15 bits; so the letters that follow aninitial z have lower average information content per character than the lettersthat follow an initial a. A rare initial letter such as z indeed conveys moreinformation about what the word is than a common initial letter.Similarly, in English, if rare characters occur at the start of the word (e.g.xyl...), then often we can identify the whole word immediately; whereaswords that start with common characters (e.g. pro...) require more charactersbefore we can identify them.4.2 Data compressionThe preceding examples justify the idea that the Shannon information contentof an outcome is a natural measure of its information content. Improbable outcomesdo convey more information than probable outcomes. We now discussthe information content of a source by considering how many bits are neededto describe the outcome of an experiment.If we can show that we can compress data from a particular source intoa file of L bits per source symbol <strong>and</strong> recover the data reliably, then we willsay that the average information content of that source is at most L bits persymbol.Example: compression of text filesA file is composed of a sequence of bytes. A byte is composed of 8 bits <strong>and</strong>can have a decimal value between 0 <strong>and</strong> 255. A typical text file is composedof the ASCII character set (decimal values 0 to 127). This character set usesonly seven of the eight bits in a byte.Here we use the word ‘bit’ with itsmeaning, ‘a symbol with twovalues’, not to be confused withthe unit of information content.⊲ Exercise 4.4. [1, p.86] By how much could the size of a file be reduced giventhat it is an ASCII file? How would you achieve this reduction?Intuitively, it seems reasonable to assert that an ASCII file contains 7/8 asmuch information as an arbitrary file of the same size, since we already knowone out of every eight bits before we even look at the file. This is a simple exampleof redundancy. Most sources of data have further redundancy: Englishtext files use the ASCII characters with non-equal frequency; certain pairs ofletters are more probable than others; <strong>and</strong> entire words can be predicted giventhe context <strong>and</strong> a semantic underst<strong>and</strong>ing of the text.Some simple data compression methods that define measures of informationcontentOne way of measuring the information content of a r<strong>and</strong>om variable is simplyto count the number of possible outcomes, |A X |. (The number of elements ina set A is denoted by |A|.) If we gave a binary name to each outcome, the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!