10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.4.3: <strong>Information</strong> content defined in terms of lossy compression 75are used with equal probability, <strong>and</strong> one in which the characters are used withtheir frequencies in English text. Can we define a measure of informationcontent that distinguishes between these two files? Intuitively, the latter filecontains less information per character because it is more predictable.One simple way to use our knowledge that some symbols have a smallerprobability is to imagine recoding the observations into a smaller alphabet– thus losing the ability to encode some of the more improbable symbols –<strong>and</strong> then measuring the raw bit content of the new alphabet. For example,we might take a risk when compressing English text, guessing that the mostinfrequent characters won’t occur, <strong>and</strong> make a reduced ASCII code that omitsthe characters { !, @, #, %, ^, *, ~, , /, \, _, {, }, [, ], | }, thereby reducingthe size of the alphabet by seventeen. The larger the risk we are willing totake, the smaller our final alphabet becomes.We introduce a parameter δ that describes the risk we are taking whenusing this compression method: δ is the probability that there will be noname for an outcome x.Example 4.6. LetA X = { a, b, c, d, e, f, g, h },<strong>and</strong> P X = { 1 4 , 1 4 , 1 4 , 3 16 , 1 64 , 1 64 , 164 , 164 }. (4.17)The raw bit content of this ensemble is 3 bits, corresponding to 8 binarynames. But notice that P (x ∈ {a, b, c, d}) = 15/16. So if we are willingto run a risk of δ = 1/16 of not having a name for x, then we can getby with four names – half as many names as are needed if every x ∈ A Xhas a name.Table 4.5 shows binary names that could be given to the different outcomesin the cases δ = 0 <strong>and</strong> δ = 1/16. When δ = 0 we need 3 bits toencode the outcome; when δ = 1/16 we need only 2 bits.Let us now formalize this idea. To make a compression strategy with riskδ, we make the smallest possible subset S δ such that the probability that x isnot in S δ is less than or equal to δ, i.e., P (x ∉ S δ ) ≤ δ. For each value of δwe can then define a new measure of information content – the log of the sizeof this smallest subset S δ . [In ensembles in which several elements have thesame probability, there may be several smallest subsets that contain differentelements, but all that matters is their sizes (which are equal), so we will notdwell on this ambiguity.]The smallest δ-sufficient subset S δ is the smallest subset of A X satisfyingP (x ∈ S δ ) ≥ 1 − δ. (4.18)xδ = 0c(x)a 000b 001c 010d 011e 100f 101g 110h 111δ = 1/16xc(x)a 00b 01c 10d 11e −f −g −h −Table 4.5. Binary names for theoutcomes, for two failureprobabilities δ.The subset S δ can be constructed by ranking the elements of A X in order ofdecreasing probability <strong>and</strong> adding successive elements starting from the mostprobable elements until the total probability is ≥ (1−δ).We can make a data compression code by assigning a binary name to eachelement of the smallest sufficient subset. This compression scheme motivatesthe following measure of information content:The essential bit content of X is:H δ (X) = log 2 |S δ |. (4.19)Note that H 0 (X) is the special case of H δ (X) with δ = 0 (if P (x) > 0 for allx ∈ A X ). [Caution: do not confuse H 0 (X) <strong>and</strong> H δ (X) with the function H 2 (p)displayed in figure 4.1.]Figure 4.6 shows H δ (X) for the ensemble of example 4.6 as a function ofδ.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!