12.07.2015 Views

Data Compression: The Complete Reference

Data Compression: The Complete Reference

Data Compression: The Complete Reference

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2.18 PPM 135probability.) Another reason why a symbol must have nonzero probability is that itsentropy (the smallest number of bits into which it can be encoded) depends on log 2 P ,which is undefined for P = 0 (but gets very large when P → 0). This zero-probabilityproblem faces any model, static or adaptive, that uses probabilities of occurrence ofsymbols to achieve compression. Two simple solutions are traditionally adopted for thisproblem, but neither has any theoretical justification.1. After analyzing a large quantity of data and counting frequencies, go over the frequencytable, looking for empty cells. Each empty cell is assigned a frequency countof 1, and the total count is also incremented by 1. This method pretends that everydigram and trigram has been seen at least once.2. Add 1 to the total count and divide this single 1 among all the empty cells. Eachwill get a count that’s less than 1 and, as a result, a very small probability. This assignsa very small probability to anything that hasn’t been seen in the training data used forthe analysis.An adaptive context-based modeler also maintains tables with the probabilities ofall the possible digrams (or trigrams or even longer contexts) of the alphabet and usesthe tables to assign a probability to the next symbol S depending on a few symbols immediatelypreceding it (its context C). <strong>The</strong> tables are updated all the time as more datais being input, which adapts the probabilities to the particular data being compressed.Such a model is slower and more complex than the static one but produces bettercompression, since it uses the correct probabilities even when the input has data withprobabilities much different from the average.A text that skews letter probabilities is called a lipogram. (Would a computerprogram without any goto statements be considered a lipogram?) <strong>The</strong> word comesfrom the Greek stem λɛίπω (lipo or leipo) meaning to miss, to lack, combined withthe Greek γράµµα (gramma), meaning “letter” or “of letters.” Together they formλιπoγράµµατoσ. <strong>The</strong>re are just a few examples of literary works that are lipograms:1. Perhaps the best-known lipogram in English is Gadsby, a full-length novel [Wright 39],by Ernest V. Wright, that does not contain any occurrences of the letter E.2. Alphabetical Africa by Walter Abish (W. W. Norton, 1974) is a readable lipogramwhere the reader is supposed to discover the unusual writing style while reading. Thisstyle has to do with the initial letters of words. <strong>The</strong> book consists of 52 chapters. In thefirst, all words begin with a; in the second, words start with either a or b, etc., until, inChapter 26, all letters are allowed at the start of a word. In the remaining 26 chapters,the letters are taken away one by one. Various readers have commented on how littleor how much they have missed the word “the” and how they felt on finally seeing it (inChapter 20).3. <strong>The</strong> novel La Disparition is a 1969 French lipogram by Georges Perec that does notcontain the letter E (this letter actually appears several times, outside the main text, inwords that the publisher had to include, and these are all printed in red). La Disparitionhas been translated to English, where it is titled A Void, by Gilbert Adair. Perec alsowrote a univocalic (text employing just one vowel) titled Les Revenentes (the revenents)employing only the vowel E. <strong>The</strong> title of the English translation (by Ian Monk) is <strong>The</strong>Exeter Text, Jewels, Secrets, Sex. (Perec also wrote a short history of lipograms, see[Motte 98].)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!