12.07.2015 Views

Data Compression: The Complete Reference

Data Compression: The Complete Reference

Data Compression: The Complete Reference

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

24 1. Basic Techniques<strong>The</strong>re are three kinds of lies: lies, damned lies, and statistics.(Attributed by Mark Twain to Benjamin Disraeli)encoding. <strong>The</strong> latter technique is described in Section 2.10, but here is how MNP5solves problem 2 above.When three or more identical consecutive bytes are found in the input stream, thecompressor writes three copies of the byte on the output stream, followed by a repetitioncount. When the decompressor reads three identical consecutive bytes, it knows thatthe next byte is a repetition count (which may be 0, indicating just three repetitions). Adisadvantage of the method is that a run of three characters in the input stream resultsin four characters written to the output stream: expansion! A run of four charactersresults in no compression. Only runs longer than 4 characters get compressed. Anotherslight problem is that the maximum count is artificially limited in MNP5 to 250 insteadof 255.To get an idea of the compression ratios produced by RLE, we assume a stringof N characters that needs to be compressed. We assume that the string containsM repetitions of average length L each. Each of the M repetitions is replaced by 3characters (escape, count, and data), so the size of the compressed string is N − M ×L + M × 3=N − M(L − 3) and the compression factor isNN − M(L − 3) .(For MNP5 just substitute 4 for 3.) Examples: N = 1000,M =10,L = 3 yield acompression factor of 1000/[1000 − 10(4 − 3)] = 1.01. A better result is obtained in thecase N = 1000,M =50,L= 10, where the factor is 1000/[1000 − 50(10 − 3)] = 1.538.A variant of run length encoding for text is digram encoding. This method is suitablefor cases where the data to be compressed consists only of certain characters, e.g., justletters, digits, and punctuation. <strong>The</strong> idea is to identify commonly occurring pairs ofcharacters and to replace a pair (a digram) with one of the characters that cannot occurin the data (e.g., one of the ASCII control characters). Good results can be obtainedif the data can be analyzed beforehand. We know that in plain English certain pairs ofcharacters, such as “E␣”, “␣T”, “TH”, and “␣A”, occur often. Other types of data mayhave different common digrams. <strong>The</strong> sequitur method of Section 8.10 is an example of amethod that compresses data by locating repeating digrams (as well as longer repeatedphrases) and replacing them with special symbols.A similar variant is pattern substitution. This is suitable for compressing computerprograms, where certain words, such as for, repeat, andprint, occur often. Eachsuch word is replaced with a control character or, if there are many such words, with anescape character followed by a code character. Assuming that code “a” is assigned tothe word print, the text “m:␣print,b,a;” willbecompressedto“m:␣@a,b,a;”.1.3.1 Relative EncodingThis is another variant, sometimes called differencing (see [Gottlieb 75]). It is used incases where the data to be compressed consists of a string of numbers that do not differ

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!