23.11.2014 Views

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

transmit our text. Likewise, text compression is also useful for stor<strong>in</strong>g collections of<br />

large documents more efficiently, so as to allow for a fixed-capacity storage device to<br />

conta<strong>in</strong> as many documents as possible.<br />

The method for text compression explored <strong>in</strong> this section is the Huffman code.<br />

St<strong>and</strong>ard encod<strong>in</strong>g schemes, such as the ASCII <strong>and</strong> Unicode systems, use fixedlength<br />

b<strong>in</strong>ary str<strong>in</strong>gs to encode characters (with 7 bits <strong>in</strong> the ASCII system <strong>and</strong> 16 <strong>in</strong><br />

the Unicode system). A Huffman code, on the other h<strong>and</strong>, uses a variablelength<br />

encod<strong>in</strong>g optimized for the str<strong>in</strong>g X. The optimization is based on the use of character<br />

frequencies, where we have, for each character c, a count f(c) of the number of times<br />

c appears <strong>in</strong> the str<strong>in</strong>g X. The Huffman code saves space over a fixed-length encod<strong>in</strong>g<br />

by us<strong>in</strong>g short code-word str<strong>in</strong>gs to encode high-frequency characters <strong>and</strong> long codeword<br />

str<strong>in</strong>gs to encode low-frequency characters.<br />

To encode the str<strong>in</strong>g X, we convert each character <strong>in</strong> X from its fixed-length code<br />

word to its variable-length code word, <strong>and</strong> we concatenate all these code words <strong>in</strong><br />

order to produce the encod<strong>in</strong>g Y for X. In order to avoid ambiguities, we <strong>in</strong>sist that no<br />

code word <strong>in</strong> our encod<strong>in</strong>g is a prefix of another code word <strong>in</strong> our encod<strong>in</strong>g. Such a<br />

code is called a prefix code, <strong>and</strong> it simplifies the decod<strong>in</strong>g of Y <strong>in</strong> order to get back X.<br />

(See Figure 12.11.) Even with this restriction, the sav<strong>in</strong>gs produced by a variablelength<br />

prefix code can be significant, particularly if there is a wide variance <strong>in</strong><br />

character frequencies (as is the case for natural language text <strong>in</strong> almost every spoken<br />

language).<br />

Huffman's algorithm for produc<strong>in</strong>g an optimal variable-length prefix code for X is<br />

based on the construction of a b<strong>in</strong>ary tree T that represents the code. Each node <strong>in</strong> T,<br />

except the root, represents a bit <strong>in</strong> a code word, with each left child represent<strong>in</strong>g a "0"<br />

<strong>and</strong> each right child represent<strong>in</strong>g a "1." Each external node v is associated with a<br />

specific character, <strong>and</strong> the code word for that character is def<strong>in</strong>ed by the sequence of<br />

bits associated with the nodes <strong>in</strong> the path from the root of T to v. (See Figure 12.11.)<br />

Each external node v has a frequency f(v), which is simply the frequency <strong>in</strong> X of the<br />

character associated with v. In addition, we give each <strong>in</strong>ternal node v <strong>in</strong> T a<br />

frequency, f(v), that is the sum of the frequencies of all the external nodes <strong>in</strong> the<br />

subtree rooted at v.<br />

Figure 12.11: An illustration of an example Huffman<br />

code for the <strong>in</strong>put str<strong>in</strong>g X = "a fast runner need<br />

never be afraid of the dark": (a) frequency of<br />

each character of X; (b) Huffman tree T for str<strong>in</strong>g X. The<br />

code for a character c is obta<strong>in</strong>ed by trac<strong>in</strong>g the path<br />

from the root of T to the external node where c is stored,<br />

<strong>and</strong> associat<strong>in</strong>g a left child with 0 <strong>and</strong> a right child with 1.<br />

775

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!