15.04.2018 Views

programming-for-dummies

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

436<br />

Lossless Data Compression Algorithms<br />

Lossless Data Compression Algorithms<br />

The basic idea behind lossless data compression is to find a way to pack data<br />

in a smaller space more efficiently without losing any of the data in the process.<br />

To do this, lossless data compression algorithms are typically optimized <strong>for</strong><br />

specific data, such as text, audio, or video, although the general principles<br />

remain the same no matter what type of data the algorithm is compressing.<br />

Run-length encoding<br />

The simplest lossless data compression algorithm is run-length encoding (RLE).<br />

Basically, this method looks <strong>for</strong> redundancy and replaces any redundant data<br />

with a much shorter code instead. Suppose you had the following 17-character<br />

string:<br />

WWWBBWWWWBBBBWWWW<br />

RLE looks <strong>for</strong> redundant data and condenses it into a 10-character string,<br />

like this:<br />

3W2B4W4B4W<br />

The number in front of each letter identifies how many characters the code<br />

replaced, so in this example, the first two characters, 3W represents WWW,<br />

2B represents BB, 4W represents WWWW, and so on.<br />

Run-length encoding is used by fax machines because most images consist<br />

of mainly white space with occasional black areas that represent letters or<br />

drawings.<br />

The Burrows-Wheeler trans<strong>for</strong>m algorithm<br />

One problem with run-length encoding is that it works best when repetitive<br />

characters appear grouped together. With a string, like WBWBWB#, RLE can’t<br />

compress anything because no groups of W and B characters are bunched<br />

together. (However, a smart version of the RLE algorithm notices the twocharacter<br />

repetitive string WB and encodes the string as 3(WB)#, which would<br />

tell the computer to repeat the two-character pattern of WB three times.)<br />

When redundant characters appear scattered, run-length encoding can be<br />

made more efficient by first trans<strong>for</strong>ming the data to group identical characters<br />

together and then use run-length encoding to compress the data. That’s<br />

the idea behind the Burrows-Wheeler trans<strong>for</strong>m (BWT) algorithm, developed<br />

by Michael Burrows and David Wheeler.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!