17.01.2013 Views

Algorithms and Data Structures for External Memory

Algorithms and Data Structures for External Memory

Algorithms and Data Structures for External Memory

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

15.2 <strong>External</strong> <strong>Memory</strong> Compressed <strong>Data</strong> <strong>Structures</strong> 135<br />

Eltabakh et al. [146] consider sets of variable-length strings encoded<br />

using run-length coding in order to exploit space savings when there<br />

are repeated characters. They adapt string B-trees [157, 158] (see Section<br />

14.2) with the EM priority search data structure <strong>for</strong> three-sided<br />

range searching [50] (see Section 12.4) to develop a dynamic compressed<br />

data structure <strong>for</strong> the run-length encoded data. The data structure supports<br />

substring matching, prefix matching, <strong>and</strong> range search queries.<br />

The data structure uses O( ˆ N/B) blocks of space, where ˆ N is the total<br />

length of the compressed strings. Insertions <strong>and</strong> deletions of t runlength<br />

encoded suffixes take O � tlog B( ˆ N + t) � I/Os. Queries <strong>for</strong> a pattern<br />

P can be per<strong>for</strong>med in O � log B ˆ N + � | ˆ P | + Z � /B � I/Os, where | ˆ P |<br />

is the size of the search pattern in run-length encoded <strong>for</strong>mat.<br />

One of the major advances in text indexing in the last decade has<br />

been the development of entropy-compressed data structures. Until<br />

fairly recently, the best general-purpose data structures <strong>for</strong> pattern<br />

matching queries were the suffix tree <strong>and</strong> its more space-efficient version,<br />

the suffix array, which we studied in Section 14.3. However, the<br />

suffix array requires several times more space than the text being<br />

indexed. The basic reason is that suffix arrays store an array of pointers,<br />

each requiring at least logN bits, whereas the original text being<br />

indexed consists of N characters, each of size log |Σ| bits. For a terabyte<br />

of ascii text (i.e., N =2 40 ), each text character occupies 8 bits.<br />

The suffix array, on the other h<strong>and</strong>, consists of N pointers to the text,<br />

each pointer requiring logN = 40 bits, which is five times larger. 1<br />

For reasonable values of h, the compressed suffix array of Grossi et<br />

al. [185] requires an amount of space in bits per character only as large<br />

as the hth-order entropy Hh(T ) of the original text, plus lower-order<br />

terms. In addition, the compressed suffix array is self-indexing, in that<br />

it encodes the original text <strong>and</strong> provides r<strong>and</strong>om access to the characters<br />

of the original text. There<strong>for</strong>e, the original text does not need to be<br />

stored, <strong>and</strong> we can delete it. The net result is a big improvement over<br />

conventional suffix arrays: Rather than having to store both the original<br />

1 Imagine going to a library <strong>and</strong> finding that the card catalog is stored in an annex that is<br />

five times larger than the main library! Suffix arrays support general pattern matching,<br />

which card catalogs do not, but it is still counterintuitive <strong>for</strong> an index to require so much<br />

more space than the text it is indexing.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!