Algorithms and Data Structures for External Memory
Algorithms and Data Structures for External Memory
Algorithms and Data Structures for External Memory
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
15.2 <strong>External</strong> <strong>Memory</strong> Compressed <strong>Data</strong> <strong>Structures</strong> 135<br />
Eltabakh et al. [146] consider sets of variable-length strings encoded<br />
using run-length coding in order to exploit space savings when there<br />
are repeated characters. They adapt string B-trees [157, 158] (see Section<br />
14.2) with the EM priority search data structure <strong>for</strong> three-sided<br />
range searching [50] (see Section 12.4) to develop a dynamic compressed<br />
data structure <strong>for</strong> the run-length encoded data. The data structure supports<br />
substring matching, prefix matching, <strong>and</strong> range search queries.<br />
The data structure uses O( ˆ N/B) blocks of space, where ˆ N is the total<br />
length of the compressed strings. Insertions <strong>and</strong> deletions of t runlength<br />
encoded suffixes take O � tlog B( ˆ N + t) � I/Os. Queries <strong>for</strong> a pattern<br />
P can be per<strong>for</strong>med in O � log B ˆ N + � | ˆ P | + Z � /B � I/Os, where | ˆ P |<br />
is the size of the search pattern in run-length encoded <strong>for</strong>mat.<br />
One of the major advances in text indexing in the last decade has<br />
been the development of entropy-compressed data structures. Until<br />
fairly recently, the best general-purpose data structures <strong>for</strong> pattern<br />
matching queries were the suffix tree <strong>and</strong> its more space-efficient version,<br />
the suffix array, which we studied in Section 14.3. However, the<br />
suffix array requires several times more space than the text being<br />
indexed. The basic reason is that suffix arrays store an array of pointers,<br />
each requiring at least logN bits, whereas the original text being<br />
indexed consists of N characters, each of size log |Σ| bits. For a terabyte<br />
of ascii text (i.e., N =2 40 ), each text character occupies 8 bits.<br />
The suffix array, on the other h<strong>and</strong>, consists of N pointers to the text,<br />
each pointer requiring logN = 40 bits, which is five times larger. 1<br />
For reasonable values of h, the compressed suffix array of Grossi et<br />
al. [185] requires an amount of space in bits per character only as large<br />
as the hth-order entropy Hh(T ) of the original text, plus lower-order<br />
terms. In addition, the compressed suffix array is self-indexing, in that<br />
it encodes the original text <strong>and</strong> provides r<strong>and</strong>om access to the characters<br />
of the original text. There<strong>for</strong>e, the original text does not need to be<br />
stored, <strong>and</strong> we can delete it. The net result is a big improvement over<br />
conventional suffix arrays: Rather than having to store both the original<br />
1 Imagine going to a library <strong>and</strong> finding that the card catalog is stored in an annex that is<br />
five times larger than the main library! Suffix arrays support general pattern matching,<br />
which card catalogs do not, but it is still counterintuitive <strong>for</strong> an index to require so much<br />
more space than the text it is indexing.