23.11.2014 Views

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

12.3 Tries<br />

The pattern match<strong>in</strong>g algorithms presented <strong>in</strong> the previous section speed up the search<br />

<strong>in</strong> a text by preprocess<strong>in</strong>g the pattern (to compute the failure function <strong>in</strong> the KMP<br />

algorithm or the last function <strong>in</strong> the BM algorithm). In this section, we take a<br />

complementary approach, namely, we present str<strong>in</strong>g search<strong>in</strong>g algorithms that<br />

preprocess the text. This approach is suitable for applications where a series of<br />

queries is performed on a fixed text, so that the <strong>in</strong>itial cost of preprocess<strong>in</strong>g the text is<br />

compensated by a speedup <strong>in</strong> each subsequent query (for example, a Web site that<br />

offers pattern match<strong>in</strong>g <strong>in</strong> Shakespeare's Hamlet or a search eng<strong>in</strong>e that offers Web<br />

pages on the Hamlet topic).<br />

A trie (pronounced "try") is a tree-based data structure for stor<strong>in</strong>g str<strong>in</strong>gs <strong>in</strong> order to<br />

support fast pattern match<strong>in</strong>g. The ma<strong>in</strong> application for tries is <strong>in</strong> <strong>in</strong>formation<br />

retrieval. Indeed, the name "trie" comes from the word "retrieval." In an <strong>in</strong>formation<br />

retrieval application, such as a search for a certa<strong>in</strong> DNA sequence <strong>in</strong> a genomic<br />

database, we are given a collection S of str<strong>in</strong>gs, all def<strong>in</strong>ed us<strong>in</strong>g the same alphabet.<br />

The primary query operations that tries support are pattern match<strong>in</strong>g <strong>and</strong> prefix<br />

match<strong>in</strong>g. The latter operation <strong>in</strong>volves be<strong>in</strong>g given a str<strong>in</strong>g X, <strong>and</strong> look<strong>in</strong>g for all the<br />

str<strong>in</strong>gs <strong>in</strong> S that conta<strong>in</strong> X as a prefix.<br />

12.3.1 St<strong>and</strong>ard Tries<br />

Let S be a set of s str<strong>in</strong>gs from alphabet σ such that no str<strong>in</strong>g <strong>in</strong> S is a prefix of<br />

another str<strong>in</strong>g. A st<strong>and</strong>ard trie for S is an ordered tree T with the follow<strong>in</strong>g<br />

properties (see Figure 12.6):<br />

• Each node of T, except the root, is labeled with a character of σ.<br />

• The order<strong>in</strong>g of the children of an <strong>in</strong>ternal node of T is determ<strong>in</strong>ed by a<br />

canonical order<strong>in</strong>g of the alphabet σ.<br />

• T has s external nodes, each associated with a str<strong>in</strong>g of S, such that the<br />

concatenation of the labels of the nodes on the path from the root to an external<br />

node v of T yields the str<strong>in</strong>g of S associated with v.<br />

Thus, a trie T represents the str<strong>in</strong>gs of S with paths from the root to the external<br />

nodes of T. Note the importance of assum<strong>in</strong>g that no str<strong>in</strong>g <strong>in</strong> S is a prefix of<br />

another str<strong>in</strong>g. This ensures that each str<strong>in</strong>g of S is uniquely associated with an<br />

external node of T. We can always satisfy this assumption by add<strong>in</strong>g a special<br />

character that is not <strong>in</strong> the orig<strong>in</strong>al alphabet σ at the end of each str<strong>in</strong>g.<br />

An <strong>in</strong>ternal node <strong>in</strong> a st<strong>and</strong>ard trie T can have anywhere between 1 <strong>and</strong> d children,<br />

where d is the size of the alphabet. There is an edge go<strong>in</strong>g from the root r to one of<br />

its children for each character that is first <strong>in</strong> some str<strong>in</strong>g <strong>in</strong> the collection S. In<br />

addition, a path from the root of T to an <strong>in</strong>ternal node v at depth i corresponds to<br />

763

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!