23.11.2014 Views

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The worst case for the number of nodes of a trie occurs when no two str<strong>in</strong>gs share a<br />

common nonempty prefix; that is, except for the root, all <strong>in</strong>ternal nodes have one<br />

child.<br />

A trie T for a set S of str<strong>in</strong>gs can be used to implement a dictionary whose keys are<br />

the str<strong>in</strong>gs of S. Namely, we perform a search <strong>in</strong> T for a str<strong>in</strong>g X by trac<strong>in</strong>g down<br />

from the root the path <strong>in</strong>dicated by the characters <strong>in</strong> X. If this path can be traced <strong>and</strong><br />

term<strong>in</strong>ates at an external node, then we know X is <strong>in</strong> the dictionary. For example, <strong>in</strong><br />

the trie <strong>in</strong> Figure 12.6, trac<strong>in</strong>g the path for "bull" ends up at an external node. If the<br />

path cannot be traced or the path can be traced but term<strong>in</strong>ates at an <strong>in</strong>ternal node,<br />

then X is not <strong>in</strong> the dictionary. In the example <strong>in</strong> Figure 12.6, the path for "bet"<br />

cannot be traced <strong>and</strong> the path for "be" ends at an <strong>in</strong>ternal node. Neither such word is<br />

<strong>in</strong> the dictionary. Note that <strong>in</strong> this implementation of a dictionary, s<strong>in</strong>gle characters<br />

are compared <strong>in</strong>stead of the entire str<strong>in</strong>g (key). It is easy to see that the runn<strong>in</strong>g time<br />

of the search for a str<strong>in</strong>g of size m is O(dm), where d is the size of the alphabet.<br />

Indeed, we visit at most m + 1 nodes of T <strong>and</strong> we spend O(d) time at each node. For<br />

some alphabets, we may be able to improve the time spent at a node to be O(1) or<br />

O(logd) by us<strong>in</strong>g a dictionary of characters implemented <strong>in</strong> a hash table or search<br />

table. However, s<strong>in</strong>ce d is a constant <strong>in</strong> most applications, we can stick with the<br />

simple approach that takes O(d) time per node visited.<br />

From the discussion above, it follows that we can use a trie to perform a special<br />

type of pattern match<strong>in</strong>g, called word match<strong>in</strong>g, where we want to determ<strong>in</strong>e<br />

whether a given pattern matches one of the words of the text exactly. (See Figure<br />

12.7.) Word match<strong>in</strong>g differs from st<strong>and</strong>ard pattern match<strong>in</strong>g s<strong>in</strong>ce the pattern<br />

cannot match an arbitrary substr<strong>in</strong>g of the text, but only one of its words. Us<strong>in</strong>g a<br />

trie, word match<strong>in</strong>g for a pattern of length m takes O(dm) time, where d is the size<br />

of the alphabet, <strong>in</strong>dependent of the size of the text. If the alphabet has constant size<br />

(as is the case for text <strong>in</strong> natural languages <strong>and</strong> DNA str<strong>in</strong>gs), a query takes O(m)<br />

time, proportional to the size of the pattern. A simple extension of this scheme<br />

supports prefix match<strong>in</strong>g queries. However, arbitrary occurrences of the pattern <strong>in</strong><br />

the text (for example, the pattern is a proper suffix of a word or spans two words)<br />

cannot be efficiently performed.<br />

To construct a st<strong>and</strong>ard trie for a set S of str<strong>in</strong>gs, we can use an <strong>in</strong>cremental<br />

algorithm that <strong>in</strong>serts the str<strong>in</strong>gs one at a time. Recall the assumption that no str<strong>in</strong>g<br />

of S is a prefix of another str<strong>in</strong>g. To <strong>in</strong>sert a str<strong>in</strong>g X <strong>in</strong>to the current trie T, we first<br />

try to trace the path associated with X <strong>in</strong> T. S<strong>in</strong>ce X is not already <strong>in</strong> T <strong>and</strong> no str<strong>in</strong>g<br />

<strong>in</strong> S is a prefix of another str<strong>in</strong>g, we will stop trac<strong>in</strong>g the path at an <strong>in</strong>ternal node v<br />

of T before reach<strong>in</strong>g the end of X. We then create a new cha<strong>in</strong> of node descendents<br />

of v to store the rema<strong>in</strong><strong>in</strong>g characters of X. The time to <strong>in</strong>sert X is O(dm), where m<br />

is the length of X <strong>and</strong> d is the size of the alphabet. Thus, construct<strong>in</strong>g the entire trie<br />

for set S takes O(dn) time, where n is the total length of the str<strong>in</strong>gs of S.<br />

Figure 12.7: Word match<strong>in</strong>g <strong>and</strong> prefix match<strong>in</strong>g<br />

with a st<strong>and</strong>ard trie: (a) text to be searched; (b) st<strong>and</strong>ard<br />

765

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!