01.04.2015 Views

Sequence Comparison.pdf

Sequence Comparison.pdf

Sequence Comparison.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

66 4 Homology Search Tools<br />

4.1.2 Suffix Trees<br />

A sequence A = a 1 a 2 ...a m has m suffixes, namely, a 1 ...a m , a 2 ...a m , a 3 ...a m ,...,<br />

and a m . A suffix tree for sequence A is a rooted tree such that every suffix of A<br />

corresponds uniquely to a path from the root to a tree node. Furthermore, each edge<br />

of the suffix tree is labeled with a nonempty substring of A, and all internal nodes<br />

except the root must have at lease two children.<br />

Figure 4.2 constructs a suffix tree for GATCCATCTT. The number of a node<br />

specifies the starting position of its corresponding suffix. Take the number “5” for<br />

example. If we concatenate the labels along the path from the root to the node with<br />

the number “5,” we get CATCTT, which is a suffix starting at position 5. Notice that<br />

some internal node might associate with a number, e.g., the node with number “10”<br />

in this figure. The reason is that T is suffix starting at position 10, yet it is also a<br />

prefix of another three suffixes TCCATCTT, TCTT, and TT.<br />

For convenience, one may require that all suffixes correspond to paths from the<br />

root to the leaves. In fact, if sequence A is padded with a terminal symbol, say $,<br />

that does not appear in A, then every suffix would correspond to a path from the<br />

root to a leaf node in the suffix tree because a suffix with “$” will not be a prefix of<br />

any other suffix. Figure 4.3 constructs a suffix tree for GATCCATCTT$. Now every<br />

suffix corresponds to a path from the root to a leaf, including the suffix starting at<br />

position 10.<br />

For a constant-size alphabet, the construction of a suffix tree for a sequence of<br />

length m can be done in O(m) time and space based on a few other crucial observations,<br />

including the use of suffix links. Once a suffix tree has been built, we can<br />

answer several kinds of pattern matching queries iteratively and efficiently. Take the<br />

exact word match problem for example. Given are two sequences A = a 1 a 2 ...a m ,<br />

Fig. 4.2 AsuffixtreeforGATCCATCTT.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!