01.04.2015 Views

Sequence Comparison.pdf

Sequence Comparison.pdf

Sequence Comparison.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

Anatomy of Spaced Seeds<br />

BLAST was developed by Lipman and his collaborators to meet demanding needs<br />

of homology search in the late 1980s. It is based on the filtration technique and<br />

is multiple times faster than the Smith-Waterman algorithm. It first identifies short<br />

exact matches (called seed matches) of a fixed length (usually 11 bases) and then<br />

extends each match to both sides until a drop-off score is reached. Motivated by the<br />

success of BLAST, several other seeding strategies were proposed at about the same<br />

time in the early 2000s. In particular, PatternHunter demonstrates that an optimized<br />

spaced seed improves sensitivity substantially. Accordingly, elucidating the mechanism<br />

that confers power to spaced seeds and identifying good spaced seeds are two<br />

new issues of homology search.<br />

This chapter is divided into six sections. In Section 6.1, we define spaced seeds<br />

and discuss the trade-off between sensitivity and specificity for homology search.<br />

The sensitivity and specificity of a seeding-based program are largely related to<br />

the probability that a seed match is expected to occur by chance, called the hit probability.<br />

Here we study analytically spaced seeds in the Bernoulli sequence model<br />

defined in Section 1.6. Section 6.2 gives a recurrence relation system for calculating<br />

hit probability.<br />

In Section 6.3, we investigate the expected distance µ between adjacent nonoverlapping<br />

seed matches. By estimating µ, we further discuss why spaced seeds<br />

are often more sensitive than the consecutive seed used in BLAST.<br />

A spaced seed has a larger span than the consecutive seed of the same weight.<br />

As a result, it has less hit probability in a small region but surpass the consecutive<br />

seed for large regions. Section 6.4 studies the hit probability of spaced seeds<br />

in asymptotic limit. Section 6.5 describes different methods for identifying good<br />

spaced seeds.<br />

Section 6.6 introduces briefly three generalizations of spaced seeds: transition<br />

seeds, multiple spaced seeds, and vector seeds.<br />

Finally, we conclude the chapter with the bibliographic notes in Section 6.7.<br />

91

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!