Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 6<br />
Anatomy of Spaced Seeds<br />
BLAST was developed by Lipman and his collaborators to meet demanding needs<br />
of homology search in the late 1980s. It is based on the filtration technique and<br />
is multiple times faster than the Smith-Waterman algorithm. It first identifies short<br />
exact matches (called seed matches) of a fixed length (usually 11 bases) and then<br />
extends each match to both sides until a drop-off score is reached. Motivated by the<br />
success of BLAST, several other seeding strategies were proposed at about the same<br />
time in the early 2000s. In particular, PatternHunter demonstrates that an optimized<br />
spaced seed improves sensitivity substantially. Accordingly, elucidating the mechanism<br />
that confers power to spaced seeds and identifying good spaced seeds are two<br />
new issues of homology search.<br />
This chapter is divided into six sections. In Section 6.1, we define spaced seeds<br />
and discuss the trade-off between sensitivity and specificity for homology search.<br />
The sensitivity and specificity of a seeding-based program are largely related to<br />
the probability that a seed match is expected to occur by chance, called the hit probability.<br />
Here we study analytically spaced seeds in the Bernoulli sequence model<br />
defined in Section 1.6. Section 6.2 gives a recurrence relation system for calculating<br />
hit probability.<br />
In Section 6.3, we investigate the expected distance µ between adjacent nonoverlapping<br />
seed matches. By estimating µ, we further discuss why spaced seeds<br />
are often more sensitive than the consecutive seed used in BLAST.<br />
A spaced seed has a larger span than the consecutive seed of the same weight.<br />
As a result, it has less hit probability in a small region but surpass the consecutive<br />
seed for large regions. Section 6.4 studies the hit probability of spaced seeds<br />
in asymptotic limit. Section 6.5 describes different methods for identifying good<br />
spaced seeds.<br />
Section 6.6 introduces briefly three generalizations of spaced seeds: transition<br />
seeds, multiple spaced seeds, and vector seeds.<br />
Finally, we conclude the chapter with the bibliographic notes in Section 6.7.<br />
91