You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
92 6 Anatomy of Spaced Seeds<br />
6.1 Filtration Technique in Homology Search<br />
Filtration is a powerful strategy for homology search as exemplified by BLAST. It<br />
identifies short perfect matches defined by a fixed pattern between the query and<br />
target sequences and then extends each match to both sides for local alignments; the<br />
obtained local alignments are scored for acceptance.<br />
6.1.1 Spaced Seed<br />
The pattern used in filtration strategy is usually specified by one or more strings over<br />
the alphabet {1,∗}. Each such string is a spaced seed in which 1s denote matching<br />
positions. For instance, if the seed P = 11 ∗ 1 ∗∗11 is used, then the segments examined<br />
for match in the first stage span 8 positions and the program only checks<br />
whether they match in the 1st, 2nd, 4th, 7th, and 8th positions or not. If the segments<br />
match in these positions, they form a perfect match. As we have seen in this<br />
example, the positions specified by ∗s are irrelevant and sometimes called don’t care<br />
positions.<br />
The most natural seeds are those in which all the positions are matching positions.<br />
These seeds are called the consecutive seeds. They are used in BLASTN.<br />
WABA, another homology search program, uses spaced seed 11 ∗ 11 ∗ 11 ∗ 11 ∗ 11<br />
to align gene-coding regions. The rationale behind this seed is that mutation in the<br />
third position of a codon usually does not affect the function of the encoded amino<br />
acid and hence substitution in the third base of a codon is irrelevant. PatternHunter<br />
uses a rather unusual seed 111 ∗ 1 ∗∗1 ∗ 1 ∗∗11 ∗ 111 as its default seed. As we shall<br />
see later, this spaced seed is much more sensitive than the BLASTN’s default seed<br />
although they have the same number of matching positions.<br />
6.1.2 Sensitivity and Specificity<br />
The efficiency of a homology search program is measured by sensitivity and specificity.<br />
Sensitivity is the true positive rate, the chance that an alignment of interest is<br />
found by the program; specificity is one minus the false positive rate; the false positive<br />
rate is the chance that an alignment without any biological meaning is found in<br />
comparison of random sequence.<br />
Obviously, there is a trade-off between sensitivity and specificity. In a program<br />
powered with a spaced seed in the filtration phase, if the number of the matching<br />
positions of the seed is large, the program is less sensitive, but more specified. On<br />
the other hand, lowering the number of the matching positions increases sensitivity,<br />
but decreases the specificity. What is not obvious is that the positional structure of a<br />
spaced seed affects greatly the sensitivity and specificity of the program.