01.04.2015 Views

Sequence Comparison.pdf

Sequence Comparison.pdf

Sequence Comparison.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

92 6 Anatomy of Spaced Seeds<br />

6.1 Filtration Technique in Homology Search<br />

Filtration is a powerful strategy for homology search as exemplified by BLAST. It<br />

identifies short perfect matches defined by a fixed pattern between the query and<br />

target sequences and then extends each match to both sides for local alignments; the<br />

obtained local alignments are scored for acceptance.<br />

6.1.1 Spaced Seed<br />

The pattern used in filtration strategy is usually specified by one or more strings over<br />

the alphabet {1,∗}. Each such string is a spaced seed in which 1s denote matching<br />

positions. For instance, if the seed P = 11 ∗ 1 ∗∗11 is used, then the segments examined<br />

for match in the first stage span 8 positions and the program only checks<br />

whether they match in the 1st, 2nd, 4th, 7th, and 8th positions or not. If the segments<br />

match in these positions, they form a perfect match. As we have seen in this<br />

example, the positions specified by ∗s are irrelevant and sometimes called don’t care<br />

positions.<br />

The most natural seeds are those in which all the positions are matching positions.<br />

These seeds are called the consecutive seeds. They are used in BLASTN.<br />

WABA, another homology search program, uses spaced seed 11 ∗ 11 ∗ 11 ∗ 11 ∗ 11<br />

to align gene-coding regions. The rationale behind this seed is that mutation in the<br />

third position of a codon usually does not affect the function of the encoded amino<br />

acid and hence substitution in the third base of a codon is irrelevant. PatternHunter<br />

uses a rather unusual seed 111 ∗ 1 ∗∗1 ∗ 1 ∗∗11 ∗ 111 as its default seed. As we shall<br />

see later, this spaced seed is much more sensitive than the BLASTN’s default seed<br />

although they have the same number of matching positions.<br />

6.1.2 Sensitivity and Specificity<br />

The efficiency of a homology search program is measured by sensitivity and specificity.<br />

Sensitivity is the true positive rate, the chance that an alignment of interest is<br />

found by the program; specificity is one minus the false positive rate; the false positive<br />

rate is the chance that an alignment without any biological meaning is found in<br />

comparison of random sequence.<br />

Obviously, there is a trade-off between sensitivity and specificity. In a program<br />

powered with a spaced seed in the filtration phase, if the number of the matching<br />

positions of the seed is large, the program is less sensitive, but more specified. On<br />

the other hand, lowering the number of the matching positions increases sensitivity,<br />

but decreases the specificity. What is not obvious is that the positional structure of a<br />

spaced seed affects greatly the sensitivity and specificity of the program.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!