01.04.2015 Views

Sequence Comparison.pdf

Sequence Comparison.pdf

Sequence Comparison.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.3 Distance between Non-Overlapping Hits 103<br />

Table 6.1 The values of the upper bound in Theorem 6.6 for different w and p after rounding to<br />

the nearest integer).<br />

❅ w 10 11 12 13 14<br />

p ❅<br />

0.6 49 76 121 196 315<br />

0.7 17 21 26 24 44<br />

0.8 11 12 14 15 17<br />

0.9 10 11 12 13 14<br />

Using the above theorem, the following explicit upper bound on µ π can be proved<br />

for non-uniformly spaced seeds π. Its proof is quite involved and so is omitted.<br />

Theorem 6.5. For any non-uniformly spaced seed π,<br />

µ π ≤<br />

w π<br />

∑<br />

i=1<br />

[<br />

]<br />

(1/p) i +(|π|−w π ) − (q/p) (1/p) (w π −2) − 1<br />

6.3.3 Why Do Spaced Seeds Have More Hits?<br />

Recall that, for the consecutive seed θ of weight w, µ θ = ∑ w i=1 ( 1 p )i . By Theorem 6.5,<br />

we have<br />

Theorem 6.6. Let π be a non-uniformly spaced seed and θ the consecutive seed of<br />

the same weight. If |π| < w π + q p [( 1 p )w π −2 − 1], then, µ π < µ θ .<br />

Non-overlapping hit of a spaced seed π is a recurrent event with the following<br />

convention: If a hit at position i is selected as a non-overlapping hit, then the<br />

next non-overlapping hit is the first hit at or after position i + |π|. By (B.45) in<br />

Section B.8, the expected number of the non-overlapping hits of a spaced seed π<br />

in a random sequence of length N is approximately N µ π<br />

. Therefore, if |π| < w π +<br />

q<br />

p [( 1 p )w π −2 − 1] (see Table 6.1 for the values of this bound for p = 0.6,0.7,0.8,0.9<br />

and 10 ≤ w ≤ 14), Theorem 6.6 implies that π has on average more non-overlapping<br />

hits than θ in a long homologous region with sequence similarity p in the Bernoulli<br />

sequence model. Because overlapping hits can only be extended into one local alignment,<br />

the above fact indicates that a homology search program with a good spaced<br />

seed is usually more sensitive than with the consecutive seed (of the same weight)<br />

especially for genome-genome comparison.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!