01.04.2015 Views

Sequence Comparison.pdf

Sequence Comparison.pdf

Sequence Comparison.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

132 7 Local Alignment Statistics<br />

probability is well approximated by the formula<br />

Pr[S n,r ≥ t]=<br />

e−t t r−1<br />

r!(r − 1)! . (7.40)<br />

especially when t > r(r − 1). This is the formula used in the BLAST program for<br />

calculating p-value when multiple highest-scoring segments are reported.<br />

7.2.5 Local Ungapped Alignment<br />

The statistic theory of Sections 7.2.1 and 7.2.2 considered the maximal scoring segments<br />

in a fixed ungapped alignment. In practice, the objective of database search is<br />

to find all good matches between a query sequence and the sequences in a database.<br />

Here, we consider a general problem of calculating the statistical significance of<br />

a local ungapped alignment between two sequences. The sequences in a highestscoring<br />

local ungapped alignment between two sequences is called the maximalscoring<br />

segment pairs (MSP).<br />

Consider two sequences of length n 1 and n 2 . To find the MSPs of the sequences,<br />

we have to consider all n 1 + n 2 − 1 possible ungapped alignments between the sequences.<br />

Each such alignment yields a random work as that studied in Section 7.2.1.<br />

Because the n 1 +n 2 −1 corresponding random walks are not independent, it is much<br />

more involved to estimate the mean value of the maximum segment score. The theory<br />

developed by Dembo et al. [56, 57] for this general case is too advanced to be<br />

discussed in this book. Here we simply state the relevant results.<br />

To some extent, the key formulas in Sections 7.2.2 can be taken over to the general<br />

case, with n being simply replaced by n 1 n 2 . Consider the sequences x 1 x 2 ...x n1<br />

and y 1 y 2 ...y n2 , where x i and y j are residues. We use s(x i ,y j ) to denote the score for<br />

aligning residues x i and y j . The optimal local ungapped alignment score S max from<br />

the comparison of the sequences is<br />

S max =<br />

max<br />

∆≤min{n 1 ,n 2 }<br />

max<br />

i≤n 1 −∆<br />

j≤n 2 −∆<br />

∆<br />

∑<br />

l=1<br />

s(x i+l ,x j+l ).<br />

Suppose the sequences are random and independent: x i and y j follows the same<br />

distribution. The random variable S max has the following tail probability distribution:<br />

Pr[S max > 1 λ log(n 1n 2 )+y] ≈ 1 − e −Ke−λy , (7.41)<br />

where λ is given by equation (7.5) and K is given by equation (7.32) with n replaced<br />

by n 1 n 2 . The mean value of S max is approximately<br />

1<br />

λ log(Kn 1n 2 ). (7.42)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!