You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
132 7 Local Alignment Statistics<br />
probability is well approximated by the formula<br />
Pr[S n,r ≥ t]=<br />
e−t t r−1<br />
r!(r − 1)! . (7.40)<br />
especially when t > r(r − 1). This is the formula used in the BLAST program for<br />
calculating p-value when multiple highest-scoring segments are reported.<br />
7.2.5 Local Ungapped Alignment<br />
The statistic theory of Sections 7.2.1 and 7.2.2 considered the maximal scoring segments<br />
in a fixed ungapped alignment. In practice, the objective of database search is<br />
to find all good matches between a query sequence and the sequences in a database.<br />
Here, we consider a general problem of calculating the statistical significance of<br />
a local ungapped alignment between two sequences. The sequences in a highestscoring<br />
local ungapped alignment between two sequences is called the maximalscoring<br />
segment pairs (MSP).<br />
Consider two sequences of length n 1 and n 2 . To find the MSPs of the sequences,<br />
we have to consider all n 1 + n 2 − 1 possible ungapped alignments between the sequences.<br />
Each such alignment yields a random work as that studied in Section 7.2.1.<br />
Because the n 1 +n 2 −1 corresponding random walks are not independent, it is much<br />
more involved to estimate the mean value of the maximum segment score. The theory<br />
developed by Dembo et al. [56, 57] for this general case is too advanced to be<br />
discussed in this book. Here we simply state the relevant results.<br />
To some extent, the key formulas in Sections 7.2.2 can be taken over to the general<br />
case, with n being simply replaced by n 1 n 2 . Consider the sequences x 1 x 2 ...x n1<br />
and y 1 y 2 ...y n2 , where x i and y j are residues. We use s(x i ,y j ) to denote the score for<br />
aligning residues x i and y j . The optimal local ungapped alignment score S max from<br />
the comparison of the sequences is<br />
S max =<br />
max<br />
∆≤min{n 1 ,n 2 }<br />
max<br />
i≤n 1 −∆<br />
j≤n 2 −∆<br />
∆<br />
∑<br />
l=1<br />
s(x i+l ,x j+l ).<br />
Suppose the sequences are random and independent: x i and y j follows the same<br />
distribution. The random variable S max has the following tail probability distribution:<br />
Pr[S max > 1 λ log(n 1n 2 )+y] ≈ 1 − e −Ke−λy , (7.41)<br />
where λ is given by equation (7.5) and K is given by equation (7.32) with n replaced<br />
by n 1 n 2 . The mean value of S max is approximately<br />
1<br />
λ log(Kn 1n 2 ). (7.42)