You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
134 7 Local Alignment Statistics<br />
where q ij is the target frequency at which we expect to see residue i aligned with<br />
residue j in the MSP, and s ij is the score for aligning i and j. With the value of λ,<br />
the s ij and the background frequencies q i and q j in hand, q ij can be calculated as<br />
q ij ≈ p i p j e λs ij<br />
. (7.50)<br />
Simulation shows that the values calculated from (7.49) are often larger than the empirical<br />
mean lengths of MSPs especially when n 1 and n 2 are in the range from 10 2<br />
to 10 3 (Altschul and Gish, 1996, [6]). Accordingly, the effective lengths defined by<br />
(7.46) might lead to P-value estimates less than the correct values. The current version<br />
of BLAST calculates empirically the mean length of MSPs in database search.<br />
7.3 Gapped Local Alignment Scores<br />
In this section, we concern with the optimal local alignment scores S (in the general<br />
case that gaps are allowed). Although the explicit theory is unknown in this case,<br />
a number of empirical studies strongly suggest that S also has asymptotically an<br />
extreme value distribution (7.1) under certain conditions on the scoring matrix and<br />
gap penalty used for alignment. As we will see, these conditions are satisfied for<br />
most combinations of scoring matrices and gap penalty costs.<br />
7.3.1 Effects of Gap Penalty<br />
Most empirical studies focus on the statistical distribution of the scores of optimal<br />
local alignments with affine gap costs. Each such gap cost has gap opening penalty<br />
o and gap extension penalty e, by which a gap of length k receives a score of −(o +<br />
k × e).<br />
Consider two sequences X ′ and X ′′ that are random with letters generated independently<br />
according to a probabilistic distribution. The optimal local alignment<br />
score S max of X ′ and X ′′ depends on the sequence lengths m and n, and the letter<br />
distribution, the substitution matrix, and affine gap cost (o,e). Although the exact<br />
probabilistic distribution of S max is unclear, S max has either linear or logarithmic<br />
growth as m and n go to infinite.<br />
This phase transition phenomenon for the optimal local alignment score was<br />
studied by Arratia and Waterman [15]. Although a rigorous treatment is far beyond<br />
the scope of this book, an intuitive account is quite straightforward. Consider<br />
two sequences x 1 x 2 ...x n and y 1 y 2 ...y n .LetS(t) denote the score of the optimal<br />
alignment of x 1 x 2 ...x t and y 1 y 2 ...y t . Then,<br />
S t+k ≥ S t + S(x t+1 x t+2 ···x t+k ,y t+1 y t+2 ···y t+k ).