01.04.2015 Views

Sequence Comparison.pdf

Sequence Comparison.pdf

Sequence Comparison.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

134 7 Local Alignment Statistics<br />

where q ij is the target frequency at which we expect to see residue i aligned with<br />

residue j in the MSP, and s ij is the score for aligning i and j. With the value of λ,<br />

the s ij and the background frequencies q i and q j in hand, q ij can be calculated as<br />

q ij ≈ p i p j e λs ij<br />

. (7.50)<br />

Simulation shows that the values calculated from (7.49) are often larger than the empirical<br />

mean lengths of MSPs especially when n 1 and n 2 are in the range from 10 2<br />

to 10 3 (Altschul and Gish, 1996, [6]). Accordingly, the effective lengths defined by<br />

(7.46) might lead to P-value estimates less than the correct values. The current version<br />

of BLAST calculates empirically the mean length of MSPs in database search.<br />

7.3 Gapped Local Alignment Scores<br />

In this section, we concern with the optimal local alignment scores S (in the general<br />

case that gaps are allowed). Although the explicit theory is unknown in this case,<br />

a number of empirical studies strongly suggest that S also has asymptotically an<br />

extreme value distribution (7.1) under certain conditions on the scoring matrix and<br />

gap penalty used for alignment. As we will see, these conditions are satisfied for<br />

most combinations of scoring matrices and gap penalty costs.<br />

7.3.1 Effects of Gap Penalty<br />

Most empirical studies focus on the statistical distribution of the scores of optimal<br />

local alignments with affine gap costs. Each such gap cost has gap opening penalty<br />

o and gap extension penalty e, by which a gap of length k receives a score of −(o +<br />

k × e).<br />

Consider two sequences X ′ and X ′′ that are random with letters generated independently<br />

according to a probabilistic distribution. The optimal local alignment<br />

score S max of X ′ and X ′′ depends on the sequence lengths m and n, and the letter<br />

distribution, the substitution matrix, and affine gap cost (o,e). Although the exact<br />

probabilistic distribution of S max is unclear, S max has either linear or logarithmic<br />

growth as m and n go to infinite.<br />

This phase transition phenomenon for the optimal local alignment score was<br />

studied by Arratia and Waterman [15]. Although a rigorous treatment is far beyond<br />

the scope of this book, an intuitive account is quite straightforward. Consider<br />

two sequences x 1 x 2 ...x n and y 1 y 2 ...y n .LetS(t) denote the score of the optimal<br />

alignment of x 1 x 2 ...x t and y 1 y 2 ...y t . Then,<br />

S t+k ≥ S t + S(x t+1 x t+2 ···x t+k ,y t+1 y t+2 ···y t+k ).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!