You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 7<br />
Local Alignment Statistics<br />
The understanding of the statistical significance of local sequence alignment has<br />
improved greatly since Karlin and Altschul published their seminal work [100] on<br />
the distribution of optimal ungapped local alignment scores in 1990. In this chapter,<br />
we discuss the local alignment statistics that are incorporated into BLAST and other<br />
alignment programs. Our discussion focuses on protein sequences for two reasons.<br />
First, the analysis for DNA sequences is theoretically similar to, but easier than, that<br />
for protein sequences. Second, protein sequence comparison is more sensitive than<br />
that of DNA sequences. Nucleotide bases in a DNA sequence have higher-order dependence<br />
due to codon bias and other mechanisms, and hence DNA sequences with<br />
normal complexity might encode protein sequences with extremely low complexity.<br />
Accordingly, the statistical estimations from DNA sequence comparison are often<br />
less reliable than those with proteins.<br />
The statistics of local similarity scores are far more complicated than what we<br />
shall discuss in this chapter. Many theoretical problems arising from the general case<br />
in which gaps are allowed have yet to be well studied, even though they have been<br />
investigated for three decades. Our aim is to present the key ideas in the work of<br />
Karlin and Altschul on optimal ungapped local alignment scores and its generalizations<br />
to gapped local alignment. Basic formulas used in BLAST are also described.<br />
This chapter is divided into five sections. In Section 7.1, we introduce the extreme<br />
value type-I distribution. Such a distribution is fundamental to the study of local<br />
similarity scores, with and without gaps.<br />
Section 7.2 presents the Karlin and Altschul statistics of local alignment scores.<br />
We first prove that maximal segment scores are accurately described by a geometriclike<br />
distribution in asymptotic limit in Sections 7.2.1 and 7.2.3; we introduce the<br />
Karlin-Altschul sum statistic in Section 7.2.4. Section 7.2.5 summarizes the corresponding<br />
results for optimal ungapped local alignment scores. Finally, we discuss<br />
the edge effect issue in Section 7.2.6.<br />
The explicit theory is unknown for the distribution of local similarity scores<br />
in the case that gaps are allowed. Hence, most studies in this case are empirical.<br />
These studies suggest that the optimal local alignment scores also fit an extreme<br />
value type-I distribution for most cases of interest. Section 7.3.1 describes a phase<br />
119