Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
8.7 Gap Cost in Gapped Alignments 163<br />
ble 8.3), whereas in the alignment of two DNA sequences that are diverged at 96 (or<br />
120 × 0.8) PAMs, every three residues (a codon) carry only about 0.62-bit information.<br />
In other words, at this evolutionary distance, as much as 37% of the information<br />
available in protein comparison will be lost in DNA sequence comparison.<br />
8.7 Gap Cost in Gapped Alignments<br />
There is no general theory available for guiding the choice of gap costs. The most<br />
straightforward scheme is to charge a fixed penalty for each indel. Over the years,<br />
it has been observed that the optimal alignments produced by this scheme usually<br />
contain a large number of short gaps and are often not biologically meaningful (see<br />
Section 1.3 for the definition of gaps).<br />
To capture the idea that a single mutational event might insert or delete a sequence<br />
of residues, Waterman and Smith (1981, [180]) introduced the affine gap<br />
penalty model. Under this model, the penalty o+e×k is charged for a gap of length<br />
k, where o is a large penalty for opening a gap and e a smaller penalty for extending<br />
it. The current version of BLASTP uses, by default, 11 for gap opening and 1 for<br />
gap extension, together with BLOSUM62, for aligning protein sequences.<br />
The affine gap cost is based on the hypothesis that gap length has an exponential<br />
distribution, that is, the probability of a gap of length k is α(1 − β)β k for some<br />
constant α and β. Under this hypothesis, an affine gap cost is derived by charging<br />
log ( α(1 − β)β k) for a gap of length k. But, this hypothesis might not be true in<br />
general. For instance, the study of Benner, Cohen, and Gonnet (1993, [26]) suggests<br />
that the frequency of a length k is accurately described by mk −1.7 for some constant<br />
m.<br />
A generalized affine gap cost is introduced by Altschul (1998, [2]). A generalized<br />
gap consists of a consecutive sequence of indels in which spaces can be in either<br />
row. A generalized gap of length 10 may contain 10 insertions; it may also contain<br />
4 insertions and 6 deletions. To reflect the structural property of a generalized gap, a<br />
generalized affine gap cost has three parameters a,b,c. The score −a is introduced<br />
for the opening of a gap; −b is for each residue inserted or deleted; and −c is<br />
for each pair of residues left unaligned. A generalized gap with k insertions and l<br />
deletions scores −(a + |k − l|b + cmin{k,l}).<br />
Generalized affine gap costs can be used for aligning locally or globally protein<br />
sequences. The dynamic programming algorithm carries over to this generalized<br />
affine gap cost in a straightforward manner; and it still has quadratic time complexity.<br />
For local alignment, the distribution of optimal alignment scores also follows<br />
approximately an extreme value distribution (7.1). The empirical study of Zachariah<br />
et al. (2005, [213]) shows that this generalized affine gap cost model improves significantly<br />
the accuracy of protein alignment.