01.04.2015 Views

Sequence Comparison.pdf

Sequence Comparison.pdf

Sequence Comparison.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

8.7 Gap Cost in Gapped Alignments 163<br />

ble 8.3), whereas in the alignment of two DNA sequences that are diverged at 96 (or<br />

120 × 0.8) PAMs, every three residues (a codon) carry only about 0.62-bit information.<br />

In other words, at this evolutionary distance, as much as 37% of the information<br />

available in protein comparison will be lost in DNA sequence comparison.<br />

8.7 Gap Cost in Gapped Alignments<br />

There is no general theory available for guiding the choice of gap costs. The most<br />

straightforward scheme is to charge a fixed penalty for each indel. Over the years,<br />

it has been observed that the optimal alignments produced by this scheme usually<br />

contain a large number of short gaps and are often not biologically meaningful (see<br />

Section 1.3 for the definition of gaps).<br />

To capture the idea that a single mutational event might insert or delete a sequence<br />

of residues, Waterman and Smith (1981, [180]) introduced the affine gap<br />

penalty model. Under this model, the penalty o+e×k is charged for a gap of length<br />

k, where o is a large penalty for opening a gap and e a smaller penalty for extending<br />

it. The current version of BLASTP uses, by default, 11 for gap opening and 1 for<br />

gap extension, together with BLOSUM62, for aligning protein sequences.<br />

The affine gap cost is based on the hypothesis that gap length has an exponential<br />

distribution, that is, the probability of a gap of length k is α(1 − β)β k for some<br />

constant α and β. Under this hypothesis, an affine gap cost is derived by charging<br />

log ( α(1 − β)β k) for a gap of length k. But, this hypothesis might not be true in<br />

general. For instance, the study of Benner, Cohen, and Gonnet (1993, [26]) suggests<br />

that the frequency of a length k is accurately described by mk −1.7 for some constant<br />

m.<br />

A generalized affine gap cost is introduced by Altschul (1998, [2]). A generalized<br />

gap consists of a consecutive sequence of indels in which spaces can be in either<br />

row. A generalized gap of length 10 may contain 10 insertions; it may also contain<br />

4 insertions and 6 deletions. To reflect the structural property of a generalized gap, a<br />

generalized affine gap cost has three parameters a,b,c. The score −a is introduced<br />

for the opening of a gap; −b is for each residue inserted or deleted; and −c is<br />

for each pair of residues left unaligned. A generalized gap with k insertions and l<br />

deletions scores −(a + |k − l|b + cmin{k,l}).<br />

Generalized affine gap costs can be used for aligning locally or globally protein<br />

sequences. The dynamic programming algorithm carries over to this generalized<br />

affine gap cost in a straightforward manner; and it still has quadratic time complexity.<br />

For local alignment, the distribution of optimal alignment scores also follows<br />

approximately an extreme value distribution (7.1). The empirical study of Zachariah<br />

et al. (2005, [213]) shows that this generalized affine gap cost model improves significantly<br />

the accuracy of protein alignment.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!