07.02.2013 Views

Bioinformatics Algorithms: Techniques and Applications

Bioinformatics Algorithms: Techniques and Applications

Bioinformatics Algorithms: Techniques and Applications

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

12 DYNAMIC PROGRAMMING ALGORITHMS<br />

(Fig. 2.2 b). Each column may consist of two aligned characters, vi <strong>and</strong> wj (1 ≤ i ≤ m,<br />

1 ≤ j ≤ n), which is called a match (if vi = wj) oramismatch (otherwise), or one<br />

character <strong>and</strong> one gap, which is called an indel (insertion or deletion). A global alignment<br />

can be evaluated by the sum of the scores of all columns, which are defined<br />

by a similarity matrix between any pair of characters (4 nucleotides for DNAs or<br />

20 amino acids for proteins) for matches <strong>and</strong> mismatches, <strong>and</strong> a gap penalty function.<br />

A simple scoring function for the global alignment of two DNA sequences rewards<br />

each match by score +1, <strong>and</strong> penalizes each mismatch by score −µ <strong>and</strong> each indel by<br />

score −σ. The alignment of two protein sequences usually involves more complicated<br />

scoring schemes reflecting models of protein evolution, for example, PAM [21] <strong>and</strong><br />

BLOSUM [33].<br />

It is useful to map the global alignment problem, that is, to find the global alignment<br />

with the highest score for two given sequences, onto an alignment graph (Fig. 2.2 a).<br />

Given two sequences V <strong>and</strong> W, the alignment graph is a directed acylic graph G on<br />

(n + 1) × (m + 1) nodes, each labeled with a pair of positions (i, j) ((0 ≤ i ≤ m,<br />

0 ≤ j ≤ n)), with three types of weighted edges: horizontal edges from (i, j)to(i +<br />

1,j) with weight δ(v(i + 1), −), vertical edges from (i, j) to(i, j + 1) with weight<br />

δ(−,w(j + 1)), <strong>and</strong> diagonal edges from (i, j)to(i + 1,j+ 1) with weight δ(v(i + 1),<br />

w(j + 1)), where δ(vi, −) <strong>and</strong> δ(−,wj) represent the penalty score for indels, <strong>and</strong><br />

δ(vi,wj) represents similarity scores for match/mismatches. Any global alignment<br />

between V <strong>and</strong> W corresponds to a path in the alignment graph from node (0, 0)<br />

to node (m, n), <strong>and</strong> the alignment score is equal to the total weight of the path.<br />

Therefore, the global alignment problem can be transformed into the problem of<br />

finding the longest path between two nodes in the alignment graph, thus can be<br />

solved by a dynamic programming algorithm. To compute the optimal alignment<br />

score S(i, j) between two subsequences V = v1...vi <strong>and</strong> W = w1...wj, that is, the<br />

total weight of the longest path from (0, 0) to node (i, j), one can use the following<br />

(0,0) A T C T i G C<br />

A<br />

C<br />

T<br />

A<br />

A<br />

j<br />

G<br />

(i,j)<br />

ATCT GC<br />

A CTAAGC<br />

C<br />

(6,7)<br />

(a)<br />

(b)<br />

FIGURE 2.2 The alignment graph for the alignment of two DNA sequences, ACCTGC <strong>and</strong><br />

ACTAAGC. The optimal global alignment (b) can be represented as a path in the alignment<br />

graph from (0,0) to (6,7) (highlighted in bold).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!