6. MSA using DCA and branch-and-bound - Algorithms in ...

Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004 316 Advanced Multiple Alignment (script by K. Reinert)This exposition is based on the following sources, which are recommended reading:1. Reinert, Stoye, Will, An iterative method for faster sum-of-pairs multiple sequence alignment,Bioinformatics, 2000, Vol 16, no 9, pages 808-814.2. Stoye, Divide-and-Conquer Multiple Sequence Alignment, TR 97-02 Universität Bielefeld, 1997In this chapter we discuss how to combine three different algorithmic techniques to address the problemof computing good multiple sequence alignments:• divide-and-conquer,• dynamic programming and• branch-and-bound.First we will describe the DCA (divide-and-conquer) method. then we will discuss the naïve dynamicprogramming approach and will show how to speed it up using branch-and-bound. Finally, we willdiscuss a hybrid method that uses both DCA and the branch-and-bound approach.6.1 The DCA methodGiven a set of k sequences s 1 . . . s k for which we would like to compute a multiple alignment, DCA isbased on the following idea:Cut each of the k sequences and solve the resulting subproblems recursively. When theproblem size is small enough, use an exact algorithm.The solutions can be combined simply by concatenation.On the one hand it is clear that optimal cut positions exist, on the other hand it is clear that it isNP-hard to find themTrivial ideas for determining the cut positions are not very successful. Hence, the main question ishow to choose the cut position.This illustrates the main idea:

32 Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004S 1S 1S 1divideS 1S 1S 1S 1S 1S 1dividedividealign optimalyconcatenateTo help determine good cut sites, we will make use of so-called additional cost matrices for each pairof sequences and each possible cut position.The additional cost is the score difference between the optimal global alignment and the best alignmentgoing through position i and j.isjtBased on this, we will determine C-optimal families of cut positions, for which the (weighted) sum ofpairwise additional costs is minimized.To define this more formally, assume we are given two sequences s p = s p1 . . . s pn and s q = s q1 . . . s qm .Let c denote a cost function and let A ∗ denote an optimal alignment of s p and s q .Let (i, j) be a pair of possible cut positions in s p and s q with 0 ≤ i ≤ n and 0 ≤ j ≤ m. We use α i s orσs i to denote the i-th prefix, or suffix, of a string s, respectively:s ps qα i s pσ i s p———–i———–——–j————–α j s qσs j qIf A is an alignment of a prefix of s p to a prefix of s q , and if B is an alignment of the two remainingsuffixes, then let A++B denote the concatenation of the two alignments.Definition For each pair (i, j) of possible cut positions of s p and s q we define the pairwise additional

Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004 33cost with respect to c by:C sp,s q[i, j] := min{c(A++B) | A ∈ A(α i s p, α j s q), B ∈ A(σ i s p, σ j s q)}−c(A ∗ ),where A(a, b) denote the set of all possible alignments of two sequences a and b.The matrix C sp,s q:= (C i,j ) 0≤i≤|sp|,0≤j≤|s q| is called the additional cost matrix of s p and s q with respectto c.The additional cost matrices can be easily computed using the “forward” and “reverse” pairwisealignment matrices, that is,whereby definition.Lets have a look at an example:C s,t [i, j] = D f s,t [i, j] + Dr s,t[i, j] − c opt (s, t)c opt (s, t) = D f s,t [| s |, | t |] = Dr s,t [0, 0],Let s = CT and t = AGT. We use the unit cost function d(x, y) = 1 − δ xy and a gap penalty of 1:AGTCTC T C0 1 2 A 2 2 3 0 1 31 1 2 G 1 1 2A0 0 2G2 2 2 T 1 0 1 1 0 1T3 3 2 2 1 0 3 2 0D^f_{s,t} D^r_{s,t} C_{s,t}As stated above, the main problem is how to determine a good family of k cut positions for thesequences s 1 , s 2 , . . . , s k .Analogously to the WSOP score we define the weighted multiple additional cost imposed by forcing themultiple alignment path of the sequences through a particular vertex (c 1 , . . . , c k ) in the k-dimensionalhypercube.Definition Suppose that we are given a family of k sequences s 1 , . . . , s k and pairwise weights w i,j .The weighted SP multiple additional cost of a family of cut positions (c 1 , . . . , c k ) is defined as:C(c 1 , . . . , c k ) =∑w i,j C si ,s j[c i , c j ]1≤i

34 Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004AGT0013C1002T3210C_{s_1,s_2}G0 00 0012310T0 1 20 C 1 2 0 1 2 31C_{s_1,s_3}AG1 0 0 1 2G0 2 1 0 01C_{s_2,s_3}Set all pairwise weights to 1 and assume ĉ 1 = 1. It is not difficult to verify that (c 2 , c 3 ) = (1, 0) isC-optimal with respect to ĉ 1 , sinceC(1, 1, 0) = C s1 ,s 2[1, 1] + C s1 ,s 3[1, 0] + C s2 ,s 3[1, 0] = 0.Given this cut the best possible alignment has cost 7. The unique optimal alignment has cost 6 andcan be achieved with the cut (2, 1) which is also C-optimal with respect to ĉ 1 .C -T C-T -C T -CTA ++ GT = AGT AG ++ T = AGT- G- -G- -G - -Gcut(1,1,0) cut (1,2,1)cost 7 cost 6Assume we have a multiple sequence alignment program MSA that we can use to solve small instancesof the problem of aligning k sequences. The DCA algorithm can be summarized as follows:Algorithm DCA(s 1 , s 2 , . . . , s k , L)Input: sequences s 1 , . . . , s k and cut-off LOutput: alignment of s 1 , . . . , s kbeginif max{n 1 , . . . n k } ≤ L thenreturn MSA(s 1 , s 2 , . . . , s k )elseSet ĉ 1 := ⌈ n 12⌉Set (c 2 , . . . , c k ) := C-opt((s 1 , . . . , s k ), ĉ 1 )return DCA(α cˆ1s 1, α c 2s 2, . . . , α c ksk) ++DCA(σĉ 1 1, σc 2 2, . . . , σc k k)endThe function C-opt can naively be implemented as nested loops that run over all possible values0, . . . , n i , ∀i and return the cut position with the minimal multiple additional costs.The computation of C-opt requires at most O(Nk 2 n 2 ) steps, with n = max i {|s i |} and N = ∏ i |s i|.A direct improvement of this procedure is the combination of the loops with a simple branch-andboundprocedure that cuts off combinations of the c 2 , . . . , c k , if already a partial sum of the additionalcost term exceeds the minimal cost found so far.T6.2 Naïve Dynamic programmingThe straightforward extension of the Needleman-Wunsch algorithm to compute an optimal (W)SOPcostalignment A ∗ for k sequences s 1 , . . . , s k using dynamic programming takes time O(n k ), withn = min i |s i |. Obviously this is only practical for very few, short sequences (k = 3, 4)Our goal now is to develop a branch-and-bound approach.In the following we will use the edit graph representation of a (multiple) alignment. In this representationfinding an optimal alignment corresponds to finding a shortest path in a directed acyclic graph(DAG).

Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004 35All considerations can also be done with affine gap costs, but for the sake of exposition we use lineargap costs.Given sequences s 1 , . . . , s k . The corresponding edit graph has nodes{}V = v = (v[1], v[2], . . . , v[k]) | v ∈ N k and 0 ≤ v[i] ≤ n i , 1 ≤ i ≤ kand edgesWe use:E =p → q :=to denote the set of all paths from node p to q.{(p, q) | p, q ∈ V, p ≠ q und q − p ∈ {0, 1} k \ {0} k} .{}(p = v 0 , v 1 , . . . , q = v n ) | (v i , v i+1 ) ∈ E, 0 ≤ i < nSimilarly, we write p → q → r for the set of all paths from p to r, which run through node q.The graph contains two special nodes the source s = (0, 0, . . . , 0), and the sink t = (n 1 , n 2 , . . . , n k ).The following example shows a two-dimensional edit graph with source s, sink t and an optimalsource-to-sink path s → t in dashed lines.s00A N N12 3CAN123tAnother example:A path π of length l from s = (0, . . . , 0) to t = (n 1 , . . . , n k ) corresponds to an alignment definedthrough the following matrix:A ij ={ - if vj [i] − v j−1 [i] = 0s i [v j [i]] if v j [i] − v j−1 [i] = 1for 1 ≤ i ≤ k, 1 ≤ j ≤ lThe cost of an alignment is the sum of the cost of all edges:with∑l−1c(π) := c(v i , v i+1 )i=0c(v, w) :=∑d(a i∗ , a j∗ ).1≤i

36 Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004We overload the notation of c, since an edge in the edit graph corresponds exactly to a column in amultiple alignment. We denote the shortest path in p → q with p → ∗ q and its length with c(p → ∗ q).Any node v in the edit graph cuts each sequence s i into a prefix α v i := α v[i]s iand a suffix σ v i := σ v[i]s i,for i = 1, . . . , k.6.3 Dynamic programming and branch-and-boundThe edit graph is a DAG (directed acyclic graph) and a shortest path can be found by traversingthe DAG in any topological order. This corresponds exactly to the natural extension of dynamicprogramming.However, we do not want to construct the whole graph, as it has exponential size. To avoid thisproblem, we will take Dijkstra’s algorithm, which efficiently finds a shortest path in a graph with nonnegativeedge costs, and modify it so that (hopefully only) necessary parts of the graph are createdon the fly.Algorithm [Dijkstra]Input: Graph G = (V, E), source node s, edge cost cOutput: predecessor map π defining shortest path from sfor each node v doSet d[v] := ∞ // initial distance from sourceSet π[v] := 0Set d[s] := 0Set Q = Vwhile Q ≠ ∅ dou := ExtractMin(Q) // extract arg min v∈Q {d[v]}for each child v of u doif d[v] > d[u] + c(u, v) then // relaxd[v] := d[u] + c(u, v)π[v] := u6.4 Branch-and-boundAs noted above, we want to modify the algorithm so that only necessary parts of the graph areexplicitly constructed, in the context of branch-and-bound.Suppose we are given an upper bound U for the length of the shortest path. We modify the algorithmas follows.• Initially, we set Q := {s}, not Q := V .• After extracting a node u of minimal distance from Q, we must expand u, that is, create allchildren of u and for any such child v, insert it into Q, if d[u] + c(u, v) ≤ U holds.Algorithm [Modified Dijkstra with upper bound U]Input: Graph G = (V, E), source node s, edge cost c, upper bound UOutput: predecessor map π defining shortest path from sbeginCreate node s.Set d[s] := 0Set Q := {s}while Q ≠ ∅ dou := ExtractMin(Q) // extract arg min v∈Q {d[v]}for each potential child v of u do

Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004 37endif d[u] + c(u, v) ≤ U then // boundif v has not yet been created then // createCreate vSet d[v] := ∞if d[v] > d[u] + c(u, v) then // relaxSet d[v] := d[u] + c(u, v)Set π[v] := u6.5 ProjectionsWe will discuss how to obtain a useful bound U.First we need the definition of multiple alignments of subsets of the k strings.Definition Let A be a multiple alignment for the k strings s 1 , . . . , s k and I ⊆ {1, . . . , k} be a setof indices defining a subset of the k strings. Then we define A I as the alignment resulting from firstremoving all rows i /∈ I from A and then deleting all columns consisting entirely of blanks. We callA I the projection of A to I. If I is given explicitely, we simplify notation and write A i,j,k instead ofA {i,j,k} .Examplea1 = - G C T G A T A T A G C Ta2 = G G G T G A T - T A G C Ta3 = - G C T - A T - - C G C -a4 = A G C G G A - A C A C C TThe projection A 2,3 is given by first taking the second and third row of the alignment:a2 = G G G T G A T - T A G C Ta3 = - G C T - A T - - C G C -and then deleting the column consisting only of blanks:a2 = G G G T G A T T A G C Ta3 = - G C T - A T - C G C -6.6 Standard-BoundingAssume we have for each node v of the edit graph a lower bound l for the cost of the shortest path inv → t. For exampleL(v → t) :=∑c(A ∗ (σi v , σv j ))1≤i

Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004 39Compute optimal alignmentsCombine into heuristic alignmentUse heuristic alignment value as upper bound for computing optimal alignment6.10 Summary• Divide-and-conquer alignment needs the computation of good cut positions and a module tocompute an optimal sum-of-pair alignment.• The naïve generalization of the Needleman-Wunsch algorithm to more than two sequences is notpractical.• If we view the DP matrix as a k-dimensional edit graph we can apply standard branch-and-boundtechniques to the graph.• We can build the graph on the fly and explore it using a modification of Dijkstra’s algorithm.• The standard branch-and-bound can be augmented by Carillo-Lipman bounding.• The DCA and branch-and-bound techniques can be combined to provide a hybrid algorithm.

6. MSA using DCA and branch-and-bound - Algorithms in ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?