11.07.2015 Views

6. MSA using DCA and branch-and-bound - Algorithms in ...

6. MSA using DCA and branch-and-bound - Algorithms in ...

6. MSA using DCA and branch-and-bound - Algorithms in ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004 316 Advanced Multiple Alignment (script by K. Re<strong>in</strong>ert)This exposition is based on the follow<strong>in</strong>g sources, which are recommended read<strong>in</strong>g:1. Re<strong>in</strong>ert, Stoye, Will, An iterative method for faster sum-of-pairs multiple sequence alignment,Bio<strong>in</strong>formatics, 2000, Vol 16, no 9, pages 808-814.2. Stoye, Divide-<strong>and</strong>-Conquer Multiple Sequence Alignment, TR 97-02 Universität Bielefeld, 1997In this chapter we discuss how to comb<strong>in</strong>e three different algorithmic techniques to address the problemof comput<strong>in</strong>g good multiple sequence alignments:• divide-<strong>and</strong>-conquer,• dynamic programm<strong>in</strong>g <strong>and</strong>• <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong>.First we will describe the <strong>DCA</strong> (divide-<strong>and</strong>-conquer) method. then we will discuss the naïve dynamicprogramm<strong>in</strong>g approach <strong>and</strong> will show how to speed it up <strong>us<strong>in</strong>g</strong> <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong>. F<strong>in</strong>ally, we willdiscuss a hybrid method that uses both <strong>DCA</strong> <strong>and</strong> the <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong> approach.<strong>6.</strong>1 The <strong>DCA</strong> methodGiven a set of k sequences s 1 . . . s k for which we would like to compute a multiple alignment, <strong>DCA</strong> isbased on the follow<strong>in</strong>g idea:Cut each of the k sequences <strong>and</strong> solve the result<strong>in</strong>g subproblems recursively. When theproblem size is small enough, use an exact algorithm.The solutions can be comb<strong>in</strong>ed simply by concatenation.On the one h<strong>and</strong> it is clear that optimal cut positions exist, on the other h<strong>and</strong> it is clear that it isNP-hard to f<strong>in</strong>d themTrivial ideas for determ<strong>in</strong><strong>in</strong>g the cut positions are not very successful. Hence, the ma<strong>in</strong> question ishow to choose the cut position.This illustrates the ma<strong>in</strong> idea:


32 Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004S 1S 1S 1divideS 1S 1S 1S 1S 1S 1dividedividealign optimalyconcatenateTo help determ<strong>in</strong>e good cut sites, we will make use of so-called additional cost matrices for each pairof sequences <strong>and</strong> each possible cut position.The additional cost is the score difference between the optimal global alignment <strong>and</strong> the best alignmentgo<strong>in</strong>g through position i <strong>and</strong> j.isjtBased on this, we will determ<strong>in</strong>e C-optimal families of cut positions, for which the (weighted) sum ofpairwise additional costs is m<strong>in</strong>imized.To def<strong>in</strong>e this more formally, assume we are given two sequences s p = s p1 . . . s pn <strong>and</strong> s q = s q1 . . . s qm .Let c denote a cost function <strong>and</strong> let A ∗ denote an optimal alignment of s p <strong>and</strong> s q .Let (i, j) be a pair of possible cut positions <strong>in</strong> s p <strong>and</strong> s q with 0 ≤ i ≤ n <strong>and</strong> 0 ≤ j ≤ m. We use α i s orσs i to denote the i-th prefix, or suffix, of a str<strong>in</strong>g s, respectively:s ps qα i s pσ i s p———–i———–——–j————–α j s qσs j qIf A is an alignment of a prefix of s p to a prefix of s q , <strong>and</strong> if B is an alignment of the two rema<strong>in</strong><strong>in</strong>gsuffixes, then let A++B denote the concatenation of the two alignments.Def<strong>in</strong>ition For each pair (i, j) of possible cut positions of s p <strong>and</strong> s q we def<strong>in</strong>e the pairwise additional


Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004 33cost with respect to c by:C sp,s q[i, j] := m<strong>in</strong>{c(A++B) | A ∈ A(α i s p, α j s q), B ∈ A(σ i s p, σ j s q)}−c(A ∗ ),where A(a, b) denote the set of all possible alignments of two sequences a <strong>and</strong> b.The matrix C sp,s q:= (C i,j ) 0≤i≤|sp|,0≤j≤|s q| is called the additional cost matrix of s p <strong>and</strong> s q with respectto c.The additional cost matrices can be easily computed <strong>us<strong>in</strong>g</strong> the “forward” <strong>and</strong> “reverse” pairwisealignment matrices, that is,whereby def<strong>in</strong>ition.Lets have a look at an example:C s,t [i, j] = D f s,t [i, j] + Dr s,t[i, j] − c opt (s, t)c opt (s, t) = D f s,t [| s |, | t |] = Dr s,t [0, 0],Let s = CT <strong>and</strong> t = AGT. We use the unit cost function d(x, y) = 1 − δ xy <strong>and</strong> a gap penalty of 1:AGTCTC T C0 1 2 A 2 2 3 0 1 31 1 2 G 1 1 2A0 0 2G2 2 2 T 1 0 1 1 0 1T3 3 2 2 1 0 3 2 0D^f_{s,t} D^r_{s,t} C_{s,t}As stated above, the ma<strong>in</strong> problem is how to determ<strong>in</strong>e a good family of k cut positions for thesequences s 1 , s 2 , . . . , s k .Analogously to the WSOP score we def<strong>in</strong>e the weighted multiple additional cost imposed by forc<strong>in</strong>g themultiple alignment path of the sequences through a particular vertex (c 1 , . . . , c k ) <strong>in</strong> the k-dimensionalhypercube.Def<strong>in</strong>ition Suppose that we are given a family of k sequences s 1 , . . . , s k <strong>and</strong> pairwise weights w i,j .The weighted SP multiple additional cost of a family of cut positions (c 1 , . . . , c k ) is def<strong>in</strong>ed as:C(c 1 , . . . , c k ) =∑w i,j C si ,s j[c i , c j ]1≤i


34 Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004AGT0013C1002T3210C_{s_1,s_2}G0 00 0012310T0 1 20 C 1 2 0 1 2 31C_{s_1,s_3}AG1 0 0 1 2G0 2 1 0 01C_{s_2,s_3}Set all pairwise weights to 1 <strong>and</strong> assume ĉ 1 = 1. It is not difficult to verify that (c 2 , c 3 ) = (1, 0) isC-optimal with respect to ĉ 1 , s<strong>in</strong>ceC(1, 1, 0) = C s1 ,s 2[1, 1] + C s1 ,s 3[1, 0] + C s2 ,s 3[1, 0] = 0.Given this cut the best possible alignment has cost 7. The unique optimal alignment has cost 6 <strong>and</strong>can be achieved with the cut (2, 1) which is also C-optimal with respect to ĉ 1 .C -T C-T -C T -CTA ++ GT = AGT AG ++ T = AGT- G- -G- -G - -Gcut(1,1,0) cut (1,2,1)cost 7 cost 6Assume we have a multiple sequence alignment program <strong>MSA</strong> that we can use to solve small <strong>in</strong>stancesof the problem of align<strong>in</strong>g k sequences. The <strong>DCA</strong> algorithm can be summarized as follows:Algorithm <strong>DCA</strong>(s 1 , s 2 , . . . , s k , L)Input: sequences s 1 , . . . , s k <strong>and</strong> cut-off LOutput: alignment of s 1 , . . . , s kbeg<strong>in</strong>if max{n 1 , . . . n k } ≤ L thenreturn <strong>MSA</strong>(s 1 , s 2 , . . . , s k )elseSet ĉ 1 := ⌈ n 12⌉Set (c 2 , . . . , c k ) := C-opt((s 1 , . . . , s k ), ĉ 1 )return <strong>DCA</strong>(α cˆ1s 1, α c 2s 2, . . . , α c ksk) ++<strong>DCA</strong>(σĉ 1 1, σc 2 2, . . . , σc k k)endThe function C-opt can naively be implemented as nested loops that run over all possible values0, . . . , n i , ∀i <strong>and</strong> return the cut position with the m<strong>in</strong>imal multiple additional costs.The computation of C-opt requires at most O(Nk 2 n 2 ) steps, with n = max i {|s i |} <strong>and</strong> N = ∏ i |s i|.A direct improvement of this procedure is the comb<strong>in</strong>ation of the loops with a simple <strong>branch</strong>-<strong>and</strong><strong>bound</strong>procedure that cuts off comb<strong>in</strong>ations of the c 2 , . . . , c k , if already a partial sum of the additionalcost term exceeds the m<strong>in</strong>imal cost found so far.T<strong>6.</strong>2 Naïve Dynamic programm<strong>in</strong>gThe straightforward extension of the Needleman-Wunsch algorithm to compute an optimal (W)SOPcostalignment A ∗ for k sequences s 1 , . . . , s k <strong>us<strong>in</strong>g</strong> dynamic programm<strong>in</strong>g takes time O(n k ), withn = m<strong>in</strong> i |s i |. Obviously this is only practical for very few, short sequences (k = 3, 4)Our goal now is to develop a <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong> approach.In the follow<strong>in</strong>g we will use the edit graph representation of a (multiple) alignment. In this representationf<strong>in</strong>d<strong>in</strong>g an optimal alignment corresponds to f<strong>in</strong>d<strong>in</strong>g a shortest path <strong>in</strong> a directed acyclic graph(DAG).


Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004 35All considerations can also be done with aff<strong>in</strong>e gap costs, but for the sake of exposition we use l<strong>in</strong>eargap costs.Given sequences s 1 , . . . , s k . The correspond<strong>in</strong>g edit graph has nodes{}V = v = (v[1], v[2], . . . , v[k]) | v ∈ N k <strong>and</strong> 0 ≤ v[i] ≤ n i , 1 ≤ i ≤ k<strong>and</strong> edgesWe use:E =p → q :=to denote the set of all paths from node p to q.{(p, q) | p, q ∈ V, p ≠ q und q − p ∈ {0, 1} k \ {0} k} .{}(p = v 0 , v 1 , . . . , q = v n ) | (v i , v i+1 ) ∈ E, 0 ≤ i < nSimilarly, we write p → q → r for the set of all paths from p to r, which run through node q.The graph conta<strong>in</strong>s two special nodes the source s = (0, 0, . . . , 0), <strong>and</strong> the s<strong>in</strong>k t = (n 1 , n 2 , . . . , n k ).The follow<strong>in</strong>g example shows a two-dimensional edit graph with source s, s<strong>in</strong>k t <strong>and</strong> an optimalsource-to-s<strong>in</strong>k path s → t <strong>in</strong> dashed l<strong>in</strong>es.s00A N N12 3CAN123tAnother example:A path π of length l from s = (0, . . . , 0) to t = (n 1 , . . . , n k ) corresponds to an alignment def<strong>in</strong>edthrough the follow<strong>in</strong>g matrix:A ij ={ - if vj [i] − v j−1 [i] = 0s i [v j [i]] if v j [i] − v j−1 [i] = 1for 1 ≤ i ≤ k, 1 ≤ j ≤ lThe cost of an alignment is the sum of the cost of all edges:with∑l−1c(π) := c(v i , v i+1 )i=0c(v, w) :=∑d(a i∗ , a j∗ ).1≤i


36 Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004We overload the notation of c, s<strong>in</strong>ce an edge <strong>in</strong> the edit graph corresponds exactly to a column <strong>in</strong> amultiple alignment. We denote the shortest path <strong>in</strong> p → q with p → ∗ q <strong>and</strong> its length with c(p → ∗ q).Any node v <strong>in</strong> the edit graph cuts each sequence s i <strong>in</strong>to a prefix α v i := α v[i]s i<strong>and</strong> a suffix σ v i := σ v[i]s i,for i = 1, . . . , k.<strong>6.</strong>3 Dynamic programm<strong>in</strong>g <strong>and</strong> <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong>The edit graph is a DAG (directed acyclic graph) <strong>and</strong> a shortest path can be found by travers<strong>in</strong>gthe DAG <strong>in</strong> any topological order. This corresponds exactly to the natural extension of dynamicprogramm<strong>in</strong>g.However, we do not want to construct the whole graph, as it has exponential size. To avoid thisproblem, we will take Dijkstra’s algorithm, which efficiently f<strong>in</strong>ds a shortest path <strong>in</strong> a graph with nonnegativeedge costs, <strong>and</strong> modify it so that (hopefully only) necessary parts of the graph are createdon the fly.Algorithm [Dijkstra]Input: Graph G = (V, E), source node s, edge cost cOutput: predecessor map π def<strong>in</strong><strong>in</strong>g shortest path from sfor each node v doSet d[v] := ∞ // <strong>in</strong>itial distance from sourceSet π[v] := 0Set d[s] := 0Set Q = Vwhile Q ≠ ∅ dou := ExtractM<strong>in</strong>(Q) // extract arg m<strong>in</strong> v∈Q {d[v]}for each child v of u doif d[v] > d[u] + c(u, v) then // relaxd[v] := d[u] + c(u, v)π[v] := u<strong>6.</strong>4 Branch-<strong>and</strong>-<strong>bound</strong>As noted above, we want to modify the algorithm so that only necessary parts of the graph areexplicitly constructed, <strong>in</strong> the context of <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong>.Suppose we are given an upper <strong>bound</strong> U for the length of the shortest path. We modify the algorithmas follows.• Initially, we set Q := {s}, not Q := V .• After extract<strong>in</strong>g a node u of m<strong>in</strong>imal distance from Q, we must exp<strong>and</strong> u, that is, create allchildren of u <strong>and</strong> for any such child v, <strong>in</strong>sert it <strong>in</strong>to Q, if d[u] + c(u, v) ≤ U holds.Algorithm [Modified Dijkstra with upper <strong>bound</strong> U]Input: Graph G = (V, E), source node s, edge cost c, upper <strong>bound</strong> UOutput: predecessor map π def<strong>in</strong><strong>in</strong>g shortest path from sbeg<strong>in</strong>Create node s.Set d[s] := 0Set Q := {s}while Q ≠ ∅ dou := ExtractM<strong>in</strong>(Q) // extract arg m<strong>in</strong> v∈Q {d[v]}for each potential child v of u do


Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004 37endif d[u] + c(u, v) ≤ U then // <strong>bound</strong>if v has not yet been created then // createCreate vSet d[v] := ∞if d[v] > d[u] + c(u, v) then // relaxSet d[v] := d[u] + c(u, v)Set π[v] := u<strong>6.</strong>5 ProjectionsWe will discuss how to obta<strong>in</strong> a useful <strong>bound</strong> U.First we need the def<strong>in</strong>ition of multiple alignments of subsets of the k str<strong>in</strong>gs.Def<strong>in</strong>ition Let A be a multiple alignment for the k str<strong>in</strong>gs s 1 , . . . , s k <strong>and</strong> I ⊆ {1, . . . , k} be a setof <strong>in</strong>dices def<strong>in</strong><strong>in</strong>g a subset of the k str<strong>in</strong>gs. Then we def<strong>in</strong>e A I as the alignment result<strong>in</strong>g from firstremov<strong>in</strong>g all rows i /∈ I from A <strong>and</strong> then delet<strong>in</strong>g all columns consist<strong>in</strong>g entirely of blanks. We callA I the projection of A to I. If I is given explicitely, we simplify notation <strong>and</strong> write A i,j,k <strong>in</strong>stead ofA {i,j,k} .Examplea1 = - G C T G A T A T A G C Ta2 = G G G T G A T - T A G C Ta3 = - G C T - A T - - C G C -a4 = A G C G G A - A C A C C TThe projection A 2,3 is given by first tak<strong>in</strong>g the second <strong>and</strong> third row of the alignment:a2 = G G G T G A T - T A G C Ta3 = - G C T - A T - - C G C -<strong>and</strong> then delet<strong>in</strong>g the column consist<strong>in</strong>g only of blanks:a2 = G G G T G A T T A G C Ta3 = - G C T - A T - C G C -<strong>6.</strong>6 St<strong>and</strong>ard-Bound<strong>in</strong>gAssume we have for each node v of the edit graph a lower <strong>bound</strong> l for the cost of the shortest path <strong>in</strong>v → t. For exampleL(v → t) :=∑c(A ∗ (σi v , σv j ))1≤i


Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Re<strong>in</strong>ert), December 17, 2004 39Compute optimal alignmentsComb<strong>in</strong>e <strong>in</strong>to heuristic alignmentUse heuristic alignment value as upper <strong>bound</strong> for comput<strong>in</strong>g optimal alignment<strong>6.</strong>10 Summary• Divide-<strong>and</strong>-conquer alignment needs the computation of good cut positions <strong>and</strong> a module tocompute an optimal sum-of-pair alignment.• The naïve generalization of the Needleman-Wunsch algorithm to more than two sequences is notpractical.• If we view the DP matrix as a k-dimensional edit graph we can apply st<strong>and</strong>ard <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong>techniques to the graph.• We can build the graph on the fly <strong>and</strong> explore it <strong>us<strong>in</strong>g</strong> a modification of Dijkstra’s algorithm.• The st<strong>and</strong>ard <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong> can be augmented by Carillo-Lipman <strong>bound</strong><strong>in</strong>g.• The <strong>DCA</strong> <strong>and</strong> <strong>branch</strong>-<strong>and</strong>-<strong>bound</strong> techniques can be comb<strong>in</strong>ed to provide a hybrid algorithm.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!