6. MSA using DCA and branch-and-bound - Algorithms in ...

More documents

Recommendations

Info

34 Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004AGT0013C1002T3210C_{s_1,s_2}G0 00 0012310T0 1 20 C 1 2 0 1 2 31C_{s_1,s_3}AG1 0 0 1 2G0 2 1 0 01C_{s_2,s_3}Set all pairwise weights to 1 and assume ĉ 1 = 1. It is not difficult to verify that (c 2 , c 3 ) = (1, 0) isC-optimal with respect to ĉ 1 , sinceC(1, 1, 0) = C s1 ,s 2[1, 1] + C s1 ,s 3[1, 0] + C s2 ,s 3[1, 0] = 0.Given this cut the best possible alignment has cost 7. The unique optimal alignment has cost 6 andcan be achieved with the cut (2, 1) which is also C-optimal with respect to ĉ 1 .C -T C-T -C T -CTA ++ GT = AGT AG ++ T = AGT- G- -G- -G - -Gcut(1,1,0) cut (1,2,1)cost 7 cost 6Assume we have a multiple sequence alignment program MSA that we can use to solve small instancesof the problem of aligning k sequences. The DCA algorithm can be summarized as follows:Algorithm DCA(s 1 , s 2 , . . . , s k , L)Input: sequences s 1 , . . . , s k and cut-off LOutput: alignment of s 1 , . . . , s kbeginif max{n 1 , . . . n k } ≤ L thenreturn MSA(s 1 , s 2 , . . . , s k )elseSet ĉ 1 := ⌈ n 12⌉Set (c 2 , . . . , c k ) := C-opt((s 1 , . . . , s k ), ĉ 1 )return DCA(α cˆ1s 1, α c 2s 2, . . . , α c ksk) ++DCA(σĉ 1 1, σc 2 2, . . . , σc k k)endThe function C-opt can naively be implemented as nested loops that run over all possible values0, . . . , n i , ∀i and return the cut position with the minimal multiple additional costs.The computation of C-opt requires at most O(Nk 2 n 2 ) steps, with n = max i {|s i |} and N = ∏ i |s i|.A direct improvement of this procedure is the combination of the loops with a simple branch-andboundprocedure that cuts off combinations of the c 2 , . . . , c k , if already a partial sum of the additionalcost term exceeds the minimal cost found so far.T6.2 Naïve Dynamic programmingThe straightforward extension of the Needleman-Wunsch algorithm to compute an optimal (W)SOPcostalignment A ∗ for k sequences s 1 , . . . , s k using dynamic programming takes time O(n k ), withn = min i |s i |. Obviously this is only practical for very few, short sequences (k = 3, 4)Our goal now is to develop a branch-and-bound approach.In the following we will use the edit graph representation of a (multiple) alignment. In this representationfinding an optimal alignment corresponds to finding a shortest path in a directed acyclic graph(DAG).
Comp. Sequence Analysis WS’04 ZBIT, D. Huson, (script by K. Reinert), December 17, 2004 35All considerations can also be done with affine gap costs, but for the sake of exposition we use lineargap costs.Given sequences s 1 , . . . , s k . The corresponding edit graph has nodes{}V = v = (v[1], v[2], . . . , v[k]) | v ∈ N k and 0 ≤ v[i] ≤ n i , 1 ≤ i ≤ kand edgesWe use:E =p → q :=to denote the set of all paths from node p to q.{(p, q) | p, q ∈ V, p ≠ q und q − p ∈ {0, 1} k \ {0} k} .{}(p = v 0 , v 1 , . . . , q = v n ) | (v i , v i+1 ) ∈ E, 0 ≤ i < nSimilarly, we write p → q → r for the set of all paths from p to r, which run through node q.The graph contains two special nodes the source s = (0, 0, . . . , 0), and the sink t = (n 1 , n 2 , . . . , n k ).The following example shows a two-dimensional edit graph with source s, sink t and an optimalsource-to-sink path s → t in dashed lines.s00A N N12 3CAN123tAnother example:A path π of length l from s = (0, . . . , 0) to t = (n 1 , . . . , n k ) corresponds to an alignment definedthrough the following matrix:A ij ={ - if vj [i] − v j−1 [i] = 0s i [v j [i]] if v j [i] − v j−1 [i] = 1for 1 ≤ i ≤ k, 1 ≤ j ≤ lThe cost of an alignment is the sum of the cost of all edges:with∑l−1c(π) := c(v i , v i+1 )i=0c(v, w) :=∑d(a i∗ , a j∗ ).1≤i
Page 1 and 2: Comp. Sequence Analysis WS’04 ZBI
Page 3: Comp. Sequence Analysis WS’04 ZBI
Page 7: Comp. Sequence Analysis WS’04 ZBI

6. MSA using DCA and branch-and-bound - Algorithms in ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?