New Efficient Algorithms for LCS and Constrained LCS Problem

New Efficient Algorithms for LCS andConstrained LCS ProblemCostas S. Iliopoulos and M. Sohel RahmanAlgorithm Design GroupDepartment of Computer Science, King’s College London,Strand, London WC2R 2LS, England{sohel,csi}@dcs.kcl.ac.ukhttp://www.dcs.kcl.ac.uk/adgAbstract. In this paper, we study the classic and well-studied longest common subsequence(LCS) problem and a recent variant of it namely constrained LCS (CLCS) problem. In CLCS,the computed LCS must also be a supersequence of a third given string. In this paper, we firstpresent an efficient algorithm for the traditional LCS problem that runs in O(R log log n + n)time, where R is the total number of ordered pairs of positions at which the two strings matchand n is the length of the two given strings. Then, using this algorithm, we devise an algorithmfor the CLCS problem having time complexity O(pR log log n + n) in the worst case, where pis the length of the third string. Note that, if R = o(n 2 ), our algorithm will perform very wellbut, if R = O(n 2 ), then, due to the log log n term, our algorithms will behave slightly worsethan the existing algorithms.1 IntroductionThe longest common subsequence(LCS) problem is one of the classical and wellstudiedproblems in computer science having extensive applications in diverse areas.Perhaps, the strongest motivation for the LCS problem and variants thereof, to thisdate, comes from Computational Molecular Biology. The LCS problem is a commontask in DNA sequence analysis, and has applications to genetics and molecular biology.Variants of LCS problem have been used to study similarity in RNA secondarystructure [6, 15, 10, 17]. Very recently, Bereg and Zhu in [4], presented a new modelfor RNA multiple sequence structural alignment based on the longest common subsequence.Several other variants with applications in computational biology have beenstudied in the literature recently [18, 24, 25, 3, 2].In this paper, we study the traditional LCS problem along with an interestingand newer variant of it, namely the Constrained LCS problem (CLCS). In CLCS,the computed LCS must also be a supersequence of a third string (given) [25, 8, 3, 2].This problem also finds its motivation from bioinformatics: in the computation of thehomology of two biological sequences it is important to take into account a commonspecific or putative structure [25].The longest common subsequence problem for k strings (k > 2) was first shown tobe NP-hard [19] and later proved to be hard to be approximated [14]. The restrictedbut probably the more studied problem that deals with two strings has been studiedextensively [21, 26, 23, 22, 20, 13, 12, 11]. The classic dynamic programming solution to

LCS problem, invented by Wagner and Fischer [26], has O(n 2 ) worst case runningtime, where n is the length of the two strings. Masek and Paterson [20] improved thisalgorithm using the “Four-Russians” technique [1] to reduce the worst case runningtime to O(n 2 / log n) 1 . Since then not much improvement in terms of n can be found inthe literature. However, several algorithms exist with complexities depending on otherparameters. For example Myers in [22] and Nakatsu et al. in [23] presented an O(nD)algorithm where the parameter D is the simple Levenshtein distance between the twogiven strings [16]. Another interesting and perhaps more relevant parameter for thisproblem is R, where R is the total number of ordered pairs of positions at whichthe two strings match. Hunt and Szymanski [13] presented an algorithm runningin O((R + n) log n). They have also cited applications where R ∼ n and therebyclaimed that for these applications the algorithm would run in O(n log n) time. For acomprehensive comparison of the well-known algorithms for LCS problem and studyof their behaviour in various application environments the readers are referred to [5].The CLCS problem, on the other hand, was introduced quite recently by Tsaiin [25], where an algorithm was presented solving the problem in O(pn 4 ) time, wherep is the length of the third string which applies the constraint. Later, Chin et al. [8]and independently, Arslan and Eğecioğ [2, 3] presented improved algorithm with On 2 ptime and space complexity. In this paper we present two new algorithms: one for LCSand one for the CLCS problem. In particular, we first devise an efficient algorithmfor the traditional LCS problem that runs in O(R log log n + n) time. Then, usingthis algorithm we devise an algorithm for the CLCS problem having time complexityO(pR log log n + n) in the worst case. Note that, if R = o(n 2 ), our algorithm willperform very well but, if R = O(n 2 ), then, due to the log log n term, our algorithmswill behave slightly worse than the existing algorithms.The rest of the paper is organized as follows. In Section 2, we present the preliminaryconcepts. Section 3 is devoted to a new O(R log log n+n) time algorithm for theLCS problem. In Section 4, using the algorithm of section 3, we devise a new efficientalgorithm to solve the CLCS problem, which runs in O(pR log log n+n) time. Finally,we briefly conclude in Section 5.2 PreliminariesSuppose we are given two strings X[1..n] = X[1] X[2] . . . X[n] and Y [1..n] = Y [1] Y [2]. . . Y [n]. A subsequence S[1..r] = S[1] S[2] ...S[r] of X is obtained by deleting n − rsymbols from X. A common subsequence of two strings X and Y , denoted CS(X, Y ),is a subsequence common to both X and Y . The longest common subsequence of Xand Y , denoted LCS(X, Y ), is a common subsequence of maximum length. We denotethe length of LCS(X, Y ) by L(X, Y ).Problem “LCS”. Given 2 strings X and Y , we want to find out an LCS(X, Y ).1 Employing different techniques, the same worst case bound was achieved in [9]. In particular, for mosttexts, the achieved time complexity in [9] is O(hn 2 / log n), where h ≤ 1 is the entropy of the text.

Given two strings X[1..n] and Y [1..n] and a third string Z[1..p], a CS(X, Y ) is saidto be constrained by Z if, and only if, Z is a subsequence of that CS(X, Y ). We useCS Z (X, Y ) to denote a common subsequence of X and Y constrained by Z. Then, thelongest common subsequence of X and Y , constrained by Z, is a CS Z (X, Y ) of maximumlength and is denoted by LCS Z (X, Y ). We denote the length of LCS Z (X, Y )by L Z (X, Y ). It is easy to note that L Z (X, Y ) ≤ L(X, Y ).Problem “CLCS”. Given 2 strings X and Y and another string Z, we want tofind an LCS Z (X, Y ).X =T C C A C AX =T C C A C AY = A C C A A GY = A C C A A GZ = A CFig. 1. LCS(X, Y ) and LCS Z (X, Y ) of Example 1.Example 1. Suppose X = T CCACA, Y = ACCAAG and Z = AC. As is evidentfrom Fig. 1, S 1 = CCAA is an LCS(X, Y ). However, S 1 is not an LCS Z (X, Y ) becauseZ is not a subsequence of S 1 . On the other hand, S 2 = ACA is an LCS Z (X, Y ).Note carefully that, in this case L Z (X, Y ) < L(X, Y ).In this paper we use the following notions. We say a pair (i, j), 1 ≤ i, j ≤ n, definesa match, if X[i] = Y [j]. The set of all matches, M, is defined as follows:M = {(i, j) | X[i] = Y [j], 1 ≤ i, j ≤ n}We define |M| = R.In what follows we assume that |X| = |Y | = n. But our results can be easilyextended when |X| ̸= |Y |.3 A new Algorithm for Problem LCSIn this section, we present a new efficient algorithm to solve Problem LCS combiningan interesting data structure of [7] with the techniques of [24]. The resulting algorithmwill be applied to get an efficient algorithm for Problem CLCS later in Section 4. Wefollow the dynamic programming formulation given in [24] and improve its runningtime of O(n 2 + R) to O(R log log n + n). The dynamic programming formulationfrom [24] is as follows:

⎧Undefined if (i, j) /∈ M,⎪⎨1 if (i = 1 or j = 1) and (i, j) ∈ MT [i, j] =1 + max 1≤li j) wemust have T [i ′ ][j] ≥ T [i][j] (resp. T [i][j ′ ] ≥ T [i][j]).Fact 2. ([24]) The calculation of a T [i][j], (i, j) ∈ M, 1 ≤ i, j ≤ n is independent ofany T [l][q], (l, q) ∈ M, l = i, 1 ≤ q ≤ n.Along with the above two facts, we use the ‘BoundedHeap’ data structure of [7] thatsupports the following operations:Insert(H, pos, val, data): Insert into the BoundedHeap H the position pos with valueval and associated information data.IncreaseV alue(H, pos, val, data): If H does not already contain the position pos,perform Insert(H, pos, val, data). Otherwise, set this position’s value to max{val, val ′ },where val ′ is its previous value. Also, change the ‘data’ field accordingly.BoundedMax(H, pos): Return the item (with additional data) that has maximumvalue among all items in H with position smaller than pos. If H does not containany items with position smaller than pos, return 0.The following theorem from [7] presents the time complexity of the BoundedHeapdata structure.Theorem 1. ([7]) BoundedHeap data structure can support each of the above operationsin O(log log n) amortized time, where keys are drawn from the set {1, . . . , n}.The data structure requires O(n) space.The algorithm AlgLCSNew proceeds as follows. Recall that we perform a rowby row operation. We always deal with two BoundedHeap data structures. Whileconsidering row i, we already have the BoundedHeap data structure H i−1 at ourhand; now we construct the BoundedHeap data structure H i . At first H i is initialized2 In [24], the running time of this preprocessing step is reported as O(R log log n) under the usual assumptionthat R ≥ n.

to H i−1 . Assume that we are considering the match (i, j) ∈ M i , 1 ≤ j ≤ n, whereM i = {(i, j) | (i, j) ∈ M, 1 ≤ j ≤ n}. We calculate T [i, j] as follows:T [i, j].Value = BoundedMax(H i−1 , j).Value + 1.T [i, j].Prev = BoundedMax(H i−1 , j).data 3 .Then we perform the following operation:IncreaseV alue(H i , j, T [i, j].Value, (i, j)).The correctness of the above procedure follows from Fact 1 and 2. Due to Fact 1,as soon as we compute the T -value of a new match in a column j, we can forget aboutthe previous matches of that column. So, as soon as we compute T [i, j] in row i, weinsert it in H i to update it for the next row i.e. i + 1. And due to Fact 2, we can useH i−1 for the computation of the T -values of the matches in row i and do the updatein H i (initialized at first to H i−1 ) to make H i ready for row i + 1.Next, we analyze the running time of AlgLCSNew. The preprocessing requiresO(R log log n + n) time to get the set M in the required order [24]. Then we calculateeach (i, j) ∈ M according to Equation 1 using the BoundedHeap data structure.Note carefully that we need to use the two operations, BoundedMax() andIncreaseV alue(), once each for each of the matches in M. So, in total the runningtime is O(R log log n + n). The space requirement is as follows. The preprocessingstep requires θ(max{R, n}) space [24]. And, in the main algorithm, we only needto keep track of two BoundedHeap data structures at a time, requiring O(n) space(Theorem 1). So, in total the space requirement is θ(max{R, n}).Theorem 2. Problem LCS can be solved in O(R log log n+n) time requiring θ(max{R, n})space.Algorithm 1 formally presents the algorithm. Since we are not calculating all theentries of the table, we, off course, need to use variables to keep track of the actualLCS.4 CLCS AlgorithmIn this section, we present the new algorithm for Problem CLCS. We use the dynamicprogramming formulation for CLCS presented in [2, 3]. Extending our tabularnotion from Equation 1, we use T [i, j, k], 1 ≤ i, j ≤ n, 0 ≤ k ≤ p to denoteL Z[1..k] (X[1..i], Y [1..j]). We have the following formulation for Problem CLCS from [2,3].T [i, j, k] = max{T ′ [i, j, k], T ′′ [i, j, k], T [i, j − 1, k], T [i − 1, j, k]} (2)3 Note that, we use the ‘data’ field to keep the ‘address’ of the match.

Algorithm 1 AlgLCSNew1: Construct the set M using Algorithm 1 of [24]. Let M i = {(i, j) | (i, j) ∈ M, 1 ≤j ≤ n}.2: globalLCS.Instance = ɛ3: globalLCS.Value = ɛ4: H 0 = ɛ5: for i = 1 to n do6: H i = H i−17: for each (i, j) ∈ M i do8: maxresult = BoundedMax(H i−1 , j)9: T [i, j].Value = maxresult.Value + 110: T [i, j].Prev = maxresult.Instance11: if globalLCS.Value < T [i, j].Value then12: globalLCS.Value = T [i, j].Value13: globalLCS.Instance = (i, j)14: end if15: IncreaseV alue(H i , j, T [i, j].Value, (i, j)).16: end for17: Delete H i−1 .18: end for19: return globalLCSwhere⎧1 + T [i − 1, j − 1, k − 1] if(k = 1 or⎪⎨T ′ (k > 1 and T [i − 1, j − 1, k − 1] > 0))[i, j, k] =and X[i] = Y [j] = Z[k].⎪⎩0 otherwise.(3)and⎧⎪⎨ 1 + T [i − 1, j − 1, k] if (k = 0 or T [i − 1, j − 1, k] > 0)T ′′ [i, j, k] =and X[i] = Y [j].⎪⎩0 otherwise.(4)The following boundary conditions are assumed in Equations 2 to 4:T [i, 0, k] = T [0, j, k] = 0, 0 ≤ i, j ≤ n, 0 ≤ k ≤ p.It is straightforward to give a O(pn 2 ) algorithm realizing the dynamic programmingformulation presented in Equations 2 to 4. Our goal is to present a new efficient algorithmusing the parameter R. In line of Equation 1, we can reformulate Equations 3and 4 as follows:

V 1 = max {T [l i , l j , k − 1]} (5)1≤l i 0))[i, j, k] =(6)and X[i] = Y [j] = Z[k].⎪⎩0 otherwise.V 2 = max {T [l i , l j , k]} (7)1≤l i

Algorithm 2 AlgCLCSNew1: Construct the set M using Algorithm 1 of [24]. Let M i = (i, j) ∈ M, 1 ≤ j ≤ n.2: globalCLCS.Instance = ɛ3: globalCLCS.Value = ɛ4: H −10 = ɛ5: H 0 0 = ɛ6: for i = 1 to n do7: for k = 0 to p do8:9:Hi k = Hi−1kfor each (i, j) ∈ M i do10: max 1 = BoundedMax(H k−1i−1 , j)11: max 2 = BoundedMax(Hi−1, k j){First, we consider Equations 7 and 8}12: Val 2 .Value = 013: Val 2 .Instance = ɛ14: if ((k = 0) or (max 2 .Value > 0)) then15: Val 2 = max 216: end if{Now, we consider Equations 5 and 6}17: if (X[i] = Z[k]){This means X[i] = Y [j] = Z[k]} then18: Val 1 .Value = 019: Val 1 .Instance = ɛ20: if (k = 1) or ((k > 1) and (max 1 .V alue > 0)) then21: Val 1 = max 122: end if23: end if{Finally, we consider Equation 2}24: maxresult = max{Val 1 , Val 2 }25: T k [i, j].Value = maxresult.Value + 126: T k [i, j].Prev = maxresult.Instance27: if globalLCS.Value < T k [i, j].Value then28: globalLCS.Value = T k [i, j].Value29: globalLCS.Instance = (i, j, k)30: end if31: IncreaseValue(Hi k , j, T [i, j].Value, (i, j, k)).32: end for33: Delete H k−1i−1 .34: end for35: end for36: return globalLCS

– We update H k i , which is initialized to H k i−1 when we start row i. H k i is later usedwhen we consider the matches of row i + 1 of the two matrices T k and T k+1Due to Fact 4, while computing a T k [i][j], (i, j) ∈ M, 1 ≤ i, j ≤ n, 1 ≤ k ≤ p, we canemploy the same technique with Hi k , Hi−1 k and H k−1i−1 used in AlgLCSNew (with H iand H i−1 to compute T [i, j], (i, j) ∈ M, 1 ≤ j ≤ n). On the other hand, courtesyto Fact 3, to realize Equations 5−8, we need only keep track of H k−1i−1 and Hk i−1. Thesteps are formally stated in Algorithm 2. In light of the analysis of AlgLCSNew, it isquite easy to see that the running time is O(pR log log n + n). The space complexity,like AlgLCSNew, is dominated by the preprocessing step of [24] because in the mainalgorithm, we only need to keep track of 3 BoundedHeap data structures and thethird string Z, requiring in total O(n) space.Theorem 3. Problem CLCS can be solved in O(pR log log n + n) time requiringθ(max{R, n}) space.5 ConclusionIn this paper, we have studied the classic and well-studied longest common subsequence(LCS) problem and a recent variant of it namely constrained LCS (CLCS)problem. In LCS, given two strings, we want to find the common subsequence havingthe highest length; in CLCS, in addition to that, the solution to the problemmust also be a supersequence of a third given string. The traditional LCS problemhas extensive applications in diverse areas of computer science including DNA sequenceanalysis, genetics and molecular biology. CLCS, a relatively new variant ofLCS, gets its motivation from computational bio-informatics as well. In this paper,we have presented an efficient algorithm for the traditional LCS problem that runsin O(R log log n + n) time. Then, using this algorithm, we have devised an algorithmfor the CLCS problem having time complexity O(pR log log n + n) in the worst case.Note that, if R = o(n 2 ), our algorithm will perform very well but, if R = O(n 2 ), then,due to the log log n term, our algorithms will behave slightly worse than the existingalgorithms. It would be interesting to see whether our techniques can be employed toother variants of LCS problem to devise similar efficient algorithms.References1. V. Arlazarov, E. Dinic, M. Kronrod, and I. Faradzev. On economic construction of the transitive closureof a directed graph (english translation). Soviet Math. Dokl., 11:1209–1210, 1975.2. A. N. Arslan and Ö. Eğecioğlu. Algorithms for the constrained longest common subsequence problems.In M. Simánek and J. Holub, editors, The Prague Stringology Conference, pages 24–32. Department ofComputer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, 2004.3. A. N. Arslan and Ö. Eğecioğlu. Algorithms for the constrained longest common subsequence problems.Int. J. Found. Comput. Sci., 16(6):1099–1109, 2005.4. S. Bereg and B. Zhu. RNA multiple structural alignment with longest common subsequences. InL. Wang, editor, Computing and Combinatorics (COCOON), volume 3595 of Lecture Notes in ComputerScience, pages 32–41. Springer, 2005.

5. L. Bergroth, H. Hakonen, and T. Raita. A survey of longest common subsequence algorithms. In StringProcessing and Information Retrieval (SPIRE), pages 39–48. IEEE Computer Society, 2000.6. G. Blin, G. Fertin, R. Rizzi, and S. Vialette. What makes the arc-preserving subsequence problem hard?In V. S. Sunderam, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors, International Conferenceon Computational Science, volume 3515 of Lecture Notes in Computer Science, pages 860–868. Springer,2005.7. G. S. Brodal, K. Kaligosi, I. Katriel, and M. Kutz. Faster algorithms for computing longest commonincreasing subsequences. In M. Lewenstein and G. Valiente, editors, CPM, volume 4009 of Lecture Notesin Computer Science, pages 330–341. Springer, 2006.8. F. Y. L. Chin, A. D. Santis, A. L. Ferrara, N. L. Ho, and S. K. Kim. A simple algorithm for theconstrained sequence problems. Inf. Process. Lett., 90(4):175–179, 2004.9. M. Crochemore, G. M. Landau, and M. Ziv-Ukelson. A sub-quadratic sequence alignment algorithm forunrestricted cost matrices. In Symposium of Discrete Algorithms (SODA), pages 679–688, 2002.10. P. Evans. Algorithms and complexity for annotated sequence analysis. Ph.D Thesis, University ofVictoria, 1999.11. F. Hadlock. Minimum detour methods for string or sequence comparison. Congressus Numerantium,61:263–274, 1988.12. D. S. Hirschberg. Algorithms for the longest common subsequence problem. J. ACM, 24(4):664–675,1977.13. J. W. Hunt and T. G. Szymanski. A fast algorithm for computing longest subsequences. Commun.ACM, 20(5):350–353, 1977.14. T. Jiang and M. Li. On the approximation of shortest common supersequences and longest commonsubsequences. SIAM Journal of Computing, 24(5):1122–1139, 1995.15. T. Jiang, G. Lin, B. Ma, and K. Zhang. The longest common subsequence problem for arc-annotatedsequences. In R. Giancarlo and D. Sankoff, editors, Combinatorial Pattern Matching (CPM), volume1848 of Lecture Notes in Computer Science, pages 154–165. Springer, 2000.16. V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Problems inInformation Transmission, 1:8–17, 1965.17. G.-H. Lin, Z.-Z. Chen, T. Jiang, and J. Wen. The longest common subsequence problem for sequenceswith nested arc annotations. J. Comput. Syst. Sci., 65(3):465–480, 2002.18. B. Ma and K. Zhang. On the longest common rigid subsequence problem. In A. Apostolico,M. Crochemore, and K. Park, editors, CPM, volume 3537 of Lecture Notes in Computer Science, pages11–20. Springer, 2005.19. D. Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM,25(2):322–336, 1978.20. W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst.Sci., 20(1):18–31, 1980.21. V. Mkinen, G. Navarro, and E. Ukkonen. Transposition invariant string matching. Journal of Algorithms,56:124–153, 2005.22. E. W. Myers. An o(nd) difference algorithm and its variations. Algorithmica, 1(2):251–266, 1986.23. N. Nakatsu, Y. Kambayashi, and S. Yajima. A longest common subsequence algorithm suitable forsimilar text strings. Acta Inf., 18:171–179, 1982.24. M. S. Rahman and C. S. Iliopoulos. Algorithms for computing variants of the longest common subsequenceproblem. In ISAAC, pages 399–408, 2006.25. Y.-T. Tsai. The constrained longest common subsequence problem. Inf. Process. Lett., 88(4):173–176,2003.26. R. A. Wagner and M. J. Fischer. The string-to-string correction problem. J. ACM, 21(1):168–173, 1974.

New Efficient Algorithms for LCS and Constrained LCS Problem

Create successful ePaper yourself

Delete template?

Save as template?