23.11.2014 Views

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

12.5 Text Similarity Test<strong>in</strong>g<br />

A common text process<strong>in</strong>g problem, which arises <strong>in</strong> genetics <strong>and</strong> software<br />

eng<strong>in</strong>eer<strong>in</strong>g, is to test the similarity between two text str<strong>in</strong>gs. In a genetics<br />

application, the two str<strong>in</strong>gs could correspond to two str<strong>and</strong>s of DNA, which could, for<br />

example, come from two <strong>in</strong>dividuals, who we will consider genetically related if they<br />

have a long subsequence common to their respective DNA sequences. Likewise, <strong>in</strong> a<br />

software eng<strong>in</strong>eer<strong>in</strong>g application, the two str<strong>in</strong>gs could come from two versions of<br />

source code for the same program, <strong>and</strong> we may wish to determ<strong>in</strong>e which changes<br />

were made from one version to the next. Indeed, determ<strong>in</strong><strong>in</strong>g the similarity between<br />

two str<strong>in</strong>gs is considered such a common operation that the Unix <strong>and</strong> L<strong>in</strong>ux operat<strong>in</strong>g<br />

systems come with a program, called diff, for compar<strong>in</strong>g text files.<br />

12.5.1 The Longest Common Subsequence<br />

Problem<br />

There are several different ways we can def<strong>in</strong>e the similarity between two str<strong>in</strong>gs.<br />

Even so, we can abstract a simple, yet common, version of this problem us<strong>in</strong>g<br />

character str<strong>in</strong>gs <strong>and</strong> their subsequences. Given a str<strong>in</strong>g X = x 0 x 1 x 2 … x n−1 , a<br />

subsequence of X is any str<strong>in</strong>g that is of the form x i1 x i2 …x ik where i j < i j+1 ; that is,<br />

it is a sequence of characters that are not necessarily contiguous but are nevertheless<br />

taken <strong>in</strong> order from X. For example, the str<strong>in</strong>g AAAG is a subsequence of the str<strong>in</strong>g<br />

CGATAATTGAGA. Note that the concept of subsequence of a str<strong>in</strong>g is different<br />

from the one of substr<strong>in</strong>g of a str<strong>in</strong>g, def<strong>in</strong>ed <strong>in</strong> Section 12.1.<br />

Problem Def<strong>in</strong>ition<br />

The specific text similarity problem we address here is the longest common<br />

subsequence (LCS) problem. In this problem, we are given two character str<strong>in</strong>gs,<br />

X = x 0 x 1 x 2 …x n−1 <strong>and</strong> Y = y 0 y 1 y 2 … y m−1 , over some alphabet (such as the<br />

alphabet {A,C, G, T} common <strong>in</strong> computational genetics) <strong>and</strong> are asked to f<strong>in</strong>d a<br />

longest str<strong>in</strong>g S that is a subsequence of both X <strong>and</strong> Y.<br />

One way to solve the longest common subsequence problem is to enumerate all<br />

subsequences of X <strong>and</strong> take the largest one that is also a subsequence of Y. S<strong>in</strong>ce<br />

each character of X is either <strong>in</strong> or not <strong>in</strong> a subsequence, there are potentially 2 n<br />

different subsequences of X, each of which requires O(m) time to determ<strong>in</strong>e<br />

whether it is a subsequence of Y. Thus, this brute-force approach yields an<br />

exponential-time algorithm that runs <strong>in</strong> O(2 n m) time, which is very <strong>in</strong>efficient. In<br />

this section, we discuss how to use an algorithmic design pattern called dynamic<br />

programm<strong>in</strong>g to solve the longest common subsequence problem much faster<br />

than this.<br />

12.5.2 Dynamic Programm<strong>in</strong>g<br />

778

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!