23.11.2014 Views

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

Data Structures and Algorithms in Java[1].pdf - Fulvio Frisone

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

often allows us to approximate solutions to hard problems, <strong>and</strong> for some problems<br />

(such as <strong>in</strong> text compression) actually gives rise to optimal algorithms. F<strong>in</strong>ally, <strong>in</strong><br />

discuss<strong>in</strong>g text similarity, we <strong>in</strong>troduce the dynamic programm<strong>in</strong>g design pattern,<br />

which can be applied <strong>in</strong> some special <strong>in</strong>stances to solve a problem <strong>in</strong> polynomial time<br />

that appears at first to require exponential time to solve.<br />

Text Process<strong>in</strong>g<br />

At the heart of algorithms for process<strong>in</strong>g text are methods for deal<strong>in</strong>g with character<br />

str<strong>in</strong>gs. Character str<strong>in</strong>gs can come from a wide variety of sources, <strong>in</strong>clud<strong>in</strong>g<br />

scientific, l<strong>in</strong>guistic, <strong>and</strong> Internet applications. Indeed, the follow<strong>in</strong>g are examples<br />

of such str<strong>in</strong>gs:<br />

P = "CGTAAACTGCTTTAATCAAACGC"<br />

S = "http://java.datastructures.net".<br />

The first str<strong>in</strong>g, P, comes from DNA applications, <strong>and</strong> the second str<strong>in</strong>g, S, is the<br />

Internet address (URL) for the Web site that accompanies this book.<br />

Several of the typical str<strong>in</strong>g process<strong>in</strong>g operations <strong>in</strong>volve break<strong>in</strong>g large str<strong>in</strong>gs<br />

<strong>in</strong>to smaller str<strong>in</strong>gs. In order to be able to speak about the pieces that result from<br />

such operations, we use the term substr<strong>in</strong>g of an m-character str<strong>in</strong>g P to refer to a<br />

str<strong>in</strong>g of the form P[i]P[i + 1]P[i + 2] … P[j], for some 0 ≤ i ≤ j ≤ m− 1, that is, the<br />

str<strong>in</strong>g formed by the characters <strong>in</strong> P from <strong>in</strong>dex i to <strong>in</strong>dex j, <strong>in</strong>clusive. Technically,<br />

this means that a str<strong>in</strong>g is actually a substr<strong>in</strong>g of itself (tak<strong>in</strong>g i = 0 <strong>and</strong> j = m − 1),<br />

so if we want to rule this out as a possibility, we must restrict the def<strong>in</strong>ition to<br />

proper substr<strong>in</strong>gs, which require that either i > 0 or j − 1.<br />

To simplify the notation for referr<strong>in</strong>g to substr<strong>in</strong>gs, let us use P[i..j] to denote the<br />

substr<strong>in</strong>g of P from <strong>in</strong>dex i to <strong>in</strong>dex j, <strong>in</strong>clusive. That is,<br />

P[i..j]=P[i]P[i+1]…P[j].<br />

We use the convention that if i > j, then P[i..j] is equal to the null str<strong>in</strong>g, which has<br />

length 0. In addition, <strong>in</strong> order to dist<strong>in</strong>guish some special k<strong>in</strong>ds of substr<strong>in</strong>gs, let us<br />

refer to any substr<strong>in</strong>g of the form P [0.. i], for 0 ≤ i ≤ m −1, as a prefix of P, <strong>and</strong> any<br />

substr<strong>in</strong>g of the form P[i..m − 1], for 0 ≤ i ≤ m − 1, as a suffix of P. For example, if<br />

we aga<strong>in</strong> take P to be the str<strong>in</strong>g of DNA given above, then "CGTAA" is a prefix of<br />

P, "CGC" is a suffix of P, <strong>and</strong> "TTAATC" is a (proper) substr<strong>in</strong>g of P. Note that the<br />

null str<strong>in</strong>g is a prefix <strong>and</strong> a suffix of any other str<strong>in</strong>g.<br />

To allow for fairly general notions of a character str<strong>in</strong>g, we typically do not restrict<br />

the characters <strong>in</strong> T <strong>and</strong> P to explicitly come from a well-known character set, like<br />

the Unicode character set. Instead, we typically use the symbol σ to denote the<br />

character set, or alphabet, from which characters can come. S<strong>in</strong>ce most document<br />

process<strong>in</strong>g algorithms are used <strong>in</strong> applications where the underly<strong>in</strong>g character set is<br />

744

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!