13.11.2014 Views

Introduction to Computational Linguistics

Introduction to Computational Linguistics

Introduction to Computational Linguistics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10. Characters, Strings and Regular Expressions 27<br />

Definition 2 Let ⃗u be a string. The length of ⃗u is denoted by |⃗u|. Suppose that<br />

|⃗u| = m and |⃗v| = n. Then ⃗u ⌢ ⃗v is a string of length m + n, which is defined as<br />

follows.<br />

⎧<br />

⎪⎨<br />

(70) (⃗u ⌢ ⃗u( j) if j < m,<br />

⃗v)( j) = ⎪⎩ ⃗v( j − m) else.<br />

⃗u is a prefix of ⃗v if there is a ⃗w such that ⃗v = ⃗u ⌢ ⃗w, and a postfix if there is a ⃗w such<br />

that ⃗v = ⃗w ⌢ ⃗u. ⃗u is a substring if there are ⃗w and ⃗x such that ⃗v = ⃗w ⌢ ⃗v ⌢ ⃗x.<br />

OCaML has a function String.length that returns the length of a given string.<br />

For example, String.length "cat" will give 3. Notice that by our conventions,<br />

you cannot access the symbol with number 3. Look at the following dialog.<br />

(71)<br />

# "cat".[2];;<br />

- : char = ’t’<br />

# "cat".[String.length "cat"];;<br />

Exception: Invalid_argument "String.get".<br />

The last symbol of the string has the number 2, but the string has length 3. If<br />

you try <strong>to</strong> access an element that does not exist, OCaML raises the exception<br />

Invalid_argument "String.get".<br />

In OCaML, ^ is an infix opera<strong>to</strong>r for string concatenation. So, if we write<br />

"<strong>to</strong>m"^"cat" OCaML returns "<strong>to</strong>mcat". Here is a useful abbreviation. ⃗x n denotes<br />

the string obtained by repeating ⃗x n–times. This can be defined recursively<br />

as follows.<br />

(72)<br />

(73)<br />

⃗x 0 = ε<br />

⃗x n+1 = ⃗x n⌢ ⃗x<br />

So, vux 3 = vuxvuxvux.<br />

A language over A is a set of strings over A. Here are a few useful operations<br />

on languages.<br />

(74a)<br />

(74b)<br />

(74c)<br />

L · M := {⃗x⃗y : ⃗x ∈ L,⃗y ∈ M}<br />

L/M := {⃗x : exists ⃗y ∈ M: ⃗x⃗y ∈ L}<br />

M\L := {⃗x : exists ⃗y ∈ M: ⃗y⃗x ∈ L}

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!