Introduction to Computational Linguistics
Introduction to Computational Linguistics
Introduction to Computational Linguistics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10. Characters, Strings and Regular Expressions 27<br />
Definition 2 Let ⃗u be a string. The length of ⃗u is denoted by |⃗u|. Suppose that<br />
|⃗u| = m and |⃗v| = n. Then ⃗u ⌢ ⃗v is a string of length m + n, which is defined as<br />
follows.<br />
⎧<br />
⎪⎨<br />
(70) (⃗u ⌢ ⃗u( j) if j < m,<br />
⃗v)( j) = ⎪⎩ ⃗v( j − m) else.<br />
⃗u is a prefix of ⃗v if there is a ⃗w such that ⃗v = ⃗u ⌢ ⃗w, and a postfix if there is a ⃗w such<br />
that ⃗v = ⃗w ⌢ ⃗u. ⃗u is a substring if there are ⃗w and ⃗x such that ⃗v = ⃗w ⌢ ⃗v ⌢ ⃗x.<br />
OCaML has a function String.length that returns the length of a given string.<br />
For example, String.length "cat" will give 3. Notice that by our conventions,<br />
you cannot access the symbol with number 3. Look at the following dialog.<br />
(71)<br />
# "cat".[2];;<br />
- : char = ’t’<br />
# "cat".[String.length "cat"];;<br />
Exception: Invalid_argument "String.get".<br />
The last symbol of the string has the number 2, but the string has length 3. If<br />
you try <strong>to</strong> access an element that does not exist, OCaML raises the exception<br />
Invalid_argument "String.get".<br />
In OCaML, ^ is an infix opera<strong>to</strong>r for string concatenation. So, if we write<br />
"<strong>to</strong>m"^"cat" OCaML returns "<strong>to</strong>mcat". Here is a useful abbreviation. ⃗x n denotes<br />
the string obtained by repeating ⃗x n–times. This can be defined recursively<br />
as follows.<br />
(72)<br />
(73)<br />
⃗x 0 = ε<br />
⃗x n+1 = ⃗x n⌢ ⃗x<br />
So, vux 3 = vuxvuxvux.<br />
A language over A is a set of strings over A. Here are a few useful operations<br />
on languages.<br />
(74a)<br />
(74b)<br />
(74c)<br />
L · M := {⃗x⃗y : ⃗x ∈ L,⃗y ∈ M}<br />
L/M := {⃗x : exists ⃗y ∈ M: ⃗x⃗y ∈ L}<br />
M\L := {⃗x : exists ⃗y ∈ M: ⃗y⃗x ∈ L}