13.11.2014 Views

Introduction to Computational Linguistics

Introduction to Computational Linguistics

Introduction to Computational Linguistics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10. Characters, Strings and Regular Expressions 29<br />

(N is the set of natural numbers. It contains all integers starting from 0.)· is often<br />

omitted. Hence, st is the same as s · t. Hence, cat = c · a · t. It is easily seen<br />

that every set containing exactly one string is a regular language. Hence, every<br />

set containing a finite set of strings is also regular. With more effort one can show<br />

that a set containing all but a finite set of strings is regular, <strong>to</strong>o.<br />

Notice that whereas s is a term, L(s) is a language, the language of terms that<br />

fall under the term s. In ordinary usage, these two are not distinguished, though.<br />

We write a both for the string a and the regular term whose language is {a}. It<br />

follows that<br />

(85)<br />

(86)<br />

L(a · (b ∪ c)) = {ab, ac}<br />

L((cb) ∗ a) = {a, cba, cbcba, . . . }<br />

A couple of abbreviations are also used:<br />

(87)<br />

(88)<br />

(89)<br />

s? := ε ∪ s<br />

s + := s · s ∗<br />

s n := s · s · · · · · s<br />

(n–times)<br />

We say that s ⊆ t if L(s) ⊆ L(t). This is the same as saying that L(s ∪ t) = L(t).<br />

Regular expressions are used in a lot of applications. For example, if you are<br />

searching for a particular string, say department in a long document, it may<br />

actually appear in two shapes, department or Department. Also, if you are<br />

searching for two words in a sequence, you may face the fact that they appear on<br />

different lines. This means that they are separated by any number of blanks and<br />

an optional carriage return. In order not <strong>to</strong> loose any occurrence of that sort you<br />

will want <strong>to</strong> write a regular expression that matches any of these occurrences. The<br />

Unix command egrep allows you <strong>to</strong> search for strings that match a regular term.<br />

The particular construc<strong>to</strong>rs look a little bit different, but the underlying concepts<br />

are the same.<br />

Notice that there are regular expressions which are different but denote the<br />

same language. We have in general the following laws. (A note of caution. These<br />

laws are not identities between the terms; the terms are distinct. These are identities<br />

concerning the languages that these terms denote.)<br />

(90a)<br />

(90b)<br />

s · (t · u) = (s · t) · u<br />

s · ε = s<br />

ε · s = s

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!