Introduction to Computational Linguistics
Introduction to Computational Linguistics
Introduction to Computational Linguistics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10. Characters, Strings and Regular Expressions 29<br />
(N is the set of natural numbers. It contains all integers starting from 0.)· is often<br />
omitted. Hence, st is the same as s · t. Hence, cat = c · a · t. It is easily seen<br />
that every set containing exactly one string is a regular language. Hence, every<br />
set containing a finite set of strings is also regular. With more effort one can show<br />
that a set containing all but a finite set of strings is regular, <strong>to</strong>o.<br />
Notice that whereas s is a term, L(s) is a language, the language of terms that<br />
fall under the term s. In ordinary usage, these two are not distinguished, though.<br />
We write a both for the string a and the regular term whose language is {a}. It<br />
follows that<br />
(85)<br />
(86)<br />
L(a · (b ∪ c)) = {ab, ac}<br />
L((cb) ∗ a) = {a, cba, cbcba, . . . }<br />
A couple of abbreviations are also used:<br />
(87)<br />
(88)<br />
(89)<br />
s? := ε ∪ s<br />
s + := s · s ∗<br />
s n := s · s · · · · · s<br />
(n–times)<br />
We say that s ⊆ t if L(s) ⊆ L(t). This is the same as saying that L(s ∪ t) = L(t).<br />
Regular expressions are used in a lot of applications. For example, if you are<br />
searching for a particular string, say department in a long document, it may<br />
actually appear in two shapes, department or Department. Also, if you are<br />
searching for two words in a sequence, you may face the fact that they appear on<br />
different lines. This means that they are separated by any number of blanks and<br />
an optional carriage return. In order not <strong>to</strong> loose any occurrence of that sort you<br />
will want <strong>to</strong> write a regular expression that matches any of these occurrences. The<br />
Unix command egrep allows you <strong>to</strong> search for strings that match a regular term.<br />
The particular construc<strong>to</strong>rs look a little bit different, but the underlying concepts<br />
are the same.<br />
Notice that there are regular expressions which are different but denote the<br />
same language. We have in general the following laws. (A note of caution. These<br />
laws are not identities between the terms; the terms are distinct. These are identities<br />
concerning the languages that these terms denote.)<br />
(90a)<br />
(90b)<br />
s · (t · u) = (s · t) · u<br />
s · ε = s<br />
ε · s = s