Introduction to Computational Linguistics
Introduction to Computational Linguistics
Introduction to Computational Linguistics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
11. Interlude: Regular Expressions in OCaML 35<br />
the position in the string where the search should start. The result is an integer<br />
and it gives the initial position of the next string that falls under the regular expression.<br />
If none can be found, OCaML raises the exception Not_found. If you<br />
want <strong>to</strong> avoid getting the exception you can also use string_match before <strong>to</strong> see<br />
whether or not there is any string of that sort. It is actually superfluous <strong>to</strong> first<br />
check <strong>to</strong> see whether a match exists and then actually performing the search. This<br />
requires <strong>to</strong> search the same string twice. Rather, there are two useful functions,<br />
match_beginning and match_end, which return the initial position (final position)<br />
of the string that was found <strong>to</strong> match the regular expression. Notice that you<br />
have <strong>to</strong> say Str.match_beginning(), supplying an empty parenthesis.<br />
There is an interesting difference in the way matching can be done. The Unix<br />
matcher tries <strong>to</strong> match the longest possible substring against the regular expression<br />
(this is called greedy matching), while the Windows matcher looks for the just<br />
enough of the string <strong>to</strong> get a match (lazy matching). Say we are looking for a ∗ in<br />
the string bcagaa.<br />
(98)<br />
let ast = Str.regexp "a*";;<br />
Str.string_match ast "bcagaa" 0;;<br />
In both versions, the answer is ”yes”. Moreover, if we ask for the first and the last<br />
position, both version will give 0 and 0. This is because there is no way <strong>to</strong> match<br />
more than zero a at that point. Now try instead<br />
(99) Str.string_match ast "bcagaa" 2;;<br />
Here, the two algorithms yield different values. The greedy algorithm will say that<br />
it has found a match between 2 and 3, since it was able <strong>to</strong> match one successive<br />
a at that point. The lazy algorithm is content with the empty string. The empty<br />
string matches, so it does not look further. Notice that the greedy algorithm does<br />
not look for the longest occurrence in the entire string. Rather, it proceeds from<br />
the starting position and tries <strong>to</strong> match from the successive positions against the<br />
regular expression. If it cannot match at all, it proceeds <strong>to</strong> the next position. If it<br />
can match, however, it will take the longest possible string that matches.<br />
The greedy algorithm actually allows <strong>to</strong> check whether a string falls under a