13.11.2014 Views

Introduction to Computational Linguistics

Introduction to Computational Linguistics

Introduction to Computational Linguistics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

11. Interlude: Regular Expressions in OCaML 35<br />

the position in the string where the search should start. The result is an integer<br />

and it gives the initial position of the next string that falls under the regular expression.<br />

If none can be found, OCaML raises the exception Not_found. If you<br />

want <strong>to</strong> avoid getting the exception you can also use string_match before <strong>to</strong> see<br />

whether or not there is any string of that sort. It is actually superfluous <strong>to</strong> first<br />

check <strong>to</strong> see whether a match exists and then actually performing the search. This<br />

requires <strong>to</strong> search the same string twice. Rather, there are two useful functions,<br />

match_beginning and match_end, which return the initial position (final position)<br />

of the string that was found <strong>to</strong> match the regular expression. Notice that you<br />

have <strong>to</strong> say Str.match_beginning(), supplying an empty parenthesis.<br />

There is an interesting difference in the way matching can be done. The Unix<br />

matcher tries <strong>to</strong> match the longest possible substring against the regular expression<br />

(this is called greedy matching), while the Windows matcher looks for the just<br />

enough of the string <strong>to</strong> get a match (lazy matching). Say we are looking for a ∗ in<br />

the string bcagaa.<br />

(98)<br />

let ast = Str.regexp "a*";;<br />

Str.string_match ast "bcagaa" 0;;<br />

In both versions, the answer is ”yes”. Moreover, if we ask for the first and the last<br />

position, both version will give 0 and 0. This is because there is no way <strong>to</strong> match<br />

more than zero a at that point. Now try instead<br />

(99) Str.string_match ast "bcagaa" 2;;<br />

Here, the two algorithms yield different values. The greedy algorithm will say that<br />

it has found a match between 2 and 3, since it was able <strong>to</strong> match one successive<br />

a at that point. The lazy algorithm is content with the empty string. The empty<br />

string matches, so it does not look further. Notice that the greedy algorithm does<br />

not look for the longest occurrence in the entire string. Rather, it proceeds from<br />

the starting position and tries <strong>to</strong> match from the successive positions against the<br />

regular expression. If it cannot match at all, it proceeds <strong>to</strong> the next position. If it<br />

can match, however, it will take the longest possible string that matches.<br />

The greedy algorithm actually allows <strong>to</strong> check whether a string falls under a

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!