13.11.2014 Views

Introduction to Computational Linguistics

Introduction to Computational Linguistics

Introduction to Computational Linguistics

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10. Characters, Strings and Regular Expressions 25<br />

the value you get 1:<br />

(68)<br />

# let q = new number;;<br />

val q : number = <br />

# q#succ;;<br />

- : unit = ()<br />

# q#get;;<br />

- : int = 1<br />

Notice the following. If you have defined an object which is a set, issuing method<br />

get = x will not help you much in seeing what the current value is. You will have<br />

<strong>to</strong> say, for example, method get = PStringSet.elements x if the object is<br />

basically a set of pairs of strings as defined above.<br />

10 Characters, Strings and Regular Expressions<br />

We are now ready <strong>to</strong> do some actual theory and implementation of linguistics.<br />

We shall deal first with strings and then with finite state au<strong>to</strong>mata. Before we can<br />

talk about strings, some words are necessary on characters. Characters are drawn<br />

from a special set, which is also referred <strong>to</strong> as the alphabet. In principle the alphabet<br />

can be anything, but in actual implementations the alphabet is always fixed.<br />

OCaML, for example, is based on the character table of ISO Latin 1 (also referred<br />

<strong>to</strong> as ISO 8859-1). It is included on the web page for you <strong>to</strong> look at. You may use<br />

characters from this set only. In theory, any character can be used. However, there<br />

arises a problem of communication with OCaML. There are a number of characters<br />

that do not show up on the screen, like carriage return. Other characters are<br />

used as delimiters (such as the quotes). It is for this reason that one has <strong>to</strong> use<br />

certain naming conventions. They are given on Page 90 – 91 of the manual. If<br />

you write \ followed by 3 digits, this accesses the ASCII character named by that<br />

sequence. OCaML has a module called Char that has a few useful functions. The<br />

function code, for example, converts a character in<strong>to</strong> its 3–digit code. Its inverse<br />

is the function chr. Type Char.code ’L’;; and OCaML gives you 76. So, the<br />

string \076 refers <strong>to</strong> the character L. You can try out that function and see that<br />

it actually does support the full character set of ISO Latin 1. Another issue is of<br />

course your edi<strong>to</strong>r: in order <strong>to</strong> put in that character, you have <strong>to</strong> learn how your<br />

edi<strong>to</strong>r lets you do this. (In vi, you type either ¡Ctrl¡V and then the numeric code

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!