26.12.2013 Views

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Importing data<br />

Importing of data into the database was carried out using a number of scripts written in Haskell<br />

<strong>and</strong> using the mongoDB Haskell library 4 . The general steps in each case are described below.<br />

Verbal roots project database This data was available as a MySQL database, with one table<br />

<strong>for</strong> roots <strong>and</strong> another <strong>for</strong> verb <strong>for</strong>ms. The import script made use of the mysql-simple Haskell<br />

library 5 to query the database directly <strong>and</strong> insert the data into MongoDB.<br />

Broken plurals This resource is made available in tab-separated <strong>for</strong>mat, which was straight<strong>for</strong>ward<br />

to process. Each line of the original file contained a new <strong>for</strong>m. In cases of adjectives,<br />

two singular entries would have the same plural e.g. abjad (m.sg.) <strong>and</strong> bajda (f.sg.) are separate<br />

entries which both have the plural bojod (‘white’). The import script there<strong>for</strong>e detects these<br />

cases <strong>and</strong> labels them as adjectives. However this can fail with animate nouns which also have<br />

two genders, e.g. tabib (‘male doctor’), tabiba (‘female doctor’) <strong>and</strong> tobba (‘doctors’). In these<br />

cases the POS needs to be corrected manually since the source gives no other in<strong>for</strong>mation to<br />

help disambiguation.<br />

Verbal nouns In this case, the word list was only available as a Microsoft Word document<br />

containing a table spanning many pages. As this <strong>for</strong>mat is less amenable to direct extraction,<br />

the data was first copied-<strong>and</strong>-pasted into a text editor. This step eliminated the table structure<br />

by placing every cell on a new line, which could then be more easily processed using st<strong>and</strong>ard<br />

techniques. A disadvantage of this step is that all <strong>for</strong>matting in<strong>for</strong>mation was lost in the<br />

conversion. In particular, entries <strong>for</strong>matted in italics indicating hypothetical words could no<br />

longer be distinguished.<br />

Basic English-<strong>Maltese</strong> dictionary<br />

This dictionary has been made available in an XML <strong>for</strong>mat<br />

<strong>and</strong> there<strong>for</strong>e highly amenable to automatic processing. The import script <strong>for</strong> this resource uses<br />

the xml Haskell library 6 <strong>for</strong> parsing XML. Unlike each of the importation steps described above,<br />

which work in isolation <strong>and</strong> provide non-overlapping sets of data, the importation of an entire<br />

dictionary must be able to h<strong>and</strong>le duplicates. That is, if the lemma being imported already<br />

exists in the collection then their data should be merged, rather than a duplicate entry being<br />

created. The script runs in a batch mode but keeps logs about what entries were merged, which<br />

must be checked manually.<br />

4 http://hackage.haskell.org/package/mongoDB, accessed 2013-09-01<br />

5 http://hackage.haskell.org/package/mysql-simple, accessed 2013-09-01<br />

6 http://hackage.haskell.org/package/xml, accessed 2013-09-03<br />

52

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!