26.12.2013 Views

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Anchored regex db.word<strong>for</strong>ms.find({"surface_<strong>for</strong>m":/^skrej/}) < 100<br />

General regex db.word<strong>for</strong>ms.find({"surface_<strong>for</strong>m":/skrej/}) ~ 4000<br />

One could use an alternative engine <strong>for</strong> searching, e.g. Apache Solr 9 . However the response<br />

times in the application are generally acceptable <strong>and</strong> do not warrant a more heavy-duty solution.<br />

Native sorting<br />

MongoDB has a sort comm<strong>and</strong> which uses an inbuilt sorting algorithm <strong>for</strong> ordering documents<br />

by some field. Specifically, when sorting Unicode strings the database engine will sort in order<br />

of binary representation. Apart from not being able to per<strong>for</strong>m case-insensitive searches, this<br />

also means that one cannot search according lexicographically according to the <strong>Maltese</strong> alphabet;<br />

the letters ċ, ġ, ħ <strong>and</strong> ż are incorrectly sorted after the letter z, <strong>and</strong> għ <strong>and</strong> ie are not correctly<br />

treated a digraphs.<br />

This means, <strong>for</strong> example, that ġara (‘neighbour’) would be sorted after zuntier (‘churchyard’),<br />

though it should come be<strong>for</strong>e it. Since the sorting scheme does not support digraphs,<br />

a word containing the għ such as għaraf (‘he recognised’) is erroneously sorted after gara (‘he<br />

threw’). Sorting by custom collations is not currently supported by the MongoDB engine, however<br />

it has been marked a planned feature 10 . Until this is implemented, sorting can instead be<br />

per<strong>for</strong>med at the application level. The disadvantage with this is that it excludes the possibility<br />

of using the database engine to efficiently provide pagination by combining the sort() <strong>and</strong><br />

limit() comm<strong>and</strong>s.<br />

3.3 Monolingual GF dictionary<br />

With all the lexical entries gathered together in a <strong>computational</strong> <strong>lexicon</strong>, a monolingual GF<br />

<strong>grammar</strong> module can be easily constructed. In following the RGL convention, this consists<br />

of matching abstract <strong>and</strong> concrete modules named DictMltAbs.gf <strong>and</strong> DictMlt.gf respectively.<br />

These are included with the rest of the <strong>Maltese</strong> resource <strong>grammar</strong> described in chapter 2.<br />

3.3.1 Method<br />

The method <strong>for</strong> generating a monolingual GF dictionary from a word list is quite straight<strong>for</strong>ward:<br />

1. For each lemma in the <strong>lexicon</strong>, a valid GF function identifier is generated which is guaranteed<br />

to be unique. This is often an ASCII-ised version of the lemma combined with<br />

some unique identifier <strong>and</strong> suffixed with the POS. In the case of verbs, <strong>for</strong> example, we<br />

9 http://lucene.apache.org/solr/, accessed 2013-08-27<br />

10 https://jira.mongodb.org/browse/SERVER-1920, accessed 2013-08-27<br />

56

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!