26.12.2013 Views

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

more regular. For verb inflections, the morphological processes are in fact more regular <strong>and</strong><br />

the argument in favour of only generating full <strong>for</strong>ms becomes stronger.<br />

One could imagine a system which combines both a morphological automaton together<br />

with a database of full <strong>for</strong>ms as some kind override to the <strong>for</strong>mer. In many ways, GF <strong>lexicon</strong>s<br />

defined using smart paradigms essentially work in this way. However GF itself is not really<br />

suitable as a final storage <strong>for</strong>mat <strong>for</strong> a <strong>lexicon</strong> because it essentially en<strong>for</strong>ces a fixed schema on<br />

its entries, <strong>and</strong> extending it with new words from different sources may require considerable<br />

refactoring.<br />

This work opts <strong>for</strong> the <strong>for</strong>mer of the options presented above, that is storing all <strong>for</strong>ms in a<br />

single database, without the use of any real-time morphological generator. Apart from making<br />

the system design simpler, this means that the <strong>lexicon</strong> can more effectively searched (using<br />

regular expressions, <strong>for</strong> example) <strong>and</strong> that its contents can be more easily converted or exported<br />

to some other <strong>for</strong>mat.<br />

Size calculations<br />

Dalli (2002a) estimates that at least 30,000 lemmas must be identified in order to have significant<br />

coverage of all the <strong>Maltese</strong> language. As a comparison, Serracino-Inglott’s <strong>Maltese</strong> dictionary<br />

(Serracino-Inglott, 2003) contains roughly 26,000 entries, while the <strong>Maltese</strong>-English volumes of<br />

Aquilina’s (Aquilina, 1987, 1990) contain some 80,000. The total number of entries from the<br />

sources listed in the previous section amounts to almost 13,000. Note however that this does<br />

not take into account the duplicate entries appearing from different sources; the number of<br />

unique entries will there<strong>for</strong>e likely be lower.<br />

Considering the worst-case inflectional <strong>for</strong>ms of each of the major parts of speech, we have:<br />

• Nouns have 5 plural <strong>for</strong>ms, each of which can appear with or without enclitic pronouns,<br />

giving an upper bound of 40 word <strong>for</strong>ms.<br />

• Verbs have 952 <strong>for</strong>ms (see appendix C) <strong>for</strong> the main moods/aspects, <strong>and</strong> 14 <strong>for</strong> present<br />

<strong>and</strong> past participles (which take no enclitic pronouns). This gives a total of 966 verb word<br />

<strong>for</strong>ms.<br />

• Adjectives have 3 inflection cases <strong>and</strong> 3 <strong>for</strong>ms, making a total of 9 possible combinations.<br />

• We will ignore inflections of other word classes such as prepositions <strong>and</strong> pronouns, as<br />

such structural words are generally of a fixed, small number.<br />

In order to now estimate the total number of <strong>for</strong>ms required over an entire <strong>lexicon</strong>, we must<br />

establish the distribution of the different parts of speech in a <strong>Maltese</strong> dictionary. An analysis of<br />

the digitised version of the Aquilina dictionary (MLRS, 2013) gives a worst-case total number of<br />

<strong>for</strong>ms of 86 million. This calculation is broken down in table 3.1. While this figure is higher than<br />

Dalli’s estimation of around 64 million unique word<strong>for</strong>ms (Dalli, 2002a, p. 69), it is certainly in<br />

the same order of magnitude. He goes on to estimate an average case of around 5.7 million<br />

unique word<strong>for</strong>ms (based on an “average word <strong>for</strong>m” count of 72 (Mangion, 1999, p. 65)).<br />

49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!