A computational grammar and lexicon for Maltese
A computational grammar and lexicon for Maltese
A computational grammar and lexicon for Maltese
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
more regular. For verb inflections, the morphological processes are in fact more regular <strong>and</strong><br />
the argument in favour of only generating full <strong>for</strong>ms becomes stronger.<br />
One could imagine a system which combines both a morphological automaton together<br />
with a database of full <strong>for</strong>ms as some kind override to the <strong>for</strong>mer. In many ways, GF <strong>lexicon</strong>s<br />
defined using smart paradigms essentially work in this way. However GF itself is not really<br />
suitable as a final storage <strong>for</strong>mat <strong>for</strong> a <strong>lexicon</strong> because it essentially en<strong>for</strong>ces a fixed schema on<br />
its entries, <strong>and</strong> extending it with new words from different sources may require considerable<br />
refactoring.<br />
This work opts <strong>for</strong> the <strong>for</strong>mer of the options presented above, that is storing all <strong>for</strong>ms in a<br />
single database, without the use of any real-time morphological generator. Apart from making<br />
the system design simpler, this means that the <strong>lexicon</strong> can more effectively searched (using<br />
regular expressions, <strong>for</strong> example) <strong>and</strong> that its contents can be more easily converted or exported<br />
to some other <strong>for</strong>mat.<br />
Size calculations<br />
Dalli (2002a) estimates that at least 30,000 lemmas must be identified in order to have significant<br />
coverage of all the <strong>Maltese</strong> language. As a comparison, Serracino-Inglott’s <strong>Maltese</strong> dictionary<br />
(Serracino-Inglott, 2003) contains roughly 26,000 entries, while the <strong>Maltese</strong>-English volumes of<br />
Aquilina’s (Aquilina, 1987, 1990) contain some 80,000. The total number of entries from the<br />
sources listed in the previous section amounts to almost 13,000. Note however that this does<br />
not take into account the duplicate entries appearing from different sources; the number of<br />
unique entries will there<strong>for</strong>e likely be lower.<br />
Considering the worst-case inflectional <strong>for</strong>ms of each of the major parts of speech, we have:<br />
• Nouns have 5 plural <strong>for</strong>ms, each of which can appear with or without enclitic pronouns,<br />
giving an upper bound of 40 word <strong>for</strong>ms.<br />
• Verbs have 952 <strong>for</strong>ms (see appendix C) <strong>for</strong> the main moods/aspects, <strong>and</strong> 14 <strong>for</strong> present<br />
<strong>and</strong> past participles (which take no enclitic pronouns). This gives a total of 966 verb word<br />
<strong>for</strong>ms.<br />
• Adjectives have 3 inflection cases <strong>and</strong> 3 <strong>for</strong>ms, making a total of 9 possible combinations.<br />
• We will ignore inflections of other word classes such as prepositions <strong>and</strong> pronouns, as<br />
such structural words are generally of a fixed, small number.<br />
In order to now estimate the total number of <strong>for</strong>ms required over an entire <strong>lexicon</strong>, we must<br />
establish the distribution of the different parts of speech in a <strong>Maltese</strong> dictionary. An analysis of<br />
the digitised version of the Aquilina dictionary (MLRS, 2013) gives a worst-case total number of<br />
<strong>for</strong>ms of 86 million. This calculation is broken down in table 3.1. While this figure is higher than<br />
Dalli’s estimation of around 64 million unique word<strong>for</strong>ms (Dalli, 2002a, p. 69), it is certainly in<br />
the same order of magnitude. He goes on to estimate an average case of around 5.7 million<br />
unique word<strong>for</strong>ms (based on an “average word <strong>for</strong>m” count of 72 (Mangion, 1999, p. 65)).<br />
49