26.12.2013 Views

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

A computational grammar and lexicon for Maltese

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ible to the author (<strong>and</strong> thus not listed in the bibliography) are Lawrenz Cachia’s Grammatika<br />

Ġdida tal-Malti (1994) <strong>and</strong> Edmund Sutcliffe’s A Grammar of the <strong>Maltese</strong> Language (1936). Older<br />

historical <strong>grammar</strong>s include A Short Grammar of the <strong>Maltese</strong> Language by an unknown author in<br />

1845, Francis Vella’s <strong>Maltese</strong> Grammar <strong>for</strong> the use of the English (1831), <strong>and</strong> Michelantonio Vassalli’s<br />

Grammatica della lingua <strong>Maltese</strong> (1827).<br />

1.2.2 Corpora <strong>and</strong> terminologies<br />

MLRS corpus<br />

A monolingual corpus <strong>for</strong> <strong>Maltese</strong> has been compiled <strong>and</strong> is hosted by the <strong>Maltese</strong> Language<br />

Resource Server (MLRS) 1 (Rosner et al. , 2006). The MLRS Corpus (Gatt & Borg, 2011) can<br />

be described as an opportunistic text collection of nearly 130 million tokens, mostly created<br />

from publicly available documents, but also including a limited amount of user-contributed<br />

material. Texts <strong>for</strong> the version 2.0 BETA of the corpus were pre-processed to remove duplicate<br />

material <strong>and</strong> long stretches of non-<strong>Maltese</strong> text. They were also processed with a simple<br />

dictionary-based spelling correction, <strong>and</strong> part-of-speech (POS) tagged using the TnT Tagger<br />

trained on circa 26k of manually annotated text.<br />

Parallel corpora<br />

There are also some parallel corpora <strong>for</strong> <strong>Maltese</strong>, which have come from the translation ef<strong>for</strong>ts<br />

of the European Commission. The JRC-Acquis corpus (Ralf et al. , 2006) is a parallel corpus containing<br />

the complete text of the European Union Law (Acquis Communautaire) in 22 languages.<br />

Other parallel texts <strong>for</strong> <strong>Maltese</strong> are accessible through the OPUS project (Tiedemann, 2009)<br />

— a open-access collection of translated texts from the web, automatically processed <strong>and</strong> annotated<br />

using open-source products. OPUS contains parallel corpora <strong>for</strong> English-<strong>Maltese</strong> from:<br />

• EMEA - PDF documents from the European Medicines Agency (11.2M tokens)<br />

• ECB - Website <strong>and</strong> documentation from the European Central Bank (2.7M tokens)<br />

• EUconst - A parallel corpus collected from the European Constitution (0.1M tokens)<br />

MAMCO — <strong>Maltese</strong> Multimodal corpus<br />

Work has also begun on a Multi-modal corpus <strong>for</strong> <strong>Maltese</strong>, beginning with video recordings of<br />

first-encounter conversations between pairs of <strong>Maltese</strong> speakers. The setting <strong>and</strong> organisation<br />

replicate those used in the Nordic NOMCO corpus. Corpus annotations include transcription<br />

of the spoken data, as well as annotation of head movements <strong>and</strong> other gestures. This corpus<br />

has allowed studies on, amongst others, lengthening as a discourse strategy (Paggio et al. ,<br />

2013).<br />

1 http://mlrs.research.um.edu.mt/, accessed 2013-09-05<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!