01.02.2015 Views

Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL

Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL

Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Free</strong>/<strong>Open</strong>-<strong>Source</strong> <strong>Machine</strong><br />

<strong>Translation</strong>: a <strong>three</strong>-<strong>day</strong> <strong>tutorial</strong> at the<br />

Centre for Next Generation Localisation<br />

Mikel L. Forcada<br />

Universitat d’Alacant, E-03071 Alacant, Spain<br />

Centre for Next Generation Localisation<br />

April 14–16, 2009<br />

1 Objectives<br />

• To become acquainted with existing free/open-source machine translation<br />

(FOSMT) software<br />

• To understand rule-based FOSMT using Apertium as an example<br />

• To experience FOSMT development in the Apertium platform<br />

• To understand corpus-based FOSMT using Moses as an example (*)<br />

• To experience FOSMT development with Moses (*)<br />

• To become aware of free/open-source resources which can be used<br />

to build FOSMT systems<br />

• To understand how FOSMT can be used to generate new humanlanguage<br />

technology tools<br />

We will work on objectives marked with (*) if time allows.<br />

1


2 Audience<br />

PhD students and postdoctoral researchers in the areas of language technologies:<br />

machine translation, speech, natural language processing, digital<br />

content management —information retrieval, information extraction,<br />

adaptive hypermedia— and localization.<br />

May also be attended by translators with basic to intermediate ICT<br />

skills.<br />

3 Duration, structure, resources needed<br />

Five two-hour sessions distributed in two <strong>day</strong>s and a half as follows:<br />

• Tues<strong>day</strong>, April 14, 10:00–12:00 and 14:00–16:00, laboratory L1.14<br />

• Wednes<strong>day</strong>, April 15, 10:00–12:00 and 14:00–16:00, laboratory L1.01<br />

• Thurs<strong>day</strong>, April 16, 10:00–12:00, laboratory L1.14<br />

Both labs are in the School of Computing. We’ll have two kinds of sessions:<br />

• “Classroom” sessions will be organized to be as participative as possible<br />

to adapt different backgrounds.<br />

• Laboratory sessions are designed for participants to get a grasp of<br />

what FOSMT development may look like.<br />

4 Tentative list of blocks & contents<br />

1. <strong>Free</strong>/open-source software<br />

2


• <strong>Free</strong> software: The four basic freedoms. The ambiguity<br />

of the word free. Adoption of “open-source”. <strong>Free</strong>/opensource<br />

software (FOSS) versus freeware, shareware,<br />

and cost-free services on the net.<br />

• Copyleft as an addition to free licenses, and the creation<br />

and securing of a commons. Copylefted and noncopylefted<br />

licenses.<br />

• Analysis of typical non-free licenses used in academic<br />

and commercial settings.<br />

• FOSS and science.<br />

• The shift toward FOSS-based business models: advantages<br />

and vulnerabilities (Behlendorf 1999)<br />

• FOSS project management. Roles. Releases. Repositories,<br />

modification control. “Bazaar-style” versus<br />

“cathedral-style”<br />

2. <strong>Free</strong>/open-source machine translation (FOSMT)<br />

• The <strong>three</strong> basic components of a MT system in development:<br />

engine, data, tools (ED&T). Basic tools (compiler),<br />

auxiliary tools (evaluation tools, dictionary management<br />

tools). FOSMT as MT in which ED&T are all<br />

free/open-source.<br />

• <strong>Free</strong>/open-source rule-based MT and corpus-based MT<br />

• Copyleft applied to linguistic data. Pooling and the commons<br />

(Streiter et al. 2006). Minor languages (Forcada<br />

2006).<br />

• Advantages of FOSMT: The advanced-user/developer<br />

continuum. In particular, advantages for minor languages.<br />

• Generation of resources for other language technologies<br />

(LT).<br />

• Using standard or well documented format: interoperability,<br />

transferability.<br />

• FOSMT development scenarios<br />

• Examples of existing FOSMT software.<br />

• FOSMT project management.<br />

3. Rule-based FOSMT: Apertium<br />

3


(a) Description<br />

• Background (interNOSTRUM, Universia)<br />

• Rationale<br />

• Apertium as a shallow-transfer machine translation platform:<br />

engine, data, tools (Armentano et al. 2006).<br />

• Language-pair data<br />

• Funding<br />

• The Apertium community as an example of FOSS development<br />

(repositories: the trunk, the incubator; roles).<br />

• Apertium tools: apertium-dixtools, apertium-transfertools,<br />

apertium-tagger-training-tools.<br />

• Apertium-based applications (Tinylex, Wordpress plugin,<br />

Pidgin plugin, <strong>Open</strong>Office.org plugins, etc.)<br />

• Apertium as a research platform (Sánchez-Martínez and<br />

Forcada 2007, Sánchez-Martínez et al. 2008).<br />

(b) Laboratory: installing and modifying Apertium<br />

• Installing Apertium from the latest sources on a virtual<br />

Linux machine.<br />

• Changing the data for a language pair: vocabularies and,<br />

optionally, transfer rules.<br />

4. Corpus-based FOSMT: Moses (*)<br />

(a) Statistical machine translation<br />

• Statistical MT (SMT)<br />

• The data: sentence-aligned corpora.<br />

• Training: the statistical models<br />

• The SMT engine or “decoder”<br />

(b) Description of Moses and related software:<br />

• Training: Giza++<br />

• Language models: irstlm<br />

• Tuning: MERT<br />

• Decoding: Moses<br />

(c) Laboratory: installing and running Moses.<br />

4


QUESTIONS:<br />

5. Additional material (*)<br />

• Installing Moses and all of the auxiliary software (Giza++,<br />

language model, etc.).<br />

• Training a toy SMT system with Moses.<br />

(a) Evaluation of machine translation<br />

• FOSS for evaluation: IQMT, etc.<br />

• Human evaluation: postediting and the traditional adequacy/fluency/informativeness<br />

• Automatic evaluation: BLEU, etc. Criticism of automatic<br />

evaluation.<br />

(b) <strong>Free</strong>/open-source resources that can be used to build FOSMT<br />

systems<br />

(c) Related software<br />

• Morphological analysers, taggers: <strong>Free</strong>ling<br />

• Data useful for morphological analysers: Wiktionary (start<br />

of Breton, Icelandic and Faroese)<br />

• <strong>Free</strong> bilingual dictionaries: DACCO, etc.<br />

• Parallel corpora: Europarl, OPUS, etc.<br />

• ReTraTos: perl tools to build bilingual dictionaries from<br />

aligned corpora<br />

• <strong>Free</strong> text: Wikipedia (easy access to free “raw” text)<br />

• Bitext tools: bitextor, tagaligner, etc.<br />

Non-MT FOSS for translation: OmegaT (translation memories),<br />

etc.<br />

Blocks marked with (*) will be dealt with if time allows.<br />

4.1 References<br />

• Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada,<br />

M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-<br />

Sánchez, F., Sánchez-Martínez, F., Scalco, M. (2006) “<strong>Open</strong>-source<br />

5


Portuguese–Spanish machine translation” In Lecture Notes in Computer<br />

Science 3960, 50-59 (Computational Processing of the Portuguese<br />

Language, Proceedings of the 7th International Workshop on Computational<br />

Processing of Written and Spoken Portuguese, PROPOR<br />

2006, May 13-17, 2006, Itatiaia, Rio de Janeiro, Brazil). Available online:<br />

http://www.dlsi.ua.es/˜fsanchez/pub/pdf/armentano06.<br />

pdf<br />

• Behlendorf, Bruce (1999) “<strong>Open</strong> source as a business strategy”, in<br />

DiBona, C., Ockman, S., Stone, M., eds. <strong>Open</strong> <strong>Source</strong>s: Voices from the<br />

<strong>Open</strong> <strong>Source</strong> Revolution, O’Reilly. Available online: http://oreilly.<br />

com/catalog/opensources/book/toc.html<br />

• Forcada, Mikel (2006) “<strong>Open</strong> source machine translation: an opportunity<br />

for minor languages” in LREC-2006: Fifth International Conference<br />

on Language Resources and Evaluation. 5th SALTMIL Workshop<br />

on Minority Languages: “Strategies for developing machine translation for<br />

minority languages”, Genoa, Italy, 23 May 2006; pp.1-6. Available online:<br />

http://www.mt-archive.info/LREC-2006-Forcada.pdf,<br />

http://www.dlsi.ua.es/˜mlf/docum/forcada06p2.pdf<br />

• Sánchez-Martínez, F., Forcada, Mikel L. (2007) “Automatic induction<br />

of shallow-transfer rules for open-source machine translation”<br />

in TMI-2007: Proceedings of the 11th International Conference on Theoretical<br />

and Methodological Issues in <strong>Machine</strong> <strong>Translation</strong>, Skövde [Sweden],<br />

7-9 September 2007; pp.181-190. Available online: http://www.<br />

mt-archive.info/TMI-2007-Sanchez-Martinez.pdf, http:<br />

//www.dlsi.ua.es/˜fsanchez/pub/pdf/sanchez07c.pdf<br />

• Sánchez-Martínez, F., Pérez-Ortiz, J.A., Forcada, M.L. “Using targetlanguage<br />

information to train part-of-speech taggers for machine translation”.<br />

<strong>Machine</strong> <strong>Translation</strong>, 22(1–2)29–66.<br />

• Streiter, O. and Scannell, K.P. and Stuflesser, M. (2006) ”Implementing<br />

NLP projects for noncentral languages: instructions for funding<br />

bodies, strategies for developers”, <strong>Machine</strong> <strong>Translation</strong> 20(4)267–289.<br />

Draft available online: http://borel.slu.edu/pub/mt.pdf.<br />

6

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!