Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL
Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL
Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Free</strong>/<strong>Open</strong>-<strong>Source</strong> <strong>Machine</strong><br />
<strong>Translation</strong>: a <strong>three</strong>-<strong>day</strong> <strong>tutorial</strong> at the<br />
Centre for Next Generation Localisation<br />
Mikel L. Forcada<br />
Universitat d’Alacant, E-03071 Alacant, Spain<br />
Centre for Next Generation Localisation<br />
April 14–16, 2009<br />
1 Objectives<br />
• To become acquainted with existing free/open-source machine translation<br />
(FOSMT) software<br />
• To understand rule-based FOSMT using Apertium as an example<br />
• To experience FOSMT development in the Apertium platform<br />
• To understand corpus-based FOSMT using Moses as an example (*)<br />
• To experience FOSMT development with Moses (*)<br />
• To become aware of free/open-source resources which can be used<br />
to build FOSMT systems<br />
• To understand how FOSMT can be used to generate new humanlanguage<br />
technology tools<br />
We will work on objectives marked with (*) if time allows.<br />
1
2 Audience<br />
PhD students and postdoctoral researchers in the areas of language technologies:<br />
machine translation, speech, natural language processing, digital<br />
content management —information retrieval, information extraction,<br />
adaptive hypermedia— and localization.<br />
May also be attended by translators with basic to intermediate ICT<br />
skills.<br />
3 Duration, structure, resources needed<br />
Five two-hour sessions distributed in two <strong>day</strong>s and a half as follows:<br />
• Tues<strong>day</strong>, April 14, 10:00–12:00 and 14:00–16:00, laboratory L1.14<br />
• Wednes<strong>day</strong>, April 15, 10:00–12:00 and 14:00–16:00, laboratory L1.01<br />
• Thurs<strong>day</strong>, April 16, 10:00–12:00, laboratory L1.14<br />
Both labs are in the School of Computing. We’ll have two kinds of sessions:<br />
• “Classroom” sessions will be organized to be as participative as possible<br />
to adapt different backgrounds.<br />
• Laboratory sessions are designed for participants to get a grasp of<br />
what FOSMT development may look like.<br />
4 Tentative list of blocks & contents<br />
1. <strong>Free</strong>/open-source software<br />
2
• <strong>Free</strong> software: The four basic freedoms. The ambiguity<br />
of the word free. Adoption of “open-source”. <strong>Free</strong>/opensource<br />
software (FOSS) versus freeware, shareware,<br />
and cost-free services on the net.<br />
• Copyleft as an addition to free licenses, and the creation<br />
and securing of a commons. Copylefted and noncopylefted<br />
licenses.<br />
• Analysis of typical non-free licenses used in academic<br />
and commercial settings.<br />
• FOSS and science.<br />
• The shift toward FOSS-based business models: advantages<br />
and vulnerabilities (Behlendorf 1999)<br />
• FOSS project management. Roles. Releases. Repositories,<br />
modification control. “Bazaar-style” versus<br />
“cathedral-style”<br />
2. <strong>Free</strong>/open-source machine translation (FOSMT)<br />
• The <strong>three</strong> basic components of a MT system in development:<br />
engine, data, tools (ED&T). Basic tools (compiler),<br />
auxiliary tools (evaluation tools, dictionary management<br />
tools). FOSMT as MT in which ED&T are all<br />
free/open-source.<br />
• <strong>Free</strong>/open-source rule-based MT and corpus-based MT<br />
• Copyleft applied to linguistic data. Pooling and the commons<br />
(Streiter et al. 2006). Minor languages (Forcada<br />
2006).<br />
• Advantages of FOSMT: The advanced-user/developer<br />
continuum. In particular, advantages for minor languages.<br />
• Generation of resources for other language technologies<br />
(LT).<br />
• Using standard or well documented format: interoperability,<br />
transferability.<br />
• FOSMT development scenarios<br />
• Examples of existing FOSMT software.<br />
• FOSMT project management.<br />
3. Rule-based FOSMT: Apertium<br />
3
(a) Description<br />
• Background (interNOSTRUM, Universia)<br />
• Rationale<br />
• Apertium as a shallow-transfer machine translation platform:<br />
engine, data, tools (Armentano et al. 2006).<br />
• Language-pair data<br />
• Funding<br />
• The Apertium community as an example of FOSS development<br />
(repositories: the trunk, the incubator; roles).<br />
• Apertium tools: apertium-dixtools, apertium-transfertools,<br />
apertium-tagger-training-tools.<br />
• Apertium-based applications (Tinylex, Wordpress plugin,<br />
Pidgin plugin, <strong>Open</strong>Office.org plugins, etc.)<br />
• Apertium as a research platform (Sánchez-Martínez and<br />
Forcada 2007, Sánchez-Martínez et al. 2008).<br />
(b) Laboratory: installing and modifying Apertium<br />
• Installing Apertium from the latest sources on a virtual<br />
Linux machine.<br />
• Changing the data for a language pair: vocabularies and,<br />
optionally, transfer rules.<br />
4. Corpus-based FOSMT: Moses (*)<br />
(a) Statistical machine translation<br />
• Statistical MT (SMT)<br />
• The data: sentence-aligned corpora.<br />
• Training: the statistical models<br />
• The SMT engine or “decoder”<br />
(b) Description of Moses and related software:<br />
• Training: Giza++<br />
• Language models: irstlm<br />
• Tuning: MERT<br />
• Decoding: Moses<br />
(c) Laboratory: installing and running Moses.<br />
4
QUESTIONS:<br />
5. Additional material (*)<br />
• Installing Moses and all of the auxiliary software (Giza++,<br />
language model, etc.).<br />
• Training a toy SMT system with Moses.<br />
(a) Evaluation of machine translation<br />
• FOSS for evaluation: IQMT, etc.<br />
• Human evaluation: postediting and the traditional adequacy/fluency/informativeness<br />
• Automatic evaluation: BLEU, etc. Criticism of automatic<br />
evaluation.<br />
(b) <strong>Free</strong>/open-source resources that can be used to build FOSMT<br />
systems<br />
(c) Related software<br />
• Morphological analysers, taggers: <strong>Free</strong>ling<br />
• Data useful for morphological analysers: Wiktionary (start<br />
of Breton, Icelandic and Faroese)<br />
• <strong>Free</strong> bilingual dictionaries: DACCO, etc.<br />
• Parallel corpora: Europarl, OPUS, etc.<br />
• ReTraTos: perl tools to build bilingual dictionaries from<br />
aligned corpora<br />
• <strong>Free</strong> text: Wikipedia (easy access to free “raw” text)<br />
• Bitext tools: bitextor, tagaligner, etc.<br />
Non-MT FOSS for translation: OmegaT (translation memories),<br />
etc.<br />
Blocks marked with (*) will be dealt with if time allows.<br />
4.1 References<br />
• Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada,<br />
M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez-<br />
Sánchez, F., Sánchez-Martínez, F., Scalco, M. (2006) “<strong>Open</strong>-source<br />
5
Portuguese–Spanish machine translation” In Lecture Notes in Computer<br />
Science 3960, 50-59 (Computational Processing of the Portuguese<br />
Language, Proceedings of the 7th International Workshop on Computational<br />
Processing of Written and Spoken Portuguese, PROPOR<br />
2006, May 13-17, 2006, Itatiaia, Rio de Janeiro, Brazil). Available online:<br />
http://www.dlsi.ua.es/˜fsanchez/pub/pdf/armentano06.<br />
pdf<br />
• Behlendorf, Bruce (1999) “<strong>Open</strong> source as a business strategy”, in<br />
DiBona, C., Ockman, S., Stone, M., eds. <strong>Open</strong> <strong>Source</strong>s: Voices from the<br />
<strong>Open</strong> <strong>Source</strong> Revolution, O’Reilly. Available online: http://oreilly.<br />
com/catalog/opensources/book/toc.html<br />
• Forcada, Mikel (2006) “<strong>Open</strong> source machine translation: an opportunity<br />
for minor languages” in LREC-2006: Fifth International Conference<br />
on Language Resources and Evaluation. 5th SALTMIL Workshop<br />
on Minority Languages: “Strategies for developing machine translation for<br />
minority languages”, Genoa, Italy, 23 May 2006; pp.1-6. Available online:<br />
http://www.mt-archive.info/LREC-2006-Forcada.pdf,<br />
http://www.dlsi.ua.es/˜mlf/docum/forcada06p2.pdf<br />
• Sánchez-Martínez, F., Forcada, Mikel L. (2007) “Automatic induction<br />
of shallow-transfer rules for open-source machine translation”<br />
in TMI-2007: Proceedings of the 11th International Conference on Theoretical<br />
and Methodological Issues in <strong>Machine</strong> <strong>Translation</strong>, Skövde [Sweden],<br />
7-9 September 2007; pp.181-190. Available online: http://www.<br />
mt-archive.info/TMI-2007-Sanchez-Martinez.pdf, http:<br />
//www.dlsi.ua.es/˜fsanchez/pub/pdf/sanchez07c.pdf<br />
• Sánchez-Martínez, F., Pérez-Ortiz, J.A., Forcada, M.L. “Using targetlanguage<br />
information to train part-of-speech taggers for machine translation”.<br />
<strong>Machine</strong> <strong>Translation</strong>, 22(1–2)29–66.<br />
• Streiter, O. and Scannell, K.P. and Stuflesser, M. (2006) ”Implementing<br />
NLP projects for noncentral languages: instructions for funding<br />
bodies, strategies for developers”, <strong>Machine</strong> <strong>Translation</strong> 20(4)267–289.<br />
Draft available online: http://borel.slu.edu/pub/mt.pdf.<br />
6