Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL

Free/Open-Source Machine 

Translation: a three-day tutorial at the 

Centre for Next Generation Localisation 

Mikel L. Forcada 

Universitat d’Alacant, E-03071 Alacant, Spain 

Centre for Next Generation Localisation 

April 14–16, 2009 

1 Objectives 

• To become acquainted with existing free/open-source machine translation 

(FOSMT) software 

• To understand rule-based FOSMT using Apertium as an example 

• To experience FOSMT development in the Apertium platform 

• To understand corpus-based FOSMT using Moses as an example (*) 

• To experience FOSMT development with Moses (*) 

• To become aware of free/open-source resources which can be used 

to build FOSMT systems 

• To understand how FOSMT can be used to generate new humanlanguage 

technology tools 

We will work on objectives marked with (*) if time allows. 

1

2 Audience 

PhD students and postdoctoral researchers in the areas of language technologies: 

machine translation, speech, natural language processing, digital 

content management —information retrieval, information extraction, 

adaptive hypermedia— and localization. 

May also be attended by translators with basic to intermediate ICT 

skills. 

3 Duration, structure, resources needed 

Five two-hour sessions distributed in two days and a half as follows: 

• Tuesday, April 14, 10:00–12:00 and 14:00–16:00, laboratory L1.14 

• Wednesday, April 15, 10:00–12:00 and 14:00–16:00, laboratory L1.01 

• Thursday, April 16, 10:00–12:00, laboratory L1.14 

Both labs are in the School of Computing. We’ll have two kinds of sessions: 

• “Classroom” sessions will be organized to be as participative as possible 

to adapt different backgrounds. 

• Laboratory sessions are designed for participants to get a grasp of 

what FOSMT development may look like. 

4 Tentative list of blocks & contents 

1. Free/open-source software 

2

• Free software: The four basic freedoms. The ambiguity 

of the word free. Adoption of “open-source”. Free/opensource 

software (FOSS) versus freeware, shareware, 

and cost-free services on the net. 

• Copyleft as an addition to free licenses, and the creation 

and securing of a commons. Copylefted and noncopylefted 

licenses. 

• Analysis of typical non-free licenses used in academic 

and commercial settings. 

• FOSS and science. 

• The shift toward FOSS-based business models: advantages 

and vulnerabilities (Behlendorf 1999) 

• FOSS project management. Roles. Releases. Repositories, 

modification control. “Bazaar-style” versus 

“cathedral-style” 

2. Free/open-source machine translation (FOSMT) 

• The three basic components of a MT system in development: 

engine, data, tools (ED&T). Basic tools (compiler), 

auxiliary tools (evaluation tools, dictionary management 

tools). FOSMT as MT in which ED&T are all 

free/open-source. 

• Free/open-source rule-based MT and corpus-based MT 

• Copyleft applied to linguistic data. Pooling and the commons 

(Streiter et al. 2006). Minor languages (Forcada 

2006). 

• Advantages of FOSMT: The advanced-user/developer 

continuum. In particular, advantages for minor languages. 

• Generation of resources for other language technologies 

(LT). 

• Using standard or well documented format: interoperability, 

transferability. 

• FOSMT development scenarios 

• Examples of existing FOSMT software. 

• FOSMT project management. 

3. Rule-based FOSMT: Apertium 

3

(a) Description 

• Background (interNOSTRUM, Universia) 

• Rationale 

• Apertium as a shallow-transfer machine translation platform: 

engine, data, tools (Armentano et al. 2006). 

• Language-pair data 

• Funding 

• The Apertium community as an example of FOSS development 

(repositories: the trunk, the incubator; roles). 

• Apertium tools: apertium-dixtools, apertium-transfertools, 

apertium-tagger-training-tools. 

• Apertium-based applications (Tinylex, Wordpress plugin, 

Pidgin plugin, OpenOffice.org plugins, etc.) 

• Apertium as a research platform (Sánchez-Martínez and 

Forcada 2007, Sánchez-Martínez et al. 2008). 

(b) Laboratory: installing and modifying Apertium 

• Installing Apertium from the latest sources on a virtual 

Linux machine. 

• Changing the data for a language pair: vocabularies and, 

optionally, transfer rules. 

4. Corpus-based FOSMT: Moses (*) 

(a) Statistical machine translation 

• Statistical MT (SMT) 

• The data: sentence-aligned corpora. 

• Training: the statistical models 

• The SMT engine or “decoder” 

(b) Description of Moses and related software: 

• Training: Giza++ 

• Language models: irstlm 

• Tuning: MERT 

• Decoding: Moses 

(c) Laboratory: installing and running Moses. 

4

QUESTIONS: 

5. Additional material (*) 

• Installing Moses and all of the auxiliary software (Giza++, 

language model, etc.). 

• Training a toy SMT system with Moses. 

(a) Evaluation of machine translation 

• FOSS for evaluation: IQMT, etc. 

• Human evaluation: postediting and the traditional adequacy/fluency/informativeness 

• Automatic evaluation: BLEU, etc. Criticism of automatic 

evaluation. 

(b) Free/open-source resources that can be used to build FOSMT 

systems 

(c) Related software 

• Morphological analysers, taggers: Freeling 

• Data useful for morphological analysers: Wiktionary (start 

of Breton, Icelandic and Faroese) 

• Free bilingual dictionaries: DACCO, etc. 

• Parallel corpora: Europarl, OPUS, etc. 

• ReTraTos: perl tools to build bilingual dictionaries from 

aligned corpora 

• Free text: Wikipedia (easy access to free “raw” text) 

• Bitext tools: bitextor, tagaligner, etc. 

Non-MT FOSS for translation: OmegaT (translation memories), 

etc. 

Blocks marked with (*) will be dealt with if time allows. 

4.1 References 

• Armentano-Oller, C., Carrasco, R.C., Corbí-Bellot, A.M., Forcada, 

M.L., Ginestí-Rosell, M., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Ramírez- 

Sánchez, F., Sánchez-Martínez, F., Scalco, M. (2006) “Open-source 

5

Portuguese–Spanish machine translation” In Lecture Notes in Computer 

Science 3960, 50-59 (Computational Processing of the Portuguese 

Language, Proceedings of the 7th International Workshop on Computational 

Processing of Written and Spoken Portuguese, PROPOR 

2006, May 13-17, 2006, Itatiaia, Rio de Janeiro, Brazil). Available online: 

http://www.dlsi.ua.es/˜fsanchez/pub/pdf/armentano06. 

pdf 

• Behlendorf, Bruce (1999) “Open source as a business strategy”, in 

DiBona, C., Ockman, S., Stone, M., eds. Open Sources: Voices from the 

Open Source Revolution, O’Reilly. Available online: http://oreilly. 

com/catalog/opensources/book/toc.html 

• Forcada, Mikel (2006) “Open source machine translation: an opportunity 

for minor languages” in LREC-2006: Fifth International Conference 

on Language Resources and Evaluation. 5th SALTMIL Workshop 

on Minority Languages: “Strategies for developing machine translation for 

minority languages”, Genoa, Italy, 23 May 2006; pp.1-6. Available online: 

http://www.mt-archive.info/LREC-2006-Forcada.pdf, 

http://www.dlsi.ua.es/˜mlf/docum/forcada06p2.pdf 

• Sánchez-Martínez, F., Forcada, Mikel L. (2007) “Automatic induction 

of shallow-transfer rules for open-source machine translation” 

in TMI-2007: Proceedings of the 11th International Conference on Theoretical 

and Methodological Issues in Machine Translation, Skövde [Sweden], 

7-9 September 2007; pp.181-190. Available online: http://www. 

mt-archive.info/TMI-2007-Sanchez-Martinez.pdf, http: 

//www.dlsi.ua.es/˜fsanchez/pub/pdf/sanchez07c.pdf 

• Sánchez-Martínez, F., Pérez-Ortiz, J.A., Forcada, M.L. “Using targetlanguage 

information to train part-of-speech taggers for machine translation”. 

Machine Translation, 22(1–2)29–66. 

• Streiter, O. and Scannell, K.P. and Stuflesser, M. (2006) ”Implementing 

NLP projects for noncentral languages: instructions for funding 

bodies, strategies for developers”, Machine Translation 20(4)267–289. 

Draft available online: http://borel.slu.edu/pub/mt.pdf. 

6

Free/Open-Source Machine Translation: a three-day tutorial ... - CNGL

Create successful ePaper yourself

Delete template?

Save as template?