12.07.2015 Views

Topics in Language Resources for Translation ... - ymerleksi - home

Topics in Language Resources for Translation ... - ymerleksi - home

Topics in Language Resources for Translation ... - ymerleksi - home

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

76 Carme Colom<strong>in</strong>as and Toni Badiatheir respective translations <strong>in</strong>to Catalan or Spanish or vice versa, about 4 millionwords). Due to the numerous handicaps of build<strong>in</strong>g such types of resources,as we can see from the examples we have mentioned, parallel corpora tend to bedoma<strong>in</strong>-specific or relatively small. In this context, the OPUS collection becomesmuch important as it is based on open source documentation that can be downloadedas files, and constitute the largest collection of translated texts available.Tra<strong>in</strong><strong>in</strong>g <strong>in</strong>stitutions can benefit from the OPUS <strong>in</strong>itiative (adapt<strong>in</strong>g the resourcesto their respective <strong>in</strong>terfaces) and they can contribute to the <strong>in</strong>itiative as well <strong>in</strong>terms of data or tools. Presently OPUS allows access to the follow<strong>in</strong>g parallel corpora:EUconst – The European constitution; 21 languages (3 million words), OO –the OpenOffice.org corpus, (30 million), KDE – KDE system messages (20 million),KDEdoc – the KDE manual corpus (3.8 million), PHP – the PHP manualcorpus (3.5 million), EUROPARL – European Parliament Proceed<strong>in</strong>gs 1996–2003(296 million).The most obvious benefit <strong>in</strong> build<strong>in</strong>g corpora from the Web is that they areeasy and cheap to make. Furthermore, they reflect modern language use, are easilyextensible and updated, and promote the technological development of non-majorlanguages. Besides such obvious benefits, some shortcom<strong>in</strong>gs have already beenpo<strong>in</strong>ted out such as the fact that not all topics, not all text types are equally available,and that such biases become far more evident across languages. This is <strong>in</strong> facttrue. Internet corpora can no longer meet some of the traditional requirementsmade of corpora (e.g., to be balanced); however this does not need to be seen(only) as a handicap. Actually, with the exception of the BNC, most of the so-calledgeneral corpora are also heavily biased (e.g., towards newspaper texts, like the IDSor the FR German corpora, or towards literary texts like the Catalan CTILC). Internetcorpora are at least representative of the language on the Web, and this can alsobe seen as valuable <strong>in</strong><strong>for</strong>mation, e.g., from a sociol<strong>in</strong>guistic perspective; consider<strong>for</strong> <strong>in</strong>stance the possibilities of contrastive studies between languages and culturesby means of comparable Internet corpora. They can provide <strong>in</strong>terest<strong>in</strong>g and valuable<strong>in</strong><strong>for</strong>mation from a contrastive po<strong>in</strong>t of view to assess the impact of text genreand topic <strong>in</strong> Internet corpora across languages/cultures (Sharoff 2006). However,the real possibilities of carry<strong>in</strong>g out such studies also depend on the possibilitiesof access<strong>in</strong>g the corpora through adequate <strong>in</strong>terfaces.3. The need <strong>for</strong> common <strong>in</strong>terfaces to several corporaAs po<strong>in</strong>ted out <strong>in</strong> Section 1 above, the usefulness of corpora resources, even whenthey exist, becomes not so evident <strong>for</strong> users due to the fact that too much timemust be spent <strong>in</strong> order to become familiar with the several <strong>in</strong>terfaces and querylanguages. The problem derived from the multiplicity of <strong>in</strong>terfaces and query lan-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!