12.07.2015 Views

Topics in Language Resources for Translation ... - ymerleksi - home

Topics in Language Resources for Translation ... - ymerleksi - home

Topics in Language Resources for Translation ... - ymerleksi - home

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 5. The real use of corpora <strong>in</strong> teach<strong>in</strong>g and research contexts 75(Grup de L<strong>in</strong>güística Computacional, UPF) jo<strong>in</strong>tly with the Cátedra Telefónica(UPF) aimed at obta<strong>in</strong><strong>in</strong>g a large Catalan corpus from the Web.In Leeds, general Internet corpora <strong>for</strong> a range of languages <strong>in</strong>clud<strong>in</strong>g Ch<strong>in</strong>ese,French, German, Italian, Polish, Romanian, Russian and Spanish have been developed<strong>in</strong> order to cover the needs of researchers and students enrolled at theirCentre. The size of all of these general corpora ranges between 100 and 200 millionwords, which makes them especially suitable <strong>for</strong> contrastive studies as they aresupposed to be comparable. The procedure adopted <strong>in</strong> these cases <strong>for</strong> the acquisitionof data is based on BootCat (Baroni & Bernard<strong>in</strong>i 2004) and is described <strong>in</strong>detail <strong>in</strong> Sharoff (2004).Follow<strong>in</strong>g the same idea, although us<strong>in</strong>g a different strategy <strong>for</strong> data compilation(see Boleda et al. 2004), CUCWeb, a 166 million word corpus <strong>for</strong> Catalan, wasbuilt by crawl<strong>in</strong>g the Web. This project is especially relevant due to the fact that itdeals with a m<strong>in</strong>or language (with some 12 million speakers), similar to Serbian(also about 12 million speakers), or Swedish (9.3). Be<strong>for</strong>e CUCWeb, the largestannotated Catalan corpus was the CTILC corpus (Rafel 1994), conta<strong>in</strong><strong>in</strong>g 50 millionwords stemm<strong>in</strong>g from literary and non literary documents between 1832 and1998, which obviously do not reflect some modern usages of the language.Of all these ef<strong>for</strong>ts to build corpora from the Web, the OPUS collection 5 deservesspecial attention as it is a collection of parallel corpora which are, as is wellknown (Badia et al. 2002), much more expensive to build than any monol<strong>in</strong>gualcorpus. There is a general scarcity of annotated parallel corpora, but we can list afew here, such as the Europarl 6 (extracted from the proceed<strong>in</strong>gs of the EuropeanParliament 7 <strong>in</strong> 11 European languages), the Canadian Hansard 8 (a parallel corpus<strong>in</strong> French and English of the proceed<strong>in</strong>gs of the Canadian Parliament), theInternational Telecommunications Union (ITU) or CRATER Corpus 9 (tril<strong>in</strong>gualcorpus of Spanish, French and English), the English-Norwegian Parallel Corpus 10(2.6 million <strong>in</strong> all), the Chemnitz corpus 11 (about 2 million words), and Banc-Trad 12 (a parallel corpus conta<strong>in</strong><strong>in</strong>g texts from English, French or German and5. http://logos.uio.no/opus/6. http://people.csail.mit.edu/koehn/publications/europarl/7. http://www3.europarl.eu.<strong>in</strong>t/omk/omnsapir.so/calendar?APP=CRE&LANGUE=EN8. http://spraakbanken.gu.se/pedant/parabank/node6.html9. http://www.comp.lancs.ac.uk/l<strong>in</strong>guistics/crater/corpus.html10. http://www.hf.uio.no/ilos/<strong>for</strong>skn<strong>in</strong>g/<strong>for</strong>skn<strong>in</strong>gsprosjekter/enpc/11. http://www.tu-chemnitz.de/phil/english/chairs/l<strong>in</strong>guist/real/<strong>in</strong>dependent/transcorpus/12. http://mutis2.upf.es/bt/english/<strong>in</strong>dex.htm

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!