Topics in Language Resources for Translation ... - ymerleksi - home
Topics in Language Resources for Translation ... - ymerleksi - home
Topics in Language Resources for Translation ... - ymerleksi - home
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
74 Carme Colom<strong>in</strong>as and Toni Badialayout, but <strong>in</strong> the types of queries they allow <strong>for</strong>, and this even affects the exploitationpossibilities and especially those that imply compar<strong>in</strong>g the results obta<strong>in</strong>edfrom several corpora. For <strong>in</strong>stance, it is quite difficult to compare the usage of e.g.,ES/CA molar as verb (<strong>for</strong> ‘to be great’, ‘amaz<strong>in</strong>g’, ‘cool’, etc.) <strong>in</strong> the jargon of theyoung Catalan and Spanish, as the available corpus <strong>in</strong> one language (CUCWeb <strong>for</strong>Catalan searches) allows <strong>for</strong> searches by lemma, whereas the one available <strong>in</strong> theother (CREA) does not. A similar problem arises when we try to compare patternsof use of a verb like like <strong>in</strong> the BNC and mögen <strong>in</strong> the German IDS corpus. Despitebe<strong>in</strong>g one of the best available reference corpora, the BNC is not lemmatised,which considerably restricts its potential use and the possibilities of per<strong>for</strong>m<strong>in</strong>gthis k<strong>in</strong>d of comparison with other languages <strong>for</strong> which a lemmatised corpus isavailable. In other words, the range of functionality <strong>for</strong> automated retrieval ofcorpora is greatly dependent on annotation, and differences between corpora <strong>in</strong>this matter limit their potential usage considerably. Besides annotation, corporadiffer from each other depend<strong>in</strong>g on the query language used. Compare, <strong>for</strong> example,the different query syntaxes by us<strong>in</strong>g Xaira (to access the BNC) or CorpusWorkbench. Tak<strong>in</strong>g <strong>in</strong>to account that translation students and researchers workcommonly with at least three or four different languages, they need to access constantlyseveral URLs <strong>in</strong> order to get familiar with different <strong>in</strong>terfaces and querylanguages and, what is worse, to face the differences <strong>in</strong> creat<strong>in</strong>g concordances (by<strong>for</strong>m, lemma or part-of-speech (POS)), <strong>in</strong> gather<strong>in</strong>g statistical <strong>in</strong><strong>for</strong>mation, etc.,between corpora. As a result, the usefulness of resources, even when they exist,becomes far from evident <strong>for</strong> users <strong>in</strong> general, as too much time must be spent(especially by users that are not tra<strong>in</strong>ed <strong>in</strong> query <strong>for</strong>malisms as is often the case<strong>in</strong> the context of translation) <strong>in</strong> order to familiarise themselves with the several<strong>in</strong>terfaces and query languages.The two aspects we have po<strong>in</strong>ted out as the most desirable aims, that is, theavailability of large and representative corpora and a more user-friendly access tothe several corpora needed, are be<strong>in</strong>g faced nowadays by some corpus developersby means of common plat<strong>for</strong>ms that allow access to several corpora eventuallybuilt from the Web.2. Internet corpora: An alternative to large corporaIn recent years the arduous and expensive task of build<strong>in</strong>g large corpora has foundas a source of l<strong>in</strong>guistic data (Kilgarriff and Grefenstette 2003) real new chances <strong>in</strong>the World Wide Web. Exploit<strong>in</strong>g the Web as a corpus is becom<strong>in</strong>g a real alternativeto the traditional build<strong>in</strong>g of large corpora, as can be stated by the Internet corporacompiled at the Centre <strong>for</strong> <strong>Translation</strong> Studies of Leeds (Sharoff 2006), the OPUScollection of parallel corpora, or the CUCWeb project developed by the GLiCom