Topics in Language Resources for Translation ... - ymerleksi - home

More documents

Recommendations

Info

Chapter 2. Interactive reference grammars 25corpus design are employed in corpus-based translation work: the parallel corpusand the comparable corpus. Parallel corpora consist of source language texts andtranslations of those texts into a target language. They are commonly employedfor terminology look-up and for teaching the usage of collocations. Furthermore,they are used for bilingual lexicography and more recently also as training materialin machine translation. Comparable corpora are collections of translations(from one or more source languages) into one target language and original texts inthe target language. Comparable corpora are mainly used in translation research –they reveal properties which are characteristic for translated texts only, e.g., explicitationor simplification (for further information on corpora in translation studiessee Olohan (2004)).At the technical level, translation studies benefits from existing corpuslinguistictechniques, such as key word in context (KWIC) concordances, automaticfrequency counts of words, etc. While the use of such tools has become anintegral part of technical practice in corpus-based translation work, more sophisticatedcorpus techniques, notably tools for corpus annotation, corpus maintenanceand corpus query as they have been developed for monolingual corpora, haveonly rarely been exploited yet. Thus, new methodological and technical challengesemerge for the discipline: Since this chapter will show how treebanks, i.e., corporaannotated with syntax trees, can be used for translation, in the followingthe creation of monolingual (Subsection 2.1) as well as multilingual treebanks(Subsection 2.2) will be discussed.2.1 Monolingual treebankingOne of the first and best known treebanks is the Penn Treebank for the English language(Marcus et al. 1994), which consists of more than 1 million words of newspapertext. It contains part-of-speech tagging as well as syntactic and semantic annotation.A bracketing format is used to encode predicate argument structure andtrace-filler mechanisms are used to represent discontinuous phenomena. Othertreebanks for English are, for instance, the Susanne Corpus (Sampson 1995) (containingdetailed part-of-speech information and phrase structure annotation), theLancaster Parsed Corpus (Leech 1992) (representing phrase structure annotationby means of labelled bracketing) and the British part of the International Corpusof English (Greenbaum 1996) (about 1 million words of British English that weretagged, parsed and afterwards checked).For languages other than English, a fairly well-known treebank is the PragueDependency Treebank for Czech (Hajic 1999). It contains more than 1 milliontokens and is annotated on three levels: on the morphological level (tags, lemmata,word forms), on the syntactic level (using dependency syntax) and on thetectogrammatical level (encoding functions such as actor, patient, etc.). Recently,
26 Silvia Hansen-Schirratreebank projects for other languages have come to life as well, e.g., for French(Abeillé et al. 2000), Italian (Bosco et al. 2000), Spanish (Moreno et al. 2000), etc.For German, the Verbmobil Treebank (Hinrichs et al. 2000) and the TübingenTreebank (Telljohann et al. 2006) are available. However, they are rather smallas reference work and restricted to spoken language (as in the case of Verbmobil).In contrast, the NEGRA/TiGer corpora (Brants et al. 2003), including 70,000sentences, are the ideal basis for empirical investigations. For their annotation, ahybrid framework is used which combines advantages of dependency grammarand phrase structure grammar. The syntactic structure is represented by a tree.The branches of a tree may cross, allowing the encoding of local and non-localdependencies and eliminating the need for traces. This approach has considerableadvantages for free-word order languages such as German, which show a largevariety of discontinuous constituency types (Skut et al. 1997). The linguistic annotationof each sentence in the TiGer Treebank is represented on a number ofdifferent levels: Information on part-of-speech, morphology and lemmata is encodedin terminal nodes (on the word level). Non-terminal nodes are labelled withphrase categories. The edges of a tree represent syntactic functions. Syntactic structuresare rather flat and simple in order to reduce the potential for attachmentambiguities. The distinction between arguments and adjuncts, for instance, is notexpressed in the constituent structure, but is instead encoded by means of syntacticfunctions. Secondary edges, i.e., labelled directed arcs between arbitrary nodes,are used to encode coordination information.Instead of having an automatic parser as pre-processor and a human annotatoras postprocessor (as in the Penn Treebank project), interactive annotationwith the annotation tool (Brants & Plaehn 2000) is used for the annotation process,efficiently combining automatic parsing and human annotation. The TnTtagger (Brants 2000) and the parser using Cascaded Markov Models (Brants 1999)generate small parts of the annotation, which are immediately presented visuallyto the human annotator, who can either accept, correct or reject it. Based on theannotator’s decision, the parser proposes the next part of the annotation, whichis again submitted to the annotator’s judgement. This process is repeated until theannotation of the sentence is complete. The advantage of this interactive method isthat the human decisions can be used by the automatic parser. Thus, errors madeby the automatic parser at lower levels are corrected instantly and do not ‘shinethrough’ on higher levels. The chances grow that the automatic parser proposescorrect analyses on higher levels. In order to achieve a high level of consistencyand to avoid mistakes, we use a very thorough approach to the annotation: First,each sentence is annotated independently by two annotators. With the support ofscripts, they afterwards compare their annotations and correct obvious mistakes.Remaining differences are submitted to a discussion between the annotators. Al-
Page 3 and 4: Benjamins Translation Library (BTL)
Page 5 and 6: 8 TMThe paper used in this publicat
Page 7 and 8: VITopics in Language Resources for
Page 9 and 10: VIII Topics in Language Resources f
Page 11 and 12: XTopics in Language Resources for T
Page 13 and 14: XIITopics in Language Resources for
Page 15 and 16: 2 Lynne Bowker and Michael Barlowde
Page 17 and 18: 4 Lynne Bowker and Michael BarlowFi
Page 19 and 20: 6 Lynne Bowker and Michael Barlow2.
Page 21 and 22: 8 Lynne Bowker and Michael BarlowOn
Page 23 and 24: 10 Lynne Bowker and Michael Barlow4
Page 25 and 26: 12 Lynne Bowker and Michael Barlowb
Page 27 and 28: 14 Lynne Bowker and Michael Barlows
Page 29 and 30: 16 Lynne Bowker and Michael Barlowp
Page 31 and 32: 18 Lynne Bowker and Michael Barlowt
Page 33 and 34: 20 Lynne Bowker and Michael Barlowg
Page 35 and 36: 22 Lynne Bowker and Michael BarlowM
Page 37: 24 Silvia Hansen-Schirraphenomenon;
Page 41 and 42: 28 Silvia Hansen-Schirracurrently a
Page 43 and 44: 30 Silvia Hansen-Schirrawhichaltern
Page 45 and 46: 32 Silvia Hansen-Schirra(1) We cont
Page 47 and 48: 34 Silvia Hansen-Schirrarealisation
Page 49 and 50: 36 Silvia Hansen-Schirratranslation
Page 52 and 53: chapter 3Corpora for translator edu
Page 54 and 55: Chapter 3. Corpora for translator e
Page 68: Chapter 3. Corpora for translator e
Page 71 and 72: 58 Belinda Maiahindsight, one can n
Page 73 and 74: 60 Belinda Maiabeen translated by m
Page 75 and 76: 62 Belinda Maiastudy reformulations
Page 77 and 78: 64 Belinda MaiaSearchablecorporaenc
Page 79 and 80: 66 Belinda Maia- Find definition ca
Page 81 and 82: 68 Belinda MaialishaEuropeanMaster
Page 83 and 84: 70 Belinda MaiaMaia, B. and L. Sarm
Page 85 and 86: 72 Carme Colominas and Toni Badiadi
Page 87 and 88: 74 Carme Colominas and Toni Badiala
Page 89 and 90:
76 Carme Colominas and Toni Badiath
Page 91 and 92:
78 Carme Colominas and Toni Badiath
Page 93 and 94:
80 Carme Colominas and Toni BadiaTa
Page 95 and 96:
82 Carme Colominas and Toni BadiaAs
Page 97 and 98:
84 Carme Colominas and Toni BadiaFi
Page 99 and 100:
86 Carme Colominas and Toni Badiaco
Page 101 and 102:
88 Carme Colominas and Toni BadiaVa
Page 103 and 104:
90 Rachélle GautonIzwaini (2003:17
Page 105 and 106:
92 Rachélle Gautonneeded by the Ba
Page 107 and 108:
94 Rachélle GautonThese electronic
Page 109 and 110:
96 Rachélle Gautonthat of Bantu la
Page 111 and 112:
98 Rachélle GautonSeeagainFig.1for
Page 113 and 114:
100 Rachélle GautonLocke, translat
Page 115 and 116:
102 Rachélle GautonHaving to work
Page 117 and 118:
104 Rachélle Gautonmore, after suc
Page 119 and 120:
106 Rachélle GautonMcEnery, A. and
Page 121 and 122:
108 Marie-Josée de Saint Roberta c
Page 123 and 124:
110 Marie-Josée de Saint Robertpre
Page 125 and 126:
112 Marie-Josée de Saint Robertlef
Page 127 and 128:
114 Marie-Josée de Saint Robertcap
Page 129 and 130:
116 Marie-Josée de Saint Robertinf
Page 131 and 132:
118 Marie-Josée de Saint Roberttra
Page 134 and 135:
chapter 8Global content managementC
Page 136 and 137:
Chapter 8. Global content managemen
Page 138 and 139:
Page 140 and 141:
Page 142 and 143:
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
chapter 9BEYTransA Wiki-based envir
Page 150 and 151:
Chapter 9. BEYTrans 137Trans system
Page 152 and 153:
Chapter 9. BEYTrans 1391. Facilitat
Page 154 and 155:
Chapter 9. BEYTrans 141lators to ch
Page 156 and 157:
Chapter 9. BEYTrans 1435.2 Translat
Page 158 and 159:
Chapter 9. BEYTrans 1456. BEYTrans:
Page 160 and 161:
Chapter 9. BEYTrans 1476.2.2 Multil
Page 162 and 163:
Chapter 9. BEYTrans 149Bey, Y., C.
Page 164 and 165:
chapter 10Standardising the managem
Page 166 and 167:
Chapter 10. Standardising multiling
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
Page 184 and 185:
Page 186 and 187:
chapter 11Tagging and tracing Progr
Page 188 and 189:
Chapter 11. Tagging and tracing Pro
Page 190 and 191:
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
chapter 12Linguistic resources and
Page 210 and 211:
Chapter 12. Linguistic resources an
Page 212 and 213:
Page 214 and 215:
Page 216 and 217:
Page 218 and 219:
Page 220 and 221:
Page 222 and 223:
Page 224 and 225:
Page 226 and 227:
Page 228 and 229:
IndexAAfrican language translatorX,
Page 230 and 231:
Index 217Expert Advisory Group onLa
Page 232 and 233:
Index 219open standards 206, 208,21
Page 234 and 235:
Benjamins Translation LibraryA comp
Page 236:
27 Beylard-Ozeroff, Ann, Jana Král
show all

Topics in Language Resources for Translation ... - ymerleksi - home

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?