12.07.2015 Views

Topics in Language Resources for Translation ... - ymerleksi - home

Topics in Language Resources for Translation ... - ymerleksi - home

Topics in Language Resources for Translation ... - ymerleksi - home

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

26 Silvia Hansen-Schirratreebank projects <strong>for</strong> other languages have come to life as well, e.g., <strong>for</strong> French(Abeillé et al. 2000), Italian (Bosco et al. 2000), Spanish (Moreno et al. 2000), etc.For German, the Verbmobil Treebank (H<strong>in</strong>richs et al. 2000) and the Tüb<strong>in</strong>genTreebank (Telljohann et al. 2006) are available. However, they are rather smallas reference work and restricted to spoken language (as <strong>in</strong> the case of Verbmobil).In contrast, the NEGRA/TiGer corpora (Brants et al. 2003), <strong>in</strong>clud<strong>in</strong>g 70,000sentences, are the ideal basis <strong>for</strong> empirical <strong>in</strong>vestigations. For their annotation, ahybrid framework is used which comb<strong>in</strong>es advantages of dependency grammarand phrase structure grammar. The syntactic structure is represented by a tree.The branches of a tree may cross, allow<strong>in</strong>g the encod<strong>in</strong>g of local and non-localdependencies and elim<strong>in</strong>at<strong>in</strong>g the need <strong>for</strong> traces. This approach has considerableadvantages <strong>for</strong> free-word order languages such as German, which show a largevariety of discont<strong>in</strong>uous constituency types (Skut et al. 1997). The l<strong>in</strong>guistic annotationof each sentence <strong>in</strong> the TiGer Treebank is represented on a number ofdifferent levels: In<strong>for</strong>mation on part-of-speech, morphology and lemmata is encoded<strong>in</strong> term<strong>in</strong>al nodes (on the word level). Non-term<strong>in</strong>al nodes are labelled withphrase categories. The edges of a tree represent syntactic functions. Syntactic structuresare rather flat and simple <strong>in</strong> order to reduce the potential <strong>for</strong> attachmentambiguities. The dist<strong>in</strong>ction between arguments and adjuncts, <strong>for</strong> <strong>in</strong>stance, is notexpressed <strong>in</strong> the constituent structure, but is <strong>in</strong>stead encoded by means of syntacticfunctions. Secondary edges, i.e., labelled directed arcs between arbitrary nodes,are used to encode coord<strong>in</strong>ation <strong>in</strong><strong>for</strong>mation.Instead of hav<strong>in</strong>g an automatic parser as pre-processor and a human annotatoras postprocessor (as <strong>in</strong> the Penn Treebank project), <strong>in</strong>teractive annotationwith the annotation tool (Brants & Plaehn 2000) is used <strong>for</strong> the annotation process,efficiently comb<strong>in</strong><strong>in</strong>g automatic pars<strong>in</strong>g and human annotation. The TnTtagger (Brants 2000) and the parser us<strong>in</strong>g Cascaded Markov Models (Brants 1999)generate small parts of the annotation, which are immediately presented visuallyto the human annotator, who can either accept, correct or reject it. Based on theannotator’s decision, the parser proposes the next part of the annotation, whichis aga<strong>in</strong> submitted to the annotator’s judgement. This process is repeated until theannotation of the sentence is complete. The advantage of this <strong>in</strong>teractive method isthat the human decisions can be used by the automatic parser. Thus, errors madeby the automatic parser at lower levels are corrected <strong>in</strong>stantly and do not ‘sh<strong>in</strong>ethrough’ on higher levels. The chances grow that the automatic parser proposescorrect analyses on higher levels. In order to achieve a high level of consistencyand to avoid mistakes, we use a very thorough approach to the annotation: First,each sentence is annotated <strong>in</strong>dependently by two annotators. With the support ofscripts, they afterwards compare their annotations and correct obvious mistakes.Rema<strong>in</strong><strong>in</strong>g differences are submitted to a discussion between the annotators. Al-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!