06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

phrases are generated by a shallow pars<strong>in</strong>g model (hidden Markov model) tra<strong>in</strong>ed<br />

on the Le Monde corpus [1]. F<strong>in</strong>ally, the monol<strong>in</strong>gual treebanks are exported <strong>in</strong><br />

the TIGER-XML format [6]. More details about the annotation <strong>of</strong> the German<br />

treebank can be found <strong>in</strong> [14], whereas the French annotation is described <strong>in</strong> [5].<br />

We use the Bleualign algorithm [13] to align the sentences across both monol<strong>in</strong>gual<br />

treebanks. Our alignment convention was to discard the automatic many-tomany<br />

word alignments for the purpose <strong>of</strong> <strong>in</strong>creas<strong>in</strong>g the precision. Subsequently,<br />

a human annotator checked and, when required, corrected the rema<strong>in</strong><strong>in</strong>g word and<br />

sentence alignments and then added the phrase alignments. F<strong>in</strong>ally, the alignment<br />

file is available <strong>in</strong> XML format, as the follow<strong>in</strong>g snippet shows:<br />

<br />

<br />

<br />

<br />

This says that node 18 <strong>in</strong> sentence 225 <strong>of</strong> the German treebank (de) is aligned<br />

with node 16 <strong>in</strong> sentence 231 <strong>of</strong> the French treebank (fr). The node identifiers refer<br />

to the IDs <strong>in</strong> the TIGER-XML treebanks. The alignment is labeled as good when<br />

the l<strong>in</strong>ked phrases represent exact translations and as fuzzy <strong>in</strong> case <strong>of</strong> approximate<br />

correspondences.<br />

3 The Evaluation Tool: DELiC4MT<br />

DELiC4MT is an open-source tool that performs diagnostic evaluation <strong>of</strong> MT systems<br />

over user-def<strong>in</strong>ed l<strong>in</strong>guistically-motivated constructions, also called checkpo<strong>in</strong>ts.<br />

This term was <strong>in</strong>troduced by Zhou et al. [15] and refers to either lexical<br />

elements or grammatical constructions, such as ambiguous words, noun phrases,<br />

verb-object collocations etc. The experiments reported <strong>in</strong> this paper follow the<br />

workflow proposed by Naskar et al. [9], due to its option to <strong>in</strong>tegrate new language<br />

pairs.<br />

First the texts are PoS-tagged and exported <strong>in</strong> the KYOTO Annotation Format<br />

(KAF)[2]. This scheme facilitates the <strong>in</strong>spection <strong>of</strong> the terms <strong>in</strong> the sentences and<br />

thus query<strong>in</strong>g for specific features, such as PoS sequences. Figure 1 depicts the<br />

KAF annotation <strong>of</strong> the German phrase den ersten Gipfel (EN: the first peak) and<br />

its French equivalent le premier sommet.<br />

The l<strong>in</strong>guistic checkpo<strong>in</strong>ts are subsequently def<strong>in</strong>ed <strong>in</strong> the so-called kybot pr<strong>of</strong>iles.<br />

A kybot pr<strong>of</strong>ile starts with the declaration <strong>of</strong> the <strong>in</strong>volved variables and the<br />

relations among them and ends specify<strong>in</strong>g which attributes <strong>of</strong> the matched terms<br />

should be exported. For example, figure 2 depicts the kybot pr<strong>of</strong>ile for a nom<strong>in</strong>al<br />

group consist<strong>in</strong>g <strong>of</strong> a determ<strong>in</strong>er, an adjective and a noun. Moreover, the constituent<br />

terms have to be consecutive. Once def<strong>in</strong>ed, the kybot pr<strong>of</strong>ile is run over<br />

the source language KAF files and the matched terms (with the specified attributes)<br />

are stored <strong>in</strong> a separate XML file.<br />

147

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!