06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

sv: Bestämmelserna i detta avtal får ändras eller revideras<br />

da: Bestemmelserne i denne aftale kan ændres og revideres<br />

no: Bestemmelsene i denne avtalen kan endres eller revideres<br />

helt eller delvis efter gemensam överenskomst mellan parterna.<br />

helt eller delvis efter fælles overenskomst mellem parterne.<br />

helt eller delvis etter felles overenskomst mellom partene.<br />

en: The provisions <strong>of</strong> the Agreement may be amended and revised as a whole<br />

or <strong>in</strong> detail by common agreement between the two Parties.<br />

Figure 3: Swedish-Danish parallel example.<br />

driver we have written for the Norwegian tagset. We then delexicalise the corpora,<br />

replac<strong>in</strong>g the word form and lemma columns <strong>in</strong> the CoNLL data with dummy<br />

values, leav<strong>in</strong>g only the PoS <strong>in</strong>formation <strong>in</strong> the treebank, and tra<strong>in</strong> parsers on the<br />

delexicalised treebanks. The “Basic” column <strong>of</strong> the first two rows <strong>of</strong> Table 1 show<br />

the performance <strong>of</strong> these parsers on the Norwegian test corpus. We tried to apply<br />

Søgaard’s [9] filter<strong>in</strong>g technique <strong>in</strong> conjunction with these experiments, but it did<br />

not have any noticeable impact on performance, confirm<strong>in</strong>g Søgaard’s hypothesis<br />

that this technique is most useful for unrelated or distantly related languages.<br />

3.2 Lexicalised pars<strong>in</strong>g<br />

Not delexicalis<strong>in</strong>g the source corpora <strong>in</strong> a cross-l<strong>in</strong>gual sett<strong>in</strong>g is unconventional,<br />

but as noted above the Scand<strong>in</strong>avian languages are extremely similar. Figure<br />

3, shows an example parallel sentence 4 (Zeman and Resnik [12]), and we may<br />

note that <strong>of</strong> the 16 words <strong>in</strong> the sentence, 5 are identical between the Danish and<br />

Swedish versions. In the Norwegian version, 9 words (i, kan, og, revideres, helt,<br />

eller, delvis, overenskomst) are <strong>in</strong>dentical to the word <strong>in</strong> the Danish version and 6<br />

(i, får, eller [twice], helt, delvis) <strong>in</strong> the Swedish. Given this similarity, our hypothesis<br />

was that keep<strong>in</strong>g the words might give a small performance boost by mak<strong>in</strong>g<br />

it possible to learn some selectional preferences dur<strong>in</strong>g parser tra<strong>in</strong><strong>in</strong>g.<br />

Look<strong>in</strong>g more closely at the lexical overlaps between the tra<strong>in</strong><strong>in</strong>g corpora and<br />

our test corpus, we f<strong>in</strong>d that, disregard<strong>in</strong>g punctuation, out <strong>of</strong> the word-forms encountered<br />

<strong>in</strong> the Swedish tra<strong>in</strong><strong>in</strong>g set, 338 also occur <strong>in</strong> the test corpus (a total <strong>of</strong><br />

3151 times), and 660 word-forms from the Danish corpus occur a total <strong>of</strong> 4273<br />

times. This means that out <strong>of</strong> the 6809 non-punctuation tokens <strong>in</strong> the Norwegian<br />

test corpus, 46% <strong>of</strong> them are known words to a lexicalised Swedish parser and 63%<br />

are known to a lexicalised Danish parser.<br />

The “Basic” column <strong>of</strong> Table 1 shows the results <strong>of</strong> our experiments with the<br />

parsers described <strong>in</strong> this and the previous section. As the table makes clear, Talbanken<br />

fares far better than DDT as source corpus. Given the large overlap <strong>in</strong><br />

4 The sentence is from the Acquis corpus. The Swedish, Danish and English versions are those <strong>in</strong><br />

Acquis; Norway not be<strong>in</strong>g a member <strong>of</strong> the EU, we have supplied our own Norwegian translation.<br />

191

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!