A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
sv: Bestämmelserna i detta avtal får ändras eller revideras<br />
da: Bestemmelserne i denne aftale kan ændres og revideres<br />
no: Bestemmelsene i denne avtalen kan endres eller revideres<br />
helt eller delvis efter gemensam överenskomst mellan parterna.<br />
helt eller delvis efter fælles overenskomst mellem parterne.<br />
helt eller delvis etter felles overenskomst mellom partene.<br />
en: The provisions <strong>of</strong> the Agreement may be amended and revised as a whole<br />
or <strong>in</strong> detail by common agreement between the two Parties.<br />
Figure 3: Swedish-Danish parallel example.<br />
driver we have written for the Norwegian tagset. We then delexicalise the corpora,<br />
replac<strong>in</strong>g the word form and lemma columns <strong>in</strong> the CoNLL data with dummy<br />
values, leav<strong>in</strong>g only the PoS <strong>in</strong>formation <strong>in</strong> the treebank, and tra<strong>in</strong> parsers on the<br />
delexicalised treebanks. The “Basic” column <strong>of</strong> the first two rows <strong>of</strong> Table 1 show<br />
the performance <strong>of</strong> these parsers on the Norwegian test corpus. We tried to apply<br />
Søgaard’s [9] filter<strong>in</strong>g technique <strong>in</strong> conjunction with these experiments, but it did<br />
not have any noticeable impact on performance, confirm<strong>in</strong>g Søgaard’s hypothesis<br />
that this technique is most useful for unrelated or distantly related languages.<br />
3.2 Lexicalised pars<strong>in</strong>g<br />
Not delexicalis<strong>in</strong>g the source corpora <strong>in</strong> a cross-l<strong>in</strong>gual sett<strong>in</strong>g is unconventional,<br />
but as noted above the Scand<strong>in</strong>avian languages are extremely similar. Figure<br />
3, shows an example parallel sentence 4 (Zeman and Resnik [12]), and we may<br />
note that <strong>of</strong> the 16 words <strong>in</strong> the sentence, 5 are identical between the Danish and<br />
Swedish versions. In the Norwegian version, 9 words (i, kan, og, revideres, helt,<br />
eller, delvis, overenskomst) are <strong>in</strong>dentical to the word <strong>in</strong> the Danish version and 6<br />
(i, får, eller [twice], helt, delvis) <strong>in</strong> the Swedish. Given this similarity, our hypothesis<br />
was that keep<strong>in</strong>g the words might give a small performance boost by mak<strong>in</strong>g<br />
it possible to learn some selectional preferences dur<strong>in</strong>g parser tra<strong>in</strong><strong>in</strong>g.<br />
Look<strong>in</strong>g more closely at the lexical overlaps between the tra<strong>in</strong><strong>in</strong>g corpora and<br />
our test corpus, we f<strong>in</strong>d that, disregard<strong>in</strong>g punctuation, out <strong>of</strong> the word-forms encountered<br />
<strong>in</strong> the Swedish tra<strong>in</strong><strong>in</strong>g set, 338 also occur <strong>in</strong> the test corpus (a total <strong>of</strong><br />
3151 times), and 660 word-forms from the Danish corpus occur a total <strong>of</strong> 4273<br />
times. This means that out <strong>of</strong> the 6809 non-punctuation tokens <strong>in</strong> the Norwegian<br />
test corpus, 46% <strong>of</strong> them are known words to a lexicalised Swedish parser and 63%<br />
are known to a lexicalised Danish parser.<br />
The “Basic” column <strong>of</strong> Table 1 shows the results <strong>of</strong> our experiments with the<br />
parsers described <strong>in</strong> this and the previous section. As the table makes clear, Talbanken<br />
fares far better than DDT as source corpus. Given the large overlap <strong>in</strong><br />
4 The sentence is from the Acquis corpus. The Swedish, Danish and English versions are those <strong>in</strong><br />
Acquis; Norway not be<strong>in</strong>g a member <strong>of</strong> the EU, we have supplied our own Norwegian translation.<br />
191