A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
mance is not <strong>in</strong>considerable, as was the case for Talbanken and NDT. However, as<br />
evidenced by the DDT-NDT pair, the labelled performance varies more <strong>in</strong> function<br />
<strong>of</strong> the characteristics <strong>of</strong> the source treebank.<br />
F<strong>in</strong>ally, it turns out that delexicalisation is not required if the languages <strong>in</strong> question<br />
are closely enough related that there is non-trivial amounts <strong>of</strong> lexical overlap<br />
between the languages. But if a delexicalised parser is required and the source<br />
treebank is large, it may be best to not use the entire treebank to tra<strong>in</strong> the adapted<br />
parser. As the learn<strong>in</strong>g curves <strong>in</strong> Figure 4 show, a delexicalised parser tra<strong>in</strong>ed on<br />
more than 500-1,000 sentences starts to overtra<strong>in</strong> and is likely to perform worse<br />
than a parser tra<strong>in</strong>ed on less material on target language data.<br />
5 Future work<br />
There are a number <strong>of</strong> topics that merit further <strong>in</strong>vestigation. First <strong>of</strong> all, a thorough<br />
study is required <strong>of</strong> the <strong>in</strong>fluence <strong>of</strong> us<strong>in</strong>g pre-parsed data for annotation and the<br />
<strong>in</strong>fluence <strong>of</strong> parser performance on the parameters <strong>of</strong> annotation process, such as<br />
time used per sentence and <strong>in</strong>ter-annotator agreement. Fort and Sagot [3] study the<br />
<strong>in</strong>fluence <strong>of</strong> pre-processed data for the part-<strong>of</strong>-speech annotation task, and Chiou<br />
et al. [2] show that an accurate parser speeds up phrase structure annotation, but to<br />
our knowledge no such study has been done for the task <strong>of</strong> dependency annotation.<br />
Given languages as similar as the languages used here, it would also be quite<br />
<strong>in</strong>terest<strong>in</strong>g to discard the notion <strong>of</strong> three dist<strong>in</strong>ct languages, and rather consider<br />
them three dialects <strong>of</strong> the same language and apply orthographic normalisation<br />
techniques to automatically convert Swedish or Danish data to Norwegian.<br />
F<strong>in</strong>ally comes the question <strong>of</strong> parameter tun<strong>in</strong>g. In an <strong>in</strong>-doma<strong>in</strong> sett<strong>in</strong>g techniques<br />
such as cross-validation over the tra<strong>in</strong><strong>in</strong>g set are useful techniques to f<strong>in</strong>d the<br />
best set <strong>of</strong> parameters for an algorithm on a data-set, but for cross-l<strong>in</strong>gual pars<strong>in</strong>g, it<br />
is not at all obvious that the model that performs optimally on the source-language<br />
data is the best model for the target doma<strong>in</strong>. It is thus necessary to further <strong>in</strong>vestigate<br />
the possible correlation between source and target doma<strong>in</strong> performance and<br />
methods <strong>of</strong> a priori estimation <strong>of</strong> performance on out-<strong>of</strong>-doma<strong>in</strong> data.<br />
References<br />
[1] Charniak, E. and Johnson, M. (2005). Coarse-to-F<strong>in</strong>e N-Best Pars<strong>in</strong>g and Max-<br />
Ent Discrim<strong>in</strong>ative Rerank<strong>in</strong>g. In Proceed<strong>in</strong>gs <strong>of</strong> the 43rd Annual Meet<strong>in</strong>g <strong>of</strong><br />
the Association for Computational L<strong>in</strong>guistics (ACL’05), pages 173–180, Ann<br />
Arbor, Michigan. Association for Computational L<strong>in</strong>guistics.<br />
[2] Chiou, F.-D., Chiang, D., and Palmer, M. (2001). Facilitat<strong>in</strong>g <strong>Treebank</strong> Annotation<br />
Us<strong>in</strong>g a Statistical Parser. In Proceed<strong>in</strong>gs <strong>of</strong> the first <strong>in</strong>ternational<br />
conference on Human language technology research, pages 1–4.<br />
197