06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

mance is not <strong>in</strong>considerable, as was the case for Talbanken and NDT. However, as<br />

evidenced by the DDT-NDT pair, the labelled performance varies more <strong>in</strong> function<br />

<strong>of</strong> the characteristics <strong>of</strong> the source treebank.<br />

F<strong>in</strong>ally, it turns out that delexicalisation is not required if the languages <strong>in</strong> question<br />

are closely enough related that there is non-trivial amounts <strong>of</strong> lexical overlap<br />

between the languages. But if a delexicalised parser is required and the source<br />

treebank is large, it may be best to not use the entire treebank to tra<strong>in</strong> the adapted<br />

parser. As the learn<strong>in</strong>g curves <strong>in</strong> Figure 4 show, a delexicalised parser tra<strong>in</strong>ed on<br />

more than 500-1,000 sentences starts to overtra<strong>in</strong> and is likely to perform worse<br />

than a parser tra<strong>in</strong>ed on less material on target language data.<br />

5 Future work<br />

There are a number <strong>of</strong> topics that merit further <strong>in</strong>vestigation. First <strong>of</strong> all, a thorough<br />

study is required <strong>of</strong> the <strong>in</strong>fluence <strong>of</strong> us<strong>in</strong>g pre-parsed data for annotation and the<br />

<strong>in</strong>fluence <strong>of</strong> parser performance on the parameters <strong>of</strong> annotation process, such as<br />

time used per sentence and <strong>in</strong>ter-annotator agreement. Fort and Sagot [3] study the<br />

<strong>in</strong>fluence <strong>of</strong> pre-processed data for the part-<strong>of</strong>-speech annotation task, and Chiou<br />

et al. [2] show that an accurate parser speeds up phrase structure annotation, but to<br />

our knowledge no such study has been done for the task <strong>of</strong> dependency annotation.<br />

Given languages as similar as the languages used here, it would also be quite<br />

<strong>in</strong>terest<strong>in</strong>g to discard the notion <strong>of</strong> three dist<strong>in</strong>ct languages, and rather consider<br />

them three dialects <strong>of</strong> the same language and apply orthographic normalisation<br />

techniques to automatically convert Swedish or Danish data to Norwegian.<br />

F<strong>in</strong>ally comes the question <strong>of</strong> parameter tun<strong>in</strong>g. In an <strong>in</strong>-doma<strong>in</strong> sett<strong>in</strong>g techniques<br />

such as cross-validation over the tra<strong>in</strong><strong>in</strong>g set are useful techniques to f<strong>in</strong>d the<br />

best set <strong>of</strong> parameters for an algorithm on a data-set, but for cross-l<strong>in</strong>gual pars<strong>in</strong>g, it<br />

is not at all obvious that the model that performs optimally on the source-language<br />

data is the best model for the target doma<strong>in</strong>. It is thus necessary to further <strong>in</strong>vestigate<br />

the possible correlation between source and target doma<strong>in</strong> performance and<br />

methods <strong>of</strong> a priori estimation <strong>of</strong> performance on out-<strong>of</strong>-doma<strong>in</strong> data.<br />

References<br />

[1] Charniak, E. and Johnson, M. (2005). Coarse-to-F<strong>in</strong>e N-Best Pars<strong>in</strong>g and Max-<br />

Ent Discrim<strong>in</strong>ative Rerank<strong>in</strong>g. In Proceed<strong>in</strong>gs <strong>of</strong> the 43rd Annual Meet<strong>in</strong>g <strong>of</strong><br />

the Association for Computational L<strong>in</strong>guistics (ACL’05), pages 173–180, Ann<br />

Arbor, Michigan. Association for Computational L<strong>in</strong>guistics.<br />

[2] Chiou, F.-D., Chiang, D., and Palmer, M. (2001). Facilitat<strong>in</strong>g <strong>Treebank</strong> Annotation<br />

Us<strong>in</strong>g a Statistical Parser. In Proceed<strong>in</strong>gs <strong>of</strong> the first <strong>in</strong>ternational<br />

conference on Human language technology research, pages 1–4.<br />

197

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!