06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Impact <strong>of</strong> treebank characteristics on cross-l<strong>in</strong>gual<br />

parser adaptation<br />

Arne Skjærholt, Lilja Øvrelid<br />

Department <strong>of</strong> <strong>in</strong>formatics, University <strong>of</strong> Oslo<br />

{arnskj,liljao}@ifi.uio.no<br />

Abstract<br />

<strong>Treebank</strong> creation can benefit from the use <strong>of</strong> a parser. Recent work on<br />

cross-l<strong>in</strong>gual parser adaptation has presented results that make this a viable<br />

option as an early-stage preprocessor <strong>in</strong> cases where there are no (or not<br />

enough) resources available to tra<strong>in</strong> a statistical parser for a language.<br />

In this paper we exam<strong>in</strong>e cross-l<strong>in</strong>gual parser adaptation between three<br />

highly related languages: Swedish, Danish and Norwegian. Our focus on<br />

related languages allows for an <strong>in</strong>-depth study <strong>of</strong> factors <strong>in</strong>fluenc<strong>in</strong>g the performance<br />

<strong>of</strong> the adapted parsers and we exam<strong>in</strong>e the <strong>in</strong>fluence <strong>of</strong> annotation<br />

strategy and treebank size. Our results show that a simple conversion<br />

process can give very good results, and with a few simple, l<strong>in</strong>guistically <strong>in</strong>formed,<br />

changes to the source data, even better results can be obta<strong>in</strong>ed, even<br />

with a source treebank that is quite differently annotated from the target treebank.<br />

We also show that for languages with large amounts <strong>of</strong> lexical overlap,<br />

delexicalisation is not necessary, <strong>in</strong>deed lexicalised parsers outperform<br />

delexicalised parsers, and that for some treebanks it is possible to convert<br />

dependency relations and create a labelled cross-l<strong>in</strong>gual parser. 1<br />

1 Introduction<br />

It is well known that when annotat<strong>in</strong>g new resources, the best way to go about it is<br />

not necessarily to annotate raw data, but rather to correct automatically annotated<br />

material (Chiou et al. [2], Fort and Sagot [3]). This can yield important speed<br />

ga<strong>in</strong>s, and <strong>of</strong>ten also improvements <strong>in</strong> measures <strong>of</strong> annotation quality such as <strong>in</strong>terannotator<br />

agreement. However, if we wish to create a treebank for a language with<br />

no pre-exist<strong>in</strong>g resources, we face someth<strong>in</strong>g <strong>of</strong> a Catch-22; on the one hand, we<br />

have no data to tra<strong>in</strong> a statistical parser to automatically annotate sentences, but<br />

on the other hand we really would like to have one. In this sett<strong>in</strong>g, one option is<br />

cross-l<strong>in</strong>gual parser adaptation.<br />

1 The code used to obta<strong>in</strong> the data <strong>in</strong> this paper are available from https://github.com/<br />

arnsholt/tlt11-experiments.<br />

187

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!