06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Det Adj N<br />

(a) Danish<br />

Det Adj N<br />

(b) Swedish/Norwegian<br />

Figure 1: Noun phrase annotation<br />

bank that fall below a certa<strong>in</strong> perplexity threshold. All experiments are performed<br />

with dependency representations, us<strong>in</strong>g MSTparser (McDonald et al. [5]). The<br />

work shows that the filter<strong>in</strong>g approach yields considerable improvements over the<br />

simple technique proposed by Zeman and Resnik [12] when applied to unrelated<br />

languages.<br />

The above approaches to parser adaptation tra<strong>in</strong> delexicalized parsers on the<br />

(unfiltered or filtered) source language data, however, there are clearly many lexical<br />

patterns that are similar cross-l<strong>in</strong>gually if one f<strong>in</strong>ds the right level <strong>of</strong> abstraction.<br />

Täckström et al. [10] <strong>in</strong>vestigate the use <strong>of</strong> cross-l<strong>in</strong>gual word clusters derived<br />

us<strong>in</strong>g parallel texts <strong>in</strong> order to add some level <strong>of</strong> lexical <strong>in</strong>formation that abstracts<br />

over <strong>in</strong>dividual words.<br />

2 <strong>Treebank</strong>s<br />

In all our experiments, the target language is Norwegian, and either Swedish or<br />

Danish is used as the source. For these experiments, we used three treebanks: the<br />

Danish Dependency <strong>Treebank</strong> (DDT) (Kromann [4]) and the Swedish Talbanken05<br />

(Nivre et al. [6]), both from the 2006 CoNLL shared task on multil<strong>in</strong>gual dependency<br />

pars<strong>in</strong>g, and an <strong>in</strong>-development Norwegian treebank be<strong>in</strong>g prepared by the<br />

Norwegian national library (NDT) 2 . Of the three, Talbanken is by far the largest,<br />

with some 11000 sentences <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g set. The DDT is smaller with about 5200<br />

sentences <strong>of</strong> tra<strong>in</strong><strong>in</strong>g data, and the NDT even smaller with 4400 sentences all told.<br />

The Scand<strong>in</strong>avian languages, Danish, Norwegian and Swedish, are extremely<br />

closely related to one another. The split <strong>of</strong> Norse <strong>in</strong>to the modern-day Scand<strong>in</strong>avian<br />

languages is very recent 3 , which means that the languages are mutually<br />

<strong>in</strong>telligible to a large extent, and share a large number <strong>of</strong> syntactic and morphological<br />

features, and have a large common vocabulary. Perhaps the largest difference<br />

is orthographical conventions.<br />

Our treebanks are more diverse than the languages they are <strong>based</strong> on. As Zeman<br />

and Resnik [12] po<strong>in</strong>t out, Talbanken and DDT are different <strong>in</strong> many respects;<br />

most <strong>of</strong> these differences are due to DDT preferr<strong>in</strong>g functional heads while Tal-<br />

2 A beta release is available at http://www.nb.no/spraakbanken/<br />

tilgjengelege-ressursar/tekstressursar<br />

3 With<strong>in</strong> the last 1000 to 500 years. For comparison, the split between German and <strong>Dutch</strong> is some<br />

500 years before this.<br />

189

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!