A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Det Adj N<br />
(a) Danish<br />
Det Adj N<br />
(b) Swedish/Norwegian<br />
Figure 1: Noun phrase annotation<br />
bank that fall below a certa<strong>in</strong> perplexity threshold. All experiments are performed<br />
with dependency representations, us<strong>in</strong>g MSTparser (McDonald et al. [5]). The<br />
work shows that the filter<strong>in</strong>g approach yields considerable improvements over the<br />
simple technique proposed by Zeman and Resnik [12] when applied to unrelated<br />
languages.<br />
The above approaches to parser adaptation tra<strong>in</strong> delexicalized parsers on the<br />
(unfiltered or filtered) source language data, however, there are clearly many lexical<br />
patterns that are similar cross-l<strong>in</strong>gually if one f<strong>in</strong>ds the right level <strong>of</strong> abstraction.<br />
Täckström et al. [10] <strong>in</strong>vestigate the use <strong>of</strong> cross-l<strong>in</strong>gual word clusters derived<br />
us<strong>in</strong>g parallel texts <strong>in</strong> order to add some level <strong>of</strong> lexical <strong>in</strong>formation that abstracts<br />
over <strong>in</strong>dividual words.<br />
2 <strong>Treebank</strong>s<br />
In all our experiments, the target language is Norwegian, and either Swedish or<br />
Danish is used as the source. For these experiments, we used three treebanks: the<br />
Danish Dependency <strong>Treebank</strong> (DDT) (Kromann [4]) and the Swedish Talbanken05<br />
(Nivre et al. [6]), both from the 2006 CoNLL shared task on multil<strong>in</strong>gual dependency<br />
pars<strong>in</strong>g, and an <strong>in</strong>-development Norwegian treebank be<strong>in</strong>g prepared by the<br />
Norwegian national library (NDT) 2 . Of the three, Talbanken is by far the largest,<br />
with some 11000 sentences <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g set. The DDT is smaller with about 5200<br />
sentences <strong>of</strong> tra<strong>in</strong><strong>in</strong>g data, and the NDT even smaller with 4400 sentences all told.<br />
The Scand<strong>in</strong>avian languages, Danish, Norwegian and Swedish, are extremely<br />
closely related to one another. The split <strong>of</strong> Norse <strong>in</strong>to the modern-day Scand<strong>in</strong>avian<br />
languages is very recent 3 , which means that the languages are mutually<br />
<strong>in</strong>telligible to a large extent, and share a large number <strong>of</strong> syntactic and morphological<br />
features, and have a large common vocabulary. Perhaps the largest difference<br />
is orthographical conventions.<br />
Our treebanks are more diverse than the languages they are <strong>based</strong> on. As Zeman<br />
and Resnik [12] po<strong>in</strong>t out, Talbanken and DDT are different <strong>in</strong> many respects;<br />
most <strong>of</strong> these differences are due to DDT preferr<strong>in</strong>g functional heads while Tal-<br />
2 A beta release is available at http://www.nb.no/spraakbanken/<br />
tilgjengelege-ressursar/tekstressursar<br />
3 With<strong>in</strong> the last 1000 to 500 years. For comparison, the split between German and <strong>Dutch</strong> is some<br />
500 years before this.<br />
189