A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
statistical parsers. S<strong>in</strong>ce data sparsity is the biggest challenge <strong>in</strong> data oriented<br />
pars<strong>in</strong>g, parsers will have a poor performance if they are tra<strong>in</strong>ed with a small set<br />
<strong>of</strong> data, or when the doma<strong>in</strong> <strong>of</strong> the tra<strong>in</strong><strong>in</strong>g and the test data are not similar. Brown<br />
et al. [2] pioneered to use word cluster<strong>in</strong>g for language model<strong>in</strong>g methods. Later<br />
on, word cluster<strong>in</strong>g was widely used <strong>in</strong> various NLP applications <strong>in</strong>clud<strong>in</strong>g pars<strong>in</strong>g<br />
[3, 7, 9]. The Brown word cluster<strong>in</strong>g is an approach which provides a coarse<br />
level <strong>of</strong> word representation such that the words which have similarities with each<br />
other are assigned to one cluster. In this type <strong>of</strong> annotation, <strong>in</strong>stead <strong>of</strong> words, the<br />
correspond<strong>in</strong>g mapped clusters are used.<br />
POS Tag: The syntactic functions <strong>of</strong> words at the sentence level are the very<br />
basic l<strong>in</strong>guistic knowledge that the parser learns; therefore, they play a very important<br />
role on the performance <strong>of</strong> a parser. The quality <strong>of</strong> the assigned POS<br />
tags to the words and the amount <strong>of</strong> <strong>in</strong>formation they conta<strong>in</strong> have a direct effect<br />
on the performance <strong>of</strong> the parser. The representation <strong>of</strong> this knowledge can be<br />
either coarse-gra<strong>in</strong>ed such as Noun, Verb, Adjective, etc, or f<strong>in</strong>e-gra<strong>in</strong>ed to conta<strong>in</strong><br />
morpho-syntactic and semantic <strong>in</strong>formation such as Noun-S<strong>in</strong>gle, Verb-Past,<br />
Adjective-superlative, etc. The f<strong>in</strong>e representation <strong>of</strong> the tags leads to <strong>in</strong>crease the<br />
tag set size and <strong>in</strong>tensify the complexity <strong>of</strong> the tagg<strong>in</strong>g task for a statistical POS<br />
tagger to disambiguate the correct labels.<br />
Phrase Structure: Depend<strong>in</strong>g on the formalism used as the backbone <strong>of</strong> a treebank,<br />
the labels <strong>of</strong> the nodes at the phrasal level can be either f<strong>in</strong>e- or coarsegra<strong>in</strong>ed.<br />
The annotated data <strong>in</strong> the Penn <strong>Treebank</strong> [11] provides relatively coarsegra<strong>in</strong>ed<br />
phrase structures <strong>in</strong> which mostly the types <strong>of</strong> the phrasal constituents such<br />
as NP, VP, etc are determ<strong>in</strong>ed. It is not denied that <strong>in</strong> the latest version <strong>of</strong> the Penn<br />
<strong>Treebank</strong> the syntactic functions are added to the labels as well, but this <strong>in</strong>formation<br />
is not available for all nodes. Contrary, annotated data <strong>of</strong> deep formalisms like<br />
HPSG provides a f<strong>in</strong>e representation <strong>of</strong> phrase structures s<strong>in</strong>ce the types <strong>of</strong> headdaughters’<br />
dependencies are def<strong>in</strong>ed explicitly for all nodes such as the Bulgarian<br />
treebank [15] and the Persian treebank [6]. Representation <strong>of</strong> dependency <strong>in</strong>formation<br />
on constituents adds complexities to the parser for disambiguat<strong>in</strong>g the type<br />
<strong>of</strong> the dependencies as the size <strong>of</strong> the constituent label set is <strong>in</strong>creased.<br />
3 Evaluation<br />
3.1 Data Set and Tool<br />
The Bijankhan Corpus 1 conta<strong>in</strong>s more than 2.5 million word tokens, and it is<br />
POS tagged manually with a rich set <strong>of</strong> 586 tags conta<strong>in</strong><strong>in</strong>g morpho-syntactic and<br />
semantic <strong>in</strong>formation [1] such that there is a hierarchy on the assigned tags <strong>based</strong><br />
on the EAGLES guidel<strong>in</strong>es [10]. The number <strong>of</strong> the ma<strong>in</strong> syntactic categories <strong>in</strong><br />
the tag set is 14, and it is <strong>in</strong>creased to 15 when clitics are splited from their hosts.<br />
The Persian <strong>Treebank</strong> (PerTreeBank) 2 is a treebank for Persian developed <strong>in</strong> the<br />
framework <strong>of</strong> HPSG [13] such that the basic properties <strong>of</strong> HPSG are simulated<br />
1 http://ece.ut.ac.ir/dbrg/bijankhan/<br />
2 http://hpsg.fu-berl<strong>in</strong>.de/∼ghayoomi/PTB.html<br />
110