A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
ut without feature structures. This treebank conta<strong>in</strong>s 1016 trees with 27659 word<br />
tokens from the Bijankhan Corpus, and it is developed semi-automatically via a<br />
bootstrapp<strong>in</strong>g approach [6]. In this treebank, the types <strong>of</strong> head-daughter dependencies<br />
are def<strong>in</strong>ed accord<strong>in</strong>g to the HPSG basic schemas, namely head-subject,<br />
head-complement, head-adjunct, and head-filler; therefore it is a hierarchical treebank<br />
which represents subcategorization requirements. Moreover, the trace for<br />
scrambled or extraposed elements and also empty nodes for ellipses are explicitly<br />
determ<strong>in</strong>ed.<br />
Stanford Parser is the Java implementation <strong>of</strong> a lexicalized, probabilistic natural<br />
language parser [8]. The parser is <strong>based</strong> on an optimized Probabilistic Context Free<br />
Grammar (PCFG) and lexicalized dependency parsers, and a lexicalized PCFG<br />
parser. Based on the study <strong>of</strong> Coll<strong>in</strong>s [5], heads should be provided for the parser.<br />
This has been done semi-automatically for Persian <strong>based</strong> on the head-daughters’<br />
labels. The evaluation <strong>of</strong> the pars<strong>in</strong>g results is done with Evalb 3 to report the<br />
standard bracket<strong>in</strong>g metric results like precision, recall, and F-measure.<br />
SRILM Toolkit conta<strong>in</strong>s the implementation <strong>of</strong> the the Brown word cluster<strong>in</strong>g<br />
algorithm [2] and it is used to cluster the lexical items <strong>in</strong> the Bijankhan Corpus [16].<br />
3.2 Setup <strong>of</strong> the Experiments<br />
Section 2 described the three possible annotation dimensions for pars<strong>in</strong>g. In the<br />
follow<strong>in</strong>gs, we will describe the data preparation and the setup <strong>of</strong> our experiments<br />
to study the effect <strong>of</strong> each dimension’s annotation on pars<strong>in</strong>g performance.<br />
Besides <strong>of</strong> prepar<strong>in</strong>g the PerTreeBank <strong>based</strong> on the Penn <strong>Treebank</strong> style expla<strong>in</strong>ed<br />
<strong>in</strong> Ghayoomi [7], we modified the treebank <strong>in</strong> three directions for our<br />
experiments. The SRILM toolkit is used to cluster the words <strong>in</strong> the Bijankhan<br />
Corpus. S<strong>in</strong>ce the Brown algorithm requires a predef<strong>in</strong>ed number <strong>of</strong> clusters, we<br />
set the number <strong>of</strong> clusters to 700 <strong>based</strong> on the extended model <strong>of</strong> Brown algorithm<br />
proposed by Ghayoomi [7] to treat homographs dist<strong>in</strong>ctly. In order to provide a<br />
coarse-gra<strong>in</strong>ed representation <strong>of</strong> morpho-syntactic <strong>in</strong>formation <strong>of</strong> the words, only<br />
the ma<strong>in</strong> POS tags <strong>of</strong> the words (the 15 tags) are used <strong>in</strong>stead <strong>of</strong> all 586 tags;<br />
and <strong>in</strong> order to provide simple head-daughter relations as coarse-gra<strong>in</strong>ed phrase<br />
structures, only the type <strong>of</strong> dependencies on adjunct-daughters and complementdaughters<br />
as well as the type <strong>of</strong> clauses on marker-daughters are removed without<br />
any changes on other head-daughter relations.<br />
In each model, we tra<strong>in</strong> the Stanford parser with the Persian data. S<strong>in</strong>ce<br />
PerTreeBank is currently the only available annotated data set for Persian with<br />
constituents, this data set is used for both tra<strong>in</strong><strong>in</strong>g and test<strong>in</strong>g. The 10-fold cross<br />
validation is performed to avoid any overlap between the two data sets.<br />
3.3 Results<br />
In the first step <strong>of</strong> our experiments, we tra<strong>in</strong>ed the Stanford parser with PerTree-<br />
Bank without any changes on annotation granularities (Model 1) and recognized<br />
it as the basel<strong>in</strong>e model. To further study the effect <strong>of</strong> each annotation dimension,<br />
3 http://nlp.cs.nyu.edu/evalb/<br />
111