06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ut without feature structures. This treebank conta<strong>in</strong>s 1016 trees with 27659 word<br />

tokens from the Bijankhan Corpus, and it is developed semi-automatically via a<br />

bootstrapp<strong>in</strong>g approach [6]. In this treebank, the types <strong>of</strong> head-daughter dependencies<br />

are def<strong>in</strong>ed accord<strong>in</strong>g to the HPSG basic schemas, namely head-subject,<br />

head-complement, head-adjunct, and head-filler; therefore it is a hierarchical treebank<br />

which represents subcategorization requirements. Moreover, the trace for<br />

scrambled or extraposed elements and also empty nodes for ellipses are explicitly<br />

determ<strong>in</strong>ed.<br />

Stanford Parser is the Java implementation <strong>of</strong> a lexicalized, probabilistic natural<br />

language parser [8]. The parser is <strong>based</strong> on an optimized Probabilistic Context Free<br />

Grammar (PCFG) and lexicalized dependency parsers, and a lexicalized PCFG<br />

parser. Based on the study <strong>of</strong> Coll<strong>in</strong>s [5], heads should be provided for the parser.<br />

This has been done semi-automatically for Persian <strong>based</strong> on the head-daughters’<br />

labels. The evaluation <strong>of</strong> the pars<strong>in</strong>g results is done with Evalb 3 to report the<br />

standard bracket<strong>in</strong>g metric results like precision, recall, and F-measure.<br />

SRILM Toolkit conta<strong>in</strong>s the implementation <strong>of</strong> the the Brown word cluster<strong>in</strong>g<br />

algorithm [2] and it is used to cluster the lexical items <strong>in</strong> the Bijankhan Corpus [16].<br />

3.2 Setup <strong>of</strong> the Experiments<br />

Section 2 described the three possible annotation dimensions for pars<strong>in</strong>g. In the<br />

follow<strong>in</strong>gs, we will describe the data preparation and the setup <strong>of</strong> our experiments<br />

to study the effect <strong>of</strong> each dimension’s annotation on pars<strong>in</strong>g performance.<br />

Besides <strong>of</strong> prepar<strong>in</strong>g the PerTreeBank <strong>based</strong> on the Penn <strong>Treebank</strong> style expla<strong>in</strong>ed<br />

<strong>in</strong> Ghayoomi [7], we modified the treebank <strong>in</strong> three directions for our<br />

experiments. The SRILM toolkit is used to cluster the words <strong>in</strong> the Bijankhan<br />

Corpus. S<strong>in</strong>ce the Brown algorithm requires a predef<strong>in</strong>ed number <strong>of</strong> clusters, we<br />

set the number <strong>of</strong> clusters to 700 <strong>based</strong> on the extended model <strong>of</strong> Brown algorithm<br />

proposed by Ghayoomi [7] to treat homographs dist<strong>in</strong>ctly. In order to provide a<br />

coarse-gra<strong>in</strong>ed representation <strong>of</strong> morpho-syntactic <strong>in</strong>formation <strong>of</strong> the words, only<br />

the ma<strong>in</strong> POS tags <strong>of</strong> the words (the 15 tags) are used <strong>in</strong>stead <strong>of</strong> all 586 tags;<br />

and <strong>in</strong> order to provide simple head-daughter relations as coarse-gra<strong>in</strong>ed phrase<br />

structures, only the type <strong>of</strong> dependencies on adjunct-daughters and complementdaughters<br />

as well as the type <strong>of</strong> clauses on marker-daughters are removed without<br />

any changes on other head-daughter relations.<br />

In each model, we tra<strong>in</strong> the Stanford parser with the Persian data. S<strong>in</strong>ce<br />

PerTreeBank is currently the only available annotated data set for Persian with<br />

constituents, this data set is used for both tra<strong>in</strong><strong>in</strong>g and test<strong>in</strong>g. The 10-fold cross<br />

validation is performed to avoid any overlap between the two data sets.<br />

3.3 Results<br />

In the first step <strong>of</strong> our experiments, we tra<strong>in</strong>ed the Stanford parser with PerTree-<br />

Bank without any changes on annotation granularities (Model 1) and recognized<br />

it as the basel<strong>in</strong>e model. To further study the effect <strong>of</strong> each annotation dimension,<br />

3 http://nlp.cs.nyu.edu/evalb/<br />

111

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!