06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

statistical parsers. S<strong>in</strong>ce data sparsity is the biggest challenge <strong>in</strong> data oriented<br />

pars<strong>in</strong>g, parsers will have a poor performance if they are tra<strong>in</strong>ed with a small set<br />

<strong>of</strong> data, or when the doma<strong>in</strong> <strong>of</strong> the tra<strong>in</strong><strong>in</strong>g and the test data are not similar. Brown<br />

et al. [2] pioneered to use word cluster<strong>in</strong>g for language model<strong>in</strong>g methods. Later<br />

on, word cluster<strong>in</strong>g was widely used <strong>in</strong> various NLP applications <strong>in</strong>clud<strong>in</strong>g pars<strong>in</strong>g<br />

[3, 7, 9]. The Brown word cluster<strong>in</strong>g is an approach which provides a coarse<br />

level <strong>of</strong> word representation such that the words which have similarities with each<br />

other are assigned to one cluster. In this type <strong>of</strong> annotation, <strong>in</strong>stead <strong>of</strong> words, the<br />

correspond<strong>in</strong>g mapped clusters are used.<br />

POS Tag: The syntactic functions <strong>of</strong> words at the sentence level are the very<br />

basic l<strong>in</strong>guistic knowledge that the parser learns; therefore, they play a very important<br />

role on the performance <strong>of</strong> a parser. The quality <strong>of</strong> the assigned POS<br />

tags to the words and the amount <strong>of</strong> <strong>in</strong>formation they conta<strong>in</strong> have a direct effect<br />

on the performance <strong>of</strong> the parser. The representation <strong>of</strong> this knowledge can be<br />

either coarse-gra<strong>in</strong>ed such as Noun, Verb, Adjective, etc, or f<strong>in</strong>e-gra<strong>in</strong>ed to conta<strong>in</strong><br />

morpho-syntactic and semantic <strong>in</strong>formation such as Noun-S<strong>in</strong>gle, Verb-Past,<br />

Adjective-superlative, etc. The f<strong>in</strong>e representation <strong>of</strong> the tags leads to <strong>in</strong>crease the<br />

tag set size and <strong>in</strong>tensify the complexity <strong>of</strong> the tagg<strong>in</strong>g task for a statistical POS<br />

tagger to disambiguate the correct labels.<br />

Phrase Structure: Depend<strong>in</strong>g on the formalism used as the backbone <strong>of</strong> a treebank,<br />

the labels <strong>of</strong> the nodes at the phrasal level can be either f<strong>in</strong>e- or coarsegra<strong>in</strong>ed.<br />

The annotated data <strong>in</strong> the Penn <strong>Treebank</strong> [11] provides relatively coarsegra<strong>in</strong>ed<br />

phrase structures <strong>in</strong> which mostly the types <strong>of</strong> the phrasal constituents such<br />

as NP, VP, etc are determ<strong>in</strong>ed. It is not denied that <strong>in</strong> the latest version <strong>of</strong> the Penn<br />

<strong>Treebank</strong> the syntactic functions are added to the labels as well, but this <strong>in</strong>formation<br />

is not available for all nodes. Contrary, annotated data <strong>of</strong> deep formalisms like<br />

HPSG provides a f<strong>in</strong>e representation <strong>of</strong> phrase structures s<strong>in</strong>ce the types <strong>of</strong> headdaughters’<br />

dependencies are def<strong>in</strong>ed explicitly for all nodes such as the Bulgarian<br />

treebank [15] and the Persian treebank [6]. Representation <strong>of</strong> dependency <strong>in</strong>formation<br />

on constituents adds complexities to the parser for disambiguat<strong>in</strong>g the type<br />

<strong>of</strong> the dependencies as the size <strong>of</strong> the constituent label set is <strong>in</strong>creased.<br />

3 Evaluation<br />

3.1 Data Set and Tool<br />

The Bijankhan Corpus 1 conta<strong>in</strong>s more than 2.5 million word tokens, and it is<br />

POS tagged manually with a rich set <strong>of</strong> 586 tags conta<strong>in</strong><strong>in</strong>g morpho-syntactic and<br />

semantic <strong>in</strong>formation [1] such that there is a hierarchy on the assigned tags <strong>based</strong><br />

on the EAGLES guidel<strong>in</strong>es [10]. The number <strong>of</strong> the ma<strong>in</strong> syntactic categories <strong>in</strong><br />

the tag set is 14, and it is <strong>in</strong>creased to 15 when clitics are splited from their hosts.<br />

The Persian <strong>Treebank</strong> (PerTreeBank) 2 is a treebank for Persian developed <strong>in</strong> the<br />

framework <strong>of</strong> HPSG [13] such that the basic properties <strong>of</strong> HPSG are simulated<br />

1 http://ece.ut.ac.ir/dbrg/bijankhan/<br />

2 http://hpsg.fu-berl<strong>in</strong>.de/∼ghayoomi/PTB.html<br />

110

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!