06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The Effect <strong>of</strong> <strong>Treebank</strong> Annotation Granularity on<br />

Pars<strong>in</strong>g: A Comparative Study<br />

Masood Ghayoomi<br />

Freie Universität Berl<strong>in</strong><br />

E-mail: masood.ghayoomi@fu-berl<strong>in</strong>.de<br />

Omid Moradiannasab<br />

Iran University <strong>of</strong> Science and Technology<br />

E-mail: omidmoradiannasab@gmail.com<br />

Abstract<br />

Statistical parsers need annotated data for tra<strong>in</strong><strong>in</strong>g. Depend<strong>in</strong>g on the available<br />

l<strong>in</strong>guistic <strong>in</strong>formation <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g data, the performance <strong>of</strong> the parsers<br />

vary. In this paper, we study the effect <strong>of</strong> annotation granularity on pars<strong>in</strong>g<br />

from three po<strong>in</strong>ts <strong>of</strong> views: lexicon, part-<strong>of</strong>-speech tag, and phrase structure.<br />

The results show that chang<strong>in</strong>g annotation granularity at each <strong>of</strong> these<br />

dimensions has a significant impact on pars<strong>in</strong>g performance.<br />

1 Introduction<br />

Pars<strong>in</strong>g is one <strong>of</strong> the ma<strong>in</strong> tasks <strong>in</strong> Natural Language Process<strong>in</strong>g (NLP). The state<strong>of</strong>-the-art<br />

statistical parsers are tra<strong>in</strong>ed with treebanks [4, 5], ma<strong>in</strong>ly developed<br />

<strong>based</strong> on the Phrase Structure Grammar (PSG). The part-<strong>of</strong>-speech (POS) tags <strong>of</strong><br />

the words <strong>in</strong> the treebanks are def<strong>in</strong>ed accord<strong>in</strong>g to a tag set which conta<strong>in</strong>s the<br />

syntactic categories <strong>of</strong> the words with the optional additional morpho-syntactic <strong>in</strong>formation.<br />

Moreover, the constituent labels <strong>in</strong> treebanks might also be enriched<br />

with syntactic functions. The developed annotated data <strong>in</strong> the framework <strong>of</strong> deeper<br />

formalisms such as HPSG [13] has provided a f<strong>in</strong>e representation <strong>of</strong> l<strong>in</strong>guistic<br />

knowledge. The performance <strong>of</strong> the parsers tra<strong>in</strong>ed with the latter data set have not<br />

beaten the state-<strong>of</strong>-the-art results [12] which shows that f<strong>in</strong>e-gra<strong>in</strong>ed representation<br />

<strong>of</strong> l<strong>in</strong>guistic knowledge adds complexities to a parser and it has a counter-effect on<br />

the performance <strong>of</strong> the parser. In this paper, we aim to comprehensively study the<br />

effect <strong>of</strong> annotation granularity on pars<strong>in</strong>g from three po<strong>in</strong>ts <strong>of</strong> views: lexicon,<br />

POS tag, and phrase structure. This study has a different perspective than Rehbe<strong>in</strong><br />

and van Genabith [14]. We selected Persian, a language from the Indo-European<br />

language family, as the target language <strong>of</strong> our study.<br />

2 <strong>Treebank</strong> Annotation Dimensions<br />

Lexicon: The words <strong>of</strong> a language represent f<strong>in</strong>e-gra<strong>in</strong>ed concepts, and the l<strong>in</strong>guistic<br />

<strong>in</strong>formation added to the words plays a very important role for lexicalized,<br />

109

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!