06.07.2014 Views

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

A Treebank-based Investigation of IPP-triggering Verbs in Dutch

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Pr<strong>of</strong>il<strong>in</strong>g Feature Selection for Named Entity<br />

Classification <strong>in</strong> the TüBa-D/Z <strong>Treebank</strong><br />

Kathr<strong>in</strong> Beck, Erhard H<strong>in</strong>richs<br />

Department <strong>of</strong> General and Computational L<strong>in</strong>guistics<br />

University <strong>of</strong> Tüb<strong>in</strong>gen<br />

Email: {kbeck;eh}@sfs.uni-tueb<strong>in</strong>gen.de<br />

Abstract<br />

This paper provides an overview <strong>of</strong> the named entity annotation <strong>in</strong>cluded <strong>in</strong> the<br />

TüBa-D/Z treebank <strong>of</strong> German newspaper articles. It describes the subclasses <strong>of</strong><br />

named entities dist<strong>in</strong>guished by the TüBa-D/Z annotation scheme and pr<strong>of</strong>iles a set<br />

<strong>of</strong> surface-oriented and syntax-<strong>based</strong> features that are highly predictive for different<br />

subclasses <strong>of</strong> named entities.<br />

1 Introduction<br />

The annotation and automatic detection <strong>of</strong> named entities (henceforth<br />

abbreviated as NE) <strong>in</strong> large corpora has played a major role <strong>in</strong> recent research<br />

<strong>in</strong> computational l<strong>in</strong>guistics and <strong>in</strong> the field <strong>of</strong> digital humanities. In digital<br />

humanities research, NEs have featured prom<strong>in</strong>ently <strong>in</strong> the creation <strong>of</strong> l<strong>in</strong>ked<br />

data for classical text collections and the visualization <strong>of</strong> the content <strong>of</strong><br />

corpus collections by l<strong>in</strong>k<strong>in</strong>g NEs with their geospatial coord<strong>in</strong>ates. In<br />

computational l<strong>in</strong>guistics, the identification <strong>of</strong> NEs is an important <strong>in</strong>gredient<br />

<strong>in</strong> the automatic detection <strong>of</strong> coreference relations <strong>in</strong> texts, <strong>of</strong> automatic topic<br />

detection and text classification, as well as for question-answer<strong>in</strong>g and other<br />

<strong>in</strong>formation retrieval and extraction applications.<br />

In order to provide tra<strong>in</strong><strong>in</strong>g material for (semi-)supervised learn<strong>in</strong>g<br />

algorithms for automatic NE detection, the annotation <strong>of</strong> corpus materials<br />

with NEs has been an important desideratum. For this reason, the l<strong>in</strong>guistic<br />

annotation <strong>of</strong> the Tüb<strong>in</strong>gen <strong>Treebank</strong> <strong>of</strong> Written German (TüBa-D/Z) 1 has<br />

been enhanced <strong>in</strong> recent years by NE <strong>in</strong>formation. To the best <strong>of</strong> our<br />

knowledge, the TüBa-D/Z NE annotation constitutes the largest dataset <strong>of</strong><br />

this k<strong>in</strong>d for German apart from the German data prepared for the CoNLL-<br />

2003 NE recognition shared task (Tjong Kim Sang and de Meulder [9]) and<br />

the GerNED corpus (Ploch et al. [6]) (cf. Table 1).<br />

1 The TüBa-D/Z <strong>Treebank</strong> (PID: http://hdl.handle.net/11858/00-1778-0000-0005-896C-F)<br />

is freely available on http://www.sfs.uni-tueb<strong>in</strong>gen.de/en/ascl/resources/corpora/tuebadz/.<br />

13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!