A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
A Treebank-based Investigation of IPP-triggering Verbs in Dutch
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Pr<strong>of</strong>il<strong>in</strong>g Feature Selection for Named Entity<br />
Classification <strong>in</strong> the TüBa-D/Z <strong>Treebank</strong><br />
Kathr<strong>in</strong> Beck, Erhard H<strong>in</strong>richs<br />
Department <strong>of</strong> General and Computational L<strong>in</strong>guistics<br />
University <strong>of</strong> Tüb<strong>in</strong>gen<br />
Email: {kbeck;eh}@sfs.uni-tueb<strong>in</strong>gen.de<br />
Abstract<br />
This paper provides an overview <strong>of</strong> the named entity annotation <strong>in</strong>cluded <strong>in</strong> the<br />
TüBa-D/Z treebank <strong>of</strong> German newspaper articles. It describes the subclasses <strong>of</strong><br />
named entities dist<strong>in</strong>guished by the TüBa-D/Z annotation scheme and pr<strong>of</strong>iles a set<br />
<strong>of</strong> surface-oriented and syntax-<strong>based</strong> features that are highly predictive for different<br />
subclasses <strong>of</strong> named entities.<br />
1 Introduction<br />
The annotation and automatic detection <strong>of</strong> named entities (henceforth<br />
abbreviated as NE) <strong>in</strong> large corpora has played a major role <strong>in</strong> recent research<br />
<strong>in</strong> computational l<strong>in</strong>guistics and <strong>in</strong> the field <strong>of</strong> digital humanities. In digital<br />
humanities research, NEs have featured prom<strong>in</strong>ently <strong>in</strong> the creation <strong>of</strong> l<strong>in</strong>ked<br />
data for classical text collections and the visualization <strong>of</strong> the content <strong>of</strong><br />
corpus collections by l<strong>in</strong>k<strong>in</strong>g NEs with their geospatial coord<strong>in</strong>ates. In<br />
computational l<strong>in</strong>guistics, the identification <strong>of</strong> NEs is an important <strong>in</strong>gredient<br />
<strong>in</strong> the automatic detection <strong>of</strong> coreference relations <strong>in</strong> texts, <strong>of</strong> automatic topic<br />
detection and text classification, as well as for question-answer<strong>in</strong>g and other<br />
<strong>in</strong>formation retrieval and extraction applications.<br />
In order to provide tra<strong>in</strong><strong>in</strong>g material for (semi-)supervised learn<strong>in</strong>g<br />
algorithms for automatic NE detection, the annotation <strong>of</strong> corpus materials<br />
with NEs has been an important desideratum. For this reason, the l<strong>in</strong>guistic<br />
annotation <strong>of</strong> the Tüb<strong>in</strong>gen <strong>Treebank</strong> <strong>of</strong> Written German (TüBa-D/Z) 1 has<br />
been enhanced <strong>in</strong> recent years by NE <strong>in</strong>formation. To the best <strong>of</strong> our<br />
knowledge, the TüBa-D/Z NE annotation constitutes the largest dataset <strong>of</strong><br />
this k<strong>in</strong>d for German apart from the German data prepared for the CoNLL-<br />
2003 NE recognition shared task (Tjong Kim Sang and de Meulder [9]) and<br />
the GerNED corpus (Ploch et al. [6]) (cf. Table 1).<br />
1 The TüBa-D/Z <strong>Treebank</strong> (PID: http://hdl.handle.net/11858/00-1778-0000-0005-896C-F)<br />
is freely available on http://www.sfs.uni-tueb<strong>in</strong>gen.de/en/ascl/resources/corpora/tuebadz/.<br />
13