PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3. Tracking English Inclusions in German 86<br />
made by the POS tagger and therefore could be avoided if the latter performed with<br />
perfect accuracy. One reason for lower tagging accuracy is the fact that POS taggers<br />
trained on data for a particular language do not necessarily deal well with text contain-<br />
ing foreign inclusions. Moreover, some taggers have difficulty differentiating between<br />
common and proper nouns in some cases.<br />
In order to gain a better understanding <strong>of</strong> how the POS tagging influences the per-<br />
formance <strong>of</strong> the English inclusion classifier, I compared three different taggers in a<br />
task-based evaluation:<br />
• TnTNEGRA - the TnT tagger trained on the NEGRA corpus <strong>of</strong> approximately<br />
355,000 tokens (Skut et al., 1997)<br />
• TnTTIGER - the TnT tagger trained on the TIGER corpus <strong>of</strong> approximately<br />
700,000 tokens (Brants et al., 2002)<br />
• TreeTagger - the TreeTagger trained on a small German newspaper corpus <strong>of</strong><br />
Stuttgarter Zeitung containing 25,000 tokens (Schmid, 1994, 1995)<br />
The English inclusion classifier is essentially run on the same set <strong>of</strong> data tagged<br />
by the three different POS taggers and evaluated against the hand-annotated gold stan-<br />
dard. Note that this method does not necessarily determine the best and most accurate<br />
POS tagger but rather one that is most beneficial for identifying English inclusions in<br />
German text. Before discussing the results for each setup, the characteristics <strong>of</strong> the<br />
two POS taggers used in this evaluation are explained in detail.<br />
3.5.1.1 TnT - Trigrams’n’Tags<br />
TnT is a very efficient statistical POS tagger which can be trained on corpora in dif-<br />
ferent languages and domains and new POS tag sets (Brants, 2000b). Moreover, the<br />
tagger is very fast to train and run. It is based on the Viterbi algorithm for second<br />
order Markov models and therefore assigns the tag ti that is most likely to generate<br />
the wi given the two previous tags ti−1 and ti−2. The output and transition probabili-<br />
ties are estimated from an annotated corpus. In order to deal with data sparseness, the<br />
system incorporates linear interpolation-based smoothing and handles unknown words<br />
via n-gram-based suffix analysis.