05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Tracking English Inclusions in German 86<br />

made by the POS tagger and therefore could be avoided if the latter performed with<br />

perfect accuracy. One reason for lower tagging accuracy is the fact that POS taggers<br />

trained on data for a particular language do not necessarily deal well with text contain-<br />

ing foreign inclusions. Moreover, some taggers have difficulty differentiating between<br />

common and proper nouns in some cases.<br />

In order to gain a better understanding <strong>of</strong> how the POS tagging influences the per-<br />

formance <strong>of</strong> the English inclusion classifier, I compared three different taggers in a<br />

task-based evaluation:<br />

• TnTNEGRA - the TnT tagger trained on the NEGRA corpus <strong>of</strong> approximately<br />

355,000 tokens (Skut et al., 1997)<br />

• TnTTIGER - the TnT tagger trained on the TIGER corpus <strong>of</strong> approximately<br />

700,000 tokens (Brants et al., 2002)<br />

• TreeTagger - the TreeTagger trained on a small German newspaper corpus <strong>of</strong><br />

Stuttgarter Zeitung containing 25,000 tokens (Schmid, 1994, 1995)<br />

The English inclusion classifier is essentially run on the same set <strong>of</strong> data tagged<br />

by the three different POS taggers and evaluated against the hand-annotated gold stan-<br />

dard. Note that this method does not necessarily determine the best and most accurate<br />

POS tagger but rather one that is most beneficial for identifying English inclusions in<br />

German text. Before discussing the results for each setup, the characteristics <strong>of</strong> the<br />

two POS taggers used in this evaluation are explained in detail.<br />

3.5.1.1 TnT - Trigrams’n’Tags<br />

TnT is a very efficient statistical POS tagger which can be trained on corpora in dif-<br />

ferent languages and domains and new POS tag sets (Brants, 2000b). Moreover, the<br />

tagger is very fast to train and run. It is based on the Viterbi algorithm for second<br />

order Markov models and therefore assigns the tag ti that is most likely to generate<br />

the wi given the two previous tags ti−1 and ti−2. The output and transition probabili-<br />

ties are estimated from an annotated corpus. In order to deal with data sparseness, the<br />

system incorporates linear interpolation-based smoothing and handles unknown words<br />

via n-gram-based suffix analysis.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!