PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3. Tracking English Inclusions in German 64<br />
Post-processing type Example<br />
Ambiguous words Space Station Crew<br />
Single letters E-mail<br />
Currencies & Units Euro<br />
Function words Friends <strong>of</strong> the Earth<br />
Abbreviations Europäische Union (EU)<br />
Person names Präsident Bush<br />
Table 3.7: Different types <strong>of</strong> post-processing rules.<br />
ule involve resolving language classification <strong>of</strong> ambiguous words, single letter tokens,<br />
currencies and units <strong>of</strong> measurement, function words, abbreviations and person names.<br />
Each type <strong>of</strong> post-processing is listed in Table 3.7 with an example and explained in<br />
more detail in the following. Individual contributions <strong>of</strong> each type are presented in<br />
Section 3.4.2.3. Most <strong>of</strong> the rules lead to improvements in performance for all <strong>of</strong> the<br />
three domains and none <strong>of</strong> them deteriorate the scores.<br />
As only the token and its POS tag but not its surrounding context are considered in<br />
the lexicon module classification, it is difficult to identify the language <strong>of</strong> interlingual<br />
homographs, tokens with the same spelling in both languages (e.g. Station). Therefore,<br />
the majority <strong>of</strong> post-processing rules are designed to disambiguate such instances. For<br />
example, if a language ambiguous token is preceded and followed by an English token,<br />
then its is also likely to be <strong>of</strong> English origin (e.g. Space Station Crew versus macht<br />
Station auf Sizilien). The post-processing module applies rules that disambiguate such<br />
interlingual homographs based on their POS tag and contextual information.<br />
Moreover, the module contains rules designed to flag single-character tokens cor-<br />
rectly. These occur because the tokeniser is set up to split hyphenated compounds<br />
like E-mail into three separate tokens (Section 3.3.2). The core system identifies the<br />
language <strong>of</strong> tokens with a length <strong>of</strong> more than one character and therefore only recog-<br />
nises mail as English in this example. The post-processing rule flags E as English as<br />
well. Several additional rules deal with names <strong>of</strong> currencies and units <strong>of</strong> measurements<br />
and prevent them from being mistaken as English inclusions. Furthermore, some rules<br />
were designed to classify English function words as English.<br />
As the core system classifies each token individually, a further post-processing