05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Tracking English Inclusions in German 64<br />

Post-processing type Example<br />

Ambiguous words Space Station Crew<br />

Single letters E-mail<br />

Currencies & Units Euro<br />

Function words Friends <strong>of</strong> the Earth<br />

Abbreviations Europäische Union (EU)<br />

Person names Präsident Bush<br />

Table 3.7: Different types <strong>of</strong> post-processing rules.<br />

ule involve resolving language classification <strong>of</strong> ambiguous words, single letter tokens,<br />

currencies and units <strong>of</strong> measurement, function words, abbreviations and person names.<br />

Each type <strong>of</strong> post-processing is listed in Table 3.7 with an example and explained in<br />

more detail in the following. Individual contributions <strong>of</strong> each type are presented in<br />

Section 3.4.2.3. Most <strong>of</strong> the rules lead to improvements in performance for all <strong>of</strong> the<br />

three domains and none <strong>of</strong> them deteriorate the scores.<br />

As only the token and its POS tag but not its surrounding context are considered in<br />

the lexicon module classification, it is difficult to identify the language <strong>of</strong> interlingual<br />

homographs, tokens with the same spelling in both languages (e.g. Station). Therefore,<br />

the majority <strong>of</strong> post-processing rules are designed to disambiguate such instances. For<br />

example, if a language ambiguous token is preceded and followed by an English token,<br />

then its is also likely to be <strong>of</strong> English origin (e.g. Space Station Crew versus macht<br />

Station auf Sizilien). The post-processing module applies rules that disambiguate such<br />

interlingual homographs based on their POS tag and contextual information.<br />

Moreover, the module contains rules designed to flag single-character tokens cor-<br />

rectly. These occur because the tokeniser is set up to split hyphenated compounds<br />

like E-mail into three separate tokens (Section 3.3.2). The core system identifies the<br />

language <strong>of</strong> tokens with a length <strong>of</strong> more than one character and therefore only recog-<br />

nises mail as English in this example. The post-processing rule flags E as English as<br />

well. Several additional rules deal with names <strong>of</strong> currencies and units <strong>of</strong> measurements<br />

and prevent them from being mistaken as English inclusions. Furthermore, some rules<br />

were designed to classify English function words as English.<br />

As the core system classifies each token individually, a further post-processing

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!