05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3. Tracking English Inclusions in German 49<br />

tool based on NXT (Carletta et al., 2003) which operates with stand-<strong>of</strong>f XML input<br />

and output. The binary annotation distinguishes between two classes using the BIO-<br />

encoding (Ramshaw and Marcus, 1995): English inclusion tokens are marked as I-EN<br />

(inside an English token) and any token that falls outside this category is marked as O<br />

(Outside). As the annotation was performed on the level <strong>of</strong> the token (and not phrase),<br />

an English inclusion received the tag B-EN only if it was preceded by another English<br />

inclusion. The annotation guidelines, which are presented in detail in Appendix B,<br />

specified to the annotators to mark as English inclusions:<br />

• all tokens that are English words even if part <strong>of</strong> NEs: Google<br />

• all abbreviations that expand to English terms: ISS<br />

• compounds that are made up <strong>of</strong> two English words: Bluetooth<br />

For the evaluation, it was also decided to ignore English-like person and location<br />

names as well as English inclusions occurring:<br />

• as part <strong>of</strong> URLs: www.stepstone.de<br />

• in mixed-lingual unhyphenated compounds: Shuttleflug (shuttle flight)<br />

• with German inflections: Receivern (with German dative plural case inflection)<br />

Further morphological analysis is required to recognise these. These issues will be<br />

addressed in future work when mixed-lingual compounds and inflected inclusions also<br />

need to be represented in the gold standard annotation.<br />

Table 3.1 provides some corpus statistics for each domain and presents the number<br />

<strong>of</strong> English inclusions annotated in the various gold standard development and test sets.<br />

Interestingly, the percentage <strong>of</strong> English inclusions varies considerably across all three<br />

domains. There are considerably more English tokens present in the articles on the<br />

internet & telecoms and space travel than in those on the EU. This result seemed sur-<br />

prising at first as the development <strong>of</strong> the EU has facilitated increasing contact between<br />

German and English speaking cultures. However, political structures and concepts are<br />

intrinsic parts <strong>of</strong> individual cultures and therefore tend to have their own expressions.<br />

Moreover, EU legislation is translated into all its <strong>of</strong>ficial languages, currently number-<br />

ing 23. This language policy renders English less dominant in this domain than was

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!