05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3. Tracking English Inclusions in German 53<br />

3.2.4 Annotation Issues<br />

Although the aforementioned annotation guidelines are relatively clear, the actual an-<br />

notation revealed some tricky cases which were difficult to classify with the binary<br />

classification scheme described above. This section discusses the main issues which<br />

need to be clarified for revising and possibly extending the current guidelines.<br />

These complicated instances mainly concern NEs which cannot be found in the<br />

individual lexicons but have certain language specific morphology and comprise char-<br />

acter sequences typical for that language. Table 3.4 lists some examples for different<br />

types <strong>of</strong> NEs that stem from German, English and other language origins. Dudenhöffer<br />

and Neckermann are clearly German names, just as Hutchison and Forrester are En-<br />

glish names. Difficult cases are Sony (sonus + sonny), Activy (similar to activity) or<br />

Booxtra (book + extra). Such English-like examples were not annotated as English<br />

inclusions in the gold standard. Therefore, if the system identifies them as English,<br />

its performance scores determined in the evaluation (Section 3.4) are to some extent<br />

unfairly penalised. A way <strong>of</strong> including these instances in the evaluation is to annotate<br />

them as English with an attribute distinguishing them from real English words.<br />

NE type German English Other<br />

Person Dudenhöffer Hutchison Kinoshita<br />

Company Neckermann Forrester Kelkoo<br />

Table 3.4: Difficult annotation examples.<br />

The German development corpus also contains NEs from other languages. Readers<br />

might well know that Kinoshita is neither a German nor an English name, although<br />

they may not be able to identify it as a Japanese name. Interestingly, Font-Llitjós and<br />

Black (2001) show that out <strong>of</strong> a list <strong>of</strong> 516 names, only 43% can be labelled confidently<br />

by human annotators with respect to their language origin. An example in our corpus<br />

where the language origin is not at all apparent in the character sequence is the name<br />

Kelkoo, a play on words derived from the French for What a bargain (Quel coup). It<br />

represents the English phonetic spelling <strong>of</strong> the French phrase. Other foreign names are<br />

Toshiba (Japanese), Svanberg (Swedish), Altavista (Spanish) and Centrino (Italian).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!