05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 3. Tracking English Inclusions in German 57<br />

to tokenise the XML documents. The first grammar pre-tokenises the text into tokens<br />

surrounded by white space and punctuation and the second grammar groups together<br />

various abbreviations, numerals and URLs. Grammar rules also split hyphenated to-<br />

kens. The two grammars are applied with lxtransduce 6 , a transducer which adds or<br />

rewrites XML markup to an input stream based on the rules provided. lxtransduce<br />

is an improved version <strong>of</strong> fsgmatch, the core program <strong>of</strong> LT-TTT (Grover et al.,<br />

2000). The tokenised text is then POS-tagged using the statistical POS tagger TnT<br />

(Trigrams’n’Tags). The tagger is trained on the TIGER Treebank (Release 1) which<br />

consists <strong>of</strong> 700,000 tokens <strong>of</strong> German newspaper text (Brants et al., 2002) annotated<br />

with the Stuttgart-Tübingen Tagset (Schiller et al., 1995), henceforth referred to as<br />

STTS.<br />

3.3.3 Lexicon Lookup Module<br />

The lexicon module performs an initial language classification run based on a case-<br />

insensitive lookup procedure using two lexicons, one for the base language <strong>of</strong> the text<br />

and one for the language <strong>of</strong> the inclusions. The system is designed to search CELEX<br />

Version 2 (Celex, 1993), a lexical database <strong>of</strong> German, English and Dutch. The Ger-<br />

man database holds 51,728 lemmas and their 365,530 word forms and the English<br />

database contains 52,446 lemmas representing 160,594 corresponding word forms. A<br />

CELEX lookup is only performed for tokens which TnT tags as NN (common noun),<br />

NE (named entity), ADJA or ADJD (attributive and adverbial or predicatively used adjec-<br />

tives) as well as FM (foreign material). Anglicisms representing other parts <strong>of</strong> speech<br />

are relatively infrequently used in German (Yeandle, 2001) which is the principal rea-<br />

son for focussing on the classification <strong>of</strong> noun and adjective phrases. Before the lexicon<br />

lookup is performed, distinctive characteristics <strong>of</strong> German orthography are exploited<br />

for classification. So, all tokens containing German umlauts are automatically recog-<br />

nised as German and are therefore not further processed by the system.<br />

The core lexicon lookup algorithm involves each token being looked up twice, both<br />

in the German and English CELEX databases. Each part <strong>of</strong> a hyphenated compound<br />

is checked individually. Moreover, the lookup in the English database is made case-<br />

insensitive in order to identify the capitalised English tokens in the corpus, the reason<br />

6 http://www.ltg.ed.ac.uk/˜richard/lxtransduce.html

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!