PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 3. Tracking English Inclusions in German 57<br />
to tokenise the XML documents. The first grammar pre-tokenises the text into tokens<br />
surrounded by white space and punctuation and the second grammar groups together<br />
various abbreviations, numerals and URLs. Grammar rules also split hyphenated to-<br />
kens. The two grammars are applied with lxtransduce 6 , a transducer which adds or<br />
rewrites XML markup to an input stream based on the rules provided. lxtransduce<br />
is an improved version <strong>of</strong> fsgmatch, the core program <strong>of</strong> LT-TTT (Grover et al.,<br />
2000). The tokenised text is then POS-tagged using the statistical POS tagger TnT<br />
(Trigrams’n’Tags). The tagger is trained on the TIGER Treebank (Release 1) which<br />
consists <strong>of</strong> 700,000 tokens <strong>of</strong> German newspaper text (Brants et al., 2002) annotated<br />
with the Stuttgart-Tübingen Tagset (Schiller et al., 1995), henceforth referred to as<br />
STTS.<br />
3.3.3 Lexicon Lookup Module<br />
The lexicon module performs an initial language classification run based on a case-<br />
insensitive lookup procedure using two lexicons, one for the base language <strong>of</strong> the text<br />
and one for the language <strong>of</strong> the inclusions. The system is designed to search CELEX<br />
Version 2 (Celex, 1993), a lexical database <strong>of</strong> German, English and Dutch. The Ger-<br />
man database holds 51,728 lemmas and their 365,530 word forms and the English<br />
database contains 52,446 lemmas representing 160,594 corresponding word forms. A<br />
CELEX lookup is only performed for tokens which TnT tags as NN (common noun),<br />
NE (named entity), ADJA or ADJD (attributive and adverbial or predicatively used adjec-<br />
tives) as well as FM (foreign material). Anglicisms representing other parts <strong>of</strong> speech<br />
are relatively infrequently used in German (Yeandle, 2001) which is the principal rea-<br />
son for focussing on the classification <strong>of</strong> noun and adjective phrases. Before the lexicon<br />
lookup is performed, distinctive characteristics <strong>of</strong> German orthography are exploited<br />
for classification. So, all tokens containing German umlauts are automatically recog-<br />
nised as German and are therefore not further processed by the system.<br />
The core lexicon lookup algorithm involves each token being looked up twice, both<br />
in the German and English CELEX databases. Each part <strong>of</strong> a hyphenated compound<br />
is checked individually. Moreover, the lookup in the English database is made case-<br />
insensitive in order to identify the capitalised English tokens in the corpus, the reason<br />
6 http://www.ltg.ed.ac.uk/˜richard/lxtransduce.html