PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 4. System Extension to a New Language 105<br />
tools and therefore demanded more time and effort to be customised for French. The<br />
core system was adapted in approximately one person week in total (Section 4.1). Fig-<br />
ure 4.2 illustrates the system architecture after extending it to French. Note that the<br />
system now involves an additional document-based language identification step after<br />
pre-processing in which the base language <strong>of</strong> the document is determined by TextCat<br />
(Cavnar and Trenkle, 1994). TextCat, a traditional language identification tool, per-<br />
forms well on identifying the language <strong>of</strong> sentences and larger passages. This enables<br />
running the English inclusion classifier over either German or French text without<br />
having to specify the base language <strong>of</strong> the text manually. The base-language-specific<br />
classifier components are therefore initiated purely based on TextCat’s language iden-<br />
tification. For both the German and French newspaper articles, TextCat is always able<br />
to identify the language correctly.<br />
4.3.1 Pre-processing Module<br />
The pre-processing module involves tokenisation and POS tagging (cf. Section 3.3.2).<br />
First, the German tokeniser was adapted to French and a French part-<strong>of</strong>-speech (POS)<br />
tagger was integrated into the system. The French tokeniser consists <strong>of</strong> two rule-based<br />
tokenisation grammars. In the same way as the German version, it not only identifies<br />
tokens surrounded by white space and punctuation but also resolves typical abbrevia-<br />
tions, numerals and URLs. Both grammars are applied by means <strong>of</strong> improved upgrades<br />
<strong>of</strong> the XML tools described in Thompson et al. (1997) and Grover et al. (2000). These<br />
tools process an XML input stream and rewrite it on the basis <strong>of</strong> the rules provided.<br />
The French TreeTagger (see Section 3.5.1.2) is used for POS tagging. It is freely avail-<br />
able for research and is also trained for a number <strong>of</strong> other languages, including German<br />
and English. The TreeTagger functions on the basis <strong>of</strong> binary decision trees trained on<br />
a French corpus <strong>of</strong> 35,448 word forms and yields a tagging accuracy <strong>of</strong> over 94% on<br />
an evaluation data set comprising <strong>of</strong> 10,000 word forms (Stein and Schmidt, 1995). 2<br />
2 While trained models are available online, the tagged data set that was used to train and evaluate<br />
the French TreeTagger model is not part <strong>of</strong> the distribution. Otherwise, the data could have been used<br />
to train TnT, as that tagger resulted in a better performance <strong>of</strong> the English inclusion classifier on the<br />
German development data (see Section 3.5.1).