05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 4. System Extension to a New Language 105<br />

tools and therefore demanded more time and effort to be customised for French. The<br />

core system was adapted in approximately one person week in total (Section 4.1). Fig-<br />

ure 4.2 illustrates the system architecture after extending it to French. Note that the<br />

system now involves an additional document-based language identification step after<br />

pre-processing in which the base language <strong>of</strong> the document is determined by TextCat<br />

(Cavnar and Trenkle, 1994). TextCat, a traditional language identification tool, per-<br />

forms well on identifying the language <strong>of</strong> sentences and larger passages. This enables<br />

running the English inclusion classifier over either German or French text without<br />

having to specify the base language <strong>of</strong> the text manually. The base-language-specific<br />

classifier components are therefore initiated purely based on TextCat’s language iden-<br />

tification. For both the German and French newspaper articles, TextCat is always able<br />

to identify the language correctly.<br />

4.3.1 Pre-processing Module<br />

The pre-processing module involves tokenisation and POS tagging (cf. Section 3.3.2).<br />

First, the German tokeniser was adapted to French and a French part-<strong>of</strong>-speech (POS)<br />

tagger was integrated into the system. The French tokeniser consists <strong>of</strong> two rule-based<br />

tokenisation grammars. In the same way as the German version, it not only identifies<br />

tokens surrounded by white space and punctuation but also resolves typical abbrevia-<br />

tions, numerals and URLs. Both grammars are applied by means <strong>of</strong> improved upgrades<br />

<strong>of</strong> the XML tools described in Thompson et al. (1997) and Grover et al. (2000). These<br />

tools process an XML input stream and rewrite it on the basis <strong>of</strong> the rules provided.<br />

The French TreeTagger (see Section 3.5.1.2) is used for POS tagging. It is freely avail-<br />

able for research and is also trained for a number <strong>of</strong> other languages, including German<br />

and English. The TreeTagger functions on the basis <strong>of</strong> binary decision trees trained on<br />

a French corpus <strong>of</strong> 35,448 word forms and yields a tagging accuracy <strong>of</strong> over 94% on<br />

an evaluation data set comprising <strong>of</strong> 10,000 word forms (Stein and Schmidt, 1995). 2<br />

2 While trained models are available online, the tagged data set that was used to train and evaluate<br />

the French TreeTagger model is not part <strong>of</strong> the distribution. Otherwise, the data could have been used<br />

to train TnT, as that tagger resulted in a better performance <strong>of</strong> the English inclusion classifier on the<br />

German development data (see Section 3.5.1).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!