PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 4. System Extension to a New Language 104<br />
that they are less repetitive than those contained in the German articles. However,<br />
overall TTRs are 0.6 points lower for French than for German which means that the<br />
remaining vocabulary in the French articles is somewhat less heterogeneous than in<br />
the German data. This is due to the nature <strong>of</strong> the two languages. In German, lexical<br />
variety is higher due to compounding and case inflection for articles, adjectives and<br />
nouns.<br />
French internet data<br />
Token f Token f<br />
e 49 web 21<br />
Google 44 spam 18<br />
internet 35 mail 18<br />
Micros<strong>of</strong>t 33 Firefox 16<br />
mails 32 Windows 14<br />
Table 4.2: Ten most frequent (f) English inclusions in the French data.<br />
The ten most frequent English inclusions found in the French data set are listed<br />
in Table 4.2. The majority <strong>of</strong> them are either common or proper nouns. The most<br />
frequent English token is actually the single-character token e as occurring in e-mail.<br />
Less frequent pseudo-anglicisms like le forcing in the sense <strong>of</strong> putting pressure on<br />
someone or le parking referring to a car park are also contained in the French data.<br />
Although such lexical items are either non-existent or scarcely used by native English<br />
speakers, they are also annotated as English inclusions.<br />
4.3 System Module Conversion to French<br />
The extended system architecture <strong>of</strong> the English inclusion classifier consists <strong>of</strong> sev-<br />
eral pre-processing steps, followed by a lexicon module, a search engine module and<br />
a post-processing module. Converting the search engine module to a new language<br />
required little computational cost and time. Conversely, the pre- and post-processing<br />
as well as the lexicon modules necessitated some language knowledge resources or