PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 2. Background and Theory 43<br />
that could have generated the observed token sequence. This is done by means <strong>of</strong> the<br />
Viterbi algorithm (e.g. Rabiner, 1989).<br />
The LID algorithm handles unknown words first by means <strong>of</strong> a dictionary lookup<br />
for each language involved. If an unknown token is present in a dictionary, four train-<br />
ing samples are added with the corresponding language tag. If it is not found in the<br />
dictionary, one training sample is added. If a token is not found in any dictionary, the<br />
system backs <strong>of</strong>f to a character n-gram language model based on a training corpus for<br />
each language (e.g. Dunning, 1994). Farrugia uses a parallel Maltese English corpus<br />
<strong>of</strong> legislative documents for this purpose. Three samples are then added to the SMS<br />
training corpus for the most likely language guess. After biasing the training sample<br />
in this way, the HMM is rebuilt and the input text is tagged with language tags.<br />
Farrugia’s algorithm is set up to distinguish between Maltese and English tokens.<br />
He reports an average LID accuracy <strong>of</strong> 95% for all tokens in three different test sets<br />
containing 100 random SMS messages each, obtained via a three-fold cross-validation<br />
experiment. As the language distribution for each <strong>of</strong> the test sets is not provided, it is<br />
unclear how well the system performs for each language in terms <strong>of</strong> precision, recall<br />
and F-score and consequently how pr<strong>of</strong>icient it is at determining English inclusions.<br />
Therefore, it is difficult to say what improvement this LID system provides over simply<br />
assuming that the input text is monolingual Maltese or English. In fact, Farrugia (2005)<br />
does not clarify at what level code-switching takes place, i.e. if SMS messages are<br />
made up <strong>of</strong> mostly Maltese text containing embedded English expressions, if language<br />
changes are on the sentence level, or if messages are written entirely in Maltese or<br />
English. Furthermore, it would be really interesting to investigate how well Farrugia<br />
(2005)’s approach performs on running text in other domains and what the performance<br />
contribution <strong>of</strong> each <strong>of</strong> the system components is. Considering that languages are<br />
constantly evolving and new words enter the vocabulary every day, the dictionary and<br />
character n-gram based approach for dealing with unknown words is relatively static<br />
and may not perform well for languages that are closely related.<br />
2.2.1.4 Lexicon Lookup, Chargrams and Regular Expression Matching:<br />
Andersen (2005)<br />
Andersen (2005) notes the importance <strong>of</strong> recognising anglicisms to lexicographers.<br />
He tests several algorithms based on lexicon lookup, character n-grams and regular