05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Background and Theory 43<br />

that could have generated the observed token sequence. This is done by means <strong>of</strong> the<br />

Viterbi algorithm (e.g. Rabiner, 1989).<br />

The LID algorithm handles unknown words first by means <strong>of</strong> a dictionary lookup<br />

for each language involved. If an unknown token is present in a dictionary, four train-<br />

ing samples are added with the corresponding language tag. If it is not found in the<br />

dictionary, one training sample is added. If a token is not found in any dictionary, the<br />

system backs <strong>of</strong>f to a character n-gram language model based on a training corpus for<br />

each language (e.g. Dunning, 1994). Farrugia uses a parallel Maltese English corpus<br />

<strong>of</strong> legislative documents for this purpose. Three samples are then added to the SMS<br />

training corpus for the most likely language guess. After biasing the training sample<br />

in this way, the HMM is rebuilt and the input text is tagged with language tags.<br />

Farrugia’s algorithm is set up to distinguish between Maltese and English tokens.<br />

He reports an average LID accuracy <strong>of</strong> 95% for all tokens in three different test sets<br />

containing 100 random SMS messages each, obtained via a three-fold cross-validation<br />

experiment. As the language distribution for each <strong>of</strong> the test sets is not provided, it is<br />

unclear how well the system performs for each language in terms <strong>of</strong> precision, recall<br />

and F-score and consequently how pr<strong>of</strong>icient it is at determining English inclusions.<br />

Therefore, it is difficult to say what improvement this LID system provides over simply<br />

assuming that the input text is monolingual Maltese or English. In fact, Farrugia (2005)<br />

does not clarify at what level code-switching takes place, i.e. if SMS messages are<br />

made up <strong>of</strong> mostly Maltese text containing embedded English expressions, if language<br />

changes are on the sentence level, or if messages are written entirely in Maltese or<br />

English. Furthermore, it would be really interesting to investigate how well Farrugia<br />

(2005)’s approach performs on running text in other domains and what the performance<br />

contribution <strong>of</strong> each <strong>of</strong> the system components is. Considering that languages are<br />

constantly evolving and new words enter the vocabulary every day, the dictionary and<br />

character n-gram based approach for dealing with unknown words is relatively static<br />

and may not perform well for languages that are closely related.<br />

2.2.1.4 Lexicon Lookup, Chargrams and Regular Expression Matching:<br />

Andersen (2005)<br />

Andersen (2005) notes the importance <strong>of</strong> recognising anglicisms to lexicographers.<br />

He tests several algorithms based on lexicon lookup, character n-grams and regular

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!