05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Background and Theory 33<br />

by calculating the sum <strong>of</strong> all absolute rank distances for each n-gram ti. The language<br />

l resulting in the smallest distance D signals the language <strong>of</strong> the test document.<br />

D =<br />

N<br />

∑<br />

i=1<br />

|rank(ti,text) − rank(ti,l)| (2.1)<br />

TextCat provides corpus pr<strong>of</strong>iles for 69 languages each containing 400 n-grams per<br />

language. If an n-gram does not occur in the pr<strong>of</strong>ile, it is given a maximum distance<br />

score <strong>of</strong> 400. Cavnar and Trenkle (1994) report an accuracy <strong>of</strong> 99.8% for input strings<br />

<strong>of</strong> at least 300 bytes and an accuracy <strong>of</strong> 98.3% for strings <strong>of</strong> less that 300 bytes using<br />

their LID approach. No information is given regarding its performance for really small<br />

strings. In the experiments presented in Sections 3.4.1 and 3.4.3.1, the collection <strong>of</strong><br />

corpus pr<strong>of</strong>iles was limited to the required languages, namely German and English.<br />

Most automatic LID systems are successful in identifying the base language <strong>of</strong> a<br />

document but are not designed to deal with mixed-lingual text to identify the origin<br />

<strong>of</strong> individual foreign words and proper names within a given sentence. Initial work<br />

on this issue has been carried out by Font-Llitjós and Black (2001) who use Cavnar’s<br />

n-gram statistics approach to estimate the origin <strong>of</strong> unseen proper names as a means <strong>of</strong><br />

improving the pronunciation accuracy for TTS syn<strong>thesis</strong>. While the LID performance<br />

(for 26 languages) is not evaluated separately, the pronunciation accuracy <strong>of</strong> proper<br />

names increases by 7.6% from the baseline <strong>of</strong> 54.08% when adding language origin<br />

probabilities as a feature to the CART-based decision tree model. LID has also been<br />

applied in the field <strong>of</strong> name transliteration. Qu and Grefenstette (2004) used LID as<br />

a way <strong>of</strong> determining language-specific transliteration methods for Japanese, Chinese<br />

and English named entities (NEs) written in Latin script. Their language identifier is<br />

based on tri-gram frequencies, whereby the language <strong>of</strong> a word is that for which the<br />

sum <strong>of</strong> the normalised tri-grams is highest. On the training data, LID yields accura-<br />

cies <strong>of</strong> 92% for Japanese, 87% for Chinese and 70% for English names. A further<br />

study is that <strong>of</strong> Lewis et al. (2004) who implemented a character n-gram classifier to<br />

differentiate between the 10,000 most common English words, 3,143 unique transliter-<br />

ated Arabic and 20,577 Russian names. Each <strong>of</strong> the three data sets is divided into 80%<br />

training and 20% test data, a process which is repeated 4 times resulting in overlapping<br />

training and test sets. Average precision values amount to 81.1% for Russian names,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!