PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 2. Background and Theory 33<br />
by calculating the sum <strong>of</strong> all absolute rank distances for each n-gram ti. The language<br />
l resulting in the smallest distance D signals the language <strong>of</strong> the test document.<br />
D =<br />
N<br />
∑<br />
i=1<br />
|rank(ti,text) − rank(ti,l)| (2.1)<br />
TextCat provides corpus pr<strong>of</strong>iles for 69 languages each containing 400 n-grams per<br />
language. If an n-gram does not occur in the pr<strong>of</strong>ile, it is given a maximum distance<br />
score <strong>of</strong> 400. Cavnar and Trenkle (1994) report an accuracy <strong>of</strong> 99.8% for input strings<br />
<strong>of</strong> at least 300 bytes and an accuracy <strong>of</strong> 98.3% for strings <strong>of</strong> less that 300 bytes using<br />
their LID approach. No information is given regarding its performance for really small<br />
strings. In the experiments presented in Sections 3.4.1 and 3.4.3.1, the collection <strong>of</strong><br />
corpus pr<strong>of</strong>iles was limited to the required languages, namely German and English.<br />
Most automatic LID systems are successful in identifying the base language <strong>of</strong> a<br />
document but are not designed to deal with mixed-lingual text to identify the origin<br />
<strong>of</strong> individual foreign words and proper names within a given sentence. Initial work<br />
on this issue has been carried out by Font-Llitjós and Black (2001) who use Cavnar’s<br />
n-gram statistics approach to estimate the origin <strong>of</strong> unseen proper names as a means <strong>of</strong><br />
improving the pronunciation accuracy for TTS syn<strong>thesis</strong>. While the LID performance<br />
(for 26 languages) is not evaluated separately, the pronunciation accuracy <strong>of</strong> proper<br />
names increases by 7.6% from the baseline <strong>of</strong> 54.08% when adding language origin<br />
probabilities as a feature to the CART-based decision tree model. LID has also been<br />
applied in the field <strong>of</strong> name transliteration. Qu and Grefenstette (2004) used LID as<br />
a way <strong>of</strong> determining language-specific transliteration methods for Japanese, Chinese<br />
and English named entities (NEs) written in Latin script. Their language identifier is<br />
based on tri-gram frequencies, whereby the language <strong>of</strong> a word is that for which the<br />
sum <strong>of</strong> the normalised tri-grams is highest. On the training data, LID yields accura-<br />
cies <strong>of</strong> 92% for Japanese, 87% for Chinese and 70% for English names. A further<br />
study is that <strong>of</strong> Lewis et al. (2004) who implemented a character n-gram classifier to<br />
differentiate between the 10,000 most common English words, 3,143 unique transliter-<br />
ated Arabic and 20,577 Russian names. Each <strong>of</strong> the three data sets is divided into 80%<br />
training and 20% test data, a process which is repeated 4 times resulting in overlapping<br />
training and test sets. Average precision values amount to 81.1% for Russian names,