PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
PhD thesis - School of Informatics - University of Edinburgh
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 2. Background and Theory 34<br />
92.0% for Arabic names and 98.9% for English common words. When combining<br />
LID and language-specific letter-to-sound rule decision trees, the precision <strong>of</strong> phones<br />
in the system transcription that match hand-transcribed phones amounts to 89.2% com-<br />
pared to a baseline precision <strong>of</strong> 80.1% for a system that is simply trained on the CMU<br />
lexicon, a pronunciation lexicon for American English.<br />
The latter work largely depends on the distinct statistical characteristics between<br />
languages. English, Russian and Arabic are very different languages. Although En-<br />
glish and Russian are both Indo-European languages, they belong to different language<br />
groups, namely Germanic and Slavic, respectively. Arabic, on the other hand, is a<br />
member <strong>of</strong> the Afro-Asiatic language family. Therefore, LID for NEs <strong>of</strong> more closely<br />
related languages is anticipated to be a more challenging task. Another interesting<br />
point made by Lewis et al. (2004) is that unseen foreign words in English documents<br />
are generally proper names. While this is largely true for English, the same cannot be<br />
said for German text, for example, in which increasing numbers <strong>of</strong> anglicisms have<br />
been recorded, particularly in the last 50 years. This can be mainly attributed to tech-<br />
nological advances, in particular the invention <strong>of</strong> the computer and the internet, as<br />
well as political events such as the creation and enlargement <strong>of</strong> the EU. As a result,<br />
German documents frequently contain English inclusions, not only NEs but also many<br />
other content words. The influx <strong>of</strong> anglicisms into German and other languages was<br />
examined in detail in Sections 2.1.2 to 2.1.4.<br />
It is evident that LID information would be beneficial to multilingual TTS synthe-<br />
sis and other NLP applications that need to handle foreign names. However, with the<br />
increasing influence that English has on other languages, state-<strong>of</strong>-the-art systems must<br />
also be able to deal with other types <strong>of</strong> foreign inclusions such as English computer<br />
terminology, expressions from the business world or advertising slogans that are en-<br />
countered in texts written in other languages. Moreover, such language mixing does<br />
not only happen on the word level, i.e. a German sentence containing some foreign<br />
words. It also occurs on the morpheme level when a word contains morphemes from<br />
different languages. First efforts that address mixed-lingual LID are discussed in the<br />
following section.