05.03.2013 Views

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

PhD thesis - School of Informatics - University of Edinburgh

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2. Background and Theory 34<br />

92.0% for Arabic names and 98.9% for English common words. When combining<br />

LID and language-specific letter-to-sound rule decision trees, the precision <strong>of</strong> phones<br />

in the system transcription that match hand-transcribed phones amounts to 89.2% com-<br />

pared to a baseline precision <strong>of</strong> 80.1% for a system that is simply trained on the CMU<br />

lexicon, a pronunciation lexicon for American English.<br />

The latter work largely depends on the distinct statistical characteristics between<br />

languages. English, Russian and Arabic are very different languages. Although En-<br />

glish and Russian are both Indo-European languages, they belong to different language<br />

groups, namely Germanic and Slavic, respectively. Arabic, on the other hand, is a<br />

member <strong>of</strong> the Afro-Asiatic language family. Therefore, LID for NEs <strong>of</strong> more closely<br />

related languages is anticipated to be a more challenging task. Another interesting<br />

point made by Lewis et al. (2004) is that unseen foreign words in English documents<br />

are generally proper names. While this is largely true for English, the same cannot be<br />

said for German text, for example, in which increasing numbers <strong>of</strong> anglicisms have<br />

been recorded, particularly in the last 50 years. This can be mainly attributed to tech-<br />

nological advances, in particular the invention <strong>of</strong> the computer and the internet, as<br />

well as political events such as the creation and enlargement <strong>of</strong> the EU. As a result,<br />

German documents frequently contain English inclusions, not only NEs but also many<br />

other content words. The influx <strong>of</strong> anglicisms into German and other languages was<br />

examined in detail in Sections 2.1.2 to 2.1.4.<br />

It is evident that LID information would be beneficial to multilingual TTS synthe-<br />

sis and other NLP applications that need to handle foreign names. However, with the<br />

increasing influence that English has on other languages, state-<strong>of</strong>-the-art systems must<br />

also be able to deal with other types <strong>of</strong> foreign inclusions such as English computer<br />

terminology, expressions from the business world or advertising slogans that are en-<br />

countered in texts written in other languages. Moreover, such language mixing does<br />

not only happen on the word level, i.e. a German sentence containing some foreign<br />

words. It also occurs on the morpheme level when a word contains morphemes from<br />

different languages. First efforts that address mixed-lingual LID are discussed in the<br />

following section.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!