13.07.2015 Views

A corpus-based approach for thai romanization.pdf - NAiST

A corpus-based approach for thai romanization.pdf - NAiST

A corpus-based approach for thai romanization.pdf - NAiST

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Introduction Previous Works Methodology Experiment Result Discussion


What is Romanization?Romanization is a way to write text using RomanalphabetsThai Romanization uses Roman characters torepresent Thai words E.g. ชลบุรี can be romanized to Chonburiปทุมธานี can be romanized to Pathumthani


Two common systems <strong>for</strong> Thai Romanization: Orthographic Romanization (transliteration) Pronunciation-<strong>based</strong> Romanization Problem :The rules <strong>for</strong> Thai Romanization are suchcomplicated that there are several ways totranslate a Thai character into more than oneRoman characters depending on its function andposition in the word.


Solution :Statistics collected from a <strong>corpus</strong> of Thai-EnglishRomanization pairs are applied.These statistical data provide in<strong>for</strong>mation toselect suitable candidates <strong>for</strong> transcribing Thaiwords.


Thai Romanization methods Rule-<strong>based</strong> Dictionary <strong>based</strong> Statistical <strong>based</strong> Combination of rule and statistical <strong>based</strong>Our research proposed a <strong>corpus</strong>-<strong>based</strong> <strong>approach</strong> <strong>for</strong>automatic Thai words Romanization


Improve the correctness of Thai geographicalname spelling Help preventing the misunderstanding of Thaigeographical name


Map a Thai word to its correspond pronunciationbe<strong>for</strong>e translating the result to romanized wordThai wordpronunciation Syllablication method is used to segment word intosyllables Syllable-<strong>based</strong> tri-grams is used to select the mostsuitable pronunciation Translate the pronunciation into Roman characters


1. TCC segmentationA Thai word is segmented using the concept ofThai character cluster(TCC). TCC is anunambiguous group of Thai characters defined bya set of rules. The rules used in TCC are simplycomputerized versions of Royal Institute's rules.TCC represents an inseparable unit of Thaicharacters used in composing words.e.g.ชลบุรี ช | ล | บุ | รีพระนคร พ | ระ | น | ค | ร


2. ECC SegmentationEnglish Character Cluster(ECC) has the similarconcept to TCC. It is a group of Roman characterssuitable <strong>for</strong> mapping to TCC. We use `?’ in ECC toindicate the TCC that is not pronounced, such asthe TCC with the voice cancellation character(Karan character).e.g.Chonburi Cho | n | bu | riPhranakhon Ph | ra | na | kho | n


3. TCC-ECC Alignmente.g.cho|n|bu|riช|ล|บุ|รี4. Store TCC-ECC Alignment into <strong>corpus</strong>


Chonburiชลบุรีcho|n|bu|ri ECCSegment a group ofRoman Character thatcorresponds to Thaiwordsช|ล|บุ|รีTCCSegment Thai wordscho|n|bu|riAlignmentช|ล|บุ|รีCorpusMap TCCs to ECCsAlignments are keptin the <strong>corpus</strong>


The Statistical ModelWhere e i is a romanized charactert i-1 , t i , t i+1 is a TCC unit 1 ,…, 5 are weighting factors


Experiment Methods Close Test - Whole <strong>corpus</strong> <strong>for</strong> both training andtesting Open Test - five-fold cross validation1 st Test 2 nd Test 3 rd Test 4 th Test 5 th Test


Evaluation Type ECC Level - shows the basic per<strong>for</strong>mance of thesystem check <strong>for</strong> limitations and shortcomings of the system Word level - shows the per<strong>for</strong>mance of thesystem when it will be used in the real situation.


EvaluationTypeOpen TestTest TypeClose TestECC Level 93.4% 99.6%Word Level 74.5% 98.1%


Error AnalysisTCCs Actual ECCs Generated ECCsสัน|ท|รา|ย SAN|?|SA|I SAN|?|A|Iหัว|ไท|ร HUA|SAI|? HUA|SAI|Iไท|ร|น้|อ|ย SAI|?|N|O|I SA|?|N|O|Iลา|ด|ห|ลุ|ม|แก้|ว LA|T|?|LU|M|KA|EO LA|T|?|LU|M|KA|Oห|น|อ|ง|แค ?|N|O|NG|KHAE ?|N|O|NG|KAEเฉ|ลิ|ม|พ|ระ|เกีย|ร|ติ CHA|LOE|M|PH|RA|KIA|?|T CHA|LER|M|PH|RA|KIA|?|Tท่า|ให|ม่ THA|MA|I THA|MA|MAIบ|ร|บื|อ BO|RA|BUE|? B|RA|BUE|?ส|ว|ร|ร|ค|โล|ก SA|W|A|N|KHA|LO|K S|W|A|N|KHA|LO|K


Sources of Errors A single-character TCC has multiple pronunciationsFor example, the character “น” in “พระนคร” should bepronounced as ‘na’ but it becomes ‘n’ since it has higherprobability. Some consonants can be pronouncedindependently or pronounced together with theirsucceeding consonantsFor example, the string “สระ” will be pronounced as ‘sa ra’ in theword “สระบุรี” but it is pronounced as ‘sa’ in the word “สระแก้ว”.


Sources of Errors One character has two roles in pronunciation.For example, “ศัตรู” will be pronounced as /sat tru/. The character“ต” acts as the final consonant <strong>for</strong> “ศัต” and it also acts as theinitial consonant <strong>for</strong> “ตรู”.


To solve the errors More accurate statistics Wider context Improve statistical model

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!