Representing Myanmar in Unicode - Evertype
Representing Myanmar in Unicode - Evertype
Representing Myanmar in Unicode - Evertype
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
The problem with <strong>in</strong>sert<strong>in</strong>g Word Jo<strong>in</strong>ers is that it makes search<strong>in</strong>g for polysyllabic words much harder<br />
s<strong>in</strong>ce the search<strong>in</strong>g eng<strong>in</strong>e must be able to recognise the Word Jo<strong>in</strong>er characters and ignore them. This is<br />
unlikely to happen. Therefore it is advisable not to use Word Jo<strong>in</strong>er characters if at all possible.<br />
Dictionary Based L<strong>in</strong>e Break<strong>in</strong>g<br />
The next level of sophistication builds on the previous by add<strong>in</strong>g the ability for the l<strong>in</strong>e breaker to identify<br />
polysyllabic words. Such words are usually held <strong>in</strong> a dictionary. Thankfully, such a dictionary only need<br />
conta<strong>in</strong> polysyllabic words which are far fewer than a complete wordlist for a language. The ma<strong>in</strong> weakness<br />
of this approach is where new words are used that are not <strong>in</strong> the dictionary. For this, one may need to<br />
fallback to ZWSP or WJ approaches. The complexity of this approach is that users are not generally aware<br />
of the contents of such dictionaries and so cannot predict when they will have difficulties and when not.<br />
Notice that at each level of sophistication, it is necessary for the l<strong>in</strong>e break<strong>in</strong>g approach to be able to handle<br />
data that has been generated for a less sophisticated l<strong>in</strong>e break<strong>in</strong>g approach and to handle that appropriately.<br />
For example, if a text conta<strong>in</strong>s ZWSP characters, they should be honoured.<br />
Sort<strong>in</strong>g<br />
Sort<strong>in</strong>g <strong>Myanmar</strong> str<strong>in</strong>gs is a complex process <strong>in</strong>volv<strong>in</strong>g significant str<strong>in</strong>g transformation and four levels of<br />
comparison. The str<strong>in</strong>g transformation is a syllable based operation for which the identification of syllable<br />
boundaries (but not word boundaries) are required. The same techniques that are used for l<strong>in</strong>e-break<strong>in</strong>g,<br />
therefore, may be used for sort<strong>in</strong>g.<br />
The basic pr<strong>in</strong>ciple used <strong>in</strong> sort<strong>in</strong>g most <strong>Myanmar</strong> based languages, <strong>in</strong> the script, is to treat a syllable as<br />
consist<strong>in</strong>g of one or more of the follow<strong>in</strong>g components <strong>in</strong> order:<br />
Consonant Medials Vowels F<strong>in</strong>als Tone<br />
There are two primary approaches to sort<strong>in</strong>g. The th<strong>in</strong>bongyi approach is the current national standard and<br />
reorders the components so that the F<strong>in</strong>als occur before the Vowel:<br />
Consonant Medials F<strong>in</strong>als Vowels Tone<br />
The Pali sort uses a different reorder<strong>in</strong>g:<br />
Consonant Medials Vowels Tone F<strong>in</strong>als<br />
Then sort<strong>in</strong>g proceeds simply, tak<strong>in</strong>g each component as hav<strong>in</strong>g a primary sort relationship to the other<br />
components. It should be noted that where there is more than one medial character, they may <strong>in</strong>teract to<br />
produce a s<strong>in</strong>gle sort key. This is also true for sequences of vowels.<br />
Contractions<br />
The <strong>Myanmar</strong> language has a system of double act<strong>in</strong>g consonants, where a consonant acts as both the f<strong>in</strong>al<br />
of a syllable and the <strong>in</strong>itial of a follow<strong>in</strong>g syllable. These are significant for sort<strong>in</strong>g purposes. Double act<strong>in</strong>g<br />
consonants are rare, but occur <strong>in</strong> two common words.<br />
နယာ<br />
ကျါ်ား<br />
ကျန် ု ပ်<br />
101A 1031 102C 1000 103A 103B<br />
102C 1038<br />
1000 103B 103D 1014 103A 102F<br />
1015 103A<br />
man, husband<br />
I (1 st person s<strong>in</strong>gular)<br />
This storage approach also affects syllable break<strong>in</strong>g s<strong>in</strong>ce a devowelised consonant with a vowel acts like a<br />
normal base consonant with its preced<strong>in</strong>g syllable break.<br />
There are also words with double act<strong>in</strong>g consonants which are unmarked. S<strong>in</strong>ce these are unmarked, it has<br />
been decided that despite their etymology, these words should be sorted as if there were no double act<strong>in</strong>g<br />
consonant.<br />
<strong>Represent<strong>in</strong>g</strong> <strong>Myanmar</strong> <strong>in</strong> <strong>Unicode</strong> Page 10 of 37 Version: 433