26.08.2013 Views

Representing Myanmar in Unicode - Evertype

Representing Myanmar in Unicode - Evertype

Representing Myanmar in Unicode - Evertype

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The problem with <strong>in</strong>sert<strong>in</strong>g Word Jo<strong>in</strong>ers is that it makes search<strong>in</strong>g for polysyllabic words much harder<br />

s<strong>in</strong>ce the search<strong>in</strong>g eng<strong>in</strong>e must be able to recognise the Word Jo<strong>in</strong>er characters and ignore them. This is<br />

unlikely to happen. Therefore it is advisable not to use Word Jo<strong>in</strong>er characters if at all possible.<br />

Dictionary Based L<strong>in</strong>e Break<strong>in</strong>g<br />

The next level of sophistication builds on the previous by add<strong>in</strong>g the ability for the l<strong>in</strong>e breaker to identify<br />

polysyllabic words. Such words are usually held <strong>in</strong> a dictionary. Thankfully, such a dictionary only need<br />

conta<strong>in</strong> polysyllabic words which are far fewer than a complete wordlist for a language. The ma<strong>in</strong> weakness<br />

of this approach is where new words are used that are not <strong>in</strong> the dictionary. For this, one may need to<br />

fallback to ZWSP or WJ approaches. The complexity of this approach is that users are not generally aware<br />

of the contents of such dictionaries and so cannot predict when they will have difficulties and when not.<br />

Notice that at each level of sophistication, it is necessary for the l<strong>in</strong>e break<strong>in</strong>g approach to be able to handle<br />

data that has been generated for a less sophisticated l<strong>in</strong>e break<strong>in</strong>g approach and to handle that appropriately.<br />

For example, if a text conta<strong>in</strong>s ZWSP characters, they should be honoured.<br />

Sort<strong>in</strong>g<br />

Sort<strong>in</strong>g <strong>Myanmar</strong> str<strong>in</strong>gs is a complex process <strong>in</strong>volv<strong>in</strong>g significant str<strong>in</strong>g transformation and four levels of<br />

comparison. The str<strong>in</strong>g transformation is a syllable based operation for which the identification of syllable<br />

boundaries (but not word boundaries) are required. The same techniques that are used for l<strong>in</strong>e-break<strong>in</strong>g,<br />

therefore, may be used for sort<strong>in</strong>g.<br />

The basic pr<strong>in</strong>ciple used <strong>in</strong> sort<strong>in</strong>g most <strong>Myanmar</strong> based languages, <strong>in</strong> the script, is to treat a syllable as<br />

consist<strong>in</strong>g of one or more of the follow<strong>in</strong>g components <strong>in</strong> order:<br />

Consonant Medials Vowels F<strong>in</strong>als Tone<br />

There are two primary approaches to sort<strong>in</strong>g. The th<strong>in</strong>bongyi approach is the current national standard and<br />

reorders the components so that the F<strong>in</strong>als occur before the Vowel:<br />

Consonant Medials F<strong>in</strong>als Vowels Tone<br />

The Pali sort uses a different reorder<strong>in</strong>g:<br />

Consonant Medials Vowels Tone F<strong>in</strong>als<br />

Then sort<strong>in</strong>g proceeds simply, tak<strong>in</strong>g each component as hav<strong>in</strong>g a primary sort relationship to the other<br />

components. It should be noted that where there is more than one medial character, they may <strong>in</strong>teract to<br />

produce a s<strong>in</strong>gle sort key. This is also true for sequences of vowels.<br />

Contractions<br />

The <strong>Myanmar</strong> language has a system of double act<strong>in</strong>g consonants, where a consonant acts as both the f<strong>in</strong>al<br />

of a syllable and the <strong>in</strong>itial of a follow<strong>in</strong>g syllable. These are significant for sort<strong>in</strong>g purposes. Double act<strong>in</strong>g<br />

consonants are rare, but occur <strong>in</strong> two common words.<br />

နယာ<br />

ကျါ်ား<br />

ကျန် ု ပ်<br />

101A 1031 102C 1000 103A 103B<br />

102C 1038<br />

1000 103B 103D 1014 103A 102F<br />

1015 103A<br />

man, husband<br />

I (1 st person s<strong>in</strong>gular)<br />

This storage approach also affects syllable break<strong>in</strong>g s<strong>in</strong>ce a devowelised consonant with a vowel acts like a<br />

normal base consonant with its preced<strong>in</strong>g syllable break.<br />

There are also words with double act<strong>in</strong>g consonants which are unmarked. S<strong>in</strong>ce these are unmarked, it has<br />

been decided that despite their etymology, these words should be sorted as if there were no double act<strong>in</strong>g<br />

consonant.<br />

<strong>Represent<strong>in</strong>g</strong> <strong>Myanmar</strong> <strong>in</strong> <strong>Unicode</strong> Page 10 of 37 Version: 433

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!