26.08.2013 Views

Representing Myanmar in Unicode - Evertype

Representing Myanmar in Unicode - Evertype

Representing Myanmar in Unicode - Evertype

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Advanced Issues<br />

So far we have covered what is expla<strong>in</strong>ed <strong>in</strong> the <strong>Unicode</strong> Standard 3 . In this section we exam<strong>in</strong>e some of the<br />

more difficult areas of the <strong>Myanmar</strong> language <strong>in</strong>clud<strong>in</strong>g some implementation details regard<strong>in</strong>g l<strong>in</strong>e<br />

break<strong>in</strong>g, sort<strong>in</strong>g and render<strong>in</strong>g; further exam<strong>in</strong>ation of the k<strong>in</strong>zi question; contractions and some issues with<br />

respect to Old <strong>Myanmar</strong>.<br />

L<strong>in</strong>e break<strong>in</strong>g<br />

Burmese does not have <strong>in</strong>terword spaces like English. Instead spaces are used to mark phrases. Some<br />

phrases are relatively short (two or three syllables, 1.5em, or 2.3 times the width of U+1000 က) while others<br />

can be quite long (8.5em or 13 times the width of U+1000 က). A common approach to address<strong>in</strong>g l<strong>in</strong>e<br />

break<strong>in</strong>g issues is to adjust the phrase spac<strong>in</strong>g so that a l<strong>in</strong>e breaks at a phrase break. If l<strong>in</strong>e break<strong>in</strong>g is<br />

required with<strong>in</strong> a phrase then there are a number of possible approaches. What is presented here is a slid<strong>in</strong>g<br />

scale of quality of l<strong>in</strong>e break<strong>in</strong>g approaches, start<strong>in</strong>g with the simplest.<br />

Insert Zero-Width Spaces<br />

The simplest approach is to <strong>in</strong>sert a U+200B ZERO WIDTH SPACE (ZWSP) between words <strong>in</strong> a phrase. This<br />

would allow l<strong>in</strong>e breaks between words <strong>in</strong> a phrase. The problem is, though, ensur<strong>in</strong>g that ZWSP characters<br />

are only <strong>in</strong>serted between words. The standard approach is to <strong>in</strong>sert ZWSP between syllables, s<strong>in</strong>ce most<br />

words <strong>in</strong> the languages us<strong>in</strong>g the script are monosyllabic. But the problems occur when ZWSP is<br />

erroneously <strong>in</strong>serted <strong>in</strong>to the middle of a polysyllabic word. Such <strong>in</strong>sertions cause problems for search<strong>in</strong>g<br />

and <strong>in</strong>dex<strong>in</strong>g. Thus ZWSP should only be used where there is certa<strong>in</strong>ty that there is a word break.<br />

Automated Syllable Break<strong>in</strong>g<br />

A better approach uses a purely algorithmic approach to l<strong>in</strong>e break<strong>in</strong>g based on syllable breaks. The outl<strong>in</strong>e<br />

algorithm described here has not been found to fail <strong>in</strong> any of the languages us<strong>in</strong>g the <strong>Myanmar</strong> script. A<br />

syllable break may occur before any cluster (as described <strong>in</strong> the diacritic order<strong>in</strong>g section) so long as the<br />

k<strong>in</strong>zi, asat and stacked slots rema<strong>in</strong> empty <strong>in</strong> the cluster follow<strong>in</strong>g the possible break po<strong>in</strong>t. In reality such an<br />

algorithm requires ref<strong>in</strong>ement, but it still requires no dictionary. For example, sequences of digits should be<br />

kept together.<br />

These same syllable break<strong>in</strong>g rules are used for sort<strong>in</strong>g purposes, with the addition of non-l<strong>in</strong>e break<strong>in</strong>g<br />

syllable breaks, such as those occurr<strong>in</strong>g between the two characters <strong>in</strong> a syllable cha<strong>in</strong>. For example these<br />

phrases show possible <strong>in</strong>ter-syllable l<strong>in</strong>e breaks.<br />

နကာင်နလးနတွ<br />

နကျာင်းကိ ု သွားြက<br />

တယ်။<br />

အိပ်ခန်းတံ ခါးကိ ု<br />

1000 1031 102C 1004 103A | 101C 1031 1038<br />

| 1010 103D 1031 | 1000 103B 1031 102C<br />

1004 103A 1038 | 1000 102D 102F | 101E<br />

103D 102C 1038 | 1000 103C | 1010 101A<br />

103A 104B<br />

1021 102D 1015 103A | 1001 1014 103A 1038<br />

| 1010 1036 | 1001 102B 1038 | 1000 102D<br />

102F<br />

the kids are<br />

go<strong>in</strong>g to<br />

school<br />

to the<br />

bedroom<br />

door<br />

Notice how <strong>in</strong> the second example the word 1010 1036 | 1001 102B 1038 is a s<strong>in</strong>gle word with multiple<br />

syllables. Is there some way, without a dictionary, that we can ensure that the word is not l<strong>in</strong>e broken? There<br />

is a <strong>Unicode</strong> character : U+2060 WORD JOINER. The role of this character is to <strong>in</strong>dicate a non-break<strong>in</strong>g po<strong>in</strong>t<br />

<strong>in</strong> a text. L<strong>in</strong>es should not be broken at that po<strong>in</strong>t. Therefore, if we want to ensure that no l<strong>in</strong>e-break occurs<br />

at the syllable boundary with<strong>in</strong> our poly-syllabic word, we can <strong>in</strong>sert a U+2060 <strong>in</strong>to our data stream between<br />

the two syllables and a render<strong>in</strong>g eng<strong>in</strong>e should not break a l<strong>in</strong>e at that po<strong>in</strong>t. Thus:<br />

အိပ်ခန်းတံ ခါးကိ ု<br />

3 Version 5.1<br />

<strong>Represent<strong>in</strong>g</strong> <strong>Myanmar</strong> <strong>in</strong> <strong>Unicode</strong> Page 9 of 37 Version: 433<br />

1021 102D 1015 103A | 1001 1014 103A 1038<br />

| 1010 1036 2060 1001 102B 1038 | 1000<br />

102D 102F<br />

to the<br />

bedroom<br />

door

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!