Representing Myanmar in Unicode - Evertype
Representing Myanmar in Unicode - Evertype
Representing Myanmar in Unicode - Evertype
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Introduction<br />
This document aims to give guidance on the encod<strong>in</strong>g of text us<strong>in</strong>g the <strong>Myanmar</strong> script. S<strong>in</strong>ce the script is<br />
used for a number of orthographies cover<strong>in</strong>g different languages, the development of this document is<br />
ongo<strong>in</strong>g. It aims to br<strong>in</strong>g together the results of consensus between experts <strong>in</strong> the encod<strong>in</strong>g of the various<br />
orthographies us<strong>in</strong>g the script. In terms of the <strong>Unicode</strong> standard, this document is purely <strong>in</strong>formative s<strong>in</strong>ce it<br />
is concerned with issues not covered by that standard. But with<strong>in</strong> the country, and by developers of the<br />
script, this document has been accorded a certa<strong>in</strong> degree of authority. This provides further encouragement<br />
to ma<strong>in</strong>ta<strong>in</strong> this document and update it as new issues arise.<br />
Readers <strong>in</strong>terested <strong>in</strong> follow<strong>in</strong>g the history of the development of this script are recommended to read the<br />
different versions of this document, rather than expect<strong>in</strong>g to f<strong>in</strong>d this document conta<strong>in</strong><strong>in</strong>g all versions of<br />
itself with<strong>in</strong>.<br />
The <strong>Myanmar</strong> script is used for a number of languages. This means that when consider<strong>in</strong>g the script as a<br />
whole, care must be taken not to over specify constra<strong>in</strong>ts on what character sequences should be considered<br />
valid or <strong>in</strong> error. The temptation is to use script level sequence constra<strong>in</strong>ts as a form of spell check<strong>in</strong>g. But<br />
spell check<strong>in</strong>g is <strong>in</strong>herently language specific. The result is that script constra<strong>in</strong>ts need to be the lowest<br />
common denom<strong>in</strong>ator of all the orthographies supported by the script. The orthography list is not closed: we<br />
have not described all the exist<strong>in</strong>g orthographies yet; languages change and develop and their orthographies<br />
with them. As a result, script constra<strong>in</strong>ts cannot simply be the <strong>in</strong>tersection of all known writ<strong>in</strong>g system<br />
constra<strong>in</strong>ts, but must take a more <strong>in</strong>tentional approach. The basic pr<strong>in</strong>ciple used here is not to try to<br />
constra<strong>in</strong> what users can generate, but only to ensure that there are no two different valid sequences that look<br />
the same, with<strong>in</strong> a writ<strong>in</strong>g system. We do this by specify<strong>in</strong>g a valid str<strong>in</strong>g as be<strong>in</strong>g a sequence of slots. Each<br />
slot may be empty or conta<strong>in</strong> a character (or sequence as specified by the slot). Implementations may well<br />
add further, language specific, constra<strong>in</strong>ts to help their users.<br />
A further concern when read<strong>in</strong>g a develop<strong>in</strong>g document such as this is the stability criteria. What can we be<br />
sure about go<strong>in</strong>g <strong>in</strong>to the future? The approach taken <strong>in</strong> this document follows the core pr<strong>in</strong>ciple of stability<br />
<strong>in</strong> <strong>Unicode</strong>: Any valid data today will always rema<strong>in</strong> valid. This requires that any changes to the sequence<br />
order, for example, will always be to loosen it. Thus more sequences will be allowed rather than less. This<br />
means that <strong>in</strong>valid data today may not always rema<strong>in</strong> <strong>in</strong>valid <strong>in</strong> future versions of this document.<br />
This version br<strong>in</strong>gs the specification <strong>in</strong> l<strong>in</strong>e with <strong>Unicode</strong> 5.2.<br />
Introduction to Version 2<br />
The first edition of this technical note addressed the issue of how <strong>Myanmar</strong> text was encoded us<strong>in</strong>g the<br />
<strong>Unicode</strong> standard as it stood until version 5.1. With <strong>Unicode</strong> 5.1 various new characters were added to the<br />
<strong>Myanmar</strong> block which had the effect of simplify<strong>in</strong>g the encod<strong>in</strong>g model considerably. Such a change could<br />
only come about with agreement from all implementors and those with exist<strong>in</strong>g data because they will need<br />
to update and change to the new model. This is nearly impossible to achieve if exist<strong>in</strong>g implementations are<br />
already <strong>in</strong> widespread use, which was not the case at the time for the <strong>Myanmar</strong> block. In addition, such a<br />
change was necessary to facilitate the encod<strong>in</strong>g of m<strong>in</strong>ority scripts. So with a necessity and a unique<br />
opportunity for change, the characters were added and the encod<strong>in</strong>g model simplified.<br />
The author wishes to thank the <strong>Myanmar</strong> Language Commission, the <strong>Myanmar</strong> NLP Lab and the <strong>Myanmar</strong><br />
Computer Federation for review<strong>in</strong>g and provid<strong>in</strong>g <strong>in</strong>put to this version of the document.]<br />
Introduction to version 3<br />
The first two editions of this document were almost exclusively concerned with the needs of the Burmese<br />
language. This edition drastically extends the set of allowable sequences and considers the needs of a<br />
number of m<strong>in</strong>ority languages. It also adds summary descriptions of a number of languages that have<br />
<strong>Myanmar</strong> based writ<strong>in</strong>g systems and gives <strong>in</strong>dication on how they are encoded along with other<br />
computational issues that these writ<strong>in</strong>g systems raise.<br />
<strong>Represent<strong>in</strong>g</strong> <strong>Myanmar</strong> <strong>in</strong> <strong>Unicode</strong> Page 2 of 37 Version: 433