26.08.2013 Views

Representing Myanmar in Unicode - Evertype

Representing Myanmar in Unicode - Evertype

Representing Myanmar in Unicode - Evertype

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Introduction<br />

This document aims to give guidance on the encod<strong>in</strong>g of text us<strong>in</strong>g the <strong>Myanmar</strong> script. S<strong>in</strong>ce the script is<br />

used for a number of orthographies cover<strong>in</strong>g different languages, the development of this document is<br />

ongo<strong>in</strong>g. It aims to br<strong>in</strong>g together the results of consensus between experts <strong>in</strong> the encod<strong>in</strong>g of the various<br />

orthographies us<strong>in</strong>g the script. In terms of the <strong>Unicode</strong> standard, this document is purely <strong>in</strong>formative s<strong>in</strong>ce it<br />

is concerned with issues not covered by that standard. But with<strong>in</strong> the country, and by developers of the<br />

script, this document has been accorded a certa<strong>in</strong> degree of authority. This provides further encouragement<br />

to ma<strong>in</strong>ta<strong>in</strong> this document and update it as new issues arise.<br />

Readers <strong>in</strong>terested <strong>in</strong> follow<strong>in</strong>g the history of the development of this script are recommended to read the<br />

different versions of this document, rather than expect<strong>in</strong>g to f<strong>in</strong>d this document conta<strong>in</strong><strong>in</strong>g all versions of<br />

itself with<strong>in</strong>.<br />

The <strong>Myanmar</strong> script is used for a number of languages. This means that when consider<strong>in</strong>g the script as a<br />

whole, care must be taken not to over specify constra<strong>in</strong>ts on what character sequences should be considered<br />

valid or <strong>in</strong> error. The temptation is to use script level sequence constra<strong>in</strong>ts as a form of spell check<strong>in</strong>g. But<br />

spell check<strong>in</strong>g is <strong>in</strong>herently language specific. The result is that script constra<strong>in</strong>ts need to be the lowest<br />

common denom<strong>in</strong>ator of all the orthographies supported by the script. The orthography list is not closed: we<br />

have not described all the exist<strong>in</strong>g orthographies yet; languages change and develop and their orthographies<br />

with them. As a result, script constra<strong>in</strong>ts cannot simply be the <strong>in</strong>tersection of all known writ<strong>in</strong>g system<br />

constra<strong>in</strong>ts, but must take a more <strong>in</strong>tentional approach. The basic pr<strong>in</strong>ciple used here is not to try to<br />

constra<strong>in</strong> what users can generate, but only to ensure that there are no two different valid sequences that look<br />

the same, with<strong>in</strong> a writ<strong>in</strong>g system. We do this by specify<strong>in</strong>g a valid str<strong>in</strong>g as be<strong>in</strong>g a sequence of slots. Each<br />

slot may be empty or conta<strong>in</strong> a character (or sequence as specified by the slot). Implementations may well<br />

add further, language specific, constra<strong>in</strong>ts to help their users.<br />

A further concern when read<strong>in</strong>g a develop<strong>in</strong>g document such as this is the stability criteria. What can we be<br />

sure about go<strong>in</strong>g <strong>in</strong>to the future? The approach taken <strong>in</strong> this document follows the core pr<strong>in</strong>ciple of stability<br />

<strong>in</strong> <strong>Unicode</strong>: Any valid data today will always rema<strong>in</strong> valid. This requires that any changes to the sequence<br />

order, for example, will always be to loosen it. Thus more sequences will be allowed rather than less. This<br />

means that <strong>in</strong>valid data today may not always rema<strong>in</strong> <strong>in</strong>valid <strong>in</strong> future versions of this document.<br />

This version br<strong>in</strong>gs the specification <strong>in</strong> l<strong>in</strong>e with <strong>Unicode</strong> 5.2.<br />

Introduction to Version 2<br />

The first edition of this technical note addressed the issue of how <strong>Myanmar</strong> text was encoded us<strong>in</strong>g the<br />

<strong>Unicode</strong> standard as it stood until version 5.1. With <strong>Unicode</strong> 5.1 various new characters were added to the<br />

<strong>Myanmar</strong> block which had the effect of simplify<strong>in</strong>g the encod<strong>in</strong>g model considerably. Such a change could<br />

only come about with agreement from all implementors and those with exist<strong>in</strong>g data because they will need<br />

to update and change to the new model. This is nearly impossible to achieve if exist<strong>in</strong>g implementations are<br />

already <strong>in</strong> widespread use, which was not the case at the time for the <strong>Myanmar</strong> block. In addition, such a<br />

change was necessary to facilitate the encod<strong>in</strong>g of m<strong>in</strong>ority scripts. So with a necessity and a unique<br />

opportunity for change, the characters were added and the encod<strong>in</strong>g model simplified.<br />

The author wishes to thank the <strong>Myanmar</strong> Language Commission, the <strong>Myanmar</strong> NLP Lab and the <strong>Myanmar</strong><br />

Computer Federation for review<strong>in</strong>g and provid<strong>in</strong>g <strong>in</strong>put to this version of the document.]<br />

Introduction to version 3<br />

The first two editions of this document were almost exclusively concerned with the needs of the Burmese<br />

language. This edition drastically extends the set of allowable sequences and considers the needs of a<br />

number of m<strong>in</strong>ority languages. It also adds summary descriptions of a number of languages that have<br />

<strong>Myanmar</strong> based writ<strong>in</strong>g systems and gives <strong>in</strong>dication on how they are encoded along with other<br />

computational issues that these writ<strong>in</strong>g systems raise.<br />

<strong>Represent<strong>in</strong>g</strong> <strong>Myanmar</strong> <strong>in</strong> <strong>Unicode</strong> Page 2 of 37 Version: 433

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!