free openSS

More documents

Recommendations

Info

Annex B. Localization – Technical Aspects ❘ 31Figure 2Unicode Basic Multilingual PlaneCode points in supplementary planes are instead represented as pairs of 16-bit unsigned integers. Thesepairs of code units are called surrogate pairs. The values used for the surrogate pairs are in the rangeD800 – DFFF, which are not assigned to any character. So, UTF-16 readers can easily distinguish betweensingle code unit and surrogate pairs. The Unicode Standard 8 provides more details of surrogates.UTF-16 is a good choice for keeping general Unicode strings, as it is optimized for characters in BMP,which is used in 99 percent of Unicode texts. It consumes about half of the storage required by UTF-32.UTF-8To meet the requirements of legacy byte-oriented ASCII-based systems, UTF-8 is defined as variablewidthencoding form that preserves ASCII compatibility. It uses one to four 8-bit code units to representa Unicode character, depending on the code point value. The code points between 0000 and 007F areencoded in a single byte, making any ASCII string a valid UTF-8. Beyond the ASCII range of Unicode,some non-ideographic characters between 0080 and 07FF are encoded with two bytes. Then, Indic scriptsand CJK ideographs between 0800 and FFFF are encoded with three bytes. Supplementary characters8The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 76–77.
32 ❘ FREE/OPEN SOURCE SOFTWARE: LOCALIZATIONbeyond BMP require four bytes. The Unicode Standard 9 provides more detail of UTF-8.UTF-8 is typically the preferred encoding form for the Internet. The ASCII compatibility helps a lot inmigration from old systems. UTF-8 also has the advantage of being byte-serialized and friendly to C orother programming languages APIs. For example, the traditional string collation using byte-wisecomparison works with UTF-8.In short, UTF-8 is the most widely adopted encoding form of Unicode.Character PropertiesIn addition to code points, Unicode also provides a database of character properties called the UnicodeCharacter Database (UCD), 10 which consists of a set of files describing the following properties:Name.General category (classification as letters, numbers, symbols, punctuation, etc.).Other important general characteristics (white space, dash, ideographic, alphabetic, non character,deprecated, etc.).Character shaping (bidi category, shaping, mirroring, width, etc.).Case (upper, lower, title, folding; both simple and full).Numeric values and types (for digits).Script and block.Normalization properties (decompositions, decomposition type, canonical combining class,composition exclusions, etc.).Age (version of the standard in which the code point was first designated).Boundaries (grapheme cluster, word, line and sentence).Standardized variants.The database is useful for Unicode implementation in general. It is available at the Unicode.org Web site.The Unicode Standard 11 provides more details of the database.Technical ReportsIn addition to the code points, encoding forms and character properties, Unicode also provides sometechnical reports that can serve as implementation guidelines. Some of these reports have been includedas annexes to the Unicode standard, and some are published individually as Technical Standards.In Unicode 4.0, the standard annexes are:UAX 9: The Bidirectional AlgorithmSpecifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.UAX 11: East-Asian WidthSpecifications of an informative property of Unicode characters that is useful when interoperatingwith East-Asian Legacy character sets.UAX 14: Line Breaking PropertiesSpecification of line breaking properties for Unicode characters as well as a model algorithmfor determining line break opportunities.UAX 15: Unicode Normalization FormsSpecifications for four normalized forms of Unicode text. With these forms, equivalent text(canonical or compatibility) will have identical binary representations. When implementationskeep strings in a normalized form, they can be assured that equivalent strings have a uniquebinary representation.UAX 24: Script NamesAssignment of script names to all Unicode code points. This information is useful in mechanismssuch as regular expressions, where it produces much better results than simple matches on blocknames.UAX 29: Text Boundaries9The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 77–78.10Ibid., pp. 95–104.11Unicode.org, ‘Unicode Technical Reports’; available from www.unicode.org/reports/index.html.
Page 2 and 3: Free/Open Source Software:Localizat
Page 4: TABLE OF CONTENTSFOREWORDACKNOWLEDG
Page 7 and 8: including simple “how to” prime
Page 10 and 11: INTRODUCTIONThis primer provides a
Page 12 and 13: Introduction ❘ 3What is Free/Open
Page 14 and 15: LOCALIZATION EFFORTS IN THE ASIA-PA
Page 16 and 17: Localization Efforts in the Asia-Pa
Page 26 and 27: RECOMMENDATIONSImplementing Localiz
Page 28 and 29: Recommendations ❘ 19Analysts and
Page 30 and 31: Recommendations ❘ 21elements must
Page 32 and 33: Recommendations ❘ 23Language Cons
Page 34 and 35: Annex A. Localization - Key Concept
Page 42 and 43: Annex B. Localization - Technical A
Page 56 and 57: Further Readings ❘ 47KuhnKuhn, M.
Page 58 and 59: RESOURCES AND TOOLSBangladeshwww.be
Page 60 and 61: Resources and Tools ❘ 51oo-l10n-m
Page 62 and 63: Resources and Tools ❘ 53Organizat
Page 64 and 65: Glossary ❘ 55GNUA recursive acron
Page 66 and 67: Glossary ❘ 57XFSX Font Server.Xft
Page 68: APDIPIOSNThe Asia-Pacific Developme

free openSS

Create successful ePaper yourself

Delete template?

Save as template?