12.07.2015 Views

free openSS

free openSS

free openSS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

32 ❘ FREE/OPEN SOURCE SOFTWARE: LOCALIZATIONbeyond BMP require four bytes. The Unicode Standard 9 provides more detail of UTF-8.UTF-8 is typically the preferred encoding form for the Internet. The ASCII compatibility helps a lot inmigration from old systems. UTF-8 also has the advantage of being byte-serialized and friendly to C orother programming languages APIs. For example, the traditional string collation using byte-wisecomparison works with UTF-8.In short, UTF-8 is the most widely adopted encoding form of Unicode.Character PropertiesIn addition to code points, Unicode also provides a database of character properties called the UnicodeCharacter Database (UCD), 10 which consists of a set of files describing the following properties:Name.General category (classification as letters, numbers, symbols, punctuation, etc.).Other important general characteristics (white space, dash, ideographic, alphabetic, non character,deprecated, etc.).Character shaping (bidi category, shaping, mirroring, width, etc.).Case (upper, lower, title, folding; both simple and full).Numeric values and types (for digits).Script and block.Normalization properties (decompositions, decomposition type, canonical combining class,composition exclusions, etc.).Age (version of the standard in which the code point was first designated).Boundaries (grapheme cluster, word, line and sentence).Standardized variants.The database is useful for Unicode implementation in general. It is available at the Unicode.org Web site.The Unicode Standard 11 provides more details of the database.Technical ReportsIn addition to the code points, encoding forms and character properties, Unicode also provides sometechnical reports that can serve as implementation guidelines. Some of these reports have been includedas annexes to the Unicode standard, and some are published individually as Technical Standards.In Unicode 4.0, the standard annexes are:UAX 9: The Bidirectional AlgorithmSpecifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew.UAX 11: East-Asian WidthSpecifications of an informative property of Unicode characters that is useful when interoperatingwith East-Asian Legacy character sets.UAX 14: Line Breaking PropertiesSpecification of line breaking properties for Unicode characters as well as a model algorithmfor determining line break opportunities.UAX 15: Unicode Normalization FormsSpecifications for four normalized forms of Unicode text. With these forms, equivalent text(canonical or compatibility) will have identical binary representations. When implementationskeep strings in a normalized form, they can be assured that equivalent strings have a uniquebinary representation.UAX 24: Script NamesAssignment of script names to all Unicode code points. This information is useful in mechanismssuch as regular expressions, where it produces much better results than simple matches on blocknames.UAX 29: Text Boundaries9The Unicode Consortium. The Unicode Standard, Version 4.0., pp. 77–78.10Ibid., pp. 95–104.11Unicode.org, ‘Unicode Technical Reports’; available from www.unicode.org/reports/index.html.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!