10.07.2015 Views

Download - Multivac!

Download - Multivac!

Download - Multivac!

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.2 Important Unicode ConceptsCharacters and glyphs. When dealing with text it is important to clearly distinguishthe following concepts:> Characters are the smallest units which convey information in a language. Commonexamples are the letters in the Latin alphabet, Chinese ideographs, and Japanese syllables.Characters have a meaning: they are semantic entities.> Glyphs are different graphical variants which represent one or more particular characters.Glyphs have an appearance: they are representational entities.There is no one-to-one relationship between characters and glyphs. For example, a ligatureis a single glyph which is represented by two or more separate characters. On theother hand, a specific glyph may be used to represent different characters depending onthe context (some characters look identical, see Figure 4.1).Unicode encoding forms (UTF formats). The Unicode standard assigns a number (codepoint) to each character. In order to use these numbers in computing, they must be representedin some way. In the Unicode standard this is called an encoding form (formerly:transformation format); this term should not be confused with font encodings. Unicodedefines the following encoding forms:> UTF-8: This is a variable-width format where code points are represented by 1-4bytes. ASCII characters in the range U+0000...U+007F are represented by a singlebyte in the range 00...7F. Latin-1 characters in the range U+00A0...U+00FF are representedby two bytes, where the first byte is always 0xC2 or 0xC3 (these values represent and à in Latin-1).> UTF-16: Code points in the Basic Multilingual Plane (BMP), i.e. characters in the rangeU+0000...U+FFFF are represented by a single 16-bit value. Code points in the supplementaryplanes, i.e. in the range U+10000...U+10FFFF, are represented by a pair of 16-bit values. Such pairs are called surrogate pairs. A surrogate pair consists of a highsurrogatevalue in the range D800...DBFF and a low-surrogate value in the rangeDC00...DFFF. High- and low-surrogate values can only appear as parts of surrogatepairs, but not in any other context.> UTF-32: Each code point is represented by a single 32-bit value.CharactersGlyphsU+0067 LATIN SMALL LETTER GU+0066 LATIN SMALL LETTER F +U+0069 LATIN SMALL LETTER IU+2126 OHM SIGN orU+03A9 GREEK CAPITAL LETTER OMEGAU+2167 ROMAN NUMERAL EIGHT orU+0056 V U+0049 I U+0049 I U+0049 IFig. 4.1.Relationship of glyphsand characters74 Chapter 4: Unicode and Legacy Encodings

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!