17.05.2014 Views

PDFlib 8 Windows COM/.NET Tutorial

PDFlib 8 Windows COM/.NET Tutorial

PDFlib 8 Windows COM/.NET Tutorial

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

5.2 Unicode Characters and Glyphs<br />

5.2.1 Glyph IDs<br />

A font is a collection of glyphs, where each glyph is defined by its geometric outline.<br />

<strong>PDFlib</strong> assigns a number to each glyph in the font. This number is called the glyph id or<br />

GID. GID 0 (zero) refers to the .notdef glyph in all font formats. The visual appearance of<br />

the .notdef glyph varies among font formats and vendors; typical implementations are<br />

the space glyph or a hollow or crossed-out rectangle. The highest GID is one less than<br />

the number of glyphs in the font which can be queried with the numglyphs keyword of<br />

info_font( ).<br />

The assignment of glyph IDs depends on the font format:<br />

> Since TrueType and OpenType fonts already contain internal GIDs, <strong>PDFlib</strong> uses these<br />

GIDs.<br />

> For CID-keyed OpenType CJK fonts CIDs will be used as GIDs.<br />

> For other font types <strong>PDFlib</strong> numbers the glyphs according to the order of the corresponding<br />

outline descriptions in the font.<br />

<strong>PDFlib</strong> supports glyph selection via GID as an alternative to Unicode and other encodings<br />

(see »Glyphid encoding«, page 129). Direct GID addressing is only useful for specialized<br />

applications, e.g. printing font overview tables by querying the number of glyphs<br />

and iterating over all glyph IDs.<br />

5.2.2 Unicode Mappings for Glyphs<br />

Unicode mappings. <strong>PDFlib</strong> assigns a unique Unicode value to each GID. This mapping<br />

process depends on the font format and is detailed in the sections below for the supported<br />

font types. Although a unique Unicode value will be assigned to each GID, the reverse<br />

is not necessarily true, i.e. a particular glyph can represent multiple Unicode values.<br />

Common examples in many TrueType and OpenType fonts are the empty glyph<br />

which represents U+0020 Space as well as U+00A0 No-Break Space, and a glyph which<br />

represents both U+2126 Ohm Sign and U+03A9 Greek Capital Letter Omega. If multiple<br />

Unicode values point to the same glyph in a font <strong>PDFlib</strong> will assign the first Unicode value<br />

found in the font.<br />

Unmapped glyphs and the Private Use Area (PUA). In some situations the font may<br />

not provide a Unicode value for a particular glyph. In this case <strong>PDFlib</strong> assigns a value<br />

from the Unicode Private Use Area (PUA, see Section 4.1, »Important Unicode Concepts«,<br />

page 99) to the glyph. Such glyphs are called unmapped glyphs. The number of unmapped<br />

glyphs in a font can be queried with the unmappedglyphs keyword of info_<br />

font( ). Unmapped glyphs will be represented by the Unicode replacement character<br />

U+FFFD in the font’s ToUnicode CMap which controls searchability and text extraction.<br />

As a consequence, unmapped glyphs cannot be properly extracted as text from the generated<br />

PDF.<br />

When <strong>PDFlib</strong> assigns PUA values to unmapped glyphs it uses ascending values from<br />

the following pool:<br />

> The basis is the Unicode PUA range in the Basic Multilingual Plane (BMP), i.e. the<br />

range U+E000 - U+F8FF. Additional PUA values in plane 15 (U+F0000 to U+FFFFD) are<br />

used if required.<br />

5.2 Unicode Characters and Glyphs 121

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!