PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Fig. 6.5<br />
Sample font reports created with the <strong>PDFlib</strong> FontReporter plugin for Adobe Acrobat<br />
Precedence rules. <strong>TET</strong> will apply the glyph mapping controls in the following order:<br />
> Codelist and ToUnicode CMap resources will be consulted first.<br />
> If the font has an internal ToUnicode CMap it will be considered next.<br />
> For glyph names <strong>TET</strong> will apply an external or internal glyph name mapping rule if<br />
one is available which matches the font and glyph name.<br />
> Lastly, a user-supplied glyph list will be applied.<br />
Code list resources for all font types. Code lists are similar to glyph lists except that<br />
they specify Unicode values for individual codes instead of glyph names. Although<br />
multiple fonts from the same foundry may use identical code assignments, codes (also<br />
called glyph ids) are generally font-specific. As a consequence, separate code lists will be<br />
required for individual fonts. A code list is a text file where each line describes a Unicode<br />
mapping for a single code according to the following rules:<br />
> <strong>Text</strong> after a percent sign ’%’ will be ignored; this can be used for comments.<br />
> The first column contains the glyph code in decimal or hexadecimal notation. This<br />
must be a value in the range 0-255 for simple fonts, and in the range 0-65535 for CID<br />
fonts.<br />
> The remainder of the line contains up to 7 Unicode code points for the code. The values<br />
can be supplied in decimal notation or (with the prefix x or 0x) in hexadecimal<br />
notation. UTF-32 is supported, i.e. surrogate pairs can be used.<br />
6.8 Advanced Unicode Mapping Controls 77