PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Fig. 6.6<br />
The font report for a logotype font shows that the font contains wrong Unicode mappings.<br />
A custom code list can correct such mappings.<br />
By convention, code lists use the file name suffix .cl. Code lists can be configured with<br />
the codelist resource. If no code list resource has been specified explicitly, <strong>TET</strong> will search<br />
for a file named .gl (where is the resource name) in the searchpath<br />
hierarchy (see Section 5.2, »Resource Configuration and File Searching«, page 51 for<br />
details). In other words: if the resource name and the file name (without the .cl suffix)<br />
are identical you don’t have to configure the resource since <strong>TET</strong> will implicitly do the<br />
equivalent of the following call (where name is an arbitrary resource name):<br />
<strong>TET</strong>_set_option(tet, "codelist {name name.cl}");<br />
The following sample demonstrates the use of code lists. Consider the mismapped logotype<br />
glyphs in Figure 6.6 where a single glyph of the font actually represents multiple<br />
characters, and all characters together create the company logotype. However, the<br />
glyphs are wrongly mapped to the characters a, b, c, d, and e. In order to fix this you<br />
could create the following code list:<br />
% Unicode mappings for codes in the GlobeLogosOne font<br />
x61 x0054 x0068 x0065 x0020 % The<br />
x62 x0042 x006F % Bo<br />
x63 x0073 x0074 x006F x006E x0020 % ston<br />
x64 x0047 x006C x006F % Glo<br />
x65 x0062 x0065 % be<br />
Then supply the codelist with the following option to <strong>TET</strong>_open_document( ) (assuming<br />
the code list is available in a file called GlobeLogosOne.cl and can be found via the search<br />
path):<br />
glyphmapping {{fontname=GlobeLogosOne codelist=GlobeLogosOne}}<br />
ToUnicode CMap resources for all font types. PDF supports a data structure called<br />
ToUnicode CMap which can be used to provide Unicode values for the glyphs of a font.<br />
If this data structure is present in a PDF file <strong>TET</strong> will use it. Alternatively, a ToUnicode<br />
CMap can be supplied in an external file. This is useful when a ToUnicode CMap in the<br />
PDF is incomplete, contains wrong entries, or is missing. A ToUnicode CMap will take<br />
precedence over a code list. However, code lists use an easier format the ToUnicode<br />
CMaps so they are the preferred format.<br />
By convention, CMaps don’t use any file name suffix. ToUnicode CMaps can be configured<br />
with the cmap resource (see Section 5.2, »Resource Configuration and File<br />
Searching«, page 51). The contents of a cmap resource must adhere to the standard CMap<br />
syntax. 1 In order to apply a ToUnicode CMap to all fonts in the Warnock family use the<br />
following option to <strong>TET</strong>_open_document( ):<br />
1. See partners.adobe.com/public/developer/en/acrobat/5411.ToUnicode.pdf<br />
78 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>