17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Fig. 6.6<br />

The font report for a logotype font shows that the font contains wrong Unicode mappings.<br />

A custom code list can correct such mappings.<br />

By convention, code lists use the file name suffix .cl. Code lists can be configured with<br />

the codelist resource. If no code list resource has been specified explicitly, <strong>TET</strong> will search<br />

for a file named .gl (where is the resource name) in the searchpath<br />

hierarchy (see Section 5.2, »Resource Configuration and File Searching«, page 51 for<br />

details). In other words: if the resource name and the file name (without the .cl suffix)<br />

are identical you don’t have to configure the resource since <strong>TET</strong> will implicitly do the<br />

equivalent of the following call (where name is an arbitrary resource name):<br />

<strong>TET</strong>_set_option(tet, "codelist {name name.cl}");<br />

The following sample demonstrates the use of code lists. Consider the mismapped logotype<br />

glyphs in Figure 6.6 where a single glyph of the font actually represents multiple<br />

characters, and all characters together create the company logotype. However, the<br />

glyphs are wrongly mapped to the characters a, b, c, d, and e. In order to fix this you<br />

could create the following code list:<br />

% Unicode mappings for codes in the GlobeLogosOne font<br />

x61 x0054 x0068 x0065 x0020 % The<br />

x62 x0042 x006F % Bo<br />

x63 x0073 x0074 x006F x006E x0020 % ston<br />

x64 x0047 x006C x006F % Glo<br />

x65 x0062 x0065 % be<br />

Then supply the codelist with the following option to <strong>TET</strong>_open_document( ) (assuming<br />

the code list is available in a file called GlobeLogosOne.cl and can be found via the search<br />

path):<br />

glyphmapping {{fontname=GlobeLogosOne codelist=GlobeLogosOne}}<br />

ToUnicode CMap resources for all font types. PDF supports a data structure called<br />

ToUnicode CMap which can be used to provide Unicode values for the glyphs of a font.<br />

If this data structure is present in a PDF file <strong>TET</strong> will use it. Alternatively, a ToUnicode<br />

CMap can be supplied in an external file. This is useful when a ToUnicode CMap in the<br />

PDF is incomplete, contains wrong entries, or is missing. A ToUnicode CMap will take<br />

precedence over a code list. However, code lists use an easier format the ToUnicode<br />

CMaps so they are the preferred format.<br />

By convention, CMaps don’t use any file name suffix. ToUnicode CMaps can be configured<br />

with the cmap resource (see Section 5.2, »Resource Configuration and File<br />

Searching«, page 51). The contents of a cmap resource must adhere to the standard CMap<br />

syntax. 1 In order to apply a ToUnicode CMap to all fonts in the Warnock family use the<br />

following option to <strong>TET</strong>_open_document( ):<br />

1. See partners.adobe.com/public/developer/en/acrobat/5411.ToUnicode.pdf<br />

78 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!