17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

eady built into <strong>TET</strong>. The option encodinghint can be used to control the internal<br />

rules.<br />

> In addition to dozens of predefined encodings, custom encodings can be defined for<br />

use with the encodinghint option or the encoding suboption of the glyphrule option.<br />

> External fonts can be configured to provide Unicode mapping information if the<br />

PDF does not provide enough information and the font is not embedded in the PDF.<br />

Analyzing PDF documents with the <strong>PDFlib</strong> FontReporter Plugin 1 . In order to obtain<br />

the information required to create appropriate Unicode mapping tables you must analyze<br />

the problematic PDF documents.<br />

<strong>PDFlib</strong> GmbH provides a free companion product to <strong>TET</strong> which assists in this situation:<br />

<strong>PDFlib</strong> FontReporter is an Adobe Acrobat plugin for easily collecting font, encoding,<br />

and glyph information. The plugin creates detailed font reports containing the actual<br />

glyphs along with the following information:<br />

> The corresponding code: the first hex digit is given in the left-most column, the second<br />

hex digit is given in the top row. For CID fonts the offset printed in the header<br />

must be added to obtain the code corresponding to the glyph.<br />

> The glyph name if present.<br />

> The Unicode value(s) corresponding to the glyph (if Acrobat can determine them).<br />

These pieces of information play an important role for <strong>TET</strong>’s glyph mapping controls.<br />

Figure 7.2 shows two pages from a sample font report. Font reports created with the<br />

FontReporter plugin can be used to analyze PDF fonts and create mapping tables for<br />

successfully extracting the text with <strong>TET</strong>. It is highly recommended to take a look at the<br />

corresponding font report if you want to write Unicode mapping tables or glyph name<br />

heuristics to control text extraction with <strong>TET</strong>.<br />

1. The <strong>PDFlib</strong> FontReporter plugin is available for free download at www.pdflib.com/products/fontreporter<br />

108 Chapter 7: Advanced Unicode Handling

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!