17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.8 Advanced Unicode Mapping Controls<br />

<strong>TET</strong> implements many workarounds in order to process PDF documents which actually<br />

don’t contain Unicode values so that it can successfully extract the text nevertheless.<br />

However, there are still documents where the text cannot be extracted since not<br />

enough information is available in the PDF and relevant font data structures. <strong>TET</strong> contains<br />

various configuration features which can be used to supply additional Unicode<br />

mapping information. These features are detailed in this section.<br />

Summary of Unicode mapping controls. Using the glyphmapping option of <strong>TET</strong>_open_<br />

document( ) (see Section 10.4, »Document Functions«, page 129) you can control Unicode<br />

mapping for glyphs in several ways. The following list gives an overview of available<br />

methods (which can be combined). These controls can be applied on a per-font basis or<br />

globally for all fonts in a document:<br />

> The suboption forceencoding can be used to completely override all occurrences of<br />

the predefined PDF encodings WinAnsiEncoding or MacRomanEncoding.<br />

> The suboptions codelist and tounicodecmap can be used to supply Unicode values in a<br />

simple text format (a codelist resource).<br />

> The suboption glyphlist can be used to supply Unicode values for non-standard glyph<br />

names.<br />

> The suboption glyphrule can be used to define a rule which will be used to derive Unicode<br />

values from numerical glyph names in an algorithmic way. Several rules are already<br />

built into <strong>TET</strong>. The option encodinghint can be used to control the internal<br />

rules.<br />

> In addition to dozens of predefined encodings, custom encodings can be defined for<br />

use with the encodinghint option or the encoding suboption of the glyphrule option.<br />

> External fonts can be configured to provide Unicode mapping information if the<br />

PDF does not provide enough information and the font is not embedded in the PDF.<br />

Analyzing PDF documents with the <strong>PDFlib</strong> FontReporter plugin 1 . In order to obtain<br />

the information required to create appropriate Unicode mapping tables you must analyze<br />

the problematic PDF documents.<br />

<strong>PDFlib</strong> GmbH provides a free companion product to <strong>TET</strong> which assists in this situation:<br />

<strong>PDFlib</strong> FontReporter is an Adobe Acrobat plugin for easily collecting font, encoding,<br />

and glyph information. The plugin creates detailed font reports containing the actual<br />

glyphs along with the following information:<br />

> The corresponding code: the first hex digit is given in the left-most column, the second<br />

hex digit is given in the top row. For CID fonts the offset printed in the header<br />

must be added to obtain the code corresponding to the glyph.<br />

> The glyph name if present.<br />

> The Unicode value(s) corresponding to the glyph (if Acrobat can determine them).<br />

These pieces of information play an important role for <strong>TET</strong>’s glyph mapping controls.<br />

Figure 6.5 shows two pages from a sample font report. Font reports created with the<br />

FontReporter plugin can be used to analyze PDF fonts and create mapping tables for<br />

successfully extracting the text with <strong>TET</strong>. It is highly recommended to take a look at the<br />

corresponding font report if you want to write Unicode mapping tables or glyph name<br />

heuristics to control text extraction with <strong>TET</strong>.<br />

1. The <strong>PDFlib</strong> FontReporter plugin is available for free download at www.pdflib.com/products/fontreporter<br />

76 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!