PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6.8 Advanced Unicode Mapping Controls<br />
<strong>TET</strong> implements many workarounds in order to process PDF documents which actually<br />
don’t contain Unicode values so that it can successfully extract the text nevertheless.<br />
However, there are still documents where the text cannot be extracted since not<br />
enough information is available in the PDF and relevant font data structures. <strong>TET</strong> contains<br />
various configuration features which can be used to supply additional Unicode<br />
mapping information. These features are detailed in this section.<br />
Summary of Unicode mapping controls. Using the glyphmapping option of <strong>TET</strong>_open_<br />
document( ) (see Section 10.4, »Document Functions«, page 129) you can control Unicode<br />
mapping for glyphs in several ways. The following list gives an overview of available<br />
methods (which can be combined). These controls can be applied on a per-font basis or<br />
globally for all fonts in a document:<br />
> The suboption forceencoding can be used to completely override all occurrences of<br />
the predefined PDF encodings WinAnsiEncoding or MacRomanEncoding.<br />
> The suboptions codelist and tounicodecmap can be used to supply Unicode values in a<br />
simple text format (a codelist resource).<br />
> The suboption glyphlist can be used to supply Unicode values for non-standard glyph<br />
names.<br />
> The suboption glyphrule can be used to define a rule which will be used to derive Unicode<br />
values from numerical glyph names in an algorithmic way. Several rules are already<br />
built into <strong>TET</strong>. The option encodinghint can be used to control the internal<br />
rules.<br />
> In addition to dozens of predefined encodings, custom encodings can be defined for<br />
use with the encodinghint option or the encoding suboption of the glyphrule option.<br />
> External fonts can be configured to provide Unicode mapping information if the<br />
PDF does not provide enough information and the font is not embedded in the PDF.<br />
Analyzing PDF documents with the <strong>PDFlib</strong> FontReporter plugin 1 . In order to obtain<br />
the information required to create appropriate Unicode mapping tables you must analyze<br />
the problematic PDF documents.<br />
<strong>PDFlib</strong> GmbH provides a free companion product to <strong>TET</strong> which assists in this situation:<br />
<strong>PDFlib</strong> FontReporter is an Adobe Acrobat plugin for easily collecting font, encoding,<br />
and glyph information. The plugin creates detailed font reports containing the actual<br />
glyphs along with the following information:<br />
> The corresponding code: the first hex digit is given in the left-most column, the second<br />
hex digit is given in the top row. For CID fonts the offset printed in the header<br />
must be added to obtain the code corresponding to the glyph.<br />
> The glyph name if present.<br />
> The Unicode value(s) corresponding to the glyph (if Acrobat can determine them).<br />
These pieces of information play an important role for <strong>TET</strong>’s glyph mapping controls.<br />
Figure 6.5 shows two pages from a sample font report. Font reports created with the<br />
FontReporter plugin can be used to analyze PDF fonts and create mapping tables for<br />
successfully extracting the text with <strong>TET</strong>. It is highly recommended to take a look at the<br />
corresponding font report if you want to write Unicode mapping tables or glyph name<br />
heuristics to control text extraction with <strong>TET</strong>.<br />
1. The <strong>PDFlib</strong> FontReporter plugin is available for free download at www.pdflib.com/products/fontreporter<br />
76 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>