17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Legacy PDF documents with missing Unicode values. In some situations PDF documents<br />

created by legacy applications must be processed where the PDF may not contain<br />

enough information for proper Unicode mapping. Using the default settings <strong>TET</strong> may<br />

be unable to extract some or all of the text contents. Recommendations:<br />

> Start by extracting the text with default settings, and analyze the results. Identify<br />

the fonts which do not provide enough information for proper Unicode mapping.<br />

> Write custom encoding tables and glyph name lists to fix problematic fonts. Use the<br />

<strong>PDFlib</strong> FontReporter plugin for analyzing the fonts and preparing Unicode mapping<br />

tables.<br />

> Configure the custom mapping tables and extract the text again, using a larger number<br />

of documents. If there are still unmappable glyphs or fonts adjust the mapping<br />

tables as appropriate.<br />

> If you have a large number of documents with unmappable fonts <strong>PDFlib</strong> GmbH may<br />

be able to assist you in creating the required mapping tables.<br />

Convert PDF documents to another format. If you want to import the page contents of<br />

PDF documents into your application, while retaining as much information as possible<br />

you’ll need precise character metrics. Recommendations:<br />

> Use <strong>TET</strong>_get_char_info( ) to retrieve precise character metrics and font names. Even if<br />

you use the uv field to retrieve the Unicode values of individual characters, you must<br />

also call <strong>TET</strong>_get_text( ) since it fills the char_info structure.<br />

> Use granularity=glyph or word in <strong>TET</strong>_open_page( ), depending on what is better suited<br />

for your application.<br />

Corporate fonts with custom-encoded logos. In many cases corporate fonts containing<br />

custom logos have missing or wrong Unicode mapping information for the logos. If<br />

you have a large number of PDF documents containing such fonts it is recommended to<br />

create a custom mapping table with proper Unicode values.<br />

Start by creating a font report (see »Analyzing PDF documents with the <strong>PDFlib</strong> Font-<br />

Reporter plugin«, page 76) for a PDF containing the font, and locate mismapped glyphs<br />

in the font report. Depending on the font type you can use any of the available configuration<br />

tables to provide the missing Unicode mappings. See »Code list resources for all<br />

font types«, page 77, for a detailed example of a code list for a logotype font.<br />

TeX documents. PDF documents produced with the TeX documents often contain numerical<br />

glyph names, Type 3 fonts and other features which prevent other products<br />

from successfully extracting the text. <strong>TET</strong> contains many heuristics and workarounds<br />

for dealing with such documents. However, a particular flavor of TeX documents can<br />

only be processed with a workaround that requires more processing time, and is disabled<br />

by default. You can enable more CPU-intensive font processing for these documents<br />

with the following document option:<br />

checkglyphlists=true<br />

56 Chapter 5: Configuration

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!