17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Also to ensure text fidelity you may want to disable text extraction for text which is<br />

not visible on the page:<br />

ignoreinvisibletext=true<br />

Processing documents with <strong>PDFlib</strong>+PDI. When using <strong>PDFlib</strong>+PDI to process PDF documents<br />

on a per-page basis you can integrate <strong>TET</strong> for controlling the splitting or merging<br />

process. For example, you could split a PDF document based on the contents of a page. If<br />

you have control over the creation process you can insert separator pages with suitable<br />

processing instructions in the text. The <strong>TET</strong> Cookbook contains examples for analyzing<br />

documents with <strong>TET</strong> and then processing them with <strong>PDFlib</strong>+PDI.<br />

Legacy PDF documents with missing Unicode values. In some situations PDF documents<br />

created by legacy applications must be processed where the PDF may not contain<br />

enough information for proper Unicode mapping. Using the default settings <strong>TET</strong> may<br />

be unable to extract some or all of the text contents. Recommendations:<br />

> Start by extracting the text with default settings, and analyze the results. Identify<br />

the fonts which do not provide enough information for proper Unicode mapping.<br />

> Write custom encoding tables and glyph name lists to fix problematic fonts. Use the<br />

<strong>PDFlib</strong> FontReporter plugin for analyzing the fonts and preparing Unicode mapping<br />

tables.<br />

> Configure the custom mapping tables and extract the text again, using a larger number<br />

of documents. If there are still unmappable glyphs or fonts adjust the mapping<br />

tables as appropriate.<br />

> If you have a large number of documents with unmappable fonts <strong>PDFlib</strong> GmbH may<br />

be able to assist you in creating the required mapping tables.<br />

Convert PDF documents to another format. If you want to import the page contents of<br />

PDF documents into your application, while retaining as much information as possible<br />

you’ll need precise character metrics. Recommendations:<br />

> Use <strong>TET</strong>_get_char_info( ) to retrieve precise character metrics and font names. Even if<br />

you use the uv field to retrieve the Unicode values of individual characters, you must<br />

also call <strong>TET</strong>_get_text( ) since it fills the char_info structure.<br />

> Use granularity=glyph or word in <strong>TET</strong>_open_page( ), depending on what is better suited<br />

for your application. Working with granularity=glyph may result in conflicts between<br />

the visual layout of text and the processed logical text created by <strong>TET</strong> (e.g. the two<br />

characters created by a ligature glyph may not fit into the same space as the ligature).<br />

Corporate fonts with custom-encoded logos. In many cases corporate fonts containing<br />

custom logos have missing or wrong Unicode mapping information for the logos. If<br />

you have a large number of PDF documents containing such fonts it is recommended to<br />

create a custom mapping table with proper Unicode values.<br />

Start by creating a font report (see »Analyzing PDF documents with the <strong>PDFlib</strong> Font-<br />

Reporter Plugin«, page 108) for a PDF containing the font, and locate mismapped glyphs<br />

in the font report. Depending on the font type you can use any of the available configuration<br />

tables to provide the missing Unicode mappings. See »Code list resources for all<br />

font types«, page 109, for a detailed example of a code list for a logotype font.<br />

5.3 Recommendations for common Scenarios 67

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!