PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Also to ensure text fidelity you may want to disable text extraction for text which is<br />
not visible on the page:<br />
ignoreinvisibletext=true<br />
Processing documents with <strong>PDFlib</strong>+PDI. When using <strong>PDFlib</strong>+PDI to process PDF documents<br />
on a per-page basis you can integrate <strong>TET</strong> for controlling the splitting or merging<br />
process. For example, you could split a PDF document based on the contents of a page. If<br />
you have control over the creation process you can insert separator pages with suitable<br />
processing instructions in the text. The <strong>TET</strong> Cookbook contains examples for analyzing<br />
documents with <strong>TET</strong> and then processing them with <strong>PDFlib</strong>+PDI.<br />
Legacy PDF documents with missing Unicode values. In some situations PDF documents<br />
created by legacy applications must be processed where the PDF may not contain<br />
enough information for proper Unicode mapping. Using the default settings <strong>TET</strong> may<br />
be unable to extract some or all of the text contents. Recommendations:<br />
> Start by extracting the text with default settings, and analyze the results. Identify<br />
the fonts which do not provide enough information for proper Unicode mapping.<br />
> Write custom encoding tables and glyph name lists to fix problematic fonts. Use the<br />
<strong>PDFlib</strong> FontReporter plugin for analyzing the fonts and preparing Unicode mapping<br />
tables.<br />
> Configure the custom mapping tables and extract the text again, using a larger number<br />
of documents. If there are still unmappable glyphs or fonts adjust the mapping<br />
tables as appropriate.<br />
> If you have a large number of documents with unmappable fonts <strong>PDFlib</strong> GmbH may<br />
be able to assist you in creating the required mapping tables.<br />
Convert PDF documents to another format. If you want to import the page contents of<br />
PDF documents into your application, while retaining as much information as possible<br />
you’ll need precise character metrics. Recommendations:<br />
> Use <strong>TET</strong>_get_char_info( ) to retrieve precise character metrics and font names. Even if<br />
you use the uv field to retrieve the Unicode values of individual characters, you must<br />
also call <strong>TET</strong>_get_text( ) since it fills the char_info structure.<br />
> Use granularity=glyph or word in <strong>TET</strong>_open_page( ), depending on what is better suited<br />
for your application. Working with granularity=glyph may result in conflicts between<br />
the visual layout of text and the processed logical text created by <strong>TET</strong> (e.g. the two<br />
characters created by a ligature glyph may not fit into the same space as the ligature).<br />
Corporate fonts with custom-encoded logos. In many cases corporate fonts containing<br />
custom logos have missing or wrong Unicode mapping information for the logos. If<br />
you have a large number of PDF documents containing such fonts it is recommended to<br />
create a custom mapping table with proper Unicode values.<br />
Start by creating a font report (see »Analyzing PDF documents with the <strong>PDFlib</strong> Font-<br />
Reporter Plugin«, page 108) for a PDF containing the font, and locate mismapped glyphs<br />
in the font report. Depending on the font type you can use any of the available configuration<br />
tables to provide the missing Unicode mappings. See »Code list resources for all<br />
font types«, page 109, for a detailed example of a code list for a logotype font.<br />
5.3 Recommendations for common Scenarios 67