17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

1 Introduction<br />

The <strong>PDFlib</strong> <strong>Text</strong> <strong>Extraction</strong> <strong>Toolkit</strong> (<strong>TET</strong>) is targeted at extracting text and images from<br />

PDF documents, but can also be used to retrieve other information from PDF. <strong>TET</strong> can be<br />

used as a base component for realizing the following tasks:<br />

> search the text contents of PDF<br />

> create a list of all words contained in a PDF (concordance)<br />

> implement a search engine for processing large numbers of PDF files<br />

> extract text from PDF to store, translate, or otherwise repurpose it<br />

> convert the text contents of PDF to other formats<br />

> process or enhance PDFs based on their contents<br />

> compare the text contents of multiple PDF documents<br />

> extract the raster images from PDF for repurposing<br />

> extract metadata and other information from PDF<br />

<strong>TET</strong> has been designed for standalone use, and does not require any third-party software.<br />

It is robust and suitable for multi-threaded server use.<br />

1.1 Overview of <strong>TET</strong> Features<br />

Supported PDF input. <strong>TET</strong> has been tested against thousands of PDF test files from various<br />

sources. It accepts PDF 1.0 up to PDF 1.7 extension level 3 (corresponding to<br />

Acrobat 1-9) as well as encrypted documents.<br />

Unicode support. <strong>TET</strong> includes a considerable number of algorithms and data to<br />

achieve reliable Unicode mappings for all text. Although text in PDF documents is not<br />

usually encoded in Unicode, <strong>TET</strong> will normalize the text from a PDF document to Unicode:<br />

> <strong>TET</strong> converts all text contents to Unicode. In C the text will be returned in UTF-8 or<br />

UTF-16 format; in other language bindings as native Unicode strings.<br />

> Ligatures and other multi-character glyphs will be decomposed into a sequence of<br />

their constituent Unicode characters.<br />

> Vendor-specific Unicode values (Corporate Use Subarea, CUS) are identified, and will<br />

be mapped to characters with precisely defined meanings if possible.<br />

> Glyphs which are lacking Unicode mapping information are identified as such, and<br />

will be mapped to a configurable replacement character.<br />

> UTF-16 surrogate pairs for characters outside the Basic Multilingual Plane (BMP) are<br />

properly interpreted and maintained. Surrogate pairs and UTF-32 values can be retrieved<br />

in all language bindings.<br />

Some PDF documents do not contain enough information for reliable Unicode mapping.<br />

In order to successfully extract the text nevertheless <strong>TET</strong> offers various configuration<br />

options which can be used to supply auxiliary information for proper Unicode<br />

mappings. In order to facilitate writing the required mapping tables we make available<br />

<strong>PDFlib</strong> FontReporter, a free plugin for Adobe Acrobat. This plugin can be used for analyzing<br />

fonts, encodings, and glyphs in PDF.<br />

1.1 Overview of <strong>TET</strong> Features 11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!