PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Recognize superscript, subscript and dropcaps (large initial characters at the start of<br />
a paragraph)<br />
pCOS interface for simple access to PDF objects. <strong>TET</strong> includes pCOS (<strong>PDFlib</strong> Comprehensive<br />
Object System) for retrieving arbitrary PDF objects. With pCOS you can retrieve<br />
PDF metadata, interactive elements (e.g. bookmark text, contents of form fields), or any<br />
other information from a PDF document with a simple query interface. The syntax of<br />
pCOS query path is described separately in the pCOS Path Reference.<br />
<strong>TET</strong> Markup Language (<strong>TET</strong>ML). The information retrieved from a PDF document can<br />
be presented in an XML format called <strong>TET</strong> Markup Language, or <strong>TET</strong>ML for processing<br />
with standard XML tools. <strong>TET</strong>ML contains text, image, and metadata information and<br />
can optionally also contain font- and geometry-related details.<br />
What is text? While <strong>TET</strong> deals with a large class of PDF documents, not all visible text<br />
can successfully be extracted. The text must be encoded using PDF’s text and encoding<br />
facilities (i.e., it must be based on a font). Although the following flavors of text may be<br />
visible on the page they cannot be extracted with <strong>TET</strong>:<br />
> Rasterized (pixel image) text, e.g. scanned pages;<br />
> <strong>Text</strong> which is directly represented by vector elements without any font.<br />
Note that metadata and text in hypertext elements (such as bookmarks, form fields,<br />
notes, or annotations) can be retrieved with the pCOS interface. On the other hand, <strong>TET</strong><br />
may extract some text which is not visible on the page. This may happen in the following<br />
situations:<br />
> <strong>Text</strong> using PDF’s invisible attribute (however, there is an option to exclude this kind<br />
of text from the text retrieval process)<br />
> <strong>Text</strong> which is obscured or clipped by some other element on the page, e.g. an image.<br />
> PDF layers are ignored; <strong>TET</strong> will retrieve the text from all layers regardless of their<br />
visibility.<br />
1.2 Many ways to use <strong>TET</strong><br />
<strong>TET</strong> is available as a programming library (component) for various development environments,<br />
and as a command-line tool for batch operations. Both offer similar features,<br />
but are suitable for different deployment tasks. Both the <strong>TET</strong> library and command-line<br />
tool can create <strong>TET</strong>ML, <strong>TET</strong>’s XML-based output format.<br />
> The <strong>TET</strong> programming library can be used for integration into your desktop or server<br />
application. Many different programming languages are supported. Examples for<br />
using the <strong>TET</strong> library with all supported language bindings are included in the <strong>TET</strong><br />
package.<br />
> The <strong>TET</strong> command-line tool is suited for batch processing PDF documents. It doesn’t<br />
require any programming, but offers command-line options which can be used to<br />
integrate it into complex workflows.<br />
> <strong>TET</strong>ML output is suited for XML-based workflows and developers who are familiar<br />
with the wide range of XML processing tools and languages, e.g. XSLT.<br />
> <strong>TET</strong> connectors are suited for integrating <strong>TET</strong> in various common software packages,<br />
e.g. databases and search engines.<br />
1.2 Many ways to use <strong>TET</strong> 13