17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recognize superscript, subscript and dropcaps (large initial characters at the start of<br />

a paragraph)<br />

pCOS interface for simple access to PDF objects. <strong>TET</strong> includes pCOS (<strong>PDFlib</strong> Comprehensive<br />

Object System) for retrieving arbitrary PDF objects. With pCOS you can retrieve<br />

PDF metadata, interactive elements (e.g. bookmark text, contents of form fields), or any<br />

other information from a PDF document with a simple query interface. The syntax of<br />

pCOS query path is described separately in the pCOS Path Reference.<br />

<strong>TET</strong> Markup Language (<strong>TET</strong>ML). The information retrieved from a PDF document can<br />

be presented in an XML format called <strong>TET</strong> Markup Language, or <strong>TET</strong>ML for processing<br />

with standard XML tools. <strong>TET</strong>ML contains text, image, and metadata information and<br />

can optionally also contain font- and geometry-related details.<br />

What is text? While <strong>TET</strong> deals with a large class of PDF documents, not all visible text<br />

can successfully be extracted. The text must be encoded using PDF’s text and encoding<br />

facilities (i.e., it must be based on a font). Although the following flavors of text may be<br />

visible on the page they cannot be extracted with <strong>TET</strong>:<br />

> Rasterized (pixel image) text, e.g. scanned pages;<br />

> <strong>Text</strong> which is directly represented by vector elements without any font.<br />

Note that metadata and text in hypertext elements (such as bookmarks, form fields,<br />

notes, or annotations) can be retrieved with the pCOS interface. On the other hand, <strong>TET</strong><br />

may extract some text which is not visible on the page. This may happen in the following<br />

situations:<br />

> <strong>Text</strong> using PDF’s invisible attribute (however, there is an option to exclude this kind<br />

of text from the text retrieval process)<br />

> <strong>Text</strong> which is obscured or clipped by some other element on the page, e.g. an image.<br />

> PDF layers are ignored; <strong>TET</strong> will retrieve the text from all layers regardless of their<br />

visibility.<br />

1.2 Many ways to use <strong>TET</strong><br />

<strong>TET</strong> is available as a programming library (component) for various development environments,<br />

and as a command-line tool for batch operations. Both offer similar features,<br />

but are suitable for different deployment tasks. Both the <strong>TET</strong> library and command-line<br />

tool can create <strong>TET</strong>ML, <strong>TET</strong>’s XML-based output format.<br />

> The <strong>TET</strong> programming library can be used for integration into your desktop or server<br />

application. Many different programming languages are supported. Examples for<br />

using the <strong>TET</strong> library with all supported language bindings are included in the <strong>TET</strong><br />

package.<br />

> The <strong>TET</strong> command-line tool is suited for batch processing PDF documents. It doesn’t<br />

require any programming, but offers command-line options which can be used to<br />

integrate it into complex workflows.<br />

> <strong>TET</strong>ML output is suited for XML-based workflows and developers who are familiar<br />

with the wide range of XML processing tools and languages, e.g. XSLT.<br />

> <strong>TET</strong> connectors are suited for integrating <strong>TET</strong> in various common software packages,<br />

e.g. databases and search engines.<br />

1.2 Many ways to use <strong>TET</strong> 13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!