17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

may extract some text which is not visible on the page. This may happen in the following<br />

situations:<br />

> <strong>Text</strong> using PDF’s invisible attribute (however, there is an option to exclude this kind<br />

of text from the text retrieval process)<br />

> <strong>Text</strong> which is obscured or clipped by some other element on the page, e.g. an image.<br />

> PDF layers are ignored; <strong>TET</strong> will retrieve the text from all layers regardless of their<br />

visibility.<br />

1.2 Many ways to use <strong>TET</strong><br />

<strong>TET</strong> is available as a programming library (component) for various development environments,<br />

and as a command-line tool for batch operations. Both offer similar features,<br />

but are suitable for different deployment tasks. Both the <strong>TET</strong> library and command-line<br />

tool can create <strong>TET</strong>ML, <strong>TET</strong>’s XML-based output format.<br />

> The <strong>TET</strong> programming library can be used for integration into your desktop or server<br />

application. Many different programming languages are supported. Examples for<br />

using the <strong>TET</strong> library with all supported language bindings are included in the <strong>TET</strong><br />

package.<br />

> The <strong>TET</strong> command-line tool is suited for batch processing PDF documents. It doesn’t<br />

require any programming, but offers command-line options which can be used to<br />

integrate it into complex workflows.<br />

> <strong>TET</strong>ML output is suited for XML-based workflows and developers who are familiar<br />

with the wide range of XML processing tools and languages, e.g. XSLT.<br />

> <strong>TET</strong> connectors are suited for integrating <strong>TET</strong> in various common software packages,<br />

e.g. databases and search engines.<br />

1.3 Roadmap to Documentation and Samples<br />

Mini samples for the <strong>TET</strong> library. The <strong>TET</strong> distribution contains programming examples<br />

for all supported language bindings. These mini samples can serve as a starting<br />

point for your own applications, or to test your <strong>TET</strong> installation. They comprise source<br />

code for the following applications:<br />

> The extractor sample demonstrates the basic loops for extracting text and images<br />

from a PDF document.<br />

> The dumper sample shows the use of the integrated pCOS interface for querying general<br />

information about a PDF document.<br />

> The fontfilter sample shows how to process font-related information, such as font<br />

name and font size.<br />

> The tetml sample contains the prototypical code for generating <strong>TET</strong>ML (<strong>TET</strong>’s XML<br />

language for expressing PDF contents) from a PDF document.<br />

> The get_attachments sample (not available for all language bindings) demonstrates<br />

how to process PDF file attachments, i.e. PDF documents which are embedded in another<br />

PDF document.<br />

Note On Windows Vista the mini samples will be installed in the »Program Files« directory by default.<br />

Due to a new protection scheme in Windows Vista the PDF output files created by these<br />

samples will only be visible under »compatibility files«. Recommended workaround: copy the<br />

examples to a user directory.<br />

1.2 Many ways to use <strong>TET</strong> 13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!