17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

8 <strong>TET</strong> Markup Language (<strong>TET</strong>ML)<br />

8.1 Creating <strong>TET</strong>ML<br />

As an alternative to supplying the contents of a PDF document via a programming interface,<br />

<strong>TET</strong> can create XML output which represents the same information. We refer to<br />

the XML output created by <strong>TET</strong> as <strong>TET</strong> Markup Language (<strong>TET</strong>ML). <strong>TET</strong>ML contains the<br />

text content of the PDF pages plus optional information such as text position, font, font<br />

size, etc. If <strong>TET</strong> detects table-like structures on the page the tables will be expressed in<br />

<strong>TET</strong>ML as a hierarchy of table, row, and cell elements. Note that table information is not<br />

available via the <strong>TET</strong> programming interface, but only through <strong>TET</strong>ML. <strong>TET</strong>ML also contains<br />

information about images and colorspaces.<br />

You can convert PDF documents to <strong>TET</strong>ML with the <strong>TET</strong> command-line tool or the<br />

<strong>TET</strong> library. In both cases there are various options available for controlling details of<br />

<strong>TET</strong>ML generation.<br />

Creating <strong>TET</strong>ML with the <strong>TET</strong> command-line tool. Using the <strong>TET</strong> command-line tool<br />

you can generate <strong>TET</strong>ML output with the --tetml option. The following command will<br />

create a <strong>TET</strong>ML output document file.tetml:<br />

tet --tetml word file.pdf<br />

You can use various options to convert only some pages of the document, supply processing<br />

options, etc. Refer to Section 2.1, »Command-Line Options«, page 15, for more details.<br />

Creating <strong>TET</strong>ML with the <strong>TET</strong> library. Using a simple sequence of API calls you can generate<br />

<strong>TET</strong>ML output with the <strong>TET</strong> library. The tetml sample program demonstrates the<br />

canonical sequence for programmatically generating <strong>TET</strong>ML. This sample program is<br />

available in all supported language bindings.<br />

<strong>TET</strong>ML output can be generated on a disk file or in memory. The generated <strong>TET</strong>ML<br />

stream can be parsed into a XML tree using the XML support provided by most modern<br />

programming languages. Processing the <strong>TET</strong>ML tree is also demonstrated in the tetml<br />

sample programs.<br />

What’s included in <strong>TET</strong>ML? <strong>TET</strong>ML output is encoded in UTF-8 (on zSeries with USS or<br />

MVS: EBCDIC-UTF-8, see www.unicode.org/reports/tr16), and includes the following information:<br />

> general document information and metadata<br />

> text contents of each page (words or paragraph)<br />

> glyph information (font name, size, coordinates)<br />

> structure information, e.g. tables<br />

> information about placed images on the page<br />

> resource information, i.e. fonts, colorspaces, and images<br />

> error messages if an exception occurred during PDF processing<br />

Various elements and attributes in <strong>TET</strong>ML are optional. See Section 8.2, »Controlling<br />

<strong>TET</strong>ML Details«, page 92, for details.<br />

8.1 Creating <strong>TET</strong>ML 89

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!