PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
8 <strong>TET</strong> Markup Language (<strong>TET</strong>ML)<br />
8.1 Creating <strong>TET</strong>ML<br />
As an alternative to supplying the contents of a PDF document via a programming interface,<br />
<strong>TET</strong> can create XML output which represents the same information. We refer to<br />
the XML output created by <strong>TET</strong> as <strong>TET</strong> Markup Language (<strong>TET</strong>ML). <strong>TET</strong>ML contains the<br />
text content of the PDF pages plus optional information such as text position, font, font<br />
size, etc. If <strong>TET</strong> detects table-like structures on the page the tables will be expressed in<br />
<strong>TET</strong>ML as a hierarchy of table, row, and cell elements. Note that table information is not<br />
available via the <strong>TET</strong> programming interface, but only through <strong>TET</strong>ML. <strong>TET</strong>ML also contains<br />
information about images and colorspaces.<br />
You can convert PDF documents to <strong>TET</strong>ML with the <strong>TET</strong> command-line tool or the<br />
<strong>TET</strong> library. In both cases there are various options available for controlling details of<br />
<strong>TET</strong>ML generation.<br />
Creating <strong>TET</strong>ML with the <strong>TET</strong> command-line tool. Using the <strong>TET</strong> command-line tool<br />
you can generate <strong>TET</strong>ML output with the --tetml option. The following command will<br />
create a <strong>TET</strong>ML output document file.tetml:<br />
tet --tetml word file.pdf<br />
You can use various options to convert only some pages of the document, supply processing<br />
options, etc. Refer to Section 2.1, »Command-Line Options«, page 15, for more details.<br />
Creating <strong>TET</strong>ML with the <strong>TET</strong> library. Using a simple sequence of API calls you can generate<br />
<strong>TET</strong>ML output with the <strong>TET</strong> library. The tetml sample program demonstrates the<br />
canonical sequence for programmatically generating <strong>TET</strong>ML. This sample program is<br />
available in all supported language bindings.<br />
<strong>TET</strong>ML output can be generated on a disk file or in memory. The generated <strong>TET</strong>ML<br />
stream can be parsed into a XML tree using the XML support provided by most modern<br />
programming languages. Processing the <strong>TET</strong>ML tree is also demonstrated in the tetml<br />
sample programs.<br />
What’s included in <strong>TET</strong>ML? <strong>TET</strong>ML output is encoded in UTF-8 (on zSeries with USS or<br />
MVS: EBCDIC-UTF-8, see www.unicode.org/reports/tr16), and includes the following information:<br />
> general document information and metadata<br />
> text contents of each page (words or paragraph)<br />
> glyph information (font name, size, coordinates)<br />
> structure information, e.g. tables<br />
> information about placed images on the page<br />
> resource information, i.e. fonts, colorspaces, and images<br />
> error messages if an exception occurred during PDF processing<br />
Various elements and attributes in <strong>TET</strong>ML are optional. See Section 8.2, »Controlling<br />
<strong>TET</strong>ML Details«, page 92, for details.<br />
8.1 Creating <strong>TET</strong>ML 89