17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

8.2 Controlling <strong>TET</strong>ML Details<br />

<strong>TET</strong>ML text modes. <strong>TET</strong>ML can be generated in various text modes which include different<br />

amounts of font and geometry information, and differ regarding the grouping of<br />

text into larger units (granularity). The text mode can be specified individually for each<br />

page. However, in most situations <strong>TET</strong>ML files will contain the data for all pages in the<br />

same mode. The following text modes are available:<br />

> Glyph mode is a low-level flavor which includes the text, font, and coordinates for<br />

each glyph, without any word grouping or structure information. It is intended for<br />

debugging and analysis purposes since it represents the original text information on<br />

the page.<br />

> Word mode groups text into words and adds Box elements with the coordinates of<br />

each word. No font information is available. This mode is suitable for applications<br />

which operate on word basis. Punctuation characters will by default be treated as individual<br />

words, but this behavior can be changed with a page option (see »Word<br />

boundary detection«, page 72).<br />

> Wordplus mode is similar to word mode, but adds font and coordinate details for all<br />

glyphs in a word. This makes it possible to analyze font usage and track changes of<br />

font, font size, etc. within a word. Since wordplus is the only text mode which contains<br />

all relevant <strong>TET</strong>ML elements it is suited for all kinds of processing tasks. On the<br />

other hand, it creates the largest amount of output due to the wealth of information<br />

contained in the <strong>TET</strong>ML.<br />

> Line mode includes all text which comprises a line in a separate Line element. In addition,<br />

multiple lines may be grouped in a Para element. Line mode is recommended<br />

only in situations where the page content is known to be grouped into lines, or the<br />

receiving application can only deal with line-based text input.<br />

> Page mode includes structure information starting at the paragraph level, but does<br />

not include any font or coordinate details.<br />

Table 8.1 lists the <strong>TET</strong>ML elements which are present in the text modes.<br />

Table 8.1 <strong>TET</strong>ML elements in various text mode<br />

text mode structure tables text position text details<br />

glyph – – – Glyph<br />

word Para, Word Table, Row, Cell Box –<br />

wordplus Para, Word Table, Row, Cell Box Glyph<br />

line Para, Line – – –<br />

page Para Table, Row, Cell – –<br />

Selecting the text mode. With the <strong>TET</strong> command-line tool (see Section 2.1, »Command-Line<br />

Options«, page 15) you can specify the desired page mode as a parameter for<br />

the --tetml option. The following command generates <strong>TET</strong>ML output in wordplus mode:<br />

tet --tetml wordplus file.pdf<br />

With the <strong>TET</strong> library the text mode cannot be specified directly, but as a combination of<br />

options:<br />

92 Chapter 8: <strong>TET</strong> Markup Language (<strong>TET</strong>ML)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!