17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

9.2 Controlling <strong>TET</strong>ML Details<br />

<strong>TET</strong>ML text modes. <strong>TET</strong>ML can be generated in various text modes which include different<br />

amounts of font and geometry information, and differ regarding the grouping of<br />

text into larger units (granularity). The text mode can be specified individually for each<br />

page. However, in most situations <strong>TET</strong>ML files will contain the data for all pages in the<br />

same mode. The following text modes are available:<br />

> Glyph mode is a low-level flavor which includes the text, font, and coordinates for<br />

each glyph, without any word grouping or structure information. It is intended for<br />

debugging and analysis purposes since it represents the original text information on<br />

the page.<br />

> Word mode groups text into words and adds Box elements with the coordinates of<br />

each word. No font information is available. This mode is suitable for applications<br />

which operate on word basis. Punctuation characters will by default be treated as individual<br />

words, but this behavior can be changed with a page option (see »Word<br />

boundary detection for Western text«, page 85). Lines of text can optionally be identified<br />

with the Line element; this is controlled via the tetml page option.<br />

> Wordplus mode is similar to word mode, but adds font and coordinate details for all<br />

glyphs in a word. The coordinates are expressed relative to the lower left or upper<br />

left corner subject to the topdown page option. Wordplus mode makes it possible to<br />

analyze font usage and track changes of font, font size, etc. within a word. Since<br />

wordplus is the only text mode which contains all relevant <strong>TET</strong>ML elements it is suited<br />

for all kinds of processing tasks. On the other hand, it creates the largest amount<br />

of output due to the wealth of information contained in the <strong>TET</strong>ML.<br />

> Line mode includes all text which comprises a line in a separate Line element. In addition,<br />

multiple lines may be grouped in a Para element. Line mode is recommended<br />

only in situations where the page content is known to be grouped into lines, or the<br />

receiving application can only deal with line-based text input.<br />

> Page mode includes structure information starting at the paragraph level, but does<br />

not include any font or coordinate details.<br />

Table 9.1 lists the <strong>TET</strong>ML elements which are present in the text modes.<br />

Table 9.1 <strong>TET</strong>ML elements in various text mode<br />

text mode structure tables text position text details<br />

glyph – – – Glyph<br />

word<br />

wordplus<br />

Para, Word<br />

optionally: Line<br />

Para, Word<br />

optionally: Line<br />

Table, Row, Cell Box –<br />

Table, Row, Cell Box Glyph<br />

line Para, Line – – –<br />

page Para Table, Row, Cell – –<br />

Selecting the text mode. With the <strong>TET</strong> command-line tool (see Section 2.1, »Command-Line<br />

Options«, page 17) you can specify the desired page mode as a parameter for<br />

the --tetml option. The following command generates <strong>TET</strong>ML output in wordplus mode:<br />

9.2 Controlling <strong>TET</strong>ML Details 127

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!