PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
9.2 Controlling <strong>TET</strong>ML Details<br />
<strong>TET</strong>ML text modes. <strong>TET</strong>ML can be generated in various text modes which include different<br />
amounts of font and geometry information, and differ regarding the grouping of<br />
text into larger units (granularity). The text mode can be specified individually for each<br />
page. However, in most situations <strong>TET</strong>ML files will contain the data for all pages in the<br />
same mode. The following text modes are available:<br />
> Glyph mode is a low-level flavor which includes the text, font, and coordinates for<br />
each glyph, without any word grouping or structure information. It is intended for<br />
debugging and analysis purposes since it represents the original text information on<br />
the page.<br />
> Word mode groups text into words and adds Box elements with the coordinates of<br />
each word. No font information is available. This mode is suitable for applications<br />
which operate on word basis. Punctuation characters will by default be treated as individual<br />
words, but this behavior can be changed with a page option (see »Word<br />
boundary detection for Western text«, page 85). Lines of text can optionally be identified<br />
with the Line element; this is controlled via the tetml page option.<br />
> Wordplus mode is similar to word mode, but adds font and coordinate details for all<br />
glyphs in a word. The coordinates are expressed relative to the lower left or upper<br />
left corner subject to the topdown page option. Wordplus mode makes it possible to<br />
analyze font usage and track changes of font, font size, etc. within a word. Since<br />
wordplus is the only text mode which contains all relevant <strong>TET</strong>ML elements it is suited<br />
for all kinds of processing tasks. On the other hand, it creates the largest amount<br />
of output due to the wealth of information contained in the <strong>TET</strong>ML.<br />
> Line mode includes all text which comprises a line in a separate Line element. In addition,<br />
multiple lines may be grouped in a Para element. Line mode is recommended<br />
only in situations where the page content is known to be grouped into lines, or the<br />
receiving application can only deal with line-based text input.<br />
> Page mode includes structure information starting at the paragraph level, but does<br />
not include any font or coordinate details.<br />
Table 9.1 lists the <strong>TET</strong>ML elements which are present in the text modes.<br />
Table 9.1 <strong>TET</strong>ML elements in various text mode<br />
text mode structure tables text position text details<br />
glyph – – – Glyph<br />
word<br />
wordplus<br />
Para, Word<br />
optionally: Line<br />
Para, Word<br />
optionally: Line<br />
Table, Row, Cell Box –<br />
Table, Row, Cell Box Glyph<br />
line Para, Line – – –<br />
page Para Table, Row, Cell – –<br />
Selecting the text mode. With the <strong>TET</strong> command-line tool (see Section 2.1, »Command-Line<br />
Options«, page 17) you can specify the desired page mode as a parameter for<br />
the --tetml option. The following command generates <strong>TET</strong>ML output in wordplus mode:<br />
9.2 Controlling <strong>TET</strong>ML Details 127