17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

You can specify the amount of text in the smallest element with the granularity option<br />

of <strong>TET</strong>_process_page( ).<br />

> For granularity=glyph or word you can additionally specify the amount of glyph details.<br />

Using the geometry and font suboptions you can omit some parts of the glyph<br />

information if you don’t need it.<br />

The following page option list generates <strong>TET</strong>ML output in wordplus mode with all glyph<br />

details:<br />

granularity=word glyphdetails={geometry=true font=true}<br />

Table 8.2 summarizes the options for creating page modes.<br />

Table 8.2 Creating <strong>TET</strong>ML text modes with the <strong>TET</strong> library<br />

text mode granularity option of <strong>TET</strong>_process_page( ) tetml option of <strong>TET</strong>_process_page( )<br />

glyph granularity=glyph tetml={glyphdetails={geometry=true font=true}}<br />

word granularity=word –<br />

wordplus granularity=word tetml={glyphdetails={geometry=true font=true}}<br />

line granularity=line –<br />

page granularity=page –<br />

Document options for controlling <strong>TET</strong>ML output. In this section we will summarize<br />

the effect of various options which directly control the generated <strong>TET</strong>ML output. All<br />

other document options can be used to control processing details. The complete description<br />

of document options can be found in Table 10.3.<br />

Document-related options must be supplied to the --docopt command-line option or<br />

to the <strong>TET</strong>_open_document( ) function.<br />

The tetml option controls general aspects of <strong>TET</strong>ML. The elements suboption can be<br />

used to suppress certain <strong>TET</strong>ML elements if they are not required. The following document<br />

option list will suppress document-level XMP metadata in the generated <strong>TET</strong>ML<br />

output:<br />

tetml={ elements={nodocxmp} }<br />

The engines option enables or disables the text and image extraction engines. The following<br />

option list will process text contents, but disable image processing:<br />

engines={noimage}<br />

All document options which have been supplied when creating <strong>TET</strong>ML will be recorded<br />

in the /<strong>TET</strong>/Document/Options element unless disabled with the following document option:<br />

tetml={ elements={nooptions} }<br />

Page options for controlling <strong>TET</strong>ML output. The complete description of page options<br />

can be found in Table 10.5. Page-related options must be supplied to the --pageopt command-line<br />

option or to the <strong>TET</strong>_process_page( ) function.<br />

8.2 Controlling <strong>TET</strong>ML Details 93

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!