17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 10.10 Page options for <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />

option<br />

layouteffort<br />

layouthint<br />

skipengines<br />

structureanalysis<br />

topdown<br />

description<br />

(Keyword) Controls the quality/performance trade-off of layout recognition. Layout recognition can be<br />

improved by spending more effort, but this may slow down operation. The layout recognition effort can<br />

be controlled with the keywords none, low, medium, high, and extra. Default: low<br />

(Option list) Inform the layout recognition engine about the presence of certain page layout elements:<br />

subsummary<br />

(Keyword) Informs the engine about the presence of subsummaries (marginalia) and possibly<br />

also their position. Supported keywords (default: none):<br />

auto No subsummary detection<br />

left Try to detect subsummaries on the left side of the page.<br />

none Try to detect subsummaries automatically.<br />

right Try to detect subsummaries on the right side of the page.<br />

header (Boolean) If true, the engine tries to detect page headers (default: false).<br />

footer (Boolean) If true, the engine tries to detect page footers (default: false).<br />

(List of keywords) Skip some of the available parsers for the page contents. A skipped engine never returns<br />

any data for this page. Skipping an engine which is not required will improve performance for applications<br />

which don’t need the data delivered by this engine (default: all engines are active):<br />

text (Keyword) Skip the text extraction engine.<br />

image (Keyword) Skip the image extraction engine.<br />

(Option list; not for granularity=glyph) List of suboptions according to Table 10.14 for controlling page<br />

structure analysis.<br />

(Option list) Specify a coordinate system with the origin in the top left corner of the visible page, and y<br />

coordinates which increase downwards; otherwise the default coordinate system with the origin in the<br />

lower left corner will be used. Enabling topdown coordinates enables the same coordinate system which<br />

is displayed in Acrobat. Supported suboptions:<br />

input<br />

output<br />

(Boolean) If true, enable coordinates for the following items (default: false):<br />

page options includebox, excludebox<br />

(Boolean) If true, enable coordinates for the following items (default: false):<br />

<strong>TET</strong>_char_info: y, alpha, beta<br />

<strong>TET</strong>_image_info: y, alpha, beta<br />

<strong>TET</strong>ML: Glyph/@y, Glyph/@alpha, Glyph/@beta, Box/@lly, Box/@ury, PlacedImage/@y,<br />

PlacedImage/@alpha, PlacedImage/@beta<br />

10.8 Page Functions 171

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!