17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 10.5 Page options for <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />

option<br />

docstyle<br />

excludebox<br />

granularity<br />

includebox<br />

fontsizerange<br />

ignoreinvisibletext<br />

imageanalysis<br />

layoutanalysis<br />

layouteffort<br />

skipengines<br />

structureanalysis<br />

description<br />

(Keyword) A hint which will be used by the layout detection engine to select various parameters. These<br />

parameters will optimize layout detection for situations where the input document belongs to one of<br />

the following classes:<br />

book Typical book<br />

business Business documents<br />

fancy Fancy pages with complex layout<br />

forms Structured forms<br />

generic The most general document class without any further qualification.<br />

magazines Magazine articles<br />

papers Newspaper<br />

science Scientific article<br />

searchengine<br />

The application is a search engine or similar, and mainly interested in retrieving the word list<br />

for the page as fast as possible. Table and page structure recognition will be disabled.<br />

(List of rectangles) Exclude the combined area of the specified rectangles from text extraction. Default:<br />

empty<br />

(List of two floats) Two numbers specifying the minimum and maximum font size of text. <strong>Text</strong> with a<br />

size outside of this interval will be ignored. The maximum can be specified with the keyword unlimited,<br />

which means that no upper limit will be active. Default: { 0 unlimited }<br />

(Keyword) The granularity of the text fragments returned by <strong>TET</strong>_get_text( ); all modes except glyph will<br />

enable the Wordfinder. See »<strong>Text</strong> granularity«, page 71, for more details (default: word).<br />

glyph<br />

word<br />

line<br />

page<br />

A fragment contains the result of mapping one glyph, but may contain more than one<br />

character (e.g. for ligatures).<br />

A fragment contains a word as determined by the wordfinder.<br />

A fragment contains a line of text, or the closest approximation thereof. Word separators will<br />

be inserted between two consecutive words.<br />

A fragment contains the contents of a single page. Word, line, and zone separators will be<br />

inserted as appropriate.<br />

(Boolean) If true, text with rendering mode 3 (invisible) will be ignored. Default: false (since invisible<br />

text is mainly used for image+text PDFs containing scanned pages and the corresponding OCR text)<br />

(Option list) List of suboptions according to Table 10.8 for controlling high-level image processing.<br />

(List of rectangles) Restrict text extraction to the combined area of the specified rectangles. Default: the<br />

complete clipping area<br />

(Option list; not for granularity=glyph) List of suboptions according to Table 10.7 for controlling layout<br />

detection features.<br />

(Keyword) Controls the quality/performance trade-off of layout recognition. Layout recognition can be<br />

improved by spending more effort, but this may slow down operation. The layout recognition effort can<br />

be controlled with the keywords none, low, medium, high, and extra. Default: low<br />

(List of keywords) Skip some of the availablabe parsers for the page contents. A skipped engine never returns<br />

any data for this page. Skipping an engine which is not required will improve performance for applications<br />

which don’t need the data delivered by this engine (default: all engines are active):<br />

text (Keyword) Skip the text extraction engine.<br />

image (Keyword) Skip the image extraction engine.<br />

(Option list; not for granularity=glyph) List of suboptions according to Table 10.9 for controlling page<br />

structure analysis.<br />

10.5 Page Functions 135

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!