17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 10.10 Page options for <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />

option<br />

docstyle<br />

excludebox<br />

granularity<br />

includebox<br />

fontsizerange<br />

ignoreinvisibletext<br />

imageanalysis<br />

layoutanalysis<br />

description<br />

(Keyword) A hint which will be used by the layout detection engine to select various parameters. These<br />

parameters optimize layout detection for situations where the document belongs to one of the classes<br />

below. If the document is known to fall into one of these classes layout detection results can be improved<br />

significantly by supplying a suitable value for this option. This option activates advanced layout recognition<br />

(default: none):<br />

book Typical book<br />

business Business documents<br />

cad Technical or architectural drawings which are typically heavily fragmented<br />

fancy Fancy pages with complex layout<br />

forms Structured forms<br />

generic The most general document class without any further qualification.<br />

magazines Magazine articles<br />

none No specific document style is known and advanced layout recognition will be disabled.<br />

papers Newspaper<br />

science Scientific article<br />

searchengine<br />

The application is a search engine indexer or similar application, and mainly interested in<br />

retrieving the word list for the page as fast as possible. Table and page structure recognition<br />

are disabled.<br />

spacegrid List-oriented report (often generated on mainframe systems) where the visual layout is<br />

generated using space characters. Since many heuristics like shadow detection and<br />

sophisticated word boundary detection are not required for this class of documents text<br />

extraction can be accelerated with this option.<br />

(List of rectangles) Exclude the combined area of the specified rectangles from text extraction. Default:<br />

empty<br />

(List of two floats) Two numbers specifying the minimum and maximum font size of text. <strong>Text</strong> with a size<br />

outside of this interval will be ignored. The maximum can be specified with the keyword unlimited,<br />

which means that no upper limit will be active. Default: { 0 unlimited }<br />

(Keyword) The granularity of the text fragments returned by <strong>TET</strong>_get_text( ); all modes except glyph will<br />

enable the Wordfinder. See »<strong>Text</strong> granularity«, page 84, for more details (default: word).<br />

glyph<br />

word<br />

line<br />

page<br />

A fragment contains the result of mapping one glyph, but may contain more than one<br />

character (e.g. for ligatures).<br />

A fragment contains a word as determined by the Wordfinder.<br />

A fragment contains a line of text, or the closest approximation thereof. Word separators will<br />

be inserted between two consecutive words.<br />

A fragment contains the contents of a single page. Word, line, and zone separators will be<br />

inserted as appropriate.<br />

(Boolean) If true, text with rendering mode 3 (invisible) will be ignored. Default: false (since invisible<br />

text is mainly used for image+text PDFs containing scanned pages and the corresponding OCR text)<br />

(Option list) List of suboptions according to Table 10.13 for controlling high-level image processing.<br />

(List of rectangles) Restrict text extraction to the combined area of the specified rectangles. Default: the<br />

complete clipping area<br />

(Option list; not for granularity=glyph) List of suboptions according to Table 10.12 for controlling layout<br />

detection features.<br />

170 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!