17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 10.11 Suboptions for the contentanalysis option of <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />

option<br />

bidi<br />

bidilevel<br />

dehyphenate<br />

dropcapsize<br />

dropcapratio<br />

ideographic<br />

includeboxorder<br />

description<br />

(Keyword; will be ignored for granularity=glyph; has an effect only if right-to-left characters are present<br />

on the page) Control the inverse Bidi algorithm which reorders right-to-left and left-to-right text in a<br />

chunk (default: logical):<br />

visual<br />

logical<br />

Keep RTL and LTR characters in a chunk in visual order, i.e. do not apply the inverse Bidi<br />

algorithm<br />

Apply the inverse Bidi algorithm to bring the characters in a chunk in logical order.<br />

(Keyword) Specify the page’s base level (i.e. the main direction of text progression) for the inverse Bidi algorithm<br />

(default: auto):<br />

auto Determine the main direction of text progression heuristically based on the content.<br />

ltr Assume left-to-right as main direction of text progression (e.g. Latin documents)<br />

rtl Assume right-to-left as main direction of text progression (e.g. Hebrew or Arabic documents)<br />

(Boolean) If true, hyphenated words will be identified and the text fragments surrounding the hyphen<br />

will be combined. The hyphen itself will be treated according to the keephyphens option. Default: true<br />

(Float) The minimum size at which large glyphs will be recognized as a drop cap. Drop caps are large<br />

characters at the beginning of a zone that are enlarged to »drop« down several lines. They will be<br />

merged with the remainder of the zone and form part of the first word in the zone. Default: 35<br />

(Float) The minimum ratio of the font size of drop caps and neighboring text. Large characters will be recognized<br />

as drop caps if their size exceeds dropcapsize and the font size quotient exceeds dropcapratio.<br />

In other words, this is the number of text lines spanned by drop caps. Default: 4 (drop caps spanning<br />

three lines are very common, but additional line spacing must be taken into account)<br />

(Keyword) Control word boundary detection for ideographic characters. It is recommended to set this option<br />

to keep although the default is split for compatibility reasons (default: split):<br />

keep<br />

split<br />

Ideographic characters generally don’t constitute a word boundary. Punctuation and the<br />

transition between ideographic and non-ideographic characters still constitute a word<br />

boundary. For granularity=word ideographic comma U+3001 and ideographic full stop<br />

U+3002 also constitute word boundaries. For granularity=page no line separator will be<br />

inserted at the end of a line.<br />

Ideographic characters always constitute a word boundary.<br />

(Integer) When multiple include boxes have been supplied (see option includebox), this option controls<br />

how the order of boxes affects the Wordfinder (default: 0):<br />

0 Ignore include box ordering when analyzing the page contents. The result will be the same as<br />

if all the text outside the include boxes was deleted. This is useful for eliminating unwanted<br />

text (e.g. headers and footers) while not affecting the Wordfinder in any way.<br />

1 Take include box ordering into account when creating words and zones, but not for zone<br />

ordering. A word will never belong to more than one box. The resulting zones will be sorted in<br />

logical order. In case of overlapping boxes the text will belong to the box which is earlier in<br />

the list. Other than that, the ordering of include boxes in the option list doesn’t matter. This<br />

setting is useful for extracting text from forms, extracting text from tables, or when include<br />

boxes overlap for complicated layouts.<br />

2 Consider include box ordering for all operations. The contents of each include box will be<br />

treated independently from other boxes, and the resulting text will be concatenated<br />

according to the order of the include boxes. This is useful for extracting text from forms in a<br />

particular ordering, or extracting article columns in a magazine layout in a predefined order.<br />

In these cases advance knowledge about the page layout is required in order to specify the<br />

include boxes in appropriate order.<br />

172 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!