PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.11 Suboptions for the contentanalysis option of <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />
option<br />
bidi<br />
bidilevel<br />
dehyphenate<br />
dropcapsize<br />
dropcapratio<br />
ideographic<br />
includeboxorder<br />
description<br />
(Keyword; will be ignored for granularity=glyph; has an effect only if right-to-left characters are present<br />
on the page) Control the inverse Bidi algorithm which reorders right-to-left and left-to-right text in a<br />
chunk (default: logical):<br />
visual<br />
logical<br />
Keep RTL and LTR characters in a chunk in visual order, i.e. do not apply the inverse Bidi<br />
algorithm<br />
Apply the inverse Bidi algorithm to bring the characters in a chunk in logical order.<br />
(Keyword) Specify the page’s base level (i.e. the main direction of text progression) for the inverse Bidi algorithm<br />
(default: auto):<br />
auto Determine the main direction of text progression heuristically based on the content.<br />
ltr Assume left-to-right as main direction of text progression (e.g. Latin documents)<br />
rtl Assume right-to-left as main direction of text progression (e.g. Hebrew or Arabic documents)<br />
(Boolean) If true, hyphenated words will be identified and the text fragments surrounding the hyphen<br />
will be combined. The hyphen itself will be treated according to the keephyphens option. Default: true<br />
(Float) The minimum size at which large glyphs will be recognized as a drop cap. Drop caps are large<br />
characters at the beginning of a zone that are enlarged to »drop« down several lines. They will be<br />
merged with the remainder of the zone and form part of the first word in the zone. Default: 35<br />
(Float) The minimum ratio of the font size of drop caps and neighboring text. Large characters will be recognized<br />
as drop caps if their size exceeds dropcapsize and the font size quotient exceeds dropcapratio.<br />
In other words, this is the number of text lines spanned by drop caps. Default: 4 (drop caps spanning<br />
three lines are very common, but additional line spacing must be taken into account)<br />
(Keyword) Control word boundary detection for ideographic characters. It is recommended to set this option<br />
to keep although the default is split for compatibility reasons (default: split):<br />
keep<br />
split<br />
Ideographic characters generally don’t constitute a word boundary. Punctuation and the<br />
transition between ideographic and non-ideographic characters still constitute a word<br />
boundary. For granularity=word ideographic comma U+3001 and ideographic full stop<br />
U+3002 also constitute word boundaries. For granularity=page no line separator will be<br />
inserted at the end of a line.<br />
Ideographic characters always constitute a word boundary.<br />
(Integer) When multiple include boxes have been supplied (see option includebox), this option controls<br />
how the order of boxes affects the Wordfinder (default: 0):<br />
0 Ignore include box ordering when analyzing the page contents. The result will be the same as<br />
if all the text outside the include boxes was deleted. This is useful for eliminating unwanted<br />
text (e.g. headers and footers) while not affecting the Wordfinder in any way.<br />
1 Take include box ordering into account when creating words and zones, but not for zone<br />
ordering. A word will never belong to more than one box. The resulting zones will be sorted in<br />
logical order. In case of overlapping boxes the text will belong to the box which is earlier in<br />
the list. Other than that, the ordering of include boxes in the option list doesn’t matter. This<br />
setting is useful for extracting text from forms, extracting text from tables, or when include<br />
boxes overlap for complicated layouts.<br />
2 Consider include box ordering for all operations. The contents of each include box will be<br />
treated independently from other boxes, and the resulting text will be concatenated<br />
according to the order of the include boxes. This is useful for extracting text from forms in a<br />
particular ordering, or extracting article columns in a magazine layout in a predefined order.<br />
In these cases advance knowledge about the page layout is required in order to specify the<br />
include boxes in appropriate order.<br />
172 Chapter 10: <strong>TET</strong> Library API Reference