PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.7 Suboptions for the layoutanalysis option of <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />
option<br />
layoutastable<br />
layoutcolumnhint<br />
description<br />
(Boolean) If true, the layout recognition engine will treat the zones on the page as one or more tables.<br />
The minimum number of columns which is required to consider the sequence as a table depends on the<br />
document style. If false, supertable recognition will be disabled (default: true).<br />
(Keyword) This option may improve zone reading order detection for complex layouts. Supported keywords<br />
(default: multicolumn):<br />
multicolumn<br />
The page contains multi-column text; zones will be sorted column by column.<br />
none No hint available; zone ordering will be determined by page content order.<br />
singlecolumn<br />
The page contains single-column text; zones will be sorted row by row.<br />
layoutdetect (Integer) Specifies the depth of recursive layout recognition (default: 1):<br />
0 No layout recognition.<br />
1 Layout recognition for the whole page. This is sufficient for the vast majority of documents.<br />
2 Layout recognition for the results of level 1. This is required for layouts with different multicolumn<br />
sublayouts and titles on different parts of the page as well as multi-paragraph tables.<br />
3 Layout recognition for the results of level 2. This is required only for very complex layouts.<br />
layoutrowhint<br />
mergetables<br />
splithint<br />
standalonefontsize<br />
(Option list) Control layout row processing. Supported options (default: none):<br />
full Enable layout row processing.<br />
none Disable layout row processing.<br />
separation (Keyword) Enable layout row processing, but disable it if layout recognition suspects a<br />
supertable. The following suboptions can be supplied:<br />
preservecolumns<br />
Try to keep vertical columns based on the geometric relationship between zones.<br />
This is recommended if zones within columns are separated by large gaps (e.g.<br />
caused by images).<br />
thick Try to combine neighboring zones and place them in the same layout row. This results<br />
in a smaller number larger layout rows. This is recommended for complex<br />
layouts, such as magazines and papers where paragraphs within columns are separated<br />
from each other by more than the font size, and for layouts with several<br />
multi-column articles one under the other.<br />
thin Try to separate neighboring zones and place them in different layout rows. This<br />
results in a larger number of smaller layout rows.<br />
Example: layoutanalysis = {layoutrowhint={full separation=thick}}<br />
(Integer) Tables with a single row will be skipped during table recognition, and treated as regular zones.<br />
If two sequential zones are tables (even with only a single row) they can be combined. (default: none):<br />
down Combine downstairs only.<br />
none Don’t merge.<br />
up Combine upstairs only.<br />
updown Combine in both directions.<br />
(Keyword or option list) Activate special treatment of double-page spreads (or even pages consisting of<br />
more spreads). The page may be divided vertically or horizontally in two or more sections. The keyword<br />
includebox means that the split areas will be defined by the includebox option. Alternatively the following<br />
options can be supplied:<br />
x (Float) Divider for the x axis, e.g. 0.5 for a double-page spread, 0.33 for a three-page spread.<br />
y (Float) Divider for y axis.<br />
(Float) Minimum font size for huge glyphs. Huge glyphs form single-glyph strips, and will not be combined<br />
with other zones (default: 70).<br />
10.5 Page Functions 139