PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.7 Suboptions for the layoutanalysis option of <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />
option<br />
supertablecolumns<br />
description<br />
(Integer; only if layoutastable=true) Minium number of columns in a layout row to consider the sequence<br />
of zones as a supertable. When a table is created from paragraphs, these columns are recognized<br />
as separate zones instead of being combined. As a consequence of this, layout recognition can identify<br />
these zone sequences as a table (default: 4).<br />
tabledetect (Integer) Specifies the depth of recursive table recognition (default: 1):<br />
0 No table recognition.<br />
1 Table recognition for each zone.<br />
2 Table recognition for each table cell detected in level 1. This is required for nested tables and<br />
resolving row spans.<br />
Table 10.8 Suboptions for the imageanalysis option of <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />
option<br />
smallimages<br />
merge<br />
description<br />
(Option list) Control small image removal. Small images must often be ignored since they are artifacts<br />
and not real images. Supported options:<br />
disable (Boolean) If true, small image removal will be disabled. Default: false<br />
maxarea (Float) Maximum area (=width x height) in pixels of an image to be considered as a small<br />
image. Default: 500<br />
maxcount (Integer) Maximum allowed number of small images. If more small images are found all of<br />
them will be removed. Default: 50<br />
(Option list) Control image merging. This process combines adjacent images which together may form a<br />
single larger image. This is useful for multi-strip images where the individual strips have been preserved<br />
in the PDF, and for background images which are broken into a large number of very small images.<br />
Supported options:<br />
disable (Boolean) If true, image merging will be disabled. Default: false<br />
gap<br />
(Float) Maximum gap in points between two images to be considered for merging. Default:<br />
1.0 (not 0.0 because of unavoidable inaccuracies in the position calculations)<br />
C++ void close_page(int page)<br />
C# Java void close_page(int page)<br />
Perl PHP <strong>TET</strong>_close_page(resource tet, long page)<br />
VB Sub close_page(page As Long)<br />
C void <strong>TET</strong>_close_page(<strong>TET</strong> *tet, int page)<br />
Release a page handle and all related resources.<br />
page A valid page handle obtained with <strong>TET</strong>_open_page( ).<br />
Details All open pages of the document will be closed automatically when <strong>TET</strong>_close_document( )<br />
is called. It is good programming practice, however, to close pages explicitly when they<br />
are no longer needed. Closed page handles must no longer be used in any function call.<br />
140 Chapter 10: <strong>TET</strong> Library API Reference