17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10.8 <strong>TET</strong> Markup Language (<strong>TET</strong>ML) Functions<br />

C++ int process_page(int doc, int pagenumber, string optlist)<br />

C# Java int process_page(int doc, int pagenumber, String optlist)<br />

Perl PHP long <strong>TET</strong>_process_page(resource tet, long doc, long pagenumber, string optlist)<br />

VB Function process_page(doc As Long, pagenumber As Long, optlist As String) As Int<br />

C<br />

int <strong>TET</strong>_process_page(<strong>TET</strong> *tet, int doc, int pagenumber, const char *optlist)<br />

Process a page and create <strong>TET</strong>ML output.<br />

doc A valid document handle obtained with <strong>TET</strong>_open_document*( ).<br />

pagenumber The physical number of the page to be processed. The first page has page<br />

number 1. The total number of pages can be retrieved with <strong>TET</strong>_pcos_get_number( ) and<br />

the pCOS path length:pages. The pagenumber parameter may be 0 if trailer=true.<br />

optlist An option list specifying options from the following groups:<br />

> General page-related options according to Table 10.5 (these will be ignored if<br />

pagenumber=0): clippingarea, contentanalysis, excludebox, fontsizerange, granularity,<br />

ignoreinvisibletext, imageanalysis, includebox, layoutanalysis, skipengines<br />

> Option specifying processing details according to Table 10.13: tetml<br />

Table 10.13 Additional options for <strong>TET</strong>_process_page( )<br />

option<br />

tetml<br />

description<br />

(Option list) Controls details of <strong>TET</strong>ML. The following options are available:<br />

glyphdetails<br />

(Option list; only for granularity=glyph and word) Specify which members of the <strong>TET</strong>_char_<br />

info structure will be reported for each Glyph element:<br />

geometry (Boolean) Emit attributes x, y, width, alpha, beta. Default: false<br />

font (Boolean) Emit attributes font, fontsize, textrendering, unknown. Default:<br />

false<br />

trailer (Boolean) If true, document trailer data, i.e. data after the last page, will be emitted (it must<br />

be appended to the page-specific data emitted earlier). This option is required in the last call<br />

to this function in order to emit trailer data. If pagenumber=0 only trailer data (without any<br />

page-specific data) will be emitted. Once trailer=true has been supplied, no more calls to<br />

<strong>TET</strong>_process_page( ) are allowed for the same document. Default: false<br />

Returns<br />

Details<br />

-1 on error, or 1 otherwise. However, in <strong>TET</strong>ML mode this function will always succeed<br />

since problems related to the input document will be reported in a <strong>TET</strong>ML element, and<br />

all other problems will raise an exception.<br />

This function will open a page, create output according to the format-related options<br />

supplied to <strong>TET</strong>_open_document*( ), and close the page. The generated data can be retrieved<br />

with <strong>TET</strong>_get_xml_data( ).<br />

This function must only be called if the option tetml has been supplied in the corresponding<br />

call to <strong>TET</strong>_open_document*( ). Header data, i.e. document-specific data before<br />

the first page, will be created by <strong>TET</strong>_open_document*( ) before the first page data. It can<br />

be retrieved separately by calling <strong>TET</strong>_get_xml_data( ) before the first call to <strong>TET</strong>_process_<br />

page( ), or in combination with page-related data.<br />

10.8 <strong>TET</strong> Markup Language (<strong>TET</strong>ML) Functions 149

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!