17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10.5 Page Functions<br />

C++ int open_page(int doc, int pagenumber, string optlist)<br />

C# Java int open_page(int doc, int pagenumber, String optlist)<br />

Perl PHP long <strong>TET</strong>_open_page(resource tet, long pagenumber, string optlist)<br />

VB Function open_page(doc As Long, pagenumber As Long, optlist As String) As Long<br />

C<br />

int <strong>TET</strong>_open_page(<strong>TET</strong> *tet, int doc, int pagenumber, const char *optlist)<br />

Open a page for content extraction.<br />

doc A valid document handle obtained with <strong>TET</strong>_open_document*( ).<br />

pagenumber The physical number of the page to be opened. The first page has page<br />

number 1. The total number of pages can be retrieved with <strong>TET</strong>_pcos_get_number( ) and<br />

the pCOS path length:pages.<br />

optlist An option list specifying page options according to Table 10.5. The following<br />

options can be used: clippingarea, contentanalysis, docstyle, excludebox, fontsizerange,<br />

granularity, ignoreinvisibletext, imageanalysis, includebox, layoutanalysis, layouteffort,<br />

skipengines, structureanalysis.<br />

Returns<br />

Details<br />

A handle for the page, or -1 in case of an error.<br />

Within a single document an arbitrary number of pages may be kept open at the same<br />

time. The same page may be opened multiply with different options. However, options<br />

can not be changed while processing a page.<br />

Layer definitions (optional content groups) which may be present on the page are<br />

not taken into account: all text on all layers of the page will be extracted, regardless of<br />

the visibility of layers.<br />

Table 10.5 Page options for <strong>TET</strong>_open_page( ) and <strong>TET</strong>_process_page( )<br />

option<br />

clippingarea<br />

contentanalysis<br />

description<br />

(Keyword) Specifies the clipping area (default: cropbox):<br />

mediabox Use the MediaBox (which is always present)<br />

cropbox Use the CropBox (the area visible in Acrobat) if present, else MediaBox<br />

bleedbox Use the BleedBox if present, else use cropbox<br />

trimbox Use the TrimBox if present, else use cropbox<br />

artbox Use the ArtBox if present, else use cropbox<br />

unlimited Consider all text, regardless of its location<br />

(Option list; not for granularity=glyph) List of suboptions according to Table 10.6 for controlling highlevel<br />

text processing.<br />

134 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!