17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Trailer data, i.e. document-specific data after the last page, must be requested with<br />

the trailer suboption when this function is called for the last time for a document. Trailer<br />

data can be created with a separate call after the last page (pagenumber=0), or together<br />

with the last page (pagenumber is different from 0). Pages can be retrieved in any order,<br />

and any subset of the document’s pages can be retrieved.<br />

It is an error to call <strong>TET</strong>_close_document( ) without retrieving the trailer, or to call <strong>TET</strong>_<br />

process_page( ) again after retrieving the trailer.<br />

C++ const char *get_xml_data(int doc, size_t *length, string optlist)<br />

C# Java final byte[ ] get_xml_data(int doc, String optlist)<br />

Perl PHP string <strong>TET</strong>_get_xml_data(resource tet, long doc, string optlist)<br />

VB Function get_xml_data(doc As Long, optlist As String)<br />

C const char * <strong>TET</strong>_get_xml_data(<strong>TET</strong> *tet, int doc, size_t *length, const char *optlist)<br />

Retrieve <strong>TET</strong>ML data from memory.<br />

doc A valid document handle obtained with <strong>TET</strong>_open_document*( ).<br />

length (C and C++ language binding only) A pointer to a variable which will hold the<br />

length of the returned string in bytes. length does not count the terminating null byte.<br />

optlist<br />

(Currently there are no supported options.)<br />

Returns<br />

A byte array containing the next chunk of data according to the specified options. If the<br />

buffer is empty an empty string will be returned (in C: a NULL pointer and *len=0).<br />

Details This functions retrieves <strong>TET</strong>ML data which has been created by <strong>TET</strong>_open_document*( )<br />

and one or more calls to <strong>TET</strong>_process_page( ). The <strong>TET</strong>ML data will always be encoded in<br />

UTF-8, regardless of the outputformat option. The internal buffer will be cleared by this<br />

call. It is not required to call <strong>TET</strong>_get_xml_data( ) after each call to <strong>TET</strong>_process_page( ).<br />

The client may accumulate the data for one or more pages or for the whole document in<br />

the buffer.<br />

In <strong>TET</strong>ML mode this function must be called at least once before <strong>TET</strong>_close_<br />

document( ) since otherwise the data would no longer be accessible. If <strong>TET</strong>_get_xml_<br />

data( ) is called exactly once (such a single call must happen between the last call to <strong>TET</strong>_<br />

process_page( ) and <strong>TET</strong>_close_document( )) the buffer is guaranteed to contain wellformed<br />

<strong>TET</strong>ML output for the whole document.<br />

This function must not be called if the filename suboption has been supplied to the<br />

tetml option of <strong>TET</strong>_open_document*( ).<br />

Bindings<br />

C and C++ language bindings: the result will be provided as null-terminated UTF-8. On<br />

iSeries and zSeries EBCDIC-encoded UTF-8 will be returned. The returned data buffer can<br />

be used until the next call to<strong>TET</strong>_get_xml_data( ).<br />

Java and .NET language bindings: the result will be provided as a byte array containing<br />

UTF-8 data.<br />

COM: Most client programs will use the Variant type to hold the UTF-8 data.<br />

PHP language binding: the result will be provided as UTF-8 string.<br />

RPG language binding: the result will be provided as null-terminated EBCDIC-encoded<br />

UTF-8.<br />

150 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!