PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Details<br />
This function will open a page, create output according to the format-related options<br />
supplied to <strong>TET</strong>_open_document*( ), and close the page. The generated data can be retrieved<br />
with <strong>TET</strong>_get_xml_data( ).<br />
This function must only be called if the option tetml has been supplied in the corresponding<br />
call to <strong>TET</strong>_open_document*( ). Header data, i.e. document-specific data before<br />
the first page, will be created by <strong>TET</strong>_open_document*( ) before the first page data. It can<br />
be retrieved separately by calling <strong>TET</strong>_get_xml_data( ) before the first call to <strong>TET</strong>_process_<br />
page( ), or in combination with page-related data.<br />
Trailer data, i.e. document-specific data after the last page, must be requested with<br />
the trailer suboption when this function is called for the last time for a document. Trailer<br />
data can be created with a separate call after the last page (pagenumber=0), or together<br />
with the last page (pagenumber is different from 0). Pages can be retrieved in any order,<br />
and any subset of the document’s pages can be retrieved.<br />
It is an error to call <strong>TET</strong>_close_document( ) without retrieving the trailer, or to call <strong>TET</strong>_<br />
process_page( ) again after retrieving the trailer.<br />
C++ const char *get_xml_data(int doc, size_t *length, wstring optlist)<br />
C# Java final byte[ ] get_xml_data(int doc, String optlist)<br />
Perl PHP string get_xml_data(long doc, string optlist)<br />
VB RB Function get_xml_data(doc As Long, optlist As String)<br />
C const char * <strong>TET</strong>_get_xml_data(<strong>TET</strong> *tet, int doc, size_t *length, const char *optlist)<br />
Retrieve <strong>TET</strong>ML data from memory.<br />
doc A valid document handle obtained with <strong>TET</strong>_open_document*( ).<br />
length (C and C++ language binding only) A pointer to a variable which will hold the<br />
length of the returned string in bytes. length does not count the terminating null byte.<br />
optlist<br />
(Currently there are no supported options.)<br />
Returns<br />
A byte array containing the next chunk of data according to the specified options. If the<br />
buffer is empty an empty string will be returned (in C: a NULL pointer and *len=0).<br />
Details This functions retrieves <strong>TET</strong>ML data which has been created by <strong>TET</strong>_open_document*( )<br />
and one or more calls to <strong>TET</strong>_process_page( ). The <strong>TET</strong>ML data will always be encoded in<br />
UTF-8, regardless of the outputformat option. The internal buffer will be cleared by this<br />
call. It is not required to call <strong>TET</strong>_get_xml_data( ) after each call to <strong>TET</strong>_process_page( ).<br />
The client may accumulate the data for one or more pages or for the whole document in<br />
the buffer.<br />
In <strong>TET</strong>ML mode this function must be called at least once before <strong>TET</strong>_close_<br />
document( ) since otherwise the data would no longer be accessible. If <strong>TET</strong>_get_xml_<br />
data( ) is called exactly once (such a single call must happen between the last call to <strong>TET</strong>_<br />
process_page( ) and <strong>TET</strong>_close_document( )) the buffer is guaranteed to contain wellformed<br />
<strong>TET</strong>ML output for the whole document. This function must not be called if the<br />
filename suboption has been supplied to the tetml option of <strong>TET</strong>_open_document*( ).<br />
Bindings<br />
C and C++ language bindings: the result will be provided as null-terminated UTF-8. On<br />
i5/iSeries and zSeries EBCDIC-encoded UTF-8 will be returned. The returned data buffer<br />
can be used until the next call to <strong>TET</strong>_get_xml_data( ).<br />
186 Chapter 10: <strong>TET</strong> Library API Reference