17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10.6 <strong>Text</strong> and Metrics Retrieval Functions<br />

C++ string get_text(int page)<br />

C# Java String get_text(int page)<br />

Perl PHP string <strong>TET</strong>_get_text(resource tet, long page)<br />

VB Function get_text(page As Long) As String<br />

C const char *<strong>TET</strong>_get_text(<strong>TET</strong> *tet, int page, int *len)<br />

Get the next text fragment from a page’s content.<br />

page A valid page handle obtained with <strong>TET</strong>_open_page( ).<br />

len (C language binding only) A pointer to a variable which will hold the length of the<br />

returned string in UTF-16 values (not bytes!). To determine the number of bytes this value<br />

must be multiplied by 2 if outputformat=utf16; the string length of the returned nullterminated<br />

string must be used if outputformat=utf8.<br />

Returns<br />

Bindings<br />

A string containing the next text fragment on the page. The length of the fragment is<br />

determined by the granularity option of <strong>TET</strong>_open_page( ). Even for granularity=glyph the<br />

string may contain more than one character (see Section 6.2, »Unicode Concepts«, page<br />

61).<br />

If all text on the page has been retrieved an empty string or null object will be returned<br />

(see below). In this case <strong>TET</strong>_get_errnum( ) should be called to find out whether<br />

there is no more text because of an error on the page, or because the end of the page has<br />

been reached.<br />

C language binding: the result will be provided as null-terminated UTF-8 (default) or<br />

UTF-16 string according to the outputformat option of <strong>TET</strong>_set_option( ). On iSeries and<br />

zSeries EBCDIC-encoded UTF-8 can also be selected, and is enabled by default. The returned<br />

data buffer can be used until the next call to this function. If no more text is<br />

available a NULL pointer and *len=0 will be returned.<br />

C++ and COM: the result will be provided as Unicode string in UTF-16 format. If no more<br />

text is available an empty string will be returned.<br />

Java and .NET: the result will be provided as Unicode string. If no more text is available a<br />

null object will be returned.<br />

Perl, PHP, and Python language bindings: the result will be provided as UTF-8 string. If<br />

no more text is available a null object will be returned.<br />

RPG language binding: the result will be provided as null-terminated ASCII- or EBCDICencoded<br />

UTF-8 string, or as a null-terminated UTF-16 string according to the outputformat<br />

option of <strong>TET</strong>_set_option( ). If no more text is available a NULL pointer will be returned.<br />

142 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!