PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10.9 <strong>Text</strong> and Metrics Retrieval Functions<br />
C++ wstring get_text(int page)<br />
C# Java String get_text(int page)<br />
Perl PHP string get_text(long page)<br />
VB RB Function get_text(page As Long) As String<br />
C const char *<strong>TET</strong>_get_text(<strong>TET</strong> *tet, int page, int *len)<br />
Get the next text fragment from a page’s content.<br />
page A valid page handle obtained with <strong>TET</strong>_open_page( ).<br />
len (C language binding only) A pointer to a variable which will hold the length of the<br />
returned string depending on the outputformat option of <strong>TET</strong>_set_option( ):<br />
If outputformat=utf8 the length is reported as number of Unicode characters. The<br />
number of bytes in the null-terminated string (which is identical to the number of 8-bit<br />
code units) can be determined with the strlen( ) function.<br />
If outputformat=utf16 the length is reported as number of 16-bit code units; surrogate<br />
pairs are counted as two code units. The number of bytes in the string is 2*len.<br />
If outputformat=utf32 the length is reported as number of 32-bit code units (which is<br />
identical to the number of Unicode characters). The number of bytes in the string is<br />
4*len.<br />
Returns<br />
Bindings<br />
A string containing the next text fragment on the page. The length of the fragment is<br />
determined by the granularity option of <strong>TET</strong>_open_page( ). Even for granularity=glyph the<br />
string may contain more than one character (see Section 7.1, »Important Unicode Concepts«,<br />
page 91).<br />
If all text on the page has been retrieved an empty string or null object will be returned<br />
(see below). In this case <strong>TET</strong>_get_errnum( ) should be called to find out whether<br />
there is no more text because of an error on the page, or because the end of the page has<br />
been reached.<br />
C language binding: the result is provided as null-terminated UTF-8 (default) or UTF-16/<br />
UTF-32 string according to the outputformat option of <strong>TET</strong>_set_option( ). On i5/iSeries and<br />
zSeries EBCDIC-encoded UTF-8 can also be selected, and is enabled by default. The returned<br />
data buffer can be used until the next call to this function. If no more text is<br />
available a NULL pointer and *len=0 will be returned.<br />
C++ and COM: the result is provided as Unicode string in UTF-16 format (wstring in C++).<br />
If no more text is available an empty string will be returned.<br />
Java, .NET and Objective-C: the result is provided as Unicode string. If no more text is<br />
available a null (nil in Objective-C) object will be returned.<br />
Perl, PHP, Python and Ruby language bindings: the result is provided as UTF-8 (default)<br />
or UTF-16/UTF-32 string according to the outputformat option of <strong>TET</strong>_set_option( ). In Python<br />
3 UTF-16/UTF-32 results are returned as bytes. If no more text is available a null object<br />
will be returned.<br />
REALbasic: the result is provided as Unicode string. If no more text is available an empty<br />
string will be returned.<br />
10.9 <strong>Text</strong> and Metrics Retrieval Functions 177