17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10.9 <strong>Text</strong> and Metrics Retrieval Functions<br />

C++ wstring get_text(int page)<br />

C# Java String get_text(int page)<br />

Perl PHP string get_text(long page)<br />

VB RB Function get_text(page As Long) As String<br />

C const char *<strong>TET</strong>_get_text(<strong>TET</strong> *tet, int page, int *len)<br />

Get the next text fragment from a page’s content.<br />

page A valid page handle obtained with <strong>TET</strong>_open_page( ).<br />

len (C language binding only) A pointer to a variable which will hold the length of the<br />

returned string depending on the outputformat option of <strong>TET</strong>_set_option( ):<br />

If outputformat=utf8 the length is reported as number of Unicode characters. The<br />

number of bytes in the null-terminated string (which is identical to the number of 8-bit<br />

code units) can be determined with the strlen( ) function.<br />

If outputformat=utf16 the length is reported as number of 16-bit code units; surrogate<br />

pairs are counted as two code units. The number of bytes in the string is 2*len.<br />

If outputformat=utf32 the length is reported as number of 32-bit code units (which is<br />

identical to the number of Unicode characters). The number of bytes in the string is<br />

4*len.<br />

Returns<br />

Bindings<br />

A string containing the next text fragment on the page. The length of the fragment is<br />

determined by the granularity option of <strong>TET</strong>_open_page( ). Even for granularity=glyph the<br />

string may contain more than one character (see Section 7.1, »Important Unicode Concepts«,<br />

page 91).<br />

If all text on the page has been retrieved an empty string or null object will be returned<br />

(see below). In this case <strong>TET</strong>_get_errnum( ) should be called to find out whether<br />

there is no more text because of an error on the page, or because the end of the page has<br />

been reached.<br />

C language binding: the result is provided as null-terminated UTF-8 (default) or UTF-16/<br />

UTF-32 string according to the outputformat option of <strong>TET</strong>_set_option( ). On i5/iSeries and<br />

zSeries EBCDIC-encoded UTF-8 can also be selected, and is enabled by default. The returned<br />

data buffer can be used until the next call to this function. If no more text is<br />

available a NULL pointer and *len=0 will be returned.<br />

C++ and COM: the result is provided as Unicode string in UTF-16 format (wstring in C++).<br />

If no more text is available an empty string will be returned.<br />

Java, .NET and Objective-C: the result is provided as Unicode string. If no more text is<br />

available a null (nil in Objective-C) object will be returned.<br />

Perl, PHP, Python and Ruby language bindings: the result is provided as UTF-8 (default)<br />

or UTF-16/UTF-32 string according to the outputformat option of <strong>TET</strong>_set_option( ). In Python<br />

3 UTF-16/UTF-32 results are returned as bytes. If no more text is available a null object<br />

will be returned.<br />

REALbasic: the result is provided as Unicode string. If no more text is available an empty<br />

string will be returned.<br />

10.9 <strong>Text</strong> and Metrics Retrieval Functions 177

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!