PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
RPG language binding: the result is provided as Unicode string. If no more text is available<br />
NULL will be returned.<br />
C++ const <strong>TET</strong>_char_info *get_char_info(int page)<br />
C# Java int get_char_info(int page)<br />
Perl PHP object get_char_info(long page)<br />
VB RB Function get_char_info(int page) As Long<br />
C const <strong>TET</strong>_char_info *<strong>TET</strong>_get_char_info(<strong>TET</strong> *tet, int page)<br />
Get detailed information for the next glyph in the most recent text chunk.<br />
page A valid page handle obtained with <strong>TET</strong>_open_page( ).<br />
Note The name of this function is a misnomer. It should better be called <strong>TET</strong>_get_glyph_info( ) since<br />
it reports information about visual glyphs on the page, not the corresponding Unicode<br />
characters.<br />
Returns<br />
Details<br />
If no more glyphs are available for the most recent text fragment returned by <strong>TET</strong>_get_<br />
text( ), a binding-specific value will be returned. See section Bindings below for more details.<br />
This function can be called one or more times after <strong>TET</strong>_get_text( ). It will advance to the<br />
next glyph for the current text chunk associated with the supplied page handle (or return<br />
nothing if there are no more glyphs), and provide detailed information for this<br />
glyph. There will be N >0 successful calls to this function (corresponding to N glyphs)<br />
for a text chunk with M logical characters. The relationship between N and M depends<br />
on the granularity:<br />
> For granularity=glyph each text chunk corresponds to a single glyph, i.e. N=1. One<br />
glyph corresponds to one character in many cases, i.e. M=1. However, for ligature<br />
glyphs multiple characters correspond to a single glyph, i.e. M >1 and <strong>TET</strong>_get_char_<br />
info( ) must be called more than once.<br />
> For granularities other than glyph a sequence of glyphs results in a sequence of characters,<br />
where each glyph may contribute to 0, 1, or more characters. The sequence of<br />
glyphs serves as raw material for the sequence of Unicode characters. In other words,<br />
there is no known relationship between N and M. The relationship between N and M<br />
may be influenced by content analysis (e.g. hyphens are removed by the dehyphenation<br />
process) or Unicode postprocessing (e.g. characters are added or deleted because<br />
of a folding).<br />
For granularities other than glyph this function advances to the next glyph which contributes<br />
to the chunk returned by the most recent call to <strong>TET</strong>_get_text( ). This way it is<br />
possible to retrieve glyph metrics when the Wordfinder is active and a text chunk may<br />
contain more than one character. In order to retrieve all glyph details for the current<br />
text chunk this function must be called repeatedly until it returns no more info.<br />
The glyph details in the structure or properties/fields are valid until the next call to <strong>TET</strong>_<br />
get_char_info( ) or <strong>TET</strong>_close_page( ) with the same page handle (whichever occurs first).<br />
Since there is only a single set of glyph info properties/fields per <strong>TET</strong> object, clients<br />
must retrieve all glyph info before they call <strong>TET</strong>_get_char_info( ) again for the same or<br />
another page or document.<br />
178 Chapter 10: <strong>TET</strong> Library API Reference