17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

RPG language binding: the result is provided as Unicode string. If no more text is available<br />

NULL will be returned.<br />

C++ const <strong>TET</strong>_char_info *get_char_info(int page)<br />

C# Java int get_char_info(int page)<br />

Perl PHP object get_char_info(long page)<br />

VB RB Function get_char_info(int page) As Long<br />

C const <strong>TET</strong>_char_info *<strong>TET</strong>_get_char_info(<strong>TET</strong> *tet, int page)<br />

Get detailed information for the next glyph in the most recent text chunk.<br />

page A valid page handle obtained with <strong>TET</strong>_open_page( ).<br />

Note The name of this function is a misnomer. It should better be called <strong>TET</strong>_get_glyph_info( ) since<br />

it reports information about visual glyphs on the page, not the corresponding Unicode<br />

characters.<br />

Returns<br />

Details<br />

If no more glyphs are available for the most recent text fragment returned by <strong>TET</strong>_get_<br />

text( ), a binding-specific value will be returned. See section Bindings below for more details.<br />

This function can be called one or more times after <strong>TET</strong>_get_text( ). It will advance to the<br />

next glyph for the current text chunk associated with the supplied page handle (or return<br />

nothing if there are no more glyphs), and provide detailed information for this<br />

glyph. There will be N >0 successful calls to this function (corresponding to N glyphs)<br />

for a text chunk with M logical characters. The relationship between N and M depends<br />

on the granularity:<br />

> For granularity=glyph each text chunk corresponds to a single glyph, i.e. N=1. One<br />

glyph corresponds to one character in many cases, i.e. M=1. However, for ligature<br />

glyphs multiple characters correspond to a single glyph, i.e. M >1 and <strong>TET</strong>_get_char_<br />

info( ) must be called more than once.<br />

> For granularities other than glyph a sequence of glyphs results in a sequence of characters,<br />

where each glyph may contribute to 0, 1, or more characters. The sequence of<br />

glyphs serves as raw material for the sequence of Unicode characters. In other words,<br />

there is no known relationship between N and M. The relationship between N and M<br />

may be influenced by content analysis (e.g. hyphens are removed by the dehyphenation<br />

process) or Unicode postprocessing (e.g. characters are added or deleted because<br />

of a folding).<br />

For granularities other than glyph this function advances to the next glyph which contributes<br />

to the chunk returned by the most recent call to <strong>TET</strong>_get_text( ). This way it is<br />

possible to retrieve glyph metrics when the Wordfinder is active and a text chunk may<br />

contain more than one character. In order to retrieve all glyph details for the current<br />

text chunk this function must be called repeatedly until it returns no more info.<br />

The glyph details in the structure or properties/fields are valid until the next call to <strong>TET</strong>_<br />

get_char_info( ) or <strong>TET</strong>_close_page( ) with the same page handle (whichever occurs first).<br />

Since there is only a single set of glyph info properties/fields per <strong>TET</strong> object, clients<br />

must retrieve all glyph info before they call <strong>TET</strong>_get_char_info( ) again for the same or<br />

another page or document.<br />

178 Chapter 10: <strong>TET</strong> Library API Reference

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!