17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Bindings<br />

C and C++ language bindings: If no more glyphs are available for the most recent text<br />

chunk returned by <strong>TET</strong>_get_text( ), a NULL pointer will be returned. Otherwise, a pointer<br />

to a <strong>TET</strong>_char_info structure containing information about a single glyph will be returned.<br />

The members of the data structure are detailed in Table 10.15.<br />

COM, Java, .NET, and Objective-C language bindings: -1 will be returned if no more<br />

glyphs are available for the most recent text chunk returned by <strong>TET</strong>_get_text( ), otherwise<br />

1. Individual glyph info can be retrieved from the <strong>TET</strong> properties/public fields according<br />

to Table 10.15. All properties/fields contain the value -1 (the unknown field contains<br />

false) if they are accessed although the function returned -1.<br />

Perl and Python language bindings: 0 will be returned if no more glyphs are available<br />

for the most recent text chunk returned by <strong>TET</strong>_get_text( ), otherwise a hash containing<br />

the keys listed in Table 10.15. Individual glyph info can be retrieved with the keys in this<br />

hash.<br />

PHP language binding: an empty (null) object will be returned if no more glyphs are<br />

available for the most recent text chunk returned by <strong>TET</strong>_get_text( ), otherwise an object<br />

containing the fields listed in Table 10.15. Individual glyph info can be retrieved from<br />

the member fields of this object. Integer fields in the glyph info object are implemented<br />

as long in the PHP language binding.<br />

REALbasic binding: nil will be returned if no more glyphs are available for the most recent<br />

text chunk returned by <strong>TET</strong>_get_text( ), otherwise a <strong>TET</strong>_char_info object containing<br />

the members listed in Table 10.15. Individual glyph info can be retrieved with the keys in<br />

this object. The attributes field is called attrs in the REALbasic binding to work around a<br />

REALbasic interface problem.<br />

Ruby binding: nil (null object) will be returned if no more glyphs are available, and a<br />

<strong>TET</strong>_char_info object otherwise.<br />

Table 10.15 Members of the <strong>TET</strong>_char_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), keys<br />

(Perl) or properties (COM and .NET) with their type and meaning. See »Glyph metrics«, page 74, and Figure 6.3 for more<br />

details.<br />

property/<br />

field name<br />

uv<br />

type<br />

explanation<br />

(Integer) UTF-32 Unicode value for the current glyph. For granularities other than glyph this may be an<br />

artificial or intermediate value which has no relationship to the final text chunk. For granularity=glyph<br />

the sequence of Unicode values for the glyphs is identical to the logical text, but for other granularities it<br />

may be modified by various processing steps.<br />

(Integer) Type of the character. The following types describe real characters which correspond to a glyph<br />

on the page. The values of all other properties/fields are determined by the corresponding glyph:<br />

0 Normal character which corresponds to exactly one glyph<br />

1 Start of a sequence (e.g. ligature)<br />

The following types describe artificial characters which do not correspond to a glyph on the page. The x<br />

and y fields will specify the most recent real character’s endpoint, the width field will be 0, and all other<br />

fields except uv will contain the values corresponding to the most recent real character:<br />

10 Continuation of a sequence (e.g. ligature)<br />

11 (Deprecated and unused)<br />

12 Inserted word, line, or zone separator<br />

10.9 <strong>Text</strong> and Metrics Retrieval Functions 179

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!