PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Bindings<br />
C and C++ language bindings: If no more glyphs are available for the most recent text<br />
chunk returned by <strong>TET</strong>_get_text( ), a NULL pointer will be returned. Otherwise, a pointer<br />
to a <strong>TET</strong>_char_info structure containing information about a single glyph will be returned.<br />
The members of the data structure are detailed in Table 10.15.<br />
COM, Java, .NET, and Objective-C language bindings: -1 will be returned if no more<br />
glyphs are available for the most recent text chunk returned by <strong>TET</strong>_get_text( ), otherwise<br />
1. Individual glyph info can be retrieved from the <strong>TET</strong> properties/public fields according<br />
to Table 10.15. All properties/fields contain the value -1 (the unknown field contains<br />
false) if they are accessed although the function returned -1.<br />
Perl and Python language bindings: 0 will be returned if no more glyphs are available<br />
for the most recent text chunk returned by <strong>TET</strong>_get_text( ), otherwise a hash containing<br />
the keys listed in Table 10.15. Individual glyph info can be retrieved with the keys in this<br />
hash.<br />
PHP language binding: an empty (null) object will be returned if no more glyphs are<br />
available for the most recent text chunk returned by <strong>TET</strong>_get_text( ), otherwise an object<br />
containing the fields listed in Table 10.15. Individual glyph info can be retrieved from<br />
the member fields of this object. Integer fields in the glyph info object are implemented<br />
as long in the PHP language binding.<br />
REALbasic binding: nil will be returned if no more glyphs are available for the most recent<br />
text chunk returned by <strong>TET</strong>_get_text( ), otherwise a <strong>TET</strong>_char_info object containing<br />
the members listed in Table 10.15. Individual glyph info can be retrieved with the keys in<br />
this object. The attributes field is called attrs in the REALbasic binding to work around a<br />
REALbasic interface problem.<br />
Ruby binding: nil (null object) will be returned if no more glyphs are available, and a<br />
<strong>TET</strong>_char_info object otherwise.<br />
Table 10.15 Members of the <strong>TET</strong>_char_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), keys<br />
(Perl) or properties (COM and .NET) with their type and meaning. See »Glyph metrics«, page 74, and Figure 6.3 for more<br />
details.<br />
property/<br />
field name<br />
uv<br />
type<br />
explanation<br />
(Integer) UTF-32 Unicode value for the current glyph. For granularities other than glyph this may be an<br />
artificial or intermediate value which has no relationship to the final text chunk. For granularity=glyph<br />
the sequence of Unicode values for the glyphs is identical to the logical text, but for other granularities it<br />
may be modified by various processing steps.<br />
(Integer) Type of the character. The following types describe real characters which correspond to a glyph<br />
on the page. The values of all other properties/fields are determined by the corresponding glyph:<br />
0 Normal character which corresponds to exactly one glyph<br />
1 Start of a sequence (e.g. ligature)<br />
The following types describe artificial characters which do not correspond to a glyph on the page. The x<br />
and y fields will specify the most recent real character’s endpoint, the width field will be 0, and all other<br />
fields except uv will contain the values corresponding to the most recent real character:<br />
10 Continuation of a sequence (e.g. ligature)<br />
11 (Deprecated and unused)<br />
12 Inserted word, line, or zone separator<br />
10.9 <strong>Text</strong> and Metrics Retrieval Functions 179