PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.10 Members of the <strong>TET</strong>_char_info structure (C and C++), equivalent public fields (Java, PHP), keys (Perl) or<br />
properties (COM and .NET) with their type and meaning. See »Glyph metrics«, page 64, for more details.<br />
property/<br />
field name<br />
uv<br />
type<br />
unknown<br />
explanation<br />
(Integer) UTF-32 Unicode value of the current character. It will be 0 if the corresponding UTF-16 value is<br />
the trailing value of a surrogate pair (i.e. if type=11).<br />
(Integer) Type of the character. The following types describe real characters which correspond to a glyph<br />
on the page. The values of all other properties/fields are determined by the corresponding glyph:<br />
0 Normal character which corresponds to exactly one glyph<br />
1 Start of a sequence (e.g. ligature)<br />
The following types describe artificial characters which do not correspond to a glyph on the page. The x<br />
and y fields will specify the most recent real character’s endpoint, the width field will be 0, and all other<br />
fields except uv will contain the values corresponding to the most recent real character:<br />
10 Continuation of a sequence (e.g. ligature)<br />
11 Trailing value of a surrogate pair; the leading value has type=0, 1, or 10.<br />
12 Inserted word, line, or zone separator<br />
(Boolean, in C and C++: integer) Usually false (0), but will be true (1) if the original glyph could not be<br />
mapped to Unicode and has been replaced with the character specified as unknownchar.<br />
x, y (Double) Position of the glyph’s reference point. The reference point is the lower left corner of the glyph<br />
box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters<br />
the x, y coordinates will be those of the end point of the most recent real character.<br />
width<br />
alpha<br />
beta<br />
fontid<br />
fontsize<br />
textrendering<br />
(Double) Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial<br />
characters the width will be 0.<br />
(Double) Direction of inline text progression in degrees measured counter-clockwise. For horizontal writing<br />
mode this is the direction of the text baseline; for vertical writing mode it is the digression from the<br />
standard -90° direction. The angle will be in the range -180° < alpha ³ +180°. For standard horizontal<br />
text as well as for standard text in vertical writing mode the angle will be 0°.<br />
(Double) <strong>Text</strong> slanting angle in degrees (counter-clockwise), relative to the perpendicular of alpha. The<br />
angle will be 0° for upright text, and negative for italicized (slanted) text. The angle will be in the range<br />
-180° < beta ³ 180°, but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.<br />
(Integer) Index of the font in the fonts[] pseudo object (see Table 9.5). fontid is never negative.<br />
(Double) Size of the font (always positive); the relation of this value to the actual height of glyphs is not<br />
fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses<br />
all ascenders (including accented characters) and descenders.<br />
(Integer) <strong>Text</strong> rendering mode:<br />
0 fill text<br />
1 stroke text (outline)<br />
2 fill and stroke text<br />
3 invisible text (often used for OCR results)<br />
4 fill text and add it to the clipping path<br />
5 stroke text and add it to the clipping path<br />
6 fill and stroke text and add it to the clipping path<br />
7 add text to the clipping path<br />
144 Chapter 10: <strong>TET</strong> Library API Reference