PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 10.15 Members of the <strong>TET</strong>_char_info structure (C, C++, Ruby), equivalent public fields (Java, PHP, Objective-C), keys<br />
(Perl) or properties (COM and .NET) with their type and meaning. See »Glyph metrics«, page 74, and Figure 6.3 for more<br />
details.<br />
property/<br />
field name<br />
attributes 1<br />
unknown<br />
explanation<br />
(Integer) Glyph attributes expressed as bits which can be combined:<br />
bit 0 Geometric or semantic subscript<br />
bit 1 Geometric or semantic superscript<br />
bit 2 Drop cap character (initial large character at the start of a paragraph)<br />
bit 3 Glyph- or word-based shadow duplicate of this glyph has been removed<br />
bit 4 Glyph represents last character before hyphenation point<br />
bit 5 Hyphenation artifact (i.e. the hyphen character) which was removed unless<br />
contentanalysis={keephyphenglyphs=true} was specified.<br />
bit 6 Glyph represents the character after hyphenation point<br />
(Boolean, in C, C++ and Perl: integer) Usually false (0), but will be true (1) if the original glyph could not<br />
be mapped to Unicode and has been replaced with the character specified as unknownchar.<br />
x, y (Double) Position of the glyph’s reference point. The reference point is the lower left corner of the glyph<br />
box for horizontal writing mode, and the top center point for vertical writing mode. For artificial characters<br />
the x, y coordinates will be those of the end point of the most recent real character.<br />
width<br />
alpha<br />
beta<br />
fontid<br />
fontsize<br />
textrendering<br />
(Double) Width of the corresponding glyph (for both horizontal and vertical writing mode). For artificial<br />
characters (i.e. inserted separators with type=12 and hyphenation artifacts with attribute bit 5 set) the<br />
width is 0.<br />
(Double) Direction of inline text progression in degrees measured counter-clockwise (or clockwise for topdown<br />
coordinates). For horizontal writing mode this is the direction of the text baseline; for vertical writing<br />
mode it is the digression from the standard vertical direction. The angle will be in the range<br />
-180° < alpha ³ +180°. For standard horizontal text as well as for standard text in vertical writing mode<br />
the angle will be 0°.<br />
(Double) <strong>Text</strong> slanting angle in degrees measured counter-clockwise (or clockwise for topdown coordinates),<br />
relative to the perpendicular of alpha. The angle will be 0° for upright text, and negative for italicized<br />
(slanted) text (positive for topdown coordinates). The angle will be in the range -180° < beta ³ 180°,<br />
but different from ±90°. If abs(beta) > 90° the text is mirrored at the baseline.<br />
(Integer) Index of the font in the fonts[] pseudo object (see the pCOS Path Reference). fontid is never<br />
negative.<br />
(Double) Size of the font (always positive); the relation of this value to the actual height of glyphs is not<br />
fixed, but may vary with the font design. For most fonts the font size is chosen such that it encompasses<br />
all ascenders (including accented characters) and descenders.<br />
(Integer) <strong>Text</strong> rendering mode:<br />
0 fill text<br />
1 stroke text (outline)<br />
2 fill and stroke text<br />
3 invisible text (often used for OCR results)<br />
4 fill text and add it to the clipping path<br />
5 stroke text and add it to the clipping path<br />
6 fill and stroke text and add it to the clipping path<br />
7 add text to the clipping path<br />
1. In the REALbasic binding this field is called attrs.<br />
180 Chapter 10: <strong>TET</strong> Library API Reference