17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Fig. 6.2<br />

Configuring coordinate display in Acrobat; use View, Cursor Coordinates to display cursor coordinates.<br />

constructed by determining the union of all rectangles specified in the includebox option,<br />

and subtracting the union of all rectangles specified in the excludebox option. A<br />

character is considered inside the clipping area if its reference point is inside the clipping<br />

area. This means that a character could be considered inside the clipping area even<br />

if parts of it extend beyond the clipping area, or vice versa.<br />

Glyph metrics. Using <strong>TET</strong>_get_char_info( ) you can retrieve font and metrics information<br />

for the characters which are returned for a particular glyph. The following values<br />

are available for each character in the output (see Figure 6.3 and Table 10.15):<br />

> The uv value contains the UTF-32 Unicode value of the current character, i.e. the character<br />

for which details are retrieved. This field will always contain UTF-32, even in<br />

language bindings that can deal only with UTF-16 strings in their native Unicode<br />

strings. Accessing the uv field allows applications to deal with characters outside the<br />

BMP without having to interpret surrogate pairs. Since surrogate pairs are reported<br />

as two separate characters, the uv field of the leading surrogate value will contain the<br />

actual Unicode value (larger than U+FFFF). The uv field of the trailing surrogate value<br />

is treated as an artificial character, and has a uv value of 0.<br />

> The type field specifies how the character was created. There are two groups: real and<br />

artificial characters. The group of real characters comprises normal characters (i.e.<br />

the complete result of a single glyph) and characters which start a multi-character<br />

sequence that corresponds to a single glyph (e.g. the first character of a ligature). The<br />

74 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!