17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

width<br />

(x, y)<br />

beta<br />

fontsize<br />

baseline<br />

fontsize<br />

(x, y)<br />

alpha<br />

width<br />

Fig. 6.3<br />

Glyph metrics for horizontal and vertical writing mode<br />

ed as two separate characters, the uv field of the leading surrogate value will contain<br />

the actual Unicode value (larger than U+FFFF). The uv field of the trailing surrogate<br />

value will be treated as an artificial character, and will have an uv value of 0.<br />

> The type field specifies how the character was created. There are two groups: real and<br />

artificial characters. The group of real characters comprises normal characters (i.e.<br />

the complete result of a single glyph) and characters which start a multi-character<br />

sequence that corresponds to a single glyph (e.g. the first character of a ligature). The<br />

group of artificial characters comprises the continuation of a multi-character sequence<br />

(e.g. the second character of a ligature), the trailing value of a surrogate pair,<br />

and inserted separator characters. For artificial characters the position (x, y) will<br />

specify the endpoint of the most recent real character, the width will be 0, and all<br />

other fields except uv will be those of the most recent real character. The endpoint is<br />

the point (x, y) plus the width added in direction alpha (in horizontal writing mode)<br />

or plus the fontsize in direction -90˚ (in vertical writing mode).<br />

> The unknown field will usually be false (in C and C++: 0), but has a value of true (in C<br />

and C++: 1) if the original glyph could not be mapped to Unicode and has therefore<br />

been replaced with the character specified in the unknownchar option. Using this<br />

field you can distinguish real document content from replaced characters if you<br />

specified a common character as unknownchar, such as a question mark or space.<br />

> The (x, y) fields specify the position of the glyph’s reference point, which is the lower<br />

left corner of the glyph rectangle in horizontal writing mode, and the top center in<br />

vertical writing mode (see Section 6.4, »Support for Chinese, Japanese, and Korean<br />

<strong>Text</strong>«, page 67 for details on vertical writing mode). For artificial characters, which do<br />

not correspond to any glyph on the page, the point (x, y) specifies the end point of<br />

the most recent real character.<br />

> The width field specifies the width of a glyph according to the corresponding font<br />

metrics and text output parameters, such as character spacing and horizontal scaling.<br />

Since these parameters control the position of the next glyph, the distance between<br />

the reference points of two adjacent glyphs may be different from width. The<br />

6.3 Page and <strong>Text</strong> Geometry 65

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!