17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

width may be zero for non-spacing characters. On the other hand, the outline may<br />

actually be wider than the glyph’s width value, e.g. for slanted text.<br />

The width will be 0 for artificial characters.<br />

> The angle alpha provides the direction of inline text progression, specified as the deviation<br />

from the standard direction. The standard direction is 0˚ for horizontal writing<br />

mode, and -90˚ for vertical writing mode (see below for more details on vertical<br />

writing mode). Therefore, the angle alpha will be 0˚ for standard horizontal text as<br />

well as for standard vertical text.<br />

> The angle beta specifies any skewing which has been applied to the text, e.g. for<br />

slanted (italicized) text. The angle will be measured against the perpendicular of<br />

alpha. It will be 0˚ for standard upright text (for both horizontal and vertical writing<br />

mode). If the absolute value of beta is greater than 90˚ the text will be mirrored at<br />

the baseline.<br />

> The fontid field contains the pCOS ID of the font used for the glyph. It can be used to<br />

retrieve detailed font information, such as the font name, embedding status, writing<br />

mode (horizontal/vertical), etc. Section 9.1, »Simple pCOS Examples«, page 105 shows<br />

sample code for retrieving font details.<br />

> The fontsize field specifies the size of the text in points. It will be normalized, and<br />

therefore always be positive.<br />

> The textrendering field specifies the kind of rendering for a glyph, e.g. stroked, filled,<br />

or invisible. It will reflect the numerical text rendering mode as defined for PDF page<br />

descriptions (see Table 10.10, page 144). Invisible text will be extracted by default, but<br />

this can be changed with the ignoreinvisibletext option of <strong>TET</strong>_open_page( ).<br />

End points of glyphs and words. In order to do proper highlighting you need the end<br />

position of the last character in a word. Using x, y, width, and alpha returned by <strong>TET</strong>_get_<br />

char_info( ) you can determine the end point of a glyph in horizontal writing mode:<br />

x end = x + width * cos(alpha)<br />

y end = y + width * sin(alpha)<br />

In the common case of horizontally oriented text (i.e. alpha=0) this reduces to<br />

x end = x + width<br />

y end = y<br />

For CJK text with vertical writing mode the end point calculation works as follows:<br />

x end = x<br />

y end = y - fontsize<br />

66 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!