17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6.3 Page and <strong>Text</strong> Geometry<br />

Coordinate system. <strong>TET</strong> represents all page and text metrics in the default coordinate<br />

system of PDF. However, the origin of the coordinate system (which could be located<br />

outside the page) will be adjusted to the lower left corner of the visible page. More precisely,<br />

the origin will be located in the lower left corner of the CropBox if it is present, or<br />

the MediaBox otherwise. Page rotation will be applied if the page has a Rotate key. The<br />

coordinate system uses the DTP point as unit:<br />

1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm<br />

The first coordinate increases to the right, the second coordinate increases upwards. All<br />

coordinates expected or returned by <strong>TET</strong> are interpreted in this coordinate system, regardless<br />

of their representation in the underlying PDF document. See Section 9.1, »Simple<br />

pCOS Examples«, page 105 to see how to determine the size of a PDF page.<br />

Area of text extraction. By default, <strong>TET</strong> will extract all text from the visible page area.<br />

Using the clippingarea option of <strong>TET</strong>_open_page( ) (see Table 10.5, page 134) you can<br />

change this to any of the PDF page box entries (e.g. TrimBox). With the keyword unlimited<br />

all text regardless of any page boxes can be extracted. The default value cropbox instructs<br />

<strong>TET</strong> to extract text within the area which is visible in Acrobat.<br />

The area of text extraction can be specified in more detail by providing an arbitrary<br />

number of rectangular areas in the includebox and excludebox options of <strong>TET</strong>_open_<br />

page( ). This is useful for extracting partial page content (e.g. selected columns), or for<br />

excluding irrelevant parts (e.g. margins, headers and footers). The final clipping area is<br />

constructed by determining the union of all rectangles specified in the includebox option,<br />

and subtracting the union of all rectangles specified in the excludebox option. A<br />

character is considered inside the clipping area if its reference point is inside the clipping<br />

area. This means that a character could be considered inside the clipping area even<br />

if parts of it extend beyond the clipping area, or vice versa.<br />

In order to determine suitable values for the includebox and excludebox options it can<br />

be helpful to display cursor coordinates in Acrobat. Note, however, that in Acrobat coordinates<br />

will increase towards the bottom of the page, while the PDF coordinate system<br />

(and therefore <strong>TET</strong>) use coordinates which increase towards the top of the page.<br />

> Display cursor coordinates in Acrobat:<br />

Acrobat 7/8: View, Navigation Panels, Info;<br />

Acrobat 9: View, Cursor Coordinates.<br />

> Select points as units for the coordinate display:<br />

Acrobat 7/8/9: Edit, Preferences, [General], Units&Guides, Page&Ruler, Points;<br />

In Acrobat 7/8 you can also use Options, Points in the Info panel.<br />

Glyph metrics. Using <strong>TET</strong>_get_char_info( ) you can retrieve font and metrics information<br />

for the characters which are returned for a particular glyph. The following values<br />

are available for each character in the output (see Figure 6.3 and Table 10.10, page 144):<br />

> The uv value contains the UTF-32 Unicode value of the current character, i.e. the character<br />

for which details are retrieved. This field will always contain UTF-32, even in<br />

language bindings that can deal only with UTF-16 strings in their native Unicode<br />

strings. Accessing the uv field allows applications to deal with characters outside the<br />

BMP without having to interpret surrogate pairs. Since surrogate pairs will be report-<br />

64 Chapter 6: <strong>Text</strong> <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!