17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.2 Page and <strong>Text</strong> Geometry<br />

Default coordinate system. By default <strong>TET</strong> represents all page and text metrics in the<br />

standard coordinate system of PDF. However, the origin of the coordinate system<br />

(which could be located outside the page) is adjusted to the lower left corner of the visible<br />

page. More precisely, the origin is located in the lower left corner of the CropBox if it<br />

is present, or the MediaBox otherwise. Page rotation is applied if the page has a Rotate<br />

key. The coordinate system uses the DTP point as unit:<br />

1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm<br />

The first coordinate increases to the right, the second coordinate increases upwards. All<br />

coordinates expected or returned by <strong>TET</strong> are interpreted in this coordinate system, regardless<br />

of their representation in the underlying PDF document. See the pCOS Path<br />

Reference to learn how to determine the size of a PDF page.<br />

Top-down coordinate system. Unlike PDF’s bottom-up coordinate system some graphics<br />

environments use top-down coordinates which may be preferred by some developers.<br />

In order to facilitate the use of top-down coordinates <strong>TET</strong> supports an alternative<br />

coordinate system in which all relevant coordinates are interpreted relative to the upper<br />

left corner of the page instead of the lower left corner, with y coordinates increasing<br />

downwards. This topdown feature has been designed to make it quite natural for <strong>TET</strong> users<br />

to work in a top-down coordinate system. As an additional advantage, top-down coordinates<br />

are identical to the coordinate values displayed in Acrobat (see below). The<br />

top-down coordinate system for a page can be activated with the page option topdown=<br />

{output}.<br />

Visualizing coordinates in Acrobat. You can visualize page coordinates in Acrobat as<br />

follows (see Figure 6.2):<br />

> To display cursor coordinates in Acrobat X/XI use View, Show/Hide, Cursor Coordinates.<br />

> The coordinates are displayed in the unit which is currently selected in Acrobat. To<br />

change the display units to points (as used in <strong>TET</strong>) in Acrobat X/XI proceed as follows:<br />

go to Edit, Preferences, [General...], Units & Guides, Units and select Points.<br />

Note that the coordinates displayed refer to an origin in the top left corner of the page,<br />

and not the default coordinate system of PDF and <strong>TET</strong> with an origin in the lower left<br />

corner. See the previous section for details on selecting a top-down coordinate system<br />

which aligns with Acrobat’s coordinate display.<br />

Area of text extraction. By default, <strong>TET</strong> will extract all text from the visible page area.<br />

Using the clippingarea option of <strong>TET</strong>_open_page( ) (see Table 10.10, page 169) you can<br />

change this to any of the PDF page box entries (e.g. TrimBox). With the keyword unlimited<br />

all text regardless of any page boxes can be extracted. The default value cropbox instructs<br />

<strong>TET</strong> to extract text within the area which is visible in Acrobat.<br />

The area of text extraction can be specified in more detail by providing an arbitrary<br />

number of rectangular areas in the includebox and excludebox options of <strong>TET</strong>_open_<br />

page( ). This is useful for extracting partial page content (e.g. selected columns), or for<br />

excluding irrelevant parts (e.g. margins, headers and footers). The final clipping area is<br />

6.2 Page and <strong>Text</strong> Geometry 73

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!