PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6.2 Page and <strong>Text</strong> Geometry<br />
Default coordinate system. By default <strong>TET</strong> represents all page and text metrics in the<br />
standard coordinate system of PDF. However, the origin of the coordinate system<br />
(which could be located outside the page) is adjusted to the lower left corner of the visible<br />
page. More precisely, the origin is located in the lower left corner of the CropBox if it<br />
is present, or the MediaBox otherwise. Page rotation is applied if the page has a Rotate<br />
key. The coordinate system uses the DTP point as unit:<br />
1 pt = 1 inch / 72 = 25.4 mm / 72 = 0.3528 mm<br />
The first coordinate increases to the right, the second coordinate increases upwards. All<br />
coordinates expected or returned by <strong>TET</strong> are interpreted in this coordinate system, regardless<br />
of their representation in the underlying PDF document. See the pCOS Path<br />
Reference to learn how to determine the size of a PDF page.<br />
Top-down coordinate system. Unlike PDF’s bottom-up coordinate system some graphics<br />
environments use top-down coordinates which may be preferred by some developers.<br />
In order to facilitate the use of top-down coordinates <strong>TET</strong> supports an alternative<br />
coordinate system in which all relevant coordinates are interpreted relative to the upper<br />
left corner of the page instead of the lower left corner, with y coordinates increasing<br />
downwards. This topdown feature has been designed to make it quite natural for <strong>TET</strong> users<br />
to work in a top-down coordinate system. As an additional advantage, top-down coordinates<br />
are identical to the coordinate values displayed in Acrobat (see below). The<br />
top-down coordinate system for a page can be activated with the page option topdown=<br />
{output}.<br />
Visualizing coordinates in Acrobat. You can visualize page coordinates in Acrobat as<br />
follows (see Figure 6.2):<br />
> To display cursor coordinates in Acrobat X/XI use View, Show/Hide, Cursor Coordinates.<br />
> The coordinates are displayed in the unit which is currently selected in Acrobat. To<br />
change the display units to points (as used in <strong>TET</strong>) in Acrobat X/XI proceed as follows:<br />
go to Edit, Preferences, [General...], Units & Guides, Units and select Points.<br />
Note that the coordinates displayed refer to an origin in the top left corner of the page,<br />
and not the default coordinate system of PDF and <strong>TET</strong> with an origin in the lower left<br />
corner. See the previous section for details on selecting a top-down coordinate system<br />
which aligns with Acrobat’s coordinate display.<br />
Area of text extraction. By default, <strong>TET</strong> will extract all text from the visible page area.<br />
Using the clippingarea option of <strong>TET</strong>_open_page( ) (see Table 10.10, page 169) you can<br />
change this to any of the PDF page box entries (e.g. TrimBox). With the keyword unlimited<br />
all text regardless of any page boxes can be extracted. The default value cropbox instructs<br />
<strong>TET</strong> to extract text within the area which is visible in Acrobat.<br />
The area of text extraction can be specified in more detail by providing an arbitrary<br />
number of rectangular areas in the includebox and excludebox options of <strong>TET</strong>_open_<br />
page( ). This is useful for extracting partial page content (e.g. selected columns), or for<br />
excluding irrelevant parts (e.g. margins, headers and footers). The final clipping area is<br />
6.2 Page and <strong>Text</strong> Geometry 73