17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

7 Image <strong>Extraction</strong><br />

7.1 Image <strong>Extraction</strong> Basics<br />

Image formats. <strong>TET</strong> extracts raster images from PDF pages and stores the extracted<br />

images in one of the following formats:<br />

> TIFF (.tif) images will be created in the majority of cases. Most TIFF images created by<br />

<strong>TET</strong> can be used in the majority of TIFF viewers and consumers. However, some advanced<br />

TIFF features are not supported by all image viewers. Note that the Windows<br />

XP image viewer does not support the common Flate compression method in TIFF.<br />

We regard Adobe Photoshop as benchmark for the validity of TIFF images.<br />

> JPEG (.jpg) will be created for images which are already compressed with the JPEG algorithm<br />

(DCTDecode filter) in PDF. However, in some cases DCT-compressed images<br />

must be extracted as TIFF since not all aspects of PDF color handling can be expressed<br />

in JPEG.<br />

> JPEG 2000 (.jpx) will be created for images which are already compressed with the<br />

JPEG 2000 algorithm (JPXDecode filter) in PDF.<br />

Placed images and image objects. <strong>TET</strong> distinguishes between placed images and image<br />

objects. A placed image corresponds to an image on a page. A placed image has geometric<br />

properties: it is placed at a certain location and has a size (measured in points,<br />

millimeters, or some other absolute unit). The image is usually visible on the page, but<br />

in some cases it may be invisible because it is obscured by other objects on the page, is<br />

placed outside the visible page area, is fully or partially clipped, etc. Placed images are<br />

represented by the PlacedImage element in <strong>TET</strong>ML.<br />

An image object is a resource which represents the actual pixel data, colorspace and<br />

number of components, number of bits per component, etc. Unlike placed images, image<br />

objects don’t have any intrinsic geometry. However, they do have width and height<br />

properties (measured in pixels). Each image object has a unique associated ID which can<br />

be used to extract its pixel data. Image objects are represented by the Image element in<br />

<strong>TET</strong>ML.<br />

A placed image always corresponds to exactly one image object, but there is no oneto-one<br />

relationship. For example, consider an image for a company logo which is used<br />

repeatedly on the header of each page in the document. Each logo on a page constitutes<br />

a placed image, but all those placed images may be associated with the same image in<br />

an optimized PDF. On the other hand, in a non-optimized PDF each placed logo could be<br />

based on its own copy of the same image object. This would result in the same visual appearance,<br />

but a more bloated PDF document.<br />

Page-oriented and document-oriented image extraction. The distinction between<br />

placed images and image objects gives rise to two fundamentally different approaches<br />

to image extraction:<br />

> Page-oriented image extraction: the application is interested in the exact page layout,<br />

but doesn’t care about duplicated images. Extracting images in a page-oriented<br />

way may result in the same data for more than one extracted image. However, the<br />

application can avoid image duplication by checking for duplicate image IDs.<br />

7.1 Image <strong>Extraction</strong> Basics 81

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!