17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

8.3 Placed Images and Image Resources<br />

<strong>TET</strong> distinguishes between placed images and image resources.<br />

> A placed image corresponds to an image on a page. A placed image has geometric<br />

properties: it is placed at a certain location and has a size (measured in points, millimeters,<br />

or some other absolute unit). In most cases the image is visible on the page,<br />

but in some cases it may be invisible because it is obscured by other objects on the<br />

page, is placed outside the visible page area, is fully or partially clipped, etc. Placed<br />

images are represented by the PlacedImage element in <strong>TET</strong>ML.<br />

> An image resource is a resource which represents the actual pixel data, colorspace and<br />

number of components, number of bits per component, etc. Unlike placed images,<br />

image resources don’t have any intrinsic geometry. However, they do have width<br />

and height properties (measured in pixels). Each image resource has a unique ID<br />

which can be used to extract its pixel data. Image resources are represented by the<br />

Image element in <strong>TET</strong>ML.<br />

An image resource may be used as the basis for an arbitrary number of placed images in<br />

the document. Commonly each image resource will be placed exactly once, but it could<br />

also be placed repeatedly on the same page or on multiple pages. For example, consider<br />

an image for a company logo which is used repeatedly on the header of each page in the<br />

document. Each logo on a page constitutes a placed image, but all those placed images<br />

may be associated with the same image resource in an optimized PDF. On the other<br />

hand, in a non-optimized PDF each placed logo could be based on its own copy of the<br />

same image resource. This would result in the same visual appearance, but a larger PDF<br />

document. Non-optimized PDF documents may even contain image resources which<br />

are not even referenced on any page (i.e. unused resources).<br />

Table 8.1 compares various properties of placed images and image resources.<br />

Table 8.1 Comparison of placed images and image resources<br />

property placed images image resources<br />

<strong>TET</strong>ML element PlacedImage Image<br />

affected by image merging yes yes<br />

associated with a page yes –<br />

width and height in pixels yes yes<br />

width and height in points yes –<br />

position on the page yes –<br />

number of appearances 1 0, 1, or more<br />

unique ID<br />

no: the imageid member returned by <strong>TET</strong>_<br />

get_image_info( ) and the PlacedImage/<br />

@image attribute in <strong>TET</strong>ML identify only the<br />

underlying image resource<br />

yes: imageid member returned by<br />

<strong>TET</strong>_get_image_info( ) Image/@id<br />

attribute in <strong>TET</strong>ML<br />

file naming convention in<br />

the <strong>TET</strong> command-line tool<br />

_p_<br />

.[tif|jpg|jpx|jbig2]<br />

_I.<br />

[tif|jpg|jpx|jbig2]<br />

8.3 Placed Images and Image Resources 117

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!