17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

7.3 Image Analysis<br />

Image merging. Sometimes it is not desirable to extract images exactly as they are<br />

represented in the PDF document: in many situations what appears to be a single image<br />

is actually a collection of several smaller images which are placed close to each other.<br />

There are some common reasons for this image fragmentation:<br />

> Some applications and drivers convert multi-strip TIFF images to fragmented PDF<br />

images. The number of strips can range from dozens to hundreds.<br />

> Some scanning software divides scanned pages in smaller fragments (strips or tiles).<br />

The number of fragments is usually not more than a few dozen.<br />

> Some applications break images into small pieces when generating print or PDF output.<br />

In extreme cases, especially documents created with Microsoft Office applications,<br />

a page may contain thousands of small image fragments.<br />

<strong>TET</strong>’s image merging engine detects this situation and recombines the image parts to<br />

form a larger and more useful image. Several conditions must be met in order for images<br />

to be considered as candidates for merging:<br />

> The image fragments are oriented horizontally or vertically (but not at arbitrary angles),<br />

and form a rectangular grid of sub-images.<br />

> The number of bits per component must be the same.<br />

> The colorspace must be the same or compatible.<br />

> Some combinations of colorspace and compression scheme (in particular, JPEG 2000<br />

compression) prevent image merging.<br />

If the merging candidates can be combined to a larger image, they will be merged.<br />

Merged images can be identified as such by the images[ ]/mergetype pCOS pseudo object:<br />

it will have the value 1 (artificial) for merged images and 2 (consumed) for images which<br />

have been consumed by the merging process. Consumed images should generally be ignored<br />

by the receiving application.<br />

In order to completely disable image merging use the following page option:<br />

imageanalysis={merge={disable}}<br />

Fig. 7.2<br />

Although this<br />

image consists of<br />

many little strips,<br />

<strong>TET</strong> will extract it<br />

is a single reusable<br />

image.<br />

7.3 Image Analysis 85

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!