17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

When are images merged? Analyzing and merging images on a page are triggered by<br />

the corresponding call to <strong>TET</strong>_open_page( ). This leads to the following important consequences:<br />

> The number of entries in the pCOS images[ ] array, i.e. the value of the length:images<br />

pseudo object, may increase: as more pages are processed, artificial images which result<br />

from image merging are added to the array. In order to extract all merged images<br />

you must therefore open all pages in the document before querying length:images<br />

and extracting image data. Artificial (merged) images are marked with the corresponding<br />

flag artificial (numerical value 1) in the images[ ]/mergetype pseudo object.<br />

> On the other hand, elements in the images[ ] array may only be used as parts of<br />

merged images. However, entries are never removed from the images[ ] array, but the<br />

consumed entries are marked with the corresponding flag consumed (numerical value<br />

2) in the images[ ]/mergetype pseudo object.<br />

How many images are in a document? Surprisingly, there is no simple answer to this<br />

simple question. The answer depends on the following decisions:<br />

> Do you want to count image resources or placed images?<br />

> Do you want to take images into account which are only used as parts of merged images,<br />

but are never placed isolated?<br />

Using <strong>TET</strong> and pCOS pseudo objects you can determine all variants of the image count<br />

answer. The image_count topic in the <strong>TET</strong> Cookbook demonstrates various possibilities<br />

of image counting. It generates output like the following:<br />

No of raw image resources before merging: 82<br />

No of placed images: 12<br />

No of images after merging (all types): 83<br />

normal images: 1<br />

artificial (merged) images: 1<br />

consumed images: 81<br />

No of relevant (normal or artificial) image resources: 2<br />

Small image filtering. <strong>TET</strong> ignores very small images if may of those is present on the<br />

page. Since the image merging process often combines many small images to a larger<br />

image, small image removal is performed after image merging. Only images which can<br />

not be merged to form a larger image will be candidates for small image removal. In addition,<br />

they must satisfy the conditions for size and count which can be specified in the<br />

maxarea and maxcount suboptions of the smallimages suboption of the imageanalysis<br />

page option. In order to completely disable small image removal use the following page<br />

option:<br />

imageanalysis={smallimages={disable}}<br />

116 Chapter 8: Image <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!