17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Document-oriented image extraction: the application is interested in all images in<br />

the document, but doesn’t care which image is used on which page. Images which<br />

are placed more than once should be extracted only once.<br />

In order to extract an image, the corresponding image ID is required. The image ID is<br />

used as an index in the pCOS images[ ] array, and can be obtained in the following ways<br />

which correspond to page-oriented and document-oriented image extraction, respectively:<br />

> <strong>TET</strong>_get_image_info( ) retrieves geometric information about a placed image as well<br />

as the pCOS image ID (in the imageid field of <strong>TET</strong>_image_info) of the underlying image<br />

data. This ID can be used to retrieve more image details with <strong>TET</strong>_pcos_get_number( ),<br />

such as the color space, width and height in pixels, etc., as well as the actual pixel<br />

data with <strong>TET</strong>_write_image_file( ) or <strong>TET</strong>_get_image_data( ). <strong>TET</strong>_get_image_info( ) will<br />

not touch the actual pixel data of the image. If the same image is referenced multiply<br />

on one or more pages, the corresponding IDs will be the same.<br />

This method for page-oriented image extraction is demonstrated in the extractor<br />

mini sample and the image_extractor topic in the <strong>TET</strong> Cookbook.<br />

> Enumerate all values from 0 to the highest image ID, which is queried with <strong>TET</strong>_pcos_<br />

get_number( ) as the value of the pCOS path length:images. Note that the relationship<br />

of images and pages will be lost in this case.<br />

Once an image ID has been retrieved, the functions <strong>TET</strong>_write_image_file( ) or <strong>TET</strong>_get_<br />

image_data( ) can be called to write the image data to a disk file or fetch the pixel data in<br />

memory, respectively.<br />

XMP metadata for images. PDF uses the XMP format to attach metadata to the whole<br />

document or parts of it. You can find more information about XMP and its use in PDF at<br />

the following location: www.pdflib.com/developer/xmp-metadata.<br />

An image object may have XMP metadata associated with it in the PDF document. If<br />

XMP metadata is present, <strong>TET</strong> will by default embed it in the extracted image for the<br />

output formats JPEG and TIFF. This behavior can be controlled with the keepxmp option<br />

of <strong>TET</strong>_write_image_file( ) and <strong>TET</strong>_get_image_data( ). If this option has been set to false,<br />

<strong>TET</strong> will ignore image metadata when generating the image output file.<br />

The image_metadata topic in the pCOS Cookbook shows how to extract image metadata<br />

with the pCOS interface directly, without generating any image file.<br />

82 Chapter 7: Image <strong>Extraction</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!