17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

8 Image <strong>Extraction</strong><br />

8.1 Image <strong>Extraction</strong> Basics<br />

Image formats. <strong>TET</strong> extracts raster images from PDF pages and stores the extracted<br />

images in one of the following formats:<br />

> TIFF (.tif) images are created in most cases. Most TIFF images created by <strong>TET</strong> can be<br />

used in the majority of TIFF viewers and consumers. However, some advanced TIFF<br />

features are not supported by all image viewers. We regard Adobe Photoshop as<br />

benchmark for the validity of TIFF images. Note that the Windows XP image viewer<br />

does not support the common Flate compression method in TIFF. In order to work<br />

around this viewer restriction you can enable LZW compression with the option<br />

preferredtiffcompression=lzw in <strong>TET</strong>_write_image_file( ) or get_<strong>TET</strong>_image_data( ).<br />

> JPEG (.jpg) is created for images which are compressed with the JPEG algorithm<br />

(DCTDecode filter) in PDF. However, in some cases DCT-compressed images must be<br />

extracted as TIFF since not all aspects of PDF color handling can be expressed in JPEG.<br />

> JPEG 2000 (.jpx) is created for images which are compressed with the JPEG 2000 algorithm<br />

(JPXDecode filter) in PDF.<br />

> JBIG2 (.jbig2) is created for images which are compressed with the JBIG2 algorithm<br />

(JBIG2Decode filter) in PDF. JBIG2 files are created with »sequential organization« according<br />

to ISO 14492.<br />

Extracting images to disk or memory. The <strong>TET</strong> API can deliver the images extracted<br />

from PDF documents in two different ways:<br />

> The <strong>TET</strong>_write_image_file( ) API function creates an image file on disk. The base file<br />

name of this image file must be specified in the filename option. <strong>TET</strong> will automatically<br />

add a suitable suffix depending on the image type.<br />

> The <strong>TET</strong>_get_image_data( ) API function delivers the image data in memory. This is<br />

convenient if you want to pass on the image data to another processing component<br />

without having to deal with disk files.<br />

Details depend on your image extraction requirements (see Section 8.4, »Page-based<br />

and Resource-based Image Loops«, page 118). In both cases you can determine the type<br />

of the extracted image (see next section).<br />

Determine the file type of extracted images. The image file type is reported in the<br />

Image/@extractedAs attribute in <strong>TET</strong>ML. At the API level you can use the following idiom<br />

to determine the type of an extracted image.<br />

int imageType = tet.write_image_file(doc, tet.imageid, "typeonly");<br />

/* Map the numerical image type to a format */<br />

String imageFormat;<br />

switch (imageType) {<br />

case 10:<br />

imageFormat = "TIFF";<br />

break;<br />

case 20:<br />

8.1 Image <strong>Extraction</strong> Basics 113

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!