17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CJK support. <strong>TET</strong> includes full support for extracting Chinese, Japanese, and Korean<br />

text:<br />

> All predefined CJK CMaps (encodings) are recognized; CJK text will be converted to<br />

Unicode. The CMap files for CJK encoding conversion are included in the <strong>TET</strong> distribution.<br />

> Horizontal and vertical writing modes are supported.<br />

> CJK font names will be normalized to Unicode.<br />

Image extraction. <strong>TET</strong> extracts raster images from PDF. Adjacent parts of a segmented<br />

image will be recombined to facilitate postprocessing and re-use (e.g. multi-strip images<br />

created by some applications). Small images can be filtered in order to exclude tiny<br />

image fragments from cluttering the output.<br />

Images will be extracted in the common TIFF, JPEG, or JPEG 2000 formats.<br />

Geometry. <strong>TET</strong> provides precise metrics for the text, such as the position on the page,<br />

glyph widths, and text direction. Specific areas on the page can be excluded or included<br />

in the text extraction process, e.g. to ignore headers and footers or margins.<br />

For images the pixel size, physical size, and color space are available as well as position<br />

and angle.<br />

Word detection and content analysis. <strong>TET</strong> can be used to retrieve low-level glyph information,<br />

but also includes advanced algorithms for high-level content analysis:<br />

> Detect word boundaries to retrieve words instead of characters.<br />

> Recombine the parts of hyphenated words (dehyphenation).<br />

> Remove duplicate instances of text, e.g. shadow and fake bold text.<br />

> Recombine paragraphs into reading order.<br />

> Reorder text which is scattered over the page.<br />

> Reconstruct lines of text.<br />

> Recognize tabular structures on the page.<br />

pCOS interface for simple access to PDF objects. <strong>TET</strong> includes pCOS (<strong>PDFlib</strong> Comprehensive<br />

Object System) for retrieving arbitrary PDF objects. With pCOS you can retrieve<br />

PDF metadata, interactive elements (e.g. bookmark text, contents of form fields), or any<br />

other information from a PDF document with a simple query interface.<br />

<strong>TET</strong> Markup Language (<strong>TET</strong>ML). The information retrieved from a PDF document can<br />

be presented in an XML format called <strong>TET</strong> Markup Language, or <strong>TET</strong>ML for processing<br />

with standard XML tools. <strong>TET</strong>ML contains text, image, and metadata information and<br />

can optionally also contain font- and geometry-related details.<br />

What is text? While <strong>TET</strong> deals with a large class of PDF documents, not all visible text<br />

can successfully be extracted. The text must be encoded using PDF’s text and encoding<br />

facilities (i.e., it must be based on a font). Although the following flavors of text may be<br />

visible on the page they cannot be extracted with <strong>TET</strong>:<br />

> Rasterized (pixel image) text, e.g. scanned pages;<br />

> <strong>Text</strong> which is directly represented by vector elements without any font.<br />

Note that metadata and text in hypertext elements (such as bookmarks, form fields,<br />

notes, or annotations) can be retrieved with the pCOS interface. On the other hand, <strong>TET</strong><br />

12 Chapter 1: Introduction

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!