17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

in the Java language, but can easily be adjusted to other programming languages since<br />

the <strong>TET</strong> API is almost identical for all supported language bindings. Some Cookbook<br />

samples are written in the XSLT language.The <strong>TET</strong> Cookbook is organized in the following<br />

groups:<br />

> <strong>Text</strong>: samples related to text extraction<br />

> Font: samples related to text with a focus on font properties<br />

> Image: samples related to image extraction<br />

> <strong>TET</strong> & <strong>PDFlib</strong>+PDI: samples which extract information from a PDF with <strong>TET</strong> and construct<br />

a new PDF based on the original PDF and the extracted information. These<br />

samples require the <strong>PDFlib</strong>+PDI product in addition to <strong>TET</strong>.<br />

> <strong>TET</strong>ML: XSLT samples for processing <strong>TET</strong>ML<br />

> Special: other samples<br />

The <strong>TET</strong> Cookbook is available at the following URL:<br />

www.pdflib.com/tet-cookbook.<br />

pCOS Cookbook. The pCOS Cookbook is a collection of code fragments for the pCOS interface<br />

which is integrated in <strong>TET</strong>. It is available at the following URL:<br />

www.pdflib.com/pcos-cookbook.<br />

Details of the pCOS interface are documented in the pCOS Path Reference which is<br />

included in the <strong>TET</strong> package.<br />

1.4 What’s new?<br />

What’s new in <strong>TET</strong> 4.0? The following features are new or considerably improved in<br />

<strong>TET</strong> 4.0:<br />

> performance enhancements: faster for many classes of documents<br />

> higher speed and smaller memory consumption for very large documents up to<br />

hundreds of thousands of pages<br />

> extract right-to-left and bidirectional text for Arabic, Hebrew, etc.<br />

> Unicode postprocessing with normalization, folding, and decomposition controls<br />

> improved shadow removal, word boundary detection, and dehyphenation<br />

> improved super- and subscript detection<br />

> workarounds for non-conforming PDF documents to enhance robustness<br />

> enhanced repair mode for successfully extracting text from damaged PDF<br />

> More information in XML output (<strong>TET</strong>ML): dehyphenation, dropcap, shadow, and super/subscript;<br />

coordinates in topdown system, PDF/A-2, PDF/E, font subsets,<br />

> improved C++ and Perl language bindings<br />

What’s new in <strong>TET</strong> 4.1? The following features are new or considerably improved in<br />

<strong>TET</strong> 4.1:<br />

> support for PDF 1.7 extension level 8 (encryption method specified in ISO 32000-2)<br />

> updated to pCOS interface 8 with more pseudo objects (e.g. font details) and clarified<br />

handling of encrypted attachments<br />

> additional information about PDF documents and fonts in <strong>TET</strong>ML output<br />

> new language bindings for Objective-C and Ruby<br />

> word boundary detection for ideographic CJK text improved (suboption ideographic)<br />

> new API functions <strong>TET</strong>_convert_to_unicode( ) and <strong>TET</strong>_info_pvf( )<br />

> updated connectors for Lucene and Solr<br />

1.4 What’s new? 15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!