PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
in the Java language, but can easily be adjusted to other programming languages since<br />
the <strong>TET</strong> API is almost identical for all supported language bindings. Some Cookbook<br />
samples are written in the XSLT language.The <strong>TET</strong> Cookbook is organized in the following<br />
groups:<br />
> <strong>Text</strong>: samples related to text extraction<br />
> Font: samples related to text with a focus on font properties<br />
> Image: samples related to image extraction<br />
> <strong>TET</strong> & <strong>PDFlib</strong>+PDI: samples which extract information from a PDF with <strong>TET</strong> and construct<br />
a new PDF based on the original PDF and the extracted information. These<br />
samples require the <strong>PDFlib</strong>+PDI product in addition to <strong>TET</strong>.<br />
> <strong>TET</strong>ML: XSLT samples for processing <strong>TET</strong>ML<br />
> Special: other samples<br />
The <strong>TET</strong> Cookbook is available at the following URL:<br />
www.pdflib.com/tet-cookbook.<br />
pCOS Cookbook. The pCOS Cookbook is a collection of code fragments for the pCOS interface<br />
which is integrated in <strong>TET</strong>. It is available at the following URL:<br />
www.pdflib.com/pcos-cookbook.<br />
Details of the pCOS interface are documented in the pCOS Path Reference which is<br />
included in the <strong>TET</strong> package.<br />
1.4 What’s new?<br />
What’s new in <strong>TET</strong> 4.0? The following features are new or considerably improved in<br />
<strong>TET</strong> 4.0:<br />
> performance enhancements: faster for many classes of documents<br />
> higher speed and smaller memory consumption for very large documents up to<br />
hundreds of thousands of pages<br />
> extract right-to-left and bidirectional text for Arabic, Hebrew, etc.<br />
> Unicode postprocessing with normalization, folding, and decomposition controls<br />
> improved shadow removal, word boundary detection, and dehyphenation<br />
> improved super- and subscript detection<br />
> workarounds for non-conforming PDF documents to enhance robustness<br />
> enhanced repair mode for successfully extracting text from damaged PDF<br />
> More information in XML output (<strong>TET</strong>ML): dehyphenation, dropcap, shadow, and super/subscript;<br />
coordinates in topdown system, PDF/A-2, PDF/E, font subsets,<br />
> improved C++ and Perl language bindings<br />
What’s new in <strong>TET</strong> 4.1? The following features are new or considerably improved in<br />
<strong>TET</strong> 4.1:<br />
> support for PDF 1.7 extension level 8 (encryption method specified in ISO 32000-2)<br />
> updated to pCOS interface 8 with more pseudo objects (e.g. font details) and clarified<br />
handling of encrypted attachments<br />
> additional information about PDF documents and fonts in <strong>TET</strong>ML output<br />
> new language bindings for Objective-C and Ruby<br />
> word boundary detection for ideographic CJK text improved (suboption ideographic)<br />
> new API functions <strong>TET</strong>_convert_to_unicode( ) and <strong>TET</strong>_info_pvf( )<br />
> updated connectors for Lucene and Solr<br />
1.4 What’s new? 15