17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.4 <strong>TET</strong> Connector for Oracle 49<br />

4.5 <strong>TET</strong> PDF IFilter for Microsoft Products 52<br />

4.6 <strong>TET</strong> Connector for the Apache TIKA <strong>Toolkit</strong> 54<br />

4.7 <strong>TET</strong> Connector for MediaWiki 56<br />

5 Configuration 59<br />

5.1 Extracting Content from protected PDF 59<br />

5.2 Resource Configuration and File Searching 61<br />

5.3 Recommendations for common Scenarios 65<br />

6 <strong>Text</strong> <strong>Extraction</strong> 69<br />

6.1 PDF Document Domains 69<br />

6.2 Page and <strong>Text</strong> Geometry 73<br />

6.3 Chinese, Japanese, and Korean <strong>Text</strong> 79<br />

6.3.1 CJK Encodings and CMaps 79<br />

6.3.2 Word Boundaries for CJK <strong>Text</strong> 79<br />

6.3.3 Vertical Writing Mode 79<br />

6.3.4 CJK Decompositions: Narrow, wide, vertical, etc. 80<br />

6.4 Bidirectional Arabic and Hebrew <strong>Text</strong> 82<br />

6.4.1 General Bidi Topics 82<br />

6.4.2 Postprocessing Arabic <strong>Text</strong> 82<br />

6.5 Content Analysis 84<br />

6.6 Layout Analysis 88<br />

7 Advanced Unicode Handling 91<br />

7.1 Important Unicode Concepts 91<br />

7.2 Unicode Preprocessing (Filtering) 94<br />

7.2.1 Filters for all Granularities 94<br />

7.2.2 Filters for Granularity Word and above 95<br />

7.3 Unicode Postprocessing 97<br />

7.3.1 Unicode Folding 97<br />

7.3.2 Unicode Decomposition 100<br />

7.3.3 Unicode Normalization 104<br />

7.4 Supplementary Characters and Surrogates 106<br />

7.5 Unicode Mapping for Glyphs 107<br />

8 Image <strong>Extraction</strong> 113<br />

8.1 Image <strong>Extraction</strong> Basics 113<br />

8.2 Image Merging and Filtering 115<br />

8.3 Placed Images and Image Resources 117<br />

8.4 Page-based and Resource-based Image Loops 118<br />

8.5 Geometry of Placed Images 119<br />

4 Contents

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!