PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.4 <strong>TET</strong> Connector for Oracle 49<br />
4.5 <strong>TET</strong> PDF IFilter for Microsoft Products 52<br />
4.6 <strong>TET</strong> Connector for the Apache TIKA <strong>Toolkit</strong> 54<br />
4.7 <strong>TET</strong> Connector for MediaWiki 56<br />
5 Configuration 59<br />
5.1 Extracting Content from protected PDF 59<br />
5.2 Resource Configuration and File Searching 61<br />
5.3 Recommendations for common Scenarios 65<br />
6 <strong>Text</strong> <strong>Extraction</strong> 69<br />
6.1 PDF Document Domains 69<br />
6.2 Page and <strong>Text</strong> Geometry 73<br />
6.3 Chinese, Japanese, and Korean <strong>Text</strong> 79<br />
6.3.1 CJK Encodings and CMaps 79<br />
6.3.2 Word Boundaries for CJK <strong>Text</strong> 79<br />
6.3.3 Vertical Writing Mode 79<br />
6.3.4 CJK Decompositions: Narrow, wide, vertical, etc. 80<br />
6.4 Bidirectional Arabic and Hebrew <strong>Text</strong> 82<br />
6.4.1 General Bidi Topics 82<br />
6.4.2 Postprocessing Arabic <strong>Text</strong> 82<br />
6.5 Content Analysis 84<br />
6.6 Layout Analysis 88<br />
7 Advanced Unicode Handling 91<br />
7.1 Important Unicode Concepts 91<br />
7.2 Unicode Preprocessing (Filtering) 94<br />
7.2.1 Filters for all Granularities 94<br />
7.2.2 Filters for Granularity Word and above 95<br />
7.3 Unicode Postprocessing 97<br />
7.3.1 Unicode Folding 97<br />
7.3.2 Unicode Decomposition 100<br />
7.3.3 Unicode Normalization 104<br />
7.4 Supplementary Characters and Surrogates 106<br />
7.5 Unicode Mapping for Glyphs 107<br />
8 Image <strong>Extraction</strong> 113<br />
8.1 Image <strong>Extraction</strong> Basics 113<br />
8.2 Image Merging and Filtering 115<br />
8.3 Placed Images and Image Resources 117<br />
8.4 Page-based and Resource-based Image Loops 118<br />
8.5 Geometry of Placed Images 119<br />
4 Contents