PDFlib Text Extraction Toolkit (TET) Manual

More documents

Recommendations

Info

1 Introduction The PDFlib Text Extraction Toolkit (TET) is targeted at extracting text and images from PDF documents, but can also be used to retrieve other information from PDF. TET can be used as a base component for realizing the following tasks: > search the text contents of PDF > create a list of all words contained in a PDF (concordance) > implement a search engine for processing large numbers of PDF files > extract text from PDF to store, translate, or otherwise repurpose it > convert the text contents of PDF to other formats > process or enhance PDFs based on their contents > compare the text contents of multiple PDF documents > extract the raster images from PDF for repurposing > extract metadata and other information from PDF TET has been designed for standalone use, and does not require any third-party software. It is robust and suitable for multi-threaded server use. 1.1 Overview of TET Features Supported PDF input. TET has been tested against thousands of PDF test files from various sources. It accepts PDF 1.0 up to PDF 1.7 extension level 3 (corresponding to Acrobat 1-9) as well as encrypted documents. Unicode support. TET includes a considerable number of algorithms and data to achieve reliable Unicode mappings for all text. Although text in PDF documents is not usually encoded in Unicode, TET will normalize the text from a PDF document to Unicode: > TET converts all text contents to Unicode. In C the text will be returned in UTF-8 or UTF-16 format; in other language bindings as native Unicode strings. > Ligatures and other multi-character glyphs will be decomposed into a sequence of their constituent Unicode characters. > Vendor-specific Unicode values (Corporate Use Subarea, CUS) are identified, and will be mapped to characters with precisely defined meanings if possible. > Glyphs which are lacking Unicode mapping information are identified as such, and will be mapped to a configurable replacement character. > UTF-16 surrogate pairs for characters outside the Basic Multilingual Plane (BMP) are properly interpreted and maintained. Surrogate pairs and UTF-32 values can be retrieved in all language bindings. Some PDF documents do not contain enough information for reliable Unicode mapping. In order to successfully extract the text nevertheless TET offers various configuration options which can be used to supply auxiliary information for proper Unicode mappings. In order to facilitate writing the required mapping tables we make available PDFlib FontReporter, a free plugin for Adobe Acrobat. This plugin can be used for analyzing fonts, encodings, and glyphs in PDF. 1.1 Overview of TET Features 11
Page 1 and 2: PDFlib GmbH München, Germany www.p
Page 3 and 4: Contents 0 First Steps with TET 7 0
Page 5: 10.9 Option Handling 151 10.10 pCOS
Page 8 and 9: 0.2 Applying the TET License Key Us
Page 12 and 13: CJK support. TET includes full supp
Page 14 and 15: XSLT samples. The TET distribution
Page 16 and 17: Table 2.1 TET command-line options
Page 18 and 19: 2.2 Command-line Examples The follo
Page 20 and 21: 20 Chapter 2: TET Command-Line Tool
Page 22 and 23: 3.2 C Binding Exception handling. T
Page 24 and 25: 3.3 C++ Binding In addition to the
Page 26 and 27: 3.5 Java Binding Installing the TET
Page 28 and 29: 3.7 Perl Binding Installing the TET
Page 30 and 31: ...some TET instructions... } catch
Page 32 and 33: 3.10 RPG Binding TET provides a /co
Page 34 and 35: 34 Chapter 3: TET Library Language
Page 36 and 37: convenient user interface features,
Page 38 and 39: devserver (1)$ ant search Buildfile
Page 40 and 41: 4.3 TET Connector for the Solr Sear
Page 42 and 43: SQL> GRANT EXECUTE ON CTX_DOC TO HR
Page 44 and 45: 4.5 TET PDF IFilter for Microsoft P
Page 46 and 47: 4.6 TET Connector for MediaWiki Med
Page 48 and 49: Indexing metadata fields. The TET c
Page 50 and 51: tected PDF (after the search engine
Page 52 and 53: A section for each of the resource
Page 54 and 55: 5.3 Recommendations for common Scen
Page 56 and 57: Legacy PDF documents with missing U
Page 58 and 59: Fig. 6.1 Acrobat’s advanced searc
Page 60 and 61:
PDF packages and portfolios. Acroba
Page 62 and 63:
Characters and glyphs. When dealing
Page 64 and 65:
6.3 Page and Text Geometry Coordina
Page 66 and 67:
width may be zero for non-spacing c
Page 68 and 69:
6.5 Unicode Pipeline TET is complet
Page 70 and 71:
U+0020) for unmappable glyphs with
Page 72 and 73:
Separator characters will be insert
Page 74 and 75:
6.7 Layout Analysis TET analyses th
Page 76 and 77:
6.8 Advanced Unicode Mapping Contro
Page 78 and 79:
Fig. 6.6 The font report for a logo
Page 80 and 81:
numerical glyph names created by va
Page 82 and 83:
Document-oriented image extraction:
Page 84 and 85:
} double yDpi = 72 * height / tet.h
Page 86 and 87:
Small image filtering. TET ignores
Page 88 and 89:
88 Chapter 7: Image Extraction
Page 90 and 91:
TETML examples. The following short
Page 92 and 93:
8.2 Controlling TETML Details TETML
Page 94 and 95:
The tetml option enables or disable
Page 96 and 97:
8.3 TETML Elements and the TETML Sc
Page 98 and 99:
unning XSLT stylesheets, and lists
Page 100 and 101:
The browser will apply the XSLT sty
Page 102 and 103:
[TheSansBold-Plain/13.98] 1 [TheSan
Page 104 and 105:
104 Chapter 8: TET Markup Language
Page 106 and 107:
Page size. Although the MediaBox, C
Page 108 and 109:
Streams in PDF generally contain bi
Page 110 and 111:
9.4 Path Syntax The backbone of the
Page 112 and 113:
9.5 Pseudo Objects Pseudo objects e
Page 114 and 115:
Pseudo objects for PDF objects, pag
Page 116 and 117:
Pseudo objects for simplified resou
Page 118 and 119:
Table 9.5 Pseudo objects for resour
Page 120 and 121:
120 Chapter 9: The pCOS Interface
Page 122 and 123:
handles must be valid for the corre
Page 124 and 125:
utf8_to_utf16( ), or TET_utf32_to_u
Page 126 and 127:
C++ int delete_pvf(string filename)
Page 128 and 129:
TET_CATCH( ) block. TET_RETHROW( )
Page 130 and 131:
Table 10.3 Document options for TET
Page 132 and 133:
Table 10.4 Suboptions for the glyph
Page 134 and 135:
10.5 Page Functions C++ int open_pa
Page 136 and 137:
136 Chapter 10: TET Library API Ref
Page 138 and 139:
Table 10.6 Suboptions for the conte
Page 140 and 141:
Table 10.7 Suboptions for the layou
Page 142 and 143:
10.6 Text and Metrics Retrieval Fun
Page 144 and 145:
Table 10.10 Members of the TET_char
Page 146 and 147:
Table 10.11 Members of the TET_imag
Page 148 and 149:
C++ const char *get_image_data(int
Page 150 and 151:
Trailer data, i.e. document-specifi
Page 152 and 153:
Table 10.14 Global options for TET_
Page 154 and 155:
mode if nocopy=false or plainmetada
Page 156 and 157:
156 Chapter 10: TET Library API Ref
Page 158 and 159:
Image Retrieval Functions Function
Page 161 and 162:
Index A annotations 59 API referenc
Page 163:
TET_pcos_get_stream( ) 154 TET_pcos
show all

PDFlib Text Extraction Toolkit (TET) Manual

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?