PDFlib Text Extraction Toolkit (TET) Manual

More documents

Recommendations

Info

2.2 Command-line Examples The following examples demonstrate some useful combinations of TET command-line options. The samples are shown in two variations; the first uses the long format of all options, while the second uses the equivalent short option format. Extract the text from a PDF document file.pdf in UTF-8 format and store it in file.txt: tet file.pdf Exclude the first and last page from text extraction: tet --firstpage 2 --lastpage last-1 file.pdf tet -f 2 -l last-1 file.pdf Supply a directory where the CJK CMaps are located (required for CJK text extraction): tet --searchpath /usr/local/cmaps file.pdf tet -s /usr/local/cmaps file.pdf Extract the text from a PDF in UTF-16 format and store it in file.utf16: tet --format utf16 --outfile file.utf16 file.pdf tet --format utf16 -o file.utf16 file.pdf Extract the text from all PDF files in a directory and store the generated *.txt files in another directory (which must already exist): tet --targetdir out in/*.pdf tet -t out in/*.pdf Restrict text extraction to a particular area on the page; this can be achieved by supplying a suitable list of page options: tet --pageopt "includebox={{0 0 200 200}}" file.pdf Extract images from file.pdf and store them in file*.tif/file*.jpg in the directory out: tet --targetdir out --image file.pdf tet -t out -i file.pdf Extract images from file.pdf without image merging; this can be achieved by supplying a suitable list of page options which are relevant for image processing: tet --targetdir out --image --pageopt "imageanalysis={merge={disable}}" file.pdf tet -t out -i --pageopt "imageanalysis={merge={disable}}" file.pdf Generate TETML output in word mode for PDF document file.pdf and store it in file.tetml: tet --tetml word file.pdf tet -m word file.pdf Generate TETML output without any Options elements; this can be achieved by supplying a suitable list of document options: tet --docopt "tetml={elements={options=false}}" --tetml word file.pdf 18 Chapter 2: TET Command-Line Tool
Extract images and generate TETML in a single call: tet --image --tetml word file.pdf tet -i -m word file.pdf Use a response file which contains various command-line options and process all PDF documents in the current directory: tet @options *.pdf 2.2 Command-line Examples 19
Page 1 and 2: PDFlib GmbH München, Germany www.p
Page 3 and 4: Contents 0 First Steps with TET 7 0
Page 5: 10.9 Option Handling 151 10.10 pCOS
Page 8 and 9: 0.2 Applying the TET License Key Us
Page 11 and 12: 1 Introduction The PDFlib Text Extr
Page 13 and 14: may extract some text which is not
Page 15 and 16: 2 TET Command-Line Tool 2.1 Command
Page 17: Constructing TET command lines. The
Page 21 and 22: 3 TET Library Language Bindings Thi
Page 23 and 24: TET_CATCH(tet) { printf("Error %d i
Page 25 and 26: 3.4 COM Binding Installing the TET
Page 27 and 28: 3.6 .NET Binding The .NET edition o
Page 29 and 30: 3.8 PHP Binding Installing the TET
Page 31 and 32: 3.9 Python Binding Installing the T
Page 33 and 34: Exception Handling in RPG. TET clie
Page 35 and 36: 4 TET Connectors TET connectors pro
Page 37 and 38: 4.2 TET Connector for the Lucene Se
Page 39 and 40: Indexing metadata fields. The TET c
Page 41 and 42: 4.4 TET Connector for Oracle The TE
Page 43 and 44: these options on the command line.
Page 45 and 46: TET PDF IFilter is freely available
Page 47 and 48: tracts text and metadata from the P
Page 49 and 50: 5 Configuration 5.1 Indexing protec
Page 51 and 52: 5.2 Resource Configuration and File
Page 53 and 54: tet/3.0/resource /tet/3.0/resource/
Page 55 and 56: Geometry. The geometry features may
Page 57 and 58: 6 Text Extraction 6.1 Document Doma
Page 59 and 60: How to search with Acrobat 8/9: not
Page 61 and 62: 6.2 Unicode Concepts Unicode encodi
Page 63 and 64: artificial characters which will be
Page 65 and 66: width (x, y) beta fontsize baseline
Page 67 and 68: 6.4 Support for Chinese, Japanese,
Page 69 and 70:
Options for text filtering. There a
Page 71 and 72:
6.6 Content Analysis PDF documents
Page 73 and 74:
Dehyphenation. Hyphenated words at
Page 75 and 76:
Table 6.3 Document styles docstyle=
Page 77 and 78:
Fig. 6.5 Sample font reports create
Page 79 and 80:
glyphmapping {{fontname=Warnock* to
Page 81 and 82:
7 Image Extraction 7.1 Image Extrac
Page 83 and 84:
7.2 Image Geometry Using TET_get_im
Page 85 and 86:
7.3 Image Analysis Image merging. S
Page 87 and 88:
7.4 Restrictions and Caveats Image
Page 89 and 90:
8 TET Markup Language (TETML) 8.1 C
Page 91 and 92:
Depending on the sele
Page 93 and 94:
You can specify the amount of text
Page 95 and 96:
Table 8.3 TETML elements TETML elem
Page 97 and 98:
8.4 Transforming TETML with XSLT Ve
Page 99 and 100:
Run the program as follows: nxslt3.
Page 101 and 102:
8.5 XSLT Samples The TET distributi
Page 103 and 104:
Alphabetical list of words in the d
Page 105 and 106:
9 The pCOS Interface The pCOS (PDFl
Page 107 and 108:
9.2 Handling Basic PDF Data Types p
Page 109 and 110:
9.3 Composite Data Structures and I
Page 111 and 112:
Path prefixes. Prefixes can be used
Page 113 and 114:
Table 9.3 Universal pseudo objects
Page 115 and 116:
Table 9.4 Pseudo objects for PDF ob
Page 117 and 118:
Table 9.5 Pseudo objects for resour
Page 119 and 120:
9.6 Encrypted PDF Documents pCOS su
Page 121 and 122:
10 TET Library API Reference 10.1 O
Page 123 and 124:
10.2 General Functions Perl PHP res
Page 125 and 126:
Returns Scope Bindings The converte
Page 127 and 128:
10.3 Exception Handling C++ string
Page 129 and 130:
10.4 Document Functions C++ int ope
Page 131 and 132:
Table 10.3 Document options for TET
Page 133 and 134:
C++ C int open_document_callback(vo
Page 135 and 136:
Table 10.5 Page options for TET_ope
Page 137 and 138:
Table 10.6 Suboptions for the conte
Page 139 and 140:
Table 10.7 Suboptions for the layou
Page 141 and 142:
Table 10.9 Suboptions for the struc
Page 143 and 144:
C++ const TET_char_info *get_char_i
Page 145 and 146:
10.7 Image Retrieval Functions C++
Page 147 and 148:
C++ int write_image_file(int doc, i
Page 149 and 150:
10.8 TET Markup Language (TETML) Fu
Page 151 and 152:
10.9 Option Handling C++ void set_o
Page 153 and 154:
10.10 pCOS Functions The full pCOS
Page 155 and 156:
If the object has type stream all f
Page 157 and 158:
A TET Library Quick Reference The f
Page 159:
B Revision History Revision history
Page 162 and 163:
unsupported types 87 XMP metadata 8
show all

PDFlib Text Extraction Toolkit (TET) Manual

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?