PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
2 <strong>TET</strong> Command-Line Tool<br />
2.1 Command-Line Options<br />
The <strong>TET</strong> command-line tool allows you to extract text and images from one or more PDF<br />
documents without the need for any programming. Output can be generated in plain<br />
text (Unicode) format or in <strong>TET</strong>ML, <strong>TET</strong>’s XML-based output format. The <strong>TET</strong> program<br />
can be controlled via a number of command-line options. The program will insert space<br />
characters (U+0020) after each word, U+000A after each line, and U+000C after each<br />
page. It is called as follows for one or more input PDF files:<br />
tet [] ...<br />
The <strong>TET</strong> command-line tool is built on top of the <strong>TET</strong> library. You can supply library options<br />
using the --docopt, --tetopt, --imageopt, and --pageopt options according to the option<br />
list tables in Chapter 10, »<strong>TET</strong> Library API Reference«, page 121. Table 2.1 lists all <strong>TET</strong><br />
command-line options (this list will also be displayed if you run the <strong>TET</strong> program without<br />
any options).<br />
Note In order to extract CJK text you must configure access to the CMap files which are shipped with<br />
<strong>TET</strong> according to Section 0.1, »Installing the Software«, page 7.<br />
Table 2.1 <strong>TET</strong> command-line options<br />
option parameters function<br />
-- End the list of options; this is useful if file names start with a - character.<br />
@filename 1<br />
Specify a response file with options; for a syntax description see »Response files«,<br />
page 17. Response files will only be recognized before the -- option and before<br />
the first filename, and can not be used to replace the parameter for another option.<br />
--docopt Additional option list for <strong>TET</strong>_open_document( ) (see Table 10.3, page 130). The<br />
filename suboption of the tetml option can not be used here.<br />
--firstpage<br />
-f<br />
| last<br />
The number of the page where content extraction will start. The keyword last<br />
specifies the last page, last-1 the page before the last page, etc. Default: 1<br />
--format utf8 | utf16 Specifies the format for text output (default: utf8):<br />
utf8 UTF-8 with BOM (byte order mark)<br />
utf16 UTF-16 in native byte ordering with BOM<br />
--help, -?<br />
(or no option)<br />
--inmemory<br />
--image 2<br />
-i<br />
Display help with a summary of available options.<br />
Load the input file into memory and process it from there. This can result in a significant<br />
performance gain on some systems at the expense of memory usage.<br />
Extract images from the document. Extracted images will be placed in files according<br />
to the following naming scheme:<br />
_p_.[tif|jpg|jpx]<br />
--imageopt Additional option list for <strong>TET</strong>_write_image_file( ) (see Table 10.12, page 147)<br />
--lastpage<br />
-l<br />
| last<br />
The number of the page where content extraction will finish. The keyword last<br />
specifies the last page, last-1 the page before the last page, etc. Default: last<br />
2.1 Command-Line Options 15