17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

2 <strong>TET</strong> Command-Line Tool<br />

2.1 Command-Line Options<br />

The <strong>TET</strong> command-line tool allows you to extract text and images from one or more PDF<br />

documents without the need for any programming. Output can be generated in plain<br />

text (Unicode) format or in <strong>TET</strong>ML, <strong>TET</strong>’s XML-based output format. The <strong>TET</strong> program<br />

can be controlled via a number of command-line options. The program will insert space<br />

characters (U+0020) after each word, U+000A after each line, and U+000C after each<br />

page. It is called as follows for one or more input PDF files:<br />

tet [] ...<br />

The <strong>TET</strong> command-line tool is built on top of the <strong>TET</strong> library. You can supply library options<br />

using the --docopt, --tetopt, --imageopt, and --pageopt options according to the option<br />

list tables in Chapter 10, »<strong>TET</strong> Library API Reference«, page 121. Table 2.1 lists all <strong>TET</strong><br />

command-line options (this list will also be displayed if you run the <strong>TET</strong> program without<br />

any options).<br />

Note In order to extract CJK text you must configure access to the CMap files which are shipped with<br />

<strong>TET</strong> according to Section 0.1, »Installing the Software«, page 7.<br />

Table 2.1 <strong>TET</strong> command-line options<br />

option parameters function<br />

-- End the list of options; this is useful if file names start with a - character.<br />

@filename 1<br />

Specify a response file with options; for a syntax description see »Response files«,<br />

page 17. Response files will only be recognized before the -- option and before<br />

the first filename, and can not be used to replace the parameter for another option.<br />

--docopt Additional option list for <strong>TET</strong>_open_document( ) (see Table 10.3, page 130). The<br />

filename suboption of the tetml option can not be used here.<br />

--firstpage<br />

-f<br />

| last<br />

The number of the page where content extraction will start. The keyword last<br />

specifies the last page, last-1 the page before the last page, etc. Default: 1<br />

--format utf8 | utf16 Specifies the format for text output (default: utf8):<br />

utf8 UTF-8 with BOM (byte order mark)<br />

utf16 UTF-16 in native byte ordering with BOM<br />

--help, -?<br />

(or no option)<br />

--inmemory<br />

--image 2<br />

-i<br />

Display help with a summary of available options.<br />

Load the input file into memory and process it from there. This can result in a significant<br />

performance gain on some systems at the expense of memory usage.<br />

Extract images from the document. Extracted images will be placed in files according<br />

to the following naming scheme:<br />

_p_.[tif|jpg|jpx]<br />

--imageopt Additional option list for <strong>TET</strong>_write_image_file( ) (see Table 10.12, page 147)<br />

--lastpage<br />

-l<br />

| last<br />

The number of the page where content extraction will finish. The keyword last<br />

specifies the last page, last-1 the page before the last page, etc. Default: last<br />

2.1 Command-Line Options 15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!