17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Table 2.1 <strong>TET</strong> command-line options<br />

option parameters function<br />

--outfile<br />

-o<br />

<br />

(Not allowed if multiple input file names are supplied) File name for text or <strong>TET</strong>ML<br />

output. The file name »-« can be used to designate standard output provided only<br />

a single input file has been supplied. Default: name of the input file, with .pdf or<br />

.PDF replaced with .txt (for text output) or .tetml (for <strong>TET</strong>ML output).<br />

--pageopt Additional option list which will be supplied to <strong>TET</strong>_open_page( ) if text output is<br />

generated, or to <strong>TET</strong>_process_page( ) if <strong>TET</strong>ML output is generated. See Table 10.5,<br />

page 134 and Table 10.13, page 149, for a list of available options. For text output<br />

the option granularity will always be set to page.<br />

--password,<br />

-p<br />

--searchpath 1<br />

-s<br />

--targetdir<br />

-t<br />

--tetml<br />

-m<br />

<br />

...<br />

<br />

glyph | word |<br />

wordplus |<br />

line | page<br />

User or master password for encrypted documents. In some situations the shrug<br />

feature can be used to index protected documents without supplying a password<br />

(see Section 5.1, »Indexing protected PDF Documents«, page 49).<br />

Name of one or more directories where files (e.g. CMaps) will be searched. Default:<br />

installation-specific<br />

Output directory for generated text, <strong>TET</strong>ML, and image files. The directory must<br />

exist. Default: . (i.e. the current working directory)<br />

(Can not be combined with --text) Create <strong>TET</strong>ML output according to the <strong>TET</strong> 3<br />

schema containing text and image information. <strong>TET</strong>ML output will always be created<br />

in UTF-8 encoding. The supplied parameter selects one of several variants<br />

(see Section 8.2, »Controlling <strong>TET</strong>ML Details«, page 92, for more details):<br />

glyph Glyph-based <strong>TET</strong>ML with glyph geometry and font details<br />

word Word-based <strong>TET</strong>ML with word boxes<br />

wordplus<br />

line<br />

page<br />

Word-based <strong>TET</strong>ML with word boxes plus glyph geometry and font<br />

details<br />

Line-based <strong>TET</strong>ML (text only)<br />

Page-based <strong>TET</strong>ML (text only)<br />

--tetopt Additional option list for <strong>TET</strong>_set_option( ) (see Table 10.14, page 151). The option<br />

outputformat will be ignored (use --format instead).<br />

--text 2<br />

--unique<br />

-u<br />

--verbose<br />

-v<br />

(Can not be combined with --tetml) Extract text from the document (enabled by<br />

default)<br />

(Only relevant for --image) Extract each image only once, even if it is placed in<br />

the document more than once (e.g. repeated headers).<br />

0 | 1 | 2 | 3 verbosity level (default: 1):<br />

0 no output at all<br />

1 emit only errors<br />

2 emit errors and file names<br />

3 detailed reporting<br />

--version, -V<br />

--xml 3<br />

-x<br />

glyph | word |<br />

word2 | line |<br />

zone | page<br />

Print the <strong>TET</strong> version number.<br />

(Deprecated; new applications should use --tetml instead) Create glyph-, word-,<br />

line-, zone-, or page-based XML output according to the deprecated <strong>TET</strong> 2 DTD<br />

containing the text and metrics information. The word2 mode is similar to word<br />

mode, but includes details for all the characters in a word.<br />

1. This option can be supplied more than once.<br />

2. The option --image disables text extraction, but it can be combined with --text or --tetml.<br />

3. Deprecated, use --tetml instead.<br />

16 Chapter 2: <strong>TET</strong> Command-Line Tool

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!