17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 2.1 <strong>TET</strong> command-line options<br />

option parameters function<br />

--imageloop page | resource Specifies the kind of enumeration loop for extracting images with the --image option<br />

(default: resource if --tetml is specified, otherwise page):<br />

page<br />

resource<br />

Enumerate all images on the selected pages. Images which are placed<br />

multiply will be extracted multiply. Extracted images will be named<br />

according to the following pattern:<br />

_p_.[tif|jpg|jpx|jbig2]<br />

Enumerate all (merged) image resources in the document. Each image<br />

resource will be extracted only once, regardless of the number of<br />

occurrences in the document. The --firstpage and --lastpage<br />

options will be ignored for extracting images. Extracted images will be<br />

named according to the following pattern:<br />

_I.[tif|jpg|jpx|jbig2]<br />

Note: I is also used in the <strong>TET</strong>ML attribute Image/@id.<br />

--imageopt Additional option list for <strong>TET</strong>_write_image_file( ) (see Table 10.17, page 183)<br />

--lastpage<br />

-l<br />

--outfile<br />

-o<br />

--pagecount<br />

| last<br />

<br />

The number of the page where content extraction will finish. The keyword last<br />

specifies the last page, last-1 the page before the last page, etc. Default: last<br />

(Not allowed if multiple input file names are supplied) File name for text or <strong>TET</strong>ML<br />

output. The file name »-« can be used to designate standard output provided only<br />

a single input file has been supplied. Default: name of the input file, with .pdf or<br />

.PDF replaced with .txt (for text output) or .tetml (for <strong>TET</strong>ML output).<br />

Print the number of pages in the document, i.e. the value of the pCOS path<br />

length:pages, to stdout or the file provided with --outfile.<br />

--pageopt Additional option list which will be supplied to <strong>TET</strong>_open_page( ) if text output is<br />

generated, or to <strong>TET</strong>_process_page( ) if <strong>TET</strong>ML output is generated. See Table 10.10,<br />

page 169 and Table 10.18, page 185, for a list of available options. For text output<br />

the option granularity will always be set to page.<br />

--password,<br />

-p<br />

--samedir<br />

--searchpath 1<br />

-s<br />

--targetdir<br />

-t<br />

--tetml<br />

-m<br />

<br />

...<br />

<br />

glyph | word |<br />

wordplus |<br />

line | page<br />

User, master or attachment password for encrypted documents. In some situations<br />

the shrug feature can be used to index protected documents without supplying<br />

a password (see Section 5.1, »Extracting Content from protected PDF«, page<br />

59).<br />

Create output files in the same directory as the input file(s).<br />

Name of one or more directories where files (e.g. CMaps) will be searched. Default:<br />

installation-specific<br />

Output directory for generated text, <strong>TET</strong>ML, and image files. The directory must<br />

exist. This option is ignored if --samedir is specified. Default: . (i.e. the current<br />

working directory)<br />

(Can not be combined with --text) Create <strong>TET</strong>ML output according to the <strong>TET</strong> 3<br />

schema containing text and image information. <strong>TET</strong>ML will always be created in<br />

UTF-8. The supplied parameter selects one of several variants (see Section 9.2,<br />

»Controlling <strong>TET</strong>ML Details«, page 127):<br />

glyph Glyph-based <strong>TET</strong>ML with glyph geometry and font details<br />

word Word-based <strong>TET</strong>ML with word boxes<br />

wordplus Word-based <strong>TET</strong>ML with word boxes plus all glyph details<br />

line Line-based <strong>TET</strong>ML (text only)<br />

page Page-based <strong>TET</strong>ML (text only)<br />

--tetopt Additional option list for <strong>TET</strong>_set_option( ) (see Table 10.2, page 148). The option<br />

outputformat will be ignored (use --format instead).<br />

18 Chapter 2: <strong>TET</strong> Command-Line Tool

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!