PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Table 2.1 <strong>TET</strong> command-line options<br />
option parameters function<br />
--imageloop page | resource Specifies the kind of enumeration loop for extracting images with the --image option<br />
(default: resource if --tetml is specified, otherwise page):<br />
page<br />
resource<br />
Enumerate all images on the selected pages. Images which are placed<br />
multiply will be extracted multiply. Extracted images will be named<br />
according to the following pattern:<br />
_p_.[tif|jpg|jpx|jbig2]<br />
Enumerate all (merged) image resources in the document. Each image<br />
resource will be extracted only once, regardless of the number of<br />
occurrences in the document. The --firstpage and --lastpage<br />
options will be ignored for extracting images. Extracted images will be<br />
named according to the following pattern:<br />
_I.[tif|jpg|jpx|jbig2]<br />
Note: I is also used in the <strong>TET</strong>ML attribute Image/@id.<br />
--imageopt Additional option list for <strong>TET</strong>_write_image_file( ) (see Table 10.17, page 183)<br />
--lastpage<br />
-l<br />
--outfile<br />
-o<br />
--pagecount<br />
| last<br />
<br />
The number of the page where content extraction will finish. The keyword last<br />
specifies the last page, last-1 the page before the last page, etc. Default: last<br />
(Not allowed if multiple input file names are supplied) File name for text or <strong>TET</strong>ML<br />
output. The file name »-« can be used to designate standard output provided only<br />
a single input file has been supplied. Default: name of the input file, with .pdf or<br />
.PDF replaced with .txt (for text output) or .tetml (for <strong>TET</strong>ML output).<br />
Print the number of pages in the document, i.e. the value of the pCOS path<br />
length:pages, to stdout or the file provided with --outfile.<br />
--pageopt Additional option list which will be supplied to <strong>TET</strong>_open_page( ) if text output is<br />
generated, or to <strong>TET</strong>_process_page( ) if <strong>TET</strong>ML output is generated. See Table 10.10,<br />
page 169 and Table 10.18, page 185, for a list of available options. For text output<br />
the option granularity will always be set to page.<br />
--password,<br />
-p<br />
--samedir<br />
--searchpath 1<br />
-s<br />
--targetdir<br />
-t<br />
--tetml<br />
-m<br />
<br />
...<br />
<br />
glyph | word |<br />
wordplus |<br />
line | page<br />
User, master or attachment password for encrypted documents. In some situations<br />
the shrug feature can be used to index protected documents without supplying<br />
a password (see Section 5.1, »Extracting Content from protected PDF«, page<br />
59).<br />
Create output files in the same directory as the input file(s).<br />
Name of one or more directories where files (e.g. CMaps) will be searched. Default:<br />
installation-specific<br />
Output directory for generated text, <strong>TET</strong>ML, and image files. The directory must<br />
exist. This option is ignored if --samedir is specified. Default: . (i.e. the current<br />
working directory)<br />
(Can not be combined with --text) Create <strong>TET</strong>ML output according to the <strong>TET</strong> 3<br />
schema containing text and image information. <strong>TET</strong>ML will always be created in<br />
UTF-8. The supplied parameter selects one of several variants (see Section 9.2,<br />
»Controlling <strong>TET</strong>ML Details«, page 127):<br />
glyph Glyph-based <strong>TET</strong>ML with glyph geometry and font details<br />
word Word-based <strong>TET</strong>ML with word boxes<br />
wordplus Word-based <strong>TET</strong>ML with word boxes plus all glyph details<br />
line Line-based <strong>TET</strong>ML (text only)<br />
page Page-based <strong>TET</strong>ML (text only)<br />
--tetopt Additional option list for <strong>TET</strong>_set_option( ) (see Table 10.2, page 148). The option<br />
outputformat will be ignored (use --format instead).<br />
18 Chapter 2: <strong>TET</strong> Command-Line Tool