17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.3 Command-Line Examples<br />

The following examples demonstrate some useful combinations of <strong>TET</strong> command-line<br />

options. The samples are shown in two variations; the first uses the long format of all<br />

options, while the second uses the equivalent short option format.<br />

2.3.1 Extracting <strong>Text</strong><br />

Extract the text from a PDF document file.pdf in UTF-8 format and store it in file.txt:<br />

tet file.pdf<br />

Exclude the first and last page from text extraction:<br />

tet --firstpage 2 --lastpage last-1 file.pdf<br />

tet -f 2 -l last-1 file.pdf<br />

Supply a directory where the CJK CMaps are located (required for CJK text extraction):<br />

tet --searchpath /usr/local/cmaps file.pdf<br />

tet -s /usr/local/cmaps file.pdf<br />

Extract the text from a PDF in UTF-16 format and store it in file.utf16:<br />

tet --format utf16 --outfile file.utf16 file.pdf<br />

tet --format utf16 -o file.utf16 file.pdf<br />

Extract the text from all PDF files in a directory and store the generated *.txt files in another<br />

directory (which must already exist):<br />

tet --targetdir out in/*.pdf<br />

tet -t out in/*.pdf<br />

Extract the text from all PDF files from two directories and store the generated *.txt files<br />

in the same directory as the corresponding input document:<br />

tet --samedir dir1/*.pdf dir2/*.pdf<br />

Restrict text extraction to a particular area on the page; this can be achieved by supplying<br />

a suitable list of page options:<br />

tet --pageopt "includebox={{0 0 200 200}}" file.pdf<br />

Use a response file which contains various command-line options and process all PDF<br />

documents in the current directory (the file options contains command-line options):<br />

tet @options *.pdf<br />

2.3.2 Extracting Images<br />

Extract images from file.pdf in a page-oriented manner and store them in file*.tif/file*.jpg<br />

in the directory out:<br />

tet --targetdir out --image file.pdf<br />

tet -t out -i file.pdf<br />

2.3 Command-Line Examples 21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!