17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.2 Command-line Examples<br />

The following examples demonstrate some useful combinations of <strong>TET</strong> command-line<br />

options. The samples are shown in two variations; the first uses the long format of all<br />

options, while the second uses the equivalent short option format.<br />

Extract the text from a PDF document file.pdf in UTF-8 format and store it in file.txt:<br />

tet file.pdf<br />

Exclude the first and last page from text extraction:<br />

tet --firstpage 2 --lastpage last-1 file.pdf<br />

tet -f 2 -l last-1 file.pdf<br />

Supply a directory where the CJK CMaps are located (required for CJK text extraction):<br />

tet --searchpath /usr/local/cmaps file.pdf<br />

tet -s /usr/local/cmaps file.pdf<br />

Extract the text from a PDF in UTF-16 format and store it in file.utf16:<br />

tet --format utf16 --outfile file.utf16 file.pdf<br />

tet --format utf16 -o file.utf16 file.pdf<br />

Extract the text from all PDF files in a directory and store the generated *.txt files in another<br />

directory (which must already exist):<br />

tet --targetdir out in/*.pdf<br />

tet -t out in/*.pdf<br />

Restrict text extraction to a particular area on the page; this can be achieved by supplying<br />

a suitable list of page options:<br />

tet --pageopt "includebox={{0 0 200 200}}" file.pdf<br />

Extract images from file.pdf and store them in file*.tif/file*.jpg in the directory out:<br />

tet --targetdir out --image file.pdf<br />

tet -t out -i file.pdf<br />

Extract images from file.pdf without image merging; this can be achieved by supplying a<br />

suitable list of page options which are relevant for image processing:<br />

tet --targetdir out --image --pageopt "imageanalysis={merge={disable}}" file.pdf<br />

tet -t out -i --pageopt "imageanalysis={merge={disable}}" file.pdf<br />

Generate <strong>TET</strong>ML output in word mode for PDF document file.pdf and store it in file.tetml:<br />

tet --tetml word file.pdf<br />

tet -m word file.pdf<br />

Generate <strong>TET</strong>ML output without any Options elements; this can be achieved by supplying<br />

a suitable list of document options:<br />

tet --docopt "tetml={elements={options=false}}" --tetml word file.pdf<br />

18 Chapter 2: <strong>TET</strong> Command-Line Tool

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!