17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

To verify that the <strong>TET</strong> connector for Tika is actually used for the MIME type application/pdf,<br />

execute the following command in the directory /connectors/Tika on<br />

Unix and OS X systems:<br />

java -Djava.library.path=/bind/java -classpath<br />

/bind/java/<strong>TET</strong>.jar:tika-app-1.0.jar:tet-tika.jar<br />

org.apache.tika.cli.TikaCLI --list-parser-details<br />

On Windows:<br />

java -Djava.library.path=/bind/java -classpath<br />

/bind/java/<strong>TET</strong>.jar;tika-app-1.0.jar;tet-tika.jar<br />

org.apache.tika.cli.TikaCLI --list-parser-details<br />

The following fragment should appear in the generated output:<br />

com.pdflib.tet.tika.<strong>TET</strong>PDFParser<br />

application/pdf<br />

> For running the Tika GUI application with the <strong>TET</strong> connector, execute the following<br />

command in the directory /connectors/Tika:<br />

On Unix and OS X systems:<br />

java -Djava.library.path=/bind/java -classpath<br />

/bind/java/<strong>TET</strong>.jar:tika-app-1.0.jar:tet-tika.jar<br />

org.apache.tika.cli.TikaCLI<br />

On Windows:<br />

java -Djava.library.path=\bind\java -classpath<br />

\bind\java\<strong>TET</strong>.jar;tika-app-1.0.jar;tet-tika.jar<br />

org.apache.tika.cli.TikaCLI<br />

Customizing the <strong>TET</strong> connector for Tika. You can customize the Tika connector as follows<br />

in the <strong>TET</strong>PDFParser.java source module:<br />

> Add document options to the DOC_OPT_LIST variable, e.g. the shrug option for processing<br />

protected documents;<br />

> Add page options to the PAGE_OPT_LIST variable;<br />

> Customize the searchpath for resources such as CJK CMaps in the SEARCHPATH variable.<br />

Alternatively, the tet.searchpath property can be supplied when processing PDF<br />

documents.<br />

4.6 <strong>TET</strong> Connector for the Apache TIKA <strong>Toolkit</strong> 55

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!