17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.2 <strong>TET</strong> Connector for the Lucene Search Engine<br />

Lucene is an open-source search engine. Lucene is primarily a Java project, but a C version<br />

is also available and a version for .NET is under development. For more information<br />

on Lucene see lucene.apache.org.<br />

Note Protected documents can be indexed with the shrug option under certain conditions (see<br />

Chapter 5.1, »Indexing protected PDF Documents«, page 49, for details). This is prepared in the<br />

Connector files, but you must manually enable this option.<br />

Requirements and installation. The <strong>TET</strong> distribution contains a <strong>TET</strong> connector which<br />

can be used to enable PDF indexing in Lucene Java. We describe this connector for<br />

Lucene Java in more detail below, assuming the following requirements are met:<br />

> JDK 1.4 or newer<br />

> A working installation of the Ant build tool<br />

> The Lucene distribution with the Lucene core JAR file. The Ant build file distributed<br />

with <strong>TET</strong> expects the file lucene-core-2.4.0.jar, which is part of the Lucene 2.4.0 distribution.<br />

> An installed <strong>TET</strong> distribution package for Unix, Linux, Mac, or Windows.<br />

In order to implement the <strong>TET</strong> connector for Lucene perform the following steps with a<br />

command prompt:<br />

> cd to the directory /connectors/lucene.<br />

> Copy the file lucene-core-2.4.0.jar to this directory.<br />

> Optionally customize the settings by adding global, document-, and page-related<br />

<strong>TET</strong> options in TetReader.java. For example, the global option list can be used to supply<br />

a suitable search path for resources (e.g. if the CJK CMaps are installed in a directory<br />

different from the default installation).<br />

The PdfDocument.java module demonstrates how to process PDF documents which<br />

are stored either on a disk file or in a memory buffer (e.g. supplied by a Web crawler).<br />

> Run the command ant index. This will compile the source code and run the indexer<br />

on the PDF files contained in the directory /bind/data.<br />

> Run the command ant search to start the command-line search client where you can<br />

enter queries in the Lucene query language.<br />

Testing <strong>TET</strong> and Lucene with the command-line search client. The following sample<br />

session demonstrates the commands and output for indexing with <strong>TET</strong> and Lucene, and<br />

testing the generated index with the Lucene command-line query tool. The process is<br />

started by running the command ant index:<br />

devserver (1)$ ant index<br />

Buildfile: build.xml<br />

...<br />

index:<br />

[java] adding ../data/Whitepaper-XMP-metadata-in-<strong>PDFlib</strong>-products.pdf<br />

[java] adding ../data/Whitepaper-PDFA-with-<strong>PDFlib</strong>-products.pdf<br />

[java] adding ../data/FontReporter.pdf<br />

[java] adding ../data/<strong>TET</strong>-PDF-IFilter-datasheet.pdf<br />

[java] adding ../data/<strong>PDFlib</strong>-datasheet.pdf<br />

[java] 1255 total milliseconds<br />

BUILD SUCCESSFUL<br />

Total time: 2 seconds<br />

4.2 <strong>TET</strong> Connector for the Lucene Search Engine 37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!