17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

add here would be /bind/lucene/index if Ant was run without overriding<br />

the property for the location of the Lucene index.<br />

Indexing metadata fields. The <strong>TET</strong> connector for Lucene indexes the following metadata<br />

fields:<br />

> path (tokenized field): the pathname of the document<br />

> modified (DateField): the date of the last modification<br />

> contents (Reader field): the full text contents of the document<br />

> All predefined and custom PDF document info entries, e.g. Title, Subject, Author,<br />

etc. Document info entries can be queried with the pCOS interface which is integrated<br />

in <strong>TET</strong> (see the pCOS Path Reference for more details on pCOS), e.g.<br />

String objType = tet.pcos_get_string(tetHandle, "type:/Info/Subject");<br />

if (!objType.equals("null"))<br />

{<br />

doc.add(new Field("summary", tet.pcos_get_string(tetHandle,<br />

"/Info/Subject"), Field.Store.YES, Field.Index.ANALYZED));<br />

}<br />

> font: the names of all fonts in the PDF document<br />

You can customize metadata fields by modifying the set of indexed document info entries<br />

or by adding more information based on pCOS paths in PdfDocument.java.<br />

PDF file attachments. The Lucene connector for <strong>TET</strong> recursively processes all PDF file<br />

attachments in a document, and feeds the text and metadata of each attachment to the<br />

Lucene search engine for indexing. This way search hits will be generated even if the<br />

searched text is not present in the main document but some attachment. Recursive attachment<br />

traversal is especially important for PDF packages and portfolios.<br />

4.2 <strong>TET</strong> Connector for the Lucene Search Engine 47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!