17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Indexing metadata fields. The <strong>TET</strong> connector for Lucene indexes the following metadata<br />

fields:<br />

> path (tokenized field): the pathname of the document<br />

> modified (DateField): the date of the last modification<br />

> contents (Reader field): the full text contents of the document<br />

> All predefined and custom PDF document info entries, e.g. Title, Subject, Author,<br />

etc. Document info entries can be queried with the pCOS interface which is integrated<br />

in <strong>TET</strong> (see Chapter 9, »The pCOS Interface«, page 105, for more details on<br />

pCOS), e.g.<br />

String objType = tet.pcos_get_string(tetHandle, "type:/Info/Subject");<br />

if (!objType.equals("null"))<br />

{<br />

doc.add(new Field("summary", tet.pcos_get_string(tetHandle,<br />

"/Info/Subject"), Field.Store.YES, Field.Index.ANALYZED));<br />

}<br />

> font: the names of all fonts in the PDF document<br />

You can customize metadata fields by modifying the set of indexed document info entries<br />

or by adding more information based on pCOS paths in PdfDocument.java.<br />

PDF file attachments. The Lucene connector for <strong>TET</strong> recursively processes all PDF file<br />

attachments in a document, and feeds the text and metadata of each attachment to the<br />

Lucene search engine for indexing. This way search hits will be generated even if the<br />

searched text is not present in the main document but some attachment. Recursive attachment<br />

traversal is especially important for PDF packages and portfolios.<br />

4.2 <strong>TET</strong> Connector for the Lucene Search Engine 39

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!