PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
add here would be /bind/lucene/index if Ant was run without overriding<br />
the property for the location of the Lucene index.<br />
Indexing metadata fields. The <strong>TET</strong> connector for Lucene indexes the following metadata<br />
fields:<br />
> path (tokenized field): the pathname of the document<br />
> modified (DateField): the date of the last modification<br />
> contents (Reader field): the full text contents of the document<br />
> All predefined and custom PDF document info entries, e.g. Title, Subject, Author,<br />
etc. Document info entries can be queried with the pCOS interface which is integrated<br />
in <strong>TET</strong> (see the pCOS Path Reference for more details on pCOS), e.g.<br />
String objType = tet.pcos_get_string(tetHandle, "type:/Info/Subject");<br />
if (!objType.equals("null"))<br />
{<br />
doc.add(new Field("summary", tet.pcos_get_string(tetHandle,<br />
"/Info/Subject"), Field.Store.YES, Field.Index.ANALYZED));<br />
}<br />
> font: the names of all fonts in the PDF document<br />
You can customize metadata fields by modifying the set of indexed document info entries<br />
or by adding more information based on pCOS paths in PdfDocument.java.<br />
PDF file attachments. The Lucene connector for <strong>TET</strong> recursively processes all PDF file<br />
attachments in a document, and feeds the text and metadata of each attachment to the<br />
Lucene search engine for indexing. This way search hits will be generated even if the<br />
searched text is not present in the main document but some attachment. Recursive attachment<br />
traversal is especially important for PDF packages and portfolios.<br />
4.2 <strong>TET</strong> Connector for the Lucene Search Engine 47