17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.3 <strong>TET</strong> Connector for the Solr Search Server<br />

Solr is a high performance open-source enterprise search server based on the Lucene<br />

search library, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted<br />

search, caching, replication, and a web admin interface. It runs in a Java servlet container<br />

(see lucene.apache.org/solr).<br />

Solar acts as an additional layer around the Lucene core engine. It expects the indexed<br />

data in a simple XML format. Solr input can most easily be generated based on<br />

<strong>TET</strong>ML, the XML flavor produced by <strong>TET</strong>. The <strong>TET</strong> connector for Solr consists of an XSLT<br />

stylesheet which converts <strong>TET</strong>ML to the XML format expected by Solr. The <strong>TET</strong>ML input<br />

for this stylesheet can be generated with the <strong>TET</strong> library or the <strong>TET</strong> command-line tool<br />

(see Section 8.1, »Creating <strong>TET</strong>ML«, page 89).<br />

Note Protected documents can be indexed with the shrug option under certain conditions (see<br />

Chapter 5.1, »Indexing protected PDF Documents«, page 49, for details). In order to index protected<br />

documents you must enable this option in the <strong>TET</strong> library or the <strong>TET</strong> command-line tool<br />

when generating the <strong>TET</strong>ML input for Solr.<br />

Indexing metadata fields. The <strong>TET</strong> connector for Solr indexes all standard document<br />

info fields. The key of each field will be used as the field name.<br />

PDF file attachments. The <strong>TET</strong> connector for Solr recursively processes all PDF file attachments<br />

in a document, and feeds the text and metadata of each attachment to the<br />

search engine for indexing. This way search hits will be generated even if the searched<br />

text is not present in the main document but some attachment. Recursive attachment<br />

traversal is especially important for PDF packages and portfolios.<br />

XSLT stylesheet for converting <strong>TET</strong>ML. The solr.xsl stylesheet expects <strong>TET</strong>ML input in<br />

any mode except glyph. It generates the XML required to supply input data to the search<br />

server. Document info entries are supplied as fields which carry the name of the info<br />

entry (plus the _s suffix to indicate a string value), and the main text is supplied in a<br />

number of text fields. PDF attachments (including PDF packages and portfolios) in the<br />

document will be processed recursively:<br />

<br />

<br />

<br />

<strong>PDFlib</strong>-FontReporter-E.pdf<br />

<strong>PDFlib</strong> GmbH<br />

2008-07-08T15:05:39+00:00<br />

FrameMaker 7.0<br />

2008-07-08T15:05:39+00:00<br />

Acrobat Distiller 7.0.5 (Windows)<br />

<strong>PDFlib</strong> FontReporter<br />

<strong>PDFlib</strong> FontReporter 1.3 <strong>Manual</strong><br />

<strong>PDFlib</strong><br />

GmbH<br />

München<br />

...<br />

40 Chapter 4: <strong>TET</strong> Connectors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!