17.05.2014 Views

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

PDFlib Text Extraction Toolkit (TET) Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.3 <strong>TET</strong> Connector for the Solr Search Server<br />

Solr is a high performance open-source enterprise search server based on the Lucene<br />

search library, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted<br />

search, caching, replication, and a web admin interface. It runs in a Java servlet container<br />

(see lucene.apache.org/solr).<br />

Solar acts as an additional layer around the Lucene core engine. It expects the indexed<br />

data in a simple XML format. Solr input can most easily be generated based on<br />

<strong>TET</strong>ML, the XML flavor produced by <strong>TET</strong>. The <strong>TET</strong> connector for Solr consists of an XSLT<br />

stylesheet which converts <strong>TET</strong>ML to the XML format expected by Solr. The <strong>TET</strong>ML input<br />

for this stylesheet can be generated with the <strong>TET</strong> library or the <strong>TET</strong> command-line tool<br />

(see Section 9.1, »Creating <strong>TET</strong>ML«, page 123).<br />

Note Protected documents can be indexed with the shrug option under certain conditions (see<br />

Chapter 5.1, »Extracting Content from protected PDF«, page 59, for details). In order to index<br />

protected documents you must enable this option in the <strong>TET</strong> library or the <strong>TET</strong> command-line<br />

tool when generating the <strong>TET</strong>ML input for Solr.<br />

Indexing metadata fields. The <strong>TET</strong> connector for Solr indexes all standard document<br />

info fields. The key of each field will be used as the field name.<br />

PDF file attachments. The <strong>TET</strong> connector for Solr recursively processes all PDF file attachments<br />

in a document, and feeds the text and metadata of each attachment to the<br />

search engine for indexing. This way search hits will be generated even if the searched<br />

text is not present in the main document but some attachment. Recursive attachment<br />

traversal is especially important for PDF packages and portfolios.<br />

XSLT stylesheet for converting <strong>TET</strong>ML. The solr.xsl stylesheet expects <strong>TET</strong>ML input in<br />

any mode except glyph. It generates the XML required to supply input data to the search<br />

server. Document info entries are supplied as fields which carry the name of the info<br />

entry (plus the _s suffix to indicate a string value), and the main text is supplied in a<br />

number of text fields. PDF attachments (including PDF packages and portfolios) in the<br />

document will be processed recursively:<br />

<br />

<br />

<br />

<strong>PDFlib</strong>-FontReporter-E.pdf<br />

<strong>PDFlib</strong> GmbH<br />

2008-07-08T15:05:39+00:00<br />

FrameMaker 7.0<br />

2008-07-08T15:05:39+00:00<br />

Acrobat Distiller 7.0.5 (Windows)<br />

<strong>PDFlib</strong> FontReporter<br />

<strong>PDFlib</strong> FontReporter 1.3 <strong>Manual</strong><br />

<strong>PDFlib</strong><br />

GmbH<br />

München<br />

...<br />

48 Chapter 4: <strong>TET</strong> Connectors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!