PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
PDFlib Text Extraction Toolkit (TET) Manual
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
4.3 <strong>TET</strong> Connector for the Solr Search Server<br />
Solr is a high performance open-source enterprise search server based on the Lucene<br />
search library, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted<br />
search, caching, replication, and a web admin interface. It runs in a Java servlet container<br />
(see lucene.apache.org/solr).<br />
Solar acts as an additional layer around the Lucene core engine. It expects the indexed<br />
data in a simple XML format. Solr input can most easily be generated based on<br />
<strong>TET</strong>ML, the XML flavor produced by <strong>TET</strong>. The <strong>TET</strong> connector for Solr consists of an XSLT<br />
stylesheet which converts <strong>TET</strong>ML to the XML format expected by Solr. The <strong>TET</strong>ML input<br />
for this stylesheet can be generated with the <strong>TET</strong> library or the <strong>TET</strong> command-line tool<br />
(see Section 8.1, »Creating <strong>TET</strong>ML«, page 89).<br />
Note Protected documents can be indexed with the shrug option under certain conditions (see<br />
Chapter 5.1, »Indexing protected PDF Documents«, page 49, for details). In order to index protected<br />
documents you must enable this option in the <strong>TET</strong> library or the <strong>TET</strong> command-line tool<br />
when generating the <strong>TET</strong>ML input for Solr.<br />
Indexing metadata fields. The <strong>TET</strong> connector for Solr indexes all standard document<br />
info fields. The key of each field will be used as the field name.<br />
PDF file attachments. The <strong>TET</strong> connector for Solr recursively processes all PDF file attachments<br />
in a document, and feeds the text and metadata of each attachment to the<br />
search engine for indexing. This way search hits will be generated even if the searched<br />
text is not present in the main document but some attachment. Recursive attachment<br />
traversal is especially important for PDF packages and portfolios.<br />
XSLT stylesheet for converting <strong>TET</strong>ML. The solr.xsl stylesheet expects <strong>TET</strong>ML input in<br />
any mode except glyph. It generates the XML required to supply input data to the search<br />
server. Document info entries are supplied as fields which carry the name of the info<br />
entry (plus the _s suffix to indicate a string value), and the main text is supplied in a<br />
number of text fields. PDF attachments (including PDF packages and portfolios) in the<br />
document will be processed recursively:<br />
<br />
<br />
<br />
<strong>PDFlib</strong>-FontReporter-E.pdf<br />
<strong>PDFlib</strong> GmbH<br />
2008-07-08T15:05:39+00:00<br />
FrameMaker 7.0<br />
2008-07-08T15:05:39+00:00<br />
Acrobat Distiller 7.0.5 (Windows)<br />
<strong>PDFlib</strong> FontReporter<br />
<strong>PDFlib</strong> FontReporter 1.3 <strong>Manual</strong><br />
<strong>PDFlib</strong><br />
GmbH<br />
München<br />
...<br />
40 Chapter 4: <strong>TET</strong> Connectors