11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The TikaEntityProcessor uses <strong>Apache</strong> Tika to process incoming documents. This is similar to Uploading Data<br />

with <strong>Solr</strong> Cell using <strong>Apache</strong> Tika, but using the DataImportHandler options instead.<br />

Here is an example from the " tika" collection of the dih example ( example/example-DIH/tika/conf/tik<br />

a-data-config.xml):<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

The parameters for this processor are described in the table below:<br />

Attribute<br />

dataSource<br />

Use<br />

This parameter defines the data source and an optional name which can be referred to in<br />

later parts of the configuration if needed. This is the same dataSource explained in the<br />

description of general entity processor attributes above.<br />

The available data source types for this processor are:<br />

BinURLDataSource: used for HTTP resources, but can also be used for files.<br />

BinContentStreamDataSource: used for uploading content as a stream.<br />

BinFileDataSource: used for content on the local filesystem.<br />

url<br />

htmlMapper<br />

The path to the source file(s), as a file path or a traditional internet URL. This parameter is<br />

required.<br />

Allows control of how Tika parses HTML. The "default" mapper strips much of the HTML<br />

from documents while the "identity" mapper passes all HTML as-is with no modifications. If<br />

this parameter is defined, it must be either default or identity; if it is absent, "default" is<br />

assumed.<br />

format The output format. The options are text, xml, html or none. The default is "text" if not<br />

defined. The format "none" can be used if metadata only should be indexed and not the<br />

body of the documents.<br />

parser The default parser is org.apache.tika.parser.AutoDetectParser. If a custom or<br />

other parser should be used, it should be entered as a fully-qualified name of the class<br />

and path.<br />

fields<br />

extractEmbedded<br />

The list of fields from the input documents and how they should be mapped to <strong>Solr</strong> fields.<br />

If the attribute meta is defined as "true", the field will be obtained from the metadata of the<br />

document and not parsed from the body of the main text.<br />

Instructs the TikaEntityProcessor to extract embedded documents or attachments when tr<br />

ue. If false, embedded documents and attachments will be ignored.<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

218

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!