11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Method<br />

MD5Signature<br />

Lookup3Signature<br />

TextProfileSignature<br />

Description<br />

128 bit hash used for exact duplicate detection.<br />

64 bit hash used for exact duplicate detection, much faster than MD5 and smaller to<br />

index<br />

Fuzzy hashing implementation from nutch for near duplicate detection. It's tunable but<br />

works best on longer text.<br />

Other, more sophisticated algorithms for fuzzy/near hashing can be added later.<br />

Adding in the de-duplication process will change the allowDups setting so that it applies to an update<br />

Term (with signatureField in this case) rather than the unique field Term. Of course the signature<br />

Field could be the unique field, but generally you want the unique field to be unique. When a document<br />

is added, a signature will automatically be generated and attached to the document in the specified sign<br />

atureField.<br />

Configuration Options<br />

There are two places in <strong>Solr</strong> to configure de-duplication: in solrconfig.xml and in schema.xml.<br />

In solrconfig.xml<br />

The SignatureUpdateProcessorFactory has to be registered in solrconfig.xml as part of an Update<br />

Request Processor Chain, as in this example:<br />

<br />

<br />

true<br />

id<br />

false<br />

name,features,cat<br />

solr.processor.Lookup3Signature<br />

<br />

<br />

<br />

<br />

The SignatureUpdateProcessorFactory takes several properties:<br />

Parameter Default Description<br />

signatureClass org. apache. solr.<br />

update.processo<br />

r.Lookup3Signat<br />

ure<br />

A Signature implementation for generating a signature hash. The<br />

full classpath of the implementation must be specified. The available<br />

options are described above, the associated classpaths to use are:<br />

org. apache. solr. update. processor. Lookup3Signatu<br />

re<br />

org. apache. solr. update. processor. MD5Signature<br />

org. apache. solr. update. process. TextProfileSigna<br />

ture<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

231

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!