11.05.2016 Views

Apache Solr Reference Guide Covering Apache Solr 6.0

21SiXmO

21SiXmO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The level of redundancy built into the Collection and how fault tolerant the Cluster can be in the<br />

event that some Nodes become unavailable.<br />

The theoretical limit in the number concurrent search requests that can be processed under heavy<br />

load.<br />

Shards and Indexing Data in <strong>Solr</strong>Cloud<br />

When your data is too large for one node, you can break it up and store it in sections by creating one or more sh<br />

ards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the<br />

index.<br />

A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard<br />

for data that represents each state, or different categories that are likely to be searched independently, but are<br />

often combined.<br />

Before <strong>Solr</strong>Cloud, <strong>Solr</strong> supported Distributed Search, which allowed one query to be executed across multiple<br />

shards, so the query was executed against the entire <strong>Solr</strong> index and no documents would be missed from the<br />

search results. So splitting the core across shards is not exclusively a <strong>Solr</strong>Cloud concept. There were, however,<br />

several problems with the distributed approach that necessitated improvement with <strong>Solr</strong>Cloud:<br />

1.<br />

2.<br />

3.<br />

Splitting of the core into shards was somewhat manual.<br />

There was no support for distributed indexing, which meant that you needed to explicitly send documents<br />

to a specific shard; <strong>Solr</strong> couldn't figure out on its own what shards to send documents to.<br />

There was no load balancing or failover, so if you got a high number of queries, you needed to figure out<br />

where to send them and if one shard died it was just gone.<br />

<strong>Solr</strong>Cloud fixes all those problems. There is support for distributing both the index process and the queries<br />

automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have<br />

multiple replicas for additional robustness.<br />

In <strong>Solr</strong>Cloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically<br />

elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://z<br />

ookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection..<br />

If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it's<br />

assigned to the shard with the fewest replicas. When there's a tie, it's assigned to the shard with the lowest shard<br />

ID.<br />

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a<br />

leader.<br />

If the machine is a replica, the document is forwarded to the leader for processing.<br />

If the machine is a leader, <strong>Solr</strong>Cloud determines which shard the document should go to, forwards the<br />

document the leader for that shard, indexes the document for this shard, and forwards the index notation<br />

to itself and any replicas.<br />

Document Routing<br />

<strong>Solr</strong> offers the ability to specify the router implementation used by a collection by specifying the router.name p<br />

arameter when creating your collection. If you use the " compositeId" router, you can send documents with a<br />

prefix in the document ID which will be used to calculate the hash <strong>Solr</strong> uses to determine the shard a document<br />

is sent to for indexing. The prefix can be anything you'd like it to be (it doesn't have to be the shard name, for<br />

example), but it must be consistent so <strong>Solr</strong> behaves consistently. For example, if you wanted to co-locate<br />

documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for<br />

example, with a document with the ID "12345", you would insert the prefix into the document id field:<br />

"IBM!12345". The exclamation mark ('!') is critical here, as it distinguishes the prefix used to determine which<br />

shard to direct the document to.<br />

<strong>Apache</strong> <strong>Solr</strong> <strong>Reference</strong> <strong>Guide</strong> <strong>6.0</strong><br />

548

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!