15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CLUSTERING AND CACHING<br />

As your data size grows and your request load increases, you might hear one of two<br />

things from a software vendor. The vendor might tell you that they have a monolithic<br />

server design requiring you to buy ever larger boxes (each one exponentially more<br />

expensive than the last), or they might tell you they've architected their system to<br />

cluster—to take a group of commodity servers and have them operate together as a<br />

single system (thus keeping your costs down). MarkLogic? It's designed to cluster.<br />

Clustering provides four key advantages:<br />

1. The ability to use commodity servers, bought for reasonable prices<br />

2. The ability to incrementally add (or remove) new servers as needed (as data or<br />

users grow)<br />

3. The ability to maximize cache locality, by having different servers optimized for<br />

different roles and managing different parts of the data<br />

4. The ability to include failover capabilities, to handle server failures<br />

MarkLogic servers placed in a cluster specialize in one of two roles. They can be<br />

Evaluators (E-nodes) or Data Managers (D-nodes). E-nodes listen on a socket, parse<br />

requests, and generate responses. D-nodes hold data along with its associated indexes<br />

and support E-nodes by providing them with the data they need to satisfy requests, as<br />

well as to process updates.<br />

As your user load grows, you can add more E-nodes. As your data size grows, you can<br />

add more D-nodes. A mid-sized cluster might have two E-nodes and 10 D-nodes, with<br />

each D-node responsible for about 10% of the total data.<br />

A load balancer usually spreads incoming requests across the E-nodes. An E-node<br />

processes the request and delegates to the D-nodes any subexpression of the request<br />

involving data retrieval or storage. If the request needs, for example, the 10 most recent<br />

items matching a query, the E-node sends the query constraint to every D-node with<br />

a forest in the database, and each D-node responds with the most recent results for<br />

its portion of the data. The E-node coalesces the partial answers into a single unified<br />

answer: the most recent items across the full cluster. It's an intra-cluster back-and-forth<br />

that happens frequently as part of every request. It happens any time the request<br />

includes a subexpression requiring index work, document locking, fragment retrieval, or<br />

fragment storage.<br />

D-nodes, for the sake of efficiency, don't send full fragments across the wire unless<br />

they're truly needed by the E-node. When doing a relevance-based search, for example,<br />

each D-node forest returns an iterator, within which there's an ordered series of<br />

fragment IDs and scores, extracted from indexes. The E-node pulls entries from each<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!