Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CLUSTERING AND CACHING<br />
As your data size grows and your request load increases, you might hear one of two<br />
things from a software vendor. The vendor might tell you that they have a monolithic<br />
server design requiring you to buy ever larger boxes (each one exponentially more<br />
expensive than the last), or they might tell you they've architected their system to<br />
cluster—to take a group of commodity servers and have them operate together as a<br />
single system (thus keeping your costs down). MarkLogic? It's designed to cluster.<br />
Clustering provides four key advantages:<br />
1. The ability to use commodity servers, bought for reasonable prices<br />
2. The ability to incrementally add (or remove) new servers as needed (as data or<br />
users grow)<br />
3. The ability to maximize cache locality, by having different servers optimized for<br />
different roles and managing different parts of the data<br />
4. The ability to include failover capabilities, to handle server failures<br />
MarkLogic servers placed in a cluster specialize in one of two roles. They can be<br />
Evaluators (E-nodes) or Data Managers (D-nodes). E-nodes listen on a socket, parse<br />
requests, and generate responses. D-nodes hold data along with its associated indexes<br />
and support E-nodes by providing them with the data they need to satisfy requests, as<br />
well as to process updates.<br />
As your user load grows, you can add more E-nodes. As your data size grows, you can<br />
add more D-nodes. A mid-sized cluster might have two E-nodes and 10 D-nodes, with<br />
each D-node responsible for about 10% of the total data.<br />
A load balancer usually spreads incoming requests across the E-nodes. An E-node<br />
processes the request and delegates to the D-nodes any subexpression of the request<br />
involving data retrieval or storage. If the request needs, for example, the 10 most recent<br />
items matching a query, the E-node sends the query constraint to every D-node with<br />
a forest in the database, and each D-node responds with the most recent results for<br />
its portion of the data. The E-node coalesces the partial answers into a single unified<br />
answer: the most recent items across the full cluster. It's an intra-cluster back-and-forth<br />
that happens frequently as part of every request. It happens any time the request<br />
includes a subexpression requiring index work, document locking, fragment retrieval, or<br />
fragment storage.<br />
D-nodes, for the sake of efficiency, don't send full fragments across the wire unless<br />
they're truly needed by the E-node. When doing a relevance-based search, for example,<br />
each D-node forest returns an iterator, within which there's an ordered series of<br />
fragment IDs and scores, extracted from indexes. The E-node pulls entries from each<br />
47