15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A MapReduce job reading data from MarkLogic employs some tricks to make the<br />

reading more efficient. It gets from each forest a manifest of what documents are present<br />

in the forest (optionally limiting it to documents matching an ad hoc query constraint,<br />

or to subdocument sections), divides up the documents into "splits," and assigns those<br />

splits to map jobs that can run in parallel across Hadoop nodes. Communication always<br />

happens between the Hadoop node running the map job and the machine hosting the<br />

forest. (Side note: In order for the Hadoop node to communicate to the MarkLogic node<br />

hosting the forest, the MarkLogic node has to have an open XDBC port, meaning it<br />

needs to be a combination E-node/D-node.)<br />

The connector code also internally uses the little-known "unordered" feature to pull<br />

the documents in the order they're stored on disk. The fn:unordered($sequence)<br />

function provides a hint to the optimizer that the order of the items in the sequence<br />

does not matter. MarkLogic uses that liberty to order the results in the same<br />

order they're stored on disk (by increasing internal fragment ID), providing better<br />

performance on disks that optimize sequential reads.<br />

A MapReduce job writing data to MarkLogic also employs a trick to gain performance.<br />

If the reduce step inserts a document into the database and the connector deems it safe,<br />

the insert will use in-forest placement so that the communication only needs to happen<br />

directly to the host with the forest.<br />

As an administrator, whether you want to place a Hadoop process onto each<br />

MarkLogic node or keep them separate depends on your workload. Co-location reduces<br />

network traffic between MarkLogic and the MapReduce tasks but places a heavier<br />

computational and memory burden on the host.<br />

The MarkLogic Content Pump (MLCP), discussed later, is a command-line program<br />

built on Hadoop for managing bulk import, export, and data copy tasks. MarkLogic can<br />

also leverage the Hadoop Distributed File System (HDFS) for storing database content.<br />

This is useful for tiered storage deployments, discussed next.<br />

A World with No Hadoop<br />

What about a world with no Hadoop? You could still do the same work, and people have, but usually for<br />

simplicity's sake, people have loaded from one machine, driven a bulk transformation from one machine, or<br />

pulled down an export to one machine. This limits performance to that of the one client. Hadoop lets an<br />

arbitrarily large cluster of machines act as the client, talking in parallel to all of the MarkLogic nodes, and that<br />

speeds up data flow into and out of MarkLogic.<br />

114

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!