Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
A MapReduce job reading data from MarkLogic employs some tricks to make the<br />
reading more efficient. It gets from each forest a manifest of what documents are present<br />
in the forest (optionally limiting it to documents matching an ad hoc query constraint,<br />
or to subdocument sections), divides up the documents into "splits," and assigns those<br />
splits to map jobs that can run in parallel across Hadoop nodes. Communication always<br />
happens between the Hadoop node running the map job and the machine hosting the<br />
forest. (Side note: In order for the Hadoop node to communicate to the MarkLogic node<br />
hosting the forest, the MarkLogic node has to have an open XDBC port, meaning it<br />
needs to be a combination E-node/D-node.)<br />
The connector code also internally uses the little-known "unordered" feature to pull<br />
the documents in the order they're stored on disk. The fn:unordered($sequence)<br />
function provides a hint to the optimizer that the order of the items in the sequence<br />
does not matter. MarkLogic uses that liberty to order the results in the same<br />
order they're stored on disk (by increasing internal fragment ID), providing better<br />
performance on disks that optimize sequential reads.<br />
A MapReduce job writing data to MarkLogic also employs a trick to gain performance.<br />
If the reduce step inserts a document into the database and the connector deems it safe,<br />
the insert will use in-forest placement so that the communication only needs to happen<br />
directly to the host with the forest.<br />
As an administrator, whether you want to place a Hadoop process onto each<br />
MarkLogic node or keep them separate depends on your workload. Co-location reduces<br />
network traffic between MarkLogic and the MapReduce tasks but places a heavier<br />
computational and memory burden on the host.<br />
The MarkLogic Content Pump (MLCP), discussed later, is a command-line program<br />
built on Hadoop for managing bulk import, export, and data copy tasks. MarkLogic can<br />
also leverage the Hadoop Distributed File System (HDFS) for storing database content.<br />
This is useful for tiered storage deployments, discussed next.<br />
A World with No Hadoop<br />
What about a world with no Hadoop? You could still do the same work, and people have, but usually for<br />
simplicity's sake, people have loaded from one machine, driven a bulk transformation from one machine, or<br />
pulled down an export to one machine. This limits performance to that of the one client. Hadoop lets an<br />
arbitrarily large cluster of machines act as the client, talking in parallel to all of the MarkLogic nodes, and that<br />
speeds up data flow into and out of MarkLogic.<br />
114