Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
threshold needs to be reached among the forests to trigger rebalancing. This keeps<br />
rebalancing from occurring unnecessarily when there is a small difference in<br />
forest counts. 21<br />
HADOOP<br />
MarkLogic leverages Apache Hadoop—specifically the MapReduce part of the<br />
Hadoop stack—to facilitate bulk processing of data. The Hadoop MapReduce engine<br />
has become a popular way to run Java-based, computationally intensive programs<br />
across a large number of nodes. It's called MapReduce because it breaks down all<br />
work into two tasks. This first is a coded "map" task that takes in key-value pairs and<br />
outputs a series of intermediate key-value pairs. The second is a "reduce" task that<br />
takes in each key from the "map" phase along with all earlier generated values for that<br />
key, then outputs a final series of values. This simple model makes it easy to parallelize<br />
the work, running the map and reduce tasks in parallel across machines, yet the model<br />
has proven to be robust enough to handle many complex workloads. MarkLogic uses<br />
MapReduce for bulk processing: for large-scale data ingestion, transformation, and<br />
export. (MarkLogic does not use Hadoop to run live queries or updates.)<br />
At a technical level, MarkLogic provides and supports a bi-directional connector for<br />
Hadoop. 22 This connector is open source, written in Java, and available separately<br />
from the main MarkLogic Server package. The connector includes logic that enables<br />
coordinated activity between a MarkLogic cluster and a Hadoop cluster. It ensures that<br />
all connectivity happens directly between machines, node-to-node, with no machine<br />
becoming a bottleneck.<br />
A MarkLogic Hadoop job typically follows one of three patterns: 1. Read data from an<br />
external source, such as a filesystem or HDFS (the Hadoop Distributed File System),<br />
and push it into MarkLogic. This is your classic ETL (extract-transform-load) job.<br />
The data can be transformed (standardized, denormalized, deduplicated, reshaped) as<br />
much as necessary within Hadoop as part of the work. 2. Read data from MarkLogic<br />
and output to an external destination, such as a filesystem or HDFS. This is your<br />
classic database export. 3. Read data from MarkLogic and write the results back<br />
into MarkLogic. This is an efficient way to run a bulk transformation. For example,<br />
the MarkMail.org project wanted to "geo-tag" every email message with a latitudelongitude<br />
location based on IP addresses in the mail headers. The logic to determine<br />
the location based on IP was written in Java and executed in parallel against all<br />
messages with a Hadoop job running on a small Hadoop cluster.<br />
21 Suppose the average fragment count of the forests is AVG and the number of forests is N. A threshold is<br />
calculated as max(AVG/(20*N),10000). So, data is moved between forests A and B only when A's fragment<br />
count is greater than AVG + threshold and B's fragment count is smaller than AVG – threshold.<br />
22 MarkLogic supports only specific versions of Hadoop, and only certain operating systems. See the<br />
documentation for specifics.<br />
113