15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

threshold needs to be reached among the forests to trigger rebalancing. This keeps<br />

rebalancing from occurring unnecessarily when there is a small difference in<br />

forest counts. 21<br />

HADOOP<br />

MarkLogic leverages Apache Hadoop—specifically the MapReduce part of the<br />

Hadoop stack—to facilitate bulk processing of data. The Hadoop MapReduce engine<br />

has become a popular way to run Java-based, computationally intensive programs<br />

across a large number of nodes. It's called MapReduce because it breaks down all<br />

work into two tasks. This first is a coded "map" task that takes in key-value pairs and<br />

outputs a series of intermediate key-value pairs. The second is a "reduce" task that<br />

takes in each key from the "map" phase along with all earlier generated values for that<br />

key, then outputs a final series of values. This simple model makes it easy to parallelize<br />

the work, running the map and reduce tasks in parallel across machines, yet the model<br />

has proven to be robust enough to handle many complex workloads. MarkLogic uses<br />

MapReduce for bulk processing: for large-scale data ingestion, transformation, and<br />

export. (MarkLogic does not use Hadoop to run live queries or updates.)<br />

At a technical level, MarkLogic provides and supports a bi-directional connector for<br />

Hadoop. 22 This connector is open source, written in Java, and available separately<br />

from the main MarkLogic Server package. The connector includes logic that enables<br />

coordinated activity between a MarkLogic cluster and a Hadoop cluster. It ensures that<br />

all connectivity happens directly between machines, node-to-node, with no machine<br />

becoming a bottleneck.<br />

A MarkLogic Hadoop job typically follows one of three patterns: 1. Read data from an<br />

external source, such as a filesystem or HDFS (the Hadoop Distributed File System),<br />

and push it into MarkLogic. This is your classic ETL (extract-transform-load) job.<br />

The data can be transformed (standardized, denormalized, deduplicated, reshaped) as<br />

much as necessary within Hadoop as part of the work. 2. Read data from MarkLogic<br />

and output to an external destination, such as a filesystem or HDFS. This is your<br />

classic database export. 3. Read data from MarkLogic and write the results back<br />

into MarkLogic. This is an efficient way to run a bulk transformation. For example,<br />

the MarkMail.org project wanted to "geo-tag" every email message with a latitudelongitude<br />

location based on IP addresses in the mail headers. The logic to determine<br />

the location based on IP was written in Java and executed in parallel against all<br />

messages with a Hadoop job running on a small Hadoop cluster.<br />

21 Suppose the average fragment count of the forests is AVG and the number of forests is N. A threshold is<br />

calculated as max(AVG/(20*N),10000). So, data is moved between forests A and B only when A's fragment<br />

count is greater than AVG + threshold and B's fragment count is smaller than AVG – threshold.<br />

22 MarkLogic supports only specific versions of Hadoop, and only certain operating systems. See the<br />

documentation for specifics.<br />

113

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!