29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

summarize some key results that complement the <strong>Magellan</strong> evaluation<br />

Scientists are struggling with a deluge <strong>of</strong> data in various disciplines. Emerging sensor networks, more capable<br />

instruments, and ever increasing simulation scales are generating data at a rate that exceeds our ability<br />

to effectively manage, curate, analyze, and share it. Data-intensive Computing is expected to revolutionize<br />

the entire hardware and s<strong>of</strong>tware stack. Hadoop provides a way for “Big data” to be seamlessly processed<br />

through the MapReduce model. The inherent parallelization, synchronization and fault-tolerance make it<br />

ideal for highly-parallel data intensive applications. Thus it is important to evaluate Hadoop specifically for<br />

data-intensive workloads to understand the various trade-<strong>of</strong>fs. In this study, we specifically evaluate Hadoop<br />

for various data operations and understand the impact <strong>of</strong> the file system, network and programming model<br />

on performance.<br />

Processing time (s)<br />

0 5000 10000 15000<br />

C<br />

Java<br />

Processing time (s)<br />

0 200 400 600 800 1000<br />

Streaming_C<br />

Hadoop_Native<br />

0.1 0.2<br />

0.5 1.0 1.5<br />

Size (TB)<br />

Size (TB)<br />

(a) Streaming Language Differences<br />

(b) Streaming Comparison<br />

Figure 10.5: Streaming.<br />

Experiment Setup<br />

All experiments were conducted on the <strong>Magellan</strong> NERSC Hadoop testbed described in Chapter 5. The three<br />

common data operations in scientific applications were identified as:<br />

• Filter. A filter operation is when data is analyzed and the output result is a subset <strong>of</strong> the entire data<br />

set. Scientific applications that involve searching for a certain pattern e.g., finding a tropical storm<br />

in a region would fit this kind <strong>of</strong> data operation. The volume <strong>of</strong> input data processed is significantly<br />

larger than the volume <strong>of</strong> output data.<br />

• Reorder. A reorder operation is when the input is reordered in some way resulting in an output<br />

dataset. Sorting the input data is an example <strong>of</strong> this kind <strong>of</strong> operation. Reorder results in an output<br />

dataset that is identical to the input data size.<br />

89

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!