Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
summarize some key results that complement the <strong>Magellan</strong> evaluation<br />
Scientists are struggling with a deluge <strong>of</strong> data in various disciplines. Emerging sensor networks, more capable<br />
instruments, and ever increasing simulation scales are generating data at a rate that exceeds our ability<br />
to effectively manage, curate, analyze, and share it. Data-intensive Computing is expected to revolutionize<br />
the entire hardware and s<strong>of</strong>tware stack. Hadoop provides a way for “Big data” to be seamlessly processed<br />
through the MapReduce model. The inherent parallelization, synchronization and fault-tolerance make it<br />
ideal for highly-parallel data intensive applications. Thus it is important to evaluate Hadoop specifically for<br />
data-intensive workloads to understand the various trade-<strong>of</strong>fs. In this study, we specifically evaluate Hadoop<br />
for various data operations and understand the impact <strong>of</strong> the file system, network and programming model<br />
on performance.<br />
Processing time (s)<br />
0 5000 10000 15000<br />
C<br />
Java<br />
Processing time (s)<br />
0 200 400 600 800 1000<br />
Streaming_C<br />
Hadoop_Native<br />
0.1 0.2<br />
0.5 1.0 1.5<br />
Size (TB)<br />
Size (TB)<br />
(a) Streaming Language Differences<br />
(b) Streaming Comparison<br />
Figure 10.5: Streaming.<br />
Experiment Setup<br />
All experiments were conducted on the <strong>Magellan</strong> NERSC Hadoop testbed described in Chapter 5. The three<br />
common data operations in scientific applications were identified as:<br />
• Filter. A filter operation is when data is analyzed and the output result is a subset <strong>of</strong> the entire data<br />
set. Scientific applications that involve searching for a certain pattern e.g., finding a tropical storm<br />
in a region would fit this kind <strong>of</strong> data operation. The volume <strong>of</strong> input data processed is significantly<br />
larger than the volume <strong>of</strong> output data.<br />
• Reorder. A reorder operation is when the input is reordered in some way resulting in an output<br />
dataset. Sorting the input data is an example <strong>of</strong> this kind <strong>of</strong> operation. Reorder results in an output<br />
dataset that is identical to the input data size.<br />
89