Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
TeraSort is a standard map/reduce application for Hadoop that was used in the terabyte sort competition.<br />
TeraGen generates the data, and TeraSort then samples the input data and uses map/reduce to sort the<br />
data in total order. Figures 10.3 and 10.4 show the comparison <strong>of</strong> using HDFS and GPFS as the underlying<br />
file system for TeraGen with varying number <strong>of</strong> maps. HDFS and GPFS have been designed for largely<br />
different usage scenarios and the goal <strong>of</strong> our comparison is not a quantitative performance comparison <strong>of</strong><br />
the two system. Each <strong>of</strong> these file systems has its own strengths for certain workloads. Our goal here is<br />
to understand if scientific applications can benefit from Hadoop’s job management framework while using<br />
POSIX compliant file systems available in HPC centers.<br />
Figure 10.3 shows the time for TeraGen to generate 1 TB <strong>of</strong> data in Hadoop on both file systems. We<br />
see that the performance <strong>of</strong> GPFS shows a slight decrease in performance as the number <strong>of</strong> concurrent<br />
maps is increased. On the other hand, HDFS’s performance significantly improves as number <strong>of</strong> maps<br />
increases as HDFS is able to leverage the additional bandwidth available from disks in every compute node.<br />
Figure 10.4 shows the effective bandwidth for both systems and we can see that HDFS’s effective BW is<br />
steadily increasing. Our GPFS system is used in production use and the variability seen here is from other<br />
production workloads using the system. Hadoop and HDFS have been designed to handle high-levels <strong>of</strong><br />
parallelism for data-parallel applications. These results show that for small to medium scale parallelism,<br />
applications can use Hadoop with GPFS without any loss in performance.<br />
Time (minutes) <br />
12 <br />
10 <br />
8 <br />
6 <br />
4 <br />
HDFS <br />
GPFS <br />
Linear(HDFS) <br />
Expon.(HDFS) <br />
Linear(GPFS) <br />
Expon.(GPFS) <br />
2 <br />
0 <br />
0 500 1000 1500 2000 2500 3000 <br />
Number <strong>of</strong> maps <br />
Figure 10.3: HDFS and GPFS Comparison (Time)<br />
7000 <br />
6000 <br />
HDFS <br />
GPFS <br />
Bandwidth (MB/s) <br />
5000 <br />
4000 <br />
3000 <br />
2000 <br />
1000 <br />
0 <br />
400 600 800 1000 <br />
Number <strong>of</strong> maps <br />
Figure 10.4: HDFS and GPFS Comparison (Bandwidth)<br />
10.5.2 Data Intensive Benchmarks<br />
The following work was funded through a project evaluating Hadoop specifically for data intensive scientific<br />
applications. <strong>Magellan</strong> staff supervised the student who conducted these experiments and all experiments<br />
were run on <strong>Magellan</strong> Hadoop testbed. A paper describing the results in detail is in preparation. Here we<br />
88