29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

TeraSort is a standard map/reduce application for Hadoop that was used in the terabyte sort competition.<br />

TeraGen generates the data, and TeraSort then samples the input data and uses map/reduce to sort the<br />

data in total order. Figures 10.3 and 10.4 show the comparison <strong>of</strong> using HDFS and GPFS as the underlying<br />

file system for TeraGen with varying number <strong>of</strong> maps. HDFS and GPFS have been designed for largely<br />

different usage scenarios and the goal <strong>of</strong> our comparison is not a quantitative performance comparison <strong>of</strong><br />

the two system. Each <strong>of</strong> these file systems has its own strengths for certain workloads. Our goal here is<br />

to understand if scientific applications can benefit from Hadoop’s job management framework while using<br />

POSIX compliant file systems available in HPC centers.<br />

Figure 10.3 shows the time for TeraGen to generate 1 TB <strong>of</strong> data in Hadoop on both file systems. We<br />

see that the performance <strong>of</strong> GPFS shows a slight decrease in performance as the number <strong>of</strong> concurrent<br />

maps is increased. On the other hand, HDFS’s performance significantly improves as number <strong>of</strong> maps<br />

increases as HDFS is able to leverage the additional bandwidth available from disks in every compute node.<br />

Figure 10.4 shows the effective bandwidth for both systems and we can see that HDFS’s effective BW is<br />

steadily increasing. Our GPFS system is used in production use and the variability seen here is from other<br />

production workloads using the system. Hadoop and HDFS have been designed to handle high-levels <strong>of</strong><br />

parallelism for data-parallel applications. These results show that for small to medium scale parallelism,<br />

applications can use Hadoop with GPFS without any loss in performance.<br />

Time (minutes) <br />

12 <br />

10 <br />

8 <br />

6 <br />

4 <br />

HDFS <br />

GPFS <br />

Linear(HDFS) <br />

Expon.(HDFS) <br />

Linear(GPFS) <br />

Expon.(GPFS) <br />

2 <br />

0 <br />

0 500 1000 1500 2000 2500 3000 <br />

Number <strong>of</strong> maps <br />

Figure 10.3: HDFS and GPFS Comparison (Time)<br />

7000 <br />

6000 <br />

HDFS <br />

GPFS <br />

Bandwidth (MB/s) <br />

5000 <br />

4000 <br />

3000 <br />

2000 <br />

1000 <br />

0 <br />

400 600 800 1000 <br />

Number <strong>of</strong> maps <br />

Figure 10.4: HDFS and GPFS Comparison (Bandwidth)<br />

10.5.2 Data Intensive Benchmarks<br />

The following work was funded through a project evaluating Hadoop specifically for data intensive scientific<br />

applications. <strong>Magellan</strong> staff supervised the student who conducted these experiments and all experiments<br />

were run on <strong>Magellan</strong> Hadoop testbed. A paper describing the results in detail is in preparation. Here we<br />

88

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!