29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

• Merge. Some scientific application results in a merge <strong>of</strong> two or more data sources. For example,<br />

observation data might need to be merged with simulation data. Such a merge operation results in an<br />

output data volume that is significantly larger than the input data size.<br />

We developed a synthetic benchmark suite to represent each <strong>of</strong> these data operations. We used the<br />

Wikipedia data that is 6TB for our analysis. We use a subset <strong>of</strong> this data for some experiments as noted<br />

below. For a filter operation, we constructed an index map <strong>of</strong> all the titles <strong>of</strong> the wikipedia pages and their<br />

corresponding files. For the reorder operation, we perform a data conversion where we convert <br />

and to to . For the merge operation, we create an index <strong>of</strong> all<br />

lines <strong>of</strong> the wikipedia data.<br />

Result<br />

We summarize the results from our experiments understanding the effects <strong>of</strong> streaming, file systems, network<br />

and replication.<br />

Streaming. Existing scientific applications can benefit from the MapReduce framework using the streaming<br />

model. Thus for our first experiment, we constructed the filter operation in both C and Java (using the<br />

Hadoop APIs) and compared the two to understand the overheads <strong>of</strong> streaming. Comparing the overheads<br />

<strong>of</strong> streaming are tricky since there is some differences in timing induced from just language choices. Hadoop<br />

streaming also does not support Java programs since Java programs can directly use the Hadoop API. Figure<br />

10.5a shows the comparison <strong>of</strong> the timings <strong>of</strong> both programs running on a single node. We see that the C<br />

implementation is more efficient and also the performance improves as the data size increases. Figure 10.5b<br />

shows the comparison <strong>of</strong> the streaming C version with the native Hadoop version for varying file sizes. We<br />

notice that the performance through Hadoop is similar for smaller file sizes. This actually indicates that the<br />

streaming has additional overhead since the C version is more efficient as such. As the file size increases we<br />

see that the overhead for streaming increases.<br />

Effect <strong>of</strong> file system. Figure 10.6 shows the comparison <strong>of</strong> performance <strong>of</strong> the data operations on both<br />

GPFS and HDFS. For the filter operation, there is negligible difference between HDFS and GPFS performance<br />

untill 2TB. However at 3TB HDFS performs significantly better than GPFS. For the reorder and<br />

merge, GPFS seems to achieve better performance than HDFS and the gap increases with increasing file<br />

sizes.<br />

Figure 10.7 shows the comparison <strong>of</strong> HDFS and GPFS for all three data operations for a 2TB data set.<br />

The figure shows the split <strong>of</strong> the processing time. The difference between HDFS and GPFS is negligible for<br />

the filter operation. We notice that for the reorder and merge, GPFS seems to achieve better performance<br />

than HDFS overall at this data size. However, HDFS performs better than GPFS at reads whereas GPFS<br />

performs better than HDFS for the write part <strong>of</strong> the application at the given concurrency.<br />

Our earlier results compared the performance for TeraGen on HDFS and GPFS with varying number <strong>of</strong><br />

maps. TeraGen is a write-intensive operation. Figure 10.8 shows the effect <strong>of</strong> varying number <strong>of</strong> mappers<br />

on processing time for the filter operation on both HDFS and GPFS. We see that with increasing maps the<br />

performance <strong>of</strong> GPFS goes down significantly. The clearly noticeable performance variations are likely due<br />

to artifacts <strong>of</strong> Hadoop’s scheduling.<br />

Thus, HDFS seems to achieve better read performance than GPFS and better write performance at<br />

higher concurrencies.<br />

Effect <strong>of</strong> network. Scientific applications in HPC centers traditionally use high performance, low-latency<br />

networks. However Hadoop has traditionally been run on commodity clusters based on Ethernet networks.<br />

The shuffle phase between the map and reduce phase is considered to be the most network intensive operation<br />

since all the keys are sorted and data belonging to a single key is sent to the same reducer resulting in<br />

large data movement across the network. Figure 10.9 shows the comparison <strong>of</strong> the network on various data<br />

operations with varying file sizes. We observe that filter and reorder are not affected as much by the changes<br />

92

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!