Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
• Merge. Some scientific application results in a merge <strong>of</strong> two or more data sources. For example,<br />
observation data might need to be merged with simulation data. Such a merge operation results in an<br />
output data volume that is significantly larger than the input data size.<br />
We developed a synthetic benchmark suite to represent each <strong>of</strong> these data operations. We used the<br />
Wikipedia data that is 6TB for our analysis. We use a subset <strong>of</strong> this data for some experiments as noted<br />
below. For a filter operation, we constructed an index map <strong>of</strong> all the titles <strong>of</strong> the wikipedia pages and their<br />
corresponding files. For the reorder operation, we perform a data conversion where we convert <br />
and to to . For the merge operation, we create an index <strong>of</strong> all<br />
lines <strong>of</strong> the wikipedia data.<br />
Result<br />
We summarize the results from our experiments understanding the effects <strong>of</strong> streaming, file systems, network<br />
and replication.<br />
Streaming. Existing scientific applications can benefit from the MapReduce framework using the streaming<br />
model. Thus for our first experiment, we constructed the filter operation in both C and Java (using the<br />
Hadoop APIs) and compared the two to understand the overheads <strong>of</strong> streaming. Comparing the overheads<br />
<strong>of</strong> streaming are tricky since there is some differences in timing induced from just language choices. Hadoop<br />
streaming also does not support Java programs since Java programs can directly use the Hadoop API. Figure<br />
10.5a shows the comparison <strong>of</strong> the timings <strong>of</strong> both programs running on a single node. We see that the C<br />
implementation is more efficient and also the performance improves as the data size increases. Figure 10.5b<br />
shows the comparison <strong>of</strong> the streaming C version with the native Hadoop version for varying file sizes. We<br />
notice that the performance through Hadoop is similar for smaller file sizes. This actually indicates that the<br />
streaming has additional overhead since the C version is more efficient as such. As the file size increases we<br />
see that the overhead for streaming increases.<br />
Effect <strong>of</strong> file system. Figure 10.6 shows the comparison <strong>of</strong> performance <strong>of</strong> the data operations on both<br />
GPFS and HDFS. For the filter operation, there is negligible difference between HDFS and GPFS performance<br />
untill 2TB. However at 3TB HDFS performs significantly better than GPFS. For the reorder and<br />
merge, GPFS seems to achieve better performance than HDFS and the gap increases with increasing file<br />
sizes.<br />
Figure 10.7 shows the comparison <strong>of</strong> HDFS and GPFS for all three data operations for a 2TB data set.<br />
The figure shows the split <strong>of</strong> the processing time. The difference between HDFS and GPFS is negligible for<br />
the filter operation. We notice that for the reorder and merge, GPFS seems to achieve better performance<br />
than HDFS overall at this data size. However, HDFS performs better than GPFS at reads whereas GPFS<br />
performs better than HDFS for the write part <strong>of</strong> the application at the given concurrency.<br />
Our earlier results compared the performance for TeraGen on HDFS and GPFS with varying number <strong>of</strong><br />
maps. TeraGen is a write-intensive operation. Figure 10.8 shows the effect <strong>of</strong> varying number <strong>of</strong> mappers<br />
on processing time for the filter operation on both HDFS and GPFS. We see that with increasing maps the<br />
performance <strong>of</strong> GPFS goes down significantly. The clearly noticeable performance variations are likely due<br />
to artifacts <strong>of</strong> Hadoop’s scheduling.<br />
Thus, HDFS seems to achieve better read performance than GPFS and better write performance at<br />
higher concurrencies.<br />
Effect <strong>of</strong> network. Scientific applications in HPC centers traditionally use high performance, low-latency<br />
networks. However Hadoop has traditionally been run on commodity clusters based on Ethernet networks.<br />
The shuffle phase between the map and reduce phase is considered to be the most network intensive operation<br />
since all the keys are sorted and data belonging to a single key is sent to the same reducer resulting in<br />
large data movement across the network. Figure 10.9 shows the comparison <strong>of</strong> the network on various data<br />
operations with varying file sizes. We observe that filter and reorder are not affected as much by the changes<br />
92