Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
10.4.2 Application Examples<br />
<strong>Magellan</strong> staff worked with some key scientific groups to port their applications to the streaming model. We<br />
describe some <strong>of</strong> these in this section. Additional case studies are presented in Chapter 11.<br />
BLAST. JGI’s IMG pipeline has been tested in the Hadoop framework to manage a set <strong>of</strong> parallel BLAST<br />
computations. The performance across an HPC machine, virtual machines, and a Hadoop cluster was found<br />
to be comparable (within 10%), making this a feasible application for cloud environments. The experience<br />
with running the IMG pipeline on virtual machines is described in greater detail in Chapter 11. Here we<br />
describe the challenges with running this in the Hadoop framework. BLAST takes a multi-line sequence<br />
input for each run. By default, MapReduce assumes that inputs are line-oriented and each line can be<br />
processed independently. One could define and write Java classes to handle other input formats. However<br />
there are difficulties with managing API versions and jar files that can make the process cumbersome. We<br />
experimented with a wrapper script that formatted the input to be line oriented when passed to the Hadoop<br />
framework and the input was reformatted in the streaming mapper script to be compatible with application<br />
needs. This highlights the need for more custom plugins in Hadoop for science applications.<br />
Tropical storm detection code. Another application that we plugged into Hadoop was TSTORMS. It is<br />
a single-threaded analysis program to find tropical storms in climate data. TSTORMS input is a NetCDF<br />
file that has a binary format. While Hadoop and HDFS can handle binary files, current versions <strong>of</strong> Hadoop<br />
streaming can’t handle binary input and output. NetCDF libraries provide utilities to manage an ASCII<br />
version <strong>of</strong> NetCDF files and to convert between the ASCII and binary versions. While not the most efficient<br />
way, we explored pre-converting to an ASCII format and loading the converted files into HDFS. The streaming<br />
mapper function then converts it back to a binary format before running the application. In addition,<br />
TSTORMS does not take inputs on standard input and thus the file needs to be streamed to disk and the<br />
input file path is then passed to the binary. The application enjoyed the benefits <strong>of</strong> scalability and faulttolerance<br />
in the Hadoop framework but it highlights the difficulties <strong>of</strong> working with scientific data formats<br />
in this environment.<br />
Atmospheric River Detection. The atmospheric river detection code, similar to TSTORMS works on<br />
NetCDF input files. The challenges presented for TSTORMS also holds for this application. In addition,<br />
this application takes a parameter that defines the day to look at in the NetCDF file. Thus, each map should<br />
be differentiated based on a combination <strong>of</strong> file and a parameter. This can’t be handled easily in the current<br />
Hadoop implementation where a mapper is differentiated by the file or data it operates on. Additionally, we<br />
experimented with a wrapper script, that sequentially process all days in a file. Additional parallelism could<br />
be achieved for such applications if Hadoop would allow maps to be differentiated by not just a file but by<br />
additional parameters.<br />
10.5 Benchmarking<br />
In addition to understanding the programming effort required for applications to use the Hadoop framework,<br />
we conducted some benchmarking to understand the effects <strong>of</strong> various Hadoop parameters, underlying file<br />
systems, network interface etc.<br />
10.5.1 Standard Hadoop Benchmarks<br />
First, we used the standard Hadoop benchmark tests on the NERSC <strong>Magellan</strong> Hadoop cluster (described in<br />
Chapter 5). The benchmarking data enables us to understand the effect <strong>of</strong> various parameters and system<br />
configurations.<br />
86