29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

10.4.2 Application Examples<br />

<strong>Magellan</strong> staff worked with some key scientific groups to port their applications to the streaming model. We<br />

describe some <strong>of</strong> these in this section. Additional case studies are presented in Chapter 11.<br />

BLAST. JGI’s IMG pipeline has been tested in the Hadoop framework to manage a set <strong>of</strong> parallel BLAST<br />

computations. The performance across an HPC machine, virtual machines, and a Hadoop cluster was found<br />

to be comparable (within 10%), making this a feasible application for cloud environments. The experience<br />

with running the IMG pipeline on virtual machines is described in greater detail in Chapter 11. Here we<br />

describe the challenges with running this in the Hadoop framework. BLAST takes a multi-line sequence<br />

input for each run. By default, MapReduce assumes that inputs are line-oriented and each line can be<br />

processed independently. One could define and write Java classes to handle other input formats. However<br />

there are difficulties with managing API versions and jar files that can make the process cumbersome. We<br />

experimented with a wrapper script that formatted the input to be line oriented when passed to the Hadoop<br />

framework and the input was reformatted in the streaming mapper script to be compatible with application<br />

needs. This highlights the need for more custom plugins in Hadoop for science applications.<br />

Tropical storm detection code. Another application that we plugged into Hadoop was TSTORMS. It is<br />

a single-threaded analysis program to find tropical storms in climate data. TSTORMS input is a NetCDF<br />

file that has a binary format. While Hadoop and HDFS can handle binary files, current versions <strong>of</strong> Hadoop<br />

streaming can’t handle binary input and output. NetCDF libraries provide utilities to manage an ASCII<br />

version <strong>of</strong> NetCDF files and to convert between the ASCII and binary versions. While not the most efficient<br />

way, we explored pre-converting to an ASCII format and loading the converted files into HDFS. The streaming<br />

mapper function then converts it back to a binary format before running the application. In addition,<br />

TSTORMS does not take inputs on standard input and thus the file needs to be streamed to disk and the<br />

input file path is then passed to the binary. The application enjoyed the benefits <strong>of</strong> scalability and faulttolerance<br />

in the Hadoop framework but it highlights the difficulties <strong>of</strong> working with scientific data formats<br />

in this environment.<br />

Atmospheric River Detection. The atmospheric river detection code, similar to TSTORMS works on<br />

NetCDF input files. The challenges presented for TSTORMS also holds for this application. In addition,<br />

this application takes a parameter that defines the day to look at in the NetCDF file. Thus, each map should<br />

be differentiated based on a combination <strong>of</strong> file and a parameter. This can’t be handled easily in the current<br />

Hadoop implementation where a mapper is differentiated by the file or data it operates on. Additionally, we<br />

experimented with a wrapper script, that sequentially process all days in a file. Additional parallelism could<br />

be achieved for such applications if Hadoop would allow maps to be differentiated by not just a file but by<br />

additional parameters.<br />

10.5 Benchmarking<br />

In addition to understanding the programming effort required for applications to use the Hadoop framework,<br />

we conducted some benchmarking to understand the effects <strong>of</strong> various Hadoop parameters, underlying file<br />

systems, network interface etc.<br />

10.5.1 Standard Hadoop Benchmarks<br />

First, we used the standard Hadoop benchmark tests on the NERSC <strong>Magellan</strong> Hadoop cluster (described in<br />

Chapter 5). The benchmarking data enables us to understand the effect <strong>of</strong> various parameters and system<br />

configurations.<br />

86

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!