29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

<strong>of</strong> the pipeline makes it necessary to have specific library and OS versions and ends up being a barrier to<br />

making use <strong>of</strong> other large resources. The Supernova Factory project finds cloud computing attractive due<br />

to the ability to control s<strong>of</strong>tware environments and the ability to manage and control user accounts and<br />

groups for access to the s<strong>of</strong>tware. Initial experiments conducted by the group in collaboration with <strong>Magellan</strong><br />

project personnel on Amazon EC2 show that the cloud is a feasible platform for this application. There is<br />

also interest in using Hadoop to coordinate and manage the loosely coupled jobs.<br />

4.4.4 ATLAS<br />

The ATLAS project is investigating the use <strong>of</strong> cloud platforms to support analysis jobs. The ATLAS project<br />

has hundreds <strong>of</strong> jobs that operate on terabytes <strong>of</strong> data and can greatly benefit from timely access to cloud<br />

resources. The cloud environment also promises to be an effective platform for transitioning scientific codes<br />

from testing on the desktop to large-scale cloud resources. The group is investigating the use <strong>of</strong> virtual<br />

machine images for distribution <strong>of</strong> all required s<strong>of</strong>tware [10]. This would enable sites to boot the virtual<br />

machines at different sites with minimal or no work involved with s<strong>of</strong>tware management.<br />

4.4.5 Integrated Microbial Genomes (IMG) Pipeline<br />

The Integrated Microbial Genomes (IMG) pipeline at the DOE Joint Genome Institute (JGI) provides analysis<br />

<strong>of</strong> microbial community metagenomes in the integrated context <strong>of</strong> all public reference isolate microbial<br />

genomes. IMG has workloads that need to run periodically every few weeks to months for content maintenance<br />

[8]. Timeliness <strong>of</strong> completion <strong>of</strong> workloads is critical for the community, and the tremendous growth<br />

<strong>of</strong> these data sets makes access to large number <strong>of</strong> resources critical. The computational stage consists <strong>of</strong><br />

functional annotation <strong>of</strong> individual genes, identification <strong>of</strong> pair-wise genes, and identification <strong>of</strong> chromosomal<br />

clusters. The most computationally intensive step is performing a BLAST analysis against a reference<br />

database. Subsequent steps characterize the genes based on the alignments reported by BLAST. The BLAST<br />

output alone is typically over a terabyte. Consequently the analysis <strong>of</strong> the output to find the top matches and<br />

identify taxons can be time consuming and must be done in parallel. There is interest in using technologies<br />

such as Hadoop to ease management <strong>of</strong> these loosely coupled application runs.<br />

Currently the output <strong>of</strong> “all vs. all” pairwise gene sequence comparisons is stored in compressed files.<br />

However, modifying individual entries and querying the data is not easy in this format. The group is<br />

interested in exploring the use <strong>of</strong> HBase for managing data, which will allow the users to update individual<br />

rows and perform simple queries.<br />

4.5 Summary<br />

The results <strong>of</strong> the detailed survey have helped us in understanding the science requirements for cloud environments<br />

and have influenced the direction <strong>of</strong> research in the project. We summarize the requirements<br />

gathered from the user survey and corresponding project activities:<br />

• The user requirements for cloud computing are diverse, ranging from access to custom environments<br />

to the MapReduce programming model. These diverse requirements guided our flexible s<strong>of</strong>tware stack.<br />

Users <strong>of</strong> <strong>Magellan</strong> had access to (a) traditional batch queue access with the additional capability <strong>of</strong><br />

custom s<strong>of</strong>tware environments through xCAT; (b) customized virtual machines through Eucalyptus or<br />

OpenStack front ends, enabling users to port between commercial providers and the private cloud; (c)<br />

a Hadoop installation that allowed users to access the MapReduce programming model, the Hadoop<br />

Distributed File System and other job management features such as fault tolerance. More details <strong>of</strong><br />

the s<strong>of</strong>tware stack are presented in Chapter 5, our Hadoop evaluation is presented in Chapter 10, and<br />

our user experiences are summarized in Chapter 11.<br />

• It is important to understand whether commercial cloud platforms such as Amazon EC2 and private<br />

cloud s<strong>of</strong>tware such as Eucalyptus and Hadoop met the needs <strong>of</strong> the science. We identified the existing<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!