Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
<strong>of</strong> the pipeline makes it necessary to have specific library and OS versions and ends up being a barrier to<br />
making use <strong>of</strong> other large resources. The Supernova Factory project finds cloud computing attractive due<br />
to the ability to control s<strong>of</strong>tware environments and the ability to manage and control user accounts and<br />
groups for access to the s<strong>of</strong>tware. Initial experiments conducted by the group in collaboration with <strong>Magellan</strong><br />
project personnel on Amazon EC2 show that the cloud is a feasible platform for this application. There is<br />
also interest in using Hadoop to coordinate and manage the loosely coupled jobs.<br />
4.4.4 ATLAS<br />
The ATLAS project is investigating the use <strong>of</strong> cloud platforms to support analysis jobs. The ATLAS project<br />
has hundreds <strong>of</strong> jobs that operate on terabytes <strong>of</strong> data and can greatly benefit from timely access to cloud<br />
resources. The cloud environment also promises to be an effective platform for transitioning scientific codes<br />
from testing on the desktop to large-scale cloud resources. The group is investigating the use <strong>of</strong> virtual<br />
machine images for distribution <strong>of</strong> all required s<strong>of</strong>tware [10]. This would enable sites to boot the virtual<br />
machines at different sites with minimal or no work involved with s<strong>of</strong>tware management.<br />
4.4.5 Integrated Microbial Genomes (IMG) Pipeline<br />
The Integrated Microbial Genomes (IMG) pipeline at the DOE Joint Genome Institute (JGI) provides analysis<br />
<strong>of</strong> microbial community metagenomes in the integrated context <strong>of</strong> all public reference isolate microbial<br />
genomes. IMG has workloads that need to run periodically every few weeks to months for content maintenance<br />
[8]. Timeliness <strong>of</strong> completion <strong>of</strong> workloads is critical for the community, and the tremendous growth<br />
<strong>of</strong> these data sets makes access to large number <strong>of</strong> resources critical. The computational stage consists <strong>of</strong><br />
functional annotation <strong>of</strong> individual genes, identification <strong>of</strong> pair-wise genes, and identification <strong>of</strong> chromosomal<br />
clusters. The most computationally intensive step is performing a BLAST analysis against a reference<br />
database. Subsequent steps characterize the genes based on the alignments reported by BLAST. The BLAST<br />
output alone is typically over a terabyte. Consequently the analysis <strong>of</strong> the output to find the top matches and<br />
identify taxons can be time consuming and must be done in parallel. There is interest in using technologies<br />
such as Hadoop to ease management <strong>of</strong> these loosely coupled application runs.<br />
Currently the output <strong>of</strong> “all vs. all” pairwise gene sequence comparisons is stored in compressed files.<br />
However, modifying individual entries and querying the data is not easy in this format. The group is<br />
interested in exploring the use <strong>of</strong> HBase for managing data, which will allow the users to update individual<br />
rows and perform simple queries.<br />
4.5 Summary<br />
The results <strong>of</strong> the detailed survey have helped us in understanding the science requirements for cloud environments<br />
and have influenced the direction <strong>of</strong> research in the project. We summarize the requirements<br />
gathered from the user survey and corresponding project activities:<br />
• The user requirements for cloud computing are diverse, ranging from access to custom environments<br />
to the MapReduce programming model. These diverse requirements guided our flexible s<strong>of</strong>tware stack.<br />
Users <strong>of</strong> <strong>Magellan</strong> had access to (a) traditional batch queue access with the additional capability <strong>of</strong><br />
custom s<strong>of</strong>tware environments through xCAT; (b) customized virtual machines through Eucalyptus or<br />
OpenStack front ends, enabling users to port between commercial providers and the private cloud; (c)<br />
a Hadoop installation that allowed users to access the MapReduce programming model, the Hadoop<br />
Distributed File System and other job management features such as fault tolerance. More details <strong>of</strong><br />
the s<strong>of</strong>tware stack are presented in Chapter 5, our Hadoop evaluation is presented in Chapter 10, and<br />
our user experiences are summarized in Chapter 11.<br />
• It is important to understand whether commercial cloud platforms such as Amazon EC2 and private<br />
cloud s<strong>of</strong>tware such as Eucalyptus and Hadoop met the needs <strong>of</strong> the science. We identified the existing<br />
21