Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
data collected at Brookhaven National Laboratory’s Relativistic Heavy Ion Collider. Previously, STAR has<br />
demonstrated the use <strong>of</strong> cloud resources both on Amazon and at other local sites [28, 56]. STAR used<br />
<strong>Magellan</strong> resources to process near-real-time data from Brookhaven for the 2011 run data. The need for<br />
on-demand access to resources to process real-time data with a complex s<strong>of</strong>tware stack makes it useful to<br />
consider clouds as a platform for this application.<br />
The STAR pipeline transferred the input files from Brookhaven to a NERSC scratch directory. The<br />
files were then transferred using scp to the VMs; and after the processing was done, the output files were<br />
moved back to the NERSC scratch directory. The output files were finally moved to Brookhaven, where<br />
the produced files were catalogued and stored in mass storage. STAR calibration database snapshots were<br />
generated every two hours. Those snapshots were used to update database information in each <strong>of</strong> the running<br />
VMs once per day. Update times were randomized between VMs to optimize the snapshot availability.<br />
The STAR s<strong>of</strong>tware stack was initially deployed on NERSC/<strong>Magellan</strong>. After two months <strong>of</strong> successful<br />
running, the s<strong>of</strong>tware stack was expanded to a coherent cluster <strong>of</strong> over 100 VMs from three resource pools<br />
including NERSC Eucalyptus (up to 60 eight-core VMs), ALCF Nimbus cloud (up to 60 eight-core VMs),<br />
and ALCF OpenStack cloud (up to 30 eight-core VMs). Figure 11.3 shows the virtual machines used and<br />
corresponding aggregated load for a single run over three days.<br />
In summary, over a period <strong>of</strong> four months, about 18,000 files were processed, 70 TB <strong>of</strong> input data was<br />
moved to NERSC from Brookhaven, and about 40 TB <strong>of</strong> output data was moved back to Brookhaven. The<br />
number <strong>of</strong> simultaneous jobs over the period <strong>of</strong> time varied between 160 and 750. A total <strong>of</strong> about 25K CPU<br />
days or 70 CPU years was used for the processing.<br />
The STAR data processing is a good example <strong>of</strong> applications that can leverage a cloud computing testbed<br />
such as <strong>Magellan</strong>. The s<strong>of</strong>tware can be easily packaged as a virtual machine image, and the jobs within the<br />
STAR analysis also require little communication, thus having minimal impact from virtualization overheads.<br />
Using cloud resources enabled the data processing to proceed during the five months <strong>of</strong> experiments and to<br />
finish at nearly the same time.<br />
There were a number <strong>of</strong> lessons learned. The goal was to build a VM image that mirrored the STAR<br />
reconstruction and analysis workflow from an existing system at NERSC. Building such an image from<br />
scratch required extensive system administration skills and took weeks <strong>of</strong> effort. Additionally, several days<br />
effort was required when the image was ported to the ALCF Nimbus and OpenStack clouds. The image<br />
creator needs to be careful not to compromise the security <strong>of</strong> the VMs by leaving personal information<br />
like passwords or usernames in the images. Additionally, the image must be as complete as possible while<br />
limiting its size, since it resides in memory. Using cloud resources was not a turnkey operation and required<br />
significant design and development efforts. The team took advantage <strong>of</strong> on-demand resources with custom<br />
scripts to automate and prioritize the workflow.<br />
11.2.2 Genome Sequencing <strong>of</strong> Soil Samples<br />
<strong>Magellan</strong> resources at both ALCF and NERSC were used to perform genome sequencing <strong>of</strong> soil samples<br />
pulled from two plots at the Rothamsted Research Center in the UK. Rothamsted is one <strong>of</strong> the oldest<br />
soil research centers and has a controlled environment, <strong>of</strong>f limits to farming and any other disturbances by<br />
humans, reaching back 350 years. The specific aim <strong>of</strong> this project was to understand the impact <strong>of</strong> long-term<br />
plant influence (rhizosphere) on microbial community composition and function. Two distinct fields were<br />
selected to understand the differences in microbial populations associated with different land management<br />
practices.<br />
The demonstration used Argonne’s MG-RAST metagenomics analysis s<strong>of</strong>tware to gain a preliminary<br />
overview <strong>of</strong> the microbial populations <strong>of</strong> these two soil types (grassland and bare-fallow). MG-RAST is a<br />
large-scale metagenomics data analysis service. Metagenomics use shotgun sequencing methods to gather<br />
a random sampling <strong>of</strong> short sequences <strong>of</strong> DNA from community members in the source environment. MG-<br />
RAST provides a series <strong>of</strong> tools for analyzing the community structure (which organisms are present in<br />
an environment) and the proteins present in an environment, which describe discrete functional potentials.<br />
MG-RAST also provides unique statistical tools including novel sequence quality assessment and sample<br />
107