29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

11.2.5 Integrated Metagenome Pipeline<br />

The Integrated Microbial Genomes (IMG) system hosted at the DOE Joint Genome Institute (JGI) supports<br />

analysis <strong>of</strong> microbial community metagenomes in the integrated context <strong>of</strong> all public reference isolate<br />

microbial genomes. The content maintenance cycle for data involves running BLAST for identifying pairwise<br />

gene similarities between new metagenome and reference genomes, where the reference genome baseline<br />

is updated with new (approximately 500) genomes every four months. This processing takes about three<br />

weeks on a Linux cluster with 256 cores. Since the size <strong>of</strong> the databases is growing, it is important that<br />

the processing can still be accomplished in a timely manner. The primary computation in the IMG pipeline<br />

is BLAST, a data parallel application that does not require communication between tasks and thus has<br />

similarities with traditional cloud applications. The need for on-demand access to resources makes clouds<br />

an attractive platform for this workload.<br />

The pipeline is primarily written in Perl, but it includes components written in Java, as well as compiled<br />

components written in C and C++. The pipeline also uses several reference collections (typically called<br />

databases), including one for RNA alignment and a periodically updated reference database for BLAST.<br />

The pipeline and databases are currently around 16 GB in size. This does not include Bio-perl, BLAST, and<br />

other utilities. The pipeline was run across both <strong>Magellan</strong> sites through Eucalyptus. A simple task farmer<br />

framework was used to distribute the workload across both sites. As virtual machines came up, a client<br />

would query the main server for work and run the computation.<br />

11.2.6 Fusion<br />

Fusion is a traditional 320-node high-performance compute cluster operated by the Laboratory Computing<br />

Resource Center at Argonne. In order to expand the capabilities <strong>of</strong> the cluster, a project was launched to<br />

prototype an application that could <strong>of</strong>fload traditional HPC style jobs to <strong>Magellan</strong> resources at the ALCF<br />

in a transparent fashion. HPC batch jobs are <strong>of</strong>ten pre-coupled with a lot <strong>of</strong> expectations regarding file<br />

system locations, amount and type <strong>of</strong> resources, s<strong>of</strong>tware availability, etc. To seamlessly <strong>of</strong>fload these types<br />

<strong>of</strong> jobs on the ALCF <strong>Magellan</strong> cloud, a compute node image was created that establishes a VPN connection<br />

back to the Fusion HPC cluster on launch, and appears to the users as one <strong>of</strong> the default batch computation<br />

systems. Upon submission <strong>of</strong> a job, the application requests a number <strong>of</strong> instances from the cloud on the<br />

fly, passes the user’s job to them for execution, and terminates them when the work is completed. The end<br />

result was a success; a number <strong>of</strong> test serial jobs were <strong>of</strong>floaded to the <strong>Magellan</strong> cloud at ALCF. This case<br />

study demonstrates a scenario where serial jobs could be redirected to cloud resources, enabling the HPC<br />

resources to cater to the larger parallel jobs, therefore making better use <strong>of</strong> specialized hardware.<br />

11.2.7 RAST<br />

SEED is an open-source comparative-analysis genomics effort. The compute-intensive protein-base searches<br />

were <strong>of</strong>floaded from their traditional cluster onto the <strong>Magellan</strong> resources at ALCF during periods <strong>of</strong> heavy<br />

demand. Incoming raw genome sequences were processed with Glimmer to identify individual genes. These<br />

were subsequently annotated with RAST. The subsequent protein-base sequences were then processed via<br />

BLAST on a persistent 32-node <strong>Magellan</strong> virtual cluster against the NCBI and Swiss-Prot protein-sequence<br />

databases, which were shared among the two clusters via NFS. The results were then fed into the SEED<br />

database, enabling users to query similarities among genomes. This was a heavily compute-bound application,<br />

and performance was deemed to be very good on virtual resources.<br />

RAST had a number <strong>of</strong> similar challenges in managing virtual machine images and tying to existing<br />

infrastructure such as NFS and databases to make this load balancing transparent.<br />

11.2.8 QIIME<br />

QIIME (Quantitative Insights Into Microbial Ecology) is a toolset for analyzing high throughput 16S amplicon<br />

datasets. 16S rRNA is a highly conserved structure in microbial genomes that can be sequenced<br />

111

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!