Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report method to the Open Science Grid (OSG). An important aspect of porting Einstein@Home to run on OSG was strengthening of fault-tolerance and implementation of automatic recovery from errors. Fixing problems manually at scale simply isn’t practical, so Einstein@OSG eventually automated the process. LIGO used the Magellan serial queue and the Eucalyptus setup at NERSC to understand the advantages of running in cloud environments. A virtual cluster was set up in the NERSC Eucalyptus deployment for LIGO to access the virtual resources. The virtual cluster was set up to serve an OSG gatekeeper on a head virtual machine to provide the user a seamless transition between the traditional batch queue submission and the virtual cluster. Our work with LIGO shows how center staff might set up virtual clusters for use by certain groups. To keep the image size small, the gatekeeper stack (approximately 600 MB) was installed on a block store volume. In order to work around the virtual nature of the head node, it was always booted with a private IP; and a reserved public IP, which is static, was assigned after the instance was booted. The gsiftp server had to be configured to handle the private/public duality of the IP addressing. No other changes were required to the LIGO software stack. LIGO was able to switch from running through the NERSC Globus gatekeeper to running through the virtual gatekeeper without major difficulties, and ran successfully. 11.2.4 ATLAS Figure 11.6: ATLAS CloudCRV job flow. ATLAS is a particle physics experiment at the Large Hadron Collider (LHC) at CERN. The ATLAS detector is searching for new discoveries in the head-on collisions of protons of extraordinarily high energy. ATLAS used Magellan resources at NERSC to build a highly scalable compute cluster on the Eucalyptus cloud. The configuration of the cluster is similar to a normal ATLAS compute cluster: Panda is used as the primary job distributer, where a user can submit jobs into a queue. A storage gateway is installed in the cluster to move data into and out of the cluster. A Hadoop File System (HDFS) is installed on the worker nodes so that each worker node also acts as a storage node. The key difference between the Magellan cluster and a conventional ATLAS cluster is that it fully utilizes the agility provided by the cloud. The cluster can automatically adjust the number of workers according to the number of jobs waiting in the queue. When there are many jobs waiting in the queue, new worker nodes are automatically added to the cluster. When the jobs are finished, idle workers are terminated automatically to save cost. To address many of the challenges in working with virtual clusters that we discussed in earlier case studies, the ATLAS project developed and deployed a tool called CloudCRV. CloudCRV can automatically deploy and configure individual VMs to work as a cluster. It can also scale the size of the cluster automatically according to demand. As shown in Figure 11.6, the lifetime of all core services and the worker nodes are controlled by CloudCRV. 110
Magellan Final Report 11.2.5 Integrated Metagenome Pipeline The Integrated Microbial Genomes (IMG) system hosted at the DOE Joint Genome Institute (JGI) supports analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes. The content maintenance cycle for data involves running BLAST for identifying pairwise gene similarities between new metagenome and reference genomes, where the reference genome baseline is updated with new (approximately 500) genomes every four months. This processing takes about three weeks on a Linux cluster with 256 cores. Since the size of the databases is growing, it is important that the processing can still be accomplished in a timely manner. The primary computation in the IMG pipeline is BLAST, a data parallel application that does not require communication between tasks and thus has similarities with traditional cloud applications. The need for on-demand access to resources makes clouds an attractive platform for this workload. The pipeline is primarily written in Perl, but it includes components written in Java, as well as compiled components written in C and C++. The pipeline also uses several reference collections (typically called databases), including one for RNA alignment and a periodically updated reference database for BLAST. The pipeline and databases are currently around 16 GB in size. This does not include Bio-perl, BLAST, and other utilities. The pipeline was run across both Magellan sites through Eucalyptus. A simple task farmer framework was used to distribute the workload across both sites. As virtual machines came up, a client would query the main server for work and run the computation. 11.2.6 Fusion Fusion is a traditional 320-node high-performance compute cluster operated by the Laboratory Computing Resource Center at Argonne. In order to expand the capabilities of the cluster, a project was launched to prototype an application that could offload traditional HPC style jobs to Magellan resources at the ALCF in a transparent fashion. HPC batch jobs are often pre-coupled with a lot of expectations regarding file system locations, amount and type of resources, software availability, etc. To seamlessly offload these types of jobs on the ALCF Magellan cloud, a compute node image was created that establishes a VPN connection back to the Fusion HPC cluster on launch, and appears to the users as one of the default batch computation systems. Upon submission of a job, the application requests a number of instances from the cloud on the fly, passes the user’s job to them for execution, and terminates them when the work is completed. The end result was a success; a number of test serial jobs were offloaded to the Magellan cloud at ALCF. This case study demonstrates a scenario where serial jobs could be redirected to cloud resources, enabling the HPC resources to cater to the larger parallel jobs, therefore making better use of specialized hardware. 11.2.7 RAST SEED is an open-source comparative-analysis genomics effort. The compute-intensive protein-base searches were offloaded from their traditional cluster onto the Magellan resources at ALCF during periods of heavy demand. Incoming raw genome sequences were processed with Glimmer to identify individual genes. These were subsequently annotated with RAST. The subsequent protein-base sequences were then processed via BLAST on a persistent 32-node Magellan virtual cluster against the NCBI and Swiss-Prot protein-sequence databases, which were shared among the two clusters via NFS. The results were then fed into the SEED database, enabling users to query similarities among genomes. This was a heavily compute-bound application, and performance was deemed to be very good on virtual resources. RAST had a number of similar challenges in managing virtual machine images and tying to existing infrastructure such as NFS and databases to make this load balancing transparent. 11.2.8 QIIME QIIME (Quantitative Insights Into Microbial Ecology) is a toolset for analyzing high throughput 16S amplicon datasets. 16S rRNA is a highly conserved structure in microbial genomes that can be sequenced 111
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10:
Magellan Final Report role in addre
Page 11 and 12:
Contents Executive Summary Key Find
Page 13 and 14:
Magellan Final Report 9.7 Discussio
Page 15 and 16:
Chapter 1 Overview Cloud computing
Page 17 and 18:
Magellan Final Report • The Argon
Page 19 and 20:
Chapter 2 Background The term “cl
Page 21 and 22:
Magellan Final Report 2.1.4 Hardwar
Page 23 and 24:
Magellan Final Report Table 3.1: Ke
Page 25 and 26:
Magellan Final Report Little Magell
Page 27 and 28:
Magellan Final Report 3.2 Advanced
Page 29 and 30:
Chapter 4 Application Characteristi
Page 31 and 32:
Magellan Final Report Table 4.1: Pe
Page 33 and 34:
Magellan Final Report Output data
Page 35 and 36:
Magellan Final Report of the pipeli
Page 37 and 38:
Chapter 5 Magellan Testbed As part
Page 39 and 40:
Magellan Final Report Figure 5.1: P
Page 41 and 42:
Magellan Final Report Figure 5.2: P
Page 43 and 44:
Magellan Final Report NERSC deploye
Page 45 and 46:
Magellan Final Report Figure 6.1: A
Page 47 and 48:
Magellan Final Report greater than
Page 49 and 50:
Magellan Final Report specific QoS
Page 51 and 52:
Magellan Final Report configuration
Page 53 and 54:
Magellan Final Report 7.4 Summary U
Page 55 and 56:
Magellan Final Report Firewalls are
Page 57 and 58:
Magellan Final Report Aside from le
Page 59 and 60:
Magellan Final Report 9.1 Understan
Page 61 and 62:
Magellan Final Report grid) on 256
Page 63 and 64:
Magellan Final Report Table 9.1: HP
Page 65 and 66:
Magellan Final Report 25  Ping 
Page 67 and 68:
Magellan Final Report 100  12 
Page 69 and 70:
Magellan Final Report case of GTC,
Page 71 and 72:
Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100: Magellan Final Report 10.3 Hadoop E
Page 101 and 102: Magellan Final Report 35000  3500
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109 and 110: Magellan Final Report Workload Patt
Page 111 and 112: Magellan Final Report This benchmar
Page 113 and 114: Magellan Final Report Task Tracker
Page 115 and 116: Magellan Final Report processing ti
Page 117 and 118: Magellan Final Report Using ESnet
Page 119 and 120: Magellan Final Report Figure 11.2:
Page 121 and 122: Magellan Final Report data collecte
Page 123: Magellan Final Report comparison to
Page 127 and 128: Magellan Final Report very large (4
Page 129 and 130: Magellan Final Report for optimizat
Page 131 and 132: Magellan Final Report One of the ad
Page 133 and 134: Magellan Final Report commercial cl
Page 135 and 136: Magellan Final Report Table 12.2: H
Page 137 and 138: Magellan Final Report Cost per TF t
Page 139 and 140: Magellan Final Report Productivity.
Page 141 and 142: Magellan Final Report compute insta
Page 143 and 144: Chapter 13 Conclusions Cloud comput
Page 145 and 146: Magellan Final Report Inherently, t
Page 147 and 148: Bibliography [1] G. Aldering, G. Ad
Page 149 and 150: Magellan Final Report [30] I. Foste
Page 151 and 152: Magellan Final Report [67] M. Palan
Page 153 and 154: Appendix A Publications Selected Pr
Page 155 and 156: Magellan Final Report Magellan Rese
Page 157 and 158: Magellan Final Report Selected Mage
Page 159 and 160: Appendix B Surveys B1
Page 161 and 162: • Nuclear Physics - Accelarator P
Page 163 and 164: Allow users to edit responses. What
Page 165 and 166: Amazon Eucalyptus OpenStack Other:
Page 167 and 168: Please list any publications/report
Page 169 and 170: Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?