Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report Similarly we polled our users on their data requirements in the cloud. A large number of the scientific applications rely on a parallel file system and expect local disk on the nodes. Applications also have large amounts of data (order of gigabytes or terabytes) that they assume are available at the sites and/or data that arrives from remote sites. While a number of users anticipated the performance of their applications would be adversely impacted by virtualization, many felt that the other advantages of cloud computing still made it an attractive platform for their science. Scientific applications additionally face other challenges when running in cloud environments. Code bases often change and their codes need to be recompiled periodically, requiring new virtual images to be created. Additionally, data is updated often and/or large data from the runs need to be saved since virtual machines do not maintain persistent state. 4.4 Application Use Cases We followed up the survey with one-on-one discussions with some of the application groups. We summarize the discussions with some of the early cloud adopters and their requirements from Magellan. Many of these users have subsequently run on Magellan resources at both sites and their experiences are described in greater detail in Chapter 11. 4.4.1 Climate 100 Climate scientists increasingly rely on the ability to generate and share large amounts of data in order to better understand the potential impacts of climate change and evaluate the effectiveness of possible mitigation. The goal of the Climate 100 project is to bring together middleware and network researchers to develop the needed tools and techniques for moving unprecedented amounts of data using the Advanced Networking Initiative (ANI) 100 Gbps network. The data for the Climate 100 project consists of on the order of a million files that average about 100 megabytes each. Climate 100 could benefit from cloud environments such as virtual machines and Hadoop to perform large-scale data analysis on the climate data. The volume of data requires coordination of network, compute and disk resources. 4.4.2 Open Science Grid/STAR The STAR nuclear physics experiment studies fundamental properties of nuclear matter from the data collected at Brookhaven National Laboratory’s Relativistic Heavy Ion Collider. The STAR experiment has access to a number of different resource sites that are used for regularly processing experiment data. STAR experiments are embarrassingly parallel applications (i.e., non-MPI codes) where each job fits in one processor or core. The STAR suite of applications use cases consists of various analysis and simulation programs. The Monte Carlo cases in STAR are well suited to run in cloud environments due to the minimal requirements for data transfer and I/O. The ability to control the software stack in a virtual machine is very attractive to the community due to the complexity of the software stack. The ability to grow in scale during burst periods using existing virtual machine images can greatly enhance scientific productivity and time to solution. The group has previously demonstrated the use of Amazon EC2 resources to process large amounts of data in time for the Quark Matter physics conference. Different groups in the STAR community are interested in cloud computing and virtual machines. The community continues to investigate use of virtual machine images as a way of packaging and distributing software for future experiments. 4.4.3 Supernova Factory The Supernova Factory project is building tools to measure the expansion of the universe and dark energy. The experiment has a large number of simulations that require custom environments where the end results also need to be shared with other collaborators, and hence there is a need to co-allocate compute and network resources and move the data to storage resources. The Supernova Factory relies on large data volumes for the supernova search, and the code base consists of a large number of custom modules. The complexity 20
Magellan Final Report of the pipeline makes it necessary to have specific library and OS versions and ends up being a barrier to making use of other large resources. The Supernova Factory project finds cloud computing attractive due to the ability to control software environments and the ability to manage and control user accounts and groups for access to the software. Initial experiments conducted by the group in collaboration with Magellan project personnel on Amazon EC2 show that the cloud is a feasible platform for this application. There is also interest in using Hadoop to coordinate and manage the loosely coupled jobs. 4.4.4 ATLAS The ATLAS project is investigating the use of cloud platforms to support analysis jobs. The ATLAS project has hundreds of jobs that operate on terabytes of data and can greatly benefit from timely access to cloud resources. The cloud environment also promises to be an effective platform for transitioning scientific codes from testing on the desktop to large-scale cloud resources. The group is investigating the use of virtual machine images for distribution of all required software [10]. This would enable sites to boot the virtual machines at different sites with minimal or no work involved with software management. 4.4.5 Integrated Microbial Genomes (IMG) Pipeline The Integrated Microbial Genomes (IMG) pipeline at the DOE Joint Genome Institute (JGI) provides analysis of microbial community metagenomes in the integrated context of all public reference isolate microbial genomes. IMG has workloads that need to run periodically every few weeks to months for content maintenance [8]. Timeliness of completion of workloads is critical for the community, and the tremendous growth of these data sets makes access to large number of resources critical. The computational stage consists of functional annotation of individual genes, identification of pair-wise genes, and identification of chromosomal clusters. The most computationally intensive step is performing a BLAST analysis against a reference database. Subsequent steps characterize the genes based on the alignments reported by BLAST. The BLAST output alone is typically over a terabyte. Consequently the analysis of the output to find the top matches and identify taxons can be time consuming and must be done in parallel. There is interest in using technologies such as Hadoop to ease management of these loosely coupled application runs. Currently the output of “all vs. all” pairwise gene sequence comparisons is stored in compressed files. However, modifying individual entries and querying the data is not easy in this format. The group is interested in exploring the use of HBase for managing data, which will allow the users to update individual rows and perform simple queries. 4.5 Summary The results of the detailed survey have helped us in understanding the science requirements for cloud environments and have influenced the direction of research in the project. We summarize the requirements gathered from the user survey and corresponding project activities: • The user requirements for cloud computing are diverse, ranging from access to custom environments to the MapReduce programming model. These diverse requirements guided our flexible software stack. Users of Magellan had access to (a) traditional batch queue access with the additional capability of custom software environments through xCAT; (b) customized virtual machines through Eucalyptus or OpenStack front ends, enabling users to port between commercial providers and the private cloud; (c) a Hadoop installation that allowed users to access the MapReduce programming model, the Hadoop Distributed File System and other job management features such as fault tolerance. More details of the software stack are presented in Chapter 5, our Hadoop evaluation is presented in Chapter 10, and our user experiences are summarized in Chapter 11. • It is important to understand whether commercial cloud platforms such as Amazon EC2 and private cloud software such as Eucalyptus and Hadoop met the needs of the science. We identified the existing 21
Page 1 and 2: The Magellan Report on Cloud Comput
Page 3 and 4: Executive Summary The goal of Magel
Page 5 and 6: Key Findings The goal of the Magell
Page 7 and 8: Magellan Final Report Finding 8. DO
Page 9 and 10: Magellan Final Report role in addre
Page 11 and 12: Contents Executive Summary Key Find
Page 13 and 14: Magellan Final Report 9.7 Discussio
Page 15 and 16: Chapter 1 Overview Cloud computing
Page 17 and 18: Magellan Final Report • The Argon
Page 19 and 20: Chapter 2 Background The term “cl
Page 21 and 22: Magellan Final Report 2.1.4 Hardwar
Page 23 and 24: Magellan Final Report Table 3.1: Ke
Page 25 and 26: Magellan Final Report Little Magell
Page 27 and 28: Magellan Final Report 3.2 Advanced
Page 29 and 30: Chapter 4 Application Characteristi
Page 31 and 32: Magellan Final Report Table 4.1: Pe
Page 33: Magellan Final Report Output data
Page 37 and 38: Chapter 5 Magellan Testbed As part
Page 39 and 40: Magellan Final Report Figure 5.1: P
Page 41 and 42: Magellan Final Report Figure 5.2: P
Page 43 and 44: Magellan Final Report NERSC deploye
Page 45 and 46: Magellan Final Report Figure 6.1: A
Page 47 and 48: Magellan Final Report greater than
Page 49 and 50: Magellan Final Report specific QoS
Page 51 and 52: Magellan Final Report configuration
Page 53 and 54: Magellan Final Report 7.4 Summary U
Page 55 and 56: Magellan Final Report Firewalls are
Page 57 and 58: Magellan Final Report Aside from le
Page 59 and 60: Magellan Final Report 9.1 Understan
Page 61 and 62: Magellan Final Report grid) on 256
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86:
Magellan Final Report Histogram Plo
Page 87 and 88:
Magellan Final Report SATA devices.
Page 89 and 90:
Magellan Final Report MB/s Virident
Page 91 and 92:
Magellan Final Report and the perfo
Page 93 and 94:
Magellan Final Report (a) Hosts (b)
Page 95 and 96:
Magellan Final Report Routing IP pa
Page 97 and 98:
Chapter 10 MapReduce Programming Mo
Page 99 and 100:
Magellan Final Report 10.3 Hadoop E
Page 101 and 102:
Magellan Final Report 35000  3500
Page 103 and 104:
Magellan Final Report summarize som
Page 105 and 106:
Magellan Final Report Processing ti
Page 107 and 108:
Magellan Final Report in the networ
Page 109 and 110:
Magellan Final Report Workload Patt
Page 111 and 112:
Magellan Final Report This benchmar
Page 113 and 114:
Magellan Final Report Task Tracker
Page 115 and 116:
Magellan Final Report processing ti
Page 117 and 118:
Magellan Final Report Using ESnet
Page 119 and 120:
Magellan Final Report Figure 11.2:
Page 121 and 122:
Magellan Final Report data collecte
Page 123 and 124:
Magellan Final Report comparison to
Page 125 and 126:
Magellan Final Report 11.2.5 Integr
Page 127 and 128:
Magellan Final Report very large (4
Page 129 and 130:
Magellan Final Report for optimizat
Page 131 and 132:
Magellan Final Report One of the ad
Page 133 and 134:
Magellan Final Report commercial cl
Page 135 and 136:
Magellan Final Report Table 12.2: H
Page 137 and 138:
Magellan Final Report Cost per TF t
Page 139 and 140:
Magellan Final Report Productivity.
Page 141 and 142:
Magellan Final Report compute insta
Page 143 and 144:
Chapter 13 Conclusions Cloud comput
Page 145 and 146:
Magellan Final Report Inherently, t
Page 147 and 148:
Bibliography [1] G. Aldering, G. Ad
Page 149 and 150:
Magellan Final Report [30] I. Foste
Page 151 and 152:
Magellan Final Report [67] M. Palan
Page 153 and 154:
Appendix A Publications Selected Pr
Page 155 and 156:
Magellan Final Report Magellan Rese
Page 157 and 158:
Magellan Final Report Selected Mage
Page 159 and 160:
Appendix B Surveys B1
Page 161 and 162:
• Nuclear Physics - Accelarator P
Page 163 and 164:
Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?