Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report Applications must be designed to dynamically adjust to compute nodes entering and leaving the resource pool. They must be capable of dealing with failures and rescheduling work. In traditional clusters, batch systems are routinely used to manage workflows. Batch systems such as Torque, Sun GridEngine, and Condor can and have been deployed in virtualized cloud environments [71, 49]. However, these deployments typically require system administration expertise with batch systems and an understanding of how to best configure them for the cloud environment. Grid tools can also play a role, but they require the user to understand and manage certificates and deploy tools like Globus. To lower the entry barrier for scientific users, Magellan personnel have developed and deployed Torque and Globus-based system images. Each of our applications have used different mechanisms to distribute work. MG-RAST and the IMG task farmer are examples of locally built tools that handle this problem. Hadoop might be run on top of virtual machines for this purpose; however, it will suffer from lack of knowledge of data locality. STAR jobs are embarrassingly parallel applications (i.e., non-MPI codes), where each job fits in one core and uses custom scripts to handle workflow and data management. A majority of our Magellan domain science users, despite previous experience with clouds and virtualization, ranked their experience with workflow management in virtualized cloud environments as medium to difficult. 11.4.6 Data Management The last significant challenge is managing the data for the workload. This includes both input and output data. For the bioinformatic workloads, the input data includes both the new sequence data as well as the reference data. Some consideration has to be given to where this data will be stored and read from, how it will be transported, and how this will scale with many worker nodes. In a traditional batch cluster, the users would typically have access to a cluster-wide file system. However, with EC2-style cloud systems, there is a different set of building blocks: volatile local storage, persistent block storage associated with a single instance (EBS), and a scalable put/get storage system (S3). These are different from traditional cluster and HPC offerings that users are accustomed to, and hence our users made limited use of these storage systems. The image also can be used to store static data. Each of these options has differing performance characteristics that are dependent on the application. Thus, the choice of storage components depends on the volume of data and the access patterns (and cost, in the case of EC2). Similar to workflow management, a majority of our users noted that data management in these environments took some effort. 11.4.7 Performance, Reliability and Portability Scientific applications need to consider performance and reliability of these environments. Our benchmarks (Chapter 9) show that the performance of an application is largely dependent on the type of application. Cloud resources have also gained popularity since resources are immediately available to handle spikes in load etc. However, starting a set of virtual machines can take a few minutes and can be highly variable when trying to get a large number of VMs [66]. In case of synchronous applications, it will be necessary to wait until all the virtual machines are up; and the user gets charged in public clouds for the time that the machines are up, even if they are not being used. High availability is often mentioned as one of the advantages of cloud resources. It is true that a number of cloud computing vendors such as Amazon provide various fault tolerance mechanisms, such as availability zones and regions, to enable highly fault-tolerant applications. However, the burden is largely on the end users to design their applications to be fault-tolerant, usually at a higher cost. Failures can occur at the hardware or software level, and good design practices to survive failures must be used when building software and services. The Amazon outage in April 2011 was triggered by an incorrect network configuration change made during maintenance that affected multiple services. Applications that used multiple regions were less impacted by the event. However, the common EBS plane impacted EBS activities. Thus cloud computing has many of the same availability challenges that impact HPC centers. 116
Magellan Final Report One of the advantages that virtual environments provide is the ability to customize the environment and port it to different sites (e.g., CERN VM [10]). However, our experience was that machine images are not fully portable and are still highly dependent on site configurations and underlying infrastructure. 11.4.8 Federation One key benefit of cloud computing is its ability to easily port the software environment as virtual machine images. This enables users to utilize resources from different computing sites which allows scientists to expand their computing resources in a steady-state fashion, to handle burst workloads, or to address fault tolerance. Both MG-RAST and the IMG pipeline were run successfully across both sites. However, application scripts were needed to handle registering the images at both sites, managing discrete image IDs across the sites, and handling distribution of work and associated policies across the sites. A number of challenges related to federation of cloud sites are similar to the challenges in grid environments related to portability, security, etc. 11.4.9 MapReduce/Hadoop Cloud computing has an impact on the programming model and other programming aspects of scientific applications. Scientific codes running at supercomputing centers are predominantly based on the MPI programming model. Ad hoc scripts and workflow tools are also commonly used to compose and manage such computations. Tools like Hadoop provide a way to compose task farming or parametric studies. Legacy applications are limited to using the streaming model that may not harness the full benefits of the MapReduce framework. The MapReduce programming model implementation in Hadoop is closely tied to HDFS, and its non-POSIX compliant interface is a major barrier to adoption of these technologies. In addition, frameworks such as Hadoop focus on each map task operating on a single independent data piece, and thus only data locality of the single file is considered. Early Magellan Hadoop users were largely from bioinformatics and biomedical disciplines where the problems lend themselves to the MapReduce programming model. Magellan Hadoop users largely used streaming to port their applications into Hadoop, and there were a limited set of users who used Java and higher-level tools such as Pig. Hadoop users encountered some problems, and Magellan staff helped them to get started. HDFS was relatively easy to use, and applications gained performance benefits from being able to scale up their problems. Overall, MapReduce/Hadoop was considered useful for a class of loosely coupled applications. However, the open-source Hadoop implementation is largely designed for Internet applications. MapReduce implementations that take into account HPC center facilities such as high-performance parallel file systems and interconnects as well as scientific data formats and analysis would significantly help scientific users. 117
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10:
Magellan Final Report role in addre
Page 11 and 12:
Contents Executive Summary Key Find
Page 13 and 14:
Magellan Final Report 9.7 Discussio
Page 15 and 16:
Chapter 1 Overview Cloud computing
Page 17 and 18:
Magellan Final Report • The Argon
Page 19 and 20:
Chapter 2 Background The term “cl
Page 21 and 22:
Magellan Final Report 2.1.4 Hardwar
Page 23 and 24:
Magellan Final Report Table 3.1: Ke
Page 25 and 26:
Magellan Final Report Little Magell
Page 27 and 28:
Magellan Final Report 3.2 Advanced
Page 29 and 30:
Chapter 4 Application Characteristi
Page 31 and 32:
Magellan Final Report Table 4.1: Pe
Page 33 and 34:
Magellan Final Report Output data
Page 35 and 36:
Magellan Final Report of the pipeli
Page 37 and 38:
Chapter 5 Magellan Testbed As part
Page 39 and 40:
Magellan Final Report Figure 5.1: P
Page 41 and 42:
Magellan Final Report Figure 5.2: P
Page 43 and 44:
Magellan Final Report NERSC deploye
Page 45 and 46:
Magellan Final Report Figure 6.1: A
Page 47 and 48:
Magellan Final Report greater than
Page 49 and 50:
Magellan Final Report specific QoS
Page 51 and 52:
Magellan Final Report configuration
Page 53 and 54:
Magellan Final Report 7.4 Summary U
Page 55 and 56:
Magellan Final Report Firewalls are
Page 57 and 58:
Magellan Final Report Aside from le
Page 59 and 60:
Magellan Final Report 9.1 Understan
Page 61 and 62:
Magellan Final Report grid) on 256
Page 63 and 64:
Magellan Final Report Table 9.1: HP
Page 65 and 66:
Magellan Final Report 25  Ping 
Page 67 and 68:
Magellan Final Report 100  12 
Page 69 and 70:
Magellan Final Report case of GTC,
Page 71 and 72:
Magellan Final Report 1.4 IB TCPo
Page 73 and 74:
Magellan Final Report only affects
Page 75 and 76:
Magellan Final Report Figure 9.11:
Page 77 and 78:
Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100: Magellan Final Report 10.3 Hadoop E
Page 101 and 102: Magellan Final Report 35000  3500
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109 and 110: Magellan Final Report Workload Patt
Page 111 and 112: Magellan Final Report This benchmar
Page 113 and 114: Magellan Final Report Task Tracker
Page 115 and 116: Magellan Final Report processing ti
Page 117 and 118: Magellan Final Report Using ESnet
Page 119 and 120: Magellan Final Report Figure 11.2:
Page 121 and 122: Magellan Final Report data collecte
Page 123 and 124: Magellan Final Report comparison to
Page 125 and 126: Magellan Final Report 11.2.5 Integr
Page 127 and 128: Magellan Final Report very large (4
Page 129: Magellan Final Report for optimizat
Page 133 and 134: Magellan Final Report commercial cl
Page 135 and 136: Magellan Final Report Table 12.2: H
Page 137 and 138: Magellan Final Report Cost per TF t
Page 139 and 140: Magellan Final Report Productivity.
Page 141 and 142: Magellan Final Report compute insta
Page 143 and 144: Chapter 13 Conclusions Cloud comput
Page 145 and 146: Magellan Final Report Inherently, t
Page 147 and 148: Bibliography [1] G. Aldering, G. Ad
Page 149 and 150: Magellan Final Report [30] I. Foste
Page 151 and 152: Magellan Final Report [67] M. Palan
Page 153 and 154: Appendix A Publications Selected Pr
Page 155 and 156: Magellan Final Report Magellan Rese
Page 157 and 158: Magellan Final Report Selected Mage
Page 159 and 160: Appendix B Surveys B1
Page 161 and 162: • Nuclear Physics - Accelarator P
Page 163 and 164: Allow users to edit responses. What
Page 165 and 166: Amazon Eucalyptus OpenStack Other:
Page 167 and 168: Please list any publications/report
Page 169 and 170: Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?