Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report by the networking service, and are then ’attached’ to the requested virtual machine via additional network address translation rules. As all networking is handled by one service, this introduces a performance bottleneck for communicationintensive applications. Furthermore, the complexity of the networking setup is quite large, and requires a significant amount of effort to understand. Some inflexibility with regard to support of physical network architecture were encountered in the design as well. Scalability. Overall, OpenStack scales well, and performance on the 420 node cloud at ALCF was reliable up to per-project instance counts of more than 200. Instance counts of 300 or greater have been successfully accomplished, though at that scale the single API node becomes overloaded and instance launch times increase dramatically. 6.3 Nimbus Nimbus is an open-source IaaS cloud-computing framework, created by a team of developers at Argonne National Labs and the University of Chicago. It has been successfully deployed on a number of sites around the country, one of which is an 82 node cloud running on Magellan project hardware at Argonne. Internals. Nimbus supports the creation of virtual machines using both Xen and KVM, the latter being used on the Magellan cloud. The core services include the Workspace Service, which handles all virtual machine management and user interaction via grid-credential or Amazon EC2 API. The configuration clients run on individual virtual machines, and control setup of services, user and virtual cluster credentials, and image metadata acquisition. The cloudinit.d script is one of these clients, so user images can easily be modified for compatibility with Amazon EC2, similar to Eucalyptus and OpenStack. Running Nimbus on Magellan. The two biggest challenges encountered running Nimbus on Magellan hardware involved issues with networking and software dependencies. The network issues revolved primarily around the security measures present on site. Difficulties with ingress firewall settings caused user ssh sessions to timeout prematurely, and complicated the application of public IP addresses to virtual machines; this prompted one user to eschew the use of externally-routable IPs altogether. Taken together, these issues made management of the cloud difficult; and even simple administrative tasks required exorbitant effort. In order to alleviate the aforementioned networking difficulties, an image propagation protocol called LANTorrent was developed. LANTorrent is a multicast system that allows researchers to quickly distribute VM images across the cloud at one time; tests on Magellan resulted in 1000 simultaneous boots. LANTorrent forms a store and forward chain of recipients that allows the image to be sent at a rate much faster than can be accomplished with http or ssh transfers. This significantly lessened the time a user’s cluster required to boot. 6.4 Discussion In the previous section, we detailed current cloud offerings. Here we examine some of the gaps and challenges in using existing cloud offerings directly for scientific computing. Resources. Cloud computing carries with it the assumption of an unlimited supply of resources available on demand. While this was true in early days of cloud computing when demand for resources was still ramping up, more recently users have noticed that their requests have not been satisfied on providers such as Amazon EC2 due to insufficient capacity. This situation is similar to current day supercomputing and grid resources that are often over-subscribed and have long wait queues. Thus for the user, scientific cloud computing as an unlimited supply of cycles tends to be less promising. There is a need for differentiated levels of service similar to Amazon’s current offerings but with advanced resource request interfaces with 34
Magellan Final Report specific QoS guarantees to avoid users needing to periodically query to see if resources have become available. Portability. The vision of grid computing has been to enable users to use resources from a diverse set of sites by leveraging a common set of job submission and data management interfaces across all sites. However, experience revealed that there were challenges due to differing software stacks, software compatibility issues, and so on. In this regard, virtualization facilitates software portability. Open source tools such as Eucalyptus and OpenStack enable transition from local sites to Amazon EC2, but site-specific capabilities and cloud interfaces in general are diverse, often making it difficult to easily span multiple sites. In addition, cloud-software stack’s networking models and security constraints make the cost of data movement to and especially from the cloud relatively expensive, discouraging data portability. These costs are also an issue in situations where scientists would like to perform post-processing on their end-desktop or local clusters, or would like to share their output data with other colleagues. More detailed cost discussions are presented in Chapter 12. Performance. Traditional synchronous applications which rely on MPI perform poorly on virtual machines and have a huge performance overhead, while application codes with minimal or no synchronization, modest I/O requirements, with large messages or very little communication tend to perform well in cloud virtual machines. Traditional cloud computing platforms are designed for applications where there is little or no communication between individual nodes and where applications are less affected by failures of individual nodes. This assumption breaks down for large-scale synchronous applications (details in Chapter 9). Additionally, in the cloud model time-to-solution is a more critical metric than individual node performance. Scalability is achieved by throwing more nodes at the problem. However, in order for this approach to be successful for e-Science, it is important to understand the setup and configuration costs associated with porting a scientific application to the cloud - this must be factored into the overall time-to-solution metrics in resource selection and management decisions. Data Management. As discussed earlier, scientific applications have a number of data and storage needs. Synchronized applications need access to parallel file systems. There is also a need for long-term data storage. None of the current cloud storage methods resemble the high-performance parallel file systems that HPC applications typical require and currently have accessible to them at HPC centers. General storage solutions offered by cloud-software stacks include EBS volumes and bucket store, which at the time of this writing are inadequate for scientific needs, on both performance and convenience. 6.5 Summary The open-source cloud software stacks have significantly improved over the duration of the project. However, these software stacks do not address many of the DOE security, accounting and allocation policy requirements for scientific workloads. Thus, they could still benefit from additional development and testing of system and user tools. 35
Page 1 and 2: The Magellan Report on Cloud Comput
Page 3 and 4: Executive Summary The goal of Magel
Page 5 and 6: Key Findings The goal of the Magell
Page 7 and 8: Magellan Final Report Finding 8. DO
Page 9 and 10: Magellan Final Report role in addre
Page 11 and 12: Contents Executive Summary Key Find
Page 13 and 14: Magellan Final Report 9.7 Discussio
Page 15 and 16: Chapter 1 Overview Cloud computing
Page 17 and 18: Magellan Final Report • The Argon
Page 19 and 20: Chapter 2 Background The term “cl
Page 21 and 22: Magellan Final Report 2.1.4 Hardwar
Page 23 and 24: Magellan Final Report Table 3.1: Ke
Page 25 and 26: Magellan Final Report Little Magell
Page 27 and 28: Magellan Final Report 3.2 Advanced
Page 29 and 30: Chapter 4 Application Characteristi
Page 31 and 32: Magellan Final Report Table 4.1: Pe
Page 33 and 34: Magellan Final Report Output data
Page 35 and 36: Magellan Final Report of the pipeli
Page 37 and 38: Chapter 5 Magellan Testbed As part
Page 39 and 40: Magellan Final Report Figure 5.1: P
Page 41 and 42: Magellan Final Report Figure 5.2: P
Page 43 and 44: Magellan Final Report NERSC deploye
Page 45 and 46: Magellan Final Report Figure 6.1: A
Page 47: Magellan Final Report greater than
Page 51 and 52: Magellan Final Report configuration
Page 53 and 54: Magellan Final Report 7.4 Summary U
Page 55 and 56: Magellan Final Report Firewalls are
Page 57 and 58: Magellan Final Report Aside from le
Page 59 and 60: Magellan Final Report 9.1 Understan
Page 61 and 62: Magellan Final Report grid) on 256
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100:
Magellan Final Report 10.3 Hadoop E
Page 101 and 102:
Magellan Final Report 35000  3500
Page 103 and 104:
Magellan Final Report summarize som
Page 105 and 106:
Magellan Final Report Processing ti
Page 107 and 108:
Magellan Final Report in the networ
Page 109 and 110:
Magellan Final Report Workload Patt
Page 111 and 112:
Magellan Final Report This benchmar
Page 113 and 114:
Magellan Final Report Task Tracker
Page 115 and 116:
Magellan Final Report processing ti
Page 117 and 118:
Magellan Final Report Using ESnet
Page 119 and 120:
Magellan Final Report Figure 11.2:
Page 121 and 122:
Magellan Final Report data collecte
Page 123 and 124:
Magellan Final Report comparison to
Page 125 and 126:
Magellan Final Report 11.2.5 Integr
Page 127 and 128:
Magellan Final Report very large (4
Page 129 and 130:
Magellan Final Report for optimizat
Page 131 and 132:
Magellan Final Report One of the ad
Page 133 and 134:
Magellan Final Report commercial cl
Page 135 and 136:
Magellan Final Report Table 12.2: H
Page 137 and 138:
Magellan Final Report Cost per TF t
Page 139 and 140:
Magellan Final Report Productivity.
Page 141 and 142:
Magellan Final Report compute insta
Page 143 and 144:
Chapter 13 Conclusions Cloud comput
Page 145 and 146:
Magellan Final Report Inherently, t
Page 147 and 148:
Bibliography [1] G. Aldering, G. Ad
Page 149 and 150:
Magellan Final Report [30] I. Foste
Page 151 and 152:
Magellan Final Report [67] M. Palan
Page 153 and 154:
Appendix A Publications Selected Pr
Page 155 and 156:
Magellan Final Report Magellan Rese
Page 157 and 158:
Magellan Final Report Selected Mage
Page 159 and 160:
Appendix B Surveys B1
Page 161 and 162:
• Nuclear Physics - Accelarator P
Page 163 and 164:
Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?