Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Chapter 7 User Support Many of the aspects of cloud computing that make it so powerful also introduce new complexities and challenges for both users and user support staff. Cloud computing provides users the flexibility to customize their software stack, but it comes with the additional burden of managing the stack. Commercial cloud providers have a limited user support model and typically additional support comes at an extra cost. This chapter describes the user support model that was used for the Magellan project, including some of the challenges that emerged during the course of the project. We discuss the key aspects of cloud computing architecture that have bearing on user support. We discuss several examples of usage patterns of users and how these were addressed. Finally, we summarize the overall assessment of the user support process for mid-range computing users on cloud platforms. 7.1 Comparison of User Support Models HPC centers provide a well-curated environment for robust, high-performance computing, which they make accessible to non-expert users through a variety of activities. In these environments, substantial effort is put into helping users to be productive and successful on the hosted platform. These efforts take a number of forms, from building a tuned software environment that is optimized for HPC workloads, to user education, and application porting and optimization. These efforts are important to the success of current and new users in HPC facilities, as many computational scientists are not necessarily deeply knowledgeable in terms of the details of modern computing hardware and software architecture. HPC centers typically provide a single system software stack, paired with purpose built hardware, and a set of policies for user access and prioritization. Users rely on a relatively fixed set of interfaces for interaction with the resource manager, file system, and other facility services. Many HPC use cases are well covered within this scope; for example, this environment is adapted for MPI applications that perform I/O to a parallel file system. Other use cases such as high-throughput computing and data-intensive computing, may not be so well supported at HPC centers. For example, computer scientists developing low level runtime software for HPC applications have a particularly difficult time performing this work at production computing centers. Also, deploying Hadoop on demand for computations, could be performed within the framework of a traditional HPC system, albeit with significant effort and in a less optimized fashion. Cloud systems provide Application Programming Interfaces (API) for low level resource provisioning. These APIs enable users to provision new virtual machines, storage, and network resources. These resources are configured by the user and can be built into complex networks including dozens, hundreds, or potentially even thousands of VMs with distinct software configurations, security policies, and service architectures. The flexibility of the capabilities provided by cloud APIs is substantial, allowing users to manage clusters built out of virtual machines hosted inside of a cloud. This power comes at some cost in terms of support. The cloud model offers a large amount of flexibility, making it difficult and expensive to provide support to cloud users. The opportunities for errors or mistakes greatly increase once a user begins to modify virtual machine 36
Magellan Final Report configurations. User expertise plays a role here, as it does on HPC systems; some users are comfortable building new configurations and designing system infrastructure while many others are not. This difference in capabilities demonstrates an important set of trade-offs between cloud and HPC system support models; specific purpose built APIs (like those provided by HPC centers) can provide deep support for a relatively fixed set of services, while general purpose APIs (like those provided by IaaS cloud systems) can only provide high level support efficiently. The crux of the user support issue on cloud systems is to determine the level of support for various activities ranging from complete to partial support. 7.2 Magellan User Support Model and Experience Both Magellan sites leveraged their extensive experience supporting diverse user communities to support the Magellan user groups. In addition, the Argonne Magellan user support model benefited from extensive experience supporting users of experimental hardware testbeds. These testbeds are similar to clouds in that users often need administrative access to resources. This experience enabled us to understand the tradeoff between depth of support and range of user activities required for private cloud environments. Both sites also heavily leveraged existing systems to manage mailing lists, track tickets, and manage user accounts. Our user support model consisted of five major areas: monitoring system services, direct support of the high level tools and APIs provided by the system, construction and support of a set of baseline VM configurations, training sessions, and building a community support base for more complex problems. We will discuss each of these areas. Monitoring. Monitoring is a critical component for providing any user service; even if the services are being offered as a test bed. While cloud APIs provide access to more generic functionality than the traditional HPC system software stack, basic testing is quite straightforward. An initial step to monitor system operational status was to extend the existing monitoring infrastructure to cover the cloud systems. At ALCF, an existing test harness was extended in order to run system correctness tests on Magellan easily and often. The test harness ran a variety of tests, from allocating VM instances to running performance benchmarks. These tests confirmed that the system was performing reliably and consistently over time. A similar approach was used at NERSC, where tests were run on a routine basis to ensure that instances could be spawned, networking could be configured, and the storage system was functioning correctly. These tests enabled the centers to proactively address the most routine problems. Cloud setup support. The second component of the support model was helping users make use of the basic cloud services provided by Magellan. These activities included generating users credentials for access to the cloud services and providing documentation to address routine questions about how to setup simple VM instance deployments, storage, networks, and security policies. This approach provided enough help for users to get started, and is analogous in an HPC setting to ensuring that users could login to a system and submit jobs. Image support. Our initial assessment suggested that supporting more complex user activities would be prohibitively expensive, so we opted to take a different approach. Instead of direct support, we provided a number of pre-configured baseline configurations that users could build on, without needing to start from scratch. We provided a number of starting VM instance configurations that were verified to work, as well as several recipes for building common network configurations and storage setups. This approach was relatively effective; users could easily build on the basic images and documented examples. NERSC also provided tools to automate building virtual clusters with pre-configured services such as a batch system and an NFS file system. Documentation and Tutorials. Both sites also provided tutorials and online documentation to introduce users to cloud models and address the most common questions and issues. Tutorials were organized for both 37
Page 1 and 2: The Magellan Report on Cloud Comput
Page 3 and 4: Executive Summary The goal of Magel
Page 5 and 6: Key Findings The goal of the Magell
Page 7 and 8: Magellan Final Report Finding 8. DO
Page 9 and 10: Magellan Final Report role in addre
Page 11 and 12: Contents Executive Summary Key Find
Page 13 and 14: Magellan Final Report 9.7 Discussio
Page 15 and 16: Chapter 1 Overview Cloud computing
Page 17 and 18: Magellan Final Report • The Argon
Page 19 and 20: Chapter 2 Background The term “cl
Page 21 and 22: Magellan Final Report 2.1.4 Hardwar
Page 23 and 24: Magellan Final Report Table 3.1: Ke
Page 25 and 26: Magellan Final Report Little Magell
Page 27 and 28: Magellan Final Report 3.2 Advanced
Page 29 and 30: Chapter 4 Application Characteristi
Page 31 and 32: Magellan Final Report Table 4.1: Pe
Page 33 and 34: Magellan Final Report Output data
Page 35 and 36: Magellan Final Report of the pipeli
Page 37 and 38: Chapter 5 Magellan Testbed As part
Page 39 and 40: Magellan Final Report Figure 5.1: P
Page 41 and 42: Magellan Final Report Figure 5.2: P
Page 43 and 44: Magellan Final Report NERSC deploye
Page 45 and 46: Magellan Final Report Figure 6.1: A
Page 47 and 48: Magellan Final Report greater than
Page 49: Magellan Final Report specific QoS
Page 53 and 54: Magellan Final Report 7.4 Summary U
Page 55 and 56: Magellan Final Report Firewalls are
Page 57 and 58: Magellan Final Report Aside from le
Page 59 and 60: Magellan Final Report 9.1 Understan
Page 61 and 62: Magellan Final Report grid) on 256
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100: Magellan Final Report 10.3 Hadoop E
Page 101 and 102:
Magellan Final Report 35000  3500
Page 103 and 104:
Magellan Final Report summarize som
Page 105 and 106:
Magellan Final Report Processing ti
Page 107 and 108:
Magellan Final Report in the networ
Page 109 and 110:
Magellan Final Report Workload Patt
Page 111 and 112:
Magellan Final Report This benchmar
Page 113 and 114:
Magellan Final Report Task Tracker
Page 115 and 116:
Magellan Final Report processing ti
Page 117 and 118:
Magellan Final Report Using ESnet
Page 119 and 120:
Magellan Final Report Figure 11.2:
Page 121 and 122:
Magellan Final Report data collecte
Page 123 and 124:
Magellan Final Report comparison to
Page 125 and 126:
Magellan Final Report 11.2.5 Integr
Page 127 and 128:
Magellan Final Report very large (4
Page 129 and 130:
Magellan Final Report for optimizat
Page 131 and 132:
Magellan Final Report One of the ad
Page 133 and 134:
Magellan Final Report commercial cl
Page 135 and 136:
Magellan Final Report Table 12.2: H
Page 137 and 138:
Magellan Final Report Cost per TF t
Page 139 and 140:
Magellan Final Report Productivity.
Page 141 and 142:
Magellan Final Report compute insta
Page 143 and 144:
Chapter 13 Conclusions Cloud comput
Page 145 and 146:
Magellan Final Report Inherently, t
Page 147 and 148:
Bibliography [1] G. Aldering, G. Ad
Page 149 and 150:
Magellan Final Report [30] I. Foste
Page 151 and 152:
Magellan Final Report [67] M. Palan
Page 153 and 154:
Appendix A Publications Selected Pr
Page 155 and 156:
Magellan Final Report Magellan Rese
Page 157 and 158:
Magellan Final Report Selected Mage
Page 159 and 160:
Appendix B Surveys B1
Page 161 and 162:
• Nuclear Physics - Accelarator P
Page 163 and 164:
Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?