Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report pacity. Sporadic Demand. One of the more common cases for using commercial cloud offerings is when the demand is highly variable, especially if there are also time sensitive requirements for the service. In the analysis of DOE centers above, the demand is still highly variable. However, scientists can typically tolerate reasonable delays in the start of an application, especially if this results in access to more cycles. For cases where the demand must be quickly met, the ability to quickly add additional resources can mean the difference in being able to complete the project or not. One example is a DOE lab science project that is conceived suddenly that requires more resources than can be obtained quickly from a DOE HPC Center. DOE HPC Centers typically do not have a viable way to add the necessary resources in a short timescale, and the existing resources are heavily utilized, so the only way to make room for the new project would be to push out existing projects. Other examples are ones with real time-critical requirements, e.g., modeling an oil spill to direct efforts to contain the spill, hurricane tracking, or simulating a critical piece of equipment when it fails and causes an expensive resource to go down. Additionally, some experiments such as those at light sources and accelerators require or benefit from analysis resources in real-time. If those experiments only run a fraction of the time, they may benefit from on-demand cloud-type models. These kinds of opportunity costs can be difficult to quantify but should be considered when considering whether to move to a cloud model. Facility Constrained. Some sites are severely infrastructure limited. This could be due to building restrictions, insufficient power at the site, or other limitations. In these cases, commercial offerings may be the only reasonable option available. However, if expansion is an option, the long-term cost should be considered. While infrastructure expansion can be costly, those costs can be amortized over a long period, typically 15 years. The potential cost savings to the customer for these cases come from a few common sources. One is the potential ability to avoid purchasing and deploying computing resources when the demand is unclear. The other is operating resources that are only needed for infrequent periods of time which results in very low utilization. Once a project can maintain a reasonably high utilization of a resource, the cost savings typically vanish. 12.6 Late Update As this report was being finalized, Amazon announced several important updates. Due to the timing of these announcements, we have left most of the analysis unchanged, but we considered it was important to discuss the impact of these changes on the analysis. There were three significant developments: an updated Top500 entry from Amazon, the release of a new instance type which was used for the Top500 entry, and new pricing. We will discuss each of these and their impact on the analysis. Amazon’s Top500 entry for November 2011 achieved 240 TF and number 42 on the list. More interesting than the absolute numbers or position is the improvement in efficiency to 68% of peak. On previous Top500 entries, Amazon had achieved approximately 50% of peak. This is likely due to better tuning of the Linpack execution. Traditional HPC systems typically achieve between 80%-90% of peak. Eventually virtualized systems may approach these efficiencies through improved integration with the interconnect and continued improvements in the virtualization stacks. In parallel with the new Top500 result, Amazon announced a new instance type, cc2.8xlarge. This instance type is notable for several reasons. It is the first significant deployment of Intel Sandy Bridge EP. In addition to increasing the number of cores per socket to eight cores from four cores compared with the Nehalem processor used in the previous cluster compute instance type, the Sandy Bridge processor also effectively doubled the number of floating point operations per cycle. However, the processors used in the new instance type run at slightly lower clock rate (2.6 GHz versus 2.95 GHz). As a result of these differences, the new instance has a theoretical peak FLOP rate that is approximately 3.5x larger than the previous cluster 126
Magellan Final Report compute instance type. Linpack is able to take full advantage of these increases. However, Linpack is not representative of all scientific applications. Thus, as shown in our benchmarking (Chapter 9), scientific applications do not typically realize the same benefits. The increase in FLOPs per cycle often occurs every other generation of processor from Intel, since it is typically a byproduct of a change in core architecture. This is the “Tock” of the “Tick-Tock” development cycle in Intel parlance. AMD typically follows a similar roadmap for their processor architecture. Consequently, these types of improvements typically occur every 3-4 years and will be seen in DOE centers in the next few years. Amazon announced new pricing in November 2011. This lowered the price for some instance types and introduced new tiers of reserved instances. Some instances did not change in cost, e.g., the m1.small remains at $0.085 per hour. However, some of the higher-end instances did drop. For example, the on-demand price of a cc1.4xlarge instance (the type used in much of the analysis above) dropped by 19%. Amazon also expanded the reserved instances to three different tiers. The new tiers allow customers to match the reserved instance to the level of utilization. These different tiers essentially trade higher upfront costs for lower per-usage costs. The tier with the highest upfront costs, “Heavy Utilization Reserved Instances”, provides the lowest effective rate if the instance is fully utilized over the reservation period. Applying the earlier analysis to the new instance type yields some interesting results. The per-core hour cost of a 1-year reserved instance of the new type results in $0.058 cents per core hour versus $0.13 per core hour before. This is partly due to a heavily discounted reserved instance cost for this instance type. For example, fully utilizing a 1-year reserved instance for most types provides a discount around 40% over the on-demand option. However, for the new instance type this discount is 63%. More impressive, the cost of a Teraflop Year (Section 12.2.4) drops to approximately $36k per TF-Year. This is a 5x improvement over the previous calculation. Roughly half of this improvement comes from the switch to Intel Sandy Bridge and the resulting doubling in FLOPs per cycle. Since Amazon essentially scales its pricing on the number of available cores in the instance, not floating point performance, there is no premium charge for the doubling of the FLOPs per cycle for the new instance type. Another fraction of the increase comes from the improvement in efficiency of the Linpack execution (68% versus 50%). The remainder comes from the drop in pricing and especially the heavily discounted reserved instance pricing. The DOE systems used in the previous analysis were deployed over a year ago. ALCF’s Intrepid was deployed almost 4 years ago. So, much of the improvement in the Amazon’s values come from the deployment of a very new architecture. Ultimately, the DOE Labs and the commercial cloud vendors are relying on the same technology trends to deliver improved performance over time. As DOE Centers deploy new technologies like Intel Sandy Bridge and AMD Interlagos, their costs will drop in a similar manner. DOE deployments are typically timed about three years apart at each center, with these deployments staggered across centers. In addition, centers will often perform mid-life upgrades to systems to remain closer to the technology edge. This combination of strategies enables DOE to deliver a portfolio of systems that closely tracks the technology improvements. For example, the Mira system is projected to achieve a cost less than $8k per TF-Year when it is deployed next year. This is approximately 4x better than Amazon’s improved result, and represents an improvement of around 10x over the previous ALCF system, Intrepid, which was deployed around 4 years ago. Eventually, DOE HPC centers and commercial cloud providers like Amazon are likely to track each other, aiming for cost-effectiveness. However, the pricing changes in the commercial cloud will not address the various other challenges in moving scientific computing and HPC workloads to cloud offerings, such as workflow and data management challenges, high-performance parallel file systems, and access to legacy data sets (see Sections 6.4 and 11.4). The recent announcement by Amazon highlights several lessons. One, it shows that Amazon is responding to feedback from the HPC community on improving their offerings in this space. This process started with the introduction of the Cluster Compute and GPU offerings over a year ago and continues with the recent announcement of the new instance type and associated pricing. Secondly, it demonstrates the importance of tracking the changes in the cloud space and routinely updating cost analysis to ensure that the appropriate choices are being made. 127
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10:
Magellan Final Report role in addre
Page 11 and 12:
Contents Executive Summary Key Find
Page 13 and 14:
Magellan Final Report 9.7 Discussio
Page 15 and 16:
Chapter 1 Overview Cloud computing
Page 17 and 18:
Magellan Final Report • The Argon
Page 19 and 20:
Chapter 2 Background The term “cl
Page 21 and 22:
Magellan Final Report 2.1.4 Hardwar
Page 23 and 24:
Magellan Final Report Table 3.1: Ke
Page 25 and 26:
Magellan Final Report Little Magell
Page 27 and 28:
Magellan Final Report 3.2 Advanced
Page 29 and 30:
Chapter 4 Application Characteristi
Page 31 and 32:
Magellan Final Report Table 4.1: Pe
Page 33 and 34:
Magellan Final Report Output data
Page 35 and 36:
Magellan Final Report of the pipeli
Page 37 and 38:
Chapter 5 Magellan Testbed As part
Page 39 and 40:
Magellan Final Report Figure 5.1: P
Page 41 and 42:
Magellan Final Report Figure 5.2: P
Page 43 and 44:
Magellan Final Report NERSC deploye
Page 45 and 46:
Magellan Final Report Figure 6.1: A
Page 47 and 48:
Magellan Final Report greater than
Page 49 and 50:
Magellan Final Report specific QoS
Page 51 and 52:
Magellan Final Report configuration
Page 53 and 54:
Magellan Final Report 7.4 Summary U
Page 55 and 56:
Magellan Final Report Firewalls are
Page 57 and 58:
Magellan Final Report Aside from le
Page 59 and 60:
Magellan Final Report 9.1 Understan
Page 61 and 62:
Magellan Final Report grid) on 256
Page 63 and 64:
Magellan Final Report Table 9.1: HP
Page 65 and 66:
Magellan Final Report 25  Ping 
Page 67 and 68:
Magellan Final Report 100  12 
Page 69 and 70:
Magellan Final Report case of GTC,
Page 71 and 72:
Magellan Final Report 1.4 IB TCPo
Page 73 and 74:
Magellan Final Report only affects
Page 75 and 76:
Magellan Final Report Figure 9.11:
Page 77 and 78:
Magellan Final Report charted as a
Page 79 and 80:
Magellan Final Report Evaluation Cr
Page 81 and 82:
Magellan Final Report Write Perform
Page 83 and 84:
Magellan Final Report 3500 3000 G
Page 85 and 86:
Magellan Final Report Histogram Plo
Page 87 and 88:
Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100: Magellan Final Report 10.3 Hadoop E
Page 101 and 102: Magellan Final Report 35000  3500
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109 and 110: Magellan Final Report Workload Patt
Page 111 and 112: Magellan Final Report This benchmar
Page 113 and 114: Magellan Final Report Task Tracker
Page 115 and 116: Magellan Final Report processing ti
Page 117 and 118: Magellan Final Report Using ESnet
Page 119 and 120: Magellan Final Report Figure 11.2:
Page 121 and 122: Magellan Final Report data collecte
Page 123 and 124: Magellan Final Report comparison to
Page 125 and 126: Magellan Final Report 11.2.5 Integr
Page 127 and 128: Magellan Final Report very large (4
Page 129 and 130: Magellan Final Report for optimizat
Page 131 and 132: Magellan Final Report One of the ad
Page 133 and 134: Magellan Final Report commercial cl
Page 135 and 136: Magellan Final Report Table 12.2: H
Page 137 and 138: Magellan Final Report Cost per TF t
Page 139: Magellan Final Report Productivity.
Page 143 and 144: Chapter 13 Conclusions Cloud comput
Page 145 and 146: Magellan Final Report Inherently, t
Page 147 and 148: Bibliography [1] G. Aldering, G. Ad
Page 149 and 150: Magellan Final Report [30] I. Foste
Page 151 and 152: Magellan Final Report [67] M. Palan
Page 153 and 154: Appendix A Publications Selected Pr
Page 155 and 156: Magellan Final Report Magellan Rese
Page 157 and 158: Magellan Final Report Selected Mage
Page 159 and 160: Appendix B Surveys B1
Page 161 and 162: • Nuclear Physics - Accelarator P
Page 163 and 164: Allow users to edit responses. What
Page 165 and 166: Amazon Eucalyptus OpenStack Other:
Page 167 and 168: Please list any publications/report
Page 169 and 170: Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?