Magellan Final Report - Office of Science - U.S. Department of Energy

More documents

Recommendations

Info

Magellan Final Report CAM uses a formalism effectively containing two different, two-dimensional domain decompositions, one for the dynamics that is decomposed over latitude and vertical level, and the other for remapping that is decomposed over longitude and latitude. Optimized transposes move data from the program structures between these decompositions. CAM is characterized by relatively low computational intensity that stresses on-node/processor data movement, and relatively long MPI messages that stress interconnect point-to-point bandwidth. GAMESS. The GAMESS (General Atomic and Molecular Electronic Structure System) code from the Gordon research group at the Department of Energy’s Ames Lab at Iowa State University contains various important tools for ab initio quantum chemistry calculations. The benchmark used here calculates the B3LYP DFT energy and gradient for a 43-atom molecule and runs on 64 cores. GAMESS is the only benchmark for which no problem size scaling was performed. GAMESS uses an SPMD approach but includes its own underlying communication library, called the Distributed Data Interface (DDI), to present the abstraction of a global shared memory with one-side data transfers even on systems with physically distributed memory. On the cluster systems included here, GAMESS was run using socket communication. On the Cray XT4, an MPI implementation of DDI is used in which only one-half of the processors allocated compute while the other half are essentially data movers. GAMESS is characterized by considerable stride-1 memory access—which stresses memory bandwidth—and interconnect collective performance. GTC. GTC is a fully self-consistent, gyrokinetic 3-D particle-in-cell (PIC) code with a non-spectral Poisson solver [57]. It uses a grid that follows the field lines as they twist around a toroidal geometry representing a magnetically confined toroidal fusion plasma. The version of GTC used here uses a fixed, 1-D domain decomposition with 64 domains and 64 MPI tasks. The benchmark runs are for 250 timesteps using 10 particles per grid cell (2 million grid points, 20 million particles). Communications at this concurrency are dominated by nearest neighbor exchanges that are bandwidth-bound. The most computationally intensive parts of GTC involve gather/deposition of charge on the grid and particle “push” steps. The charge deposition utilizes indirect addressing and therefore stresses random access to memory. IMPACT-T. IMPACT-T (Integrated Map and Particle Accelerator Tracking–Time) is an object-oriented Fortran90 code from a suite of computational tools for the prediction and performance enhancement of accelerators. It includes the arbitrary overlap of fields from a comprehensive set of beamline elements, and uses a parallel, relativistic PIC method with a spectral integrated Green function solver. A two-dimensional domain decomposition in the y-z directions is used along with a dynamic load balancing scheme based on domain. Hockneys FFT algorithm is used to solve Poissons equation with open boundary conditions. The problems chosen here are scaled down quite considerably from the official NERSC benchmarks, in terms of number of particles and grid size (4X), 2-D processor configuration (64 cores instead of 256 and 1,024), and number of time steps run (100 instead of 2,000). IMPACT-T performance is typically sensitive to memory bandwidth and MPI collective performance. (Note that although both GTC and IMPACT-T are PIC codes, their performance characteristics are quite different.) MAESTRO. MAESTRO is used for simulating astrophysical flows such as those leading up to ignition in Type Ia supernovae. Its integration scheme is embedded in an adaptive mesh refinement algorithm based on a hierarchical system of rectangular non-overlapping grid patches at multiple levels with different resolutions. However, in the benchmarking version, the grid does not adapt and, instead, a multigrid solver is used. Parallelization is via a 3-D domain decomposition in which data and work are apportioned using a coarse-grained distribution strategy to balance the load and minimize communication costs. The MAESTRO communication topology pattern is quite unusual and tends to stress simple topology interconnects. With a very low computational intensity, the code stresses memory performance, especially latency; its implicit solver technology stresses global communications; and its message passing utilizes a wide range of message sizes from short to relatively moderate. The problem used was the NERSC-6 medium case (512×512×1024 46
Magellan Final Report grid) on 256 cores, but for only 3 timesteps. This problem is more typically benchmarked on 512 cores for 10 timesteps. MILC. This code represents lattice gauge computation that is used to study quantum chromodynamics (QCD), the theory of the subatomic strong interactions responsible for binding quarks into protons and neutrons and holding them together in the nucleus. QCD discretizes space and evaluates field variables on sites and links of a regular hypercube lattice in four-dimensional space time. It involves integrating an equation of motion for hundreds or thousands of time steps that requires inverting a large, sparse matrix at each integration step. The sparse, nearly singular matrix problem is solved using a conjugate gradient (CG) method, and many CG iterations are required for convergence. Within a processor, the four-dimensional nature of the problem requires gathers from widely separated locations in memory. The inversion by CG requires repeated three-dimensional complex matrix-vector multiplications, which reduces to a dot product of three pairs of three-dimensional complex vectors. Each dot product consists of five multiply-add operations and one multiply operation. The parallel programming model for MILC is a 4-D domain decomposition in which each task exchanges data with its eight nearest neighbors as well as participating in the all-reduce calls with very small payload as part of the CG algorithm. MILC is extremely dependent on memory bandwidth and prefetching and exhibits a high computational intensity. Two different MILC configurations were used in the benchmarking efforts. In one case (Section 9.1.3) we used a 32×32×16×18 global lattice on 64 cores with two quark flavors, four trajectories, and eight steps per trajectory. In contrast, MILC benchmarking for NERSC procurements have used up to 8,192 cores on a 64 3 ×144 grid with 15 steps per trajectory. For the scaling study (Section 9.1.5) we used a 64×32×32×72 global lattice with two quark flavors, four trajectories, and 15 steps per trajectory. PARATEC. PARATEC (PARAllel Total Energy Code) performs ab initio density functional theory quantummechanical total energy calculations using pseudopotentials, a plane wave basis set, and an all-band (unconstrained) conjugate gradient (CG) approach. Part of the calculation is carried out in Fourier space; custom parallel three-dimensional FFTs are used to transform the wave functions between real and Fourier space. PARATEC uses MPI and parallelizes over grid points, thereby achieving a fine-grain level of parallelism. The real-space data layout of wave functions is on a standard Cartesian grid. In general, the speed of the FFT dominates the runtime, since it stresses global communications bandwidth, though mostly point-topoint, using relatively short messages. Optimized system libraries (such Intel MKL or AMD ACML) are used for both BLAS3 and 1-D FFT; this results in high cache reuse and a high percentage of per-processor peak performance. The benchmark used here is based on the NERSC-5 input that does not allow any aggregation of the transpose data. The input contains 250 silicon atoms in a diamond lattice configuration and runs for six conjugate gradient iterations. For the scaling study (Section 9.1.5), the input is based on the NERSC-6 input. The input contains 686 silicon atoms in a diamond lattice configuration and runs for 20 conjugate gradient iterations. In contrast, a real science run might use 60 or more iterations. HPCC. In addition to the application benchmarks discussed above, we also ran the High Performance Computing Challenge (HPCC) benchmark suite [41]. HPCC consists of seven synthetic benchmarks: three targeted and four complex. The targeted synthetics are DGEMM, STREAM, and measures of network latency and bandwidth. These are microkernels which quantify basic system parameters that separately characterize computation and communication performance. The complex synthetics are HPL, FFTE, PTRANS, and RandomAccess. These combine computation and communication and can be thought of as very simple proxy applications. Taken together, these benchmarks allow for the measurement of a variety of lower-level factors that are important for performance, which is why we chose to use them for this work. 47
Page 1 and 2:
The Magellan Report on Cloud Comput
Page 3 and 4:
Executive Summary The goal of Magel
Page 5 and 6:
Key Findings The goal of the Magell
Page 7 and 8:
Magellan Final Report Finding 8. DO
Page 9 and 10: Magellan Final Report role in addre
Page 11 and 12: Contents Executive Summary Key Find
Page 13 and 14: Magellan Final Report 9.7 Discussio
Page 15 and 16: Chapter 1 Overview Cloud computing
Page 17 and 18: Magellan Final Report • The Argon
Page 19 and 20: Chapter 2 Background The term “cl
Page 21 and 22: Magellan Final Report 2.1.4 Hardwar
Page 23 and 24: Magellan Final Report Table 3.1: Ke
Page 25 and 26: Magellan Final Report Little Magell
Page 27 and 28: Magellan Final Report 3.2 Advanced
Page 29 and 30: Chapter 4 Application Characteristi
Page 31 and 32: Magellan Final Report Table 4.1: Pe
Page 33 and 34: Magellan Final Report Output data
Page 35 and 36: Magellan Final Report of the pipeli
Page 37 and 38: Chapter 5 Magellan Testbed As part
Page 39 and 40: Magellan Final Report Figure 5.1: P
Page 41 and 42: Magellan Final Report Figure 5.2: P
Page 43 and 44: Magellan Final Report NERSC deploye
Page 45 and 46: Magellan Final Report Figure 6.1: A
Page 47 and 48: Magellan Final Report greater than
Page 49 and 50: Magellan Final Report specific QoS
Page 51 and 52: Magellan Final Report configuration
Page 53 and 54: Magellan Final Report 7.4 Summary U
Page 55 and 56: Magellan Final Report Firewalls are
Page 57 and 58: Magellan Final Report Aside from le
Page 59: Magellan Final Report 9.1 Understan
Page 63 and 64: Magellan Final Report Table 9.1: HP
Page 65 and 66: Magellan Final Report 25  Ping 
Page 67 and 68: Magellan Final Report 100  12 
Page 69 and 70: Magellan Final Report case of GTC,
Page 71 and 72: Magellan Final Report 1.4 IB TCPo
Page 73 and 74: Magellan Final Report only affects
Page 75 and 76: Magellan Final Report Figure 9.11:
Page 77 and 78: Magellan Final Report charted as a
Page 79 and 80: Magellan Final Report Evaluation Cr
Page 81 and 82: Magellan Final Report Write Perform
Page 83 and 84: Magellan Final Report 3500 3000 G
Page 85 and 86: Magellan Final Report Histogram Plo
Page 87 and 88: Magellan Final Report SATA devices.
Page 89 and 90: Magellan Final Report MB/s Virident
Page 91 and 92: Magellan Final Report and the perfo
Page 93 and 94: Magellan Final Report (a) Hosts (b)
Page 95 and 96: Magellan Final Report Routing IP pa
Page 97 and 98: Chapter 10 MapReduce Programming Mo
Page 99 and 100: Magellan Final Report 10.3 Hadoop E
Page 101 and 102: Magellan Final Report 35000  3500
Page 103 and 104: Magellan Final Report summarize som
Page 105 and 106: Magellan Final Report Processing ti
Page 107 and 108: Magellan Final Report in the networ
Page 109 and 110: Magellan Final Report Workload Patt
Page 111 and 112:
Magellan Final Report This benchmar
Page 113 and 114:
Magellan Final Report Task Tracker
Page 115 and 116:
Magellan Final Report processing ti
Page 117 and 118:
Magellan Final Report Using ESnet
Page 119 and 120:
Magellan Final Report Figure 11.2:
Page 121 and 122:
Magellan Final Report data collecte
Page 123 and 124:
Magellan Final Report comparison to
Page 125 and 126:
Magellan Final Report 11.2.5 Integr
Page 127 and 128:
Magellan Final Report very large (4
Page 129 and 130:
Magellan Final Report for optimizat
Page 131 and 132:
Magellan Final Report One of the ad
Page 133 and 134:
Magellan Final Report commercial cl
Page 135 and 136:
Magellan Final Report Table 12.2: H
Page 137 and 138:
Magellan Final Report Cost per TF t
Page 139 and 140:
Magellan Final Report Productivity.
Page 141 and 142:
Magellan Final Report compute insta
Page 143 and 144:
Chapter 13 Conclusions Cloud comput
Page 145 and 146:
Magellan Final Report Inherently, t
Page 147 and 148:
Bibliography [1] G. Aldering, G. Ad
Page 149 and 150:
Magellan Final Report [30] I. Foste
Page 151 and 152:
Magellan Final Report [67] M. Palan
Page 153 and 154:
Appendix A Publications Selected Pr
Page 155 and 156:
Magellan Final Report Magellan Rese
Page 157 and 158:
Magellan Final Report Selected Mage
Page 159 and 160:
Appendix B Surveys B1
Page 161 and 162:
• Nuclear Physics - Accelarator P
Page 163 and 164:
Allow users to edit responses. What
Page 165 and 166:
Amazon Eucalyptus OpenStack Other:
Page 167 and 168:
Please list any publications/report
Page 169 and 170:
Hadoop Streaming Hadoop Native Prog
show all

Magellan Final Report - Office of Science - U.S. Department of Energy

Create successful ePaper yourself

Delete template?

Save as template?