FIAS Scientific Report 2011 - Frankfurt Institute for Advanced Studies ...

More documents

Recommendations

Info

Vc: Development of a C++ library for explicit, portable, and intuitive SIMD programming Collaborators: M. Kretz 1 , V. Lindenstruth 1 1 Frankfurt Institute for Advanced Studies Modern CPU architectures include support for vector (SIMD) instructions, which enable operations on Multiple Data with a Single Instruction. On the widely used x86 CPUs there have been several releases of vector extensions, ranging from MMX and SSE up to SSE4 and AVX. With the latest extension (AVX), there are 8 single-precision floating-point values (or 4 double-precision values) per vector register. The computational instructions can thus execute up to eight (four) times more operations, leading to a significant speed-up of the data-parallel computations in an application. The use of these vector instructions is not directly possible with standard C/C++. This is why compilers try to auto-vectorize. But as it would be prohibitively expensive to vectorize perfectly, compilers resort to simplified rules for vector transformation. Often the results are extra work to enable auto-vectorization and frail performance. Vc was developed to solve this issue by providing a C++ library that allows to explicitly write intuitive, portable, and fast vector code. The Vc library includes several implementations, which can be used to optimize for a given target system without loss of portability. As the use of Vc does not require any changes in the toolchain it has a relatively low entry-barrier for most projects. Recently support for the new AVX instructions, which required the development of a new Vc backend, was added. Furthermore, the library received more documentation, features, and API consistency. It is planned to show applicability of Vc in large projects, many of which are often memory-bound instead of compute-bound. This will help to document best-practices and enhance Vc where needed. Ultimately, the aim is for Vc to become a reliable and standard building-block for C++ programs. a0 + b0 a1 + b1 a2 + b2 a0 a1 a2 a3 + + + + b0 b1 b2 b3 a3 + SIMD allows to transform four (or two/eight/sixteen/. . . ) separate computational instructions into a single instruction. b3 speedup 8 7 6 5 4 3 2 1 Vc::AVX Vc::SSE (ternary) Vc::SSE (binary) Vc::Scalar 0 0 100 200 300 400 500 600 700 width/3 = height/2 [pixels] The plot shows the relative speed of a Mandelbrot picture calculation of the different Vc implementations on an Intel Core i5-2500 (Sandy-Bridge). The Mandelbrot implementation using Vc requires no code changes or #ifdefs to optimize for the different SIMD extensions. Related publication in 2011: M. Kretz, V. Lindenstruth, Vc: A C++ library for explicit vectorization, Software: Practice and Experience (2011), doi: 10.1002/spe.1149 118
High Performance GPU-based DGEMM and Linpack Collaborators: D. Rohr 1 , M. Bach 1 , M. Kretz 1 , V. Lindenstruth 1 1 Frankfurt Institute for Advanced Studies Linpack is a popular benchmark for supercomputers and builds the basis of the half-yearly Top500 list. It iteratively solves a dense system of linear equations. Each iteration consists of panel factorization, panel broadcast, LASWP, U-matrix broadcast and trailing matrix update [1]. The trailing matrix update is performed via a matrix multiplication (DGEMM) and is the most compute intensive step. It is a well known fact that GPUs excel at matrix multiplication. For these reasons, a heterogeneous Linpack has been implemented for the LOEWE-CSC cluster, which performs the update step on the GPU and all other tasks on the processor. Since only one process per node can use the GPU efficiently, instead of one MPI process per CPU core only one process pre node is used, which uses the GPU and is multi-threaded itself. Therefore, all other tasks during Linpack have been parallelized. The employed multi-threaded BLAS libraries were to support reservation of CPU cores for GPU pre- and postprocessing. A binary patch to the AMD driver reduces the page fault rate. The DGEMM of the update step is split in many multiplications of submatrices called tiles, which are small enough to fit in GPU memory. Processing of these tiles is arranged in a pipeline which ensures continuous GPU utilization and allows for concurrent GPU and CPU processing as well as DMA transfer. In addition, a dynamic buffering ensures that no tile is ever retransferred. For not wasting GPU cycles during non-DGEMM Linpack-tasks, the implementation allows for executing factorization and broadcast for the next iteration in parallel to the DGEMM of the current iteration and pipelines the LASWP task. Linpack performance has been demonstrated to scale linearly up to several hundreds of nodes [1]. Usually, the per-node performance of multi-node Linpack runs is limited by the slowest nodes. For clusters with nodes of different performance levels, e.g. the LOEWE-CSC has CPU-only and GPU-enabled compute notes, the new Linpack allows for an unequal distribution of the workload among the systems. Measurements show that the implementation is actually able to achieve almost the accumulated performance of all participating nodes with less than 3% granularity loss. Multiple GPUs can be used in parallel with e.g. two GPUs showing a speedup of about factor 1.98. Up to four GPUs in one system have been tested reaching 2 TFlop/s of Linpack performance. With three GPUs an efficiency of about 1200 GFlop/J was demonstrated, corresponding to a second place in the Green500 list at the time the experiment was conducted [2]. The DGEMM kernel on the GPU achieves about 90% of the theoretical peak performance on both the Cypress and Cayman series of AMD GPUs. 75% of the accumulated theoretical CPU and GPU performance are available in Linpack. The LOEWE-CSC ranked place 22 in the November 2010 Top500 list demonstrating the highest efficiency of all listed GPU clusters with respect to theoretical peak performance. GPU CPU Core 23 ... Core 10 Core 9 ... Core 3 Core 2 Core 14 Core 13 Core 12 Core 1 Core 0 GPU 0 GPU 1 GPU 2 U BCAST LASWP + DTRSM: 0-5120 LASWP + DTRSM: 5120-n GotoBLAS, Columns 0-1024 Factorization GotoBLAS Columns 1024-l Rows k-n PANEL BROADCAST Iteration N Iteration N+1 GotoBLAS CPU DGEMM MergeBuffer GPU 2 MergeBuffer GPU 1 DivideBuffer GPU 1 & 2 MergeBuffer GPU 0 Scheduling, DivideBuffer GPU 0 Time Columns l-n, Rows k-n GPU DGEMM KERNELS Columns 1024-n, Rows 0-k U BCAST Figure 1: Concurrent execution of all HPL tasks on CPU cores and GPUs The current implementation for the CAL-framework is momentarily being improved to support OpenCL and CUDA and the new AMD Graphics Core Next GPU. Related publications in 2011: 1) M. Bach, M. Kretz, V. Lindenstruth, D. Rohr Optimized HPL for AMD GPU and multi-core CPU usage, Computer Science - Research and Development 26, 153 (2011) 2) D. Rohr, M. Bach, M. Kretz, V. Lindenstruth Multi-GPU DGEMM and HPL on highly energy efficient clusters, IEEE Micro, Special Issue: CPU, GPU, and Hybrid Computing 119
Page 1 and 2:
FIAS Scientific Report 2011
Page 3:
FIAS Scientific Report 2011 Table o
Page 6 and 7:
Physics Research highlights 2011 To
Page 9 and 10:
1. Partner Research Centers 9
Page 11 and 12:
ExtreMe Matter Institute EMMI by Ca
Page 13 and 14:
Bernstein Focus Neurotechnology Fra
Page 15 and 16:
2. Graduate Schools 15
Page 17 and 18:
participants reported on their proj
Page 19 and 20:
Courses offered at FIGSS Summer Sem
Page 21 and 22:
3. FIAS Scientific Life 21
Page 23 and 24:
28.07.2011 Prof. Dr. Victor Flambau
Page 25 and 26:
Conferences and meetings (co)organi
Page 27 and 28:
4. Research Reports 4.1 Nuclear Phy
Page 29 and 30:
Hydrodynamics of a quark droplet Co
Page 31 and 32:
Production of hyper-nuclei in react
Page 33 and 34:
On production and properties of mul
Page 35 and 36:
Monte Carlo modeling microdosimetry
Page 37 and 38:
Monte Carlo simulation of the spall
Page 39 and 40:
Dimuon radiation within a (3+1)d hy
Page 41 and 42:
Initial state anisotropies and thei
Page 43 and 44:
Chiral hadronic model including res
Page 45 and 46:
Trace anomaly and the vector coupli
Page 47 and 48:
Dileptons from the strongly interac
Page 49 and 50:
Space-time evolution of the magneti
Page 51 and 52:
Extreme isospin in heavy nuclei Col
Page 53 and 54:
Production of heavy and superheavy
Page 55 and 56:
The phase diagram in T -µB-Nc spac
Page 57 and 58:
Critical Zeeman Fields for Unitary
Page 59 and 60:
BCS-BEC Crossover in 2D Fermi Gases
Page 61 and 62:
Applications of a chiral SU(3) mode
Page 63 and 64:
The phase diagram of black holes in
Page 65 and 66:
Black holes in short scale modified
Page 67 and 68: Fractal dimensions and micro-struct
Page 69 and 70: Untangling the interactions between
Page 71 and 72: Stimulus information coded by spike
Page 73 and 74: Non-stationarity of neuronal activi
Page 75 and 76: Timescale of information processing
Page 77 and 78: Discovery of agency in 6 and 8-mont
Page 79 and 80: Power spectra of the natural input
Page 81 and 82: Self-organization in recurrent neur
Page 83 and 84: Learning mechanisms underlying visu
Page 85 and 86: Emerging Bayesian priors in a self-
Page 87 and 88: Non-linear generative models and th
Page 89 and 90: Preference elicitation and Bayesian
Page 91 and 92: 4.3 Biology, Chemistry, Molecules,
Page 93 and 94: DNA unzipping Collaborators: S.N. V
Page 95 and 96: Statistical mechanics of protein fo
Page 97 and 98: Phase transitions in large clusters
Page 99 and 100: Photo-processes in fullerenes and e
Page 101 and 102: Assessment of complex DNA damage Co
Page 103 and 104: Novel light sources: Crystalline un
Page 105 and 106: Monte-Carlo code for channeling dyn
Page 107 and 108: Programs for many-body descriptions
Page 109 and 110: Are 12 C radiation effects in liver
Page 111 and 112: Objective identification of residue
Page 113 and 114: Structural basis for the dual RNA-r
Page 115 and 116: The ALICE High Level Trigger Collab
Page 117: Arithmetic over Galois fields on mo
Page 121 and 122: How stable are transport model resu
Page 123 and 124: 5. Talks and Publications 123
Page 125 and 126: Conference and Seminar Talks 2011
Page 129 and 130: Conference and Seminar Talks 2011 W
Page 133 and 134: FIAS conference abstracts and poste
Page 135 and 136: FIAS conference abstracts and poste
Page 137 and 138: FIAS Publications 2011 - Journal pu
Page 147 and 148: FIAS publications 2011 - Conference
Page 159 and 160: FIAS publications 2011 - Patents 24
Page 161: The success of FIAS would not have
show all

FIAS Scientific Report 2011 - Frankfurt Institute for Advanced Studies ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?