24.12.2012 Views

FIAS Scientific Report 2011 - Frankfurt Institute for Advanced Studies ...

FIAS Scientific Report 2011 - Frankfurt Institute for Advanced Studies ...

FIAS Scientific Report 2011 - Frankfurt Institute for Advanced Studies ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

High Per<strong>for</strong>mance GPU-based DGEMM and Linpack<br />

Collaborators: D. Rohr 1 , M. Bach 1 , M. Kretz 1 , V. Lindenstruth 1<br />

1 <strong>Frankfurt</strong> <strong>Institute</strong> <strong>for</strong> <strong>Advanced</strong> <strong>Studies</strong><br />

Linpack is a popular benchmark <strong>for</strong> supercomputers and builds the basis of the half-yearly Top500 list. It<br />

iteratively solves a dense system of linear equations. Each iteration consists of panel factorization, panel<br />

broadcast, LASWP, U-matrix broadcast and trailing matrix update [1]. The trailing matrix update is per<strong>for</strong>med<br />

via a matrix multiplication (DGEMM) and is the most compute intensive step. It is a well known fact that<br />

GPUs excel at matrix multiplication. For these reasons, a heterogeneous Linpack has been implemented <strong>for</strong><br />

the LOEWE-CSC cluster, which per<strong>for</strong>ms the update step on the GPU and all other tasks on the processor.<br />

Since only one process per node can use the GPU efficiently, instead of one MPI process per CPU core only<br />

one process pre node is used, which uses the GPU and is multi-threaded itself. There<strong>for</strong>e, all other tasks during<br />

Linpack have been parallelized. The employed multi-threaded BLAS libraries were to support reservation of<br />

CPU cores <strong>for</strong> GPU pre- and postprocessing. A binary patch to the AMD driver reduces the page fault rate.<br />

The DGEMM of the update step is split in many multiplications of submatrices called tiles, which are small<br />

enough to fit in GPU memory. Processing of these tiles is arranged in a pipeline which ensures continuous<br />

GPU utilization and allows <strong>for</strong> concurrent GPU and CPU processing as well as DMA transfer. In addition, a<br />

dynamic buffering ensures that no tile is ever retransferred. For not wasting GPU cycles during non-DGEMM<br />

Linpack-tasks, the implementation allows <strong>for</strong> executing factorization and broadcast <strong>for</strong> the next iteration in<br />

parallel to the DGEMM of the current iteration and pipelines the LASWP task. Linpack per<strong>for</strong>mance has been<br />

demonstrated to scale linearly up to several hundreds of nodes [1].<br />

Usually, the per-node per<strong>for</strong>mance of multi-node Linpack runs is limited by the slowest nodes. For clusters<br />

with nodes of different per<strong>for</strong>mance levels, e.g. the LOEWE-CSC has CPU-only and GPU-enabled compute<br />

notes, the new Linpack allows <strong>for</strong> an unequal distribution of the workload among the systems. Measurements<br />

show that the implementation is actually able to achieve almost the accumulated per<strong>for</strong>mance of all<br />

participating nodes with less than 3% granularity loss.<br />

Multiple GPUs can be used in parallel with e.g. two GPUs<br />

showing a speedup of about factor 1.98. Up to four GPUs in<br />

one system have been tested reaching 2 TFlop/s of Linpack<br />

per<strong>for</strong>mance. With three GPUs an efficiency of about 1200<br />

GFlop/J was demonstrated, corresponding to a second place in<br />

the Green500 list at the time the experiment was conducted [2].<br />

The DGEMM kernel on the GPU achieves about 90% of the<br />

theoretical peak per<strong>for</strong>mance on both the Cypress and Cayman<br />

series of AMD GPUs. 75% of the accumulated theoretical<br />

CPU and GPU per<strong>for</strong>mance are available in Linpack. The<br />

LOEWE-CSC ranked place 22 in the November 2010 Top500<br />

list demonstrating the highest efficiency of all listed GPU<br />

clusters with respect to theoretical peak per<strong>for</strong>mance.<br />

GPU CPU<br />

Core 23<br />

...<br />

Core 10<br />

Core 9<br />

...<br />

Core 3<br />

Core 2<br />

Core 14<br />

Core 13<br />

Core 12<br />

Core 1<br />

Core 0<br />

GPU 0<br />

GPU 1<br />

GPU 2<br />

U<br />

BCAST<br />

LASWP + DTRSM: 0-5120<br />

LASWP + DTRSM: 5120-n<br />

GotoBLAS, Columns 0-1024<br />

Factorization<br />

GotoBLAS<br />

Columns 1024-l<br />

Rows k-n<br />

PANEL<br />

BROADCAST<br />

Iteration N Iteration N+1<br />

GotoBLAS CPU DGEMM<br />

MergeBuffer GPU 2<br />

MergeBuffer GPU 1<br />

DivideBuffer GPU 1 & 2<br />

MergeBuffer GPU 0<br />

Scheduling, DivideBuffer GPU 0<br />

Time<br />

Columns l-n, Rows k-n<br />

GPU DGEMM KERNELS<br />

Columns 1024-n, Rows 0-k<br />

U<br />

BCAST<br />

Figure 1: Concurrent execution of all HPL tasks<br />

on CPU cores and GPUs<br />

The current implementation <strong>for</strong> the CAL-framework is<br />

momentarily being improved to support OpenCL and CUDA and the new AMD Graphics Core Next GPU.<br />

Related publications in <strong>2011</strong>:<br />

1) M. Bach, M. Kretz, V. Lindenstruth, D. Rohr Optimized HPL <strong>for</strong> AMD GPU and multi-core CPU usage,<br />

Computer Science - Research and Development 26, 153 (<strong>2011</strong>)<br />

2) D. Rohr, M. Bach, M. Kretz, V. Lindenstruth Multi-GPU DGEMM and HPL on highly energy efficient<br />

clusters, IEEE Micro, Special Issue: CPU, GPU, and Hybrid Computing<br />

119

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!