CUDA Libraries and MPI+OpenMP+CUDA - Prace Training Portal

More documents

Recommendations

Info

WHEN and WHY more MPI processesper GPU is good?Usually CUDA optimizations are performed starting from a serial code! Visual Profiler (or directly on text file)Introducing a parallelization means distribute data! compute-footprint of some CUDA kernels might decrease or transfer-timeovercomes compute-time (even worst!)GPU performs better its “duty” (accelerator) when there is enoughcomputation to exploit all the parallelism of all SM! let’s safely share it and its resourcesLess speed-up of a single piece of computation but there is more interleavedwork on the GPU coming from different processes! kernels are less efficient but more can run concurrently ! PERFORMANCE!February 10, 2012 PRACE Winter School 201268
Can I achieve a better acceleration?ALWAYS! But remember that…Amdahl law still exists!Def: In a massively parallel context, an upper limit for the scalability of parallelapplications is determined by the fraction of the overall execution time spent innon-scalable operations.AI/OPARALLELI/OBI/OPARALLELI/OFebruary 10, 2012 PRACE Winter School 201269
Page 1 and 2:
PRACE Winter School 2012Hybrid Prog
Page 3 and 4:
Acknowledgments first!• ICHEC, Ir
Page 5 and 6:
CUDA library ecosystemNVIDIA CUDATo
Page 7 and 8:
CUFFT - Code Sample (C)#define NX 2
Page 9 and 10:
CUBLAS - Thunking versus non-Thunki
Page 11 and 12:
CUBLAS - Code Sample (FORTRAN)PROGR
Page 13 and 14:
use-case: compute 3D-FFT on multi-G
Page 15 and 16:
fft3Dtest.cu - Code Snippet (C++)lo
Page 17 and 18: fft3Dtest.cu - Profiling12101 GPU2
Page 19 and 20: Pinned allocation - Code Sample (FO
Page 21 and 22: Pinned allocation - Code Sample (FO
Page 23 and 24: Consideration over PINNED memory4m3
Page 25: PHIGEMM - Code Snippet (C)for (iDev
Page 28 and 29: cuBLAS API: new versus Legacy• th
Page 30 and 31: Performance Measurement and Metrics
Page 32 and 33: Performance Measurement and Metrics
Page 34 and 35: MPI + CUDA, short summary1MPI commu
Page 36 and 37: Know what you have…Let’s forget
Page 38 and 39: Example: PLASMA, DPLASMA & MAGMAMAG
Page 40 and 41: Time-consuming steps in PWSCF• Ca
Page 42 and 43: GPU Developments• MPI-GPU binding
Page 44 and 45: MPI-GPU binding - Code Snippet (C)/
Page 46 and 47: GPU memory management - Code Snippe
Page 48 and 49: ADDUSDENS - Code Snippet (FORTRAN)D
Page 50 and 51: QVAN2 - Code Snippet (FORTRAN)DO lm
Page 52 and 53: CUDA ADDUSDENS, Considerations• C
Page 54 and 55: “new” CUDA ADDUSDENS - Code Sni
Page 56 and 57: CUDA NEWD - Code Snippet (C)for( ih
Page 58 and 59: CUDA NEWD - Code Snippet (C)...#if
Page 60 and 61: CUDA VLOC_PSI_K serial - Computatio
Page 62 and 63: KERNEL_INIT_PSIC_K - Code Snippet (
Page 64 and 65: CUDA VLOC_PSI_K - parallelPSIPSICFF
Page 66 and 67: PWSCF GPU, results & benchmarks(ser
Page 70: Thank you for your attention!No CPU
show all

CUDA Libraries and MPI+OpenMP+CUDA - Prace Training Portal

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?