CUDA Libraries and MPI+OpenMP+CUDA - Prace Training Portal

More documents

Recommendations

Info

CUBLAS – Basic Linear Algebra Library• Self-contained at the API level, same name convention of BLAS• Supports all the BLAS functions:– Level 1 (vector-vector, O(N) ): AXPY, DOT– Level 2 (matrix-vector, O(N 2 ) ):GEMV, TRSV– Level 3 (matrix-matrix, O(N 3 ) ): GEMM, TRSM• Following BLAS convention, column-major storage• Error handling• FORTRAN interfaces are Thunking and non-Thunking (default)• Support streams and asynchronous calls• Batched GEMM (small matrices)February 10, 2012 PRACE Winter School 20128
CUBLAS - Thunking versus non-ThunkingThunking:• Allows interfacing to existing applications without any changes ! wrappers• During each call, the wrappers allocate GPU memory, copy source data from CPUmemory space to GPU memory space, call CUBLAS, and finally copy back the resultsto CPU memory space and deallocate the GPGPU memory• Intended for light testing, call overheadNon-Thunking (default):• Existing applications need to be modified slightly to allocate/deallocate data the inGPGPU memory space (using CUBLAS_ALLOC and CUBLAS_FREE) and to copydata between GPU and CPU memory spaces (using CUBLAS_SET_VECTOR,CUBLAS_GET_VECTOR, CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)• Intended for production code, high flexibilityFebruary 10, 2012 PRACE Winter School 20129
Page 1 and 2: PRACE Winter School 2012Hybrid Prog
Page 3 and 4: Acknowledgments first!• ICHEC, Ir
Page 5 and 6: CUDA library ecosystemNVIDIA CUDATo
Page 7: CUFFT - Code Sample (C)#define NX 2
Page 11 and 12: CUBLAS - Code Sample (FORTRAN)PROGR
Page 13 and 14: use-case: compute 3D-FFT on multi-G
Page 15 and 16: fft3Dtest.cu - Code Snippet (C++)lo
Page 17 and 18: fft3Dtest.cu - Profiling12101 GPU2
Page 19 and 20: Pinned allocation - Code Sample (FO
Page 21 and 22: Pinned allocation - Code Sample (FO
Page 23 and 24: Consideration over PINNED memory4m3
Page 25: PHIGEMM - Code Snippet (C)for (iDev
Page 28 and 29: cuBLAS API: new versus Legacy• th
Page 30 and 31: Performance Measurement and Metrics
Page 32 and 33: Performance Measurement and Metrics
Page 34 and 35: MPI + CUDA, short summary1MPI commu
Page 36 and 37: Know what you have…Let’s forget
Page 38 and 39: Example: PLASMA, DPLASMA & MAGMAMAG
Page 40 and 41: Time-consuming steps in PWSCF• Ca
Page 42 and 43: GPU Developments• MPI-GPU binding
Page 44 and 45: MPI-GPU binding - Code Snippet (C)/
Page 46 and 47: GPU memory management - Code Snippe
Page 48 and 49: ADDUSDENS - Code Snippet (FORTRAN)D
Page 50 and 51: QVAN2 - Code Snippet (FORTRAN)DO lm
Page 52 and 53: CUDA ADDUSDENS, Considerations• C
Page 54 and 55: “new” CUDA ADDUSDENS - Code Sni
Page 56 and 57: CUDA NEWD - Code Snippet (C)for( ih
Page 58 and 59:
CUDA NEWD - Code Snippet (C)...#if
Page 60 and 61:
CUDA VLOC_PSI_K serial - Computatio
Page 62 and 63:
KERNEL_INIT_PSIC_K - Code Snippet (
Page 64 and 65:
CUDA VLOC_PSI_K - parallelPSIPSICFF
Page 66 and 67:
PWSCF GPU, results & benchmarks(ser
Page 68 and 69:
WHEN and WHY more MPI processesper
Page 70:
Thank you for your attention!No CPU
show all

CUDA Libraries and MPI+OpenMP+CUDA - Prace Training Portal

Create successful ePaper yourself

Delete template?

Save as template?