CUDA Accelerated Linpack on Clusters - Nvidia

More documents

Recommendations

Info

<strong>CUDA</strong> <strong>Accelerated</strong> LINPACK Both CPU cores and GPUs are used in synergy with minor or no modifications to the original source code (HPL 2.0): - An host library intercepts the calls to DGEMM and DTRSM and executes them simultaneously on the GPUs and CPU cores. - Use of pinned memory for fast PCI-e transfers, up to 5.7GB/s on x16 gen2 slots. Only changes to the HPL source
NVIDIA Tesla GPU Computing Products
Page 1 and 2: CUDA Accel
Page 3 and 4: LINPACK Benchmark The LINPACK bench
Page 5: LINPACK Benchmark Solve a dense NxN
Page 9 and 10: Overlap DGEMM on CPU and GPU // Cop
Page 11 and 12: DGEMM performance on GPU (T10) A DG
Page 13 and 14: Results on workstation SUN Ultra 24
Page 15 and 16: Effect of PCI-e bandwidth Page lock
Page 17 and 18: FERMI T20 GPU
Page 19 and 20: T10 DGEMM 16 • T10 DGEMM algorith
Page 21 and 22: T20 DGEMM Performance • Inner Loo
Page 23 and 24: DGEMM Compute/COPY Analysis
Page 25 and 26: Fermi DGEMM Strategy • Slice Matr
Page 27 and 28: Additional Overlap strategy • Cop
Page 29 and 30: Optimizations - Auto Split • Keep
Page 31 and 32: DTRSM Same basic splitting approach
Page 33 and 34: DTRSM • GPU implimentation based
Page 35 and 36: DTRSM • Repeat: • DTRSM diagona
Page 37 and 38: Results on single node Supermicro 6
Page 39 and 40: Linpack Score (GFL
Page 41 and 42: Nebulae Cluster #2 on Latest Top500
Page 43 and 44: Future Challenges • Compute growi

CUDA Accelerated Linpack on Clusters - Nvidia

Create successful ePaper yourself

Delete template?

Save as template?