CUDA Accelerated Linpack on Clusters - Nvidia

More documents

Recommendations

Info

Optimal split If A(M,K), B(K,N) and C(M,N), a DGEMM call performs 2*M*K*N operations T CPU (M,K,N2) = T GPU (M,k,N1) N=N1+N2 If G CPU denotes the DGEMM performance of the CPU in Gflops and G GPU the one of the GPU, The optimal split is G GPU / (G CPU +G GPU )
DGEMM performance on GPU (T10) A DGEMM call in CUBLAS maps to several different kernels depending on the size With the combined CPU/GPU approach, we can always send optimal work to the GPU. M K N M%64 K%16 N%16 Gflops 448 400 12320 Y Y Y 82.4 12320 400 1600 N Y Y 75.2 12320 300 448 N N Y 55.9 12320 300 300 N N N 55.9 Tesla T10 1.44Ghz, data resident in GPU memory. Optimal kernel achieves 95% of peak
Page 1 and 2: CUDA Accel
Page 3 and 4: LINPACK Benchmark The LINPACK bench
Page 5 and 6: LINPACK Benchmark Solve a dense NxN
Page 7 and 8: NVIDIA Tesla GPU Computing Products
Page 9: Overlap DGEMM on CPU and GPU // Cop
Page 13 and 14: Results on workstation SUN Ultra 24
Page 15 and 16: Effect of PCI-e bandwidth Page lock
Page 17 and 18: FERMI T20 GPU
Page 19 and 20: T10 DGEMM 16 • T10 DGEMM algorith
Page 21 and 22: T20 DGEMM Performance • Inner Loo
Page 23 and 24: DGEMM Compute/COPY Analysis
Page 25 and 26: Fermi DGEMM Strategy • Slice Matr
Page 27 and 28: Additional Overlap strategy • Cop
Page 29 and 30: Optimizations - Auto Split • Keep
Page 31 and 32: DTRSM Same basic splitting approach
Page 33 and 34: DTRSM • GPU implimentation based
Page 35 and 36: DTRSM • Repeat: • DTRSM diagona
Page 37 and 38: Results on single node Supermicro 6
Page 39 and 40: Linpack Score (GFL
Page 41 and 42: Nebulae Cluster #2 on Latest Top500
Page 43 and 44: Future Challenges • Compute growi

CUDA Accelerated Linpack on Clusters - Nvidia

Create successful ePaper yourself

Delete template?

Save as template?