CUDA Accelerated Linpack on Clusters - Nvidia

More documents

Recommendations

Info

T20 DGEMM • 16x16 Threads update 64x64 of C • Instruction throughput bottleneck — Maximize Density of DFMA instrucitons — Register Blocking — Use wide vector Loads (128 bits) — DFMA dual issues with Texture • Texture fetch A and B • Hide Latencies — Double Buffer Shared Memory 64 A 16 16 B C 64
T20 DGEMM Performance • Inner Loop — Multiply 4x4 elements per thread — 4 Loads (128-bit), 16 DFMA = 80% • Outer Loop (16 iterations) — 256 DFMA / 341 instruction = 75 % • <strong>CUDA</strong> code ~ 301 GFLOPS — General – all configs – all sizes • <strong>Linpack</strong> version ~360 GFLOPS — NT config, M%64, N%64, K%16 — 360/515 = 70% 360/386 = 93%
Page 1 and 2: CUDA Accel
Page 3 and 4: LINPACK Benchmark The LINPACK bench
Page 5 and 6: LINPACK Benchmark Solve a dense NxN
Page 7 and 8: NVIDIA Tesla GPU Computing Products
Page 9 and 10: Overlap DGEMM on CPU and GPU // Cop
Page 11 and 12: DGEMM performance on GPU (T10) A DG
Page 13 and 14: Results on workstation SUN Ultra 24
Page 15 and 16: Effect of PCI-e bandwidth Page lock
Page 17 and 18: FERMI T20 GPU
Page 19: T10 DGEMM 16 • T10 DGEMM algorith
Page 23 and 24: DGEMM Compute/COPY Analysis
Page 25 and 26: Fermi DGEMM Strategy • Slice Matr
Page 27 and 28: Additional Overlap strategy • Cop
Page 29 and 30: Optimizations - Auto Split • Keep
Page 31 and 32: DTRSM Same basic splitting approach
Page 33 and 34: DTRSM • GPU implimentation based
Page 35 and 36: DTRSM • Repeat: • DTRSM diagona
Page 37 and 38: Results on single node Supermicro 6
Page 39 and 40: Linpack Score (GFL
Page 41 and 42: Nebulae Cluster #2 on Latest Top500
Page 43 and 44: Future Challenges • Compute growi

CUDA Accelerated Linpack on Clusters - Nvidia

Create successful ePaper yourself

Delete template?

Save as template?