CUDA Accelerated Linpack on Clusters - Nvidia

More documents

Recommendations

Info

T20 (Fermi) Architecture • Increased DP Throughput — 515 GFLOPS *PEAK* vs 85 GFLOPS T10 — Theoretical 75% DGEMM : 515 x .75 = 386 GFLOPS • Cache Heirarchy — 64KB configurable L1 + shared memory (48/16 or 16/48) — 768 KB unified L2 — Load/Store architecture • Dual Copy Engines — Overlap both download and upload of data with compute — Support Asynchronous 2D data transfers
T10 DGEMM 16 • T10 DGEMM algorithm (Volkov) — 64 Threads 64x16 of C — B reuse in shared memory — A loaded from global memory — DP limited on T10 • DP 8x slower than other instructions • T20 DP full speed — T10 DGEMM runs 175 GFLOPS — Different Bottlnecks 64 A 16 16 B C
Page 1 and 2: CUDA Accel
Page 3 and 4: LINPACK Benchmark The LINPACK bench
Page 5 and 6: LINPACK Benchmark Solve a dense NxN
Page 7 and 8: NVIDIA Tesla GPU Computing Products
Page 9 and 10: Overlap DGEMM on CPU and GPU // Cop
Page 11 and 12: DGEMM performance on GPU (T10) A DG
Page 13 and 14: Results on workstation SUN Ultra 24
Page 15 and 16: Effect of PCI-e bandwidth Page lock
Page 17: FERMI T20 GPU
Page 21 and 22: T20 DGEMM Performance • Inner Loo
Page 23 and 24: DGEMM Compute/COPY Analysis
Page 25 and 26: Fermi DGEMM Strategy • Slice Matr
Page 27 and 28: Additional Overlap strategy • Cop
Page 29 and 30: Optimizations - Auto Split • Keep
Page 31 and 32: DTRSM Same basic splitting approach
Page 33 and 34: DTRSM • GPU implimentation based
Page 35 and 36: DTRSM • Repeat: • DTRSM diagona
Page 37 and 38: Results on single node Supermicro 6
Page 39 and 40: Linpack Score (GFL
Page 41 and 42: Nebulae Cluster #2 on Latest Top500
Page 43 and 44: Future Challenges • Compute growi

CUDA Accelerated Linpack on Clusters - Nvidia

Create successful ePaper yourself

Delete template?

Save as template?