CUDA Accelerated Linpack on Clusters - Nvidia
CUDA Accelerated Linpack on Clusters - Nvidia
CUDA Accelerated Linpack on Clusters - Nvidia
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
T20 DGEMM Performance<br />
• Inner Loop<br />
— Multiply 4x4 elements per thread<br />
— 4 Loads (128-bit), 16 DFMA = 80%<br />
• Outer Loop (16 iterati<strong>on</strong>s)<br />
— 256 DFMA / 341 instructi<strong>on</strong> = 75 %<br />
• <str<strong>on</strong>g>CUDA</str<strong>on</strong>g> code ~ 301 GFLOPS<br />
— General – all c<strong>on</strong>figs – all sizes<br />
• <str<strong>on</strong>g>Linpack</str<strong>on</strong>g> versi<strong>on</strong> ~360 GFLOPS<br />
— NT c<strong>on</strong>fig, M%64, N%64, K%16<br />
— 360/515 = 70% 360/386 = 93%