30.04.2014 Views

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

T20 DGEMM Performance<br />

• Inner Loop<br />

— Multiply 4x4 elements per thread<br />

— 4 Loads (128-bit), 16 DFMA = 80%<br />

• Outer Loop (16 iterati<strong>on</strong>s)<br />

— 256 DFMA / 341 instructi<strong>on</strong> = 75 %<br />

• <str<strong>on</strong>g>CUDA</str<strong>on</strong>g> code ~ 301 GFLOPS<br />

— General – all c<strong>on</strong>figs – all sizes<br />

• <str<strong>on</strong>g>Linpack</str<strong>on</strong>g> versi<strong>on</strong> ~360 GFLOPS<br />

— NT c<strong>on</strong>fig, M%64, N%64, K%16<br />

— 360/515 = 70% 360/386 = 93%

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!