30.04.2014 Views

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

T10 DGEMM<br />

16<br />

• T10 DGEMM algorithm (Volkov)<br />

— 64 Threads 64x16 of C<br />

— B reuse in shared memory<br />

— A loaded from global memory<br />

— DP limited <strong>on</strong> T10<br />

• DP 8x slower than other instructi<strong>on</strong>s<br />

• T20 DP full speed<br />

— T10 DGEMM runs 175 GFLOPS<br />

— Different Bottlnecks<br />

64<br />

A<br />

16<br />

16<br />

B<br />

C

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!