CUDA Accelerated Linpack on Clusters - Nvidia
CUDA Accelerated Linpack on Clusters - Nvidia
CUDA Accelerated Linpack on Clusters - Nvidia
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
T10 DGEMM<br />
16<br />
• T10 DGEMM algorithm (Volkov)<br />
— 64 Threads 64x16 of C<br />
— B reuse in shared memory<br />
— A loaded from global memory<br />
— DP limited <strong>on</strong> T10<br />
• DP 8x slower than other instructi<strong>on</strong>s<br />
• T20 DP full speed<br />
— T10 DGEMM runs 175 GFLOPS<br />
— Different Bottlnecks<br />
64<br />
A<br />
16<br />
16<br />
B<br />
C