CUDA Accelerated Linpack on Clusters - Nvidia
CUDA Accelerated Linpack on Clusters - Nvidia
CUDA Accelerated Linpack on Clusters - Nvidia
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
T20 DGEMM<br />
• 16x16 Threads update 64x64 of C<br />
• Instructi<strong>on</strong> throughput bottleneck<br />
— Maximize Density of DFMA instrucit<strong>on</strong>s<br />
— Register Blocking<br />
— Use wide vector Loads (128 bits)<br />
— DFMA dual issues with Texture<br />
• Texture fetch A and B<br />
• Hide Latencies<br />
— Double Buffer Shared Memory<br />
64<br />
A<br />
16<br />
16<br />
B<br />
C<br />
64