30.04.2014 Views

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

T20 DGEMM<br />

• 16x16 Threads update 64x64 of C<br />

• Instructi<strong>on</strong> throughput bottleneck<br />

— Maximize Density of DFMA instrucit<strong>on</strong>s<br />

— Register Blocking<br />

— Use wide vector Loads (128 bits)<br />

— DFMA dual issues with Texture<br />

• Texture fetch A and B<br />

• Hide Latencies<br />

— Double Buffer Shared Memory<br />

64<br />

A<br />

16<br />

16<br />

B<br />

C<br />

64

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!