30.04.2014 Views

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

CUDA Accelerated Linpack on Clusters - Nvidia

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

T20 (Fermi) Architecture<br />

• Increased DP Throughput<br />

— 515 GFLOPS *PEAK* vs 85 GFLOPS T10<br />

— Theoretical 75% DGEMM : 515 x .75 = 386 GFLOPS<br />

• Cache Heirarchy<br />

— 64KB c<strong>on</strong>figurable L1 + shared memory (48/16 or 16/48)<br />

— 768 KB unified L2<br />

— Load/Store architecture<br />

• Dual Copy Engines<br />

— Overlap both download and upload of data with compute<br />

— Support Asynchr<strong>on</strong>ous 2D data transfers

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!