12.07.2015 Views

GPU Performance Analysis and Optimization - GPU Technology ...

GPU Performance Analysis and Optimization - GPU Technology ...

GPU Performance Analysis and Optimization - GPU Technology ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Case Study 7: Matrix Transpose• Staged via SMEM to coalesce GMEM addresses– 32x32 threadblock, double-precision values– 32x32 array in shared memory• Initial implementation:– A warp writes a row of values to SMEM (read from GMEM)– A warp reads a column of values from SMEM (to be written to GMEM)• Diagnosing:– 15 replays per shared memory instruction– Replays make up 56% of instructions issued• Ratio of l1_shared_bank_conflict to inst_issued– Code achieves only 45% of DRAM b<strong>and</strong>width– Conclusion: bank conflicts add latency <strong>and</strong> prevent GMEM instructions fromexecuting efficiently© 2012, NVIDIA94

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!