12.07.2015 Views

GPU Performance Analysis and Optimization - GPU Technology ...

GPU Performance Analysis and Optimization - GPU Technology ...

GPU Performance Analysis and Optimization - GPU Technology ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Summary: Shared Memory• Shared memory is a tremendous resource– Very high b<strong>and</strong>width (terabytes per second)– 20-30x lower latency than accessing GMEM– Data is programmer-managed, no evictions by hardware• <strong>Performance</strong> issues to look out for:– Bank conflicts add latency <strong>and</strong> reduce throughput– Many-way bank conflicts can be very expensive• Replay latency adds up• However, few code patterns have high conflicts, padding is a very simple <strong>and</strong> effective solution• Kepler has 2x SMEM throughput compared to Fermi:– SMEM throughput is doubled by increasing bank width to 8 bytes– Kernels with 8-byte words will benefit without changing kernel code• Put <strong>GPU</strong> into 8-byte bank mode with cudaSetSharedMemConfig() call– Kernels with smaller words will benefit if words are grouped into 8-byte structures© 2012, NVIDIA96

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!