29.01.2013 Views

Tutorial CUDA

Tutorial CUDA

Tutorial CUDA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Coalescing:<br />

Timing Results<br />

Experiment:<br />

Kernel: read a float, increment, write back<br />

3M floats (12MB)<br />

Times averaged over 10K runs<br />

12K blocks x 256 threads:<br />

356µs – coalesced<br />

357µs – coalesced, some threads don’t participate<br />

3,494µs – permuted/misaligned thread access<br />

4K blocks x 256 threads:<br />

3,302µs – float3 non-coalesced<br />

Conclusion:<br />

Coalescing greatly improves throughput!<br />

Critical to small or memory-bound kernels<br />

© NVIDIA Corporation 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!