Memory Bandwidth Limited Kernels - GPU Technology Conference

More documents

Recommendations

Info

© NVIDIA Corporation 2011Summary
Summary• Strive for perfect coalescing• Align starting address (may require padding)• A warp should access within a contiguous region• Have enough concurrent accesses to saturate the bus• Process several elements per thread- Multiple loads get pipelined- Indexing calculations can often be reused• Launch enough threads to maximize throughput- Latency is hidden by switching threads (warps)• Try L1 and caching configurations to see which one works best• Caching vs non-caching loads (compiler option)• 16KB vs 48KB L1 (CUDA call)- Sometimes using shared memory or the texture / constant cache is the bestchoice© NVIDIA Corporation 2011
Page 1 and 2: CUDA Optimization:Memory Bandwidth
Page 3: © NVIDIA Corporation 2011Launch Co
Page 7 and 8: © NVIDIA Corporation 2011Memory Ac
Page 9 and 10: Load Operation• Memory operations
Page 11 and 12: Non-caching Load• Warp requests 3
Page 17 and 18: Non-caching Load• All threads in
Page 21 and 22: REFERENCESr ,1.2.3.4.5.6.7.8.9.10.1
Page 23: Additional “memories”• Textur
Page 27: Questions?

Memory Bandwidth Limited Kernels - GPU Technology Conference

Create successful ePaper yourself

Delete template?

Save as template?