Slides - Par Lab

More documents

Recommendations

Info

Imperatives for Efficient CUDA Code ¡ Expose abundant fine-‐grained parallelism § need 1000’s of threads for full utilization ¡ Maximize on-‐chip work § on-‐chip memory orders of magnitude faster ¡ Minimize execution divergence § SIMT execution of threads in 32-‐thread warps ¡ Minimize memory divergence § warp loads and consumes complete 128-‐byte cache line 30/73
Mapping CUDA to Nvidia GPUs ¡ CUDA is designed to be functionally forgiving § First priority: make things work. Second: get performance. ¡ However, to get good performance, one must understand how CUDA is mapped to Nvidia GPUs ¡ Threads: each thread is a SIMD vector lane ¡ Warps: A SIMD instruction acts on a “warp” § Warp width is 32 elements: LOGICAL SIMD width ¡ Thread blocks: Each thread block is scheduled onto an SM § Peak efficiency requires multiple thread blocks per SM 31/73
Page 1: An Introduction to CUDA/OpenCL a
Page 13 and 14: What is a CUDA Thread Block?
Page 15 and 16: Synchronization ¡ Threads within
Page 17 and 18: Scalability ¡ Manycore chips exi
Page 19 and 20: Memory model Thread Per-‐threa
Page 24 and 25: Using per-‐block shared memo
Page 26 and 27: CUDA: Runtime support ¡Explicit
Page 28 and 29: CUDA and OpenCL correspondence
Page 32 and 33: Mapping CUDA to a GPU, continu
Page 34 and 35: Profiling NVIDIA Visual Profiler©
Page 36 and 37: Memory, Memory, Memory ! Chapter
Page 38 and 39: Coalescing ¡ GPUs and CPUs both
Page 40 and 41: SoA, AoS ¡ Different data acces
Page 42 and 43: Efficiency and Productivity" Parall
Page 44 and 45: Abstraction, cont." Reduction is on
Page 46 and 47: Diving In#include #include #include
Page 48 and 49: Containers" Concise and readable co
Page 50 and 51: Iterators" Iterators act like point
Page 52 and 53: Algorithms" Elementwise operations"
Page 54 and 55: Algorithms" Standard data types// a
Page 56 and 57: Custom Types & Operators// compare
Page 58 and 59: Recap" Containers manage memory" He
Page 60 and 61: Explicit versus implicit parallelis
Page 62 and 63: Explicit versus implicit parallelis
Page 64 and 65: Productivity Implications" Consider
Page 66 and 67: Productivity Implications" Consider
Page 68 and 69: Leveraging Parallel Primitives" Use
Page 70 and 71: Leveraging Parallel Primitives" Com
Page 72 and 73: Summary ¡ Throughput optimized p

Slides - Par Lab

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?