DirectCompute Optimizations and Best Practices - Nvidia

More documents

Recommendations

Info

What is Our Optimization Goal • We should strive to reach GPU peak performance • Choose the right metric: — GFLOP/s: for compute-bound shaders — Bandwidth: for memory-bound shaders • Reductions have very low arithmetic intensity — 1 flop per element loaded (bandwidth-optimal) • Therefore we should strive for peak bandwidth • We use a G80 GPU for this Optimization — 384-bit memory interface, 900 MHz DDR — 384 * 1800 / 8 = 86.4 GB/s — Optimization techniques are equally applicable to newer GPUs
Reduction #1: Interleaved Addressing RWStructuredBuffer g_data; #define groupDim_x 128 groupshared float sdata[groupDim_x]; [numthreads( groupDim_x, 1, 1)] void reduce1( uint3 threadIdx : SV_GroupThreadID, uint3 groupIdx : SV_GroupID) { } // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = groupIdx.x*groupDim_x + threadIdx.x; sdata[tid] = g_data[i]; GroupMemoryBarrierWithGroupSync(); // do reduction in shared mem for(unsigned int s=1; s < groupDim_x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } GroupMemoryBarrierWithGroupSync(); } // write result for this block to global mem if (tid == 0) g_data[groupIdx.x] = sdata[0];
Page 1 and 2: DirectCompute Optimizations and Bes
Page 3 and 4: Why GPUs • GPUs are throughput or
Page 5 and 6: Advantages of DirectCompute • Dir
Page 7 and 8: Parallel Execution Model Thread Gro
Page 9 and 10: Memory Coalescing • A coordinated
Page 11 and 12: Uncoalesced Access: Reading floats
Page 13 and 14: Compute 1.2+ Coalesced Access: Read
Page 15 and 16: Thread Group Shared Memory (TGSM)
Page 17 and 18: Shared Memory Bank Addressing • N
Page 19 and 20: What is Occupancy GPUs typical
Page 21 and 22: Maximizing HW Occupancy • Registe
Page 23 and 24: DirectCompute Optimization Example
Page 25 and 26: Parallel Reduction • Tree-based a
Page 27: Solution: Shader Decomposition •
Page 31 and 32: Performance for 4M element reductio
Page 33 and 34: What is Thread Divergence • Diver
Page 35 and 36: Parallel Reduction: Interleaved Add
Page 37 and 38: Thread Group Shared Memory Bank Con
Page 39 and 40: Reduction #3: Sequential Addressing
Page 41 and 42: Idle Threads Problem: for (unsigned
Page 45 and 46: Unrolling the Last Warp • As redu
Page 49 and 50: Reduction #6: Completely Unrolled #
Page 51 and 52: Reduction #7: Multiple Adds / Threa
Page 55 and 56: Time (ms) Performance Comparison 10
Page 58 and 59: Extra Slides
Page 60 and 61: Memory Coalescing (Matrix Multiply)
Page 62: Matrix Multiplication (cont.) Optim

DirectCompute Optimizations and Best Practices - Nvidia

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?