DirectCompute Optimizations and Best Practices - Nvidia

More documents

Recommendations

Info

Final Optimized Compute Shader Cbuffer consts { uint n; uint dispatchDim_x; }; groupshared float sdata[groupDim_x]; [numthreads( groupDim_x, 1, 1)] void reduce6(uint tid : SV_GroupIndex, uint3 groupIdx : groupID ) { unsigned int i = groupIdx.x*(groupDim_x*2) + tid; unsigned int dispatchSize = groupDim_x*2*dispatchDim_x; sdata[tid] = 0; do { sdata[tid] += g_idata[i] + g_idata[i+groupDim_x]; i += dispatchSize; } while (i < n); GroupMemoryBarrierWithGroupSync(); if (groupDim_x >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } GroupMemoryBarrierWithGroupSync(); } if (groupDim_x >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } GroupMemoryBarrierWithGroupSync(); } if (groupDim_x >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } GroupMemoryBarrierWithGroupSync(); } } if (tid < 32) { if (groupDim_x >= 64) sdata[tid] += sdata[tid + 32]; if (groupDim_x >= 32) sdata[tid] += sdata[tid + 16]; if (groupDim_x >= 16) sdata[tid] += sdata[tid + 8]; if (groupDim_x >= 8) sdata[tid] += sdata[tid + 4]; if (groupDim_x >= 4) sdata[tid] += sdata[tid + 2]; if (groupDim_x >= 2) sdata[tid] += sdata[tid + 1]; } if (tid == 0) g_odata[groupIdx.x] = sdata[0];
Time (ms) Performance Comparison 10 1 1: Interleaved Addressing: Divergent Branches 2: Interleaved Addressing: Bank Conflicts 3: Sequential Addressing 4: First add during global load 5: Unroll last warp 6: Completely unroll 0.1 7: Multiple elements per thread (max 64 blocks) 0.01 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 # Elements
Page 1 and 2:
DirectCompute Optimizations and Bes
Page 3 and 4: Why GPUs • GPUs are throughput or
Page 5 and 6: Advantages of DirectCompute • Dir
Page 7 and 8: Parallel Execution Model Thread Gro
Page 9 and 10: Memory Coalescing • A coordinated
Page 11 and 12: Uncoalesced Access: Reading floats
Page 13 and 14: Compute 1.2+ Coalesced Access: Read
Page 15 and 16: Thread Group Shared Memory (TGSM)
Page 17 and 18: Shared Memory Bank Addressing • N
Page 19 and 20: What is Occupancy GPUs typical
Page 21 and 22: Maximizing HW Occupancy • Registe
Page 23 and 24: DirectCompute Optimization Example
Page 25 and 26: Parallel Reduction • Tree-based a
Page 27 and 28: Solution: Shader Decomposition •
Page 29 and 30: Reduction #1: Interleaved Addressin
Page 31 and 32: Performance for 4M element reductio
Page 33 and 34: What is Thread Divergence • Diver
Page 35 and 36: Parallel Reduction: Interleaved Add
Page 37 and 38: Thread Group Shared Memory Bank Con
Page 39 and 40: Reduction #3: Sequential Addressing
Page 41 and 42: Idle Threads Problem: for (unsigned
Page 45 and 46: Unrolling the Last Warp • As redu
Page 49 and 50: Reduction #6: Completely Unrolled #
Page 51 and 52: Reduction #7: Multiple Adds / Threa
Page 53: Performance for 4M element reductio
Page 58 and 59: Extra Slides
Page 60 and 61: Memory Coalescing (Matrix Multiply)
Page 62: Matrix Multiplication (cont.) Optim
show all

DirectCompute Optimizations and Best Practices - Nvidia

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?