DirectCompute Optimizations and Best Practices - Nvidia

More documents

Recommendations

Info

Reduction #7: Multiple Adds / Thread Replace load <strong>and</strong> add of two elements: unsigned int tid = threadIdx.x; unsigned int i = groupIdx.x*(groupDim_x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+groupDim_x]; GroupMemoryBarrierWithGroupSync(); With a while loop to add as many as necessary: unsigned int tid = threadIdx.x; unsigned int i = groupIdx.x*(groupDim_x*2) + threadIdx.x; unsigned int dispatchSize = groupDim_x*2*gridDim.x; sdata[tid] = 0; Note: dispatchSize loop stride to maintain coalescing! while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+groupDim_x]; i += dispatchSize; } GroupMemoryBarrierWithGroupSync();
Performance for 4M element reduction Shader 1: interleaved addressing with divergent branching Shader 2: interleaved addressing with bank conflicts Shader 3: sequential addressing Shader 4: first add during global load Shader 5: unroll last warp Shader 6: completely unrolled Shader 7: multiple elements per thread Time (2 22 ints) B<strong>and</strong>width 8.054 ms 2.083 GB/s 3.456 ms 4.854 GB/s 2.33x 2.33x 1.722 ms 9.741 GB/s 2.01x 4.68x 0.965 ms 17.377 GB/s 1.78x 8.34x 0.536 ms 31.289 GB/s 1.8x 15.01x 0.381 ms 43.996 GB/s 1.41x 21.16x 0.268 ms 62.671 GB/s 1.42x 30.04x Shader 7 on 32M elements: 73 GB/s! Step Speedup Cumulative Speedup
Page 1 and 2: DirectCompute Optimizations and Bes
Page 3 and 4: Why GPUs • GPUs are throughput or
Page 5 and 6: Advantages of DirectCompute • Dir
Page 7 and 8: Parallel Execution Model Thread Gro
Page 9 and 10: Memory Coalescing • A coordinated
Page 11 and 12: Uncoalesced Access: Reading floats
Page 13 and 14: Compute 1.2+ Coalesced Access: Read
Page 15 and 16: Thread Group Shared Memory (TGSM)
Page 17 and 18: Shared Memory Bank Addressing • N
Page 19 and 20: What is Occupancy GPUs typical
Page 21 and 22: Maximizing HW Occupancy • Registe
Page 23 and 24: DirectCompute Optimization Example
Page 25 and 26: Parallel Reduction • Tree-based a
Page 27 and 28: Solution: Shader Decomposition •
Page 29 and 30: Reduction #1: Interleaved Addressin
Page 31 and 32: Performance for 4M element reductio
Page 33 and 34: What is Thread Divergence • Diver
Page 35 and 36: Parallel Reduction: Interleaved Add
Page 37 and 38: Thread Group Shared Memory Bank Con
Page 39 and 40: Reduction #3: Sequential Addressing
Page 41 and 42: Idle Threads Problem: for (unsigned
Page 45 and 46: Unrolling the Last Warp • As redu
Page 49 and 50: Reduction #6: Completely Unrolled #
Page 51: Reduction #7: Multiple Adds / Threa
Page 55 and 56: Time (ms) Performance Comparison 10
Page 58 and 59: Extra Slides
Page 60 and 61: Memory Coalescing (Matrix Multiply)
Page 62: Matrix Multiplication (cont.) Optim

DirectCompute Optimizations and Best Practices - Nvidia

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?