DirectCompute Optimizations and Best Practices - Nvidia

More documents

Recommendations

Info

Problem: Global Synchronization • If we could synchronize across all thread groups we can run reduce on a very large array — A global sync after each group produces its result — Once all groups reach sync, continue recursively • But GPUs have no global synchronization. Why — Expensive to build in hardware for GPUs with high processor count — Would force programmer to run fewer groups (no more than # multiprocessors * # resident groups / multiprocessor) to avoid deadlock, which may reduce overall efficiency • Solution: decompose into multiple shader dispatches — A dispatch() call serves as a global synchronization point — Dispatch() has negligible HW overhead, low SW overhead
Solution: Shader Decomposition • Avoid global sync by decomposing computation into multiple dispatches 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 Level 0: 8 blocks 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 Level 1: 1 block • In the case of reductions, code for all levels is the same — Implement with recursive dispatches
Page 1 and 2: DirectCompute Optimizations and Bes
Page 3 and 4: Why GPUs • GPUs are throughput or
Page 5 and 6: Advantages of DirectCompute • Dir
Page 7 and 8: Parallel Execution Model Thread Gro
Page 9 and 10: Memory Coalescing • A coordinated
Page 11 and 12: Uncoalesced Access: Reading floats
Page 13 and 14: Compute 1.2+ Coalesced Access: Read
Page 15 and 16: Thread Group Shared Memory (TGSM)
Page 17 and 18: Shared Memory Bank Addressing • N
Page 19 and 20: What is Occupancy GPUs typical
Page 21 and 22: Maximizing HW Occupancy • Registe
Page 23 and 24: DirectCompute Optimization Example
Page 25: Parallel Reduction • Tree-based a
Page 29 and 30: Reduction #1: Interleaved Addressin
Page 31 and 32: Performance for 4M element reductio
Page 33 and 34: What is Thread Divergence • Diver
Page 35 and 36: Parallel Reduction: Interleaved Add
Page 37 and 38: Thread Group Shared Memory Bank Con
Page 39 and 40: Reduction #3: Sequential Addressing
Page 41 and 42: Idle Threads Problem: for (unsigned
Page 45 and 46: Unrolling the Last Warp • As redu
Page 49 and 50: Reduction #6: Completely Unrolled #
Page 51 and 52: Reduction #7: Multiple Adds / Threa
Page 55 and 56: Time (ms) Performance Comparison 10
Page 58 and 59: Extra Slides
Page 60 and 61: Memory Coalescing (Matrix Multiply)
Page 62: Matrix Multiplication (cont.) Optim

DirectCompute Optimizations and Best Practices - Nvidia

Create successful ePaper yourself

Delete template?

Save as template?