DirectCompute Optimizations and Best Practices - Nvidia
DirectCompute Optimizations and Best Practices - Nvidia
DirectCompute Optimizations and Best Practices - Nvidia
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Problem: Global Synchronization<br />
• If we could synchronize across all thread groups we can run reduce on a very<br />
large array<br />
— A global sync after each group produces its result<br />
— Once all groups reach sync, continue recursively<br />
• But GPUs have no global synchronization. Why<br />
— Expensive to build in hardware for GPUs with high processor count<br />
— Would force programmer to run fewer groups (no more than # multiprocessors * #<br />
resident groups / multiprocessor) to avoid deadlock, which may reduce overall<br />
efficiency<br />
• Solution: decompose into multiple shader dispatches<br />
— A dispatch() call serves as a global synchronization point<br />
— Dispatch() has negligible HW overhead, low SW overhead