GPU Performance Analysis and Optimization - GPU Technology ...

More documents

Recommendations

Info

Case Study 7: Matrix Transpose• Staged via SMEM to coalesce GMEM addresses– 32x32 threadblock, double-precision values– 32x32 array in shared memory• Initial implementation:– A warp writes a row of values to SMEM (read from GMEM)– A warp reads a column of values from SMEM (to be written to GMEM)• Diagnosing:– 15 replays per shared memory instruction– Replays make up 56% of instructions issued• Ratio of l1_shared_bank_conflict to inst_issued– Code achieves only 45% of DRAM bandwidth– Conclusion: bank conflicts add latency and prevent GMEM instructions fromexecuting efficiently© 2012, NVIDIA94
Cast Study 7: Remedy and Results• Remedy:– Simply pad each row of SMEM array with an extra element• 32x33 array, as opposed to 32x32• Effort: 1 character, literally– Warp access to SMEM• Writes still have no bank conflicts:– threads access successive elements• Reads also have no bank conflicts:– Stride between threads is 17 8-byte words, thus each goes to a different bank• Results:– Initial: 22.6 ms (worse than naïve with scattered GMEM access)– Optimized: 11.2 ms (~2x speedup)• 0 bank conflicts, 65% of DRAM theory© 2012, NVIDIA95
Page 2:
Goals of This Talk• Give insight
Page 6 and 7:
Topics For Ninjas• Topics relevan
Page 8 and 9:
EXPOSING SUFFICIENT PARALLELISM© 2
Page 10 and 11:
Memory Parallelism• Achieved Kepl
Page 12 and 13:
Occupancy• Occupancy: number of c
Page 14 and 15:
Exposing Parallelism: Grid Configur
Page 16 and 17:
Threadblock SizingToo fewthreadsper
Page 18 and 19:
Case Study 1: Threadblock Sizing•
Page 20 and 21:
Tail Effect• Tail underutilizes G
Page 22 and 23:
Tail Effect: Large vs Small Threadb
Page 24 and 25:
General Guidelines• Threadblock s
Page 26 and 27:
Kepler Memory HierarchySM-0Register
Page 28 and 29:
Blocking for L1, Read-only, L2 Cach
Page 30 and 31:
Global Memory Operations• Memory
Page 32 and 33:
Read-only Loads• Go through the r
Page 34 and 35:
Read-only Loads• Go through the r
Page 36 and 37:
Caching Load• Scenario:- Warp req
Page 38 and 39:
Page 40 and 41:
Page 42 and 43:
Caching Load• Scenario:- All thre
Page 44 and 45: Caching Load• Scenario:- Warp req
Page 46 and 47: Memory Throughput Analysis• Two p
Page 48 and 49: Two Ways to Investigate Address Pat
Page 50 and 51: Pattern Category: Offset Access•
Page 52 and 53: Case Study 2: Offset Address Patter
Page 54 and 55: Case Study 2: Remedy• Looking at
Page 56 and 57: Pattern Category 2: Large Inter-thr
Page 58 and 59: Case Study 3: Diagnosis• Double-p
Page 60 and 61: Case Study 3: Result• Naïve impl
Page 62 and 63: Case Study 4: SoA vs AoS• Global
Page 64 and 65: Case Study 4: Cause• Array of Str
Page 66 and 67: Case Study 5: Assigning More Thread
Page 68 and 69: Case Study 5: Cause and Remedy• E
Page 70 and 71: Pattern Category 4: Irregular Addre
Page 72 and 73: Summary of Pattern Categories and t
Page 74 and 75: Having Sufficient Concurrent Access
Page 76 and 77: Optimizing Access Concurrency• Ha
Page 78 and 79: Summary: GMEM Optimization• Striv
Page 80 and 81: Shared Memory• On-chip (on each S
Page 82 and 83: Kepler 8-byte Bank Mode• Mapping
Page 84 and 85: Kepler 4-byte Bank Mode• To visua
Page 86 and 87: Case Study 6: Kepler 8-byte SMEM Ac
Page 88 and 89: SMEM Access ExamplesAddresses from
Page 96 and 97: Summary: Shared Memory• Shared me
Page 98 and 99: Execution• Instructions are issue
Page 100 and 101: instructionsControl Flowif ( ... ){
Page 102 and 103: instructions / timeExecution diverg
Page 104 and 105: Instruction Throughput: Analysis•
Page 106 and 107: Instruction Throughput: Summary•
Page 108 and 109: Kepler Architecture Family• Two a
Page 110 and 111: Level of Parallelism• Parallelism
Page 112 and 113: Increased Shared Memory Bandwidth
Page 114 and 115: Case Study 8: More Registers Per Th
Page 116 and 117: In Conclusion• When programming a
show all

GPU Performance Analysis and Optimization - GPU Technology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?