GPU Performance Analysis and Optimization - GPU Technology ...

More documents

Recommendations

Info

Kepler Memory HierarchySM-0RegistersSM-1RegistersSM-NRegistersL1SMEMReadonlyL1SMEMReadonlyL1SMEMReadonlyL2Global Memory (DRAM)© 2012, NVIDIA26
Memory Hierarchy Review• Registers– Storage local to each threads– Compiler-managed• Shared memory / L1– 64 KB, program-configurable into shared:L1– Program-managed– Accessible by all threads in the same threadblock– Low latency, high bandwidth: 1.5-2 TB/s on Kepler GK104• Read-only cache– Up to 48 KB per Kepler SM– Hardware-managed (also used by texture units)– Used for read-only GMEM accesses (not coherent with writes)• L2– Up to: 512 KB on Kepler GK104, 1.5 MB on Kepler GK110 (768 KB on Fermi)– Hardware-managed: all accesses to global memory go through L2, including CPU and peer GPU• Global memory– Accessible by all threads, host (CPU), other GPUs in the same system– Higher latency (400-800 cycles)– Tesla K10 bandwidth: 2x160 GB/s (2 chips on a board)© 2012, NVIDIA27
Page 2: Goals of This Talk• Give insight
Page 6 and 7: Topics For Ninjas• Topics relevan
Page 8 and 9: EXPOSING SUFFICIENT PARALLELISM© 2
Page 10 and 11: Memory Parallelism• Achieved Kepl
Page 12 and 13: Occupancy• Occupancy: number of c
Page 14 and 15: Exposing Parallelism: Grid Configur
Page 16 and 17: Threadblock SizingToo fewthreadsper
Page 18 and 19: Case Study 1: Threadblock Sizing•
Page 20 and 21: Tail Effect• Tail underutilizes G
Page 22 and 23: Tail Effect: Large vs Small Threadb
Page 24 and 25: General Guidelines• Threadblock s
Page 28 and 29: Blocking for L1, Read-only, L2 Cach
Page 30 and 31: Global Memory Operations• Memory
Page 32 and 33: Read-only Loads• Go through the r
Page 34 and 35: Read-only Loads• Go through the r
Page 36 and 37: Caching Load• Scenario:- Warp req
Page 42 and 43: Caching Load• Scenario:- All thre
Page 46 and 47: Memory Throughput Analysis• Two p
Page 48 and 49: Two Ways to Investigate Address Pat
Page 50 and 51: Pattern Category: Offset Access•
Page 52 and 53: Case Study 2: Offset Address Patter
Page 54 and 55: Case Study 2: Remedy• Looking at
Page 56 and 57: Pattern Category 2: Large Inter-thr
Page 58 and 59: Case Study 3: Diagnosis• Double-p
Page 60 and 61: Case Study 3: Result• Naïve impl
Page 62 and 63: Case Study 4: SoA vs AoS• Global
Page 64 and 65: Case Study 4: Cause• Array of Str
Page 66 and 67: Case Study 5: Assigning More Thread
Page 68 and 69: Case Study 5: Cause and Remedy• E
Page 70 and 71: Pattern Category 4: Irregular Addre
Page 72 and 73: Summary of Pattern Categories and t
Page 74 and 75: Having Sufficient Concurrent Access
Page 76 and 77:
Optimizing Access Concurrency• Ha
Page 78 and 79:
Summary: GMEM Optimization• Striv
Page 80 and 81:
Shared Memory• On-chip (on each S
Page 82 and 83:
Kepler 8-byte Bank Mode• Mapping
Page 84 and 85:
Kepler 4-byte Bank Mode• To visua
Page 86 and 87:
Case Study 6: Kepler 8-byte SMEM Ac
Page 88 and 89:
SMEM Access ExamplesAddresses from
Page 90 and 91:
Page 92 and 93:
Page 94 and 95:
Case Study 7: Matrix Transpose• S
Page 96 and 97:
Summary: Shared Memory• Shared me
Page 98 and 99:
Execution• Instructions are issue
Page 100 and 101:
instructionsControl Flowif ( ... ){
Page 102 and 103:
instructions / timeExecution diverg
Page 104 and 105:
Instruction Throughput: Analysis•
Page 106 and 107:
Instruction Throughput: Summary•
Page 108 and 109:
Kepler Architecture Family• Two a
Page 110 and 111:
Level of Parallelism• Parallelism
Page 112 and 113:
Increased Shared Memory Bandwidth
Page 114 and 115:
Case Study 8: More Registers Per Th
Page 116 and 117:
In Conclusion• When programming a
show all

GPU Performance Analysis and Optimization - GPU Technology ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?