Cache Usage Tutorial - MGNet

More documents

Recommendations

Info

HPCToolkit Philosophy 2Provide information needed for analysis andtuning• Multilanguage applications• Multiple metricsoMust compare metrics which are causes versuseffects (examples: misses, flops, loads,mispredicts, cycles, stall cycles, etc.)• Hide getting details from user as much aspossibleProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 46
HPCToolkit Philosophy 3Eliminate manual labor from analyze, tune,run cycle• Collect multiple data automatically• Eliminate 90-10 ruleo90% of cycles in 10% of code… for a 500K linecode, the hotspot is only 5,000 lines of code.How do you deal with a 5K hotspot???• Drive the process with simple scriptsProf. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 47
Page 4 and 5: Part IArchitectures and Fundamental
Page 6 and 7: How fast should a solver be?(just a
Page 8 and 9: Comparison of solvers(what got me s
Page 10 and 11: PipeliningProf. Craig C. DouglasUni
Page 12 and 13: CPU trendsEPIC (similar to VLIW) (I
Page 14 and 15: Memory acceleration techniquesInter
Page 16 and 17: CachesFast but small extra memoryHo
Page 18 and 19: Cache issuesUniqueness and transpar
Page 20 and 21: Effect of cache hit ratioThe cache
Page 22 and 23: Cache organizationNumber of cache l
Page 24 and 25: Typical architectures IBM Power 3:
Page 26 and 27: Basic efficiency guidelinesChoose t
Page 28 and 29: Choose the best algorithm(cont‘d)
Page 30 and 31: Sources for librariesVendor-indepen
Page 32 and 33: Find good compiler optionsModern co
Page 34 and 35: Use suitable data layoutAccess memo
Page 36 and 37: ProfilingSubroutine-level profiling
Page 38 and 39: Profiling: hardware performancecoun
Page 40 and 41: Profiling tools: valgrindMemory/thr
Page 42 and 43: Profiling tools: PAPIPAPI = Perform
Page 44 and 45: hpcview Screen ShotProf. Craig C. D
Page 48 and 49: Our reference code2D structured mul
Page 50 and 51: Using PCL - Example 1Digital PWS 50
Page 52 and 53: Using PCL - Example 2PCLinit(&descr
Page 54 and 55: Examples of DCPI toolsdcpiwhatcg: W
Page 56 and 57: Using DCPI - Example 2Call the DCPI
Page 58 and 59: Using DCPI - Example 2Static stalls
Page 60 and 61: Part IIOptimization TechniquesforSt
Page 62 and 63: Optimization ofFloating-PointOperat
Page 64 and 65: Loop unrollingSimplest effect of lo
Page 66 and 67: Loop unrolling:Improving flop/load
Page 68 and 69: Fused Multiply-Add (FMA)On many CPU
Page 70 and 71: Exposing ILP (cont’d)Superscalar
Page 72 and 73: AliasingArrays (or other data) that
Page 74 and 75: Aliasing (cont’d)Aliasing is lega
Page 76 and 77: Eliminating overheads:if statements
Page 78 and 79: Eliminating loop overheadsFor start
Page 80 and 81: Eliminating subroutine callingoverh
Page 82 and 83: Data layout optimizationsArray tran
Page 84 and 85: Data layout optimizationsStride-1 a
Page 86 and 87: Data layout optimizations:Cache-awa
Page 88 and 89: Data layout optimizations:Array pad
Page 90 and 91: Data layout optimizations:Array pad
Page 92 and 93: Loop optimizationsLoop unrolling (s
Page 94 and 95: Data access optimizations:Loop fusi
Page 96 and 97:
Data access optimizations:Loop fusi
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Data access optimizations:Loop spli
Page 104 and 105:
Data access optimizations:Loop bloc
Page 106 and 107:
Data access optimizations:Loop bloc
Page 108 and 109:
Data access optimizationsLoop block
Page 110 and 111:
Two common multigridalgorithmsV Cyc
Page 112 and 113:
DiMEPACK libraryC++ interface, fast
Page 114 and 115:
Example:Cache-OptimizedMultigrid on
Page 116 and 117:
Data layout optimizationsfor 3D mul
Page 118 and 119:
Data layout optimizationsfor 3D mul
Page 120 and 121:
Data access optimizationsfor 3D mul
Page 122 and 123:
Example:Cache Optimizations forthe
Page 124 and 125:
LBM (cont‘d)Stream: read distribu
Page 126 and 127:
Layout 2: Grid Compression: save me
Page 128 and 129:
3-way blocking (cont’d):Prof. Cra
Page 130 and 131:
4-way blocking (cont’d):Prof. Cra
Page 132 and 133:
Layout: Grid compression, access pa
Page 134 and 135:
Performance resultsMFLOPS for 2D GS
Page 136 and 137:
Memory access behaviorStandard impl
Page 138 and 139:
Memory access behavior2D skewed blo
Page 140 and 141:
Performance results (cont’d)2D LB
Page 142 and 143:
Performance results (cont’d)3D LB
Page 144 and 145:
C++-specific considerationsWe will
Page 146 and 147:
Inlining (cont‘d)Advantages:• R
Page 148 and 149:
Inlining virtual functionsVirtual f
Page 150 and 151:
ExampleDefine a simple vector class
Page 152 and 153:
Example (cont‘d)Need a wrapper cl
Page 154 and 155:
Example (cont‘d)Need overloaded o
Page 156 and 157:
Part IIIOptimization Techniques for
Page 158 and 159:
Is It really unstructured?This is r
Page 160 and 161:
Motivating example• Suppose probl
Page 162 and 163:
Cache aware Gauss-SeidelPreprocessi
Page 164 and 165:
Example of subblock membershipCache
Page 166 and 167:
Standard Gauss-SeidelThe complexity
Page 168 and 169:
Cache boundaries connectedProf. Cra
Page 170 and 171:
Physical boundaries unknownProf. Cr
Page 172 and 173:
Physical boundaries knownProf. Crai
Page 174 and 175:
Physical boundaries knownThe comple
Page 176 and 177:
Preprocessing costsPessimistic (nev
Page 178 and 179:
Numerical experiments: BavariaExper
Page 180 and 181:
Implementation detailsOne parameter
Page 182 and 183:
Journal articles C.C. Douglas, Cach
Page 184 and 185:
Conference proceedingsC. Weiss, W.
show all

Cache Usage Tutorial - MGNet

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?