- Page 4 and 5: Part IArchitectures and Fundamental
- Page 6 and 7: How fast should a solver be?(just a
- Page 8 and 9: Comparison of solvers(what got me s
- Page 10 and 11: PipeliningProf. Craig C. DouglasUni
- Page 12 and 13: CPU trendsEPIC (similar to VLIW) (I
- Page 14 and 15: Memory acceleration techniquesInter
- Page 16 and 17: CachesFast but small extra memoryHo
- Page 18 and 19: Cache issuesUniqueness and transpar
- Page 20 and 21: Effect of cache hit ratioThe cache
- Page 22 and 23: Cache organizationNumber of cache l
- Page 24 and 25: Typical architectures IBM Power 3:
- Page 26 and 27: Basic efficiency guidelinesChoose t
- Page 28 and 29: Choose the best algorithm(cont‘d)
- Page 30 and 31: Sources for librariesVendor-indepen
- Page 32 and 33: Find good compiler optionsModern co
- Page 34 and 35: Use suitable data layoutAccess memo
- Page 36 and 37: ProfilingSubroutine-level profiling
- Page 38 and 39: Profiling: hardware performancecoun
- Page 40 and 41: Profiling tools: valgrindMemory/thr
- Page 42 and 43: Profiling tools: PAPIPAPI = Perform
- Page 44 and 45: hpcview Screen ShotProf. Craig C. D
- Page 48 and 49: Our reference code2D structured mul
- Page 50 and 51: Using PCL - Example 1Digital PWS 50
- Page 52 and 53: Using PCL - Example 2PCLinit(&descr
- Page 54 and 55: Examples of DCPI toolsdcpiwhatcg: W
- Page 56 and 57: Using DCPI - Example 2Call the DCPI
- Page 58 and 59: Using DCPI - Example 2Static stalls
- Page 60 and 61: Part IIOptimization TechniquesforSt
- Page 62 and 63: Optimization ofFloating-PointOperat
- Page 64 and 65: Loop unrollingSimplest effect of lo
- Page 66 and 67: Loop unrolling:Improving flop/load
- Page 68 and 69: Fused Multiply-Add (FMA)On many CPU
- Page 70 and 71: Exposing ILP (cont’d)Superscalar
- Page 72 and 73: AliasingArrays (or other data) that
- Page 74 and 75: Aliasing (cont’d)Aliasing is lega
- Page 76 and 77: Eliminating overheads:if statements
- Page 78 and 79: Eliminating loop overheadsFor start
- Page 80 and 81: Eliminating subroutine callingoverh
- Page 82 and 83: Data layout optimizationsArray tran
- Page 84 and 85: Data layout optimizationsStride-1 a
- Page 86 and 87: Data layout optimizations:Cache-awa
- Page 88 and 89: Data layout optimizations:Array pad
- Page 90 and 91: Data layout optimizations:Array pad
- Page 92 and 93: Loop optimizationsLoop unrolling (s
- Page 94 and 95: Data access optimizations:Loop fusi
- Page 96 and 97:
Data access optimizations:Loop fusi
- Page 98 and 99:
Data access optimizations:Loop fusi
- Page 100 and 101:
Data access optimizations:Loop fusi
- Page 102 and 103:
Data access optimizations:Loop spli
- Page 104 and 105:
Data access optimizations:Loop bloc
- Page 106 and 107:
Data access optimizations:Loop bloc
- Page 108 and 109:
Data access optimizationsLoop block
- Page 110 and 111:
Two common multigridalgorithmsV Cyc
- Page 112 and 113:
DiMEPACK libraryC++ interface, fast
- Page 114 and 115:
Example:Cache-OptimizedMultigrid on
- Page 116 and 117:
Data layout optimizationsfor 3D mul
- Page 118 and 119:
Data layout optimizationsfor 3D mul
- Page 120 and 121:
Data access optimizationsfor 3D mul
- Page 122 and 123:
Example:Cache Optimizations forthe
- Page 124 and 125:
LBM (cont‘d)Stream: read distribu
- Page 126 and 127:
Layout 2: Grid Compression: save me
- Page 128 and 129:
3-way blocking (cont’d):Prof. Cra
- Page 130 and 131:
4-way blocking (cont’d):Prof. Cra
- Page 132 and 133:
Layout: Grid compression, access pa
- Page 134 and 135:
Performance resultsMFLOPS for 2D GS
- Page 136 and 137:
Memory access behaviorStandard impl
- Page 138 and 139:
Memory access behavior2D skewed blo
- Page 140 and 141:
Performance results (cont’d)2D LB
- Page 142 and 143:
Performance results (cont’d)3D LB
- Page 144 and 145:
C++-specific considerationsWe will
- Page 146 and 147:
Inlining (cont‘d)Advantages:• R
- Page 148 and 149:
Inlining virtual functionsVirtual f
- Page 150 and 151:
ExampleDefine a simple vector class
- Page 152 and 153:
Example (cont‘d)Need a wrapper cl
- Page 154 and 155:
Example (cont‘d)Need overloaded o
- Page 156 and 157:
Part IIIOptimization Techniques for
- Page 158 and 159:
Is It really unstructured?This is r
- Page 160 and 161:
Motivating example• Suppose probl
- Page 162 and 163:
Cache aware Gauss-SeidelPreprocessi
- Page 164 and 165:
Example of subblock membershipCache
- Page 166 and 167:
Standard Gauss-SeidelThe complexity
- Page 168 and 169:
Cache boundaries connectedProf. Cra
- Page 170 and 171:
Physical boundaries unknownProf. Cr
- Page 172 and 173:
Physical boundaries knownProf. Crai
- Page 174 and 175:
Physical boundaries knownThe comple
- Page 176 and 177:
Preprocessing costsPessimistic (nev
- Page 178 and 179:
Numerical experiments: BavariaExper
- Page 180 and 181:
Implementation detailsOne parameter
- Page 182 and 183:
Journal articles C.C. Douglas, Cach
- Page 184 and 185:
Conference proceedingsC. Weiss, W.