- Page 4 and 5:
Part IArchitectures and Fundamental
- Page 6 and 7:
How fast should a solver be?(just a
- Page 8 and 9:
Comparison of solvers(what got me s
- Page 10 and 11:
PipeliningProf. Craig C. DouglasUni
- Page 12 and 13:
CPU trendsEPIC (similar to VLIW) (I
- Page 14 and 15:
Memory acceleration techniquesInter
- Page 16 and 17:
CachesFast but small extra memoryHo
- Page 18 and 19:
Cache issuesUniqueness and transpar
- Page 20 and 21:
Effect of cache hit ratioThe cache
- Page 22 and 23:
Cache organizationNumber of cache l
- Page 24 and 25:
Typical architectures IBM Power 3:
- Page 26 and 27:
Basic efficiency guidelinesChoose t
- Page 28 and 29:
Choose the best algorithm(cont‘d)
- Page 30 and 31:
Sources for librariesVendor-indepen
- Page 32 and 33:
Find good compiler optionsModern co
- Page 34 and 35:
Use suitable data layoutAccess memo
- Page 36 and 37:
ProfilingSubroutine-level profiling
- Page 38 and 39:
Profiling: hardware performancecoun
- Page 40 and 41:
Profiling tools: valgrindMemory/thr
- Page 42 and 43:
Profiling tools: PAPIPAPI = Perform
- Page 44 and 45:
hpcview Screen ShotProf. Craig C. D
- Page 46 and 47:
HPCToolkit Philosophy 2Provide info
- Page 48 and 49:
Our reference code2D structured mul
- Page 50 and 51:
Using PCL - Example 1Digital PWS 50
- Page 52 and 53:
Using PCL - Example 2PCLinit(&descr
- Page 54 and 55:
Examples of DCPI toolsdcpiwhatcg: W
- Page 56 and 57:
Using DCPI - Example 2Call the DCPI
- Page 58 and 59:
Using DCPI - Example 2Static stalls
- Page 60 and 61:
Part IIOptimization TechniquesforSt
- Page 62 and 63:
Optimization ofFloating-PointOperat
- Page 64 and 65:
Loop unrollingSimplest effect of lo
- Page 66 and 67:
Loop unrolling:Improving flop/load
- Page 68 and 69:
Fused Multiply-Add (FMA)On many CPU
- Page 70 and 71:
Exposing ILP (cont’d)Superscalar
- Page 72 and 73:
AliasingArrays (or other data) that
- Page 74 and 75:
Aliasing (cont’d)Aliasing is lega
- Page 76 and 77:
Eliminating overheads:if statements
- Page 78 and 79:
Eliminating loop overheadsFor start
- Page 80 and 81:
Eliminating subroutine callingoverh
- Page 82 and 83:
Data layout optimizationsArray tran
- Page 84 and 85:
Data layout optimizationsStride-1 a
- Page 86 and 87:
Data layout optimizations:Cache-awa
- Page 88 and 89:
Data layout optimizations:Array pad
- Page 90 and 91:
Data layout optimizations:Array pad
- Page 92 and 93:
Loop optimizationsLoop unrolling (s
- Page 94 and 95:
Data access optimizations:Loop fusi
- Page 96 and 97:
Data access optimizations:Loop fusi
- Page 98 and 99:
Data access optimizations:Loop fusi
- Page 100 and 101:
Data access optimizations:Loop fusi
- Page 102 and 103:
Data access optimizations:Loop spli
- Page 104 and 105: Data access optimizations:Loop bloc
- Page 106 and 107: Data access optimizations:Loop bloc
- Page 108 and 109: Data access optimizationsLoop block
- Page 110 and 111: Two common multigridalgorithmsV Cyc
- Page 112 and 113: DiMEPACK libraryC++ interface, fast
- Page 114 and 115: Example:Cache-OptimizedMultigrid on
- Page 116 and 117: Data layout optimizationsfor 3D mul
- Page 118 and 119: Data layout optimizationsfor 3D mul
- Page 120 and 121: Data access optimizationsfor 3D mul
- Page 122 and 123: Example:Cache Optimizations forthe
- Page 124 and 125: LBM (cont‘d)Stream: read distribu
- Page 126 and 127: Layout 2: Grid Compression: save me
- Page 128 and 129: 3-way blocking (cont’d):Prof. Cra
- Page 130 and 131: 4-way blocking (cont’d):Prof. Cra
- Page 132 and 133: Layout: Grid compression, access pa
- Page 134 and 135: Performance resultsMFLOPS for 2D GS
- Page 136 and 137: Memory access behaviorStandard impl
- Page 138 and 139: Memory access behavior2D skewed blo
- Page 140 and 141: Performance results (cont’d)2D LB
- Page 142 and 143: Performance results (cont’d)3D LB
- Page 144 and 145: C++-specific considerationsWe will
- Page 146 and 147: Inlining (cont‘d)Advantages:• R
- Page 148 and 149: Inlining virtual functionsVirtual f
- Page 150 and 151: ExampleDefine a simple vector class
- Page 152 and 153: Example (cont‘d)Need a wrapper cl
- Page 156 and 157: Part IIIOptimization Techniques for
- Page 158 and 159: Is It really unstructured?This is r
- Page 160 and 161: Motivating example• Suppose probl
- Page 162 and 163: Cache aware Gauss-SeidelPreprocessi
- Page 164 and 165: Example of subblock membershipCache
- Page 166 and 167: Standard Gauss-SeidelThe complexity
- Page 168 and 169: Cache boundaries connectedProf. Cra
- Page 170 and 171: Physical boundaries unknownProf. Cr
- Page 172 and 173: Physical boundaries knownProf. Crai
- Page 174 and 175: Physical boundaries knownThe comple
- Page 176 and 177: Preprocessing costsPessimistic (nev
- Page 178 and 179: Numerical experiments: BavariaExper
- Page 180 and 181: Implementation detailsOne parameter
- Page 182 and 183: Journal articles C.C. Douglas, Cach
- Page 184 and 185: Conference proceedingsC. Weiss, W.