Cache Usage Tutorial - MGNet

More documents

Recommendations

Info

Example (cont‘d)Need overloaded operator+() variants for allpossible return types, for example:templateDExproperator+(const DExpr& a, const DExpr& b) {typedef DExprSum ExprT;return DExpr(ExprT(a,b));};Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 154
Example (cont‘d)The vector class must contain a memberfunction operator=(const A& ea), where A isan expression template class.Only when this member function is called, theactual computation (vector sum) takes place.Prof. Craig C. DouglasUniversity of Kentucky andYale UniversityHiPC2003, 12/17/2003 vHPC <strong>Cache</strong> Aware Methods 155
Page 4 and 5:
Part IArchitectures and Fundamental
Page 6 and 7:
How fast should a solver be?(just a
Page 8 and 9:
Comparison of solvers(what got me s
Page 10 and 11:
PipeliningProf. Craig C. DouglasUni
Page 12 and 13:
CPU trendsEPIC (similar to VLIW) (I
Page 14 and 15:
Memory acceleration techniquesInter
Page 16 and 17:
CachesFast but small extra memoryHo
Page 18 and 19:
Cache issuesUniqueness and transpar
Page 20 and 21:
Effect of cache hit ratioThe cache
Page 22 and 23:
Cache organizationNumber of cache l
Page 24 and 25:
Typical architectures IBM Power 3:
Page 26 and 27:
Basic efficiency guidelinesChoose t
Page 28 and 29:
Choose the best algorithm(cont‘d)
Page 30 and 31:
Sources for librariesVendor-indepen
Page 32 and 33:
Find good compiler optionsModern co
Page 34 and 35:
Use suitable data layoutAccess memo
Page 36 and 37:
ProfilingSubroutine-level profiling
Page 38 and 39:
Profiling: hardware performancecoun
Page 40 and 41:
Profiling tools: valgrindMemory/thr
Page 42 and 43:
Profiling tools: PAPIPAPI = Perform
Page 44 and 45:
hpcview Screen ShotProf. Craig C. D
Page 46 and 47:
HPCToolkit Philosophy 2Provide info
Page 48 and 49:
Our reference code2D structured mul
Page 50 and 51:
Using PCL - Example 1Digital PWS 50
Page 52 and 53:
Using PCL - Example 2PCLinit(&descr
Page 54 and 55:
Examples of DCPI toolsdcpiwhatcg: W
Page 56 and 57:
Using DCPI - Example 2Call the DCPI
Page 58 and 59:
Using DCPI - Example 2Static stalls
Page 60 and 61:
Part IIOptimization TechniquesforSt
Page 62 and 63:
Optimization ofFloating-PointOperat
Page 64 and 65:
Loop unrollingSimplest effect of lo
Page 66 and 67:
Loop unrolling:Improving flop/load
Page 68 and 69:
Fused Multiply-Add (FMA)On many CPU
Page 70 and 71:
Exposing ILP (cont’d)Superscalar
Page 72 and 73:
AliasingArrays (or other data) that
Page 74 and 75:
Aliasing (cont’d)Aliasing is lega
Page 76 and 77:
Eliminating overheads:if statements
Page 78 and 79:
Eliminating loop overheadsFor start
Page 80 and 81:
Eliminating subroutine callingoverh
Page 82 and 83:
Data layout optimizationsArray tran
Page 84 and 85:
Data layout optimizationsStride-1 a
Page 86 and 87:
Data layout optimizations:Cache-awa
Page 88 and 89:
Data layout optimizations:Array pad
Page 90 and 91:
Data layout optimizations:Array pad
Page 92 and 93:
Loop optimizationsLoop unrolling (s
Page 94 and 95:
Data access optimizations:Loop fusi
Page 96 and 97:
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Data access optimizations:Loop spli
Page 104 and 105: Data access optimizations:Loop bloc
Page 106 and 107: Data access optimizations:Loop bloc
Page 108 and 109: Data access optimizationsLoop block
Page 110 and 111: Two common multigridalgorithmsV Cyc
Page 112 and 113: DiMEPACK libraryC++ interface, fast
Page 114 and 115: Example:Cache-OptimizedMultigrid on
Page 116 and 117: Data layout optimizationsfor 3D mul
Page 118 and 119: Data layout optimizationsfor 3D mul
Page 120 and 121: Data access optimizationsfor 3D mul
Page 122 and 123: Example:Cache Optimizations forthe
Page 124 and 125: LBM (cont‘d)Stream: read distribu
Page 126 and 127: Layout 2: Grid Compression: save me
Page 128 and 129: 3-way blocking (cont’d):Prof. Cra
Page 130 and 131: 4-way blocking (cont’d):Prof. Cra
Page 132 and 133: Layout: Grid compression, access pa
Page 134 and 135: Performance resultsMFLOPS for 2D GS
Page 136 and 137: Memory access behaviorStandard impl
Page 138 and 139: Memory access behavior2D skewed blo
Page 140 and 141: Performance results (cont’d)2D LB
Page 142 and 143: Performance results (cont’d)3D LB
Page 144 and 145: C++-specific considerationsWe will
Page 146 and 147: Inlining (cont‘d)Advantages:• R
Page 148 and 149: Inlining virtual functionsVirtual f
Page 150 and 151: ExampleDefine a simple vector class
Page 152 and 153: Example (cont‘d)Need a wrapper cl
Page 156 and 157: Part IIIOptimization Techniques for
Page 158 and 159: Is It really unstructured?This is r
Page 160 and 161: Motivating example• Suppose probl
Page 162 and 163: Cache aware Gauss-SeidelPreprocessi
Page 164 and 165: Example of subblock membershipCache
Page 166 and 167: Standard Gauss-SeidelThe complexity
Page 168 and 169: Cache boundaries connectedProf. Cra
Page 170 and 171: Physical boundaries unknownProf. Cr
Page 172 and 173: Physical boundaries knownProf. Crai
Page 174 and 175: Physical boundaries knownThe comple
Page 176 and 177: Preprocessing costsPessimistic (nev
Page 178 and 179: Numerical experiments: BavariaExper
Page 180 and 181: Implementation detailsOne parameter
Page 182 and 183: Journal articles C.C. Douglas, Cach
Page 184 and 185: Conference proceedingsC. Weiss, W.
show all

Cache Usage Tutorial - MGNet

Create successful ePaper yourself

Delete template?

Save as template?