Non-linear memory layout transformations and data prefetching ...

More documents

Recommendations

Info

xivLIST OF FIGURES4.4 Alignment of arrays A, B, C, when C L1 = N 2 , with T = N . . . . . . . . . . . . 604.5 Alignment of arrays A, B, C, when N 2 > C L1 and T · N < C L1 . . . . . . . . . . 614.6 Alignment of arrays A, B, C, when 3T 2 < C L1 ≤ T · N . . . . . . . . . . . . . . . 624.7 Alignment of arrays A, B, C, when N 2 > C L1 , T 2 ≤ C L1 < 3T 2 . . . . . . . . . . 634.8 Alignment of arrays A, B, C, when N 2 > C L1 , T 2 > C L1 > T . . . . . . . . . . . 644.9 Number of L1 cache misses for various array and tile sizes in direct mapped caches,when the described in section 4.1.2 alignment has been applied (UltraSPARC IIarchitecture) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.10 Number of L1 cache misses for various array and tile sizes in set associative caches(Xeon DP architecture) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.11 Number of L2 direct mapped cache misses for various array and tile sizes . . . . 694.12 Number of TLB misses for various array and tile sizes . . . . . . . . . . . . . . . 724.13 The total miss cost for various array and tile sizes . . . . . . . . . . . . . . . . . 735.1 Resource partitioning in Intel hyperthreading architecture . . . . . . . . . . . . . 795.2 Average CPI for dierent TLP and ILP execution modes of some common instructionstreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.3 Slowdown factors for the co-execution of various integer instruction streams . . . 845.4 Slowdown factors for the co-execution of various oating-point instruction streams 846.1 Total execution results in matrix multiplication (-xO0, UltraSPARC) . . . . . . . 866.2 Total execution results in matrix multiplication (-fast, UltraSPARC) . . . . . . . 866.3 Total execution results in matrix multiplication (-fast, SGI Origin) . . . . . . . . 866.4 Total execution results in LU-decomposition (-xO0, UltraSPARC) . . . . . . . . 876.5 Total execution results in LU-decomposition (-fast, UltraSPARC) . . . . . . . . . 876.6 Total execution results in LU-decomposition for larger arrays and hand optimizedcodes (-fast, SGI Origin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.7 Total execution results in SSYR2K (-xO0, UltraSPARC) . . . . . . . . . . . . . . 886.8 Total execution results in SSYR2K (-fast, UltraSPARC) . . . . . . . . . . . . . . 886.9 Total execution results in SSYMM (-xO0, UltraSPARC) . . . . . . . . . . . . . . 896.10 Total execution results in SSYMM (-fast, UltraSPARC) . . . . . . . . . . . . . . 896.11 Total execution results in SSYMM (-O0, Athlon XP) . . . . . . . . . . . . . . . . 906.12 Total execution results in SSYMM (-O3, Athlon XP) . . . . . . . . . . . . . . . . 906.13 Total execution results in STRMM (-xO0, UltraSPARC) . . . . . . . . . . . . . . 916.14 Total execution results in STRMM (-fast, UltraSPARC) . . . . . . . . . . . . . . 916.15 Misses in Data L1, Unified L2 cache and data TLB for matrix multiplication(UltraSPARC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.16 Misses in Data L1, Unied L2 cache and data TLB for LU-decomposition (Ultra-SPARC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
LIST OF FIGURESxv6.17 Misses in Data L1, Unified L2 cache and data TLB for LU-decomposition (SGIOrigin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.18 Execution time of the Matrix Multiplication kernel for various array and tile sizes(UltraSPARC, -fast) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.19 Total performance penalty due to data L1 cache misses, L2 cache misses anddata TLB misses for the Matrix Multiplication kernel with use of Blocked arrayLayouts and ecient indexing. The real execution time of this benchmark is alsoillustrated (UltraSPARC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.20 Total performance penalty and real execution time for the Matrix Multiplicationkernel (linear array layouts - UltraSPARC) . . . . . . . . . . . . . . . . . . . . . 966.21 The relative performance of the two dierent data layouts (UltraSPARC) . . . . 976.22 Normalized performance of 5 benchmarks for various array and tile sizes (Ultra-SPARC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.23 Total performance penalty for the Matrix Multiplication kernel (Pentium III) . . 996.24 Pentium III - Normalized performance of ve benchmarks for various array andtile sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.25 Athlon XP - Normalized performance of ve benchmarks for various array andtile sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.26 Xeon - The relative performance of the three dierent versions . . . . . . . . . . 1016.27 Xeon - Normalized performance of the matrix multiplication benchmark for variousarray and tile sizes (serial MBaLt) . . . . . . . . . . . . . . . . . . . . . . . . 1016.28 Xeon - Normalized performance of the matrix multiplication benchmark for variousarray and tile sizes (2 threads - MBaLt) . . . . . . . . . . . . . . . . . . . . 1026.29 Xeon - Normalized performance of the matrix multiplication benchmark for variousarray and tile sizes (4 threads - MBaLt) . . . . . . . . . . . . . . . . . . . . 1026.30 SMT experimental results in the Intel Xeon Architecture, with HT enabled . . . 1036.31 Instruction issue ports and main execution units of the Xeon processor . . . . . . 105
Page 1: NATIONAL TECHNICAL UNIVERSITY OF AT
Page 4 and 5: .................Evangelia G. Athan
Page 6 and 7: vióôéò ìç-ãñáììéêÝò
Page 8 and 9: viiiAnother issue, that had not bee
Page 10 and 11: xCONTENTS2.7 Iteration Space . . .
Page 12 and 13: xiiCONTENTS
Page 16 and 17: xviLIST OF FIGURES
Page 18 and 19: xviiiLIST OF TABLES
Page 20 and 21: äýíáìç ðñïò ôç óõíå
Page 22 and 23: 2 Introductionthe processor die its
Page 24 and 25: 4 Introductionthe instruction strea
Page 26 and 27: 6 IntroductionUnfortunately, the pe
Page 28 and 29: 8 Introduction• A study of the ee
Page 30 and 31: 10 Introduction
Page 32 and 33: 12 Basic ConceptsPart of the On-chi
Page 34 and 35: 14 Basic Conceptsapplication execut
Page 36 and 37: 16 Basic Conceptscache entries, it
Page 38 and 39: 18 Basic Concepts• Least-Recently
Page 40 and 41: 20 Basic Conceptstable. Of course,
Page 42 and 43: 22 Basic ConceptsIn the above examp
Page 44 and 45: 24 Basic Concepts• Forward expres
Page 46 and 47: 26 Basic ConceptsThe following unim
Page 48 and 49: 28 Basic Conceptsfor (i = 0; i
Page 50 and 51: 30 Basic Concepts
Page 52 and 53: 32 Fast Indexing for Blocked Array
Page 64 and 65:
44 Fast Indexing for Blocked Array
Page 66 and 67:
Page 68 and 69:
Page 70 and 71:
Page 72 and 73:
Page 74 and 75:
Page 76 and 77:
56 A Tile Size Selection Analysis[M
Page 78 and 79:
58 A Tile Size Selection Analysismi
Page 80 and 81:
60 A Tile Size Selection Analysismi
Page 82 and 83:
62 A Tile Size Selection Analysisd.
Page 84 and 85:
64 A Tile Size Selection AnalysisRe
Page 86 and 87:
66 A Tile Size Selection AnalysisTh
Page 88 and 89:
68 A Tile Size Selection AnalysisSu
Page 90 and 91:
70 A Tile Size Selection AnalysisWe
Page 92 and 93:
72 A Tile Size Selection AnalysisTL
Page 94 and 95:
74 A Tile Size Selection Analysis4.
Page 96 and 97:
76 Simultaneous MultithreadingAlong
Page 98 and 99:
78 Simultaneous Multithreadingas be
Page 100 and 101:
80 Simultaneous MultithreadingFor c
Page 102 and 103:
82 Simultaneous Multithreading871th
Page 104 and 105:
84 Simultaneous MultithreadingExami
Page 106 and 107:
86 Experimental Results3500MBalttil
Page 108 and 109:
88 Experimental Results2000MBaLttil
Page 110 and 111:
90 Experimental Results140MBaLtMBaL
Page 112 and 113:
92 Experimental Resultsnumber of mi
Page 114 and 115:
94 Experimental Resultsnumber of mi
Page 116 and 117:
96 Experimental ResultsTotal penalt
Page 118 and 119:
98 Experimental Resultsso that dier
Page 120 and 121:
100 Experimental Results10STRMMSSYR
Page 122 and 123:
102 Experimental ResultsNorm. Perfo
Page 124 and 125:
104 Experimental Resultstool. Figur
Page 126 and 127:
106 Experimental Results
Page 128 and 129:
108 Conclusionsdata are now stored
Page 131 and 132:
APPENDIXATable of SymbolsExplanatio
Page 133 and 134:
APPENDIXBHardware ArchitectureUltra
Page 135 and 136:
APPENDIXCProgram CodesIn the follow
Page 137 and 138:
C.4 SSYMM: Symmetric Matrix-Matrix
Page 139 and 140:
Bibliography[AAKK05]Evangelia Athan
Page 141 and 142:
BIBLIOGRAPHY 121ference on Programm
Page 143 and 144:
BIBLIOGRAPHY 123[KPCM99][KRC97][KRC
Page 145 and 146:
BIBLIOGRAPHY 125[RS01][RT98a][RT98b
Page 147:
BIBLIOGRAPHY 127[WM95][WMC96]Wm. A.
show all

Non-linear memory layout transformations and data prefetching ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?