Non-linear memory layout transformations and data prefetching ...

More documents

Recommendations

Info

4 Introductionthe instruction stream access pattern. Since tiled code focuses on a sub-block of an array,why not put these array elements in contiguous memory locations? In other words, since theinstruction stream and consequently the respective data access patterns are blocked ones, itwould be ideal to store arrays following exactly the same pattern, so that instruction and datastreams are aligned. Conict misses, especially for direct mapped or small associativity cachesare considerably reduced, since now all array elements within the same block are mapped incontiguous cache locations and self interference is avoided.1.1.2 Non-linear Memory LayoutsChatterjee et al [CJL + 99], [CLPT99] explored the merit of non-linear memory layouts and quantiedtheir implementation cost. They proposed two families of blocked layout functions, bothof which split array elements up to a tile size, which ts to the cache characteristics and elementsinside each tile are linearly stored. Although they claim for increasing execution-time performance,using four-dimensional arrays, as proposed, any advantage obtained by data locality dueto blocked layouts seems to be counterbalanced by the slowdown caused by referring to fourdimensionalarray elements (in comparison to response time of two- or one-dimensional arrays).Even if we convert these four-dimensional arrays to two-dimensional ones, as proposed by Linet al in [LLC02], indexing the right array elements naively, requires expensive array subscripts.Furthermore, blocked layouts in combination with level-order, Ahnentafel and Morton indexingwere used by Wise et al in [WAFG01] and [WF99]. Although their quad-tree based layoutsseems to work well with recursive algorithms, due to ecient element indexing, no locality gaincan be obtained at non-recursive codes. Especially, since they use recursion down to the levelof a single array element, extra loss of performance is induced.Non-linear layouts were proved to favor cache locality in all levels of memory hierarchy,including L1, L2 caches and TLBs. Experiments in [RT99b] that exploit locality in the L2cache, rather than considering only one level caches, demonstrate reduced cache misses, butperformance improvements are rarely signicant. Targeting the L1 cache, nearly all localitybenets can be achieved. Most of the previous approaches target mainly the cache performance.As problem sizes become larger, TLB performance becomes more signicant. If TLB thrashingoccurs, [PHP02], the overall performance will be drastically degraded. Hence TLB and cachemust be considered in concert while optimizing application performance. Park et al in [PHP03]derive a lower bound on TLB performance for standard matrix access patterns and show thatblock data layouts and Morton Layout (for recursive codes) achieve this bound. These layoutswith block size equal to the page size, minimize the number of TLB misses. Considering bothall levels of cache and TLB, a block size selection algorithm should be used to calculate a tightrange of optimal block sizes.However, the automatic application of non-linear layouts in real compilers is a really tedioustask. It does not suce to identify the optimal non-linear (blocked) layout for a specic array,
1.1 Motivation 5we also need to automatically generate the mapping from the multidimensional iteration indicesto the correct location of the respective data element in linear memory. Blocked layouts are verypromising subject to an ecient address computation method. In the following, when referringto our non-linear layouts, we will name them Blocked Array Layouts, as they are always combinedwith loop tiling (they split array elements to blocks) and apply ecient indexing to the derivedtiles.1.1.3 Tile Size/Shape SelectionEarly eorts [MCT96], [WL91] have been dedicated to selecting the tile in such a way that itsworking set ts in the cache, so as to eliminate capacity misses. To minimize loop overhead, thetile size should be the maximum that meets the above requirement. Recent work takes conictmisses into account, as well. Conict misses [TFJ94] may occur when too many data items mapto the same set of cache locations, causing cache lines to be ushed from cache before they maybe used, despite sucient capacity in the overall cache. As a result, in addition to eliminatingcapacity misses [MCT96], [WL91] and maximizing cache utilization, the tile should be selectedin such a way that there are no (or few) self conict misses, while cross conict misses areminimized [CM99], [CM95], [Ess93], [LRW91], [RT99a].To nd tile sizes that have few capacity misses, the surveyed algorithms restrict their candidatetile sizes to be the ones whose working set can entirely t in the cache. To model selfconict misses due to low associativity cache, [WMC96] and [MHCF98] use the eective cachesize q×C (q < 1), instead of the actual cache size C, while [CM99], [CM95], [LRW91] and [SL01]explicitly nd the non-conicting tile sizes. Taking into account cache line size as well, columndimensions (without loss of generality, assume a column major data array layout) should be amultiple of the cache line size [CM95]. If xed blocks are chosen, Lam et al. in [LRW91] havefound that the best square tile is not larger than √ aCa+1, where a = associativity. In practice, theoptimal choice may occupy only a small fraction of the cache, typically less than 10%. What'smore, the fraction of the cache used for optimal block size decreases as the cache size increases.The desired tile shape has been explicitly specied in algorithms such as [Ess93], [CM99],[CM95], [WL91], [WMC96], [LRW91]. Both [WL91] and [LRW91] search for square tiles. Incontrast, [CM99], [CM95] and [WMC96] nd rectangular tiles or [Ess93] even extremely talltiles (the maximum number of complete columns that t in the cache). Tile shape and cacheutilization are two important performance factors considered by many algorithms, either implicitlythrough the cost model or explicitly through candidate tiles. However, extremely wide tilesmay introduce TLB thrashing. On the other hand, extremely tall or square tiles may have lowcache utilization. Apart from the static techniques, iteration compilation has been implementedin [KKO00]. Although it can achieve high speedups, the obvious drawback of iterative compilationis its long compilation time, required to generate and prole many versions of the sourceprogram.
Page 1: NATIONAL TECHNICAL UNIVERSITY OF AT
Page 4 and 5: .................Evangelia G. Athan
Page 6 and 7: vióôéò ìç-ãñáììéêÝò
Page 8 and 9: viiiAnother issue, that had not bee
Page 10 and 11: xCONTENTS2.7 Iteration Space . . .
Page 12 and 13: xiiCONTENTS
Page 14 and 15: xivLIST OF FIGURES4.4 Alignment of
Page 16 and 17: xviLIST OF FIGURES
Page 18 and 19: xviiiLIST OF TABLES
Page 20 and 21: äýíáìç ðñïò ôç óõíå
Page 22 and 23: 2 Introductionthe processor die its
Page 26 and 27: 6 IntroductionUnfortunately, the pe
Page 28 and 29: 8 Introduction• A study of the ee
Page 30 and 31: 10 Introduction
Page 32 and 33: 12 Basic ConceptsPart of the On-chi
Page 34 and 35: 14 Basic Conceptsapplication execut
Page 36 and 37: 16 Basic Conceptscache entries, it
Page 38 and 39: 18 Basic Concepts• Least-Recently
Page 40 and 41: 20 Basic Conceptstable. Of course,
Page 42 and 43: 22 Basic ConceptsIn the above examp
Page 44 and 45: 24 Basic Concepts• Forward expres
Page 46 and 47: 26 Basic ConceptsThe following unim
Page 48 and 49: 28 Basic Conceptsfor (i = 0; i
Page 50 and 51: 30 Basic Concepts
Page 52 and 53: 32 Fast Indexing for Blocked Array
Page 74 and 75:
54 Fast Indexing for Blocked Array
Page 76 and 77:
56 A Tile Size Selection Analysis[M
Page 78 and 79:
58 A Tile Size Selection Analysismi
Page 80 and 81:
60 A Tile Size Selection Analysismi
Page 82 and 83:
62 A Tile Size Selection Analysisd.
Page 84 and 85:
64 A Tile Size Selection AnalysisRe
Page 86 and 87:
66 A Tile Size Selection AnalysisTh
Page 88 and 89:
68 A Tile Size Selection AnalysisSu
Page 90 and 91:
70 A Tile Size Selection AnalysisWe
Page 92 and 93:
72 A Tile Size Selection AnalysisTL
Page 94 and 95:
74 A Tile Size Selection Analysis4.
Page 96 and 97:
76 Simultaneous MultithreadingAlong
Page 98 and 99:
78 Simultaneous Multithreadingas be
Page 100 and 101:
80 Simultaneous MultithreadingFor c
Page 102 and 103:
82 Simultaneous Multithreading871th
Page 104 and 105:
84 Simultaneous MultithreadingExami
Page 106 and 107:
86 Experimental Results3500MBalttil
Page 108 and 109:
88 Experimental Results2000MBaLttil
Page 110 and 111:
90 Experimental Results140MBaLtMBaL
Page 112 and 113:
92 Experimental Resultsnumber of mi
Page 114 and 115:
94 Experimental Resultsnumber of mi
Page 116 and 117:
96 Experimental ResultsTotal penalt
Page 118 and 119:
98 Experimental Resultsso that dier
Page 120 and 121:
100 Experimental Results10STRMMSSYR
Page 122 and 123:
102 Experimental ResultsNorm. Perfo
Page 124 and 125:
104 Experimental Resultstool. Figur
Page 126 and 127:
106 Experimental Results
Page 128 and 129:
108 Conclusionsdata are now stored
Page 131 and 132:
APPENDIXATable of SymbolsExplanatio
Page 133 and 134:
APPENDIXBHardware ArchitectureUltra
Page 135 and 136:
APPENDIXCProgram CodesIn the follow
Page 137 and 138:
C.4 SSYMM: Symmetric Matrix-Matrix
Page 139 and 140:
Bibliography[AAKK05]Evangelia Athan
Page 141 and 142:
BIBLIOGRAPHY 121ference on Programm
Page 143 and 144:
BIBLIOGRAPHY 123[KPCM99][KRC97][KRC
Page 145 and 146:
BIBLIOGRAPHY 125[RS01][RT98a][RT98b
Page 147:
BIBLIOGRAPHY 127[WM95][WMC96]Wm. A.
show all

Non-linear memory layout transformations and data prefetching ...

Create successful ePaper yourself

Delete template?

Save as template?