Non-linear memory layout transformations and data prefetching ...

More documents

Recommendations

Info

6 IntroductionUnfortunately, the performance of a tiled program resulting from existing tiling heuristicsdoes not have robust performance [PNDN99], [RT99a]. Instability comes from the so-calledpathological array sizes, when array dimensions are near powers of two, since cache interferenceis a particular risk at that point. Array padding [HK04], [PNDN99], [RT98b], [SL01] is a compileroptimization that increases the array sizes and changes initial locations to avoid pathologicalcases. It introduces space overhead but eectively stabilizes program performance. Cacheutilization for padded benchmark codes is much higher overall, since padding is used to avoidsmall tiles [RT99a]. As a result, more recent research eorts have investigated the combination ofboth loop tiling and array padding in the hope that both magnitude and stability of performanceimprovements of tiled programs can be achieved at the same time [HK04], [RT98b], [PNDN99].An alternative method for avoiding conict misses is to copy tiles to a buer and modify codeto use data directly from the buer [Ess93], [LRW91], [TGJ93]. Since data in the buer iscontiguous, self-interference is eliminated. If buers are adjacent, then cross-interference missesare also avoided. Copying in [LRW91] can take full advantage of the cache as it enables theuse of tiles of size √ C × √ C in each blocked loop nest. However, performance overhead due toruntime copying is low if tiles only need to be copied once.TLB thrashing is a crucial performance factor since a TLB miss costs more than a L1 cachemiss and can cause severe cache stalls. While one-level cost functions suces to yield good performanceat a single level of memory hierarchy, it may not be globally optimal. Multi-level tilingcan be used to achieve locality in multiple levels (registers, caches and TLBs) simultaneously.Such optimizations have been considered in [CM95], [HK04] and a version of [MHCF98], guidedby multi-level cost functions. Minimizing a multi-level cost function balances the relative costsof TLB and cache misses. Optimal tiling must satisfy the capacity requirements of both TLBand cache.In general, most algorithms search for the largest tile sizes that generate the least amountof capacity misses, eliminate self-conict misses and minimize cross-conict misses. Sometimes,high cache utilization [SL99] and low cache misses may not be achieved simultaneously. As aresult, each algorithm has a dierent approximation to cache utilization and number of cachemisses, and weigh between these two quantities.Signicant work has been done to quantify the total number of conict misses [CM99],[FST91], [GMM99], [HKN99], [Ver03]. Cache behaviour is extremely dicult to analyze, re-ecting its unstable nature, in which small modications can lead to disproportionate changesin cache miss ratio [TFJ94]. Traditionally, cache performance evaluation has mostly used simulation.Although the results are accurate, the time needed to obtain them is typically manytimes greater than the total execution time of the program being simulated. To try to overcomesuch problems, analytical models of cache behaviour combined with heuristics have alsobeen developed, to guide optimizing compilers [GMM99], [RT98b] and [WL91], or study thecache performance of particular types of algorithm, especially blocked ones [CM99], [HKN99],
1.2 Contributions 7[LRW91], and [Ver03]. Code optimizations, such as tile size selection, selected with the helpof predicted miss ratios require a really accurate assessment of program's code behaviour. Performancedegradation, due to tiled code complexity and miss-predicted branches, should alsobe taken into account. Miss ratios of blocked kernels are generally a lot smaller than theseof unblocked kernels, amplifying the signicance of small errors in prediction. For this reason,a combination of cache miss analysis, simulation and experimentation is the best solution foroptimal selection of critical transformations.The previous approaches assumed linear array layouts. However, as aforementioned studieshave shown, such linear array memory layouts produce unfavorable memory access patterns,that cause interference misses and increase memory system overhead. In order to quantifythe benets of adopting nonlinear layouts to reduce cache misses, there exist several dierentapproaches. In [RT99b], Rivera et al. considers all levels of memory hierarchy to reduce L2cache misses as well, rather than reducing only L1 ones. He presents even fewer overall misses,however performance improvements are rarely signicant. In another approach, TLB and cachemisses should be considered in concert. Park et al. in [PHP02] analyze the TLB and cacheperformance for standard matrix access patterns, when tiling is used together with block datalayouts. Such layouts with block size equal to the page size, seem to minimize the number ofTLB misses. Considering both all levels of cache (L1 and L2) and TLB, a block size selectionalgorithm calculates a range of optimal block sizes.1.2 ContributionsA detailed model of cache behaviour can give accurate information to compilers or programmersto optimize codes. However, this is a really demanding task, especially in respect of givingfeedback to guide code transformations. This thesis oers some advance in automation of codeoptimization, focusing on the application of non-linear layouts in numerical codes. The optimizationalgorithm takes into account cache parameters, in order to determine best processingsizes that match the memory hierarchy characteristics of each specic platform.The primary contributions of this thesis are:• The proposal of a fast indexing scheme that makes the performance of blocked data layoutsecient. We succeed in increasing the eectiveness of such layouts when applied to complexnumerical codes, in combination with loop tiling transformation. The provided frameworkcan be integrated in a static tool, like compiler optimizations.• The proposal of a simple heuristic to make one-level tiling size decisions easy. It outlinesthe convergence point of factors that aect or determine the performance of the multiplehierarchical memory levels.
Page 1: NATIONAL TECHNICAL UNIVERSITY OF AT
Page 4 and 5: .................Evangelia G. Athan
Page 6 and 7: vióôéò ìç-ãñáììéêÝò
Page 8 and 9: viiiAnother issue, that had not bee
Page 10 and 11: xCONTENTS2.7 Iteration Space . . .
Page 12 and 13: xiiCONTENTS
Page 14 and 15: xivLIST OF FIGURES4.4 Alignment of
Page 16 and 17: xviLIST OF FIGURES
Page 18 and 19: xviiiLIST OF TABLES
Page 20 and 21: äýíáìç ðñïò ôç óõíå
Page 22 and 23: 2 Introductionthe processor die its
Page 24 and 25: 4 Introductionthe instruction strea
Page 28 and 29: 8 Introduction• A study of the ee
Page 30 and 31: 10 Introduction
Page 32 and 33: 12 Basic ConceptsPart of the On-chi
Page 34 and 35: 14 Basic Conceptsapplication execut
Page 36 and 37: 16 Basic Conceptscache entries, it
Page 38 and 39: 18 Basic Concepts• Least-Recently
Page 40 and 41: 20 Basic Conceptstable. Of course,
Page 42 and 43: 22 Basic ConceptsIn the above examp
Page 44 and 45: 24 Basic Concepts• Forward expres
Page 46 and 47: 26 Basic ConceptsThe following unim
Page 48 and 49: 28 Basic Conceptsfor (i = 0; i
Page 50 and 51: 30 Basic Concepts
Page 52 and 53: 32 Fast Indexing for Blocked Array
Page 76 and 77:
56 A Tile Size Selection Analysis[M
Page 78 and 79:
58 A Tile Size Selection Analysismi
Page 80 and 81:
60 A Tile Size Selection Analysismi
Page 82 and 83:
62 A Tile Size Selection Analysisd.
Page 84 and 85:
64 A Tile Size Selection AnalysisRe
Page 86 and 87:
66 A Tile Size Selection AnalysisTh
Page 88 and 89:
68 A Tile Size Selection AnalysisSu
Page 90 and 91:
70 A Tile Size Selection AnalysisWe
Page 92 and 93:
72 A Tile Size Selection AnalysisTL
Page 94 and 95:
74 A Tile Size Selection Analysis4.
Page 96 and 97:
76 Simultaneous MultithreadingAlong
Page 98 and 99:
78 Simultaneous Multithreadingas be
Page 100 and 101:
80 Simultaneous MultithreadingFor c
Page 102 and 103:
82 Simultaneous Multithreading871th
Page 104 and 105:
84 Simultaneous MultithreadingExami
Page 106 and 107:
86 Experimental Results3500MBalttil
Page 108 and 109:
88 Experimental Results2000MBaLttil
Page 110 and 111:
90 Experimental Results140MBaLtMBaL
Page 112 and 113:
92 Experimental Resultsnumber of mi
Page 114 and 115:
94 Experimental Resultsnumber of mi
Page 116 and 117:
96 Experimental ResultsTotal penalt
Page 118 and 119:
98 Experimental Resultsso that dier
Page 120 and 121:
100 Experimental Results10STRMMSSYR
Page 122 and 123:
102 Experimental ResultsNorm. Perfo
Page 124 and 125:
104 Experimental Resultstool. Figur
Page 126 and 127:
106 Experimental Results
Page 128 and 129:
108 Conclusionsdata are now stored
Page 131 and 132:
APPENDIXATable of SymbolsExplanatio
Page 133 and 134:
APPENDIXBHardware ArchitectureUltra
Page 135 and 136:
APPENDIXCProgram CodesIn the follow
Page 137 and 138:
C.4 SSYMM: Symmetric Matrix-Matrix
Page 139 and 140:
Bibliography[AAKK05]Evangelia Athan
Page 141 and 142:
BIBLIOGRAPHY 121ference on Programm
Page 143 and 144:
BIBLIOGRAPHY 123[KPCM99][KRC97][KRC
Page 145 and 146:
BIBLIOGRAPHY 125[RS01][RT98a][RT98b
Page 147:
BIBLIOGRAPHY 127[WM95][WMC96]Wm. A.
show all

Non-linear memory layout transformations and data prefetching ...

Create successful ePaper yourself

Delete template?

Save as template?