Non-linear memory layout transformations and data prefetching ...

More documents

Recommendations

Info

14 Basic Conceptsapplication execution. Larger cache block sizes, and software optimization techniques,such as prefetching, can help at this point (as we will see later in this chapter).• Capacity misses: are those misses which a cache of a given size will have, regardless of itsassociativity or block size. The curve of capacity miss rate versus cache size gives somemeasure of the temporal locality of a particular reference stream: if cache capacity is notenough to hold the total amount of data being reused by the instruction stream executed,then useful data should be withdrawn from the cache. The next reference to them, bringson capacity misses.Increasing the cache size could be benecial for capacity misses. However, such an enlargementaggravates structural cost. Over-and-above, a large cache size would result toination of access time, that is, cache hit latency. To counterbalance the capacity constraintsof just a single cache level, memory systems are equipped with two or three levelsof caches. The rst cache level is integrated in the processor chip, for low latency, andis chosen to have a small size, while other cache levels are placed o chip, with greatercapacity.• Conict misses: are those misses that could have been avoided, had the cache not evictedan entry earlier. A conict miss can occur even if there is available space in the cache. Itis the result of a claim for the same cache line, by two dierent blocks of data. Conictmisses can be further broken down into mapping misses, which are unavoidable given aparticular amount of associativity, and replacement misses, which are due to the particularvictim choice of the replacement policy.Another classication concerns data references to array structures. Self conict misses areconict misses between elements of the same array. Cross-conict misses take place whenthere is conict between elements of dierent arrays.2.3 Cache OrganizationCache memories can have dierent organizations, according to the location in the cache wherea block can be placed:• Direct mapped cache: A given block of memory elements can appear in one single place incache. This place is determined by the block address:(cache line mapped) = (Block address) MOD (Number of blocks in cache)Direct mapped organization is simple and data retrieval is straightforward. This eectivelyreduces hit time and allows large cache sizes. On the other hand, direct mapping can notavert conict misses, even if the cache capacity can support higher hit rates. Hence, fast
2.3 Cache Organization 15processor clocks favor this organization for st-level caches, that should be kept small insize.• Fully Associative cache: A given block of memory elements can reside anywhere in cache.This mapping reduces miss rate because it eliminates conict misses. Of course, increasedassociativity comes at a cost. This is the increase of hit time and the high structuralcost. However, even if fast processor clocks favor simple caches for on-chip caches, as wedraw away from processor chip to higher-level caches, increasing of miss penalty rewardsassociativity.• N-way Set Associative cache: A set associative organization denes that cache is dividedinto sets of N cache lines. A given block of memory can be placed anywhere within a singleset. This set is determined by the block address:(set mapped) = (Block address) MOD (Number of sets in cache)A direct mapped cache can be considered as an 1-way set associative cache, while a fully associativecache with capacity of M cache lines can be considered as a M-way set associativecache.As far as cache performance is considered, an 8-way set associative cache has, in essence,the same miss rate as a fully associative cache.2.3.1 Pseudo-associative cachesAnother approach to improve miss rates without aecting the processor clock is pseudo-associativecaches. This mechanism is so eective as 2-way associativity. Pseudo-associative caches thenhave one fast and one slow hit time -corresponding to a regular hit and a pseudo hit. On a hit,pseudo-associative caches work just like direct mapped caches. When a miss occurs in the directmapped entry, an alternate entry (the index with the highest index bit ipped) is checked. A hitto the alternate entry (pseudo-hit) requires an extra cycle. This pseudo-hit results in the twoentries being swapped, so that the next access to the same line would be a fast access. A missin both entries (fast and alternative) causes an eviction of whichever of the two lines is LRU.The new data is always placed in the fast index, so if the alternate index was evicted, the linein the fast index will need to be moved to the alternate index. So a regular hit takes no extracycles, a pseudo-hit takes 1 cycle, and access to the L2 and main memory takes 1 cycle longerin a system with a pseudo-associative cache than in one without.2.3.2 Victim CachesCache conicts can be addressed in the hardware through associativity of some form. Whileassociativity has the advantage of reducing conicts by allowing locations to map to multiple
Page 1: NATIONAL TECHNICAL UNIVERSITY OF AT
Page 4 and 5: .................Evangelia G. Athan
Page 6 and 7: vióôéò ìç-ãñáììéêÝò
Page 8 and 9: viiiAnother issue, that had not bee
Page 10 and 11: xCONTENTS2.7 Iteration Space . . .
Page 12 and 13: xiiCONTENTS
Page 14 and 15: xivLIST OF FIGURES4.4 Alignment of
Page 16 and 17: xviLIST OF FIGURES
Page 18 and 19: xviiiLIST OF TABLES
Page 20 and 21: äýíáìç ðñïò ôç óõíå
Page 22 and 23: 2 Introductionthe processor die its
Page 24 and 25: 4 Introductionthe instruction strea
Page 26 and 27: 6 IntroductionUnfortunately, the pe
Page 28 and 29: 8 Introduction• A study of the ee
Page 30 and 31: 10 Introduction
Page 32 and 33: 12 Basic ConceptsPart of the On-chi
Page 36 and 37: 16 Basic Conceptscache entries, it
Page 38 and 39: 18 Basic Concepts• Least-Recently
Page 40 and 41: 20 Basic Conceptstable. Of course,
Page 42 and 43: 22 Basic ConceptsIn the above examp
Page 44 and 45: 24 Basic Concepts• Forward expres
Page 46 and 47: 26 Basic ConceptsThe following unim
Page 48 and 49: 28 Basic Conceptsfor (i = 0; i
Page 50 and 51: 30 Basic Concepts
Page 52 and 53: 32 Fast Indexing for Blocked Array
Page 76 and 77: 56 A Tile Size Selection Analysis[M
Page 78 and 79: 58 A Tile Size Selection Analysismi
Page 80 and 81: 60 A Tile Size Selection Analysismi
Page 82 and 83: 62 A Tile Size Selection Analysisd.
Page 84 and 85:
64 A Tile Size Selection AnalysisRe
Page 86 and 87:
66 A Tile Size Selection AnalysisTh
Page 88 and 89:
68 A Tile Size Selection AnalysisSu
Page 90 and 91:
70 A Tile Size Selection AnalysisWe
Page 92 and 93:
72 A Tile Size Selection AnalysisTL
Page 94 and 95:
74 A Tile Size Selection Analysis4.
Page 96 and 97:
76 Simultaneous MultithreadingAlong
Page 98 and 99:
78 Simultaneous Multithreadingas be
Page 100 and 101:
80 Simultaneous MultithreadingFor c
Page 102 and 103:
82 Simultaneous Multithreading871th
Page 104 and 105:
84 Simultaneous MultithreadingExami
Page 106 and 107:
86 Experimental Results3500MBalttil
Page 108 and 109:
88 Experimental Results2000MBaLttil
Page 110 and 111:
90 Experimental Results140MBaLtMBaL
Page 112 and 113:
92 Experimental Resultsnumber of mi
Page 114 and 115:
94 Experimental Resultsnumber of mi
Page 116 and 117:
96 Experimental ResultsTotal penalt
Page 118 and 119:
98 Experimental Resultsso that dier
Page 120 and 121:
100 Experimental Results10STRMMSSYR
Page 122 and 123:
102 Experimental ResultsNorm. Perfo
Page 124 and 125:
104 Experimental Resultstool. Figur
Page 126 and 127:
106 Experimental Results
Page 128 and 129:
108 Conclusionsdata are now stored
Page 131 and 132:
APPENDIXATable of SymbolsExplanatio
Page 133 and 134:
APPENDIXBHardware ArchitectureUltra
Page 135 and 136:
APPENDIXCProgram CodesIn the follow
Page 137 and 138:
C.4 SSYMM: Symmetric Matrix-Matrix
Page 139 and 140:
Bibliography[AAKK05]Evangelia Athan
Page 141 and 142:
BIBLIOGRAPHY 121ference on Programm
Page 143 and 144:
BIBLIOGRAPHY 123[KPCM99][KRC97][KRC
Page 145 and 146:
BIBLIOGRAPHY 125[RS01][RT98a][RT98b
Page 147:
BIBLIOGRAPHY 127[WM95][WMC96]Wm. A.
show all

Non-linear memory layout transformations and data prefetching ...

Create successful ePaper yourself

Delete template?

Save as template?