Non-linear memory layout transformations and data prefetching ...

More documents

Recommendations

Info

12 Basic ConceptsPart of the On-chipCPU Datapath16-256 RegistersStatic RAM(one or more levels)L1: On-chip 16-64KL2: On/Off-chip 128-512KL3: Off-chip 128K-8MDynamic RAM (DRAM)16M - 16GInterface:SCSI, RAID, IDE4G - 200GRegistersCacheMain MemoryMagnetic DiscOptical Disc or Magnetic TapeDrawing awayfrom the CPU :Decreasedcost/BitIncreasedcapacityIncreasedlatencyDecreasedbandwidthFigure 2.1: The Memory Hierarchy pyramidproximity and accessibility to the microprocessor. An L1 (level 1) cache is usually built intothe processor chip, while L2 (level 2) is usually a separate static RAM (SRAM) chip. As themicroprocessor processes data, if the requested information are not found in the register le, itlooks rst in the cache memory. Caches contain only a small portion of the information storedin main memory. Given that cache latency is much lower than that of other memory levels, wecan have a signicant performance improvement if a previous reading of data has brought therequested data in the cache. This way, there is no need for the time-consuming access of remotememory levels. The above principles suggest that we should try to preserve recently accessedand frequently used instructions and data in the fastest memory level.Programs tend to reuse data and instructions they have used recently: \A program spends90% of its time in 10% of its code". Thus, we have to exploit data locality, in order to improveperformance. An implication of locality is that we can predict with reasonable accuracy whatinstructions and data a program will use in the near future based on its accesses in the recentpast.Two dierent types of locality have been observed:• Temporal locality: states that recently accessed items are likely to be accessed in the nearfuture. If accessed words are placed in the cache, waste of time (due to memory latency)will be averted when they are reused.• Spatial locality: states that items whose addresses are near one another tend to be referencedclose together in time. As a result, it is advantageous to organize memory words inblocks so that whenever there is access to a word, the whole block will be transferred inthe cache, not just one word.
2.2 Cache misses 13During a memory reference, a request for specic data can be satised from the cache withoutusing the main memory, when the requested memory address is present in cache. This situationis known as a cache hit. The opposite situation, when the cache is consulted and found not tocontain the desired data, is known as a cache miss. In the latter case, the block of data (wherethe requested element belongs), is usually inserted into the cache, ready for the next access.This block is the smaller amount of elements that can be transferred from one memory level toanother. This block of data is alternatively called cache line.Memory latency and bandwidth are the two key factors that determine the time needed toresolve a cache miss. Memory latency species the intermediate time between the data requestto main memory and the arrival into the cache of the rst data element in the requested block.Memory bandwidth species the arrival rate of the remaining elements in the requested block.A cache miss resolution is critical, because the processor has to stall until the requested dataarrive from main memory (especially in an in-order execution).2.2 Cache missesCache memories were designed to keep the most recently used piece of content (either a programinstruction or data). However, it is not feasible to satisfy all data requests. In a case of a cachemiss in instruction cache, the processor stall is resolved when the requested instruction is fetchedfrom main memory. A cache read miss (data load instruction) can be less severe as there canbe other instructions not dependant on the expected data element. Execution is continued untilthe operation which really needs the loaded data is ready to be executed. However, data is oftenused immediately after the load instruction. The last case is a cache write miss, it is the leastserious miss because there are write buers usually, to store data until they are transferred tomain memory or a block is allocated in the cache. The processor can continue until the bueris full.In order to lower cache miss rate, a great deal of analysis has been done on cache behavior inan attempt to nd the best combination of size, associativity, block size, and so on. Sequences ofmemory references performed by benchmark programs are saved as address traces. Subsequentanalysis simulates many dierent possible cache designs on these long address traces. Makingsense of how the many variables aect the cache hit rate can be quite confusing. One signicantcontribution to this analysis was made by Mark Hill, who separated misses into three categories(known as the Three Cs):There are three dierent types of cache misses.• Compulsory misses: are those misses caused by the very rst reference to a block ofdata. They are alternatively called cold-start or rst-reference misses. Cache capacityand associativity do not aect the number of compulsory misses that come up by an
Page 1: NATIONAL TECHNICAL UNIVERSITY OF AT
Page 4 and 5: .................Evangelia G. Athan
Page 6 and 7: vióôéò ìç-ãñáììéêÝò
Page 8 and 9: viiiAnother issue, that had not bee
Page 10 and 11: xCONTENTS2.7 Iteration Space . . .
Page 12 and 13: xiiCONTENTS
Page 14 and 15: xivLIST OF FIGURES4.4 Alignment of
Page 16 and 17: xviLIST OF FIGURES
Page 18 and 19: xviiiLIST OF TABLES
Page 20 and 21: äýíáìç ðñïò ôç óõíå
Page 22 and 23: 2 Introductionthe processor die its
Page 24 and 25: 4 Introductionthe instruction strea
Page 26 and 27: 6 IntroductionUnfortunately, the pe
Page 28 and 29: 8 Introduction• A study of the ee
Page 30 and 31: 10 Introduction
Page 34 and 35: 14 Basic Conceptsapplication execut
Page 36 and 37: 16 Basic Conceptscache entries, it
Page 38 and 39: 18 Basic Concepts• Least-Recently
Page 40 and 41: 20 Basic Conceptstable. Of course,
Page 42 and 43: 22 Basic ConceptsIn the above examp
Page 44 and 45: 24 Basic Concepts• Forward expres
Page 46 and 47: 26 Basic ConceptsThe following unim
Page 48 and 49: 28 Basic Conceptsfor (i = 0; i
Page 50 and 51: 30 Basic Concepts
Page 52 and 53: 32 Fast Indexing for Blocked Array
Page 76 and 77: 56 A Tile Size Selection Analysis[M
Page 78 and 79: 58 A Tile Size Selection Analysismi
Page 80 and 81: 60 A Tile Size Selection Analysismi
Page 82 and 83:
62 A Tile Size Selection Analysisd.
Page 84 and 85:
64 A Tile Size Selection AnalysisRe
Page 86 and 87:
66 A Tile Size Selection AnalysisTh
Page 88 and 89:
68 A Tile Size Selection AnalysisSu
Page 90 and 91:
70 A Tile Size Selection AnalysisWe
Page 92 and 93:
72 A Tile Size Selection AnalysisTL
Page 94 and 95:
74 A Tile Size Selection Analysis4.
Page 96 and 97:
76 Simultaneous MultithreadingAlong
Page 98 and 99:
78 Simultaneous Multithreadingas be
Page 100 and 101:
80 Simultaneous MultithreadingFor c
Page 102 and 103:
82 Simultaneous Multithreading871th
Page 104 and 105:
84 Simultaneous MultithreadingExami
Page 106 and 107:
86 Experimental Results3500MBalttil
Page 108 and 109:
88 Experimental Results2000MBaLttil
Page 110 and 111:
90 Experimental Results140MBaLtMBaL
Page 112 and 113:
92 Experimental Resultsnumber of mi
Page 114 and 115:
94 Experimental Resultsnumber of mi
Page 116 and 117:
96 Experimental ResultsTotal penalt
Page 118 and 119:
98 Experimental Resultsso that dier
Page 120 and 121:
100 Experimental Results10STRMMSSYR
Page 122 and 123:
102 Experimental ResultsNorm. Perfo
Page 124 and 125:
104 Experimental Resultstool. Figur
Page 126 and 127:
106 Experimental Results
Page 128 and 129:
108 Conclusionsdata are now stored
Page 131 and 132:
APPENDIXATable of SymbolsExplanatio
Page 133 and 134:
APPENDIXBHardware ArchitectureUltra
Page 135 and 136:
APPENDIXCProgram CodesIn the follow
Page 137 and 138:
C.4 SSYMM: Symmetric Matrix-Matrix
Page 139 and 140:
Bibliography[AAKK05]Evangelia Athan
Page 141 and 142:
BIBLIOGRAPHY 121ference on Programm
Page 143 and 144:
BIBLIOGRAPHY 123[KPCM99][KRC97][KRC
Page 145 and 146:
BIBLIOGRAPHY 125[RS01][RT98a][RT98b
Page 147:
BIBLIOGRAPHY 127[WM95][WMC96]Wm. A.
show all

Non-linear memory layout transformations and data prefetching ...

Create successful ePaper yourself

Delete template?

Save as template?