12.07.2015 Views

Non-linear memory layout transformations and data prefetching ...

Non-linear memory layout transformations and data prefetching ...

Non-linear memory layout transformations and data prefetching ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.1 Motivation 5we also need to automatically generate the mapping from the multidimensional iteration indicesto the correct location of the respective <strong>data</strong> element in <strong>linear</strong> <strong>memory</strong>. Blocked <strong>layout</strong>s are verypromising subject to an ecient address computation method. In the following, when referringto our non-<strong>linear</strong> <strong>layout</strong>s, we will name them Blocked Array Layouts, as they are always combinedwith loop tiling (they split array elements to blocks) <strong>and</strong> apply ecient indexing to the derivedtiles.1.1.3 Tile Size/Shape SelectionEarly eorts [MCT96], [WL91] have been dedicated to selecting the tile in such a way that itsworking set ts in the cache, so as to eliminate capacity misses. To minimize loop overhead, thetile size should be the maximum that meets the above requirement. Recent work takes conictmisses into account, as well. Conict misses [TFJ94] may occur when too many <strong>data</strong> items mapto the same set of cache locations, causing cache lines to be ushed from cache before they maybe used, despite sucient capacity in the overall cache. As a result, in addition to eliminatingcapacity misses [MCT96], [WL91] <strong>and</strong> maximizing cache utilization, the tile should be selectedin such a way that there are no (or few) self conict misses, while cross conict misses areminimized [CM99], [CM95], [Ess93], [LRW91], [RT99a].To nd tile sizes that have few capacity misses, the surveyed algorithms restrict their c<strong>and</strong>idatetile sizes to be the ones whose working set can entirely t in the cache. To model selfconict misses due to low associativity cache, [WMC96] <strong>and</strong> [MHCF98] use the eective cachesize q×C (q < 1), instead of the actual cache size C, while [CM99], [CM95], [LRW91] <strong>and</strong> [SL01]explicitly nd the non-conicting tile sizes. Taking into account cache line size as well, columndimensions (without loss of generality, assume a column major <strong>data</strong> array <strong>layout</strong>) should be amultiple of the cache line size [CM95]. If xed blocks are chosen, Lam et al. in [LRW91] havefound that the best square tile is not larger than √ aCa+1, where a = associativity. In practice, theoptimal choice may occupy only a small fraction of the cache, typically less than 10%. What'smore, the fraction of the cache used for optimal block size decreases as the cache size increases.The desired tile shape has been explicitly specied in algorithms such as [Ess93], [CM99],[CM95], [WL91], [WMC96], [LRW91]. Both [WL91] <strong>and</strong> [LRW91] search for square tiles. Incontrast, [CM99], [CM95] <strong>and</strong> [WMC96] nd rectangular tiles or [Ess93] even extremely talltiles (the maximum number of complete columns that t in the cache). Tile shape <strong>and</strong> cacheutilization are two important performance factors considered by many algorithms, either implicitlythrough the cost model or explicitly through c<strong>and</strong>idate tiles. However, extremely wide tilesmay introduce TLB thrashing. On the other h<strong>and</strong>, extremely tall or square tiles may have lowcache utilization. Apart from the static techniques, iteration compilation has been implementedin [KKO00]. Although it can achieve high speedups, the obvious drawback of iterative compilationis its long compilation time, required to generate <strong>and</strong> prole many versions of the sourceprogram.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!