12.07.2015 Views

Non-linear memory layout transformations and data prefetching ...

Non-linear memory layout transformations and data prefetching ...

Non-linear memory layout transformations and data prefetching ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.2 Contributions 7[LRW91], <strong>and</strong> [Ver03]. Code optimizations, such as tile size selection, selected with the helpof predicted miss ratios require a really accurate assessment of program's code behaviour. Performancedegradation, due to tiled code complexity <strong>and</strong> miss-predicted branches, should alsobe taken into account. Miss ratios of blocked kernels are generally a lot smaller than theseof unblocked kernels, amplifying the signicance of small errors in prediction. For this reason,a combination of cache miss analysis, simulation <strong>and</strong> experimentation is the best solution foroptimal selection of critical <strong>transformations</strong>.The previous approaches assumed <strong>linear</strong> array <strong>layout</strong>s. However, as aforementioned studieshave shown, such <strong>linear</strong> array <strong>memory</strong> <strong>layout</strong>s produce unfavorable <strong>memory</strong> access patterns,that cause interference misses <strong>and</strong> increase <strong>memory</strong> system overhead. In order to quantifythe benets of adopting non<strong>linear</strong> <strong>layout</strong>s to reduce cache misses, there exist several dierentapproaches. In [RT99b], Rivera et al. considers all levels of <strong>memory</strong> hierarchy to reduce L2cache misses as well, rather than reducing only L1 ones. He presents even fewer overall misses,however performance improvements are rarely signicant. In another approach, TLB <strong>and</strong> cachemisses should be considered in concert. Park et al. in [PHP02] analyze the TLB <strong>and</strong> cacheperformance for st<strong>and</strong>ard matrix access patterns, when tiling is used together with block <strong>data</strong><strong>layout</strong>s. Such <strong>layout</strong>s with block size equal to the page size, seem to minimize the number ofTLB misses. Considering both all levels of cache (L1 <strong>and</strong> L2) <strong>and</strong> TLB, a block size selectionalgorithm calculates a range of optimal block sizes.1.2 ContributionsA detailed model of cache behaviour can give accurate information to compilers or programmersto optimize codes. However, this is a really dem<strong>and</strong>ing task, especially in respect of givingfeedback to guide code <strong>transformations</strong>. This thesis oers some advance in automation of codeoptimization, focusing on the application of non-<strong>linear</strong> <strong>layout</strong>s in numerical codes. The optimizationalgorithm takes into account cache parameters, in order to determine best processingsizes that match the <strong>memory</strong> hierarchy characteristics of each specic platform.The primary contributions of this thesis are:• The proposal of a fast indexing scheme that makes the performance of blocked <strong>data</strong> <strong>layout</strong>secient. We succeed in increasing the eectiveness of such <strong>layout</strong>s when applied to complexnumerical codes, in combination with loop tiling transformation. The provided frameworkcan be integrated in a static tool, like compiler optimizations.• The proposal of a simple heuristic to make one-level tiling size decisions easy. It outlinesthe convergence point of factors that aect or determine the performance of the multiplehierarchical <strong>memory</strong> levels.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!