12.07.2015 Views

Non-linear memory layout transformations and data prefetching ...

Non-linear memory layout transformations and data prefetching ...

Non-linear memory layout transformations and data prefetching ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

LIST OF FIGURESxv6.17 Misses in Data L1, Unified L2 cache <strong>and</strong> <strong>data</strong> TLB for LU-decomposition (SGIOrigin) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.18 Execution time of the Matrix Multiplication kernel for various array <strong>and</strong> tile sizes(UltraSPARC, -fast) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.19 Total performance penalty due to <strong>data</strong> L1 cache misses, L2 cache misses <strong>and</strong><strong>data</strong> TLB misses for the Matrix Multiplication kernel with use of Blocked arrayLayouts <strong>and</strong> ecient indexing. The real execution time of this benchmark is alsoillustrated (UltraSPARC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.20 Total performance penalty <strong>and</strong> real execution time for the Matrix Multiplicationkernel (<strong>linear</strong> array <strong>layout</strong>s - UltraSPARC) . . . . . . . . . . . . . . . . . . . . . 966.21 The relative performance of the two dierent <strong>data</strong> <strong>layout</strong>s (UltraSPARC) . . . . 976.22 Normalized performance of 5 benchmarks for various array <strong>and</strong> tile sizes (Ultra-SPARC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976.23 Total performance penalty for the Matrix Multiplication kernel (Pentium III) . . 996.24 Pentium III - Normalized performance of ve benchmarks for various array <strong>and</strong>tile sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.25 Athlon XP - Normalized performance of ve benchmarks for various array <strong>and</strong>tile sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006.26 Xeon - The relative performance of the three dierent versions . . . . . . . . . . 1016.27 Xeon - Normalized performance of the matrix multiplication benchmark for variousarray <strong>and</strong> tile sizes (serial MBaLt) . . . . . . . . . . . . . . . . . . . . . . . . 1016.28 Xeon - Normalized performance of the matrix multiplication benchmark for variousarray <strong>and</strong> tile sizes (2 threads - MBaLt) . . . . . . . . . . . . . . . . . . . . 1026.29 Xeon - Normalized performance of the matrix multiplication benchmark for variousarray <strong>and</strong> tile sizes (4 threads - MBaLt) . . . . . . . . . . . . . . . . . . . . 1026.30 SMT experimental results in the Intel Xeon Architecture, with HT enabled . . . 1036.31 Instruction issue ports <strong>and</strong> main execution units of the Xeon processor . . . . . . 105

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!