13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING CACHE USAGE• The hardware prefetcher may consume extra system b<strong>and</strong>width if the application’smemory traffic has significant portions with strides of cache missesgreater than the trigger distance threshold of hardware prefetch (large-stridememory traffic).• The effectiveness with existing applications depends on the proportions of smallstrideversus large-stride accesses in the application’s memory traffic. Anapplication with a preponderance of small-stride memory traffic with goodtemporal locality will benefit greatly from the automatic hardware prefetcher.• In some situations, memory traffic consisting of a preponderance of large-stridecache misses can be transformed by re-arrangement of data access sequences toalter the concentration of small-stride cache misses at the expense of largestridecache misses to take advantage of the automatic hardware prefetcher.9.6.3 Example of Effective Latency Reductionwith Hardware PrefetchConsider the situation that an array is populated with data corresponding to aconstant-access-stride, circular pointer chasing sequence (see Example 9-2). Thepotential of employing the automatic hardware prefetching mechanism to reduce theeffective latency of fetching a cache line from memory can be illustrated by varyingthe access stride between <strong>64</strong> bytes <strong>and</strong> the trigger threshold distance of hardwareprefetch when populating the array for circular pointer chasing.Example 9-2. Populating an Array for Circular Pointer Chasing with Constant Strideregister char ** p;char *next; // Populating pArray for circular pointer// chasing with constant access stride// p = (char **) *p; loads a value pointing to next loadp = (char **)&pArray;for ( i = 0; i < aperture; i += stride) {p = (char **)&pArray[i];if (i + stride >= g_array_aperture) {next = &pArray[0 ];}else {next = &pArray[i + stride];}*p = next; // populate the address of the next node}The effective latency reduction for several microarchitecture implementations isshown in Figure 9-1. For a constant-stride access pattern, the benefit of the auto-9-14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!