13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESMethod 2:• Organize the data in consecutive lines.• Access the data in increasing addresses, in sequential cache lines.Example demonstrates accesses to sequential cache lines that can benefit from thefirst-level cache prefetcher.Example 3-43. Technique For Using L1 Hardware Prefetchunsigned int *p1, j, a, b;for (j = 0; j < num; j += 16){a = p1[j];b = p1[j+1];// Use these two values}By elevating the load operations from memory to the beginning of each iteration, it islikely that a significant part of the latency of the pair cache line transfer from memoryto the second-level cache will be in parallel with the transfer of the first cache line.The IP prefetcher uses only the lower 8 bits of the address to distinguish a specificaddress. If the code size of a loop is bigger than 256 bytes, two loads may appearsimilar in the lowest 8 bits <strong>and</strong> the IP prefetcher will be restricted. Therefore, if youhave a loop bigger than 256 bytes, make sure that no two loads have the samelowest 8 bits in order to use the IP prefetcher.3.7.4 Hardware Prefetching for Second-Level CacheThe Intel Core microarchitecture contains two second-level cache prefetchers:• Streamer — Loads data or instructions from memory to the second-level cache.To use the streamer, organize the data or instructions in blocks of 128 bytes,aligned on 128 bytes. The first access to one of the two cache lines in this blockwhile it is in memory triggers the streamer to prefetch the pair line. To software,the L2 streamer’s functionality is similar to the adjacent cache line prefetchmechanism found in processors based on Intel NetBurst microarchitecture.• Data prefetch logic (DPL) — DPL <strong>and</strong> L2 Streamer are triggered only bywriteback memory type. They prefetch only inside page boundary (4 KBytes).Both L2 prefetchers can be triggered by software prefetch instructions <strong>and</strong> byprefetch request from DCU prefetchers. DPL can also be triggered by read forownership (RFO) operations. The L2 Streamer can also be triggered by DPLrequests for L2 cache misses.Software can gain from organizing data both according to the instruction pointer <strong>and</strong>according to line strides. For example, for matrix calculations, columns can be3-73

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!