13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENERAL OPTIMIZATION GUIDELINESmiss patterns). Optimizing data access patterns to suit the hardware prefetcher ishighly recommended, <strong>and</strong> should be a higher-priority consideration than using softwareprefetch instructions.The hardware prefetcher is best for small-stride data access patterns in either directionwith a cache-miss stride not far from <strong>64</strong> bytes. This is true for data accesses toaddresses that are either known or unknown at the time of issuing the load operations.Software prefetch can complement the hardware prefetcher if used carefully.There is a trade-off to make between hardware <strong>and</strong> software prefetching. Thispertains to application characteristics such as regularity <strong>and</strong> stride of accesses. Busb<strong>and</strong>width, issue b<strong>and</strong>width (the latency of loads on the critical path) <strong>and</strong> whetheraccess patterns are suitable for non-temporal prefetch will also have an impact.For a detailed description of how to use prefetching, see Chapter 9, “OptimizingCache Usage.”Chapter 5, “Optimizing for SIMD Integer Applications,” contains an example thatuses software prefetch to implement a memory copy algorithm.Tuning Suggestion 2. If a load is found to miss frequently, either insert a prefetchbefore it or (if issue b<strong>and</strong>width is a concern) move the load up to execute earlier.3.7.3 Hardware Prefetching for First-Level Data CacheThe hardware prefetching mechanism for L1 in Intel Core microarchitecture isdiscussed in Section 2.1.4.2. A similar L1 prefetch mechanism is also available toprocessors based on Intel NetBurst microarchitecture with CPUID signature of family15 <strong>and</strong> model 6.Example 3-41 depicts a technique to trigger hardware prefetch. The code demonstratestraversing a linked list <strong>and</strong> performing some computational work on 2members of each element that reside in 2 different cache lines. Each element is ofsize 192 bytes. The total size of all elements is larger than can be fitted in the L2cache.3-70

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!