13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING CACHE USAGE• Use software prefetch concatenation. Arrange prefetches to avoid unnecessaryprefetches at the end of an inner loop <strong>and</strong> to prefetch the first few iterations ofthe inner loop inside the next outer loop.• Minimize the number of software prefetches. Prefetch instructions are notcompletely free in terms of bus cycles, machine cycles <strong>and</strong> resources; excessiveusage of prefetches can adversely impact application performance.• Interleave prefetches with computation instructions. For best performance,software prefetch instructions must be interspersed with computational instructionsin the instruction sequence (rather than clustered together).9.2 HARDWARE PREFETCHING OF DATAPentium M, Intel Core Solo, <strong>and</strong> Intel Core Duo processors <strong>and</strong> processors based onIntel Core microarchitecture <strong>and</strong> Intel NetBurst microarchitecture provide hardwaredata prefetch mechanisms which monitor application data access patterns <strong>and</strong>prefetches data automatically. This behavior is automatic <strong>and</strong> does not requireprogrammer intervention.For processors based on Intel NetBurst microarchitecture, characteristics of thehardware data prefetcher are:1. It requires two successive cache misses in the last level cache to trigger themechanism; these two cache misses must satisfy the condition that strides ofthe cache misses are less than the trigger distance of the hardware prefetchmechanism (see Table 2-6).2. Attempts to stay 256 bytes ahead of current data access locations.3. Follows only one stream per 4-KByte page (load or store).4. Can prefetch up to 8 simultaneous, independent streams from eight different4-KByte regions5. Does not prefetch across 4-KByte boundary. This is independent of pagingmodes.6. Fetches data into second/third-level cache.7. Does not prefetch UC or WC memory types.8. Follows load <strong>and</strong> store streams. Issues Read For Ownership (RFO) transactionsfor store streams <strong>and</strong> Data Reads for load streams.Other than items 2 <strong>and</strong> 4 discussed above, most other characteristics also apply toPentium M, Intel Core Solo <strong>and</strong> Intel Core Duo processors. The hardware prefetcherimplemented in the Pentium M processor fetches data to a second level cache. It cantrack 12 independent streams in the forward direction <strong>and</strong> 4 independent streams inthe backward direction. The hardware prefetcher of Intel Core Solo processor cantrack 16 forward streams <strong>and</strong> 4 backward streams. On the Intel Core Duo processor,the hardware prefetcher in each core fetches data independently.9-3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!