13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

INTEL® <strong>64</strong> AND <strong>IA</strong>-<strong>32</strong> PROCESSOR ARCHITECTURES• Prefetching far ahead can cause eviction of cached data from the caches prior tothe data being used in execution.• Not prefetching far enough can reduce the ability to overlap memory <strong>and</strong>execution latencies.Software prefetches are treated by the processor as a hint to initiate a request tofetch data from the memory system, <strong>and</strong> consume resources in the processor <strong>and</strong>the use of too many prefetches can limit their effectiveness. Examples of this includeprefetching data in a loop for a reference outside the loop <strong>and</strong> prefetching in a basicblock that is frequently executed, but which seldom precedes the reference for whichthe prefetch is targeted.See: Chapter 9, “Optimizing Cache Usage.”Automatic hardware prefetch is a feature in the Pentium 4 processor. It bringscache lines into the unified second-level cache based on prior reference patterns.Software prefetching has the following characteristics:• h<strong>and</strong>les irregular access patterns, which do not trigger the hardware prefetcher• h<strong>and</strong>les prefetching of short arrays <strong>and</strong> avoids hardware prefetching start-updelay before initiating the fetches• must be added to new code; so it does not benefit existing applicationsHardware prefetching for Pentium 4 processor has the following characteristics:• works with existing applications• does not require extensive study of prefetch instructions• requires regular access patterns• avoids instruction <strong>and</strong> issue port b<strong>and</strong>width overhead• has a start-up penalty before the hardware prefetcher triggers <strong>and</strong> beginsinitiating fetchesThe hardware prefetcher can h<strong>and</strong>le multiple streams in either the forward or backwarddirections. The start-up delay <strong>and</strong> fetch-ahead has a larger effect for shortarrays when hardware prefetching generates a request for data beyond the end of anarray (not actually utilized). The hardware penalty diminishes if it is amortized overlonger arrays.Hardware prefetching is triggered after two successive cache misses in the last levelcache <strong>and</strong> requires these cache misses to satisfy a condition that the linear addressdistance between these cache misses is within a threshold value. The threshold valuedepends on the processor implementation (see Table 2-6). However, hardwareprefetching will not cross 4-KByte page boundaries. As a result, hardwareprefetching can be very effective when dealing with cache miss patterns that havesmall strides <strong>and</strong> that are significantly less than half the threshold distance to triggerhardware prefetching. On the other h<strong>and</strong>, hardware prefetching will not benefitcache miss patterns that have frequent DTLB misses or have access strides thatcause successive cache misses that are spatially apart by more than the triggerthreshold distance.2-30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!