13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING CACHE USAGEmatic hardware prefetcher begins at half the trigger threshold distance <strong>and</strong> reachesmaximum benefit when the cache-miss stride is <strong>64</strong> bytes.Upperbound of Pointer-Chasing Latency Reduction120%100%Effective Latency Reduction80%60%40%20%Fam .15; M odel 3, 4Fam .15; M odel 0,1,2Fam. 6; Model 13Fam. 6; Model 14Fam. 15; Model 60%<strong>64</strong>8096112128144160176192208224240Stride (Bytes)Figure 9-1. Effective Latency Reduction as a Function of Access Stride9.6.4 Example of Latency Hiding with S/W Prefetch InstructionAchieving the highest level of memory optimization using PREFETCH instructionsrequires an underst<strong>and</strong>ing of the architecture of a given machine. This section translatesthe key architectural implications into several simple guidelines for programmersto use.Figure 9-2 <strong>and</strong> Figure 9-3 show two scenarios of a simplified 3D geometry pipeline asan example. A 3D-geometry pipeline typically fetches one vertex record at a time<strong>and</strong> then performs transformation <strong>and</strong> lighting functions on it. Both figures show twoseparate pipelines, an execution pipeline, <strong>and</strong> a memory pipeline (front-side bus).Since the Pentium 4 processor (similar to the Pentium II <strong>and</strong> Pentium III processors)completely decouples the functionality of execution <strong>and</strong> memory access, the twopipelines can function concurrently. Figure 9-2 shows “bubbles” in both the execution<strong>and</strong> memory pipelines. When loads are issued for accessing vertex data, the executionunits sit idle <strong>and</strong> wait until data is returned. On the other h<strong>and</strong>, the memory bussits idle while the execution units are processing vertices. This scenario severelydecreases the advantage of having a decoupled architecture.9-15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!