13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEPrefetchntaDataset APrefetcht0Dataset ASM1ReuseDataset APrefetcht0Dataset BSM2PrefetchntaDataset BReuseDataset ASM1ReuseDataset BReuseDataset BTemporallyadjacent passesTemporallynon-adjacent passesFigure 9-8. Examples of Prefetch <strong>and</strong> Strip-mining for Temporally Adjacent <strong>and</strong>Non-Adjacent Passes LoopsFor Pentium 4 processors, the left scenario shows a graphical implementation ofusing PREFETCHNTA to prefetch data into selected ways of the second-level cacheonly (SM1 denotes strip mine one way of second-level), minimizing second-levelcache pollution. Use PREFETCHNTA if the data is only touched once during the entireexecution pass in order to minimize cache pollution in the higher level caches. Thisprovides instant availability, assuming the prefetch was issued far ahead enough,when the read access is issued.In scenario to the right (see Figure 9-8), keeping the data in one way of the secondlevelcache does not improve cache locality. Therefore, use PREFETCHT0 to prefetchthe data. This amortizes the latency of the memory references in passes 1 <strong>and</strong> 2, <strong>and</strong>keeps a copy of the data in second-level cache, which reduces memory traffic <strong>and</strong>latencies for passes 3 <strong>and</strong> 4. To further reduce the latency, it might be worth consideringextra PREFETCHNTA instructions prior to the memory references in passes 3<strong>and</strong> 4.In Example 9-6, consider the data access patterns of a 3D geometry engine firstwithout strip-mining <strong>and</strong> then incorporating strip-mining. Note that 4-wide SIMDinstructions of Pentium III processor can process 4 vertices per every iteration.Without strip-mining, all the x,y,z coordinates for the four vertices must be refetchedfrom memory in the second pass, that is, the lighting loop. This causes9-24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!