13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEis not resident in the TLB, a TLB miss results <strong>and</strong> the page table must be read frommemory.The TLB miss results in a performance degradation since another memory accessmust be performed (assuming that the translation is not already present in theprocessor caches) to update the TLB. The TLB can be preloaded with the page tableentry for the next desired page by accessing (or touching) an address in that page.This is similar to prefetch, but instead of a data cache line the page table entry isbeing loaded in advance of its use. This helps to ensure that the page table entry isresident in the TLB <strong>and</strong> that the prefetch happens as requested subsequently.9.7.2.6 Using the 8-byte Streaming Stores <strong>and</strong> Software PrefetchExample 9-10 presents the copy algorithm that uses second level cache. The algorithmperforms the following steps:1. Uses blocking technique to transfer 8-byte data from memory into second-levelcache using the _MM_PREFETCH intrinsic, 128 bytes at a time to fill a block. Thesize of a block should be less than one half of the size of the second-level cache,but large enough to amortize the cost of the loop.2. Loads the data into an XMM register using the _MM_LOAD_PS intrinsic.3. Transfers the 8-byte data to a different memory location via the _MM_STREAMintrinsics, bypassing the cache. For this operation, it is important to ensure thatthe page table entry prefetched for the memory is preloaded in the TLB.In Example 9-10, eight _MM_LOAD_PS <strong>and</strong> _MM_STREAM_PS intrinsics are used sothat all of the data prefetched (a 128-byte cache line) is written back. The prefetch<strong>and</strong> streaming-stores are executed in separate loops to minimize the number of transitionsbetween reading <strong>and</strong> writing data. This significantly improves the b<strong>and</strong>widthof the memory accesses.The TEMP = A[KK+CACHESIZE] instruction is used to ensure the page table entry forarray, <strong>and</strong> A is entered in the TLB prior to prefetching. This is essentially a prefetchitself, as a cache line is filled from that memory location with this instruction. Hence,the prefetching starts from KK+4 in this loop.This example assumes that the destination of the copy is not temporally adjacent tothe code. If the copied data is destined to be reused in the near future, then thestreaming store instructions should be replaced with regular 128 bit stores(_MM_STORE_PS). This is required because the implementation of streaming storeson Pentium 4 processor writes data directly to memory, maintaining cache coherency.9.7.2.7 Using 16-byte Streaming Stores <strong>and</strong> Hardware PrefetchAn alternate technique for optimizing a large region memory copy is to take advantageof hardware prefetcher, 16-byte streaming stores, <strong>and</strong> apply a segmented9-34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!