13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGE9.7.2 Cache ManagementStreaming instructions (PREFETCH <strong>and</strong> STORE) can be used to manage data <strong>and</strong>minimize disturbance of temporal data held within the processor’s caches.In addition, the Pentium 4 processor takes advantage of Intel C ++ Compiler supportfor C ++ language-level features for the Streaming SIMD Extensions. StreamingSIMD Extensions <strong>and</strong> MMX technology instructions provide intrinsics that allow youto optimize cache utilization. Examples of such Intel compiler intrinsics are_MM_PREFETCH, _MM_STREAM, _MM_LOAD, _MM_SFENCE. For detail, refer to theIntel C ++ Compiler User’s Guide documentation.The following examples of using prefetching instructions in the operation of videoencoder <strong>and</strong> decoder as well as in simple 8-byte memory copy, illustrate performancegain from using the prefetching instructions for efficient cache management.9.7.2.1 Video EncoderIn a video encoder, some of the data used during the encoding process is kept in theprocessor’s second-level cache. This is done to minimize the number of referencestreams that must be re-read from system memory. To ensure that other writes donot disturb the data in the second-level cache, streaming stores (MOVNTQ) are usedto write around all processor caches.The prefetching cache management implemented for the video encoder reduces thememory traffic. The second-level cache pollution reduction is ensured by preventingsingle-use video frame data from entering the second-level cache. Using a nontemporalPREFETCH (PREFETCHNTA) instruction brings data into only one way of thesecond-level cache, thus reducing pollution of the second-level cache.If the data brought directly to second-level cache is not re-used, then there is aperformance gain from the non-temporal prefetch over a temporal prefetch. Theencoder uses non-temporal prefetches to avoid pollution of the second-level cache,increasing the number of second-level cache hits <strong>and</strong> decreasing the number ofpolluting write-backs to memory. The performance gain results from the more efficientuse of the second-level cache, not only from the prefetch itself.9.7.2.2 Video DecoderIn the video decoder example, completed frame data is written to local memory ofthe graphics card, which is mapped to WC (Write-combining) memory type. A copy ofreference data is stored to the WB memory at a later time by the processor in orderto generate future data. The assumption is that the size of the reference data is toolarge to fit in the processor’s caches. A streaming store is used to write the dataaround the cache, to avoid displaying other temporal data held in the caches. Later,the processor re-reads the data using PREFETCHNTA, which ensures maximumb<strong>and</strong>width, yet minimizes disturbance of other cached temporal data by using thenon-temporal (NTA) version of prefetch.9-31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!