13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING CACHE USAGEExample 9-7. Data Access of a 3D Geometry Engine with Strip-mining (Contd.)compute the light vectorsPOINT LIGHTING codenvtx+=4}}With strip-mining, all vertex data can be kept in the cache (for example, one way ofsecond-level cache) during the strip-mined transformation loop <strong>and</strong> reused in thelighting loop. Keeping data in the cache reduces both bus traffic <strong>and</strong> the number ofprefetches used.Table 9-1 summarizes the steps of the basic usage model that incorporates only softwareprefetch with strip-mining. The steps are:• Do strip-mining: partition loops so that the dataset fits into second-level cache.• Use PREFETCHNTA if the data is only used once or the dataset fits into <strong>32</strong> KBytes(one way of second-level cache). Use PREFETCHT0 if the dataset exceeds<strong>32</strong> KBytes.The above steps are platform-specific <strong>and</strong> provide an implementation example. Thevariables NUM_STRIP <strong>and</strong> MAX_NUM_VX_PER_STRIP can be heuristically determinedfor peak performance for specific application on a specific platform.Table 9-1. Software Prefetching Considerations into Strip-mining CodeRead-Multiple-Times Array <strong>Reference</strong>sRead-Once Array <strong>Reference</strong>s Adjacent PassesNon-Adjacent PassesPrefetchnta Prefetch0, SM1 Prefetch0, SM1(2nd Level Pollution)Evict one way; MinimizepollutionPay memory access cost forthe first pass of each array;Amortize the first pass withsubsequent passesPay memory access cost forthe first pass of every strip;Amortize the first pass withsubsequent passes9.6.11 Hardware Prefetching <strong>and</strong> Cache Blocking TechniquesTuning data access patterns for the automatic hardware prefetch mechanism canminimize the memory access costs of the first-pass of the read-multiple-times <strong>and</strong>some of the read-once memory references. An example of the situations of readoncememory references can be illustrated with a matrix or image transpose, readingfrom a column-first orientation <strong>and</strong> writing to a row-first orientation, or vice versa.Example 9-8 shows a nested loop of data movement that represents a typicalmatrix/image transpose problem. If the dimension of the array are large, not onlythe footprint of the dataset will exceed the last level cache but cache misses will9-26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!