13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENERAL OPTIMIZATION GUIDELINESmore expensive than accessing data from the immediate inner level in thecache/memory hierarchy, assuming similar degrees of data access parallelism.Thus locality enhancement should start with characterizing the dominant data trafficlocality. Section A, “Application Performance Tools,” describes some techniques thatcan be used to determine the dominant data traffic locality for any workload.Even if cache miss rates of the last level cache may be low relative to the number ofcache references, processors typically spend a sizable portion of their execution timewaiting for cache misses to be serviced. Reducing cache misses by enhancing aprogram’s locality is a key optimization. This can take several forms:• Blocking to iterate over a portion of an array that will fit in the cache (with thepurpose that subsequent references to the data-block [or tile] will be cache hitreferences)• Loop interchange to avoid crossing cache lines or page boundaries• Loop skewing to make accesses contiguousLocality enhancement to the last level cache can be accomplished with sequencingthe data access pattern to take advantage of hardware prefetching. This can alsotake several forms:• Transformation of a sparsely populated multi-dimensional array into a onedimensionarray such that memory references occur in a sequential, small-stridepattern that is friendly to the hardware prefetch (see Section 2.2.4.4, “DataPrefetch”)• Optimal tile size <strong>and</strong> shape selection can further improve temporal data localityby increasing hit rates into the last level cache <strong>and</strong> reduce memory trafficresulting from the actions of hardware prefetching (see Section 9.6.11,“Hardware Prefetching <strong>and</strong> Cache Blocking Techniques”)It is important to avoid operations that work against locality-enhancing techniques.Using the lock prefix heavily can incur large delays when accessing memory, regardlessof whether the data is in the cache or in system memory.User/Source Coding Rule 10. (H impact, H generality) <strong>Optimization</strong>techniques such as blocking, loop interchange, loop skewing, <strong>and</strong> packing are bestdone by the compiler. Optimize data structures either to fit in one-half of the firstlevelcache or in the second-level cache; turn on loop optimizations in the compilerto enhance locality for nested loops.Optimizing for one-half of the first-level cache will bring the greatest performancebenefit in terms of cycle-cost per data access. If one-half of the first-level cache istoo small to be practical, optimize for the second-level cache. Optimizing for a pointin between (for example, for the entire first-level cache) will likely not bring asubstantial improvement over optimizing for the second-level cache.3-66

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!