13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESFor example, using a simple function that returns an input parameter (representativeof tight, short loops), the per-iteration cost of packing/unpacking may range fromslightly more than 7 cycles (the shuffle with store forwarding case, Example 3-22) to~0.9 cycles (accomplished by several test cases). Across 27 test cases (consisting ofone of the alternate packing methods, no result-simplification/simplification of either1 or 4 results, no stack optimization or with stack optimization), the average per-iterationcost of packing/unpacking is about 1.7 cycles.Generally speaking, packing method 2 <strong>and</strong> 3 (see Example 3-23) tend to be morerobust than packing method 1; the optimal choice of simplifying 1 or 4 results will beaffected by register pressure of the runtime <strong>and</strong> other relevant microarchitecturalconditions.Note that the numeric discussion of per-iteration cost of packing/packing is illustrativeonly. It will vary with test cases using a different base line code sequence <strong>and</strong> willgenerally increase if the non-vectorizable routine requires longer time to executebecause the number of loop iterations that can reside in flight in the execution coredecreases.3.6 OPTIMIZING MEMORY ACCESSESThis section discusses guidelines for optimizing code <strong>and</strong> data memory accesses. Themost important recommendations are:• Execute load <strong>and</strong> store operations within available execution b<strong>and</strong>width.• Enable forward progress of speculative execution.• Enable store forwarding to proceed.• Align data, paying attention to data layout <strong>and</strong> stack alignment.• Place code <strong>and</strong> data on separate pages.• Enhance data locality.• Use prefetching <strong>and</strong> cacheability control instructions.• Enhance code locality <strong>and</strong> align branch targets.• Take advantage of write combining.Alignment <strong>and</strong> forwarding problems are among the most common sources of largedelays on processors based on Intel NetBurst microarchitecture.3.6.1 Load <strong>and</strong> Store Execution B<strong>and</strong>widthTypically, loads <strong>and</strong> stores are the most frequent operations in a workload, up to 40%of the instructions in a workload carrying load or store intent are not uncommon.Each generation of microarchitecture provides multiple buffers to support executingload <strong>and</strong> store operations while there are instructions in flight.3-46

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!