13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINES3.6.4 Store ForwardingThe processor’s memory system only sends stores to memory (including cache) afterstore retirement. However, store data can be forwarded from a store to a subsequentload from the same address to give a much shorter store-load latency.There are two kinds of requirements for store forwarding. If these requirements areviolated, store forwarding cannot occur <strong>and</strong> the load must get its data from the cache(so the store must write its data back to the cache first). This incurs a penalty that islargely related to pipeline depth of the underlying micro-architecture.The first requirement pertains to the size <strong>and</strong> alignment of the store-forwarding data.This restriction is likely to have high impact on overall application performance. Typically,a performance penalty due to violating this restriction can be prevented. Thestore-to-load forwarding restrictions vary from one microarchitecture to another.Several examples of coding pitfalls that cause store-forwarding stalls <strong>and</strong> solutions tothese pitfalls are discussed in detail in Section 3.6.4.1, “Store-to-Load-ForwardingRestriction on Size <strong>and</strong> Alignment.” The second requirement is the availability ofdata, discussed in Section 3.6.4.2, “Store-forwarding Restriction on Data Availability.”A good practice is to eliminate redundant load operations.It may be possible to keep a temporary scalar variable in a register <strong>and</strong> never write itto memory. Generally, such a variable must not be accessible using indirect pointers.Moving a variable to a register eliminates all loads <strong>and</strong> stores of that variable <strong>and</strong>eliminates potential problems associated with store forwarding. However, it alsoincreases register pressure.Load instructions tend to start chains of computation. Since the out-of-order engineis based on data dependence, load instructions play a significant role in the engine’sability to execute at a high rate. Eliminating loads should be given a high priority.If a variable does not change between the time when it is stored <strong>and</strong> the time whenit is used again, the register that was stored can be copied or used directly. If registerpressure is too high, or an unseen function is called before the store <strong>and</strong> the secondload, it may not be possible to eliminate the second load.Assembly/Compiler Coding Rule 46. (H impact, M generality) Passparameters in registers instead of on the stack where possible. Passing argumentson the stack requires a store followed by a reload. While this sequence is optimizedin hardware by providing the value to the load directly from the memory orderbuffer without the need to access the data cache if permitted by store-forwardingrestrictions, floating point values incur a significant latency in forwarding. Passingfloating point arguments in (preferably XMM) registers should save this long latencyoperation.Parameter passing conventions may limit the choice of which parameters are passedin registers which are passed on the stack. However, these limitations may be overcomeif the compiler has control of the compilation of the whole binary (using wholeprogramoptimization).3-50

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!