13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESWhen not all sources can be read, a μop can stall in the rename stage until it can getaccess to enough ROB read ports to complete renaming the μop. This stall is usuallyshort-lived. Typically, a μop will complete renaming in the next cycle, but it appearsto the application as a loss of rename b<strong>and</strong>width.Some of the software-visible situations that can cause ROB read port stalls include:• Registers that have become cold <strong>and</strong> require a ROB read port because executionunits are doing other independent calculations.• Constants inside registers• Pointer <strong>and</strong> index registersIn rare cases, ROB read port stalls may lead to more significant performance degradations.There are a couple of heuristics that can help prevent over-subscribing theROB read ports:• Keep common register usage clustered together. Multiple references to the samewritten-back register can be “folded” inside the out of order execution core.• Keep dependency chains intact. This practice ensures that the registers will nothave been written back when the new micro-ops are written to the RS.These two scheduling heuristics may conflict with other more common schedulingheuristics. To reduce dem<strong>and</strong> on the ROB read port, use these two heuristics only ifboth the following situations are met:• short latency operations• indications of actual ROB read port stalls can be confirmed by measurements ofthe performance event (the relevant event is RAT_STALLS.ROB_READ_PORT, seeAppendix A of the Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> Software Developer’s<strong>Manual</strong>, Volume 3B)If the code has a long dependency chain, these two heuristics should not be usedbecause they can cause the RS to fill, causing damage that outweighs the positiveeffects of reducing dem<strong>and</strong>s on the ROB read port.3.5.2.2 Bypass between Execution DomainsFloating point (FP) loads have an extra cycle of latency. Moves between FP <strong>and</strong> SIMDstacks have another additional cycle of latency.Example:ADDPS XMM0, XMM1PAND XMM0, XMM3ADDPS XMM2, XMM0The overall latency for the above calculation is 9 cycles:• 3 cycles for each ADDPS instruction• 1 cycle for the PAND instruction3-33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!