13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINES3.5.2.5 Partial Flag Register StallsA “partial flag register stall” occurs when an instruction modifies a part of the flagregister <strong>and</strong> the following instruction is dependent on the outcome of the flags. Thishappens most often with shift instructions (SAR, SAL, SHR, SHL). The flags are notmodified in the case of a zero shift count, but the shift count is usually known only atexecution time. The front end stalls until the instruction is retired.Other instructions that can modify some part of the flag register includeCMPXCHG8B, various rotate instructions, STC, <strong>and</strong> STD. An example of assemblywith a partial flag register stall <strong>and</strong> alternative code without the stall is shown inExample 3-21.In processors based on Intel Core microarchitecture, shift immediate by 1 is h<strong>and</strong>ledby special hardware such that it does not experience partial flag stall.Example 3-21. Avoiding Partial Flag Register StallsA Sequence with PartialFlag Register Stallxor eax, eaxmov ecx, asar ecx, 2setz al;No partial register stall,;but flag stall as sar may;change the flagsAlternate Sequence withoutPartial Flag Register Stallor eax, eaxmov ecx, asar ecx, 2test ecx, ecxsetz al;No partial reg or flag stall,; test always updates; all the flags3.5.2.6 Floating Point/SIMD Oper<strong>and</strong>s in Intel NetBurst microarchitectureIn processors based on Intel NetBurst microarchitecture, the latency of MMX or SIMDfloating point register-to-register moves is significant. This can have implications forregister allocation.Moves that write a portion of a register can introduce unwanted dependences. TheMOVSD REG, REG instruction writes only the bottom <strong>64</strong> bits of a register, not all128 bits. This introduces a dependence on the preceding instruction that producesthe upper <strong>64</strong> bits (even if those bits are not longer wanted). The dependence inhibitsregister renaming, <strong>and</strong> thereby reduces parallelism.Use MOVAPD as an alternative; it writes all 128 bits. Even though this instruction hasa longer latency, the μops for MOVAPD use a different execution port <strong>and</strong> this port ismore likely to be free. The change can impact performance. There may be exceptionalcases where the latency matters more than the dependence or the executionport.3-37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!