13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

INSTRUCTION LATENCY AND THROUGHPUT— The FP_EXECUTE unit is actually a cluster of execution units, roughlyconsisting of seven separate execution units.— The FP_ADD unit h<strong>and</strong>les x87 <strong>and</strong> SIMD floating-point add <strong>and</strong> subtractoperation.— The FP_MUL unit h<strong>and</strong>les x87 <strong>and</strong> SIMD floating-point multiply operation.— The FP_DIV unit h<strong>and</strong>les x87 <strong>and</strong> SIMD floating-point divide square-rootoperations.— The MMX_SHFT unit h<strong>and</strong>les shift <strong>and</strong> rotate operations.— The MMX_ALU unit h<strong>and</strong>les SIMD integer ALU operations.— The MMX_MISC unit h<strong>and</strong>les reciprocal MMX computations <strong>and</strong> some integeroperations.— The FP_MISC designates other execution units in port 1 that are separatedfrom the six units listed above.3. It may be possible to construct repetitive calls to some Intel <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong>instructions in code sequences to achieve latency that is one or two clock cyclesfaster than the more realistic number listed in this table.4. Latency <strong>and</strong> Throughput of transcendental instructions can vary substantially in adynamic execution environment. Only an approximate value or a range of valuesare given for these instructions.5. The FXCH instruction has 0 latency in code sequences. However, it is limited to anissue rate of one instruction per clock cycle.6. The load constant instructions, FINCSTP, <strong>and</strong> FDECSTP have 0 latency in codesequences.7. Selection of conditional jump instructions should be based on the recommendationof section Section 3.4.1, “Branch Prediction <strong>Optimization</strong>,” to improve thepredictability of branches. When branches are predicted successfully, the latencyof jcc is effectively zero.8. RCL/RCR with shift count of 1 are optimized. Using RCL/RCR with shift countother than 1 will be executed more slowly. This applies to the Pentium 4 <strong>and</strong> IntelXeon processors.C.3.3Latency <strong>and</strong> Throughput with Memory Oper<strong>and</strong>sTypically, instructions with a memory address as the source oper<strong>and</strong>, add one moreμop to the “reg, reg” instructions. However, the throughput in most cases remainsthe same because the load operation utilizes port 2 without affecting port 0 or port 1.Many instructions accept a memory address as either the source oper<strong>and</strong> or as thedestination oper<strong>and</strong>. The former is commonly referred to as a load operation, whilethe latter a store operation.C-26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!