13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

INSTRUCTION LATENCY AND THROUGHPUTC.3 LATENCY AND THROUGHPUTThis section presents the latency <strong>and</strong> throughput information for commonly-usedinstructions including: MMX technology, Streaming SIMD Extensions, subsequentgenerations of SIMD instruction extensions, <strong>and</strong> most of the frequently used generalpurposeinteger <strong>and</strong> x87 floating-point instructions.Due to the complexity of dynamic execution <strong>and</strong> out-of-order nature of the executioncore, the instruction latency data may not be sufficient to accurately predict realisticperformance of actual code sequences based on adding instruction latency data.• Instruction latency data is useful when tuning a dependency chain. However,dependency chains limit the out-of-order core’s ability to execute micro-ops inparallel. Instruction throughput data are useful when tuning parallel codeunencumbered by dependency chains.• Numeric data in the tables is:— approximate <strong>and</strong> subject to change in future implementations of the microarchitecture.— not meant to be used as reference for instruction-level performancebenchmarks. Comparison of instruction-level performance of microprocessorsthat are based on different microarchitectures is a complex subject<strong>and</strong> requires information that is beyond the scope of this manual.Comparisons of latency <strong>and</strong> throughput data between different microarchitecturescan be misleading.Appendix C.3.1 provides latency <strong>and</strong> throughput data for the register-to-registerinstruction type. Appendix C.3.3 discusses how to adjust latency <strong>and</strong> throughputspecifications for the register-to-memory <strong>and</strong> memory-to-register instructions.In some cases, the latency or throughput figures given are just one half of a clock.This occurs only for the double-speed ALUs.C.3.1Latency <strong>and</strong> Throughput with Register Oper<strong>and</strong>sInstruction latency <strong>and</strong> throughput data are presented in Table C-1 throughTable C-10. Tables include Supplemental Streaming SIMD Extension 3, StreamingSIMD Extension 3, Streaming SIMD Extension 2, Streaming SIMD Extension, MMXtechnology <strong>and</strong> most common Intel <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> instructions. Instruction latency<strong>and</strong> throughput for different processor microarchitectures are in separate columns.Processor instruction timing data may vary from one implementation to another.Intel <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> processors with different implementation characteristics can beidentified by the encoded values of “display_family” <strong>and</strong> “display_model”. The definitionsof “display_family” <strong>and</strong> “display_model” can be found in the reference pages ofCPUID (see Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> Software Developer’s <strong>Manual</strong>, Volume2A). The tables of instruction <strong>and</strong> latency data are grouped by an abbreviated formof hex values “DisplayFamilyValue_DisplayModelValue”. Processors based on IntelNetBurst microarchitecture has a “DisplayFamilyValue” of 0FH, “DisplayModelValue”C-3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!