13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

INSTRUCTION LATENCY AND THROUGHPUTWhile several items on the above list involve selecting the right instruction, thisappendix focuses on the following issues. These are listed in priority order, thoughwhich item contributes most to performance varies by application:• Maximize the flow of μops into the execution core. Instructions which consist ofmore than four μops require additional steps from microcode ROM. Instructionswith longer μop flows incur a delay in the front end <strong>and</strong> reduce the supply of μopsto the execution core.In Pentium 4 <strong>and</strong> Intel Xeon processors, transfers to microcode ROM oftenreduce how efficiently μops can be packed into the trace cache. Where possible,it is advisable to select instructions with four or fewer μops. For example, a<strong>32</strong>-bit integer multiply with a memory oper<strong>and</strong> fits in the trace cache withoutgoing to microcode, while a 16-bit integer multiply to memory does not.• Avoid resource conflicts. Interleaving instructions so that they don’t compete forthe same port or execution unit can increase throughput. For example, alternatePADDQ <strong>and</strong> PMULUDQ (each has a throughput of one issue per two clock cycles).When interleaved, they can achieve an effective throughput of one instructionper cycle because they use the same port but different execution units. Selectinginstructions with fast throughput also helps to preserve issue port b<strong>and</strong>width,hide latency <strong>and</strong> allows for higher software performance.• Minimize the latency of dependency chains that are on the critical path. Forexample, an operation to shift left by two bits executes faster when encoded astwo adds than when it is encoded as a shift. If latency is not an issue, the shiftresults in a denser byte encoding.In addition to the general <strong>and</strong> specific rules, coding guidelines <strong>and</strong> the instructiondata provided in this manual, you can take advantage of the software performanceanalysis <strong>and</strong> tuning toolset available at http://developer.intel.com/software/products/index.htm.The tools include the Intel VTune Performance Analyzer, with itsperformance-monitoring capabilities.C.2 DEFINITIONSThe data is listed in several tables. The tables contain the following:• Instruction Name — The assembly mnemonic of each instruction.• Latency — The number of clock cycles that are required for the execution core tocomplete the execution of all of the μops that form an instruction.• Throughput — The number of clock cycles required to wait before the issueports are free to accept the same instruction again. For many instructions, thethroughput of an instruction can be significantly less than its latency.C-2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!