13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENERAL OPTIMIZATION GUIDELINESextended double-precision computation. These characteristics affect computationsincluding floating-point divide <strong>and</strong> square root.Assembly/Compiler Coding Rule 61. (H impact, L generality) Minimize thenumber of changes to the precision mode.3.8.3.3 Improving Parallelism <strong>and</strong> the Use of FXCHThe x87 instruction set relies on the floating point stack for one of its oper<strong>and</strong>s. If thedependence graph is a tree, which means each intermediate result is used only once<strong>and</strong> code is scheduled carefully, it is often possible to use only oper<strong>and</strong>s that are onthe top of the stack or in memory, <strong>and</strong> to avoid using oper<strong>and</strong>s that are buried underthe top of the stack. When oper<strong>and</strong>s need to be pulled from the middle of the stack,an FXCH instruction can be used to swap the oper<strong>and</strong> on the top of the stack withanother entry in the stack.The FXCH instruction can also be used to enhance parallelism. Dependent chains canbe overlapped to expose more independent instructions to the hardware scheduler.An FXCH instruction may be required to effectively increase the register name spaceso that more oper<strong>and</strong>s can be simultaneously live.In processors based on Intel NetBurst microarchitecture, however, that FXCH inhibitsissue b<strong>and</strong>width in the trace cache. It does this not only because it consumes a slot,but also because of issue slot restrictions imposed on FXCH. If the application is notbound by issue or retirement b<strong>and</strong>width, FXCH will have no impact.The effective instruction window size in processors based on Intel NetBurst microarchitectureis large enough to permit instructions that are as far away as the next iterationto be overlapped. This often obviates the need to use FXCH to enhanceparallelism.The FXCH instruction should be used only when it’s needed to express an algorithmor to enhance parallelism. If the size of register name space is a problem, the use ofXMM registers is recommended.Assembly/Compiler Coding Rule 62. (M impact, M generality) Use FXCH onlywhere necessary to increase the effective name space.This in turn allows instructions to be reordered <strong>and</strong> made available for execution inparallel. Out-of-order execution precludes the need for using FXCH to move instructionsfor very short distances.3.8.4 x87 vs. Scalar SIMD Floating-point Trade-offsThere are a number of differences between x87 floating-point code <strong>and</strong> scalarfloating-point code (using SSE <strong>and</strong> SSE2). The following differences should drivedecisions about which registers <strong>and</strong> instructions to use:• When an input oper<strong>and</strong> for a SIMD floating-point instruction contains values thatare less than the representable range of the data type, a denormal exceptionoccurs. This causes a significant performance penalty. An SIMD floating-point3-84

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!