13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINES• Enabling vectorization• Unrolling loopsUser/Source Coding Rule 14. (H impact, ML generality) Make sure yourapplication stays in range to avoid denormal values, underflows..Out-of-range numbers cause very high overhead.User/Source Coding Rule 15. (M impact, ML generality) Do not use doubleprecision unless necessary. Set the precision control (PC) field in the x87 FPUcontrol word to “Single Precision”. This allows single precision (<strong>32</strong>-bit) computationto complete faster on some operations (for example, divides due to early out).However, be careful of introducing more than a total of two values for the floatingpoint control word, or there will be a large performance penalty. See Section 3.8.3.User/Source Coding Rule 16. (H impact, ML generality) Use fast float-to-introutines, FISTTP, or SSE2 instructions. If coding these routines, use the FISTTPinstruction if SSE3 is available, or the CVTTSS2SI <strong>and</strong> CVTTSD2SI instructions ifcoding with Streaming SIMD Extensions 2.Many libraries generate X87 code that does more work than is necessary. The FISTTPinstruction in SSE3 can convert floating-point values to 16-bit, <strong>32</strong>-bit, or <strong>64</strong>-bit integersusing truncation without accessing the floating-point control word (FCW). Theinstructions CVTTSS2SI <strong>and</strong> CVTTSD2SI save many µops <strong>and</strong> some store-forwardingdelays over some compiler implementations. This avoids changing the roundingmode.User/Source Coding Rule 17. (M impact, ML generality) Removing datadependence enables the out-of-order engine to extract more ILP from the code.When summing up the elements of an array, use partial sums instead of a singleaccumulator..For example, to calculate z = a + b + c + d, instead of:X = A + B;Y = X + C;Z = Y + D;use:X = A + B;Y = C + D;Z = X + Y;User/Source Coding Rule 18. (M impact, ML generality) Usually, mathlibraries take advantage of the transcendental instructions (for example, FSIN)when evaluating elementary functions. If there is no critical need to evaluate thetranscendental functions using the extended precision of 80 bits, applicationsshould consider an alternate, software-based approach, such as a look-up-tablebasedalgorithm using interpolation techniques. It is possible to improve3-78

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!