13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

SUMMARY OF RULES AND SUGGESTIONSthat are significantly smaller than half the hardware prefetch trigger threshold .3-67User/Source Coding Rule 13. (M impact, M generality) Enable the compiler’suse of SSE, SSE2 or SSE3 instructions with appropriate switches .............. 3-77User/Source Coding Rule 14. (H impact, ML generality) Make sure yourapplication stays in range to avoid denormal values, underflows. .............. 3-78User/Source Coding Rule 15. (M impact, ML generality) Do not use doubleprecision unless necessary. Set the precision control (PC) field in the x87 FPUcontrol word to “Single Precision”. This allows single precision (<strong>32</strong>-bit)computation to complete faster on some operations (for example, divides due toearly out). However, be careful of introducing more than a total of two values forthe floating point control word, or there will be a large performance penalty. SeeSection 3.8.3 ..................................................................................... 3-78User/Source Coding Rule 16. (H impact, ML generality) Use fast float-to-introutines, FISTTP, or SSE2 instructions. If coding these routines, use the FISTTPinstruction if SSE3 is available, or the CVTTSS2SI <strong>and</strong> CVTTSD2SI instructions ifcoding with Streaming SIMD Extensions 2. ............................................ 3-78User/Source Coding Rule 17. (M impact, ML generality) Removing datadependence enables the out-of-order engine to extract more ILP from the code.When summing up the elements of an array, use partial sums instead of a singleaccumulator. ..................................................................................... 3-78User/Source Coding Rule 18. (M impact, ML generality) Usually, math librariestake advantage of the transcendental instructions (for example, FSIN) whenevaluating elementary functions. If there is no critical need to evaluate thetranscendental functions using the extended precision of 80 bits, applicationsshould consider an alternate, software-based approach, such as a look-up-tablebasedalgorithm using interpolation techniques. It is possible to improvetranscendental performance with these techniques by choosing the desirednumeric precision <strong>and</strong> the size of the look-up table, <strong>and</strong> by taking advantage ofthe parallelism of the SSE <strong>and</strong> the SSE2 instructions. .............................. 3-78User/Source Coding Rule 19. (H impact, ML generality) Denormalized floatingpointconstants should be avoided as much as possible ........................... 3-79User/Source Coding Rule 20. (M impact, H generality) Insert the PAUSEinstruction in fast spin loops <strong>and</strong> keep the number of loop repetitions to aminimum to improve overall system performance. .................................. 7-17User/Source Coding Rule 21. (M impact, L generality) Replace a spin lock thatmay be acquired by multiple threads with pipelined locks such that no more thantwo threads have write accesses to one lock. If only one thread needs to write toa variable shared by two threads, there is no need to use a lock. .............. 7-18User/Source Coding Rule 22. (H impact, M generality) Use a thread-blockingAPI in a long idle loop to free up the processor ....................................... 7-19User/Source Coding Rule 23. (H impact, M generality) Beware of false sharingwithin a cache line (<strong>64</strong> bytes on Intel Pentium 4, Intel Xeon, Pentium M, IntelE-9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!