13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS6.6.1.2 SSE3 <strong>and</strong> Horizontal ComputationSIMD floating-point operations: Sometimes the AOS type of data organization aremore natural in many algebraic formula. SSE3 enhances the flexibility of SIMDprogramming for applications that rely on the horizontal computation model. SSE3offers several instructions that are capable of horizontal arithmetic operations.With Intel Core microarchitecture, the latency <strong>and</strong> throughput of SSE3 instructionsfor horizontal computation have been significantly improved over previous microarchitectures.Example 6-15 compares using SSE2 <strong>and</strong> SSE3 to implement the dot product of a pairof vectors consisting of four element each. The performance of calculating dot productscan be further improved by unrolling to calculate four pairs of vectors per iteration.See Example 6-16.In both cases, the SSE3 versions are faster than the SSE2 implementations.Example 6-15. Dot Product of Vector Length 4Optimized for Intel Core Duo Processormovaps xmm0, [eax]mulps xmm0, [eax+16]movhlps xmm1, xmm0addps xmm0, xmm1pshufd xmm1, xmm0, 1addss xmm0, xmm1movss [ecx], xmm0Optimized for Intel Core Microarchitecturemovaps xmm0, [eax]mulps xmm0, [eax+16]haddps xmm0, xmm0movaps xmm1, xmm0psrlq xmm0, xmm1addss xmm0, xmm1movss [eax], xmm0Example 6-16. Unrolled Implementation of Four Dot ProductsSSE2 ImplementationSSE3 Implementationmovaps xmm0, [eax]mulps xmm0, [eax+16];w0*w1 z0*z1 y0*y1 x0*x1movaps xmm2, [eax+<strong>32</strong>]mulps xmm2, [eax+16+<strong>32</strong>];w2*w3 z2*z3 y2*y3 x2*x3movaps xmm3, [eax+<strong>64</strong>]mulps xmm3, [eax+16+<strong>64</strong>];w4*w5 z4*z5 y4*y5 x4*x5movaps xmm4, [eax+96]mulps xmm4, [eax+16+96];w6*w7 z6*z7 y6*y7 x6*x7movaps xmm0, [eax]mulps xmm0, [eax+16]movaps xmm1, [eax+<strong>32</strong>]mulps xmm1, [eax+16+<strong>32</strong>]movaps xmm2, [eax+<strong>64</strong>]mulps xmm2, [eax+16+<strong>64</strong>]movaps xmm3, [eax+96]mulps xmm3, [eax+16+96]haddps xmm0, xmm1haddps xmm2, xmm3haddps xmm0, xmm2movaps [ecx], xmm06-21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!