13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONSExample 6-16. Unrolled Implementation of Four Dot Products (Contd.)SSE2 ImplementationSSE3 Implementationmovaps xmm1, xmm0unpcklps xmm0, xmm2; y2*y3 y0*y1 x2*x3 x0*x1unpckhps xmm1, xmm2; w2*w3 w0*w1 z2*z3 z0*z1movaps xmm5, xmm3unpcklps xmm3, xmm4; y6*y7 y4*y5 x6*x7 x4*x5unpckhps xmm5, xmm4; w6*w7 w4*w5 z6*z7 z4*z5addps xmm0, xmm1addps xmm5, xmm3movaps xmm1, xmm5movhlps xmm1, xmm0movlhps xmm0, xmm5addps xmm0, xmm1movaps [ecx], xmm06.6.1.3 Packed Floating-Point Performance in Intel Core Duo ProcessorMost packed SIMD floating-point code will speed up on Intel Core Solo processorsrelative to Pentium M processors. This is due to improvement in decoding packedSIMD instructions.The improvement of packed floating-point performance on the Intel Core Soloprocessor over Pentium M processor depends on several factors. Generally, code thatis decoder-bound <strong>and</strong>/or has a mixture of integer <strong>and</strong> packed floating-point instructionscan expect significant gain. Code that is limited by execution latency <strong>and</strong> has a“cycles per instructions” ratio greater than one will not benefit from decoderimprovement.When targeting complex arithmetics on Intel Core Solo <strong>and</strong> Intel Core Duo processors,using single-precision SSE3 instructions can deliver higher performance thanalternatives. On the other h<strong>and</strong>, tasks requiring double-precision complex arithmeticsmay perform better using scalar SSE2 instructions on Intel Core Solo <strong>and</strong>Intel Core Duo processors. This is because scalar SSE2 instructions can bedispatched through two ports <strong>and</strong> executed using two separate floating-point units.Packed horizontal SSE3 instructions (HADDPS <strong>and</strong> HSUBPS) can simplify the codesequence for some tasks. However, these instruction consist of more than five microopson Intel Core Solo <strong>and</strong> Intel Core Duo processors. Care must be taken to ensurethe latency <strong>and</strong> decoding penalty of the horizontal instruction does not offset anyalgorithmic benefits.6-22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!