13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONSExample 6-10. Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS (Contd.)tmm2 = _mm_movelh_ps(tmm2, tmm3); // tmm2 = C1 C2 D1 D2tmm3 = _mm_movehl_ps(tmm3, tmm4); // tmm3 = C3 C4 D3 D4tmm3 = _mm_add_ps(tmm3, tmm2); // tmm3 = C1+C3 C2+C4 D1+D3 D2+D4tmm6 = tmm3; // tmm6 = C1+C3 C2+C4 D1+D3 D2+D4tmm6 = _mm_shuffle_ps(tmm3, tmm5, 0xDD);// tmm6 = A1+A3 B1+B3 C1+C3 D1+D3tmm5 = _mm_shuffle_ps(tmm5, tmm6, 0x88);// tmm5 = A2+A4 B2+B4 C2+C4 D2+D4tmm6 = _mm_add_ps(tmm6, tmm5);// tmm6 = A1+A2+A3+A4 B1+B2+B3+B4// C1+C2+C3+C4 D1+D2+D3+D4_mm_store_ps(out, tmm6);}6.5.2 Use of CVTTPS2PI/CVTTSS2SI InstructionsThe CVTTPS2PI <strong>and</strong> CVTTSS2SI instructions encode the truncate/chop roundingmode implicitly in the instruction. They take precedence over the rounding modespecified in the MXCSR register. This behavior can eliminate the need to change therounding mode from round-nearest, to truncate/chop, <strong>and</strong> then back to roundnearestto resume computation.Avoid frequent changes to the MXCSR register since there is a penalty associatedwith writing this register. Typically, when using CVTTPS2P/CVTTSS2SI, roundingcontrol in MXCSR can always be set to round-nearest.6.5.3 Flush-to-Zero <strong>and</strong> Denormals-are-Zero ModesThe flush-to-zero (FTZ) <strong>and</strong> denormals-are-zero (DAZ) modes are not compatiblewith the IEEE St<strong>and</strong>ard 754. They are provided to improve performance for applicationswhere underflow is common <strong>and</strong> where the generation of a denormalized resultis not necessary.See also: Section 3.8.2, “Floating-point Modes <strong>and</strong> Exceptions.”6.6 SIMD OPTIMIZATIONS AND MICROARCHITECTURESPentium M, Intel Core Solo <strong>and</strong> Intel Core Duo processors have a different microarchitecturethan Intel NetBurst microarchitecture. Intel Core microarchitecture offerssignificantly more efficient SIMD floating-point capability than previous microarchitectures.In addition, instruction latency <strong>and</strong> throughput of SSE3 instructions are significantlyimproved in Intel Core microarchitecture over previous microarchitectures.6-16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!