13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CODING FOR SIMD ARCHITECTURESIf in the code above the filter operation of data element I is the vector dot productthat begins at data element J, then the filter operation of data element I+1 begins atdata element J+1.Assuming you have a <strong>64</strong>-bit aligned data vector <strong>and</strong> a <strong>64</strong>-bit aligned coefficientsvector, the filter operation on the first data element will be fully aligned. For thesecond data element, however, access to the data vector will be misaligned. For anexample of how to avoid the misalignment problem in the FIR filter, refer to Intelapplication notes on Streaming SIMD Extensions <strong>and</strong> filters.Duplication <strong>and</strong> padding of data structures can be used to avoid the problem of dataaccesses in algorithms which are inherently misaligned. Section 4.5.1, “Data StructureLayout,” discusses trade-offs for organizing data structures.NOTEThe duplication <strong>and</strong> padding technique overcomes the misalignmentproblem, thus avoiding the expensive penalty for misaligned dataaccess, at the cost of increasing the data size. When developing yourcode, you should consider this tradeoff <strong>and</strong> use the option whichgives the best performance.4.4.2 Stack Alignment For 128-bit SIMD TechnologiesFor best performance, the Streaming SIMD Extensions <strong>and</strong> Streaming SIMD Extensions2 require their memory oper<strong>and</strong>s to be aligned to 16-byte boundaries.Unaligned data can cause significant performance penalties compared to aligneddata. However, the existing software conventions for <strong>IA</strong>-<strong>32</strong> (STDCALL, CDECL, FAST-CALL) as implemented in most compilers, do not provide any mechanism forensuring that certain local data <strong>and</strong> certain parameters are 16-byte aligned. Therefore,Intel has defined a new set of <strong>IA</strong>-<strong>32</strong> software conventions for alignment tosupport the new __M128* datatypes (__M128, __M128D, <strong>and</strong> __M218I). Thesemeet the following conditions:• Functions that use Streaming SIMD Extensions or Streaming SIMD Extensions 2data need to provide a 16-byte aligned stack frame.• __M128* parameters need to be aligned to 16-byte boundaries, possibly creating“holes” (due to padding) in the argument block.The new conventions presented in this section as implemented by the Intel C++Compiler can be used as a guideline for an assembly language code as well. In manycases, this section assumes the use of the __M128* data types, as defined by the IntelC++ Compiler, which represents an array of four <strong>32</strong>-bit floats.For more details on the stack alignment for Streaming SIMD Extensions <strong>and</strong> SSE2,see Appendix D, “Stack Alignment.”4-15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!