13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CODING FOR SIMD ARCHITECTURES• Video display <strong>and</strong> capture routines• Rendering routines• 3D graphics (geometry)• Image <strong>and</strong> video processing algorithms• Spatial (3D) audio• Physical modeling (graphics, CAD)• Workstation applications• Encryption algorithms• Complex arithmeticsGenerally, good c<strong>and</strong>idate code is code that contains small-sized repetitive loops thatoperate on sequential arrays of integers of 8, 16 or <strong>32</strong> bits, single-precision <strong>32</strong>-bitfloating-point data, double precision <strong>64</strong>-bit floating-point data (integer <strong>and</strong> floatingpointdata items should be sequential in memory). The repetitiveness of these loopsincurs costly application processing time. However, these routines have potential forincreased performance when you convert them to use one of the SIMD technologies.Once you identify your opportunities for using a SIMD technology, you must evaluatewhat should be done to determine whether the current algorithm or a modified onewill ensure the best performance.4.3 CODING TECHNIQUESThe SIMD features of SSE3, SSE2, SSE, <strong>and</strong> MMX technology require new methods ofcoding algorithms. One of them is vectorization. Vectorization is the process of transformingsequentially-executing, or scalar, code into code that can execute in parallel,taking advantage of the SIMD architecture parallelism. This section discusses thecoding techniques available for an application to make use of the SIMD architecture.To vectorize your code <strong>and</strong> thus take advantage of the SIMD architecture, do thefollowing:• Determine if the memory accesses have dependencies that would preventparallel execution.• “Strip-mine” the inner loop to reduce the iteration count by the length of theSIMD operations (for example, four for single-precision floating-point SIMD,eight for 16-bit integer SIMD on the XMM registers).• Re-code the loop with the SIMD instructions.Each of these actions is discussed in detail in the subsequent sections of this chapter.These sections also discuss enabling automatic vectorization using the Intel C++Compiler.4-7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!