13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CODING FOR SIMD ARCHITECTURESAs a basis for the usage model discussed in this section, consider a simple loopshown in Example 4-6.Example 4-6. Simple Four-Iteration Loopvoid add(float *a, float *b, float *c){int i;for (i = 0; i < 4; i++) {c[i] = a[i] + b[i];}}Note that the loop runs for only four iterations. This allows a simple replacement ofthe code with Streaming SIMD Extensions.For the optimal use of the Streaming SIMD Extensions that need data alignment onthe 16-byte boundary, all examples in this chapter assume that the arrays passed tothe routine, A, B, C, are aligned to 16-byte boundaries by a calling routine. For themethods to ensure this alignment, please refer to the application notes for thePentium 4 processor.The sections that follow provide details on the coding methodologies: inlinedassembly, intrinsics, C++ vector classes, <strong>and</strong> automatic vectorization.4.3.1.1 AssemblyKey loops can be coded directly in assembly language using an assembler or by usinginlined assembly (C-asm) in C/C++ code. The Intel compiler or assembler recognizethe new instructions <strong>and</strong> registers, then directly generate the corresponding code.This model offers the opportunity for attaining greatest performance, but this performanceis not portable across the different processor architectures.4-9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!