13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONSNow consider the case when the data is organized as SoA. Example 6-2 demonstrateshow four results are computed for five instructions.Example 6-2. Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computationmulpsmulpsmulpsaddpsaddps; x*x' for all 4 x-components of 4 vertices; y*y' for all 4 y-components of 4 vertices; z*z' for all 4 z-components of 4 vertices; x*x' + y*y'; x*x'+y*y'+z*z'For the most efficient use of the four component-wide registers, reorganizing thedata into the SoA format yields increased throughput <strong>and</strong> hence much better performancefor the instructions used.As seen from this simple example, vertical computation yielded 100% use of theavailable SIMD registers <strong>and</strong> produced four results. (The results may vary based onthe application.) If the data structures must be in a format that is not “friendly” tovertical computation, it can be rearranged “on the fly” to achieve full utilization of theSIMD registers. This operation is referred to as “swizzling” operation <strong>and</strong> the reverseoperation is referred to as “deswizzling.”6.5.1.2 Data SwizzlingSwizzling data from one format to another may be required in many algorithms whenthe available instruction set extension is limited (for example: only SSE is available).An example of this is AoS format, where the vertices come as XYZ adjacent coordinates.Rearranging them into SoA format (XXXX, YYYY, ZZZZ) allows more efficientSIMD computations.For efficient data shuffling <strong>and</strong> swizzling use the following instructions:• MOVLPS, MOVHPS load/store <strong>and</strong> move data on half sections of the registers.• SHUFPS, UNPACKHPS, <strong>and</strong> UNPACKLPS unpack data.To gather data from four different memory locations on the fly, follow these steps:1. Identify the first half of the 128-bit memory location.2. Group the different halves together using MOVLPS <strong>and</strong> MOVHPS to form an XYXYlayout in two registers.3. From the 4 attached halves, get XXXX by using one shuffle, YYYY by usinganother shuffle.ZZZZ is derived the same way but only requires one shuffle. Example 6-3 illustratesthe swizzle function.6-7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!