13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONSThe same situation can occur for the above MOVHPS/MOVLPS/SHUFPS sequence.Since each MOVHPS/MOVLPS instruction bypasses part of the destination register,the instruction cannot execute until the prior instruction that generates this registerhas completed. As with the XORPS example, in the worst case this dependence canprevent successive loop iterations from executing in parallel.A solution is to include a 128-bit load (that is, from a dummy local variable, such asTMP in Example 6-4) to each register to be used with a MOVHPS/MOVLPS instruction.This action effectively breaks the dependence by performing an independent loadfrom a memory or cached location.6.5.1.3 Data DeswizzlingIn the deswizzle operation, we want to arrange the SoA format back into AoS formatso the XXXX, YYYY, ZZZZ are rearranged <strong>and</strong> stored in memory as XYZ. To do this wecan use the UNPCKLPS/UNPCKHPS instructions to regenerate the XYXY layout <strong>and</strong>then store each half (XY) into its corresponding memory location usingMOVLPS/MOVHPS. This is followed by another MOVLPS/MOVHPS to store the Zcomponent. Example 6-5 illustrates the deswizzle function:Example 6-5. Deswizzling Single-Precision SIMD Datavoid deswizzle_asm(Vertex_soa *in, Vertex_aos *out){__asm {mov ecx, in // load structure addressesmov edx, outmovaps xmm7, [ecx] // load x1 x2 x3 x4 => xmm7movaps xmm6, [ecx+16] // load y1 y2 y3 y4 => xmm6movaps xmm5, [ecx+<strong>32</strong>] // load z1 z2 z3 z4 => xmm5movaps xmm4, [ecx+48] // load w1 w2 w3 w4 => xmm4// START THE DESWIZZLING HEREmovaps xmm0, xmm7 // xmm0= x1 x2 x3 x4unpcklps xmm7, xmm6 // xmm7= x1 y1 x2 y2movlps [edx], xmm7 // v1 = x1 y1 -- --movhps [edx+16], xmm7 // v2 = x2 y2 -- --unpckhps xmm0, xmm6 // xmm0= x3 y3 x4 y4movlps [edx+<strong>32</strong>], xmm0 // v3 = x3 y3 -- --movhps [edx+48], xmm0 // v4 = x4 y4 -- --movaps xmm0, xmm5 // xmm0= z1 z2 z3 z46-10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!