13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENERAL OPTIMIZATION GUIDELINES3.5.4.1 Alternate Packing TechniquesThe packing method implemented in the reference code of Example 3-22 will experiencedelay as it assembles 4 doubleword result from memory into an XMM registerdue to store-forwarding restrictions.Three alternate techniques for packing, using different SIMD instruction to assemblecontents in XMM registers are shown in Example 3-23. All three techniques avoidstore-forwarding delay by satisfying the restrictions on data sizes between apreceding store <strong>and</strong> subsequent load operations.Example 3-23. Three Alternate Packing Methods for Avoiding Store Forwarding DifficultyPacking Method 1 Packing Method 2 Packing Method 3movd xmm0, [ebp+16+4]movd xmm1, [ebp+16+8]movd xmm2, [ebp+16+12]movd xmm3, [ebp+12+16+4]punpckldq xmm0, xmm1punpckldq xmm2, xmm3punpckldq xmm0, xmm2movd xmm0, [ebp+16+4]movd xmm1, [ebp+16+8]movd xmm2, [ebp+16+12]movd xmm3, [ebp+12+16+4]psllq xmm3, <strong>32</strong>orps xmm2, xmm3psllq xmm1, <strong>32</strong>orps xmm0, xmm1movlhpsxmm0, xmm2movd xmm0, [ebp+16+4]movd xmm1, [ebp+16+8]movd xmm2, [ebp+16+12]movd xmm3, [ebp+12+16+4]movlhps xmm1,xmm3psllq xmm1, <strong>32</strong>movlhps xmm0, xmm2orps xmm0, xmm13.5.4.2 Simplifying Result PassingIn Example 3-22, individual results were passed to the packing stage by storing tocontiguous memory locations. Instead of using memory spills to pass four results,result passing may be accomplished by using either one or more registers. Usingregisters to simplify result passing <strong>and</strong> reduce memory spills can improve performanceby varying degrees depending on the register pressure at runtime.Example 3-24 shows the coding sequence that uses four extra XMM registers toreduce all memory spills of passing results back to the parent routine. However, softwaremust observe the following conditions when using this technique:• There is no register shortage.• If the loop does not have many stores or loads but has many computations, thistechnique does not help performance. This technique adds work to the computationalunits, while the store <strong>and</strong> loads ports are idle.3-42

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!