13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING FOR SIMD INTEGER APPLICATIONS• Runtime initialization — Use _MM_EMPTY() during runtime initialization of__M<strong>64</strong> <strong>and</strong> X87 FP data types. This ensures resetting the register between datatype transitions. See Example 5-1 for coding usage.Example 5-1. Resetting Register Between __m<strong>64</strong> <strong>and</strong> FP Data Types CodeIncorrect Usage Correct Usage__m<strong>64</strong> x = _m_paddd(y, z);float f = init();__m<strong>64</strong> x = _m_paddd(y, z);float f = (_mm_empty(), init());You must be aware that your code generates an MMX instruction, which uses MMXregisters with the Intel C++ Compiler, in the following situations:• when using a <strong>64</strong>-bit SIMD integer intrinsic from MMX technology,SSE/SSE2/SSSE3• when using a <strong>64</strong>-bit SIMD integer instruction from MMX technology,SSE/SSE2/SSSE3 through inline assembly• when referencing the __M<strong>64</strong> data type variableAdditional information on the x87 floating-point programming model can be found inthe Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> Software Developer’s <strong>Manual</strong>, Volume 1. Formore on EMMS, visit http://developer.intel.com.5.3 DATA ALIGNMENTMake sure that <strong>64</strong>-bit SIMD integer data is 8-byte aligned <strong>and</strong> that 128-bit SIMDinteger data is 16-byte aligned. Referencing unaligned <strong>64</strong>-bit SIMD integer data canincur a performance penalty due to accesses that span 2 cache lines. Referencingunaligned 128-bit SIMD integer data results in an exception unless the MOVDQU(move double-quadword unaligned) instruction is used. Using the MOVDQU instructionon unaligned data can result in lower performance than using 16-byte alignedreferences. Refer to Section 4.4, “Stack <strong>and</strong> Data Alignment,” for more information.Loading 16 bytes of SIMD data efficiently requires data alignment on 16-byte boundaries.SSSE3 provides the PALIGNR instruction. It reduces overhead in situations thatrequires software to processing data elements from non-aligned address. ThePALIGNR instruction is most valuable when loading or storing unaligned data with theaddress shifts by a few bytes. You can replace a set of unaligned loads with alignedloads followed by using PALIGNR instructions <strong>and</strong> simple register to register copies.Using PALIGNRs to replace unaligned loads improves performance by eliminatingcache line splits <strong>and</strong> other penalties. In routines like MEMCPY( ), PALIGNR can boost5-4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!