13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONSWhen using scalar floating-point instructions, it is not necessary to ensure that thedata appears in vector form. However, the optimizations regarding alignment, scheduling,instruction selection, <strong>and</strong> other optimizations covered in Chapter 3 <strong>and</strong>Chapter 4 should be observed.6.5 DATA ALIGNMENTSIMD floating-point data is 16-byte aligned. Referencing unaligned 128-bit SIMDfloating-point data will result in an exception unless MOVUPS or MOVUPD (moveunaligned packed single or unaligned packed double) is used. The unaligned instructionsused on aligned or unaligned data will also suffer a performance penalty relativeto aligned accesses.See also: Section 4.4, “Stack <strong>and</strong> Data Alignment.”6.5.1 Data ArrangementBecause SSE <strong>and</strong> SSE2 incorporate SIMD architecture, arranging data to fully use theSIMD registers produces optimum performance. This implies contiguous data forprocessing, which leads to fewer cache misses. Correct data arrangement can potentiallyquadruple data throughput when using SSE or double throughput when usingSSE2. Performance gains can occur because four data elements can be loaded with128-bit load instructions into XMM registers using SSE (MOVAPS). Similarly, two dataelements can loaded with 128-bit load instructions into XMM registers using SSE2(MOVAPD).Refer to the Section 4.4, “Stack <strong>and</strong> Data Alignment,” for data arrangement recommendations.Duplicating <strong>and</strong> padding techniques overcome misalignment problemsthat occur in some data structures <strong>and</strong> arrangements. This increases the data spacebut avoids penalties for misaligned data access.For some applications (for example: 3D geometry), traditional data arrangementrequires some changes to fully utilize the SIMD registers <strong>and</strong> parallel techniques.Traditionally, the data layout has been an array of structures (AoS). To fully utilize theSIMD registers in such applications, a new data layout has been proposed — a structureof arrays (SoA) resulting in more optimized performance.6.5.1.1 Vertical versus Horizontal ComputationThe majority of the floating-point arithmetic instructions in SSE/SSE2 are focused onvertical data processing for parallel data elements. This means the destination ofeach element is the result of an arithmetic operation performed on input oper<strong>and</strong>s inthe same vertical position (Figure 6-1).To supplement these homogeneous arithmetic operations on parallel data elements,SSE <strong>and</strong> SSE2 provides data movement instructions (e.g., SHUFPS) that facilitatemoving data elements horizontally.6-3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!