13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CODING FOR SIMD ARCHITECTURES• Use of fewer prefetches, due to fewer streams• Efficient cache line packing of data elements that are used concurrently.With the advent of the SIMD technologies, the choice of data organization becomesmore important <strong>and</strong> should be carefully based on the operations to be performed onthe data. This will become increasingly important in the Pentium 4 processor <strong>and</strong>future processors. In some applications, traditional data arrangements may not leadto the maximum performance. Application developers are encouraged to exploredifferent data arrangements <strong>and</strong> data segmentation policies for efficient computation.This may mean using a combination of AoS, SoA, <strong>and</strong> Hybrid SoA in a givenapplication.4.5.2 Strip-MiningStrip-mining, also known as loop sectioning, is a loop transformation technique forenabling SIMD-encodings of loops, as well as providing a means of improvingmemory performance. First introduced for vectorizers, this technique consists of thegeneration of code when each vector operation is done for a size less than or equal tothe maximum vector length on a given vector machine. By fragmenting a large loopinto smaller segments or strips, this technique transforms the loop structure by:• Increasing the temporal <strong>and</strong> spatial locality in the data cache if the data arereusable in different passes of an algorithm.• Reducing the number of iterations of the loop by a factor of the length of each“vector,” or number of operations being performed per SIMD operation. In thecase of Streaming SIMD Extensions, this vector or strip-length is reduced by 4times: four floating-point data items per single Streaming SIMD Extensionssingle-precision floating-point SIMD operation are processed. ConsiderExample 4-16.Example 4-16. Pseudo-code Before Strip Miningtypedef struct _VERTEX {float x, y, z, nx, ny, nz, u, v;} Vertex_rec;main(){Vertex_rec v[Num];....for (i=0; i

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!