13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

OPTIMIZING FOR SIMD INTEGER APPLICATIONS5.7.1.1 Supplemental Techniques for Avoiding Cache Line SplitsVideo processing applications sometimes cannot avoid loading data from memoryaddresses that are not aligned to 16-byte boundaries. An example of this situation iswhen each line in a video frame is averaged by shifting horizontally half a pixel.Example shows a common operation in video processing that loads data frommemory address not aligned to a 16-byte boundary. As video processing traverseseach line in the video frame, it experiences a cache line split for each <strong>64</strong> byte chunkloaded from memory.Example 5-35. An Example of Video Processing with Cache Line Splits// Average half-pels horizonally (on // the “x” axis),// from one reference frame only.nextLinesLoop:movdqu xmm0, XMMWORD PTR [edx] // may not be 16B alignedmovdqu xmm0, XMMWORD PTR [edx+1]movdqu xmm1, XMMWORD PTR [edx+eax]movdqu xmm1, XMMWORD PTR [edx+eax+1]pavgbxmm0, xmm1pavgbxmm2, xmm3movdqaXMMWORD PTR [ecx], xmm0movdqaXMMWORD PTR [ecx+eax], xmm2// (repeat ...)SSE3 provides an instruction LDDQU for loading from memory address that are not16-byte aligned. LDDQU is a special 128-bit unaligned load designed to avoid cacheline splits. If the address of the load is aligned on a 16-byte boundary, LDQQU loadsthe 16 bytes requested. If the address of the load is not aligned on a 16-byteboundary, LDDQU loads a <strong>32</strong>-byte block starting at the 16-byte aligned addressimmediately below the address of the load request. It then provides the requested 16bytes. If the address is aligned on a 16-byte boundary, the effective number ofmemory requests is implementation dependent (one, or more).LDDQU is designed for programming usage of loading data from memory withoutstoring modified data back to the same address. Thus, the usage of LDDQU should berestricted to situations where no store-to-load forwarding is expected. For situationswhere store-to-load forwarding is expected, use regular store/load pairs (eitheraligned or unaligned based on the alignment of the data accessed).5-34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!