13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING FOR SIMD INTEGER APPLICATIONSExample 5-36. Video Processing Using LDDQU to Avoid Cache Line Splits// Average half-pels horizontally (on // the “x” axis),// from one reference frame only.nextLinesLoop:lddqu xmm0, XMMWORD PTR [edx] // may not be 16B alignedlddqu xmm0, XMMWORD PTR [edx+1]lddqu xmm1, XMMWORD PTR [edx+eax]lddqu xmm1, XMMWORD PTR [edx+eax+1]pavgbxmm0, xmm1pavgbxmm2, xmm3movdqaXMMWORD PTR [ecx], xmm0 //results stored elsewheremovdqaXMMWORD PTR [ecx+eax], xmm2// (repeat ...)5.7.2 Increasing B<strong>and</strong>width of Memory Fills <strong>and</strong> Video FillsIt is beneficial to underst<strong>and</strong> how memory is accessed <strong>and</strong> filled. A memory-tomemoryfill (for example a memory-to-video fill) is defined as a <strong>64</strong>-byte (cache line)load from memory which is immediately stored back to memory (such as a videoframe buffer).The following are guidelines for obtaining higher b<strong>and</strong>width <strong>and</strong> shorter latencies forsequential memory fills (video fills). These recommendations are relevant for all Intelarchitecture processors with MMX technology <strong>and</strong> refer to cases in which the loads<strong>and</strong> stores do not hit in the first- or second-level cache.5.7.2.1 Increasing Memory B<strong>and</strong>width Using the MOVDQ InstructionLoading any size data oper<strong>and</strong> will cause an entire cache line to be loaded into thecache hierarchy. Thus, any size load looks more or less the same from a memoryb<strong>and</strong>width perspective. However, using many smaller loads consumes more microarchitecturalresources than fewer larger stores. Consuming too many resources cancause the processor to stall <strong>and</strong> reduce the b<strong>and</strong>width that the processor can requestof the memory subsystem.Using MOVDQ to store the data back to UC memory (or WC memory in some cases)instead of using <strong>32</strong>-bit stores (for example, MOVD) will reduce by three-quarters thenumber of stores per memory fill cycle. As a result, using the MOVDQ in memory fillcycles can achieve significantly higher effective b<strong>and</strong>width than using MOVD.5-35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!