13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING CACHE USAGEapproach to separate bus read <strong>and</strong> write transactions. See Section 3.6.11, “MinimizingBus Latency.”The technique employs two stages. In the first stage, a block of data is read frommemory to the cache sub-system. In the second stage, cached data are written totheir destination using streaming stores.Example 9-11. Memory Copy Using Hardware Prefetch <strong>and</strong> Bus Segmentationvoid block_prefetch(void *dst,void *src){ _asm {mov edi,dstmov esi,srcmov edx,SIZEalign 16main_loop:xor ecx,ecxalign 16}prefetch_loop:movaps xmm0, [esi+ecx]movaps xmm0, [esi+ecx+<strong>64</strong>]add ecx,128cmp ecx,BLOCK_SIZEjne prefetch_loopxor ecx,ecxalign 16cpy_loop:movdqa xmm0,[esi+ecx]movdqa xmm1,[esi+ecx+16]movdqa xmm2,[esi+ecx+<strong>32</strong>]movdqa xmm3,[esi+ecx+48]movdqa xmm4,[esi+ecx+<strong>64</strong>]movdqa xmm5,[esi+ecx+16+<strong>64</strong>]movdqa xmm6,[esi+ecx+<strong>32</strong>+<strong>64</strong>]movdqa xmm7,[esi+ecx+48+<strong>64</strong>]movntdq [edi+ecx],xmm0movntdq [edi+ecx+16],xmm1movntdq [edi+ecx+<strong>32</strong>],xmm29-35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!