13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESrequest b<strong>and</strong>width <strong>and</strong> delivers significantly lower data b<strong>and</strong>width. This difference isdepicted in Examples 3-39 <strong>and</strong> 3-40.Example 3-39. Using Non-temporal Stores <strong>and</strong> <strong>64</strong>-byte Bus Write Transactions#define STRIDESIZE 256lea ecx, p<strong>64</strong>byte_Alignedmov edx, ARRAY_LENxor eax, eaxslloop:movntps XMMWORD ptr [ecx + eax], xmm0movntps XMMWORD ptr [ecx + eax+16], xmm0movntps XMMWORD ptr [ecx + eax+<strong>32</strong>], xmm0movntps XMMWORD ptr [ecx + eax+48], xmm0; <strong>64</strong> bytes is written in one bus transactionadd eax, STRIDESIZEcmp eax, edxjl slloopExample 3-40. On-temporal Stores <strong>and</strong> Partial Bus Write Transactions#define STRIDESIZE 256Lea ecx, p<strong>64</strong>byte_AlignedMov edx, ARRAY_LENXor eax, eaxslloop:movntps XMMWORD ptr [ecx + eax], xmm0movntps XMMWORD ptr [ecx + eax+16], xmm0movntps XMMWORD ptr [ecx + eax+<strong>32</strong>], xmm0; Storing 48 bytes results in 6 bus partial transactionsadd eax, STRIDESIZEcmp eax, edx3.7 PREFETCHINGRecent Intel processor families employ several prefetching mechanisms to acceleratethe movement of data or code <strong>and</strong> improve performance:• Hardware instruction prefetcher• Software prefetch for data• Hardware prefetch for cache lines of data or instructions3-68

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!