13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

OPTIMIZING FOR SIMD INTEGER APPLICATIONS5.7.2.2 Increasing Memory B<strong>and</strong>width by Loading <strong>and</strong> Storing to <strong>and</strong>from the Same DRAM PageDRAM is divided into pages, which are not the same as operating system (OS) pages.The size of a DRAM page is a function of the total size of the DRAM <strong>and</strong> the organizationof the DRAM. Page sizes of several Kilobytes are common. Like OS pages, DRAMpages are constructed of sequential addresses. Sequential memory accesses to thesame DRAM page have shorter latencies than sequential accesses to different DRAMpages.In many systems the latency for a page miss (that is, an access to a different pageinstead of the page previously accessed) can be twice as large as the latency of amemory page hit (access to the same page as the previous access). Therefore, if theloads <strong>and</strong> stores of the memory fill cycle are to the same DRAM page, a significantincrease in the b<strong>and</strong>width of the memory fill cycles can be achieved.5.7.2.3 Increasing UC <strong>and</strong> WC Store B<strong>and</strong>width by Using Aligned StoresUsing aligned stores to fill UC or WC memory will yield higher b<strong>and</strong>width than usingunaligned stores. If a UC store or some WC stores cross a cache line boundary, asingle store will result in two transaction on the bus, reducing the efficiency of thebus transactions. By aligning the stores to the size of the stores, you eliminate thepossibility of crossing a cache line boundary, <strong>and</strong> the stores will not be split into separatetransactions.5.8 CONVERTING FROM <strong>64</strong>-BIT TO 128-BIT SIMDINTEGERSSSE2 defines a superset of 128-bit integer instructions currently available in MMXtechnology; the operation of the extended instructions remains. The superset simplyoperates on data that is twice as wide. This simplifies porting of <strong>64</strong>-bit integer applications.However, there are few considerations:• Computation instructions which use a memory oper<strong>and</strong> that may not be alignedto a 16-byte boundary must be replaced with an unaligned 128-bit load(MOVDQU) followed by the same computation operation that uses insteadregister oper<strong>and</strong>s.Use of 128-bit integer computation instructions with memory oper<strong>and</strong>s that arenot 16-byte aligned will result in a #GP. Unaligned 128-bit loads <strong>and</strong> stores arenot as efficient as corresponding aligned versions; this fact can reduce theperformance gains when using the 128-bit SIMD integer extensions.• General guidelines on the alignment of memory oper<strong>and</strong>s are:— The greatest performance gains can be achieved when all memory streamsare 16-byte aligned.5-36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!