13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GENERAL OPTIMIZATION GUIDELINESThere are six write-combining buffers in each processor core in Intel Core Duo <strong>and</strong>Intel Core Solo processors. Processors based on Intel Core microarchitecture haveeight write-combining buffers in each core.Assembly/Compiler Coding Rule 58. (H impact, L generality) If an inner loopwrites to more than four arrays (four distinct cache lines), apply loop fission tobreak up the body of the loop such that only four arrays are being written to in eachiteration of each of the resulting loops.Write combining buffers are used for stores of all memory types. They are particularlyimportant for writes to uncached memory: writes to different parts of the samecache line can be grouped into a single, full-cache-line bus transaction instead ofgoing across the bus (since they are not cached) as several partial writes. Avoidingpartial writes can have a significant impact on bus b<strong>and</strong>width-bound graphics applications,where graphics buffers are in uncached memory. Separating writes touncached memory <strong>and</strong> writes to writeback memory into separate phases can assurethat the write combining buffers can fill before getting evicted by other write traffic.Eliminating partial write transactions has been found to have performance impact onthe order of 20% for some applications. Because the cache lines are <strong>64</strong> bytes, a writeto the bus for 63 bytes will result in 8 partial bus transactions.When coding functions that execute simultaneously on two threads, reducing thenumber of writes that are allowed in an inner loop will help take full advantage ofwrite-combining store buffers. For write-combining buffer recommendations forHyper-Threading Technology, see Chapter 8, “Multicore <strong>and</strong> Hyper-Threading Technology.”Store ordering <strong>and</strong> visibility are also important issues for write combining. When awrite to a write-combining buffer for a previously-unwritten cache line occurs, therewill be a read-for-ownership (RFO). If a subsequent write happens to another writecombiningbuffer, a separate RFO may be caused for that cache line. Subsequentwrites to the first cache line <strong>and</strong> write-combining buffer will be delayed until thesecond RFO has been serviced to guarantee properly ordered visibility of the writes.If the memory type for the writes is write-combining, there will be no RFO since theline is not cached, <strong>and</strong> there is no such delay. For details on write-combining, seeChapter 10, “Memory Cache Control,” of Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> SoftwareDeveloper’s <strong>Manual</strong>, Volume 3A.3.6.10 Locality EnhancementLocality enhancement can reduce data traffic originating from an outer-level subsystemin the cache/memory hierarchy. This is to address the fact that the accesscostin terms of cycle-count from an outer level will be more expensive than from aninner level. Typically, the cycle-cost of accessing a given cache level (or memorysystem) varies across different microarchitectures, processor implementations, <strong>and</strong>platform components. It may be sufficient to recognize the relative data access costtrend by locality rather than to follow a large table of numeric values of cycle-costs,listed per locality, per processor/platform implementations, etc. The general trend istypically that access cost from an outer sub-system may be approximately 3-10X3-65

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!