13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MULTICORE AND HYPER-THREADING TECHNOLOGYUser/Source Coding Rule 28. (M impact, M generality) Consider usingoverlapping multiple back-to-back memory reads to improve effective cache misslatencies.Another technique to reduce effective memory latency is possible if one can adjustthe data access pattern such that the access strides causing successive cache missesin the last-level cache is predominantly less than the trigger threshold distance of theautomatic hardware prefetcher. See Section 9.6.3, “Example of Effective LatencyReduction with Hardware Prefetch.”User/Source Coding Rule 29. (M impact, M generality) Consider adjusting thesequencing of memory references such that the distribution of distances ofsuccessive cache misses of the last level cache peaks towards <strong>64</strong> bytes.8.5.5 Use Full Write Transactions to Achieve Higher Data RateWrite transactions across the bus can result in write to physical memory either usingthe full line size of <strong>64</strong> bytes or less than the full line size. The latter is referred to as apartial write. Typically, writes to writeback (WB) memory addresses are full-size <strong>and</strong>writes to write-combine (WC) or uncacheable (UC) type memory addresses result inpartial writes. Both cached WB store operations <strong>and</strong> WC store operations utilize a setof six WC buffers (<strong>64</strong> bytes wide) to manage the traffic of write transactions. Whencompeting traffic closes a WC buffer before all writes to the buffer are finished, thisresults in a series of 8-byte partial bus transactions rather than a single <strong>64</strong>-byte writetransaction.User/Source Coding Rule 30. (M impact, M generality) Use full writetransactions to achieve higher data throughput.Frequently, multiple partial writes to WC memory can be combined into full-sizedwrites using a software write-combining technique to separate WC store operationsfrom competing with WB store traffic. To implement software write-combining,uncacheable writes to memory with the WC attribute are written to a small, temporarybuffer (WB type) that fits in the first level data cache. When the temporarybuffer is full, the application copies the content of the temporary buffer to the finalWC destination.When partial-writes are transacted on the bus, the effective data rate to systemmemory is reduced to only 1/8 of the system bus b<strong>and</strong>width.8.6 MEMORY OPTIMIZATIONEfficient operation of caches is a critical aspect of memory optimization. Efficientoperation of caches needs to address the following:• Cache blocking• Shared memory optimization• Eliminating <strong>64</strong>-KByte aliased data accesses• Preventing excessive evictions in first-level cache8-26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!