13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENERAL OPTIMIZATION GUIDELINES3.6.11 Minimizing Bus LatencyEach bus transaction includes the overhead of making requests <strong>and</strong> arbitrations. Theaverage latency of bus read <strong>and</strong> bus write transactions will be longer if reads <strong>and</strong>writes alternate. Segmenting reads <strong>and</strong> writes into phases can reduce the averagelatency of bus transactions. This is because the number of incidences of successivetransactions involving a read following a write, or a write following a read, arereduced.User/Source Coding Rule 11. (M impact, ML generality) If there is a blend ofreads <strong>and</strong> writes on the bus, changing the code to separate these bus transactionsinto read phases <strong>and</strong> write phases can help performance.Note, however, that the order of read <strong>and</strong> write operations on the bus is not the sameas it appears in the program.Bus latency for fetching a cache line of data can vary as a function of the accessstride of data references. In general, bus latency will increase in response toincreasing values of the stride of successive cache misses. Independently, buslatency will also increase as a function of increasing bus queue depths (the numberof outst<strong>and</strong>ing bus requests of a given transaction type). The combination of thesetwo trends can be highly non-linear, in that bus latency of large-stride, b<strong>and</strong>widthsensitivesituations are such that effective throughput of the bus system for dataparallelaccesses can be significantly less than the effective throughput of smallstride,b<strong>and</strong>width-sensitive situations.To minimize the per-access cost of memory traffic or amortize raw memory latencyeffectively, software should control its cache miss pattern to favor higher concentrationof smaller-stride cache misses.User/Source Coding Rule 12. (H impact, H generality) To achieve effectiveamortization of bus latency, software should favor data access patterns that resultin higher concentrations of cache miss patterns, with cache miss strides that aresignificantly smaller than half the hardware prefetch trigger threshold.3.6.12 Non-Temporal Store Bus TrafficPeak system bus b<strong>and</strong>width is shared by several types of bus activities, includingreads (from memory), reads for ownership (of a cache line), <strong>and</strong> writes. The datatransfer rate for bus write transactions is higher if <strong>64</strong> bytes are written out to the busat a time.Typically, bus writes to Writeback (WB) memory must share the system bus b<strong>and</strong>widthwith read-for-ownership (RFO) traffic. Non-temporal stores do not require RFOtraffic; they do require care in managing the access patterns in order to ensure <strong>64</strong>bytes are evicted at once (rather than evicting several 8-byte chunks).Although the data b<strong>and</strong>width of full <strong>64</strong>-byte bus writes due to non-temporal stores istwice that of bus writes to WB memory, transferring 8-byte chunks wastes bus3-67

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!