13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MULTICORE AND HYPER-THREADING TECHNOLOGYstride cache misses (or reducing DTLB misses) will alleviate the problem of b<strong>and</strong>widthreduction due to large-stride cache misses.One way for conserving available bus comm<strong>and</strong> b<strong>and</strong>width is to improve the localityof code <strong>and</strong> data. Improving the locality of data reduces the number of cache lineevictions <strong>and</strong> requests to fetch data. This technique also reduces the number ofinstruction fetches from system memory.User/Source Coding Rule 26. (M impact, H generality) Improve data <strong>and</strong> codelocality to conserve bus comm<strong>and</strong> b<strong>and</strong>width.Using a compiler that supports profiler-guided optimization can improve code localityby keeping frequently used code paths in the cache. This reduces instruction fetches.Loop blocking can also improve the data locality. Other locality enhancement techniquescan also be applied in a multithreading environment to conserve bus b<strong>and</strong>width(see Section 9.6, “Memory <strong>Optimization</strong> Using Prefetch”).Because the system bus is shared between many bus agents (logical processors orprocessor cores), software tuning should recognize symptoms of the busapproaching saturation. One useful technique is to examine the queue depth of busread traffic (see Appendix A.2.1.3, “Workload Characterization”). When the busqueue depth is high, locality enhancement to improve cache utilization will benefitperformance more than other techniques, such as inserting more softwareprefetches or masking memory latency with overlapping bus reads. An approximateworking guideline for software to operate below bus saturation is to check if bus readqueue depth is significantly below 5.Some MP <strong>and</strong> workstation platforms may have a chipset that provides two systembuses, with each bus servicing one or more physical processors. The guidelines forconserving bus b<strong>and</strong>width described above also applies to each bus domain.8.5.2 Underst<strong>and</strong> the Bus <strong>and</strong> Cache InteractionsBe careful when parallelizing code sections with data sets that results in the totalworking set exceeding the second-level cache <strong>and</strong> /or consumed b<strong>and</strong>widthexceeding the capacity of the bus. On an Intel Core Duo processor, if only one threadis using the second-level cache <strong>and</strong> / or bus, then it is expected to get the maximumbenefit of the cache <strong>and</strong> bus systems because the other core does not interfere withthe progress of the first thread. However, if two threads use the second-level cacheconcurrently, there may be performance degradation if one of the following conditionsis true:• Their combined working set is greater than the second-level cache size.• Their combined bus usage is greater than the capacity of the bus.• They both have extensive access to the same set in the second-level cache, <strong>and</strong>at least one of the threads writes to this cache line.To avoid these pitfalls, multithreading software should try to investigate parallelismschemes in which only one of the threads access the second-level cache at a time, orwhere the second-level cache <strong>and</strong> the bus usage does not exceed their limits.8-24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!