13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MULTICORE AND HYPER-THREADING TECHNOLOGYTuning Suggestion 6. Use on-chip execution resources cooperatively if two logicalprocessors are sharing the execution resources in the same processor core.8.9.1 Using Shared Execution Resources in a Processor CoreOne way to measure the degree of overall resource utilization by a single thread is touse performance-monitoring events to count the clock cycles that a logical processoris executing code <strong>and</strong> compare that number to the number of instructions executedto completion. Such performance metrics are described in Appendix B <strong>and</strong> can beaccessed using the Intel VTune Performance Analyzer.An event ratio like non-halted cycles per instructions retired (non-halted CPI) <strong>and</strong>non-sleep CPI can be useful in directing code-tuning efforts. The non-sleep CPImetric can be interpreted as the inverse of the overall throughput of a physicalprocessor package. The non-halted CPI metric can be interpreted as the inverse ofthe throughput of a logical processor 9 .When a single thread is executing <strong>and</strong> all on-chip execution resources are availableto it, non-halted CPI can indicate the unused execution b<strong>and</strong>width available in thephysical processor package. If the value of a non-halted CPI is significantly higherthan unity <strong>and</strong> overall on-chip execution resource utilization is low, a multithreadedapplication can direct tuning efforts to encompass the factors discussed earlier.An optimized single thread with exclusive use of on-chip execution resources mayexhibit a non-halted CPI in the neighborhood of unity 10 . Because most frequentlyused instructions typically decode into a single micro-op <strong>and</strong> have throughput of nomore than two cycles, an optimized thread that retires one micro-op per cycle is onlyconsuming about one third of peak retirement b<strong>and</strong>width. Significant portions of theissue port b<strong>and</strong>width are left unused. Thus, optimizing single-thread performanceusually can be complementary with optimizing a multithreaded application to takeadvantage of the benefits of HT Technology.On a processor supporting HT Technology, it is possible that an execution unit withlower throughput than one issue every two cycles may find itself in contention fromtwo threads implemented using a data decomposition threading model. In onescenario, this can happen when the inner loop of both threads rely on executing alow-throughput instruction, such as FDIV, <strong>and</strong> the execution time of the inner loop isbound by the throughput of FDIV.Using a function decomposition threading model, a multithreaded application canpair up a thread with critical dependence on a low-throughput resource with otherthreads that do not have the same dependency.9. Non-halted CPI can correlate to the resource utilization of an application thread, if the applicationthread is affinitized to a fixed logical processor.10. In current implementations of processors based on Intel NetBurst microarchitecture, the theoreticallower bound for either non-halted CPI or non-sleep CPI is 1/3. Practical applications rarelyachieve any value close to the lower bound.8-42

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!