13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MULTICORE AND HYPER-THREADING TECHNOLOGY8.7.1 Avoid Excessive Loop UnrollingUnrolling loops can reduce the number of branches <strong>and</strong> improve the branch predictabilityof application code. Loop unrolling is discussed in detail in Chapter 3. Loopunrolling must be used judiciously. Be sure to consider the benefit of improvedbranch predictability <strong>and</strong> the cost of increased code size relative to the Trace Cache.User/Source Coding Rule 36. (M impact, L generality) Avoid excessive loopunrolling to ensure the Trace cache is operating efficiently.On HT-Technology-enabled processors, excessive loop unrolling is likely to reduce theTrace Cache’s ability to deliver high b<strong>and</strong>width μop streams to the execution engine.8.7.2 <strong>Optimization</strong> for Code SizeWhen the Trace Cache is continuously <strong>and</strong> repeatedly delivering μop traces that arepre-built, the scheduler in the execution engine can dispatch μops for execution at ahigh rate <strong>and</strong> maximize the utilization of available execution resources. Optimizingapplication code size by organizing code sequences that are repeatedly executed intosections, each with a footprint that can fit into the Trace Cache, can improve applicationperformance greatly.On HT-Technology-enabled processors, multithreaded applications should improvecode locality of frequently executed sections <strong>and</strong> target one half of the size of TraceCache for each application thread when considering code size optimization. If codesize becomes an issue affecting the efficiency of the front end, this may be detectedby evaluating performance metrics discussed in the previous sub-section withrespect to loop unrolling.User/Source Coding Rule 37. (L impact, L generality) Optimize code size toimprove locality of Trace cache <strong>and</strong> increase delivered trace length.8.8 USING THREAD AFFINITIES TO MANAGE SHAREDPLATFORM RESOURCESEach logical processor in an MP system has unique initial APIC_ID which can bequeried using CPUID. Resources shared by more than one logical processors in amultithreading platform can be mapped into a three-level hierarchy for a non-clusteredMP system. Each of the three levels can be identified by a label, which can beextracted from the initial APIC_ID associated with a logical processor. See Chapter 7of the Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> Software Developer’s <strong>Manual</strong>, Volume 3Afor details. The three levels are:• Physical processor package. A PACKAGE_ID label can be used to distinguishdifferent physical packages within a cluster.8-34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!