13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

USING PERFORMANCE MONITORING EVENTS• Bus_Snoops, event number 77H, unit mask 00H — This event counts thenumber of CLEAN, HIT, or HITM responses to external snoops detected on thebus.In a single-processor system, CLEAN <strong>and</strong> HIT responses are not likely tohappen. In a multiprocessor system this event indicates an L2 miss in oneprocessor that did not find the missed data on other processors.In a single-processor system, an HITM response indicates that an L1 miss(instruction or data) found the missed cache line in the other core in a modifiedstate. In a multiprocessor system, this event also indicates that an L1 miss(instruction or data) found the missed cache line in another core in a modifiedstate.B.6 DRILL-DOWN TECHNIQUES FOR PERFORMANCEANALYSISSoftware performance intertwines code <strong>and</strong> microarchitectural characteristics of theprocessor. Performance monitoring events provide insights to these interactions.Each microarchitecture often provides a large set of performance events that targetdifferent sub-systems within the microarchitecture. Having a methodical approach toselect key performance events will likely improve a programmer’s underst<strong>and</strong>ing ofthe performance bottlenecks <strong>and</strong> improve the efficiency of code-tuning effort.Recent generations of Intel <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> processors feature microarchitectures usingan out-of-order execution engine. They are also accompanied by an in-order frontend <strong>and</strong> retirement logic that enforces program order. Superscalar hardware, buffering<strong>and</strong> speculative execution often complicates the interpretation of performanceevents <strong>and</strong> software-visible performance bottlenecks.This section discusses a methodology of using performance events to drill down onlikely areas of performance bottleneck. By narrowed down to a small set of performanceevents, the programmer can take advantage of Intel VTune PerformanceAnalyzer to correlate performance bottlenecks with source code locations <strong>and</strong> applycoding recommendations discussed in Chapter 3 through Chapter 8. Although thegeneral principles of our method can be applied to different microarchitectures, thissection will use performance events available in processors based on Intel Coremicroarchitecture for simplicity.Performance tuning usually centers around reducing the time it takes to complete awell-defined workload. Performance events can be used to measure the elapsed timebetween the start <strong>and</strong> end of a workload. Thus, reducing elapsed time of completinga workload is equivalent to reducing measured processor cycles.The drill-down methodology can be summarized as four phases of performance eventmeasurements to help characterize interactions of the code with key pipe stages orsub-systems of the microarchitecture. The relation of the performance event drilldownmethodology to the software tuning feedback loop is illustrated in Figure B-2.B-45

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!