13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

USING PERFORMANCE MONITORING EVENTSthat allows qualification at the physical processor boundary or at bus agentboundary.Some events allow qualifications that permit the counting of microarchitecturalconditions associated with a particular core versus counts from all cores in a physicalprocessor (see L2 <strong>and</strong> bus related events in Appendix A of the Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong><strong>Architectures</strong> Software Developer’s <strong>Manual</strong>, Volume 3B).When a multi-threaded workload does not use all cores continuously, a performancecounter counting a core-specific condition may progress to some extent on the haltedcore <strong>and</strong> stop progressing or a unit mask may be qualified to continue countingoccurrences of the condition attributed to either processor core. Typically, one canadjust the highest two bits (bits 15:14 of the <strong>IA</strong><strong>32</strong>_PERFEVTSELx MSR) in the unitmask field to distinguish such asymmetry (See Chapter 18, “Debugging <strong>and</strong> PerformanceMonitoring,” of the Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> Software Developer’s<strong>Manual</strong>, Volume 3B).There are three cycle-counting events which will not progress on a halted core, evenif the halted core is being snooped. These are: Unhalted core cycles, Unhalted referencecycles, <strong>and</strong> Unhalted bus cycles. All three events are detected for the unitselected by event 3CH.Some events detect microarchitectural conditions but are limited in their ability toidentify the originating core or physical processor. For example, bus_drdy_clocksmay be programmed with a unit mask of 20H to include all agents on a bus. In thiscase, the performance counter in each core will report nearly identical values. Performancetools interpreting counts must take into account that it is only necessary toequate bus activity with the event count from one core (<strong>and</strong> not use not the sumfrom each core).The above is also applicable when the core-specificity sub field (bits 15:14 of<strong>IA</strong><strong>32</strong>_PERFEVTSELx MSR) within an event mask is programmed with 11B. The resultof reported by performance counter on each core will be nearly identical.B.5.2Ratio InterpretationRatios of two events are useful for analyzing various characteristics of a workload. Itmay be possible to acquire such ratios at multiple granularities, for example: (1) perapplicationthread, (2) per logical processor, (3) per core, <strong>and</strong> (4) per physicalprocessor.The first ratio is most useful from a software development perspective, but requiresmulti-threaded applications to manage processor affinity explicitly for each applicationthread. The other options provide insights on hardware utilization.In general, collect measurements (for all events in a ratio) in the same run. Thisshould be done because:• If measuring ratios for a multi-threaded workload, getting results for all events inthe same run enables you to underst<strong>and</strong> which event counter values belongs toeach thread.B-43

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!