13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

USING PERFORMANCE MONITORING EVENTSissuing micro-ops for execution, <strong>and</strong> Cycles_issuing_uops cycles that the RS isissuing micro-ops for execution. The latter component includes μops in thearchitected code path or in the speculative code path.• Cycle composition of OOO execution — The out-of-order engine providesmultiple execution units that can execute μops in parallel. If one execution unitstalls, it does not necessarily imply the program execution is stalled. Ourmethodology attempts to construct a cycle-composition view that approximatesthe progress of program execution. The three relevant metrics are:Cycles_stalled, Cycles_not_retiring_uops, <strong>and</strong> Cycles_retiring_uops.• Execution stall analysis — From the cycle compositions of overall programexecution, the programmer can narrow down the selection of performanceevents to further pin-point unproductive interaction between the workload <strong>and</strong> amicro-architectural sub-system.When cycles lost to a stalled microarchitectural sub-system, or to unproductive speculativeexecution are identified, the programmer can use VTune Analyzer to correlateeach significant performance impact to source code location. If the performanceimpact of stalls or misprediction is insignificant, VTune can also identify the sourcelocations of hot functions, so the programmer can evaluate the benefits of vectorizationon those hot functions.B.6.1Cycle Composition at Issue PortRecent processor microarchitectures employ out-of-order engines that executestreams of μops natively, while decoding program instructions into μops in its frontend. The metric Total_cycles alone, is opaque with respect to decomposing cyclesthat are productive or non-productive for program execution. To establish a consistentcycle-based decomposition, we construct two metrics that can be measuredusing performance events available in processors based on Intel Core microarchitecture.These are:• Cycles_not_issuing_uops — This can be measured by the eventRS_UOPS_DISPATCHED, setting the INV bit <strong>and</strong> specifying a counter mask(CMASK) value of 1 in the target performance event select (<strong>IA</strong><strong>32</strong>_PERFEVSELx)MSR (See Chapter 18 of the Intel® <strong>64</strong> <strong>and</strong> <strong>IA</strong>-<strong>32</strong> <strong>Architectures</strong> SoftwareDeveloper’s <strong>Manual</strong>, Volume 3B). In VTune Analyzer, the special values forCMASK <strong>and</strong> INV is already configured for the VTune event nameRS_UOPS_DISPATCHED.CYCLES_NONE.• Cycles_issuing_uops — This can be measured using the eventRS_UOPS_DISPATCHED, clear the INV bit <strong>and</strong> specifying a counter mask(CMASK) value of 1 in the target performance event select MSRNote the cycle decomposition view here is approximate in nature; it does not distinguishspecificities, such as whether the RS is full or empty, transient situations of RSbeing empty but some in-flight uops is getting retired.B-47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!