13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

USING PERFORMANCE MONITORING EVENTSB.6.2Cycle Composition of OOO ExecutionIn an OOO engine, speculative execution is an important part of making forwardprogress of the program. But speculative execution of μops in the shadow of mispredictedcode path represent un-productive work that consumes execution resources<strong>and</strong> execution b<strong>and</strong>width.Cycles_not_issuing_uops, by definition, represents the cycles that the OOO engine isstalled (Cycles_stalled). As an approximation, this can be interpreted as the cyclesthat the program is not making forward progress.The μops that are issued for execution do not necessarily end in retirement. Thoseμops that do not reach retirement do not help forward progress of program execution.Hence, a further approximation is made in the formalism of decomposition ofCycles_issuing_uops into:• Cycles_non_retiring_uops — Although there isn’t a direct event to measurethe cycles associated with non-retiring μops, we will derive this metric fromavailable performance events, <strong>and</strong> several assumptions:— A constant issue rate of μops flowing through the issue port. Thus, we define:uops_rate” = “Dispatch_uops/Cycles_issuing_uops, where Dispatch_uopscan be measured with RS_UOPS_DISPATCHED, clearing the INV bit <strong>and</strong> theCMASK.— We approximate the number of non-productive, non-retiring μops by[non_productive_uops = Dispatch_uops - executed_retired_uops], whereexecuted_retired_uops represent productive μops contributing towardsforward progress that consumed execution b<strong>and</strong>width.— The executed_retired_uops can be approximated by the sum of two contributions:num_retired_uops (measured by the event UOPS_RETIRED.ANY) <strong>and</strong>num_fused_uops (measured by the event UOPS_RETIRED.FUSED).Thus, Cycles_non_retiring_uops = non_productive_uops / uops_rate.• Cycles_retiring_uops — This can be derived from Cycles_retiring_uops =num_retired_uops / uops_rate.The cycle-decomposition methodology here does not distinguish situations whereproductive uops <strong>and</strong> non-productive μops may be dispatched in the same cycle intothe OOO engine. This approximation may be reasonable because heuristically highcontribution of non-retiring uops likely correlates to situations of congestions in theOOO engine <strong>and</strong> subsequently cause the program to stall.Evaluations of these three components: Cycles_non_retiring_uops, Cycles_stalled,Cycles_retiring_uops, relative to the Total_cycles, can help steer tuning effort in thefollowing directions:• If the contribution from Cycles_non_retiring_uops is high, focusing on codelayout <strong>and</strong> reducing branch mispredictions will be important.• If both the contributions from Cycles_non_retiring_uops <strong>and</strong> Cycles_stalled areinsignificant, the focus for performance tuning should be directed to vectorizationor other techniques to improve retirement throughput of hot functions.B-48

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!