13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

USING PERFORMANCE MONITORING EVENTS• If the contributions from Cycles_stalled is high, additional drill-down may benecessary to locate bottlenecks that lies deeper in the microarchitecture pipeline.B.6.3Drill-Down on Performance StallsIn some situations, it may be useful to evaluate cycles lost to stalls associated withvarious stress points in the microarchitecture <strong>and</strong> sum up the contributions fromeach c<strong>and</strong>idate stress points. This approach implies a very gross simplification <strong>and</strong>introduce complications that may be difficult to reconcile with the superscalar nature<strong>and</strong> buffering in an OOO engine.Due to the variations of counting domains associated with different performanceevents, cycle-based estimation of performance impact at each stress point may carrydifferent degree of errors due to over-estimation of exposures or under-estimations.Over-estimation is likely to occur when overall performance impact for a given causeis estimated by multiplying the per-instance-cost to an event count that measuresthe number of occurrences of that microarchitectural condition. Consequently, thesum of multiple contributions of lost cycles due to different stress points may exceedthe more accurate metric Cycles_stalled.However an approach that sums up lost cycles associated with individual stress pointmay still be beneficial as an iterative indicator to measure the effectiveness of codetuning loop effort when tuning code to fix the performance impact of each stresspoint. The remaining of this sub-section will discuss a few common causes of performancebottlenecks that can be counted by performance events <strong>and</strong> fixed by followingcoding recommendations described in this manual.The following items discuss several common stress points of the microarchitecture:• L2 Miss Impact — An L2 load miss may expose the full latency of memory subsystem.The latency of accessing system memory varies with different chipset,generally on the order of more than a hundred cycles. Server chipset tend toexhibit longer latency than desktop chipsets. The number L2 cache missreferences can be measured by MEM_LOAD_RETIRED.L2_LINE_MISS.An estimation of overall L2 miss impact by multiplying system memory latencywith the number of L2 misses ignores the OOO engine’s ability to h<strong>and</strong>le multipleoutst<strong>and</strong>ing load misses. Multiplication of latency <strong>and</strong> number of L2 misses implyeach L2 miss occur serially.To improve the accuracy of estimating L2 miss impact, an alternative techniqueshould also be considered, using the event BUS_REQUEST_OUTSTANDING with aCMASK value of 1. This alternative technique effectively measures the cycles thatthe OOO engine is waiting for data from the outst<strong>and</strong>ing bus read requests. It canovercome the over-estimation of multiplying memory latency with the number ofL2 misses.• L2 Hit Impact — Memory accesses from L2 will incur the cost of L2 latency (SeeTable 2-3). The number cache line references of L2 hit can be measured by theB-49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!