13.07.2015 Views

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel® 64 and IA-32 Architectures Optimization Reference Manual

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

USING PERFORMANCE MONITORING EVENTSFrequent accesses to registers that cause partial stalls increase access latency <strong>and</strong>decrease performance. Partial Register Stalls Ratio is the percentage of cycles whenpartial stalls occur.B.7.4.4Partial Flag Stalls22. Partial Flag Stalls Ratio:RAT_STALLS.FLAGS / CPU_CLK_UNHALTED.COREPartial flag stalls have high penalty <strong>and</strong> they can be easily avoided. However, in somecases, Partial Flag Stalls Ratio might be high although there are no real flag stalls.There are a few instructions that partially modify the RFLAGS register <strong>and</strong> may causepartial flag stalls. The most popular are the shift instructions (SAR, SAL, SHR, <strong>and</strong>SHL) <strong>and</strong> the INC <strong>and</strong> DEC instructions.B.7.4.5Bypass Between Execution Domains23. Delayed Bypass to FP Operation Rate: DELAYED_BYPASS.FP /CPU_CLK_UNHALTED.CORE24. Delayed Bypass to SIMD Operation Rate: DELAYED_BYPASS.SIMD /CPU_CLK_UNHALTED.CORE25. Delayed Bypass to Load Operation Rate: DELAYED_BYPASS.LOAD /CPU_CLK_UNHALTED.COREDomain bypass adds one cycle to instruction latency. To identify frequent domainbypasses in the code you can use the above ratios.B.7.4.6Floating Point Performance Ratios26. Floating Point Instructions Ratio: X87_OPS_RETIRED.ANY / INST_RETIRED.ANY* 100Significant floating-point activity indicates that specialized optimizations for floatingpointalgorithms may be applicable.27. FP Assist Performance Impact: FP_ASSIST * 80 / CPU_CLK_UNHALTED.CORE *100Floating Point assist is activated for non-regular FP values like denormals <strong>and</strong> NANs.FP assist is extremely slow compared to regular FP execution. Different assists incurdifferent penalties. FP Assist Performance Impact estimates the overall impact.28. Divider Busy: IDLE_DURING_DIV / CPU_CLK_UNHALTED.CORE * 100A high value for the Divider Busy ratio indicates that the divider is busy <strong>and</strong> no otherexecution unit or load operation is in progress for many cycles. Using this ratioignores L1 data cache misses <strong>and</strong> L2 cache misses that can be executed in parallel<strong>and</strong> hide the divider penalty.29. Floating-Point Control Word Stall Ratio: RESOURCE_STALLS.FPCW /CPU_CLK_UNHALTED.CORE * 100B-55

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!