12.07.2015 Views

Commonly Used Metrics for Performance Analysis - Power.org

Commonly Used Metrics for Performance Analysis - Power.org

Commonly Used Metrics for Performance Analysis - Power.org

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Table 10‐1 L1 cache data reloads events with their corresponding event groups and brief description 33Table 10‐2 Events sorted by Group................................................................................................ 33Table 11‐1 Memory location events with their corresponding event groups and brief description36Table 12‐1 Address translation events with their corresponding event groups and brief description39Table 12‐2 Events sorted by Group................................................................................................ 39Table 13‐1 Events <strong>for</strong> instruction statistics with their corresponding event groups and brief description 42Table 13‐2 Events sorted by Group................................................................................................ 42Table 14‐1 Memory location events with their corresponding event groups and brief description45Table 15‐1 L2 read‐claim machine events with their corresponding event groups and brief description 46Copyright ©2011 IBM Corporation Page 4 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>1 Per<strong>for</strong>mance Event Data <strong>for</strong> Application OptimizationFirst, this paper briefly covers the POWER7 execution pipeline and the PMU hardware. Then it introducessome AIX and Linux tools that can be used to collect hardware events. Finally, the paper discusses severaluseful sets of metrics.The first step in optimizing an application is characterizing how well the application runs on a POWER7system. The fundamental intensive metric used to characterize the per<strong>for</strong>mance of any givenprogram/workload is CPI (Cycles Per Instruction) – the average number of clock cycles (or fractions of acycle) needed to complete an instruction. CPI is best understood as a relative quantity. Lower is better, butthat assumes that useful work is being done. For a given set of calculations (an execution path), the lower theCPI, the more effectively the processor hardware is being kept busy. Note that the CPI is a measure ofprocessor per<strong>for</strong>mance, “How busy is the system hardware?” which is a narrower question than “Can aprogram be sped up?”The CPI stack (also referred to as a “CPI stall analysis”) hierarchically breaks down the CPI based on whatthe execution pipeline is doing (or not doing) at any given cycle on a per-hardware-thread basis. It is used toanswer “What are the main front-end and back-end delays encountered while executing?”The CPI stack uses data from the PMU (Per<strong>for</strong>mance Monitoring Unit) hardware in the POWER7 chip.Focusing on the core (and not the “nest” – the subsystems that transfer data to and from memory), dataaccess accounting is simplified – either the data is found in L1 cache or it’s not (and there is a processingdelay). And per<strong>for</strong>mance data <strong>for</strong> disk I/O (and other “slow” hardware interrupts like networking) are excluded.Many other metrics are also useful in both characterizing how well an application runs on a POWER7 systemand how efficiently the application uses the available hardware resources. These include metrics <strong>for</strong> memorybandwidth, L1 cache instruction and data behavior, branch prediction, data locality, address translation,flushes and read-claim machines. While not an exhaustive list, these metrics do cover several common areasof concern.Copyright ©2011 IBM Corporation Page 5 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>2 The Execution PipelineThe POWER7 architecture, like most modern processors, can dispatch groups of machine instructions everycycle. Groups are dispatched into the execution pipeline in order and completed in order. Instructions orinternal operations in a group can execute out of order, and all instructions in a group must finish be<strong>for</strong>e agroup can complete (“retire”). This is the essence of the “out-of-order” nature of the POWER7 architecture.Figure 2-1 is a standard block diagram of the POWER7 core’s structure and the execution pipeline:Figure 2-1 POWER7 core structure and execution pipeline detailsConcentrating on the CPI stack, a simplified execution pipeline is presented in Figure 2-2.Copyright ©2011 IBM Corporation Page 6 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Figure 2-2 A simplified view of the execution pipelineGroupTagit 0it 1 Finishit 2 Indicatorsit 3itag4Example entryFetchDecodeGlobal Completion TableCompletion UnitCreate Group completion entry withGroup tag and Instruction tag in programdDispatchIssueUnitFinishInterfaceFinishInterface setsAppropriatefinish bits to 1based on gtag/itagfrom unitsFinish report fromUnit Gtag and itagLoad store unitFixed point unitFloating point unitGtag and itagGtag and itagGtag and itagBranch UnitGtag and itagCompletion LogicComplete oldest group ofinstructions if all finish indicatorsare ‘1’CompleteThe stages are:1. FETCH/DECODE – instructions are fetched from the instruction cache.2. DISPATCH – instructions are placed into groups (of up to 6 instructions) and sent to distributed issuequeues.a. An entry <strong>for</strong> the instruction group is made in the Global Completion Table (GCT) which tracks everygroup that has been dispatched and is still executing somewhere in the core.3. ISSUE – instructions (up to 8 at a time) are sent from the issue queues to their target functional units (e.g.LSU, VSX unit).4. FINISH – Instructions that were dispatched in order can execute and finish out of order from anyfunctional unit. Up to 8 internal operations can finish in a cycle.5. COMPLETION – an instruction group is marked complete when all of its member instructions havefinished executing. A completed group is deallocated from the GCT.Generally speaking, the CPI stack apportions the total CPU (compute cycle) time between three places in theexecution pipeline on a per-thread basis:1. Cycles where an instruction group completed.A group can contain one to six PPC instructions. When they have all finished, the group entry in theGlobal Completion Table is removed and the next group is eligible <strong>for</strong> completion. These cycles areassociated with the COMPLETION stage.2. Cycles where the GCT is empty.In this case no new instructions were dispatched and the pipeline is empty <strong>for</strong> that thread. Thesecycles are associated with the DISPATCH stage. These are also referred to as front-end delays.3. Cycles where there are groups present in the GCT, but no group has completed. These arecompletion stall cycles and most optimization work concerns finding what type and where significantamounts of stall cycles occur. These cycles are associated with the EXECUTE stage. These are alsoreferred to as back-end delays.Copyright ©2011 IBM Corporation Page 7 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>3 The POWER7 PMUThe POWER7 processor has a built in Per<strong>for</strong>mance Monitoring Unit (PMU), which is designed to provideinstrumentation to aid in per<strong>for</strong>mance monitoring, workload characterization, system characterization codeanalysis.The PMU comprises of 6 thread level Per<strong>for</strong>mance Monitor Counters (PMC). PMC1 – PMC4 areprogrammable, PMC5 counts non idle completed instructions and PMC6 counts non idle cycles.Copyright ©2011 IBM Corporation Page 8 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>4 AIX Tools <strong>for</strong> Collecting PMC DataThe AIX tool hpmcount can be used to get CPI stack data <strong>for</strong> a complete workload. If the workload has arelatively uni<strong>for</strong>m profile and takes long enough (at least 1 CPU-minute per group is recommended – 11minutes <strong>for</strong> the complete set of CPI stack groups), hpmcount can be used to multiplex all CPI stack eventgroups in one run – the quickest way to get an idea of the events with the largest stalls.Once the critical events have been isolated, the tprof tool can be used to give profiling in<strong>for</strong>mation by event.By default, tprof will provide function level counts, but it can also do microprofiling of events.4.1 Profiling ToolsProfiling refers to charging CPU time to subroutines and micro-profiling refers to charging CPU time toinstructions. Profiling is frequently used in benchmarking and tuning activities to find out where the "hotspots"in a code and data, identify per<strong>for</strong>mance-sensitive areas, and identify problem instructions and data regions.Several tools are available <strong>for</strong> profiling in UNIX in general, and AIX offers additional tools.For many years, UNIX has included gprof, and this is also available in AIX.tprof is an AIX-only alternative which can provide profiling in<strong>for</strong>mation from the original binaries.4.1.1 Profiling with gprof/XprofilerTo get gprof-compatible output, first binaries need to be compiled and created with the added “-pg” option(additional options like optimization level, -On can also be added):xlc –pg –o myprog.exe myprog.corxlf –pg –o myprog.exe myprog.cWhen the program is executed, a gmon.out file is generated (or, <strong>for</strong> a parallel job. several gmon.out filesare generated, one per task). To get the human-readable profile, run:gprof myprog.exe gmon.out > myprog.gpro<strong>for</strong>gprof myprog.exe gmon.*.out > myprog.gprofTo get microprofiling in<strong>for</strong>mation, from gprof output, you need to use the Xprofiler tool.Full documentation <strong>for</strong> gprof can be found here4.1.2 Profiling with tprofDescription from Man PageThe tprof command reports CPU usage <strong>for</strong> individual programs and the system asa whole. This command is a useful tool <strong>for</strong> anyone with a JAVA, C, C++, orFORTRAN program that might be CPU-bound and who wants to know whichsections of the program are most heavily using the CPU. The tprof command cancharge CPU time to object files, processes, threads, subroutines (user mode,kernel mode and shared library) and even to source lines of programs or individualinstructions.tprof estimates the CPU time spent in a program, the kernel, a shared library etc. by sampling theinstructions every 10 milliseconds. When the sampling occurs a "tic" is applied to the components running atCopyright ©2011 IBM Corporation Page 9 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>that time. By sampling over several intervals an estimate of the time spent in the various processes anddaemons running on the system can be estimated.For example:tprof -x myprogIf present, the -x flag is always the last on the command line, and is followed by the name of a command andits arguments.This produces an additional file, myprog.prof, which contains the same sort of in<strong>for</strong>mation as found in thegprof output.The output looks something like this:Configuration in<strong>for</strong>mation=========================System: AIX 5.3 Node: v20n18 Machine: 00C518EC4C00Tprof command was:tprof -x myprogTrace command was:/usr/bin/trace -ad -M -L 134785843 -T 500000 -j000,00A,001,002,003,38F,005,006,134,139,5A2,5A5,465,234, -o -Total Samples = 2538Traced Time = 5.01s (out of a total execution time of 5.01s)


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>(note -qipa=level=0 can be used). When linked with the -qlist and -qipa options a file named “a.lst” will begenerated which contains the assembly code of the entire executable after ALL optimizations are applied.Below is an example of how to produce object file micro-profiling using one of the NAS parallel benchmarksnproc=8exe=lu.B.$nproctprof [-L a.lst] -m bin/$exe -Zusk -r $exe -x /bin/sh


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>active on a node at a time.As with serial jobs, it is also possible to profile an application which is already running:tprof -p DLPOLY.Y -S DIRECTORY -r dlpoly -u -l -x sleep 60Here -p DLPOLY.Y selects the process name to be profiled, and -S selects the directory containing theexecutable.If poe is being run interactively, then locating the node running the processes of the job is not a problem. Ifthe job is being run using LoadLeveler, then it is likely that the tasks are not running on the node from whichthe job was submitted. The simplest way to find the relevant node is to run "llq", and then log on to the nodespecified in the "Running on" column. This is the first node selected by LoadLeveler, and will be runningMPI task zero. If it is necessary to profile a specific MPI task, then this in<strong>for</strong>mation would need to be extractedfrom the "Task Instance" in<strong>for</strong>mation from a long LoadLeveler listing "llq -l jobname".Running tprof on a single node and extracting profiling in<strong>for</strong>mation <strong>for</strong> all the tasks running on that node isnormally useful and sufficient. There is no easy way to use tprof to profile all the tasks of a multi-nodeparallel job. For that situation, it is best to look into the PE Benchmarker toolkit <strong>for</strong> AIX (which is not discussedin this document).4.2 Counting Hardware Events4.2.1 hpmcountAlong with profiling data (see Section 4.1), per<strong>for</strong>mance counter data from the hardware per<strong>for</strong>mance monitor(HPM) can be used to characterize application per<strong>for</strong>mance on POWER7 hardware, possibly with the intentof finding per<strong>for</strong>mance bottlenecks and looking <strong>for</strong> opportunities to optimize the target code. Usinghpmcount, one can monitor the activity of all POWER7 subsystems – FPU efficiency, stalls, cache usage,flop counts, etc. during the application execution on the system.The command line syntax <strong>for</strong> the most useful options is illustrated by the following example:hpmcount –d –o hpm.out –g pm_utilization:uk myprogThe options are as follows:The –d option asks <strong>for</strong> more detail in the output.The –o option names an output file string. The actual output file <strong>for</strong> this hpmcount run will start with “hpm.out”and add the process id and, in the case of a poe-based run, a task id <strong>for</strong> each hpmcount file. A serial run justuses “0000” <strong>for</strong> the task id.The -g option lists a predefined group of events or a comma-separated list of event group namesor numbers. When a comma-separated list of groups is used, the counter multiplexing mode is selected.Each event group can be qualified by a counting mode as follows:event_group:counting_modeswhere the counting mode can be user space (u), kernel space (k), and some other options that won’t be asimportant.4.2.1.1 Multiplexing with hpmcountWhen more than one group is supplied to the –g option, counter multiplexing is activated. Counting isdistributed among the listed groups. The sum of the sampling times over all groups should be a close matchto the total CPU time.In the following example:hpmcount –g pm_utilization:uk,pm_vsu0:u myprogCopyright ©2011 IBM Corporation Page 12 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>the program execution time is split between the “pm_utilization” and the “pm_vsu0” groups.4.2.1.2 Running hpmcount under the poe environmentUnlike tprof (see 4.1.2.2), hpmcount can be run <strong>for</strong> each task of a parallel job. Task binding, by using thelaunch utility or something similar, is recommended to increase accuracy and reproducibility. An examplemight be:export MP_PROCS=8export TARGET_CPU_LIST=-1export HPM_EVENT_GROUP=pm_utilization:upoe launch hpmcount –d –o hpm.out myprogNote that additional environment variables needed by poe are implied. An alternative way of specifying thegroup(s) to be collected has been used here – by setting the HPM_EVENT_GROUP variable.4.2.1.3 libhpmThe libhpm library is part of the bos.pmapi.lib fileset distributed with the standard AIX release. It can betreated as a way to get hpmcount in<strong>for</strong>mation <strong>for</strong> sections of codeFor event counts resolved by code regions, one can use calls to libhpm.a routines as a quicker way to collecta large group of events than profiling with tprof to control the collection of HPM group statistics.Using libhpm.a can be divided into 3 steps: modifying the source code base, building the executable, andrunning.libhpm.a supports calls from Fortran and C/C++ code; the examples will be in C.1. There are 4 necessary calls that need to be inserted in the base source code: hpmInit(1,"SomeString"); [Fortran – f_hpminit(,”string”) ]at a point be<strong>for</strong>e any statistics are to be collected. Inserting this call at the start of execution in the mainprogram is the simplest course. The specified “SomeString” is arbitrary, but it seems most sensible to derive itfrom the program name. hpmTerminate(1); [Fortran – f_hpmterminate(,”string”) ]This is the matching call to hpmInit(). The argument (an integer) must agree with the first argument of thehpmInit() call. This call signals that HPM data collection has ended. hpmStart(2,"String"); [Fortran – f_hpmstart(,”string”) ]Insert this call at the start of every code section where one wants to collect HPM data. The integer identifierhas to be unique (there are usually many hpmStart() calls inside a program; all have to be uniquelyidentified). The string is user-defined and it is recommended that it briefly describe the code block beingmeasured. The string (along with the range of lines in the block) is reported as part of the libhpm output.There is a significant execution overhead from this call, so it is best to put it outside loop blocks. Having anyparticular hpmStart hpmStop(2); [Fortran – f_hpmstop() ]This is the matching call to hpmStart(). The argument must agree with the first argument of the hpmInit() call.This call signals the end of HPM data collection <strong>for</strong> a section of code.2. Link with needed librariesxlc –o myprog_hpm myprog.o –llibhpm –lpmapi –lm3. Use an appropriate execution environmentexport HPM_OUTPUT_NAME=libhpm.outCopyright ©2011 IBM Corporation Page 13 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>export HPM_EVENT_GROUP=pm_utilization./myprog_hpmNot that myprog_hpm can be run as normal, without any call to hpmcount, but that hpmcount environmentvariables are still valid. The additional environment variable, HPM_OUTPUT_NAME is stronglyrecommended to specify a file <strong>for</strong> the libhpm output. Otherwise output goes to stdout.Here is an excerpt from an example output from a libhpm run:Total execution time of instrumented code (wall time): 4.166042 seconds######## Resource Usage Statistics ########Total amount of time in user mode: 4.157466 secondsTotal amount of time in system mode: 0.002418 secondsMaximum resident set size: 6372 KbytesAverage shared memory use in text segment : 66 Kbytes*secAverage unshared memory use in data segment : 22228 Kbytes*secNumber of page faults without I/O activity : 1572Number of page faults with I/O activity : 0Number of times process was swapped out : 0Number of times file system per<strong>for</strong>med INPUT : 0Number of times file system per<strong>for</strong>med OUTPUT : 0Number of IPC messages sent : 0Number of IPC messages received : 0Number of signals delivered : 0Number of voluntary context switches : 6Number of involuntary context switches : 6####### End of Resource Statistics ########Instrumented section: 1 - Label: Scalar - process: 1file: test_mul_nn_hpm.c, lines: 118 123Count: 1Wall Clock Time: 1.587333 secondsTotal time in user mode: 1.5830869784375 secondsGroup: 0Counting duration: 1.587304013 secondsPM_RUN_CYC (Run cycles) : 5065878331PM_INST_CMPL (Instructions completed) : 2680000246PM_INST_DISP (Instructions dispatched) : 2816747278PM_CYC (Processor cycles) : 5065878331PM_RUN_INST_CMPL (Run instructions completed) : 2682544499PM_RUN_CYC (Run cycles) : 5079851163Utilization rate : 99.733 %MIPS : 1688.367 MIPSInstructions per cycle : 0.529Instrumented section: 2 - Label: VMX - process: 1file: test_mul_nn_hpm.c, lines: 159 164Count: 1Wall Clock Time: 1.277721 secondsTotal time in user mode: 1.2739698684375 secondsGroup: 0Counting duration: 1.277698308 secondsPM_RUN_CYC (Run cycles) : 4076703579PM_INST_CMPL (Instructions completed) : 1610000247PM_INST_DISP (Instructions dispatched) : 1624541340PM_CYC (Processor cycles) : 4076703579PM_RUN_INST_CMPL (Run instructions completed) : 1612330036PM_RUN_CYC (Run cycles) : 4089025245Copyright ©2011 IBM Corporation Page 14 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Utilization rate : 99.706 %MIPS : 1260.056 MIPSInstructions per cycle : 0.395Instrumented section: 3 - Label: VMXopt - process: 1file: test_mul_nn_hpm.c, lines: 197 202Count: 1Wall Clock Time: 1.28022 secondsTotal time in user mode: 1.2764352996875 seconds4.3 Profiling Hardware Events with tprofThe -E flag enables event-based profiling. The -E flag is one of the four software-based events (EMULATION,ALIGNMENT, ISLBMISS, DSLBMISS) or a Per<strong>for</strong>mance Monitor event (PM_*). By default, the profiling eventis processor cycles. All Per<strong>for</strong>mance Monitor events are prefixed with PM_, such as PM_CYC <strong>for</strong> processorcycles or PM_INST_CMPL <strong>for</strong> instructions completed. The commandpmlist –g -1lists all Per<strong>for</strong>mance Monitor events that are supported on the POWER7 processor. The events that make upthe CPI stack are discussed in 6.From the AIX 6 tprof command documentation:-E [mode]Enables event-based profiling. The possible modes are:PM_eventSpecifies the hardware event to profile. If no mode is specified <strong>for</strong> the -E flag, the default event isprocessor cycles (PM_CYC).EMULATIONEnables the emulation profiling mode.ALIGNMENTEnables the alignment profiling mode.ISLBMISSEnables the Instruction Segment Lookaside Buffer miss profiling mode.DSLBMISSEnables the Data Segment Lookaside Buffer miss profiling mode.For example,tprof -m bin/$exe -Zusk -r $exe –E PM_LD_MISS_L1 –x myprogprofiles the number of L1 D-cache misses across the source codeCopyright ©2011 IBM Corporation Page 15 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>5 Linux Tools <strong>for</strong> Collecting PMC DataThe Linux tools oprofile and perf can collect per<strong>for</strong>mance event counts. They provide similar in<strong>for</strong>mation(event counts per function). The two tools’ command lines also have similar syntax.5.1 perfperf is a standard Linux profiling tool with direct support in the kernel. It is not usually built/installed by default,but <strong>for</strong> recent kernels (2.6.34+) it can be built from the kernel source tree directly at /usr/src/linux-/tools/perf/. Details and scripts <strong>for</strong> managing perf data collection are found at this developerworkswebsite.Note: perf uses raw event codes to collect event data through the perf command line. The raw event code isthe number of the PMC to use (PMC1-PMC4 or “0” <strong>for</strong> any counter) concatenated with the PMC event codegiven in the “Comprehensive PMU Event Reference: POWER7” document. For convenience, the“run_example.py” script described on the website allows the user to supply event names (mnemonics)instead of raw event codes.The perf wiki is another good source of in<strong>for</strong>mation.5.2 oprofileoprofile is a system-wide profiler <strong>for</strong> Linux, capable of profiling all running code at low overhead. It consistsof a kernel driver and a daemon <strong>for</strong> collecting sample data, and several post-processing tools <strong>for</strong> turning datainto in<strong>for</strong>mation. oprofile leverages the hardware per<strong>for</strong>mance counters of the CPU to enable profiling of awide variety of interesting statistics, which can also be used <strong>for</strong> basic time-spent profiling.More details on using oprofile are available in the Per<strong>for</strong>mance Guide For HPC Applications On IBM <strong>Power</strong>755 System. Though its syntax is different, its usage is analogous to that of perf, as described in section 5.1,above.The official website <strong>for</strong> oprofile is here. Specifying events <strong>for</strong> oprofile is discussed here.Copyright ©2011 IBM Corporation Page 16 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>6 CPI StackThis chapter lists and describes the events, metrics and groups needed <strong>for</strong> the CPI stack analysis.6.1 Descriptions of EventsThere are 26 events, distributed across 11 event groups, needed to complete the POWER7 CPI stack. Theseinclude 2 architected events counting non-idle completed instructions and non-idle cycles.This section provides more details on each event and how it is triggered. Since instruction groups have tocomplete in order, a group can’t complete until all instructions in the group finish. So a stall is all of the cyclesbetween two groups completing. The general rule is that if a completion stall has occurred, the last instructionto finish (regardless of how many instructions were stalled) is charged with the entire stall. The practicalconsequence of this is that eliminating a stall (e.g. an ERAT miss) from an execution pipeline may reveal anew stall (e.g. a data cache miss).PM_1PLUS_PPC_CMPL1 or more ppc instructions finishedA group containing at least one <strong>Power</strong>PC instruction completed. For microcoded instructions that spanmultiple groups, this will only occur once.PM_CMPLU_STALLNo groups completed, GCT not emptyThose cycles where a thread was not completing any groups, but the group completion table had entries <strong>for</strong>that thread.PM_CMPLU_STALL_BRUCompletion stall due to BRU.Following a completion stall (any period when no groups completed, while group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was from the Branch Unit.PM_CMPLU_STALL_DCACHE_MISS Completion stall caused by D cache missFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes suffered a Data Cache Miss. AData Cache Miss has higher priority than any other Load/Store delay, so if an instruction encounters multipledelays, only the Data Cache Miss will be reported and the entire delay period will be charged to the DataCache Miss.PM_CMPLU_STALL_DFU Completion stall caused by Decimal Floating Point UnitFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was from the Decimal FloatingPoint Unit.PM_CMPLU_STALL_DIV Completion stall caused by DIV instructionFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was a fixed point divideinstruction.PM_CMPLU_STALL_ERAT_MISS Completion stall caused by ERAT missFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes suffered an ERAT miss.PM_CMPLU_STALL_FXUCompletion stall caused by FXU instructionFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was from the Fixed Point Unit.PM_CMPLU_STALL_IFUCompletion stall due to IFU.Following a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was from the Instruction FetchCopyright ©2011 IBM Corporation Page 17 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Unit (either the Branch Unit or the CR unit).PM_CMPLU_STALL_LSUCompletion stall caused by LSU instructionFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was from the Load Store Unit.PM_CMPLU_STALL_REJECT Completion stall caused by rejectFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes suffered a load/store reject.PM_CMPLU_STALL_SCALAR Completion stall caused by FPU instructionFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was a scalar floating pointinstruction.PM_CMPLU_STALL_SCALAR_LONG Completion stall caused by long latency scalar instructionFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was a floating point divide orsquare root instruction.PM_CMPLU_STALL_STORE Completion stall due to store instructionFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was a store. This generallyhappens when we run out of real SRQ entries, which prevents stores from issuing.PM_CMPLU_STALL_THRDCompletion stall due to thread conflict. Group ready to complete butit was another thread's turn.Following a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the thread could not complete a group because the completion port it’s sharing wasbeing used by another thread. In SMT4 mode, Thread0 and thread2 share a completion port and Thread1and Thread3 share another completion port.PM_CMPLU_STALL_VECTOR Completion stall caused by Vector instructionFollowing a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was a vector instruction.PM_CMPLU_STALL_VECTOR_LONG Completion stall due to long latency vector instruction.Following a completion stall (any period when no groups completed and the group completion table was notempty <strong>for</strong> that thread) the last instruction to finish be<strong>for</strong>e completion resumes was a long latency vectorinstruction.PM_GCT_NOSLOT_BR_MPRED GCT empty by branch mispredictionThese cycles occur when the Global Completion Table has no slots from this thread because of a branchmisprediction.PM_GCT_NOSLOT_BR_MPRED_IC_MISS GCT empty by branch misprediction + IC missThese cycles occur when the Global Completion Table has no slots from this thread because of a branchmisprediction and an instruction cache miss.PM_GCT_NOSLOT_CYCNo itags assignedThese cycles occur when the Global Completion Table has no slots from this thread.PM_GCT_NOSLOT_IC_MISS GCT empty by I-cache missThese cycles occur when the Global Completion Table has no slots from this thread because of an instructioncache miss.PM_GRP_CMPLGroup completedCopyright ©2011 IBM Corporation Page 18 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>A group completed. Microcoded instructions that span multiple groups will generate this event once pergroup. This ensures that each completed group will be counted.PM_RUN_CYCRun cyclesThe processor cycles that are gated by the run latch. Operating systems use the run latch to indicate whenthey are doing useful work. The run latch is typically cleared in the OS idle loop. Gating by the run latch filtersout the idle loop.PM_RUN_INST_CMPLRun InstructionsNumber of <strong>Power</strong>PC instructions completed, gated by the run latch.6.2 <strong>Metrics</strong>The CPI stack analysis uses the following metrics:BASE_COMPLETION_CPIBase Completion CyclesFormula: PM_1PLUS_PPC_CMPL/PM_RUN_INST_CMPLCOMPLETION_CPICycles in which a Group CompletedFormula: PM_GRP_CMPL/PM_RUN_INST_CMPLEXPANSION_OVERHEAD_CPI Cycles due to go overhead of expansionFormula: COMPLETION_CPI - BASE_COMPLETION_CPIFXU_STALL_CPICycles stalled by Fixed-point UnitFormula: PM_CMPLU_STALL_FXU/PM_RUN_INST_CMPLFXU_MULTI_CYC_CPICycles stalled by FXU Multi-Cycle InstructionsFormula: PM_CMPLU_STALL_DIV/PM_RUN_INST_CMPLFXU_STALL_OTHER_CPIOther cycles stalled by FXUFormula: FXU_STALL_CPI - FXU_MULTI_CYC_CPIGCT_EMPTY_CPIGCT empty cyclesFormula: PM_GCT_NOSLOT_CYC/PM_RUN_INST_CMPLGCT_EMPTY_IC_MISS_CPICycles GCT empty due to I-Cache MissesFormula: PM_GCT_NOSLOT_IC_MISS/PM_RUN_INST_CMPLGCT_EMPTY_BR_MPRED_CPI Cycles GCT empty due to Branch MispredictsFormula: PM_GCT_NOSLOT_BR_MPRED/PM_RUN_INST_CMPLGCT_EMPTY_BR_MPRED_IC_MISS_CPI Cycles GCT empty due to Branch Mispredicts and I-cacheMissesFormula: PM_GCT_NOSLOT_BR_MPRED_IC_MISS/PM_RUN_INST_CMPLGCT_EMPTY_OTHER_CPIOther GCT empty cyclesFormula: (PM_GCT_NOSLOT_CYC-PM_GCT_NOSLOT_IC_MISS-PM_GCT_NOSLOT_BR_MPRED-PM_GCT_NOSLOT_BR_MPRED_IC_MISS) / PM_RUN_INST_CMPLIFU_STALL_CPICycles stalled due to Instruction Fetch UnitFormula: PM_CMPLU_STALL_IFU/PM_RUN_INST_CMPLIFU_STALL_BRU_CPICycles stalled by branchesFormula: PM_CMPLU_STALL_BRU/PM_RUN_INST_CMPLCopyright ©2011 IBM Corporation Page 19 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>IFU_STALL_OTHER_CPICycles stalled by other IFU operationsFormula: IFU_STALL_CPI - IFU_STALL_BRU_CPILSU_STALL_CPICycles stalled by Load/Store UnitFormula: PM_CMPLU_STALL_LSU/PM_RUN_INST_CMPLLSU_STALL_REJECT_CPICycles stalled by LSU RejectsFormula: PM_CMPLU_STALL_REJECT/PM_RUN_INST_CMPLLSU_STALL_ERAT_MISS_CPI Cycles stalled by ERAT TranslationsFormula: PM_CMPLU_STALL_ERAT_MISS/PM_RUN_INST_CMPLLSU_STALL_REJECT_OTHER_CPI Cycles stalled by Other LSU RejectsFormula: LSU_STALL_REJECT_CPI - LSU_STALL_ERAT_MISS_CPILSU_STALL_DCACHE_MISS_CPI Cycles stalled by Data Cache (L1) MissesFormula: PM_CMPLU_STALL_DCACHE_MISS/PM_RUN_INST_CMPLLSU_STALL_STORE_CPICycles stalled by Data Store (L1) MissesFormula: PM_CMPLU_STALL_STORE/PM_RUN_INST_CMPLLSU_STALL_OTHER_CPICycles stalled by Other LSU OperationsFormula: LSU_STALL_CPI - LSU_STALL_REJECT_CPI - LSU_STALL_DCACHE_MISS_CPI -LSU_STALL_STORE_CPIOTHER_STALL_CPIOther stall cyclesFormula: STALL_CPI - FXU_STALL_CPI - VSU_STALL_CPI - LSU_STALL_CPI - IFU_STALL_CPI -SMT_STALL_CPIRUN_CPIFormula:Total cyclesPM_RUN_CYC/PM_RUN_INST_CMPLSMT_STALL_CPICycles stalled due to Symmetric MultithreadingFormula: PM_CMPLU_STALL_THRD/PM_RUN_INST_CMPLSTALL_CPIFormula:Completion Stall CyclesPM_CMPLU_STALL/PM_RUN_INST_CMPLVSU_STALL_CPICycles stalled by Vector-and-Scalar UnitFormula: (PM_CMPLU_STALL_SCALAR + PM_CMPLU_STALL_VECTOR +PM_CMPLU_STALL_DFU)/PM_RUN_INST_CMPLVSU_STALL_DFU_CPICycles stalled by Decimal Floating-point UnitFormula: PM_CMPLU_STALL_DFU/PM_RUN_INST_CMPLVSU_STALL_SCALAR_CPICycles stalled by VSU Scalar OperationsFormula: PM_CMPLU_STALL_SCALAR/PM_RUN_INST_CMPLVSU_STALL_SCALAR_LONG_CPI Cycles stalled by VSU Scalar Long OperationsFormula: PM_CMPLU_STALL_SCALAR_LONG/PM_RUN_INST_CMPLVSU_STALL_SCALAR_OTHER_CPI Cycles stalled by Other VSU Scalar OperationsFormula: VSU_STALL_SCALAR_CPI - VSU_STALL_SCALAR_LONG_CPIVSU_STALL_VECTOR_CPICycles stalled by VSU Vector OperationsFormula: PM_CMPLU_STALL_VECTOR/PM_RUN_INST_CMPLCopyright ©2011 IBM Corporation Page 20 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>VSU_STALL_VECTOR_LONG_CPI Cycles stalled by VSU Vector Long OperationsFormula: PM_CMPLU_STALL_VECTOR_LONG/PM_RUN_INST_CMPLVSU_STALL_VECTOR_OTHER_CPI Cycles stalled by VSU Vector OtherFormula: VSU_STALL_VECTOR_CPI - VSU_STALL_VECTOR_LONG_CPIFigure 6-1 shows how these metrics fit into the hierarchical breakdown of cycles on the POWER7 processorFigure 6-1 The breakdown of the CPI Stack6.3 Event GroupsThese events are listed in Table 6-1. The relevant events are listed alphabetically by name and each rowincludes the name of a PMC group where the event can be found and a short description. The PMC groupcan be used with tools like hpmcount (see Section 4.2.1) to collect several events at once, minimizing thenumber of runs needed to collect a complete set of data. Events can be used with perf (see Section 5.1),oprofile (see Section 5.2) or tprof (see Section 4.3)The list of groups needed (and the relevant events they include) is provided inTable 6-2.Copyright ©2011 IBM Corporation Page 21 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Table 6-1 CPI stack events with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_1PLUS_PPC_CMPLPM_CMPLU_STALLpm_compat_cpi_1plus_ppcpm_cpi_stack51 or more ppc insts finishedNo groups completed, GCT notemptyPM_CMPLU_STALL_BRUPM_CMPLU_STALL_DCACHE_MISSpm_misc_17pm_cpi_stack1Completion stall due to BRUCompletion stall caused by Dcache missPM_CMPLU_STALL_DFUpm_cpi_stack6Completion stall caused byDecimal Floating Point UnitPM_CMPLU_STALL_DIVpm_cpi_stack2Completion stall caused by DIVinstructionPM_CMPLU_STALL_ERAT_MISSpm_cpi_stack1Completion stall caused by ERATmissPM_CMPLU_STALL_FXUpm_cpi_stack2Completion stall caused by FXUinstructionPM_CMPLU_STALL_IFUPM_CMPLU_STALL_LSUpm_misc_16pm_cpi_stack3Completion stall due to the IFU.Completion stall caused by LSUinstructionPM_CMPLU_STALL_REJECTPM_CMPLU_STALL_SCALARpm_cpi_stack3pm_cpi_stack4Completion stall caused by rejectCompletion stall caused by FPUinstructionPM_CMPLU_STALL_SCALAR_LONG pm_cpi_stack4Completion stall caused by longlatency scalar instructionPM_CMPLU_STALL_STOREpm_misc_15Completion stall due to storeinstructionPM_CMPLU_STALL_THRDpm_cpi_stack6Completion Stalled due to threadconflict. Group ready to completebut it was another thread's turnPM_CMPLU_STALL_VECTORpm_cpi_stack5Completion stall caused byVector instructionPM_CMPLU_STALL_VECTOR_LONG pm_misc_15Completion stall due to longlatency vector instructionPM_GCT_NOSLOT_BR_MPREDPM_GCT_NOSLOT_BR_MPRED_IC_MISSpm_cpi_stack7pm_cpi_stack6GCT empty by branch mispredictGCT empty by branch mispredict+ IC missPM_GCT_NOSLOT_CYC pm_cpi_stack7 No itags assignedPM_GCT_NOSLOT_IC_MISS pm_cpi_stack7 GCT empty by I cache missPM_GRP_CMPL pm_cpi_stack2 Group completedPM_RUN_CYC pm_cpi_stack1 Run_cyclesPM_RUN_INST_CMPL pm_cpi_stack1 Run_InstructionsTable 6-2 Events sorted by GroupGroup NameEvent Namepm_cpi_stack1PM_CMPLU_STALL_DCACHE_MISSPM_CMPLU_STALL_ERAT_MISSPM_RUN_CYCPM_RUN_INST_CMPLpm_cpi_stack2PM_CMPLU_STALL_DIVCopyright ©2011 IBM Corporation Page 22 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>pm_cpi_stack3pm_cpi_stack4pm_cpi_stack5pm_cpi_stack6pm_cpi_stack7pm_compat_cpi_1plus_ppcpm_misc_15pm_misc_15pm_misc_16pm_misc_17PM_CMPLU_STALL_FXUPM_GRP_CMPLPM_CMPLU_STALL_LSUPM_CMPLU_STALL_REJECTPM_CMPLU_STALL_SCALARPM_CMPLU_STALL_SCALAR_LONGPM_CMPLU_STALLPM_CMPLU_STALL_VECTORPM_CMPLU_STALL_DFUPM_CMPLU_STALL_THRDPM_GCT_NOSLOT_BR_MPRED_IC_MISSPM_GCT_NOSLOT_BR_MPREDPM_GCT_NOSLOT_CYCPM_GCT_NOSLOT_IC_MISSPM_1PLUS_PPC_CMPLPM_CMPLU_STALL_STOREPM_CMPLU_STALL_VECTOR_LONGPM_CMPLU_STALL_IFUPM_CMPLU_STALL_BRUCopyright ©2011 IBM Corporation Page 23 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>7 Memory BandwidthThis chapter lists and describes the events, metrics and groups needed <strong>for</strong> calculating the average memorybandwidth used by application. This is often a critical limiting factor to application per<strong>for</strong>mance. An accuratemeasure of memory bandwidth involves a much more detailed monitoring of events across the entire memoryfabric of the node; this method, though relatively simple to implement, is less accurate. When an applicationhas a lot of “interventions” (data from any other on-chip cache), this process will over-estimate memorybandwidth.The instrumentation used to collect memory bandwidth events exists in the memory controller synchronousinterface to the <strong>Power</strong>bus. It is not associated with any particular core. Any statistics collected are anaggregate of all the off-chip memory requests from all the cores on the chip. Requests are in units of thecache line, 128 bytes. Note that these events exclude any requests that are satisfied by any of the POWER7caches.A typical POWER7 chip has either 1 or 2 memory controllers (MCs) and the first is always measured. Wherethere are 2 MCs, the event counts of the first should be closely mirrored by the second, because which one isused to satisfy a memory access request is determined by the parity of the number of the cache linerequested.7.1Descriptions of EventsPM_MEM0_RQ_DISP7.2 <strong>Metrics</strong>The memory controller has dispatched a read operation fromthe <strong>Power</strong>bus.The memory controller fetches a 128-byte cacheline from the DIMMs and returns the data to the <strong>Power</strong>bus.This event is incremented by all reads seen by this memory controller, so it is a sum of the reads from allthreads on this chip which reference local physical memory and threads on adjacent chips that also accessthis chip’s local memory (which they classify as “remote” memory reads).PM_MEM0_WQ_DISPThe memory controller has dispatched a write operation fromthe <strong>Power</strong>bus ,The memory controller writes a 128-byte cacheline to the DIMMs. This event is incremented by all writesseen by this memory controller, so it is a sum of the writes from all threads on this chip which referenced localphysical memory and threads on adjacent chips that also access this chip’s local memory (which they classifyas “remote” memory writes).PM_RUN_CYCRun_cyclesProcessor Cycles gated by the run latch. Operating systems use the run latch to indicate when they are doinguseful work. The run latch is typically cleared in the OS idle loop. Gating by the run latch filters out the idleloop.The events above are routed from the memory controller across multiple asynchronous clock domains to thethread level PMUs, so to prevent signal loss the counts need to be scaled appropriately. For a single MC:Read_BW (GB/s)Read Bandwidth in gigabytes/secFormula: (cacheline_size * fixed_prescalar * ( ReadQ dispatches )) / 1e9/ secs= ( 128 * 8 * (PM_MEM0_RQ_DISP)) / 1e9/ secsWrite_BW (GB/s)Write Bandwidth in gigabytes/secFormula: (cacheline_size * fixed_prescalar * ( WriteQ dispatches )) / 1e9/ secs= ( 128 * 8 * (PM_MEM0_WQ_DISP)) / 1e9/ secs“secs” is the number of elapsed seconds during which the events were counted. This is equivalent to usingCopyright ©2011 IBM Corporation Page 24 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>PM_RUN_CYC/[machine clock speed]If a POWER7 chip uses two controllers, multiply these rates by two.7.3 Event GroupsOnly 1 group is needed: pm_nest5.Table 7-1 Memory bandwidth events with their corresponding event group and brief descriptionEvent Name Group Name Event DescriptionPM_MEM0_RQ_DISPpm_nest5Nest events (MC0/MC1/PB/GX),Pair0 Bit1PM_MEM0_RQ_DISPpm_nest5Nest events (MC0/MC1/PB/GX),Pair3 Bit1PM_RUN_CYC pm_nest5 Run cyclesCopyright ©2011 IBM Corporation Page 25 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>8 L1 Instruction Cache MissesThis chapter lists and describes the events, metrics and groups needed <strong>for</strong> analyzing L1 instruction cachemisses.8.1Descriptions of EventsThere are 5 events, distributed across 2 event groups, needed <strong>for</strong> these POWER7 L1 I-cache miss metrics.These include 1 architected event counting non-idle completed instructions.This section provides more details on each event and how it is triggered.PM_INST_FROM_L2MISSInstruction fetched missed L2The processor's Instruction Cache was reloaded but not from the local L2 due to a demand load.PM_INST_FROM_L3MISSInstruction fetched missed L3The processor's Instruction Cache was reloaded but not from the local L3 due to a demand load.PM_INST_FROM_PREFInstruction fetched from prefetchThe processor's Instruction Cache was reloaded with data because on an instruction prefetch.PM_L1_ICACHE_MISSDemand iCache MissAn instruction fetch missed the L1 Instruction cachePM_RUN_INST_CMPLRun InstructionsNumber of <strong>Power</strong>PC instructions completed, gated by the run latch.8.2 <strong>Metrics</strong>Instruction cache misses are characterized with the following metrics:L1_Inst_Miss_Rate(%)Instruction Cache Miss Rate (Per run Instruction)(%)Formula: PM_L1_ICACHE_MISS * 100 / PM_RUN_INST_CMPLL2_Inst_Miss_Rate(%)L2 Instruction Miss Rate (per instruction)(%)Formula: (PM_INST_FROM_L2MISS - PM_INST_FROM_PREF) * 100 / PM_RUN_INST_CMPLL3_Inst_Miss_Rate(%)L3 Instruction Miss Rate (per instruction)(%)Formula: (PM_INST_FROM_L3MISS - PM_INST_FROM_PREF) * 100 / PM_RUN_INST_CMPLCopyright ©2011 IBM Corporation Page 26 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>8.3 Event GroupsThese events are listed in Table 8-1. The relevant events are listed alphabetically by name and each rowincludes the name of a PMC group where the event can be found and a short description. The PMC groupcan be used with tools like hpmcount (see Section 4.2.1) to collect several events at once, minimizing thenumber of runs needed to collect a complete set of data. Events can be used with perf (see Section 5.1),oprofile (see Section 5.2) or tprof (see Section 4.3)Two event groups are needed to cover instruction cache events.Table 8-1 L1 instruction cache events with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_L1_ICACHE_MISS pm_ic_miss L1 I cache miss countPM_INST_FROM_L2MISS pm_isource6 Instruction fetched missed L2PM_INST_FROM_L3MISS pm_isource6 Instruction fetched missed L3PM_INST_FROM_PREF pm_isource6 Instruction fetched from prefetchPM_RUN_INST_CMPL pm_ic_miss Run_InstructionsCopyright ©2011 IBM Corporation Page 27 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>9 Branch PredictionThis chapter lists and describes the events, metrics and groups needed to characterize branch mispredictionand branches taken.9.1Descriptions of EventsThere are 12 events, distributed across 4 event groups, needed <strong>for</strong> these POWER7 branch predictionmetrics. These include 1 architected event counting non-idle completed instructions.This section provides more details on each event and how it is triggered.PM_BR_MPREDNumber of Branch MispredictsA branch instruction was predicted. This could have been a target prediction a condition prediction or bothPM_BR_MPRED_CCACHEBranch Mispredict due to Count Cache predictionA branch instruction target was incorrectly predicted due to count cache misprediction. This will result in abranch mispredict flush unless a flush is detected from an older instruction.PM_BR_MPRED_CRBranch mispredict - taken/not takenA conditional branch instruction was incorrectly predicted as taken or not taken. The branch execution unitdetects a branch mispredict because the CR value is opposite of the predicted value. This will result in abranch redirect flush if not overfidden by a flush of an older instruction.PM_BR_MPRED_LSTACKBranch Mispredict due to Link StackA branch instruction target was incorrectly predicted due to Link stack Misprediction. This will result in abranch mispredict flush unless a flush is detected from an older instruction.PM_BR_MPRED_TABranch mispredict - target addressA branch instruction target was incorrectly predicted. This will result in a branch mispredict flush unless aflush is detected from an older instruction.PM_BR_PREDBranch Predictions madeA branch prediction was made, counted at branch execute time.PM_BR_PRED_CCACHECount Cache PredictionsA branch instruction target prediction was made using the count cachePM_BR_PRED_CRBranch predict - taken/not takenA conditional branch instruction was predicted as taken or not taken.PM_BR_PRED_LSTACKLink Stack PredictionsA branch instruction target prediction was made using the Link stackPM_BR_TAKENBranch TakenA branch instruction was taken. This could have been a conditional branch or an unconditional branchPM_BRU_FINBranch Instruction FinishedThe branch unit finished a branch type instruction.PM_RUN_INST_CMPLRun InstructionsNumber of <strong>Power</strong>PC instructions completed, gated by the run latch.9.2 <strong>Metrics</strong>Branch prediction is characterized with the following metrics:Copyright ©2011 IBM Corporation Page 28 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Branch_Mispredict_Rate(%)Branch mispredictions per instructionFormula: PM_BR_MPRED / PM_RUN_INST_CMPL * 100CR_Mispredict_Rate(%)CR mispredictions per InstructionFormula: PM_BR_MPRED_CR / PM_RUN_INST_CMPL * 100TA_Mispredict_Rate(%)branch target address mispredictions per InstructionFormula: PM_BR_MPRED_TA / PM_RUN_INST_CMPL * 100LSTACK_Mispredict_Rate(%) Link stack branch mispredictionFormula: (PM_BR_MPRED_TA - PM_BR_MPRED_CCACHE) / PM_RUN_INST_CMPL * 100CCACHE_Mispredict_Rate(%) Count cache branch misprediction per instructionFormula: PM_BR_MPRED_CCACHE / PM_RUN_INST_CMPL * 100Taken_Branches(%)% Branches TakenFormula: PM_BR_TAKEN * 100 / PM_BRU_FINBR_Misprediction(%)Branch mispredictions per branch predictionFormula: PM_BR_MPRED / PM_BR_PRED * 100TA_Misprediction(%)branch target address mispredictions per branch target addresspredictionFormula: PM_BR_MPRED_TA / (PM_BR_PRED_CCACHE + PM_BR_PRED_LSTACK) * 100CR_Misprediction(%)branch CR mispredictions per branch CR predictionFormula: PM_BR_MPRED_CR / PM_BR_PRED_CR * 100CCACHE_Misprediction(%)Count cache branch mispredictionFormula: PM_BR_MPRED_CCACHE / PM_BR_PRED_CCACHE * 100LSTACK_Misprediction(%)Link stack branch mispredictionFormula: (PM_BR_MPRED_TA - PM_BR_MPRED_CCACHE) / PM_BR_PRED_LSTACK * 100Copyright ©2011 IBM Corporation Page 29 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>9.3 Event GroupsThese events are listed in Table 9-1. The relevant events are listed alphabetically by name and each rowincludes the name of a PMC group where the event can be found and a short description. The PMC groupcan be used with tools like hpmcount (see Section 4.2.1) to collect several events at once, minimizing thenumber of runs needed to collect a complete set of data. Events can be used with perf (see Section 5.1),oprofile (see Section 5.2) or tprof (see Section 4.3)The list of groups needed (and the relevant events they include) is provided in Table 9-2.Table 9-1 Branch predition events with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_BR_MPRED pm_branch3 Number of Branch MispredictsPM_BR_MPRED_CCACHEpm_branch4Branch Mispredict due to CountCache predictionPM_BR_MPRED_CRPM_BR_MPRED_LSTACKpm_branch4pm_misc8Branch mispredict - taken/nottakenBranch Mispredict due to LinkStackPM_BR_MPRED_TApm_branch4Branch mispredict - targetaddressPM_BR_PRED pm_branch3 Branch Predictions madePM_BR_PRED_CCACHE pm_branch2 Count Cache PredictionsPM_BR_PRED_CR pm_branch2 Branch predict - taken/not takenPM_BR_PRED_LSTACK pm_branch2 Link Stack PredictionsPM_BR_TAKEN pm_branch3 Branch TakenPM_BRU_FIN pm_branch3 Branch Instruction FinishedPM_RUN_INST_CMPL pm_branch3 Run_InstructionsTable 9-2 Events sorted by GroupGroup NameEvent Namepm_branch2PM_BR_PRED_CCACHEPM_BR_PRED_CRPM_BR_PRED_LSTACKpm_branch3PM_BR_MPREDPM_BR_PREDPM_BR_TAKENPM_BRU_FINPM_RUN_INST_CMPLpm_branch4PM_BR_MPRED_CCACHEPM_BR_MPRED_CRPM_BR_MPRED_TApm_misc8PM_BR_MPRED_LSTACKCopyright ©2011 IBM Corporation Page 30 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>10 L1 Cache Data Reloads per Reference10.1 Descriptions of EventsThere are 15 events, distributed across 6 event groups, needed <strong>for</strong> these POWER7 L1 cache data reloadmetrics.This section provides more details on each event and how it is triggered.PM_DATA_FROM_DL2L3_MODData loaded from distant L2 or L3 modifiedThe processor's Data Cache was reloaded with Modified (M) data from a chip's L2 or L3 on a different Node(Distant) due to a demand load .PM_DATA_FROM_DL2L3_SHRData loaded from distant L2 or L3 sharedThe processor's Data Cache was reloaded with Shared (S) data from a chip's L2 or L3 on a different Node(Distant) due to a demand load .PM_DATA_FROM_DMEMData loaded from distant memoryThe processor's Data Cache was reloaded with data from a chip's memory on a different Node (Distant)due to a demand load .PM_DATA_FROM_L2Data loaded from L2The processor's Data Cache was reloaded with data from the local chiplet's L2 cache due to a demand load.PM_DATA_FROM_L21_MODData loaded from another L2 on same chip modifiedThe processor's Data Cache was reloaded with Modified (M) data from another L2 on the same chip due toa demand load .PM_DATA_FROM_L21_SHRData loaded from another L2 on same chip sharedThe processor's Data Cache was reloaded with Shared (S) data from another L2 on the same chip due to ademand load .PM_DATA_FROM_L3Data loaded from L3The processor's Data Cache was reloaded with data from the local chiplet's L3 cache due to a demand load.PM_DATA_FROM_L31_MODData loaded from another L3 on same chip modifiedThe processor's Data Cache was reloaded with Modified (M) data from another L3 on the same chip due toa demand load .PM_DATA_FROM_L31_SHRData loaded from another L3 on same chip sharedThe processor's Data Cache was reloaded with Shared (S) data from another L3 on the same chip due to ademand load .PM_DATA_FROM_LMEMData loaded from local memoryThe processor's Data Cache was reloaded with data from the local chip's memory due to a demand load.PM_DATA_FROM_RL2L3_MODData loaded from remote L2 or L3 modifiedThe processor's Data Cache was reloaded with Modified (M) data from a chip's L2 or L3 on the same Node(Remote) due to a demand load .PM_DATA_FROM_RL2L3_SHRData loaded from remote L2 or L3 sharedThe processor's Data Cache was reloaded with Shared (S) data from a chip's L2 or L3 on the same Node(Remote) due to a demand load .PM_DATA_FROM_RMEMData loaded from remote memoryThe processor's Data Cache was reloaded with data from another chip's memory on the same node due to aCopyright ©2011 IBM Corporation Page 31 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>demand load.PM_L1_DCACHE_RELOAD_VALIDL1 reload data source validThe data source in<strong>for</strong>mation is valid, the data cache has been reloaded <strong>for</strong> demand loads, reported once percachelinePM_LD_MISS_L1Load Missed L1Load references that miss the Level 1 Data cache.10.2 <strong>Metrics</strong>The L1 cache data reloads are characterized with the following metrics:dL1_Miss_Reloads(%)% of DL1 misses that result in a cache reloadFormula: PM_L1_DCACHE_RELOAD_VALID * 100 / PM_LD_MISS_L1dL1_Reload_FROM_L2(%)% of DL1 reloads from L2Formula: PM_DATA_FROM_L2 * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_L21_MOD(%) % of DL1 reloads from Private L2, other coreFormula: PM_DATA_FROM_L21_MOD * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_L21_SHR(%) % of DL1 reloads from Private L2, other coreFormula: PM_DATA_FROM_L21_SHR * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_L31_MOD(%) % of DL1 reloads from Private L3, other coreFormula: PM_DATA_FROM_L31_MOD * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_L31_SHR(%) % of DL1 reloads from Private L3, other coreFormula: PM_DATA_FROM_L31_SHR * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_L3(%)% of DL1 dL1_Reloads from L3Formula: PM_DATA_FROM_L3 * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_RL2L3_SHR(%) % of DL1 dL1_Reloads from Remote L2 or L3 (Shared)Formula: PM_DATA_FROM_RL2L3_SHR * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_RL2L3_MOD(%) % of DL1 dL1_Reloads from Remote L2 or L3 (Modified)Formula: PM_DATA_FROM_RL2L3_MOD * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_DL2L3_MOD(%) % of DL1 dL1_Reloads from Distant L2 or L3 (Modified)Formula: PM_DATA_FROM_DL2L3_MOD * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_DL2L3_SHR(%) % of DL1 dL1_Reloads from Distant L2 or L3 (Shared)Formula: PM_DATA_FROM_DL2L3_SHR * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_LMEM(%) % of DL1 dL1_Reloads from Local MemoryFormula: PM_DATA_FROM_LMEM * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_RMEM(%) % of DL1 dL1_Reloads from Remote MemoryFormula: PM_DATA_FROM_RMEM * 100 / PM_L1_DCACHE_RELOAD_VALIDdL1_Reload_FROM_DMEM(%) % of DL1 dL1_Reloads from Distant MemoryFormula: PM_DATA_FROM_DMEM * 100 / PM_L1_DCACHE_RELOAD_VALIDCopyright ©2011 IBM Corporation Page 32 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>10.3 Event GroupsThese events are listed in Table 10-1. The relevant events are listed alphabetically by name and each rowincludes the name of a PMC group where the event can be found and a short description. The PMC groupcan be used with tools like hpmcount (see Section 4.2.1) to collect several events at once, minimizing thenumber of runs needed to collect a complete set of data. Events can be used with perf (see Section 5.1),oprofile (see Section 5.2) or tprof (see Section 4.3)The list of groups needed (and the relevant events they include) is provided in Table 10-2.Table 10-1 L1 cache data reloads events with their corresponding event groups and brief descriptionEvent NamePM_DATA_FROM_DL2L3_MODGroup Namepm_dsource4Event DescriptionData loaded from distant L2 or L3modifiedPM_DATA_FROM_DL2L3_SHRpm_dsource5Data loaded from distant L2 or L3sharedPM_DATA_FROM_DMEM pm_dsource3 Data loaded from distant memoryPM_DATA_FROM_L2 pm_dsource1 Data loaded from L2PM_DATA_FROM_L21_MODpm_dsource3Data loaded from another L2 onsame chip modifiedPM_DATA_FROM_L21_SHRpm_dsource5Data loaded from another L2 onsame chip sharedPM_DATA_FROM_L3 pm_dsource1 Data loaded from L3PM_DATA_FROM_L31_MODpm_dsource4Data loaded from another L3 onsame chip modifiedPM_DATA_FROM_L31_SHRpm_dsource5Data loaded from another L3 onsame chip sharedPM_DATA_FROM_LMEMPM_DATA_FROM_RL2L3_MODpm_dsource1pm_dsource6Data loaded from local memoryData loaded from remote L2 orL3 modifiedPM_DATA_FROM_RL2L3_SHRpm_dsource6Data loaded from remote L2 orL3 sharedPM_DATA_FROM_RMEM pm_dsource1 Data loaded from remote memoryPM_L1_DCACHE_RELOAD_VALID pm_dsource9 L1 reload data source validPM_LD_MISS_L1 pm_dsource9 Load Missed L1Table 10-2 Events sorted by GroupGroup NameEvent Namepm_dsource1PM_DATA_FROM_L2PM_DATA_FROM_L3PM_DATA_FROM_LMEMPM_DATA_FROM_RMEMpm_dsource3PM_DATA_FROM_DMEMPM_DATA_FROM_L21_MODpm_dsource4PM_DATA_FROM_DL2L3_MODPM_DATA_FROM_L31_MODpm_dsource5PM_DATA_FROM_DL2L3_SHRPM_DATA_FROM_L21_SHRPM_DATA_FROM_L31_SHRpm_dsource6PM_DATA_FROM_RL2L3_MODPM_DATA_FROM_RL2L3_SHRCopyright ©2011 IBM Corporation Page 33 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>pm_dsource9PM_L1_DCACHE_RELOAD_VALIDPM_LD_MISS_L1Copyright ©2011 IBM Corporation Page 34 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>11 Memory LocationThis chapter lists and describes the events, metrics and groups needed to characterize how the data isdistributed across memory subsystems.Most users understand the term “node” to mean the same as an SMP node. But in the context of memory,“node” refers to the chips that are mounted with their associated memory onto the same board, which may bea subset of a larger SMP system. Thus, there can be more than one of these “nodes” in an SMP system.11.1Descriptions of EventsThere are 3 events, distributed across 2 event groups, needed <strong>for</strong> these POWER7 memory location metrics.This section provides more details on each event and how it is triggered.PM_DATA_FROM_DMEMData loaded from distant memoryThe processor's Data Cache was reloaded with data from a chip's memory on a different node (distant) dueto a demand load .PM_DATA_FROM_LMEMData loaded from local memoryThe processor's Data Cache was reloaded with data from the local chip's memory due to a demand load.PM_DATA_FROM_RMEMData loaded from remote memoryThe processor's Data Cache was reloaded with data from another chip's memory on the same node due to ademand load.11.2<strong>Metrics</strong>The location of data in memory is characterized with the following metrics:MEM_LOCALITYMemory localityFormula: PM_DATA_FROM_LMEM / (PM_DATA_FROM_LMEM + PM_DATA_FROM_RMEM +PM_DATA_FROM_DMEM)LD_LMEM_PER_LD_RMEMNumber of loads from local memory per loads from remote memoryFormula: PM_DATA_FROM_LMEM / PM_DATA_FROM_RMEMLD_LMEM_PER_LD_DMEMNumber of loads from local memory per loads from distant memoryFormula: PM_DATA_FROM_LMEM / PM_DATA_FROM_DMEMLD_LMEM_PER_LD_MEMNumber of loads from local memory per loads from remote anddistant memoryFormula: PM_DATA_FROM_LMEM / (PM_DATA_FROM_DMEM + PM_DATA_FROM_RMEM)LD_RMEM_PER_LD_DMEMNumber of loads from remote memory per loads from distantmemoryFormula: PM_DATA_FROM_RMEM / PM_DATA_FROM_DMEM11.3 Event GroupsThese events are listed in Two event groups are needed to cover memory location events.Table 11-1. The relevant events are listed alphabetically by name and each row includes the name of a PMCgroup where the event can be found and a short description. The PMC group can be used with tools likehpmcount (see Section 4.2.1) to collect several events at once, minimizing the number of runs needed toCopyright ©2011 IBM Corporation Page 35 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>collect a complete set of data. Events can be used with perf (see Section 5.1), oprofile (see Section 5.2) ortprof (see Section 4.3)Two event groups are needed to cover memory location events.Table 11-1 Memory location events with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_DATA_FROM_DMEM pm_dsource3 Data loaded from distant memoryPM_DATA_FROM_LMEM pm_dsource1 Data loaded from local memoryPM_DATA_FROM_RMEM pm_dsource1 Data loaded from remote memoryCopyright ©2011 IBM Corporation Page 36 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>12 Address TranslationThis chapter lists and describes the events, metrics and groups needed to characterize address translation.12.1Descriptions of EventsThere are 11 events, distributed across 4 event groups, needed <strong>for</strong> these POWER7 address translationmetrics. These include 1 architected event counting non-idle completed instructions.This section provides more details on each event and how it is triggered.PM_DERAT_MISS_4KDERAT misses <strong>for</strong> 4K pageThe source page size of a DERAT reload is 4KPM_DERAT_MISS_64KDERAT misses <strong>for</strong> 64K pageThe source page size of a DERAT reload is 64KPM_DERAT_MISS_16MDERAT misses <strong>for</strong> 16M pageThe source page size of a DERAT reload is 16MPM_DERAT_MISS_16GDERAT misses <strong>for</strong> 16G pageThe source page size of a DERAT reload is 16GPM_DSLB_MISSData SLB Miss - Total of all segment sizesA SLB miss <strong>for</strong> a data request occurred. SLB misses trap to the operating system to resolve. This is a totalcount <strong>for</strong> all segment sizes.PM_DTLB_MISSTLB reload validData TLB misses, all page sizes.PM_IERAT_MISSIERAT Miss (Not implemented as DI on POWER6)An entry was written into the IERAT as a result of an IERAT miss.PM_ISLB_MISSInstruction SLB Miss - Tota of all segment sizesA SLB miss <strong>for</strong> an instruction fetch has occurred.SLB misses trap to the operating system to resolve. This is atotal count <strong>for</strong> all segment sizes.PM_L1_ICACHE_MISSDemand iCache MissAn instruction fetch missed the L1 Instruction cachePM_LSU_DERAT_MISSDERAT Reloaded due to a DERAT missTotal D-ERAT Misses. Requests that miss the DERAT are rejected and retried until the request hits in theErat. This may result in multiple erat misses <strong>for</strong> the same instruction. Combined Unit 0 + 1.PM_RUN_INST_CMPLRun InstructionsNumber of <strong>Power</strong>PC instructions completed, gated by the run latch.12.2<strong>Metrics</strong>Address translation is characterized with the following metrics:DERAT_Miss_Rate(%)DERAT Miss Rate (per run instruction)(%)Formula: PM_LSU_DERAT_MISS * 100 / PM_RUN_INST_CMPLDERAT_4K_Miss_Rate(%)% DERAT miss rate <strong>for</strong> 4K page per instructionCopyright ©2011 IBM Corporation Page 37 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Formula:PM_DERAT_MISS_4K * 100 / PM_RUN_INST_CMPLDERAT_64K_Miss_Rate(%) % DERAT miss ratio <strong>for</strong> 64K page per instructionFormula: PM_DERAT_MISS_64K * 100 / PM_RUN_INST_CMPLDERAT_16M_Miss_Rate(%) % DERAT miss rate <strong>for</strong> 16M page per instructionFormula: PM_DERAT_MISS_16M * 100 / PM_RUN_INST_CMPLDERAT_16G_Miss_Rate(%) % DERAT miss ratio <strong>for</strong> 16G page per instructionFormula: PM_DERAT_MISS_16G * 100 / PM_RUN_INST_CMPLDERAT_Miss_Ratio DERAT miss ratioFormula: PM_LSU_DERAT_MISS / PM_RUN_INST_CMPLDERAT_4K_Miss_Ratio DERAT miss ratio <strong>for</strong> 4K pageFormula: PM_DERAT_MISS_4K / PM_LSU_DERAT_MISSDERAT_64K_Miss_Ratio DERAT miss ratio <strong>for</strong> 64K pageFormula: PM_DERAT_MISS_64K / PM_LSU_DERAT_MISSDERAT_16M_Miss_Ratio DERAT miss ratio <strong>for</strong> 16M pageFormula: PM_DERAT_MISS_16M / PM_LSU_DERAT_MISSDERAT_16G_Miss_Ratio DERAT miss ratio <strong>for</strong> 16G pageFormula: PM_DERAT_MISS_16G / PM_LSU_DERAT_MISSDSLB_Miss_Rate(%) % DSLB_Miss_Rate per instructionFormula: PM_DSLB_MISS * 100 / PM_RUN_INST_CMPLDTLB_Miss_Rate(%) % DTLB miss rate per instructionFormula: PM_DTLB_MISS / PM_RUN_INST_CMPL *100IERAT_Miss_Rate(%) IERAT miss rate (%)Formula: PM_IERAT_MISS * 100 / PM_RUN_INST_CMPLISLB_Miss_Rate(%) % ISLB miss rate per instructionFormula: PM_ISLB_MISS * 100 / PM_RUN_INST_CMPLL1_Inst_Miss_Rate(%) Instruction Cache Miss Rate (Per run Instruction)(%)Formula: PM_L1_ICACHE_MISS * 100 / PM_RUN_INST_CMPLCopyright ©2011 IBM Corporation Page 38 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>12.3 Event GroupsThese events are listed in Table 12-1. The relevant events are listed alphabetically by name and each rowincludes the name of a PMC group where the event can be found and a short description. The PMC groupcan be used with tools like hpmcount (see Section 4.2.1) to collect several events at once, minimizing thenumber of runs needed to collect a complete set of data. Events can be used with perf (see Section 5.1),oprofile (see Section 5.2) or tprof (see Section 4.3)The list of groups needed (and the relevant events they include) is provided in Table 12-2.Table 12-1 Address translation events with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_DERAT_MISS_4K pm_derat_miss1 DERAT misses <strong>for</strong> 4K pagePM_DERAT_MISS_64K pm_derat_miss1 DERAT misses <strong>for</strong> 64K pagePM_DERAT_MISS_16M pm_derat_miss1 DERAT misses <strong>for</strong> 16M pagePM_DERAT_MISS_16G pm_derat_miss1 DERAT misses <strong>for</strong> 16G pagePM_DSLB_MISSpm_misc_miss5Data SLB Miss - Total of allsegment sizesPM_DTLB_MISS pm_id_miss_erat_tlab TLB reload validPM_IERAT_MISS pm_misc_miss5 IERAT MissPM_ISLB_MISSpm_misc_miss5Instruction SLB Miss - Tota of allsegment sizesPM_L1_ICACHE_MISS pm_id_miss_erat_l1 Demand iCache MissPM_LSU_DERAT_MISSpm_id_miss_erat_tlabDERAT Reloaded due to aDERAT missPM_RUN_INST_CMPL pm_derat_miss1 Run_InstructionsTable 12-2 Events sorted by GroupGroup NameEvent Namepm_derat_miss1PM_DERAT_MISS_4KPM_DERAT_MISS_64KPM_DERAT_MISS_16MPM_DERAT_MISS_16GPM_RUN_INST_CMPLpm_id_miss_erat_l1 PM_L1_ICACHE_MISSpm_id_miss_erat_tlab PM_DTLB_MISSPM_LSU_DERAT_MISSpm_misc_miss5PM_DSLB_MISSPM_IERAT_MISSPM_ISLB_MISSCopyright ©2011 IBM Corporation Page 39 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>13 Instruction StatisticsThis chapter lists and describes the events, metrics and groups needed to characterize how efficientlyinstructions are fetched.13.1 Descriptions of EventsThere are 15 events, distributed across 6 event groups, needed <strong>for</strong> these POWER7 instruction statistics.This section provides more details on each event and how it is triggered.PM_IC_PREF_WRITEInstruction prefetch written into IL1Count of instruction prefetch sectors written in the L1. Writes are always written as LRU and subsequent readmakes it MRUPM_INST_FROM_DL2L3_MODInstruction fetched from distant L2 or L3 modifiedThe processor's Instruction Cache was reloaded with Modified (M) data from a chip's L2 or L3 on a differentNode (Distant) due to a demand load .PM_INST_FROM_DL2L3_SHRInstruction fetched from distant L2 or L3 sharedThe processor's Instruction Cache was reloaded with Shared (S) data from a chip's L2 or L3 on a differentNode (Distant) due to a demand load .PM_INST_FROM_DMEMInstruction fetched from distant memoryThe processor's Instruction Cache was reloaded with data from a chip's memory on a different Node(Distant) due to a demand load .PM_INST_FROM_L2Instruction fetched from L2The processor's Instruction Cache was reloaded with data from the local chiplet's L2 cache due to a demandload.PM_INST_FROM_L21_MODInstruction fetched from another L2 on same chip modifiedThe processor's Instruction Cache was reloaded with Modified (M) data from another L2 on the same chipdue to a demand load .PM_INST_FROM_L21_SHRInstruction fetched from another L2 on same chip sharedThe processor's Instruction Cache was reloaded with Shared (S) data from another L2 on the same chipdue to a demand load .PM_INST_FROM_L3Instruction fetched from L3The processor's Instruction Cache was reloaded with data from the local chiplet's L3 cache due to a demandload.PM_INST_FROM_L31_MODInstruction fetched from another L3 on same chip modifiedThe processor's Instruction Cache was reloaded with Modified (M) data from another L3 on the same chipdue to a demand load .PM_INST_FROM_L31_SHRInstruction fetched from another L3 on same chip sharedThe processor's Instruction Cache was reloaded with Shared (S) data from another L3 on the same chipdue to a demand load .PM_INST_FROM_LMEMInstruction fetched from local memoryThe processor's Instruction Cache was reloaded with data from the local chip's memory due to a demandload.PM_INST_FROM_RL2L3_MODInstruction fetched from remote L2 or L3 modifiedCopyright ©2011 IBM Corporation Page 40 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>The processor's Instruction Cache was reloaded with Modified (M) data from a chip's L2 or L3 on the sameNode (Remote) due to a demand load .PM_INST_FROM_RL2L3_SHRInstruction fetched from remote L2 or L3 sharedThe processor's Instruction Cache was reloaded with Shared (S) data from a chip's L2 or L3 on the sameNode (Remote) due to a demand load .PM_INST_FROM_RMEMInstruction fetched from remote memoryThe processor's Instruction Cache was reloaded with data from another chip's memory on the same nodedue to a demand load.PM_L1_ICACHE_MISSDemand iCache MissAn instruction fetch missed the L1 Instruction cache13.2 <strong>Metrics</strong>Instruction usage is characterized with the following metrics:ICACHE_PREF(%)% of ICache reloads due to prefetchFormula: PM_IC_PREF_WRITE * 100 / PM_L1_ICACHE_MISSICache_MISS_RELOADIcache Fetchs per Icache MissFormula: PM_L1_ICACHE_MISS - PM_IC_PREF_WRITE) / PM_L1_ICACHE_MISSINST_FROM_L2(%)% of ICache reloads from L2Formula: PM_INST_FROM_L2 * 100 / PM_L1_ICACHE_MISSINST_FROM_L21_MOD(%)% of ICache reloads from Private L2, other coreFormula: PM_INST_FROM_L21_MOD * 100 / PM_L1_ICACHE_MISSINST_FROM_L21_SHR(%)% of ICache reloads from Private L2, other coreFormula: PM_INST_FROM_L21_SHR * 100 / PM_L1_ICACHE_MISSINST_FROM_L31_MOD(%)% of ICache reloads from Private L3, other coreFormula: PM_INST_FROM_L31_MOD * 100 / PM_L1_ICACHE_MISSINST_FROM_L31_SHR(%)% of ICache reloads from Private L3, other coreFormula: PM_INST_FROM_L31_SHR * 100 / PM_L1_ICACHE_MISSINST_FROM_L3(%)% of ICache reloads from L3Formula: PM_INST_FROM_L3 * 100 / PM_L1_ICACHE_MISSINST_FROM_RL2L3_SHR(%) % of ICache reloads from Remote L2 or L3 (Shared)Formula: PM_INST_FROM_RL2L3_SHR * 100 / PM_L1_ICACHE_MISSINST_FROM_RL2L3_MOD(%) % of ICache reloads from Remote L2 or L3 (Modified)Formula: PM_INST_FROM_RL2L3_MOD * 100 / PM_L1_ICACHE_MISSINST_FROM_DL2L3_MOD(%) % of ICache reloads from Distant L2 or L3 (Modified)Formula: PM_INST_FROM_DL2L3_MOD * 100 / PM_L1_ICACHE_MISSINST_FROM_DL2L3_SHR(%) % of ICache reloads from Distant L2 or L3 (Shared)Formula: PM_INST_FROM_DL2L3_SHR * 100 / PM_L1_ICACHE_MISSINST_FROM_LMEM(%)% of ICache reloads from Local MemoryFormula: PM_INST_FROM_LMEM * 100 / PM_L1_ICACHE_MISSCopyright ©2011 IBM Corporation Page 41 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>INST_FROM_RMEM(%)% of ICache reloads from Remote MemoryFormula: PM_INST_FROM_RMEM * 100 / PM_L1_ICACHE_MISSINST_FROM_DMEM(%)% of ICache reloads from Distant MemoryFormula: PM_INST_FROM_DMEM * 100 / PM_L1_ICACHE_MISS13.3 Event GroupsThese events are listed in Table 13-1. The relevant events are listed alphabetically by name and each rowincludes the name of a PMC group where the event can be found and a short description. The PMC groupcan be used with tools like hpmcount (see Section 4.2.1) to collect several events at once, minimizing thenumber of runs needed to collect a complete set of data. Events can be used with perf (see Section 5.1),oprofile (see Section 5.2) or tprof (see Section 4.3)The list of groups needed (and the relevant events they include) is provided in Table 13-2.Table 13-1 Events <strong>for</strong> instruction statistics with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_IC_PREF_WRITEpm_prefetchInstruction prefetch written intoIL1PM_INST_FROM_DL2L3_MODpm_isource7Instruction fetched from distantL2 or L3 modifiedPM_INST_FROM_DL2L3_SHRpm_isource7Instruction fetched from distantL2 or L3 sharedPM_INST_FROM_DMEMpm_isource9Instruction fetched from distantmemoryPM_INST_FROM_L2 pm_isource1 Instruction fetched from L2PM_INST_FROM_L21_MODpm_isource4Instruction fetched from anotherL2 on same chip modifiedPM_INST_FROM_L21_SHRpm_isource4Instruction fetched from anotherL2 on same chip sharedPM_INST_FROM_L3 pm_isource1 Instruction fetched from L3PM_INST_FROM_L31_MODpm_isource4Instruction fetched from anotherL3 on same chip modifiedPM_INST_FROM_L31_SHRpm_isource4Instruction fetched from anotherL3 on same chip sharedPM_INST_FROM_LMEMpm_isource9Instruction fetched from localmemoryPM_INST_FROM_RL2L3_MODpm_isource7Instruction fetched from remoteL2 or L3 modifiedPM_INST_FROM_RL2L3_SHRpm_isource7Instruction fetched from remoteL2 or L3 sharedPM_INST_FROM_RMEMpm_isource9Instruction fetched from remotememoryPM_L1_ICACHE_MISS pm_id_miss_erat_l1 Demand iCache MissTable 13-2 Events sorted by GroupGroup NameEvent Namepm_id_miss_erat_l1 PM_L1_ICACHE_MISSpm_isource1PM_INST_FROM_L2PM_INST_FROM_L3pm_isource4PM_INST_FROM_L21_MODCopyright ©2011 IBM Corporation Page 42 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>pm_isource7pm_isource9pm_prefetchPM_INST_FROM_L21_SHRPM_INST_FROM_L31_MODPM_INST_FROM_L31_SHRPM_INST_FROM_DL2L3_MODPM_INST_FROM_DL2L3_SHRPM_INST_FROM_RL2L3_MODPM_INST_FROM_RL2L3_SHRPM_INST_FROM_DMEMPM_INST_FROM_LMEMPM_INST_FROM_RMEMPM_IC_PREF_WRITECopyright ©2011 IBM Corporation Page 43 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>14 Flush <strong>Metrics</strong>This chapter lists and describes the events, metrics and groups needed to characterize flushes.14.1Descriptions of EventsThere are 3 events, distributed across 2 event groups, needed <strong>for</strong> these POWER7 flush metrics. Theseinclude 1 architected event counting non-idle completed instructions.This section provides more details on each event and how it is triggered.PM_FLUSHFlush (any type)A flush has occurred, this could includes all types of flushes including branch mispredicts, load store unitflushes,partial flushes,completion flushesPM_FLUSH_BR_MPREDFlush caused by branch mispredictA flush was caused by a branch mispredictPM_RUN_INST_CMPLRun InstructionsNumber of <strong>Power</strong>PC instructions completed, gated by the run latch.14.2<strong>Metrics</strong>Flushes are characterized with the following metrics:Flush_Rate(%) Flush rate (%)Formula: PM_FLUSH * 100 / PM_RUN_INST_CMPLBr_Mpred_Flush_Rate(%)Branch Mispredict flushes per instructionFormula: PM_FLUSH_BR_MPRED / PM_RUN_INST_CMPL * 100Copyright ©2011 IBM Corporation Page 44 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>14.3 Event GroupsThese events are listed in Two event groups are needed to cover memory location events.Table 14-1. The relevant events are listed alphabetically by name and each row includes the name of a PMCgroup where the event can be found and a short description. The PMC group can be used with tools likehpmcount (see Section 4.2.1) to collect several events at once, minimizing the number of runs needed tocollect a complete set of data. Events can be used with perf (see Section 5.1), oprofile (see Section 5.2) ortprof (see Section 4.3)Two event groups are needed to cover memory location events.Table 14-1 Memory location events with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_FLUSH pm_flush1 Flush (any type)PM_FLUSH_BR_MPREDpm_misc9Flush caused by branchmispredictPM_RUN_INST_CMPLpm_flush1Run InstructionsCopyright ©2011 IBM Corporation Page 45 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>15 L2 Read-Claim Machine <strong>Metrics</strong>This chapter lists and describes the events, metrics and groups needed to characterize how the L2 readclaimmachines <strong>for</strong> each chip are utilized across the system. Each POWER7 system can configure nodepump or system pump mode, so these metrics provide in<strong>for</strong>mation <strong>for</strong> tuning the system appropriately.15.1Descriptions of EventsThere are 2 events, distributed across 2 event groups, needed <strong>for</strong> these POWER7 L2 cache statistics.This section provides more details on each event and how it is triggered.PM_L2_NODE_PUMPRC req that was a local (aka node) pump attemptThe L2 Read Claim machine per<strong>for</strong>med a local (chip) pump attempt. This event is delivered from the L2domain, so must be scaled accordingly (divide by 2)PM_L2_SYS_PUMPRC req that was a global (aka system) pump attemptThe L2 Read Claim machine per<strong>for</strong>med a global (system) pump attempt. This event is delivered from the L2domain, so must be scaled accordingly (divide by 2)15.2<strong>Metrics</strong>L2 read-claim machine behavior is characterized with the following metrics:L2_Node_Pumps(%)L2 Node pumps as a % of all L2 pumpsFormula: ( PM_L2_NODE_PUMP ) / (PM_L2_NODE_PUMP + PM_L2_SYS_PUMP) * 100L2_Sys_Pumps(%)L2 Sys pumps per core as a % of all L2 pumpsFormula: ( PM_L2_SYS_PUMP ) / (PM_L2_NODE_PUMP + PM_L2_SYS_PUMP) * 10015.3 Event GroupsThese events are listed in Two event groups are needed to cover memory location events.Table 15-1. The relevant events are listed alphabetically by name and each row includes the name of a PMCgroup where the event can be found and a short description. The PMC group can be used with tools likehpmcount (see Section 4.2.1) to collect several events at once, minimizing the number of runs needed tocollect a complete set of data. Events can be used with perf (see Section 5.1), oprofile (see Section 5.2) ortprof (see Section 4.3)Two event groups are needed to cover memory location events.Table 15-1 L2 read-claim machine events with their corresponding event groups and brief descriptionEvent Name Group Name Event DescriptionPM_L2_NODE_PUMPpm_l2_misc5RC req that was a local (akanode) pump attemptPM_L2_SYS_PUMPpm_l2_misc3RC req that was a global (akasystem) pump attemptCopyright ©2011 IBM Corporation Page 46 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Appendix A: AcknowledgementsSome source material was adapted from the publicly available documentation of event definitions (evs/gpsfiles).Thanks to M.L. Pesantez and Alex Mericas who provided many useful comments to improve this manuscript.Copyright ©2011 IBM Corporation Page 47 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Appendix B: Abbreviations <strong>Used</strong>BRUCRLDFUFPUFXUIFULSUMCPMCPMUVMXVSUBRanch execution UnitCondition Register Logical pipelineDecimal Floating-point UnitFloating-Point UnitFixed-Point UnitInstruction Fetch UnitLoad/Store UnitMemory ControllerPer<strong>for</strong>mance Monitoring CountersPer<strong>for</strong>mance Monitoring UnitVector Multimedia eXtensionsVector-Scalar UnitCopyright ©2011 IBM Corporation Page 48 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Appendix C: Notices© IBM Corporation 2010IBM CorporationMarketing Communications, Systems GroupRoute 100, Somers, New York 10589Produced in the United States of AmericaMarch 2010, All Rights ReservedThis in<strong>for</strong>mation was developed <strong>for</strong> products and services offered in the U.S.A.IBM may not offer the products, services, or features discussed in this document in other countries. Consultyour local IBM representative <strong>for</strong> in<strong>for</strong>mation on the products and services currently available in your area.Any reference to an IBM product, program, or service is not intended to state or imply that only that IBMproduct, program, or service may be used. Any functionally equivalent product, program, or service that doesnot infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility toevaluate and verify the operation of any non-IBM product, program, or service.IBM may have patents or pending patent applications covering subject matter described in this document.The furnishing of this document does not give you any license to these patents. You can send licenseinquiries, in writing, to:IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.The following paragraph does not apply to the United Kingdom or any other country where suchprovisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATIONPROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS ORIMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer ofexpress or implied warranties in certain transactions, there<strong>for</strong>e, this statement may not apply to you.This in<strong>for</strong>mation could include technical inaccuracies or typographical errors. Changes are periodically madeto the in<strong>for</strong>mation herein; these changes will be incorporated in new editions of the publication. IBM maymake improvements and/or changes in the product(s) and/or the program(s) described in this publication atany time without notice.Any references in this in<strong>for</strong>mation to non-IBM Web sites are provided <strong>for</strong> convenience only and do not in anymanner serve as an endorsement of those Web sites. The materials at those Web sites are not part of thematerials <strong>for</strong> this IBM product and use of those Web sites is at your own risk.IBM may use or distribute any of the in<strong>for</strong>mation you supply in any way it believes appropriate withoutincurring any obligation to you.In<strong>for</strong>mation concerning non-IBM products was obtained from the suppliers of those products, their publishedannouncements or other publicly available sources. IBM has not tested those products and cannot confirmthe accuracy of per<strong>for</strong>mance, compatibility or any other claims related to non-IBM products. Questions on thecapabilities of non-IBM products should be addressed to the suppliers of those products.This in<strong>for</strong>mation contains examples of data and reports used in daily business operations. To illustrate themas completely as possible, the examples include the names of individuals, companies, brands, and products.All of these names are fictitious and any similarity to the names and addresses used by an actual businessenterprise is entirely coincidental.COPYRIGHT LICENSE:Copyright ©2011 IBM Corporation Page 49 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>This in<strong>for</strong>mation contains sample application programs in source language, which illustrate programmingtechniques on various operating plat<strong>for</strong>ms. You may copy, modify, and distribute these sample programs inany <strong>for</strong>m without payment to IBM, <strong>for</strong> the purposes of developing, using, marketing or distributing applicationprograms con<strong>for</strong>ming to the application programming interface <strong>for</strong> the operating plat<strong>for</strong>m <strong>for</strong> which the sampleprograms are written. These examples have not been thoroughly tested under all conditions. IBM, there<strong>for</strong>e,cannot guarantee or imply reliability, serviceability, or function of these programs.More details can be found at the IBM <strong>Power</strong> Systems home page 1 .1 http://www.ibm.com/systems/pCopyright ©2011 IBM Corporation Page 50 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Appendix D: TrademarksIBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business MachinesCorporation in the United States, other countries, or both. These and other IBM trademarked terms aremarked on their first occurrence in this in<strong>for</strong>mation with the appropriate symbol (® or ), indicating USregistered or common law trademarks owned by IBM at the time this in<strong>for</strong>mation was published. Suchtrademarks may also be registered or common law trademarks in other countries. A current list of IBMtrademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml.The following terms are trademarks of the International Business Machines Corporation in the United States,other countries, or both:1350AIX 5LAIX®alphaWorks®Ascendant®BetaWorksBladeCenter®CICS®Cool BlueDB2®developerWorks®Domino®EnergyScaleEnterprise StorageServer®Enterprise WorkloadManagereServerExpress PortfolioFlashCopy®GDPS®General Parallel FileSystemGeographicallyDispersed ParallelSysplexGlobal InnovationOutlookGPFSHACMPHiperSocketsHyperSwapi5/OS®The following terms are trademarks of other companies:IBM ProcessReference Model <strong>for</strong>ITIBM Systems DirectorActive EnergyManagerIBM®IntelliStation®Lotus Notes®Lotus®MQSeries®MVSNetfinity®Notes®OS/390®Parallel Sysplex®PartnerWorld®POWERPOWER®POWER4POWER5POWER6POWER7<strong>Power</strong>Executive<strong>Power</strong> Systems<strong>Power</strong>PC®<strong>Power</strong>VMPR/SMpSeries®QuickPlace®RACF®Rational Summit®Rational UnifiedProcess®Rational®Redbooks®Redbooks (logo) ®RS/6000®RUP®S/390®Sametime®Summit AscendantSummit®System iSystem pSystem StorageSystem xSystem zSystem z10System/360System/370Tivoli®TotalStorage®VM/ESA®VSE/ESAWebSphere®WorkplaceWorkplaceMessaging®X-Architecture®xSeries®z/OS®z/VM®z10zSeries®AltiVec is a trademark of Freescale Semiconductor, Inc.AMD, AMD Opteron, the AMD Arrow logo, and combinations thereof, are trademarks ofAdvanced Micro Devices, Inc.InfiniBand, and the InfiniBand design marks are trademarks and/or service marks of theInfiniBand Trade Association.ITIL® is a registered trademark, and a registered community trademark of the Office ofGovernment Commerce, and is registered in the U.S. Patent and Trademark Office.IT Infrastructure Library® is a registered trademark of the Central Computer andCopyright ©2011 IBM Corporation Page 51 of 52


<strong>Metrics</strong> <strong>for</strong> Per<strong>for</strong>mance <strong>Analysis</strong>Telecommunications Agency which is now part of the Office of Government Commerce.Novell®, SUSE®, the Novell logo, and the N logo are registered trademarks of Novell, Inc. inthe United States and other countries.Oracle®, JD Edwards®, PeopleSoft®, Siebel®, and TopLink® are registered trademarks ofOracle Corporation and/or its affiliates.SAP NetWeaver®, SAP R/3®, SAP®, and SAP logos are trademarks or registeredtrademarks of SAP AG in Germany and in several other countries.IQ, J2EE, Java, JDBC, Netra, Solaris, Sun, and all Java-based trademarksare trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.Microsoft, Windows, Windows NT, Outlook, SQL Server, Windows Server,Windows, and the Windows logo are trademarks of Microsoft Corporation in the UnitedStates, other countries, or both.Intel Xeon, Intel, Itanium, Intel logo, Intel Inside logo, and Intel Centrino logo aretrademarks or registered trademarks of Intel Corporation or its subsidiaries in the UnitedStates, other countries, or both.QLogic, the QLogic logo are the trademarks or registered trademarks of QLogicCorporation.SilverStorm is a trademark of QLogic Corporation.SPEC® is a registered trademark of Standard Per<strong>for</strong>mance Evaluation Corporation.SPEC MPI® is a registered trademark of Standard Per<strong>for</strong>mance Evaluation Corporation.UNIX® is a registered trademark of The Open Group in the United States and othercountries.Linux is a trademark of Linus Torvalds in the United States, other countries, or both.Other company, product, or service names may be trademarks or service marks of others.Copyright ©2011 IBM Corporation Page 52 of 52

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!