11.07.2015 Views

Performance Analysis of Idle Programs - Researcher - IBM

Performance Analysis of Idle Programs - Researcher - IBM

Performance Analysis of Idle Programs - Researcher - IBM

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Performance</strong> <strong>Analysis</strong> <strong>of</strong> <strong>Idle</strong> <strong>Programs</strong>Erik Altman Matthew Arnold Stephen Fink Nick Mitchell<strong>IBM</strong> T.J. Watson Research Center{ealtman,marnold,sjfink,nickm}@us.ibm.comAbstractThis paper presents an approach for performance analysis <strong>of</strong> modernenterprise-class server applications. In our experience, performancebottlenecks in these applications differ qualitatively frombottlenecks in smaller, stand-alone systems. Small applications andbenchmarks <strong>of</strong>ten suffer from CPU-intensive hot spots. In contrast,enterprise-class multi-tier applications <strong>of</strong>ten suffer from problemsthat manifest not as hot spots, but as idle time indicating a lack<strong>of</strong> forward motion. Many factors can contribute to undesirable idletime, including locking problems, excessive system-level activitieslike garbage collection, various resource constraints, and problemsdriving load.We present the design and methodology for WAIT, a tool todiagnosis the root cause <strong>of</strong> idle time in server applications. Givenlightweight samples <strong>of</strong> Java activity on a single tier, the tool can<strong>of</strong>ten pinpoint the primary bottleneck on a multi-tier system. Themethodology centers on an informative abstraction <strong>of</strong> the states <strong>of</strong>idleness observed in a running program. This abstraction allows thetool to distinguish, for example, between hold-ups on a databasemachine, insufficient load, lock contention in application code,and a conventional bottleneck due to a hot method. To computethe abstraction, we present a simple expert system based on anextensible set <strong>of</strong> declarative rules.WAIT can be deployed on the fly, without modifying or evenrestarting the application. Many groups in <strong>IBM</strong> have applied thetool to diagnosis performance problems in commercial systems,and we present a number <strong>of</strong> examples as case studies.Categories and Subject Descriptors S<strong>of</strong>tware [S<strong>of</strong>tware Engineering]:Metrics, <strong>Performance</strong> MeasuresGeneral Terms<strong>Performance</strong>Keywords bottlenecks, performance analysis, multi-tier serverapplications, idle time1. IntroductionShun idleness. It is a rust that attaches itself to the mostbrilliant metals. Voltaire.This paper addresses the challenges <strong>of</strong> performance analysisfor modern enterprise-class server applications. These systems runacross multiple physical tiers, and their s<strong>of</strong>tware comprises manyPermission to make digital or hard copies <strong>of</strong> all or part <strong>of</strong> this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pr<strong>of</strong>it or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.OOPSLA/SPLASH’10, October 17–21, 2010, Reno/Tahoe, Nevada, USA.Copyright c○ 2010 ACM 978-1-4503-0203-6/10/10. . . $10.00components from different vendors and middleware stacks. Many<strong>of</strong> these applications support a high degree <strong>of</strong> concurrency, servingthousands or even millions <strong>of</strong> concurrent user requests. They supportrich and frequent interactions with other systems, with no interveninghuman think time. Many server applications manipulatelarge data sets, requiring substantial network and disk infrastructureto support bandwidth requirements.With these requirements and complexities, such applicationsface untold difficulties when attempting to scale for heavy productionloads. In our experience with dozens <strong>of</strong> industrial applications,every individual deployment introduces a unique set <strong>of</strong> challenges,due to issues specific to a particular configuration. Any change tokey configuration parameters, such as machine topology, applicationparameters, code versions, and load characteristics, can causesevere performance problems due to unanticipated interactions.Part <strong>of</strong> the challenge arises from the sheer diversity <strong>of</strong> potentialpitfalls. Even a single process can suffer from any number<strong>of</strong> bottlenecks, including concurrency issues from thread lockingbehavior, excessive garbage collection load due to temporary objectchurn [9, 21, 24], and saturating the machine’s memory bandwidth[24]. Any <strong>of</strong> these problems may appear as a serializationbottleneck in that the application fails to use multiple threads effectively;however, one must drill down further to find the rootcause. Other problems can arise from limited capacity <strong>of</strong> physicalresources including disk I/O and network links. A load balancermay not effectively distribute load to application clones. When performancetesting, testers <strong>of</strong>ten encounter problems generating loadeffectively. In such cases, the primary bottleneck may be processoror memory saturation on a remote node, outside the system-undertest.Furthermore, many pr<strong>of</strong>iling and performance understandingtools are inappropriate for commercial server environments. Manytools [2, 3, 5, 6, 11, 13, 14, 16, 18, 19] rely on restarting or instrumentingan application, which is <strong>of</strong>ten forbidden in commercialdeployment environments. Similarly, many organizations will notdeploy any unapproved monitoring agents, nor tolerate any significantperturbation <strong>of</strong> the running system. In practice, diagnosingperformance problems under such constraints resembles detectivework, where the analyst must piece together clues from incompleteinformation.Addressing performance analysis under these constraints, wepresent the design and methodology <strong>of</strong> a tool called WAIT. WAIT’smethodology centers on <strong>Idle</strong> Time <strong>Analysis</strong>: rather than analyzewhat an application is doing, the analysis focuses on explainingidle time. Specifically, the tool tries to determine the root causethat leads to under-utilized processors. Additionally, the tool mustpresent information in a way that is easily consumable, and operateunder restrictions typical <strong>of</strong> production deployment scenarios.The main contributions <strong>of</strong> this paper include:• a hierarchical abstraction <strong>of</strong> execution state: We present a novelabstraction <strong>of</strong> concrete execution states, which provides a hier-


Figure 3. From information that includes monitor states and stacksample context inferences, WAIT infers a Wait State for everysample.level details. In this way, the Wait States serve the same purposeas the conventional run states, but provide a richer semantics thathelps identify bottlenecks. Figure 3 gives example inferences <strong>of</strong>Wait States. For the stack in Figure 4, the analysis would infer theWait State “Network’, as the stack matches the pattern shown in themiddle example <strong>of</strong> Figure 3.For each stack frame, we compute an abstraction called a Categorythat represents the code or activity being performed by thatmethod invocation. For example, a method invocation could indicateapplication-level activity such as Database Query, ClientCommunication, or JDBC Overhead. We further label each stacksample with a Primary Category which best characterizes the activitybeing performed by an entire stack at the sampled moment intime.The abstract model <strong>of</strong> activity forms a hierarchy: a Wait Stategives a coarse but meaningful abstraction <strong>of</strong> a class <strong>of</strong> behaviors.For more information, one can “drill down” through the model tosee finer distinctions based on Primary Categories, stacks <strong>of</strong> Categories,and finally concrete stack samples without abstraction. Thishierarchy provides a natural model for an intuitive user interface,which provides a high-level overview and the ability to drill downthrough layers <strong>of</strong> abstraction to pinpoint relevant details.Tool We have implemented a tool based on the above abstraction,which is deployed as a service within <strong>IBM</strong>. The tool computesthe abstraction described above based on a simple set <strong>of</strong> rules,defined declaratively by an expert based on knowledge <strong>of</strong> commonmethods in standard library and middleware stacks. We presentsome statistics and case studies that indicate the methodology ispractical, and successfully identifies diverse sources <strong>of</strong> idle time.3. Hub SamplingWAIT relies on samples <strong>of</strong> processor activity and <strong>of</strong> the state <strong>of</strong>threads in a JVM. The system typically takes samples from thehub process (e.g., application server) <strong>of</strong> a multi-tier application, butcan also collect data from any standard Java environment. Despitecollecting no data from the other tiers, information from a hubprocess frequently illuminates multi-tier bottlenecks.We next describe how WAIT collects information from a Javahub, and discuss challenges due to limitations in available data.3.1 Sampling MechanismsWe have found a low barrier to entry to be a first-order requirementfor tools in this space. Many, if not most, potential userswill reject any changes to deployment scripts, root permissions,kernel changes, specific s<strong>of</strong>tware versions, or specialized monitoringagents. Instead, WAIT collects samples <strong>of</strong> processor utilization,process utilization, and snapshots <strong>of</strong> Java activity using builtinmechanisms that are available on nearly every deployed Javasystem. Table 1 summarizes the mechanisms by which the systemcollects data.java/net/SocketInputStream.socketRead0java/net/SocketInputStream.read line 140com/ibm/db2/jcc/t4/Reply.fill line 195com/ibm/db2/jcc/t4/Reply.ensureALayerDataInBuffer line 243com/ibm/db2/jcc/t4/Reply.readDssHeader line 357com/ibm/db2/jcc/t4/Reply.startSameIdChainParse line 1139com/ibm/db2/jcc/t4/T4XAConnectionReply.readXaStartUnitOfWork line 75com/ibm/db2/jcc/t4/T4XAConnection.readTransactionStart line 251com/ibm/db2/jcc/am/Agent.beginReadChain line 314com/ibm/db2/jcc/t4/T4Agent.beginReadChain line 512com/ibm/db2/jcc/am/Agent.flow line 202com/ibm/db2/jcc/am/PreparedStatement.flowExecute line 3472com/ibm/db2/jcc/am/PreparedStatement.executeQuery line 625com/ibm/ws/rsadapter/jdbc/WSJdbcPreparedStatement.executeQuery line 707org/apache/openjpa/lib/jdbc/DelegatingPreparedStatement.executeQuery line 264org/apache/openjpa/lib/jdbc/LoggingConnectionDecorator$PreparedStatement.executeQueryorg/apache/openjpa/lib/jdbc/DelegatingPreparedStatement.executeQuery line 262org/apache/openjpa/jdbc/kernel/JDBCStoreManager$CancelPreparedStatement.executeQueryorg/apache/openjpa/lib/jdbc/DelegatingPreparedStatement.executeQueryorg/apache/openjpa/jdbc/sql/SelectImpl.executeQueryorg/apache/openjpa/jdbc/sql/SelectImpl.execute line 362org/apache/openjpa/jdbc/kernel/JDBCStoreManager.getInitializeStateResult line 503org/apache/openjpa/jdbc/kernel/JDBCStoreManager.initializeState line 298org/apache/openjpa/jdbc/kernel/JDBCStoreManager.initialize line 278com/ibm/ws/persistence/jdbc/kernel/WsJpaJDBCStoreManager.initialize line 144org/apache/openjpa/kernel/DelegatingStoreManager.initializeorg/apache/openjpa/kernel/ROPStoreManager.initialize line 55org/apache/openjpa/kernel/BrokerImpl.initialize line 911org/apache/openjpa/kernel/BrokerImpl.find line 871org/apache/openjpa/kernel/DelegatingBroker.findorg/apache/openjpa/persistence/EntityManagerImpl.findcom/ibm/ws/jpa/management/JPATxEmInvocation.find line 216com/ibm/ws/jpa/management/JPAEntityManager.find line 185org/apache/geronimo/samples/daytrader/ejb3/TradeSLSBBean.getQuote line 400org/apache/geronimo/samples/daytrader/ejb3/EJSRemote0SLTradeSLSBBean_8ae41722.getQuoteorg/apache/geronimo/samples/daytrader/ejb3/_TradeSLSBRemote_Stub.getQuoteorg/apache/geronimo/samples/daytrader/TradeAction.getQuote line 376com/ibm/_jsp/_displayQuote._jspService line 92com/ibm/ws/jsp/runtime/HttpJspBase.service line 98javax/servlet/http/HttpServlet.service line 831com/ibm/ws/cache/servlet/ServletWrapper.serviceProxied line 307com/ibm/ws/cache/servlet/CacheHook.handleFragment line 429com/ibm/ws/cache/servlet/CacheHook.handleServlet line 247com/ibm/ws/cache/servlet/ServletWrapper.service line 259com/ibm/ws/webcontainer/servlet/ServletWrapper.service line 1539com/ibm/ws/webcontainer/servlet/ServletWrapper.service line 1471com/ibm/ws/webcontainer/filter/WebAppFilterChain.doFilter line 98com/ibm/ws/webcontainer/filter/WebAppFilterChain._doFilter line 77com/ibm/ws/webcontainer/filter/WebAppFilterManager.doFilter line 812com/ibm/ws/webcontainer/servlet/ServletWrapper.handleRequest line 825com/ibm/ws/webcontainer/servlet/ServletWrapper.handleRequest line 459com/ibm/ws/webcontainer/servlet/ServletWrapperImpl.handleRequest line 175com/ibm/wsspi/webcontainer/servlet/GenericServletWrapper.handleRequest line 121com/ibm/ws/jsp/webcontainerext/AbstractJSPExtensionServletWrapper.handleRequest line 130com/ibm/ws/webcontainer/webapp/WebAppRequestDispatcher.include line 654org/apache/jasper/runtime/JspRuntimeLibrary.include line 1045org/apache/jasper/runtime/JspRuntimeLibrary.include line 1006com/ibm/_jsp/_quote._jspService line 77com/ibm/ws/jsp/runtime/HttpJspBase.service line 98javax/servlet/http/HttpServlet.service line 831com/ibm/ws/cache/servlet/ServletWrapper.serviceProxied line 307com/ibm/ws/cache/servlet/CacheHook.handleFragment line 429com/ibm/ws/cache/servlet/CacheHook.handleServlet line 247com/ibm/ws/cache/servlet/ServletWrapper.service line 259com/ibm/ws/webcontainer/servlet/ServletWrapper.service line 1539com/ibm/ws/webcontainer/servlet/ServletWrapper.service line 1471com/ibm/ws/webcontainer/filter/WebAppFilterChain.doFilter line 98com/ibm/ws/webcontainer/filter/WebAppFilterChain._doFilter line 77com/ibm/ws/webcontainer/filter/WebAppFilterManager.doFilter line 812com/ibm/ws/webcontainer/servlet/ServletWrapper.handleRequest line 825com/ibm/ws/webcontainer/servlet/ServletWrapper.handleRequest line 459com/ibm/ws/webcontainer/servlet/ServletWrapperImpl.handleRequest line 175com/ibm/wsspi/webcontainer/servlet/GenericServletWrapper.handleRequest line 121com/ibm/ws/jsp/webcontainerext/AbstractJSPExtensionServletWrapper.handleRequest line 130com/ibm/ws/webcontainer/webapp/WebAppRequestDispatcher.include line 654org/apache/geronimo/samples/daytrader/web/TradeServletAction.requestDispatch line 729org/apache/geronimo/samples/daytrader/web/TradeServletAction.doQuotes line 583org/apache/geronimo/samples/daytrader/web/TradeAppServlet.performTask line 112org/apache/geronimo/samples/daytrader/web/TradeAppServlet.doGet line 78javax/servlet/http/HttpServlet.service line 711Figure javax/servlet/http/HttpServlet.service 4. Part <strong>of</strong> a 105-deep line 831 call stack, from the DayTrader benchmark(a configuration <strong>of</strong> which is now part <strong>of</strong> the DaCapo Suite).com/ibm/ws/cache/servlet/ServletWrapper.serviceProxied line 307com/ibm/ws/cache/servlet/CacheHook.handleFragment line 429com/ibm/ws/cache/servlet/CacheHook.handleServlet line 247com/ibm/ws/cache/servlet/ServletWrapper.service line 259com/ibm/ws/webcontainer/servlet/ServletWrapper.service line 1539com/ibm/ws/webcontainer/servlet/ServletWrapper.service line 1471com/ibm/ws/webcontainer/filter/WebAppFilterChain.doFilter line 98


data UNIX Windowsmachine utilization vmstat typeperfprocess utilization ps tasklistJava state kill -3 sendsignalTable 1. The built-in mechanisms we use to sample the Java hub.Note that kill -3 does not terminate the signaled process, and 3is the numeric code for SIGQUIT.sampling interval (seconds)throughput slowdown1 59%5 24%10 13%20 8%30 2%1000 unmeasurableTable 2. Throughput perturbation versus sampling interval, from adocument processing system running <strong>IBM</strong>’s 1.5 JVM.The system can also produce meaningful results with partialdata. In practice, data sometimes arrives corrupted or prematurelyterminated, due to a myriad <strong>of</strong> problems. For example, the targetmachine may run out <strong>of</strong> disk space while writing out data, the targetJVM may have bugs in its data collection, or there may be simpleuser errors. If any <strong>of</strong> the sources <strong>of</strong> data described are incomplete,the system will produce the best possible analysis based on the dataavailable.Processor Utilization Most operating systems support non-intrusiveprocessor utilization sampling. We attempt to collect time series <strong>of</strong>processor utilization at three levels <strong>of</strong> granularity: (1) for the wholemachine, (2) for the process being monitored, and (3) for the individualthreads within the process. For example, on UNIX platformswe use vmstat and ps to collect this data.Java Thread Activity To monitor the state <strong>of</strong> Java threads, werely on the support built into most commercial JVMs to dump“javacore” files. Our implementation currently supports (parses)the javacore format produced by <strong>IBM</strong> JVMs, and contains preliminarysupport for the HotSpot JVM. This data, originally intendedto help diagnose process failures and deadlock, can be sampled byissuing a signal to a running JVM process. 1 Upon receiving thissignal, the JVM stops running threads, and then writes out the informationspecified in Figure 2. The JVM forces threads to quiesceusing the same “safepoint” mechanism that is used by other standardJVM mechanisms, such as the garbage collector.<strong>IBM</strong> JVMs can produce javacore samples with fairly low perturbation.For a large application with several hundred threads withdeep call stacks, writing out a javacore file can pause the applicationfor several hundred milliseconds. As long as samples occurinfrequently, writing javacores has a small effect on throughput,but an unavoidable hit on latency for requests in flight at the moment<strong>of</strong> sample. Table 2 shows that samples taken once every 30seconds result in a 2% slowdown. For server applications, whereoperations repeat indefinitely, perturbation can be dialed down aslow as needed, at least concerning throughput.When the hub <strong>of</strong> a multi-tier application spans multiple processes,possibly running on multiple machines, we currently chooseone at random. In the future, we will tackle multi-process environ-1 There are some nominal differences between this format and that providedby a Sun JVM. Also, Sun provides a finer granularity for data acquisition,via the jstack, jstat, and related family <strong>of</strong> commands.ments more generally, including cloud environments. However, wenote that in many enterprise workloads, the bottlenecks facing allJVMs are similar, and thus our approach provides substantial benefiteven now.3.2 Difficulties with Thread StatesThe run states provided by the JVM and operating system are <strong>of</strong>teninconsistent or imprecise, due to several complications. The firstproblem is that many JVM implementations quiesce threads at safepointsbefore dumping the javacore. Threads that are already quiesced(e.g., waiting to acquire a monitor) will be reported correctlyas having a conventional run state <strong>of</strong> Blocked. However, any threadthat was Runnable before triggering the dump will be reported tohave a false run state <strong>of</strong> CondWait, since the thread was stopped bythe JVM before writing the javacore file.The boundary between the JVM and the operating system introducesfurther difficulties with thread run states. The JVM and OSeach track the run state <strong>of</strong> a thread. The JVM may think a threadis Blocked, while the OS reports the same thread Runnable, in themidst <strong>of</strong> executing a spinlock. Spinning is sometimes a detail outsidethe JVM’s jurisdiction, implemented in a native library calledby the JVM. Similarly, the JVM may report a thread in a Cond-Wait state, even though the thread is executing system code such ascopying data out <strong>of</strong> network buffers or traversing directory entriesin the filesystem implementation.Even if conventional run states were perfectly accurate, they<strong>of</strong>ten help little in diagnosing the nature <strong>of</strong> a bottleneck. Considerthe conventional CondWait run state. One such thread may bewaiting at a join point, in a fork-join style <strong>of</strong> parallelism. Anotherthread, with the same CondWait run state, may be waiting for datafrom a remote source, such as a database. A third such thread maybe a worker thread, idle only for want <strong>of</strong> work.For these reasons, we instead compute on a richer thread stateabstraction that distinguishes between these different types <strong>of</strong>states.4. A Hierarchical Abstraction <strong>of</strong> Execution StateWAIT’s analysis maps concrete program execution states into anabstract model, designed to illuminate root causes <strong>of</strong> idle time. Inthis Section, we describe the concepts in the abstraction. Section 5later describes how the analyzer computes the abstraction, andhow details <strong>of</strong> the abstraction hierarchy arise from a declarativespecification.The analysis maps each sampled thread into an abstract state,which consists <strong>of</strong> a pair <strong>of</strong> two elements called the Wait State anda stack <strong>of</strong> Categories. A Wait State encapsulates the status <strong>of</strong> athread regarding its potential to make forward progress, while thea Category represents the code or activity being performed by aparticular method invocation.We next describe the hierarchical abstraction in more detail.4.1 The Wait State AbstractionThe Wait State abstraction groups thread samples, assigning eachsample a label representing common cases <strong>of</strong> forward progress (orthe lack there<strong>of</strong>). Figure 5(a) shows the the hierarchy <strong>of</strong> Wait Stateswhich cover all possible concrete thread states. The analysis mapseach concrete thread sample into exactly one node in the tree.The hierarchy has proven to be stable over time; we have notneeded to add additional states beyond those shown in the hierarchy<strong>of</strong> Figure 5(a), even as usage <strong>of</strong> the WAIT tool has expanded toincluded diverse workloads.At the coarsest level, the Wait State <strong>of</strong> a sampled thread indicateswhether that thread is currently held up or making forwardprogress: Java threads may be either Waiting or Runnable. A third


# ! (a) The Wait State Tree " !"$($" ##&' $ % %(b) The Category TreeFigure 5. Every concrete state, a sampled thread, maps to a pair <strong>of</strong> abstract states: a Wait State and a stack <strong>of</strong> Categories (one per frame).Every frame in a call stack has a default name based on the invoked package; e.g. com.MyBank.login() would be named MyBank Code.possibility covers a thread executing native (non-Java) code that thetool fails to characterize, in which case the thread is assigned WaitState Native Unknown.For Java threads, the analysis partitions Waiting and Runnableinto finer abstractions, which convey more information regardingsources <strong>of</strong> idle time. For example, a Waiting thread might be waitingfor data from some source (Awaiting Data), blocked on lockcontention (Contention), or have put itself to sleep (Sleeping). Asshown in the Figure, finer distinctions are also possible. Considera Sleeping thread: this could be part <strong>of</strong> a polling loop that containsa call to Thread.sleep (Poll); it could be the join in a fork-joinstyle <strong>of</strong> parallelism (Join); or it could be that the thread is a workerin a pool <strong>of</strong> threads, and is waiting for new work to arrive in thework queue (Awaiting Notification).We claim that in many cases, distinctions in Wait States givea good first approximation <strong>of</strong> common sources <strong>of</strong> idle time inserver applications. Furthermore, we claim that differences in WaitStates indicate fundamentally different types <strong>of</strong> problems that leadto idle time. A server application suffering from low throughputdue to insufficient load would have many threads in the AwaitingNotification state. The solution to this problem might, for example,be to tune the load balancer. A system that improperly usesThread.sleep suffers from a problem <strong>of</strong> a completely differentnature. Similarly, having a preponderance <strong>of</strong> threads waiting fordata from a database has radically different implications on the systemhealth than many threads, also idle, suffering from lock contention.Thus, the Wait State gives a high-level description <strong>of</strong> the rootcause <strong>of</strong> idle time in an application. The second part <strong>of</strong> the abstrac- ! Figure 6. The call stack is mapped to a stack <strong>of</strong> Categories, fromwhich a primary Category (Database Communication) is chosen.tion, the Category stack, gives a finer abstraction for pinpointingroot causes.4.2 The Category AbstractionThe Category abstraction assigns each stack frame a label representingthe code or activity being performed by that method invocation.Category names provide a convenient abstraction that summarizesnested method invocations that implement larger units <strong>of</strong>functionality [10].Note that since each stack frame maps to a Category, each stackwill contain representatives from several Categories. To understandbehavior <strong>of</strong> many stacks at a glance, it is useful to assign eachstack a primary Category, which represents the Category whichprovides the best high-level characterization <strong>of</strong> the activity <strong>of</strong> theentire stack. For example, in Figure 4, the JDBC Category wouldbe chosen as the primary Category, based on priority logic that determinesthat the JDBC Category label conveys more information


com/mysql/jdbc/ConnectionImpl.commit ⇒ Database Commit.../XATransactionWrapper.rollback ⇒ Database Rollback.../SQLServerStatement.doExecuteStatement ⇒ Database Query.../WSJdbcStatement.executeBatch ⇒ Database Batch.../OracleResultSetImpl.next ⇒ Database CursorFigure 9. Rules that define aspects <strong>of</strong> the Database Category.clusters table in Figure 8 reveals that cluster c1 occurred 3592times across all application samples. Viewing table blocked in Figure8 indicates that in the first and last application samples, clusterc1 was waiting to enter the critical section guarded by monitor m2.Viewing the owned by table in turn reveals that m2 was owned bycluster c2. In other words, thread stack c1 was blocked on a monitorheld by thread stack c2.5.2 Category <strong>Analysis</strong>WAIT relies on a simple pattern-matching system, driven by aset <strong>of</strong> rules characterizing well-known methods and packages, todetermine the Category label for each stack frame. The systemrelies on a simple declarative specification <strong>of</strong> textual patterns thatdefine Categories.The declarative rules that define the Category analysis definetwo models. The first model is a Category Tree, such as the oneshown in 5(b). A Category Tree provides the namespace, inheritancestructure, and prioritization <strong>of</strong> the Category abstractions thatare available as method labels. The second model is a set <strong>of</strong> rules.Each rule maps a regular expression over method names to a nodein the Category Tree. For example, Figure 9 shows rules that, inpart, define Database activity. The rules distinguish between fiveaspects <strong>of</strong> database activity: queries, batch queries, commits, rollbacks,iteration over result sets. This example illustrates how it iseasy to define a Category Tree that is more precise than the oneshown in 5(b).Given these rules, the Category analysis is simple and straightforward.The analysis engine iterates over every frame <strong>of</strong> everycall stack cluster, looking for the highest-priority rule thatmatches each frame. As described in Section 4, every method hasan implicit Category, its package which is assigned to the Category’sCode Nickname. Thus, if no Category rule applies to aframe, then we form an ad hoc Category for that frame: a methodP1/P2/P3/.../Class.Method receives the Code Nickname P2Code.5.3 Wait State <strong>Analysis</strong>In addition to inferring Categories, WAIT infers a Wait State asillustrated in Figure 5(a). Our analysis to infer Wait States combinesthree sources <strong>of</strong> information: processor utilization, the concretedata model previously presented in Figure 8, and rules basedon method names. The rules over method names, in some cases,require the inspection <strong>of</strong> multiple frames in stack. This differs fromthe Category analysis, where each frame’s Category is independent<strong>of</strong> other frames.The main challenge in using method names to infer a Wait Stateconcerns handling imperfect knowledge <strong>of</strong> an application’s state.As discussed in Section 3.2, the true Wait State <strong>of</strong> a sampled threadis, in many cases, not knowable. To fill this knowledge gap, weuse expert knowledge about the meaning <strong>of</strong> activities, based onmethod names. Fortunately, many aspects <strong>of</strong> Wait States dependon the meaning <strong>of</strong> native methods, and the use <strong>of</strong> native methodsdoes not vary greatly from application to application. The design<strong>of</strong> Java seems actively to discourage the use <strong>of</strong> native methods.Nevertheless our conclusions are always subject to the inevitablehole in the rule set.The algorithm proceeds as a sieve, looking for the Wait Statethat can be inferred with the most certainty. The algorithm usesdata from the concrete data model presented in Figure 8, as wellas a set <strong>of</strong> rules over method names, specified declaratively withpatterns (analogous to the Category <strong>Analysis</strong>).The Wait State <strong>of</strong> a given call stack cluster c at sample index iis the first match found when traversing the following conditions,in order:1. Deadlock: if this stack cluster participates in a cycle <strong>of</strong> lockcontention in the monitor graph; i.e., there is a cycle in theBlocked and Owned By relations.2. Lock Contention: if the stack cluster, at the moment in time <strong>of</strong>sample i, has an entry in the Blocked relation.3. Awaiting Notification: if the stack cluster, at the moment intime <strong>of</strong> sample i, has an entry in the Waiting relation.4. Spinlock: if the Wait State rule set defines a method in cthat with high certainty, implies the use <strong>of</strong> spinlocking. Manymethods in the java.util.concurrent library fall in thishigh-certainty category.5. Awaiting Data from Disk, Network: if the rule set matchesc as a use <strong>of</strong> a filesystem or a network interface. Most suchrules need only inspect the leaf invocation <strong>of</strong> the stack, e.g. asocketRead native invocation is a very good indication thatthis stack sample cluster is awaiting data from the network. 2In some cases, requests for data are dispatched to a “stub/tie”method, as is common in LDAP or ORB implementations.6. Executing Java Code: if the method invoked by the top <strong>of</strong> thestack is not a native method, then c is assumed to be Runnable,executing Java code.7. Executing Native Code: if the method invoked by the top <strong>of</strong> thestack is a native method, and the rule set asserts that this nativemethod is truly running, then we infer that the native method isRunnable. We treat native and Java invocations asymmetrically,to increase robustness. A Java method, unless it participatesin the monitor graph, is almost certain to be Runnable. Thesame cannot be said <strong>of</strong> native methods. In our experience,native methods more <strong>of</strong>ten than not, serve the role <strong>of</strong> fetchingdata, rather than executing code. Therefore, we require nativemethods to be whitelisted in order to be considered Runnable.8. JVM Services: if c has no call stack, it is assumed to be executingnative services. Any compilation and Garbage Collectionthreads, spawned by the JVM itself, fall into this category. Eventhough these call stack samples have no call stacks, and unreliablethread states, they participate in the monitor graph. Thus,unless they are observed to be in a Contention or Awaiting Notificationstate, we assume they are Runnable, executing JVMServices.9. Poll, IOWait, Join Point: if there exists a rule that describesthe native method at the top <strong>of</strong> the stack as one <strong>of</strong> these variants<strong>of</strong> Sleeping.10. NativeUnknown: any call stack cluster with a native methodat the top <strong>of</strong> the stack and not otherwise classified is placedinto the NativeUnknown Wait State. This classification is incontrast to call stack clusters with Java leaf invocations, whichare assumed to be Runnable. For robustness, the algorithmrequires call stacks with native leaf invocations to be specifiedby rules to be in some particular Wait State. This allows toolusers to quickly spot deficiencies in the rule set. In practice, we2 This is likely, but not an absolute certainty. The JVM and operating systemspend some time copying network buffers, etc., as part <strong>of</strong> fetching remotedata. We are currently working to incorporate native stacks into the rule set,will allow for a more refined Wait State inference.


java/net/SocketInputStream.socketRead0 ⇒ Networkcom/ibm/as400/NativeMethods.socketRead ⇒ Networkorg/apache/.../OSNetworkSystem.write ⇒ Networksun/nio/ch/SocketChannelImpl.write0 ⇒ Networkjava/lang/Object.wait ⇒ %Condvarjava/util/concurrent/locks/ReentrantLock ⇒ %Condvar(a) Declarative Rules Based on One Leaf Invocation%Condvar & ObjectInputStream.readObject ⇒ Network(b) A Declarative Rule Based on Several Frames in a StackFigure 10. Some example Wait State rules. (a) Simple rules specifya Wait State (e.g., Network) or an auxiliary tag (e.g.,%CondVar)based on a single method frame. (b) Complex rules can use conjunctions<strong>of</strong> antecedents to match stacks which satisfy multipleconditions at different layers in the stack.Wait State Rule# RulesWaiting on Condition Variable 26Native Runnable 22Awaiting Data from Disk,Network 16Spinlock 12Table 3. The 76 Wait State rules, grouped according to Wait State.Only 5 <strong>of</strong> these rules use semantics from outside the StandardLibrary and Apache Commons.have found that our handling <strong>of</strong> native methods is robust enoughthat this state fires rarely.Rules for Wait States The syntax for declaring Wait State rulesis more general than that for Category rules, which depend on exactlyone method name. In particular, rules can specify antecedentswhich depend on a conjunction <strong>of</strong> frame patterns appearing togetherin a single stack, as illustrated in Figure 10.For convenience, the rules engine allows a declarative specification<strong>of</strong> tags, which are auxiliary labels which can be attached toa stack to encode information consumed by other rules which produceWait States.Figure 10(b) shows an example that matches against tw<strong>of</strong>rames, and relies on an auxiliary tag (%CondVar) which labelsvarious manifestations <strong>of</strong> waiting on a condition variable.5.4 Rule CoverageA key hypothesis in any rule-based system is that a stable, andhopefully small, set <strong>of</strong> rules can achieve good coverage on range <strong>of</strong>diverse inputs. This section evaluates this hypothesis for the WAITrule-based analysis.WAIT has been live within <strong>IBM</strong> for 10 months, and over thattime has received over 1,200 submissions, 830 <strong>of</strong> which were submittedby 151 unique users who were not involved in any way withWAIT development. The submissions came from multiple divisionswithin <strong>IBM</strong> and cover a wide range <strong>of</strong> Java applications, includingdocument processing systems, e-commerce systems, business analytics,s<strong>of</strong>tware configuration and management, mashups, objectcaches, and collaboration services. To handle these reports we haveencoded a total <strong>of</strong> 76 Wait State rules.This set <strong>of</strong> rules has proven to be extremely stable. When ourtool sees code from a new framework, these rules must change onlyto the extent that the framework interacts with the world outside <strong>of</strong>Java in new ways. In practice, we rarely see new flavors <strong>of</strong> crossingsbetween Java application code and the native environment. In thecase <strong>of</strong> a new cryptographic library, or a new in-Java cachingframework, none <strong>of</strong> these Wait State rules need change.Category(a) Rules, by Category# RulesDatabase 72Administrative 59Client Communication 41Disk, Network I/O 46Waiting for Work 30Marshalling 30JEE 22Classloader 13Logging 12LDAP 6(b) Database, by providerCategory# RulesDB2 18MySQL 14Oracle 12Apache 8SqlServer 6Table 4. It typically takes only a small number <strong>of</strong> rules to coverthe common Categories.# Thread Category Package# Reports Stacks Rule NicknameWAIT team 378 605,460 87.7% 12.2%External users 830 1,391,033 77.3% 22.7%Total 1208 1,996,493 80.4% 19.5%Table 5. Rule coverage statistics on WAIT reports submitted. Datashows what percent <strong>of</strong> thread stacks are labeled explicitly by theCategory rules, vs falling back to nickname based on the Javapackage.For the Category analysis we have found that only a small number<strong>of</strong> rules are necessary to capture a wide range <strong>of</strong> Categories.Table 4a characterizes most <strong>of</strong> the 387 Category rules that we havecurrently defined. For example, our current rule set covers five commonJDBC libraries, including <strong>IBM</strong> DB2 and Micros<strong>of</strong>t SqlServer,with only 72 rules. The number <strong>of</strong> rules specific to a particularJDBC implementation lies on the order <strong>of</strong> 10–20, as shown in Table4b. We have also found the rules to be stable across versions <strong>of</strong>any one implementation. For example, the same set <strong>of</strong> rules coversall known versions and platforms <strong>of</strong> the DB2 JDBC driver thatwe have tested. This testing includes three versions <strong>of</strong> the code andfour platforms.It is difficult to prove that the existing rules are “sufficient”other than to observe the eventual success <strong>of</strong> the tool within <strong>IBM</strong>.One concrete metric to assess rule coverage is to observe how<strong>of</strong>ten a thread stack does not match any Category, thus defaultsto a nickname based on the Java package. Table 5 presents thismetric for the WAIT submissions we have received so far. The 1208submissions contain 1,996,493 thread samples; 80.5% <strong>of</strong> thesestacks are labeled by our Category analysis. The remaining 19.5%fall back to being named by package. Ignoring the 378 submissionsby WAIT team members, the coverage drops only slightly to 77.3%<strong>of</strong> stacks being labeled by rules, and 22.7% by package name.6. The WAIT ToolBased on the abstractions and analyses presented throughout thepaper, we have implemented WAIT as a s<strong>of</strong>tware-as-a-service deployedin <strong>IBM</strong>. In this section, we describe the overall design <strong>of</strong>the tool, present the user interface, and discuss some implementationchoices.


6.1 WAIT Tool ArchitectureWe designed WAIT with a primary requirement <strong>of</strong> a low barrier toentry: the tool must be simple and easy to use. Any complex installprocedure or steep learning curve will inhibit adoption, particularlywhen targeting large commercial systems deployments.To meet this goal, we implemented WAIT as a service. UsingWAIT involves three steps:1. Collect one or more javacores. This can be done manually, or byusing a data collection script we provide that collects machineutilization and process utilization, as discussed in Section 3.1.2. Upload the collected data to the WAIT server through a webinterface.3. View the report in a browser.A service architecture <strong>of</strong>fers the following advantages:• Zero-install. The user can use WAIT without having to installany s<strong>of</strong>tware.• Easy to collaborate. A WAIT report can be shared and discussedby forwarding a single URL.• Incrementally refined knowledge base. By having access to thedata submitted to the WAIT service, the WAIT team can monitorthe reports being generated and continually improve theknowledge base when existing rules prove insufficient. This incrementalrefinement approach has proved particularly beneficialduring the early stages <strong>of</strong> development and allowed us torelease the tool far earlier than would have been possible with astandalone tool.• Cross-report analysis. Having access to a large number <strong>of</strong> reportsallows looking for trends that may not stand out clearly ina single report.The main disadvantage <strong>of</strong> a service-based tool is that it requiresa network connection to the server, which is sometimes not availablewhen diagnosing a problem at a customer site. There may alsobe privacy concerns with uploading the data to a central server, althoughthese concerns are mitigated by a server behind a corporatefirewall. One <strong>IBM</strong> organization has deployed a clone WAIT serviceon their own server, to satisfy more strict privacy requirements.6.2 User InterfaceFigure 11 shows a screenshot <strong>of</strong> a WAIT report being viewed inMozilla Firefox. The report is intended to be scanned from top tobottom, as this order aligns with the logical progression <strong>of</strong> questionsan expert would likely ask when diagnosing a performanceproblem.Activity Summary The top portion <strong>of</strong> the report present a highlevelview <strong>of</strong> the application’s behavior. The pie charts on theleft present data averaged over the whole collection period, whiletimelines on the right show how the behavior changed over time.The top row shows the machine utilization during the collectionperiod, breaking down the activity into four possible categories:Your Application (the Java program being monitored), GarbageCollection, Other Processes, and <strong>Idle</strong>. This overview appears inthe WAIT report first because it represents the first property oneusually checks when debugging a performance problem. In thisparticular report, the CPU utilization drops to zero roughly 1/3 <strong>of</strong>the way through the collection period, a common occurrence whenproblems arise in a multi-tier application.The second and third rows report the Wait State (described inSection 4.1) <strong>of</strong> all threads found running in the JVM. The secondrow shows threads that are Runnable, while the third row showsthreads that are Waiting. Each bar in the timeline represents the datafrom one Javacore. This example shows as many as 65 Runnablethreads for the first 8 javacores taken, at which point all runnableactivity ceased, and the number <strong>of</strong> Waiting threads shot up to 140,all in Wait State Delayed by Remote Request. 3Skimming the top portion <strong>of</strong> the WAIT report enables a user toquickly see that CPU idle time follows from threads backing up ina remote request to another tier.Category Viewer The lower left hand pane <strong>of</strong> the WAIT reportshows a breakdown <strong>of</strong> most active Categories (technically, primaryCategories from Section 4.2) executing in the application. Clickingon a pie slice or bar in the above charts causes the Category paneto drill down, showing the Category breakdown for the Wait Statethat was clicked.This report shows that all but one <strong>of</strong> the threads in Wait StateDelayed by Remote Request were executing Category Getting Datafrom Database. This indicates that the source <strong>of</strong> this problem stemsfrom the database becoming slow or unresponsive. WAIT does notcurrently support further detailed diagnosis <strong>of</strong> problems from thedatabase tier itself; WAIT’s utility stems from the ease with whichthe user can narrow down the problem to the database, withouthaving even looked at logs from the database machine.Stack Viewer Glancing at the commonly occurring Wait Statesand Category activity <strong>of</strong>ten suffices to rapidly identify bottlenecks;however, WAIT provides one additional level <strong>of</strong> drilldown. Selectinga bar in the report opens a stack viewer pane to display all callstacks that match the selected Wait State and Category. Stacks aresorted by most common occurrence to help identify the most importantbottlenecks. Having full stack samples available has provenvaluable not only for understanding performance problems, but forfixing them. The stacks allow mapping back to source code withfull context and exact lines <strong>of</strong> code where the backups are occurring.Passing this information on to the application developers is<strong>of</strong>ten sufficient for them to identify a fix.The presence <strong>of</strong> thread stacks makes WAIT useful not onlyfor analyzing waiting threads, but also for identifying program hotspots. Clicking on the Runnable Threads pie slice causes the StackViewer to display the most commonly occurring running threads.Browsing the top candidates <strong>of</strong>ten produces surprising results suchas seeing “logging activity” or “date formatter” appear near the top,suggesting wasted cycles and easy opportunities for streamliningthe code.Discussion The WAIT user interface takes an intentionally minimalisticapproach, striving to present a small amount <strong>of</strong> semanticallyrich data to users rather than overloading them with mountains<strong>of</strong> raw data. Although the input to WAIT is significantly lighterweight and more sparse than <strong>of</strong> many other tools, particularly thosebased on tracing, we have found that WAIT is <strong>of</strong>ten more effectivefor quick analysis <strong>of</strong> performance problems.The pairing <strong>of</strong> WAIT’s analyses together with drilldown t<strong>of</strong>ull stack traces has proven to be a powerful combination. TheWAIT abstractions guide the user’s focus in the right direction, andpresents a set <strong>of</strong> concrete thread stacks that can be used to confirmthe hypotheses. If the WAIT analysis is incorrect, or is simply nottrusted by a skeptical user, viewing the stacks quickly confirms ordenies their suspicion.3 This label corresponds to the Awaiting Data node in Figure 5(a). TheUI does not always present the same text as used to define the underlyingmodel, but the mapping is straightforward. For the remainder <strong>of</strong> this paper,when discussing the UI, we simply refer to text labels as presented in theUI.


Figure 11. Example WAIT Report


applicationsamplesthreadssampled zipped input clustered outputA1 1 228 156kB 50kBA2 1 206 447kB 40kBA3 6 90 112kB 22kBA4 10 356 449kB 15kBA5 499 27,638 27MB 56kBA6 946 127,536 120MB 499kBTable 6. Size <strong>of</strong> the data model communicated from server to theclient’s browser. This data is a representative sample from actual<strong>IBM</strong> customer data that was submitted, by support personnel, tothe WAIT service.<strong>Analysis</strong> time (milliseconds)Firefox 3.6 Safari 4.05A1 140 23A2 90 16A3 53 13A4 68 28A5 560 204A6 720 110Table 7. Execution time for the inference engine, as Javascript ona 2.4GHz Intel Core2 Duo with 4GB <strong>of</strong> memory, on MacOS 10.6.2.The applications, A1–A6, are the same as those in Table 6.6.3 ImplementationWAIT is coded in a combination <strong>of</strong> Java and Javascript. The ETLstep <strong>of</strong> parsing the raw data and producing the data model (Section5.1) runs in Java and executes on the server once, when a reportis created. The remaining analyses (Wait State analysis and Categoryanalysis) run in Javascript and execute in the browser eachtime a report is loaded. This somewhat controversial design allowsusers to modify the rules or select alternate rules configurationswithout a round trip to the server. This design also allows WAITreports, once generated, to be viewed in headless mode without aserver present; the browser can load the report <strong>of</strong>f a local disk andmaintain full functionality <strong>of</strong> the report.This design presents two performance constraints on the WAITanalysis: (1) the data models must be sufficiently small to avoidexcessive network transfer delays, and (2) the analysis written inJavascript must be fast enough to avoid intrusive pauses, and moreoveravoid Javascript timeouts. For example, Firefox 2.5 warns theuser that something might be wrong with the page after roughlyfive seconds <strong>of</strong> Javascript execution.Table 6 reports data model sizes for six representative <strong>IBM</strong> customerapplications, labeled A1–A6. The clustering and optimizationsteps described in Section 5.1 produce a significant reductionin space for the data model. For larger inputs, the clustered datamodel is multiple orders <strong>of</strong> magnitude smaller than the zipped rawdata input files. The largest submission received to date contained946 samples (Javacores), with a total <strong>of</strong> 127,536 stack samples. Thedata model for this report was still a manageable 499kB. More typically,we have found that the clustered output consumes fewer than50kB.The performance <strong>of</strong> the WAIT analyses is reasonable, eventhough executed within the browser. Table 7 shows the analysisruntime for the applications from Table 6. The table shows that,even with a large number <strong>of</strong> application samples (e.g. the 946 <strong>of</strong>application A6), the analysis completes in less than one second. Asdiscussed later in Section 7, most typical uses <strong>of</strong> the tool have inputswith a smaller number <strong>of</strong> samples, more similar to applicationsA1 through A4. For these typical cases, the analysis usually completesin less than 100ms. In general, analysis takes less time thangeneral widget-related rendering work.7. Case StudiesWe now show six additional case studies to demonstrate how WAITilluminates various performance problems. All these case studiesrepresent real problems discovered when analyzing <strong>IBM</strong> systems.To conserve space, we show only the relevant portions <strong>of</strong> the UIfor each case study.7.1 Lock Contention and DeadlockFigure 12 depicts a WAIT report on a 48-core system. The WaitingThreads timeline shows a sudden and sustained surge <strong>of</strong> Blockedon Monitor; i.e. threads seeking a lock and not receiving it, andthus being blocked from making progress. Looking at the Categorybreakdown suggests that lock contention comes from miscellaneousAPACHE OPENJPA Code. The thread stacks from this categoryidentify the location <strong>of</strong> the lock contention as line 364 <strong>of</strong>getMetaDataLocking().Sometimes locking issues go beyond contention to the point <strong>of</strong>deadlock or livelock. When WAIT detects a cycle in the monitorgraph, their Wait State is Blocked on Deadlock. This situation appearsin Figure 13. Looking at the thread stacks at the lower right <strong>of</strong>the report suggests that the threads are waiting for a lock in the logginginfrastructure method SystemOutStream.processEvent()line 277. Armed with this information, a programmer could look atthe code and try to determine the reason for the deadlock.7.2 Not Enough LoadThe WAIT report in Figure 14 shows that the machine was 67%idle, and the timeline confirms that utilization was consistently lowduring the entire monitoring period. The Waiting Threads piechartindicates that threads spend most <strong>of</strong> their time Delayed by RemoteRequest. Digging more deeply, the Category view indicates that theremote request on which the threads are delayed is client communicationusing an HTTP protocol. In this case, the performance problemis not with the server machine, but that the amount <strong>of</strong> databeing received from the client machine is not sufficient to keep theserver busy. Indeed it is possible that there is no problem in thissituation, other than that the server may be over provisioned forthis workload. WAIT cannot directly determine whether the clientis generating a small amount data or if the network between theclient and server is under-provisioned to handle the request traffic.To answer this question, a network utility such as netstat couldbe employed, or the CPU utilization <strong>of</strong> client machines could beinvestigated.7.3 Memory LeakWAIT can also detect memory leaks. As shown in Figure 15, amemory leak can be detected by looking at just the first timeline,which shows garbage collection activity over time. In this example,initially the non-GC work dominates, but over time the garbagecollection activity increases until it eventually consuming most <strong>of</strong>the non-idle CPU cycles. The large increase in garbage collectionas time passes is strong evidence that the heap is inadequate for theamount <strong>of</strong> live memory; either the heap size is not appropriate forthe workload, or that the application has a memory leak.


Figure 12. WAIT Report: Lock ContentionFigure 13. WAIT Report: Deadlock7.4 Database BottleneckFigure 16 presents an example <strong>of</strong> a database bottleneck. Unlike Figure11 where the database became completely unresponsive, in thiscase the database is simply struggling to keep up with the applicationserver’s requests. Over time the server’s utilization variesbetween approximately 10% and 85%, and these dips in utilizationcorrelate roughly with the spikes in the number <strong>of</strong> threads in Waitingstate Delayed by Remote Request and Category Getting Datafrom Database, thus pointing to the likely source <strong>of</strong> the delay.Clicking on the orange bar for Getting Data from Databasereveals the thread stacks that a developer can analyze to determinekey parts <strong>of</strong> the application delayed by the database, and try toreduce the load generated against the database. Alternatively, thedatabase could be optimized or deployed on faster hardware.7.5 Disk I/O Affecting LatencyFigure 17 shows a WAIT report where filesystem activity is limitingperformance. The top two pie charts and timelines show that thereis enough Java code activity to keep the CPUs well-utilized. However,the Waiting activity show a significant number <strong>of</strong> threads inWait State Delayed by Disk I/O, and CategoryFilesystem MetadataOperations, suggesting room for improvement with faster disks, orby restructuring the code to perform fewer disk operations.Reducing these delays would clearly improve latency, sinceeach transaction would spend less time waiting on Disk I/O, butsuch improvement would have other benefits as well. Even thoughthe four CPUs on this machine are currently well utilized, thisapplication will likely scale poorly on larger machines. As thenumber <strong>of</strong> processors increases, the frequent disk access delays willeventually become a scaling bottleneck. WAIT can help identifythese scaling limiters early in the development process.8. Related WorkMost conventional gpr<strong>of</strong>-style performance analysis tools (e.g.,[19, 20]) provide a pr<strong>of</strong>ile which indicates in which methods a programspends time. Often, these tools provide a hierarchical treeview, in which an analyst can recursively drill down to find timespent in subroutines. These tools also <strong>of</strong>ten provide a summaryview which indicates the sum totals <strong>of</strong> execution time over a run.Since performance characteristics vary over time, a summary viewis <strong>of</strong>ten less helpful than snapshot views <strong>of</strong> pr<strong>of</strong>ile data over smallerwindows. Many tools (e.g., [12, 15, 20]) provide snapshot or windowviews.A few previous works have applied abstraction to pr<strong>of</strong>ile information,in order to infer higher-level notions <strong>of</strong> bottlenecks.Hollingsworth [7] identified classes <strong>of</strong> performance bottlenecks inparallel programs, and introduced a methodology to dynamicallyinsert instrumentation to test for bottlenecks. Srinivas and Srinivasan[10] present a tool design that allows the user to define“Components’, abstract names for code from particular packagesor archives. This paper presents a new methodology to present lay-


Figure 14. WAIT Report: Not Enough LoadFigure 15. WAIT Report: Memory Leakers <strong>of</strong> abstraction when monitoring performance, suitable for productionenterprise deployment scenarios.A few previous works have been developed to mine log informationin large systems [22, 23]. However, unlike the approachproposed here, both require access to source code that is <strong>of</strong>tenunavailable in enterprise environments. In addition, [22] employsmachine-learning to determine important characteristics to reportas opposed to the expert rule structure employed here. [23] usesanalysis <strong>of</strong> source code control paths to determine precise causes<strong>of</strong> error conditions.There are heavier-weight alternatives to sampling that we choseto avoid. <strong>Performance</strong> tools exist that collect detailed informationabout method invocations [2, 3, 5, 6, 11, 13, 19], or that stitchtogether an end-to-end trace <strong>of</strong> call flows across a multi-tier system[1, 4, 14, 16, 18].We also note that performance experts have been using Javacore dumps for many years, but generally in an ad-hoc way usingtheir eyes and brains to summarize the data instead <strong>of</strong> having a toolto do so. Though other tools like Thread Analyzer [17] take Javacores as input, the tools <strong>of</strong> which we are aware are interactive, andrequire an expert user for effective results. As a result, existing Javacore tools are generally not suitable for use by less expert users forquick triage.Others have attempted to use method names to help diagnoses<strong>of</strong>tware problems [8]. However, unlike our approach which employsmethod names to diagnose functional issues in the system,[8] uses method name analysis to promote adherence to namingconventions and readable, maintainable s<strong>of</strong>tware.9. ConclusionWe have presented the design and methodology behind WAIT, atool designed to pinpoint performance bottlenecks in server applicationsbased on idle time analysis. WAIT has been used widelythroughout <strong>IBM</strong>, in both development and production deployments.Through these deployments, we have learned how to exploitubiquitous sample-based monitoring to infer rich semantic informationabout server application performance. Although much <strong>of</strong>the literature explores intrusive monitoring based on instrumentationand monitoring agents, we have shown effective performanceanalysis is possible with a lower barrier to entry.


Figure 16. WAIT Report: Database BottleneckReferences[1] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen.<strong>Performance</strong> debugging for distributed systems <strong>of</strong> blackboxes. In Symposium on Operating System Principles. ACM, 2003.[2] W. P. Alexander, R. F. Berry, F. E. Levine, and R. J. Urquhart. Aunifying approach to performance analysis in the java environment.<strong>IBM</strong> Systems Journal, 39(1), 2000.[3] G. Ammons, J.-D. Choi, M. Gupta, and N. Swamy. Finding andremoving performance bottlenecks in large systems. In The EuropeanConference on Object-Oriented Programming. Springer, 2004.[4] B. Darmawan, R. Gummadavelli, S. Kuppusamy, C. Tan, D. Rintoul,H. Anglin, H. Chuan, A. Subhedar, A. Firtiyan, P. Nambiar, P. Lall,R. Dhall, and R. Pires. <strong>IBM</strong> Tivoli Composite Application ManagerFamily: Installation, Configuration, and Basic Usage. http://www.redbooks.ibm.com/abstracts/sg247151.html?Open.[5] W. De Pauw, E. Jensen, N. Mitchell, G. Sevitsky, J. Vlissides, andJ. Yang. Visualizing the execution <strong>of</strong> Java programs. In S<strong>of</strong>twareVisualization, State-<strong>of</strong>-the-art Survey, volume 2269 <strong>of</strong> Lecture Notesin Computer Science. Springer-Verlag, 2002.[6] R. J. Hall. Cppr<strong>of</strong>j: Aspect-capable call path pr<strong>of</strong>iling <strong>of</strong> multithreadedjava applications. In Automated S<strong>of</strong>tware Engineering, pages107–116. IEEE Computer Society Press, 2002.[7] J. K. Hollingsworth. Finding Bottlenecks in Large-scale Parallel<strong>Programs</strong>. PhD thesis, University <strong>of</strong> Wisconsin, Aug. 1994.[8] E. W. Host and B. M. Ostvold. Debugging method names. In TheEuropean Conference on Object-Oriented Programming. Springer,2009.[9] N. Mitchell, G. Sevitsky, and H. Srinivasan. Modeling runtime behaviorin framework-based applications. The European Conference onObject-Oriented Programming, 2006.[10] K. Srinivas and H. Srinivasan. Summarizing application performancefrom a components perspective. Foundations <strong>of</strong> S<strong>of</strong>tware Engineering,30(5):136–145, 2005.[11] Borland S<strong>of</strong>tware Corporation. OptimizeIt TM Enterprise Suite.http://www.borland.com/us/products/optimizeit, 2005.[12] Compuware. Compuware Vantage Analyzer. http://www.compuware.com/solutions/e2e brochures factsheets.asp.[13] Eclipse. Eclipse Test & <strong>Performance</strong> Tools Platform Project. http://www.eclipse.org/tptp.[14] HP. HP Diagnostics for J2EE.[15] <strong>IBM</strong>. Compuware Vantage Analyzer. http://alphaworks.ibm.com/tech/dcva4j/download.[16] <strong>IBM</strong>. <strong>IBM</strong> OMEGAMON XE for WebSphere. http://www-01.ibm.com/s<strong>of</strong>tware/tivoli/products/omegamon-xe-was.[17] <strong>IBM</strong>. Thread and Monitor Dump Analyzer for Java. http://www.alphaworks.ibm.com/tech/jca.[18] <strong>IBM</strong>. Tivoli Monitoring for Transaction <strong>Performance</strong>.http://www-01.ibm.com/s<strong>of</strong>tware/tivoli/products/monitor-transaction.[19] Sun Microsystems. HPROF JVM pr<strong>of</strong>iler. http://java.sun.com/developer/technicalArticles/Programming/HPROF.html, 2005.[20] Yourkit LLC. Yourkit pr<strong>of</strong>iler. http://www.yourkit.com.[21] G. Xu, M. Arnold, N. Mitchell, A. Rountev, and G. Sevitsky. Gowith the flow: pr<strong>of</strong>iling copies to find runtime bloat. In PLDI ’09:Proceedings <strong>of</strong> the 2009 ACM SIGPLAN conference on Programminglanguage design and implementation, pages 419–430, New York, NY,USA, 2009. ACM.


Figure 17. WAIT Report: Good Throughput but Filesystem Bottleneck[22] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detectinglarge-scale system problems by mining console logs. In Symposiumon Operating System Principles. ACM, 2009.[23] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy.Sherlog: Error diagnosis by connecting clues from run-time logs. InArchitectural Support for Programming Languages and OperatingSystems, Mar. 2010.[24] Y. Zhao, J. Shi, K. Zheng, H. Wang, H. Lin, and L. Shao. Allocationwall: a limiting factor <strong>of</strong> java applications on emerging multi-coreplatforms. In OOPSLA ’09: Proceeding <strong>of</strong> the 24th ACM SIGPLANconference on Object oriented programming systems languages andapplications, pages 361–376, New York, NY, USA, 2009. ACM.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!