an empirical study of run-time coupling and cohesion software metrics

AN EMPIRICAL STUDY OF 

RUN-TIME COUPLING AND 

COHESION SOFTWARE 

METRICS 

Áine Mitchell 

Supervisor: Dr. James Power 

A Thesis presented for the degree of 

Doctor of Philosophy in Computer Science 

Department of Computer Science 

National University of Ireland, Maynooth 

Co. Kildare, Ireland 

October 2005

Dedicated to 

My parents Patrick and Ann Mitchell

An empirical study of run-time coupling and 

cohesion software metrics 

Áine Mitchell 

Submitted for the degree of Doctor of Philosophy 

Oct 2005 

Abstract 

The extent of coupling and cohesion in an object-oriented system has implications 

for its external quality. Various static coupling and cohesion metrics have been 

proposed and used in past empirical investigations, however none of these have 

taken the run-time properties of a program into account. As program behaviour is 

a function of its operational environment as well as the complexity of the source 

code, static metrics may fail to quantify all the underlying dimensions of coupling 

and cohesion. By considering both of these influences, one will acquire a more 

comprehensive understanding of the quality of critical components of a software 

system. We believe that any measurement of these attributes should include changes 

that take place at run-time. For this reason, in this work we address the utility of 

run-time coupling and cohesion complexity through the empirical evaluation of a 

selection of run-time measures for these properties. This study is carried out using 

a comprehensive selection of Java benchmark and real world programs. 

Our first case study investigates the influence of instruction coverage on the relationship 

between static and run-time coupling metrics. Our second case study defines 

a new run-time coupling metric that can be used to study object behaviour and 

investigates the ability of measures of run-time cohesion to predict such behaviour. 

Finally, we investigate whether run-time coupling metrics are good predictors of 

software fault-proneness in comparison to standard coverage measures. To the best 

of our knowledge this is the largest empirical study that has been performed to date 

on the run-time analysis of Java programs.

Declaration 

The work in this thesis is based on research carried out at the Department of Computer 

Science, in the National University of Ireland Maynooth, Co. Kildare, Ireland. 

No part of this thesis has been submitted elsewhere for any other degree or qualification 

and it is all my own work unless referenced to the contrary in the text. 

Signature:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date:. . . . . . . . . . . . . . . . . . . . . . . . 

Copyright c○ 2005 Áine Mitchell. 

“The copyright of this thesis rests with the author. No quotations from it should be 

published without the author’s prior written consent and information derived from 

it should be acknowledged”. 

iv

Acknowledgements 

I would like to thank my PhD adviser, Dr. James Power, for his advice, guidance, 

support, and encouragement throughout my PhD effort. 

A special thanks to my parents without whose continual support this work would 

not have been possible. 

I would also like to thank all my friends who were there for me throughout it all. 

This work has been funded by the Embark initiative, operated by the Irish 

Research Council for Science, Engineering and Technology (IRCSET). 

v

Contents 

Abstract 

Declaration 

Acknowledgements 

iii 

iv 

v 

1 Introduction 1 

1.1 Software Metrics and Complexity . . . . . . . . . . . . . . . . . . . . 1 

1.2 Traditional Measures of Complexity . . . . . . . . . . . . . . . . . . . 3 

1.3 Object-Oriented Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 4 

1.4 Definitions of Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . 5 

1.5 Definitions of Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

1.6 Static and Run-time Metrics . . . . . . . . . . . . . . . . . . . . . . . 7 

1.7 Factors Influencing Software Metrics . . . . . . . . . . . . . . . . . . 8 

1.7.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 

1.7.2 Metrics and Object Behaviour . . . . . . . . . . . . . . . . . . 9 

1.7.3 Metrics and Software Testing . . . . . . . . . . . . . . . . . . 9 

1.8 Aims of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 

1.9 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2 Literature Review 12 

2.1 Static Coupling Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 12 

2.1.1 Chidamber and Kemerer . . . . . . . . . . . . . . . . . . . . . 13 

2.1.2 Other Coupling Metrics . . . . . . . . . . . . . . . . . . . . . 14 

2.2 Frameworks for Static Coupling Measurement . . . . . . . . . . . . . 15 

vi

Contents 

vii 

2.2.1 Eder et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

2.2.2 Hitz and Montazeri . . . . . . . . . . . . . . . . . . . . . . . . 16 

2.2.3 Briand et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

2.2.4 Revised Framework by Briand et al. . . . . . . . . . . . . . . . 18 

2.3 Static Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 18 

2.3.1 Chidamber and Kemerer . . . . . . . . . . . . . . . . . . . . . 19 

2.3.2 Other Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . 20 

2.4 Frameworks for Static Cohesion Measurement . . . . . . . . . . . . . 21 

2.4.1 Eder et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

2.4.2 Briand et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

2.5 Run-time/Dynamic Coupling Metrics . . . . . . . . . . . . . . . . . . 23 

2.5.1 Yacoub et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

2.5.2 Arisholm et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

2.6 Run-time/Dynamic Cohesion Metrics . . . . . . . . . . . . . . . . . . 25 

2.6.1 Gupta and Rao . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.7 Other Studies of Dynamic Behaviour . . . . . . . . . . . . . . . . . . 25 

2.7.1 Dynamic Behaviour Studies . . . . . . . . . . . . . . . . . . . 25 

2.8 Coverage Metrics and Software Testing . . . . . . . . . . . . . . . . . 26 

2.8.1 Instruction Coverage . . . . . . . . . . . . . . . . . . . . . . . 27 

2.8.2 Alexander and Offutt . . . . . . . . . . . . . . . . . . . . . . . 27 

2.9 Previous Work by the Author . . . . . . . . . . . . . . . . . . . . . . 28 

2.10 Definition of Run-time Metrics . . . . . . . . . . . . . . . . . . . . . . 29 

2.10.1 Coupling Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 29 

2.10.2 Cohesion Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 31 

2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3 Experimental Design 34 

3.1 Methods for Collecting Run-time Information . . . . . . . . . . . . . 34 

3.1.1 Instrumenting a Virtual Machine . . . . . . . . . . . . . . . . 34 

3.1.2 Sun’s Java Platform Debug Architecture (JPDA) . . . . . . . 35 

3.1.3 Bytecode Instrumentation . . . . . . . . . . . . . . . . . . . . 35 

3.2 Metrics Data Collection Tools (Design Objectives) . . . . . . . . . . . 35

Contents 

viii 

3.2.1 Class-Level Metrics Collection Tool (ClMet) . . . . . . . . . . 36 

3.2.2 Object-Level Metrics Collection Tool (ObMet) . . . . . . . . . 37 

3.2.3 Static Data Collection Tool (StatMet) . . . . . . . . . . . . . 38 

3.2.4 Coverage Data Collection Tool (InCov) . . . . . . . . . . . . . 39 

3.2.5 Fault Detection Study . . . . . . . . . . . . . . . . . . . . . . 40 

3.3 Test Case Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 

3.3.1 Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . 41 

3.3.2 Real-World Programs . . . . . . . . . . . . . . . . . . . . . . . 43 

3.3.3 Execution of Programs . . . . . . . . . . . . . . . . . . . . . . 45 

3.4 Statistical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

3.4.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . 45 

3.4.2 Normality Tests . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.4.3 Normalising Transformations . . . . . . . . . . . . . . . . . . 48 

3.4.4 Pearson Correlation Test . . . . . . . . . . . . . . . . . . . . . 49 

3.4.5 T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

3.4.6 Principal Component Analysis . . . . . . . . . . . . . . . . . . 50 

3.4.7 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 52 

3.4.8 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . 53 

3.4.9 Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . 55 

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

4 Case Study 1: The Influence of Instruction Coverage on the Relationship 

Between Static and Run-time Coupling Metrics 57 

4.1 Goals and Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

4.3.1 Experiment 1: To investigate the relationship between static 

and run-time coupling metrics . . . . . . . . . . . . . . . . . . 60 

4.3.2 Experiment 2: The influence of instruction coverage . . . . . . 62 

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Contents 

ix 

5 Case Study 2: The Impact of Run-time Cohesion on Object Behaviour 

69 



5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 

5.3.1 Experiment 1: To determine if objects from the same class 

behave differently at run-time from the point of view of coupling 74 

5.3.2 Experiment 2: The influence of cohesion on the N OC . . . . . 77 

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

6 Case Study 3: A Study of Run-time Coupling Metrics and Fault 

Detection 82 



6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 

6.3.1 Experiment 1: To examine the relationship between instruction 

coverage and fault detection. . . . . . . . . . . . . . . . . 85 

6.3.2 Experiment 2: To examine the relationship between run-time 

coupling metrics and fault detection. . . . . . . . . . . . . . . 87 

6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

7 Conclusions 90 

7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

7.2 Applications of this Work . . . . . . . . . . . . . . . . . . . . . . . . 96 

7.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

7.3.1 Internal Threats . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

7.3.2 External Threats . . . . . . . . . . . . . . . . . . . . . . . . . 97 

7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 

Appendix 100 

A Case Study 1: To Investigate the Influence of Instruction Coverage 

on the Relationship Between Static and Run-time Coupling Metric100

Contents 

x 

A.1 PCA Test Results for all programs. . . . . . . . . . . . . . . . . . . . 101 

A.1.1 SPECjvm98 Benchmark Suite . . . . . . . . . . . . . . . . . . 101 

A.1.2 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 101 

A.1.3 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 102 

A.2 Multiple linear regression results for all programs . . . . . . . . . . . 103 

A.2.1 SPECjvm98 Benchmark Suite . . . . . . . . . . . . . . . . . . 103 

A.2.2 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 104 

A.2.3 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 105 

B Case Study 2: The Impact of Run-time Cohesion on Object Behaviour 

106 

B.1 PCA Test Results for all programs. . . . . . . . . . . . . . . . . . . . 106 

B.1.1 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 106 

B.1.2 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 107 

B.2 Multiple linear regression results for all programs. . . . . . . . . . . . 107 

B.2.1 JOlden Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 107 

B.2.2 Real-World Programs, Velocity, Xalan and Ant . . . . . . . . 108 

C Case Study 3: A Study of Run-time Coupling Metrics and Fault 

Detection 109 

C.1 Regression analysis results for real-world programs, Velocity, Xalan 

and Ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 

C.1.1 For Class Mutants . . . . . . . . . . . . . . . . . . . . . . . . 110 

C.1.2 For Traditional Mutants . . . . . . . . . . . . . . . . . . . . . 111 

C.2 Regression analysis results for real-world programs, Velocity, Xalan 

and Ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

C.2.1 For Class Mutants . . . . . . . . . . . . . . . . . . . . . . . . 111 

C.2.2 For Traditional Mutants . . . . . . . . . . . . . . . . . . . . . 112 

D Mutation operators in µJava 113

List of Figures 

1.1 The software quality model shows how different measures of internal 

quality can characterise the overall quality of a software product . . . 3 

3.1 Components of run-time class-level metrics collection tool, ClMet . . 37 

3.2 Components of run-time object-level metrics collection tool, ObMet . 38 

3.3 Components of static metrics collection tool, StatMet . . . . . . . . . 39 

3.4 Dendrogram: At the cutting line there are two clusters . . . . . . . . 54 

4.1 PCA test results for all programs for metrics in PC1, PC2 and PC3. 

In all graphs the bars represents the PCA value obtained for the 

corresponding metric. PC1 contains import level run-time metrics. 

PC2 contains the export level run-time metrics and PC3 contain the 

static CBO metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 

4.2 Multiple linear regression results for class-level metrics (IC CC and 

EC CC). In both graphs the lighter bars represents the R 2 value for 

CBO, and the darker bars represents the R 2 value for CBO and I c 

combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

4.3 Multiple linear regression results for method-level metrics (IC CM 

and EC CM). In both graphs the lighter bars represents the R 2 

value for CBO, and the darker bars represents the R 2 value for CBO 

and I c combined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

5.1 C V of IC OC for classes from the programs studied. The bars represent 

the number of classes in each program that have C V in the 

corresponding range. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 

xi


xii 

5.2 N OC results of cluster analysis. The bars represent the number of 

classes in each program that have the corresponding N OC value. . . . 76 

5.3 PCA Test Results for all programs for metrics in PC1 and PC2. In 

both graphs the bars represents the PCA value obtained for the corresponding 

metric. PC1 contains R LCOM and RW LCOM . PC2 contains 

S LCOM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

5.4 Results from multiple linear regression where Y=N OC . The lighter 

bars represent the R 2 for S LCOM , and the darker bars represent the 

R 2 value for S LCOM and R LCOM combined. . . . . . . . . . . . . . . . 80 

6.1 Mutation test results for real-world programs Velocity, Xalan and 

Ant. In all graphs the bars represents the number of classes that 

exhibit a percentage mutant kill rate in the corresponding range. . . . 86 

6.2 Regression analysis results for the effectiveness of I c in predicting 

class and traditional-level mutations in real-world programs Velocity, 

Xalan and Ant. The bars represents the R 2 value for the run-time 

metric under consideration. . . . . . . . . . . . . . . . . . . . . . . . 87 

6.3 Regression analysis results for the effectiveness of run-time coupling 

metrics in predicting class-level mutations in real-world programs Velocity, 

Xalan and Ant. The bars represents the R 2 value for the 

run-time metric under consideration. . . . . . . . . . . . . . . . . . . 89 

7.1 Findings from case study one that show our run-time coupling metrics 

are not simply surrogate measures for static CBO and coverage plus 

static metrics are better predictors of run-time measures than static 

measure alone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 

7.2 Findings from case study two that show run-time object-level coupling 

measures can be used to identify objects that are exhibiting 

different behaviours at run-time and run-time cohesion measures are 

good predictors of this type of behaviour. . . . . . . . . . . . . . . . . 93


xiii 

7.3 Findings from case study three that show run-time coupling metrics 

are good predictors of class-type faults and instruction coverage is a 

good predictor of traditional faults in programs. . . . . . . . . . . . . 94

List of Tables 

2.1 Abbreviations for the dynamic coupling metrics of Arisholm et al. . . 24 

3.1 Description of the SPECjvm98 benchmarks . . . . . . . . . . . . . . . 42 

3.2 Description of the JOlden benchmarks . . . . . . . . . . . . . . . . . 43 

3.3 Programs used for each case study . . . . . . . . . . . . . . . . . . . . 45 

4.1 Descriptive statistic results for all programs . . . . . . . . . . . . . . 61 

5.1 Matrix of unique accesses per object, for objects BlackNode 1 , . . . , BlackNode 4 

to classes GreyNode, QuadTreeNode and WhiteNode . . . . . . . . . 72 

5.2 Descriptive statistic results for all programs . . . . . . . . . . . . . . 73 

D.1 Traditional-level mutation operators in µJava . . . . . . . . . . . . . 113 

D.2 Class-level mutation operators in µJava . . . . . . . . . . . . . . . . . 114 

xiv

Chapter 1 

Introduction 

Software metrics have become essential in some disciplines of software engineering. 

In forward engineering they are used to measure software quality and to estimate 

the cost and effort of software projects [40]. In the field of software evolution, 

metrics can be used for identifying stable or unstable parts of software systems, 

as well as identifying where refactorings can be applied or have been applied [32], 

and detecting increases or decreases of quality in the structure of evolving software 

systems. In the field of software re-engineering and reverse engineering, metrics are 

used for assessing the quality and complexity of software systems, and also to get a 

basic understanding and provide clues about sensitive parts of software systems [27]. 

1.1 Software Metrics and Complexity 

Software metrics evaluate different aspects of the complexity of a software product. 

Software complexity was originally defined as “a measurement of the resources that 

must be expended in developing, testing, debugging, maintenance, user training, 

operation, and correction of software products” [94]. Complexity has been characterised 

in terms of seven different levels, the correlation and interdependence of 

which will determine the overall level of complexity in a software product [44]. The 

levels are as follows: 

• Control Structure 

1

1.1. Software Metrics and Complexity 2 

• Module Coupling 

• Algorithm 

• Code 

• Nesting 

• Module Cohesion 

• Data Structure 

However, most metrics measure only one software complexity factor. These 

foundations of complexity will determine the internal quality of a product. 

Internal quality measures are those which are performed in terms of the software 

product itself and are measurable both during and after the creation of the software 

product. They have however, no inherent, practical meaning within themselves. To 

give them meaning they must be characterised in terms of the product’s external 

quality. 

External quality measures are evaluated with respect to how a product relates to 

its environment and are deemed to be inherently meaningful, such examples would 

be the maintainability or testability of a product. 

It should be noted that good internal quality is a requirement for good external 

quality. Figure 1.1 illustrates the software quality model which depicts the relationship 

between these measures. Much research has contributed models and measures 

of both internal software quality attributes and external attributes of a design. Although 

the relationships between these attributes is for the most part intuitive, e.g., 

more complex code will require greater effort to maintain, the precise functional form 

of those relationships can be less clear and is the subject of intense practical and 

research concern [31]. Empirical validation aims at demonstrating the usefulness 

of a measure in practice and is, therefore, a crucial activity to establish the overall 

validity of a measure [6]. Therefore it is the belief of the author that a well-designed 

empirical study serves to clarify and strengthen the observed relationships.

1.2. Traditional Measures of Complexity 3 

Metrics 

Complexity 

External 

Quality 

Quality 

In Use 

External 

Maintainability 

Coupling 

Testability 

Internal 

Cohesion 

Reusability 

Internal Quality 

Figure 1.1: The software quality model shows how different measures of internal 

quality can characterise the overall quality of a software product 

1.2 Traditional Measures of Complexity 

The earliest software measure, which was proposed in the late 1960s, is the Source 

Lines of Code (SLOC) metric, which is still used today. It is used to measure the 

amount of code in a software program. It is typically used to estimate the amount of 

effort that will be required to develop a program, as well as to estimate productivity 

or effort once the software is produced. Two major types of SLOC measures exist: 

physical SLOC and logical SLOC. Exact definitions of these measures vary. The 

most common definition of physical SLOC is a count of “non-blank, non-comment 

lines” in the text of the program’s source code. Logical SLOC measures attempt 

to measure the number of “statements”, however their specific definitions are tied 

to specific computer languages. Therefore, it is much easier to create tools that 

measure physical SLOC, and physical SLOC definitions are easier to explain. However, 

physical SLOC measures are sensitive to logically irrelevant formatting and 

style conventions, while logical SLOC is less sensitive to formatting and style con-

1.3. Object-Oriented Metrics 4 

ventions. 

The are a number of drawbacks of using a crude measure such as LOC as a surrogate 

measure for different notions of program size such as effort, functionality and 

complexity. The need for more discriminating measures became especially urgent 

with the increasing diversity of programming languages, as LOC in an assembly 

language is not comparable in effort, functionality, or complexity to an LOC in a 

high-level language [39]. 

Thus from the mid-1970s there was an increase in the number of different complexity 

metrics defined. Some of the more prevalent ones were Halstead’s software 

science metrics [47], which made an attempt to capture notions of size and complexity 

beyond simply counting lines of code. Although the work has had a lasting 

impact they are principally regarded as an example of confused and inadequate 

measurements [40]. 

McCabe defined a measure known as Cyclomatic Complexity [71]. It may be 

considered as a broad measure of soundness and confidence for a program. It measures 

the number of linearly-independent paths through a program module and it is 

intended to be independent of language and language format. 

Function points, which were pioneered by Albrecht [2] in 1977, are a measure 

of the size of computer applications and the projects that build them. The size is 

measured from a functional, or user, point of view. It is independent of the computer 

language, development methodology, technology or capability of the project team 

used to develop the application. The original metric has been augmented and refined 

to cover more than the original emphasis on business-related data processing. 

However as object-oriented techniques became more prevalent there was an increasing 

need for metrics that could correctly evaluate their properties. 

1.3 Object-Oriented Metrics 

Object-oriented design and development is becoming very popular in today’s software 

development environment. Object-oriented development requires not only a 

different approach to design and implementation but it requires a different approach

1.4. Definitions of Coupling 5 

to software metrics. Since object oriented technology uses objects and not algorithms 

as its fundamental building blocks, the approach to software metrics for object oriented 

programs must be different from the standard metrics set. Metrics, such as 

lines of code and cyclomatic complexity, have become accepted as standard for traditional 

functional/ procedural programs and were used to evaluate object-oriented 

environments at the beginning of the object-oriented design revolution. However, 

traditional metrics for procedural approaches are not adequate for evaluating objectoriented 

software, primarily because they are not designed to measure basic elements 

like classes, objects, polymorphism, and message-passing. Even when adjusted to 

syntactically analyse object-oriented software they can only capture a small part of 

such software and thus provide a weak quality indication [50, 65]. Since this time 

there have been many proposed object-oriented metrics in the literature. The question 

now is, which object-oriented metrics should a project use As the quality 

of object-oriented software, like other software, is a complex concept there can be 

no single, simple measure of software quality acceptable to everyone. To assess or 

improve software quality in you must define the aspects of quality in which you are 

interested, then decide how you are going to measure them. By defining quality 

in a measurable way, you make it easier for other people to understand your viewpoint 

and relate your notions to their own [60]. As illustrated by Chapter 2 some of 

the seminal methods of evaluating an object-oriented design are through the use of 

measures for coupling and cohesion. 

1.4 Definitions of Coupling 

Stevens et al. [95] first introduced coupling in the context of structured development 

techniques. They defined coupling as “the measure of the strength of association 

established by a connection from one module to another”. They stated that the 

stronger the coupling between modules, that is, the more inter-related they are, the 

more difficult these modules are to understand, change and correct and thus the 

more complex the resulting software system. 

Myers [82] refined the concept of coupling by defining six distinct levels of cou-

1.5. Definitions of Cohesion 6 

pling. However coupling could only be determined by hand as the definitions were 

neither precise nor prescriptive, leaving room for subjective interpretations of the 

levels. 

Constantine and Yourdon [29] also stated that the modularity of software design 

can be measured by coupling and cohesion. They stated that coupling between two 

units reflect the interconnections between units and that faults in one unit may 

affect the coupled unit. 

Page and Jones [89] ordered coupling into eight different levels according to their 

effects on the understandability, maintainability, modifiability and reusability of the 

coupled modules. 

Troy and Zweben [98] showed that coupling between units is a good indicator 

of the number of faults in software. However their study was based on subjective 

interpretation of design documents instead of real code. 

Offutt et al. [85] extended the eight levels of coupling to twelve thus providing a 

finer grained measure of coupling. They also described algorithms to automatically 

measure the coupling level between each pair of units in a program. The coupling 

levels are defined between pairs of units A and B. For each coupling level the parameters 

are classified by the way they are used. Uses are classified into computation 

uses (C-uses) [42], predicate uses (P-uses) and indirect uses (I-uses) [85]. A C-use 

occurs when a variable is used on the right side of an assignment statement, in an 

output statement, or a procedure call. A P-use occurs when a variable is used in a 

predicate statement. An I-use occurs when a variable is used in an assignment to 

another variable and the defined variable is later used in a predicate. The I-use is 

considered to be in the predicate rather than in the assignment. 

1.5 Definitions of Cohesion 

The cohesion of a module is the extent to which its individual components are 

needed to perform the same task [40]. Cohesion was first introduced within the 

context of module design by Stevens et al. [95]. In their definition, the cohesion of a 

module is measured by inspecting the association between all pairs of its processing

1.6. Static and Run-time Metrics 7 

elements. The term processing element was defined as an action performed by 

a module such as a statement, procedure call, or something which must be done 

in a module but which has not yet been reduced to code [29]. Their definition 

was informal thereby leaving it open for interpretation. They developed a scale of 

cohesion that provide an ordinal scale of measurement that describes the degree to 

which the actions performed by a module contribute to a unified function. There 

are seven categories of cohesion which range from the most desirable (functional) to 

least desirable (coincidental). They stated that it is possible for a module to exhibit 

more than one type of cohesion, in this case the module is categorized by its least 

desirable type of cohesion. In the principle of good software design it is desirable to 

have highly cohesive modules, preferably functional. 

Emerson [36,37] based his cohesion measure on a control flow graph representation 

of a module. The range of this complexity measure varies from 0 to 1. Emerson 

indicates that his method for computing cohesion is related to program slicing. He 

reclassifies the seven levels of cohesion into three. 

Ott and Thuss [88] used program slicing to evaluate their cohesion measurements. 

They reclassified the original seven levels of cohesion into four categories. 

Lakhotia [61] codified the natural language definitions of the seven levels of 

cohesion. He developed a method for computing cohesion based on an analysis of 

the variable dependence graphs of a module. Pairs of outputs were examined to 

identify any data or control dependences that exist between the two outputs. Rules 

were provided for determining the cohesion of the pairs. 

1.6 Static and Run-time Metrics 

A large number of metrics have been proposed to measure object-oriented design 

quality. Design metrics can be classified into two categories; static and runtime/dynamic. 

Static metrics measure what may happen when a program is executed 

and are said to quantify different aspects of the complexity of the source 

code. Run-time metrics measure what actually happens when a program is executed. 

They evaluate the source code’s run-time characteristics and behaviour as

1.7. Factors Influencing Software Metrics 8 

well as its complexity. 

Despite the rich body of research and practice in developing design quality metrics, 

there has been less emphasis on run-time metrics for object-oriented designs 

mainly due to the fact that a run-time code analysis is more expensive and complex 

to perform. [99]. However, due to polymorphism, dynamic binding, and the common 

presence of unused (dead) code in software, static coupling and cohesion measures 

do not perfectly reflect the actual situation taking place amongst classes at run-time. 

The complex dynamic behaviour of many real-time applications motivates a shift 

in interest from traditional static metrics to run-time metrics. In this work, we investigate 

whether useful information on design quality can be provided by run-time 

measures of coupling and cohesion over and above that which is given by simple 

static measures. This will determine if it is worthwhile to continue the investigation 

into run-time coupling and cohesion metrics and their relationship with the external 

quality. 

1.7 Factors Influencing Software Metrics 

This section discusses factors which affect software metrics, including coverage and 

object-level behaviour. The relationship with software testing is also discussed. 

1.7.1 Coverage 

When relating static and run-time measures, it is important to have a thorough understanding 

of the degree to which the analysed source code corresponds to the code 

that is actually executed. In this thesis, this relationship is studied using instruction 

coverage measures with regard to the influence of coverage on the relationship 

between static and dynamic metrics. It is proposed that coverage results have a significant 

influence on the relationship and thus should always be a measured, recorded 

factor in any such comparison.

1.8. Aims of Thesis 9 

1.7.2 Metrics and Object Behaviour 

To date little work has been done on the analysis of code at the object-level, that is 

the use of metrics to identify specific object behaviours. We identify this behaviour 

through the use of run-time object-level coupling metrics. Run-time object-level 

coupling quantifies the level of dependencies between objects in a system whereas 

run-time class-level coupling quantifies the level of dependencies between the classes 

that implement the methods or variables of the caller object and the receiver object 

[5]. The class of the object sending or receiving a message may be different from the 

class implementing the corresponding method due to the impact of inheritance. We 

also investigate the ability of run-time cohesion measures to predict such behaviour. 

1.7.3 Metrics and Software Testing 

Testing is one of the most effort-intensive activities during software development [7]. 

Much research is directed toward developing new and improved fault detection mechanisms. 

A number of papers have investigated the relationships between static design 

metrics and the detection of faults in object-oriented software [6, 15]. However, to 

date no work has been conducted on the correlation of run-time coupling metrics and 

fault detection. In this thesis, we investigate whether measures for run-time coupling 

are good predictors of fault-proneness, an important software quality attribute. 

1.8 Aims of Thesis 

In summary, the central aims of this thesis are to outline operational definitions for 

run-time class and object-level coupling and cohesion metrics suitable for evaluating 

the quality of an object-oriented application. The motivation for these measures 

is to complement existing measures that are based on static analysis by actually 

measuring coupling and cohesion at runtime. 

It is necessary to provide tools for accurately collecting such measures for Java 

systems effectively. Java was chosen as the target language for this analysis because 

Java is executed on a virtual machine which makes it relatively simple to 

collect run-time trace information in comparison to languages like C or C++. Java

1.9. Structure of Thesis 10 

also combines a wide range of language features found in different programming 

languages, for example, an object-oriented model, exception handling and garbage 

collection. Its features of portability, robustness, simplicity and security have made 

it increasingly popular within the software engineering community, underpinning its 

importance and providing a good selection of sample applications for study. 

Finally, a thorough empirical investigation using both Java benchmark and realworld 

programs needs to be performed. The objectives of this are: 

1. To assess the fundamental properties of the run-time measures and to investigate 

whether they are redundant with respect to the most commonly used 

coupling and cohesion measures, as defined by Chidamber and Kemerer [26]. 

2. To examine the influence of test case coverage on the relationship between 

static and run-time coupling metrics. Intuitively, one would expect the better 

the coverage of the test cases used, the better the static and run-time metrics 

should correlate. 

3. To investigate run-time object behaviour, that is, to determine if objects from 

the same class behave differently at run-time, through the use of object-level 

coupling metrics. 

4. To investigate run-time object behaviour using run-time measures for cohesion. 

5. To conduct a study investigating the correlation between run-time coupling 

measures and fault detection in object-oriented software. 

1.9 Structure of Thesis 

This thesis describes how coupling and cohesion can be defined and precisely measured 

based on the run-time analysis of systems. An empirical evaluation of the proposed 

run-time measures is reported using a selection of benchmarks and real-world 

Java applications. An investigation is conducted to determine if these measures are 

redundant with respect to their static counterparts. We also determine if coverage 

has a significant impact on the correlation between static and run-time metrics. We

1.9. Structure of Thesis 11 

examine object behaviour using a run-time object-level coupling metric and we investigate 

the relationship of run-time cohesion metrics on this. Finally, we study 

the fault detection capabilities of run-time coupling measures. 

Chapter 2 presents a literature survey of coupling and cohesion metrics and 

associated studies. Chapter 3 defines the run-time metrics used in this study and 

outlines the experimental tools and techniques. Chapter 4 presents a case study on 

the correlation between static and run-time coupling measures and the influence of 

coverage on this correlation. Chapter 5 discusses a case study on object behaviour 

and the impact of cohesion on this. Chapter 6 presents a case study on run-time 

coupling metrics and fault detection. Chapter 7 presents the final conclusions and 

discusses future work.

Chapter 2 

Literature Review 

In this chapter, a comprehensive survey and literature review of existing static and 

run-time/dynamic measures and frameworks for coupling and cohesion in objectoriented 

systems is presented. Previous work which describes a coupling based 

testing approach for object-oriented software is presented. Finally, the role coverage 

measures play in software testing is discussed. In Section 2.1 and 2.3, we present 

existing coupling and cohesion measures and discuss them. Sections 2.2 and 2.4, 

present alternative frameworks for coupling and cohesion. Measures for the run-time 

evaluation of coupling and cohesion are presented in Sections 2.5 and 2.6 respectively. 

Other work in studies of dynamic behaviour is described in Section 2.7. A discussion 

of coverage metrics and the role they play in software testing is presented in Section 

2.8. Previous work by the author is discussed in 2.9. Finally, a description of the 

run-time measures used in the subsequent case studies are provided in Section 2.10. 

2.1 Static Coupling Metrics 

There exists a large variety of measurements for coupling. A comprehensive review 

of existing measures performed by Briand et al. [13] found that more than thirty 

different measures of object-oriented coupling exist. The most prevalent ones are 

explained in the following subsections: 

12

2.1. Static Coupling Metrics 13 

2.1.1 Chidamber and Kemerer 

In their papers [25, 26] Chidamber and Kemerer propose and validate a set of six 

software metrics for object-oriented systems, including two measures for coupling. 

As these are the most accepted and widely used coupling metrics, we use these as 

the basis for our run-time coupling measures. 

Coupling Between Objects (CBO) 

They first define a measure CBO for a class as, “a count of the number of noninheritance 

related couples with other classes” [25]. An object of a class is coupled 

to another if the methods of one class use the methods or attributes of the other. 

They later revise this definition to state, “CBO for a class is a count of the number 

of other classes to which it is coupled ” [26]. A footnote added that “this includes 

coupling due to inheritance.” 

They state that coupling has an adverse effect on the maintenance, reuse and 

testing of a design and that excessive coupling between object classes is detrimental 

to modular design and prevents reuse. As the more independent a class is, the easier 

it is to reuse in another application. They state that inter-object class couples should 

be kept to a minimum in order to improve modularity and promote encapsulation. 

The larger the number of couples, the higher the sensitivity to changes in other parts 

of the design, making maintenance more difficult. A measure of coupling is useful 

to determine how complex the testing of various parts of a design are likely to be. 

The higher the inter-object class coupling the more rigorous the testing needs to be. 

Response for class (RFC): 

The response set (RS) of a class is a set of methods that can potentially be executed 

in response to a message received by an object of that class. RFC is simply the 

number of methods in the set, that is, RFC = #{RS}. A given method is counted 

only once. Since RFC specifically includes methods called from outside the class, 

it is also a measure of the potential communication between the class and other 

classes.

2.1. Static Coupling Metrics 14 

RS = M ∪ alli R i = [∪ i∈M R i ] (2.1) 

Equation 2.1 gives the response set for a class where R i is the set of methods 

called by the method i and M is the set of all methods in the class. 

If a large number of methods can be invoked in response to a message, the testing 

and debugging of the class becomes more complicated since it requires a greater level 

of understanding on the part of the tester. The complexity of a class increases with 

the number of methods that can be invoked from it. 

2.1.2 Other Coupling Metrics 

In their paper [63] Li and Henry identify a number of metrics that can predict the 

maintainability of a design. They define two measures, message passing coupling 

(MPC) and data abstraction coupling (DAC). MPC is defined as the number of 

send statements defined in a class. The number of send statements sent out from a 

class may indicate how dependent the implementation of the local methods is on the 

methods in other classes. MPC only counts invocations of methods of other classes, 

not its own. DAC is defined as “the number of abstract data types (ADT) defined 

in a class”. An ADT is defined in a class c if it is the type of an attribute of class c. 

It is also specified that “the number of variables having an ADT type may indicate 

the number of data structures dependent on the definitions of other classes”. 

Martin describes two coupling metrics that can be used to measure the quality of 

an object-oriented design in terms of the interdependence between the subsystems 

of that design [70]. Afferent Coupling (Ca) is the number of classes outside this 

category that depend upon classes within this category. Efferent Coupling (Ce) is 

the number of classes inside this category that depend upon classes outside this 

category. A category is a set of classes that belong together in the sense that 

they achieve some common goal. Martin does not specify exactly what constitutes 

dependencies between classes. 

Abreu et al. present a coupling metric known as Coupling Factor (COF) for the 

design quality evaluation of object-oriented software systems [1]. COF is the actual 

number of client-server relationships between classes that are not related via inher-

2.2. Frameworks for Static Coupling Measurement 15 

itance divided by the maximum possible number of such client-server relationships. 

It is normalised to range between 0 and 1 to allow for comparisons for systems of 

different sizes. It was not specified how to account for such factors as polymorphism 

and method overriding. 

Lee et al. measure coupling and cohesion of an object-oriented program based 

on information flow through programs [62]. They define a measure, Informationflow-based 

coupling (ICP), that counts for a method m of a class c, the number 

of methods that are invoked polymorphically from other classes, weighted by the 

number of parameters of the invoked method. This count can be scaled up to 

classes and subsystems. They go on to derive two more sets of measures which 

measure inheritance-based coupling (coupling to ancestor classes (IH-ICP)) and 

noninheritance-based coupling (coupling to unrelated classes (NIH-ICP)) and deduce 

that ICP is simply the sum of IH-ICP and NIH-ICP 

Briand et al. perform a comprehensive empirical validation of product measures, 

such as coupling and cohesion, in object-oriented systems and explore the probability 

of fault detection in system classes during testing [11]. They define a number of 

measures which count the number of class-attribute (CA), class-method (CM) and 

method-method (MM) interactions for each class. They take into account which 

class the interactions originate from or are directed at and the number of ancestor 

or other classes. A CA-interaction occurs from class c to class d if an attribute of 

class c is of type class d. A CM-interaction occurs from class c to class d if a newly 

defined method of class c has a parameter of type class d. An MM-interaction occurs 

from class c to class d if a method implemented at class c statically invokes a newly 

defined or overriding method of class d, or receives a pointer to such a method. This 

set has sixteen metrics in total. 

2.2 Frameworks for Static Coupling Measurement 

Several different authors describe frameworks to characterise different approaches to 

coupling and to assign relative strengths to different types of coupling. A framework 

defines what constitutes coupling. This is done in an attempt to determine the


potential use of coupling metrics and how different metrics complement each other. 

There are three existing frameworks: 

2.2.1 Eder et al. 

Eder et al. identify three different types of relationships [34]. These are, interaction 

relationships between methods, component relationships between classes, and inheritance 

between classes. These relationships are then used to derive three different 

dimensions of coupling which are classified according to different strengths: 

1. Interaction coupling: Two methods are said to be interaction coupled if i) one 

method invokes the other, or ii) they communicate via the sharing of data. 

There are seven types of interaction coupling. 

2. Component coupling: Two classes c and d are component coupled, if d is the 

type of either i) an attribute of c, or ii) an input or output parameter of a 

method of c, or iii) a local variable of a method of c, or iv) an input or output 

parameter of a method invoked within a method of c. There are four different 

degrees of component coupling. 

3. Inheritance coupling: two classes c and d are inheritance coupled, if one class 

is an ancestor of the other. There are four degrees of inheritance coupling. 

2.2.2 Hitz and Montazeri 

Hitz and Montazeri derive two different types of coupling, object and class-level coupling 

[52]. These are determined by the state of an object (value of its attributes at 

a given moment at runtime) and state of an object’s implementation (class interface 

and body at a given time in the development cycle). 

Class level coupling (CLC) results from state dependencies between two classes 

in a system during the development cycle. This can only be determined from a static 

analysis of the design documents or source code. This is important when considering 

maintenance and change dependencies as changes in one class may lead to changes 

in other classes which use it.


Object level coupling (OLC) results from state dependencies between two objects 

during the run-time of a system. This depends on concrete object structure at runtime, 

which in turn is determined by actual input data. Therefore, it is a function 

of design or source code and input data at run-time. This is relevant for run-time 

oriented activities such as testing and debugging. 

2.2.3 Briand et al. 

In the framework by Briand et al. coupling is constituted as interactions between 

classes [14]. The strength is determined by the type of the interaction (Class- 

Attribute, Class-Method, Method-Method), the relationship between the classes (Inheritance, 

Other) and the interaction’s locus of impact (Import/Client, Export/Server). 

They assign no strengths to the different kinds of interactions. There are three basic 

criteria in the framework which are as follows: 

1. Type of interaction: This determines the mechanism by which two classes are 

coupled. A class-attribute interaction is present if aggregation occurs, that is, 

if a class c is the type of an attribute of class d. A class-method interaction 

occurs if a class c is the type of a parameter of method m d of a class d, or if a 

class c is the return type of method m d . A method-method interaction occurs 

if a m d of a class d directly invokes a method m c , or a method m d receives via 

parameter a pointer to m c thereby invoking m c indirectly. 

2. Relationship: An inheritance relationship occurs if a class c is an ancestor of 

class d or vice versa. Friendship is present if a class c declares class d as its 

friend, which grants class d access to the non-public elements of c. There is 

another relationship when no inheritance or friendship relationship is present 

between classes c and d. 

3. Locus: If a class c is involved in an interaction with another class, a distinction 

is made between export and import coupling. Export is when a class c is the 

used class or server in the interaction. Import is when a class c is the using 

class or client in the interaction.

2.3. Static Cohesion Metrics 18 

2.2.4 Revised Framework by Briand et al. 

Briand et al. outline a new unified framework for coupling in object-oriented systems 

[13]. It is characterised based on the issues identified by comparing existing coupling 

frameworks. There are six different criteria in the framework and each criterion 

determines one basic aspect of the resulting measure. The criteria are as follows: 

1. The type of connection: This determines what constitutes coupling. It is the 

type of link between a client and a server item which could be an attribute, 

method, or class. 

2. The locus of impact: This is import or export coupling. Import coupling analyses 

attributes, methods, or classes in their role as clients of other attributes, 

methods, or classes. Export coupling analyses the attributes, methods, and 

classes in their role as servers to other attributes, methods or classes. 

3. The granularity of the measure: This is the domain of the measure, that is, 

what components are to be measured and how to count coupling connections. 

4. The Stability of server: Should both stable and unstable classes be included 

Classes can be, a) stable which are those that are not subject to change in the 

project at hand, for example classes imported from libraries, or b) unstable 

which are those which are subject to development or modification in the project 

at hand. 

5. Direct or indirect coupling: Should only direct connections be counted or 

should indirect connections also be taken into account 

6. Inheritance: Inheritance-based versus noninheritance-based coupling. Also 

how to account for polymorphism and how to assign attributes and methods 

to classes. 

2.3 Static Cohesion Metrics 

A large number of alternative measures are proposed for measuring cohesion. Briand 

et al. [12] carry out a broad survey on the current state of cohesion measurement


in object-oriented systems and find fifteen separate measurements of cohesion. A 

review of these measures is presented in the following subsections. 

2.3.1 Chidamber and Kemerer 

The Lack of Cohesion in Methods (LCOM1) measure was first suggested by Chidamber 

and Kemerer [25]. It is the most prevalently used cohesion measure today 

and therefore is used as the basis for the definition of our run-time cohesion measures. 

It is defined as “the degree of similarity of methods” and is theoretically 

based on the ontology of objects by Bunge [21]. Within this ontology, the similarity 

of things is defined as the set of properties that the things have in common. 

For a given class C with a number of methods, M 1 , M 2 , ..., M n , let {I i } be the 

set of instance variables accessed by the method M i . As there are n methods there 

will be n such sets, one set per method. The LCOM metric is then determined by 

counting the number of disjoint sets formed by the intersection of the n sets. 

However, this was found to be quite ambiguous and the pair later redefined their 

metric (LCOM2) [26]. For a class C 1 with n methods, M 1 , . . . , M n , let {I i } be the 

set of instance variables referenced by method M i . There are n such sets I 1 , ...I n . 

We can define two disjoint sets: 

P = {(I i , I j ) | I i ∩ I j = ∅} 

(2.2) 

Q = {(I i , I j ) | I i ∩ I j ≠ ∅} 

The lack of cohesion in methods is then defined from the cardinality of these sets 

by: 

LCOM = |P | − |Q|, 

(2.3) 

if |P | > |Q| or 0 otherwise 

LCOM is an inverse cohesion measure. An LCOM value of zero indicates a 

cohesive class. Cohesiveness of methods within a class is desirable as it promotes 

encapsulation. Any measure of disparateness of methods helps identify flaws in the 

design of classes. If the value is greater than zero this indicates that the class can 

be split into two or more classes, since its variables belong in disjoint sets. Low


cohesion is said to increase complexity, thereby increasing the likelihood of errors 

during the development process. 

2.3.2 Other Cohesion Metrics 

Briand et al. define a set of cohesion measures for object-based systems [16,17] which 

are adapted in [12] to object-oriented systems. For this adaption a class is viewed as 

a collection of data declarations and methods. A data declaration is a local, public 

type declaration, the class itself or public attributes. There can be data declaration 

interactions between classes, attributes, types of different classes and methods. 

They define the following measures; Ratio of Cohesive Interactions (RCI), Neutral 

Ratio of Cohesive Interactions (NRCI), Pessimistic Ratio of Cohesive Interactions 

(P RCI) and Optimistic Ratio of Cohesive Interactions (ORCI). 

Hitz and Montazeri base their cohesion measurements LCOM3, LCOM4 and 

C (Connectivity) on the work of Chidamber and Kemerer [51]. 

The cohesion measurements by Bieman and Kang are also based on the work of 

Chidamber and Kemerer [9]. They define measurements known as Tight Class Cohesion 

(T CC) and Loose Class Cohesion (LCC). These metrics also consider pairs 

of methods which use common attributes, however a distinction is made between 

methods which access attributes directly or indirectly. They also take inheritance 

into account, making suggestions on how to deal with inherited methods and inherited 

attributes. 

Lee et al. propose a set of cohesion measures based on the information flow 

through method invocations within a class [62]. For a method m implemented in a 

given class c, the cohesion of m is the number of invocations to other methods implemented 

in class c, weighted by the number of parameters of the invoked methods. 

The greater the number of parameters an invoked method has, the more information 

is passed, the stronger the link between the invoking and invoked method. The 

cohesion of a class is the sum of the cohesion of its methods. The cohesion of a set 

of classes is given by the sum of the cohesion of the classes in the set. 

Henderson-Sellers propose a cohesion measure (LCOM5) [49]. They state that 

a value of zero is obtained if each method of the class references every attribute

2.4. Frameworks for Static Cohesion Measurement 21 

of the class and they called this “perfect cohesion”. They also state that if each 

method of the class references only a single attribute, the measure yields one and 

that values between zero and one are to be interpreted as percentages of the perfect 

value. They do not state how to deal with inherited methods and attributes. 

2.4 Frameworks for Static Cohesion Measurement 

Two frameworks are defined in an attempt to outline what constitutes cohesion. 

Eder et al. define a framework which aims at providing qualitative criteria for 

cohesion and also assigns relative strengths to the different levels of cohesion they 

identify within this framework. 

A comprehensive framework based on a standard terminology and formalism is 

outlined by Briand et al. which can be used (i) to facilitate comparison of existing 

cohesion measures, (ii) to facilitate the evaluation and empirical validation of existing 

cohesion measures, and (iii) to support the definition of new cohesion measures 

and the selection of existing ones based on a particular goal of measurement. 

2.4.1 Eder et al. 

Eder et al. propose a framework aimed at providing comprehensive, qualitative 

criteria for cohesion in object-oriented systems [34]. They modify existing frameworks 

for cohesion in the procedural and object-based paradigm to the specifics of 

the object-oriented paradigm. They distinguish between three types of cohesion in 

an object-oriented system: method, class and inheritance cohesion and state that 

various degrees of cohesion exist for each type. 

Myers [83] classical definition of cohesion was applied to methods for their definition 

of method cohesion. Elements of a method are statements, local variables 

and attributes of the method’s class. They defined seven degrees of cohesion, based 

on the definition by Myers. From weakest to strongest, the degrees of method cohesion 

are coincidental, logical, temporal, communicational, sequential, procedural 

and functional. 

Class cohesion addresses the relationships between the elements of a class. The

2.4. Frameworks for Static Cohesion Measurement 22 

elements of a class are its non-inherited methods and non-inherited attributes. Eder 

et al. use a categorisation of cohesion for abstract data types by Embley and Woodfield 

[35] and adapt it to object-oriented systems. They define five degrees of class cohesion 

which are, from weakest to strongest, separable, multifaceted, non-delegated, 

concealed and model. 

Inheritance cohesion is similar to class cohesion in that it addresses the relationships 

between elements of a class. However, inheritance cohesion takes all the 

methods and attributes of a class into account, that is, both the inherited and noninherited. 

Inheritance cohesion is strong if inheritance has been used for the purpose 

of defining specialized children classes. Inheritance cohesion is weak if it has been 

used for the purpose of reusing code. The degrees of inheritance cohesion are the 

same as those for class cohesion. 

2.4.2 Briand et al. 

Briand et al. outline a new framework for cohesion in object-oriented systems [12] 

based on the issues identified by comparing the various approaches to measuring 

cohesion and the discussion of existing measures outlined in Section 2.3. The framework 

consists of five criteria, each criterion determining one basic aspect of the 

resulting measure. 

The five criteria of the framework are: 

1. The type of connection, that is, what makes a class cohesive. A connection 

within a class is a link between elements of the class which can be attributes, 

methods, or data declarations. 

2. Domain of the measure, this specifies the objects to be measured which can 

be methods, classes etc. 

3. They ask whether direct or also indirect connections should be counted. 

4. How to deal with inheritance, that is, how to assign attributes and methods 

to classes and how to account for polymorphism. 

5. How to account for access methods and constructors.

2.5. Run-time/Dynamic Coupling Metrics 23 

2.5 Run-time/Dynamic Coupling Metrics 

While there has been considerable work on static metrics there has been little research 

to date on run-time/dynamic coupling metrics. This section presents the two 

most relevant works. 

2.5.1 Yacoub et al. 

Yacoub et al. propose a set of dynamic coupling metrics designed to evaluate the 

change-proneness of a design [99]. These metrics are applied at the early development 

phase to determine design quality. The measures are calculated from 

executable object-oriented design models, which are used to model the application 

to be tested. They are based on execution scenarios, that is “the measurements are 

calculated for parts of the design model that are activated during the execution of a 

specific scenario triggered by an input stimulus.” A scenario is the context in which 

the metric is applicable. The scenarios are then extended to have an application 

scope. 

They define two metrics designed to measure the quality of designs at an early 

development phase. Export Object Coupling (EOC x (o i , o j )) for an object o i with 

respect to an object o j , is defined as the percentage of the number of messages sent 

from o i to o j with respect to the total number of messages exchanged during the 

execution of a scenario x. Import Object Coupling (IOC x (o i , o j )) for an object o i 

with respect to an object o j , is the percentage of the number of messages received by 

object o i that were sent by object o j with respect to the total number of messages 

exchanged during the execution of a scenario x. 

2.5.2 Arisholm et al. 

Arisholm et al. define and validate a number of dynamic coupling metrics that 

are listed in Table 2.1 [5]. Each dynamic coupling metric name starts with either 

I or E to distinguish between import coupling and export coupling, based on the 

direction of the method calls. The third letter C or O distinguishes whether entity 

of measurement is the object or the class. The remaining letter distinguish three

2.5. Run-time/Dynamic Coupling Metrics 24 

Variable 

IC CC 

IC CM 

IC CD 

EC CC 

EC CM 

EC CD 

IC OC 

IC OM 

IC OD 

EC OC 

EC OM 

EC OD 

Description 

Import, Class Level, Number of Distinct Classes 

Import, Class Level, Number of Distinct Methods 

Import, Class Level, Number of Dynamic Messages 

Export, Class Level, Number of Distinct Classes 

Export, Class Level, Number of Distinct Methods 

Export, Class Level, Number of Dynamic Messages 

Import, Object Level, Number of Distinct Classes 

Import, Object Level, Number of Distinct Methods 

Import, Object Level, Number of Dynamic Messages 

Export, Object Level, Number of Distinct Classes 

Export, Object Level, Number of Distinct Methods 

Export, Object Level, Number of Dynamic Messages 

Table 2.1: Abbreviations for the dynamic coupling metrics of Arisholm et al. 

types of coupling. The first metric, C, counts the number of distinct classes that 

a method in a given class/object uses or is used by. The second metric, M, counts 

the number of distinct methods invoked by each method in each class/object while 

the third metric, D, counts the total number of dynamic messages sent or received 

from one class/object to or from other classes/objects. 

Arisholm et al. study the relationship of these measures with the changeproneness 

of classes. They find that the dynamic coupling metrics capture additional 

properties compared to the static coupling metrics and are good predictors 

of the change-proneness of a class. Their study uses a single software system called 

Velocity executed with its associated test suite, to evaluate the dynamic coupling 

metrics. These test cases are found to originally have 70% method coverage, which 

is increased to 90% for the methods that “might contribute to coupling” through 

the removal of dead code. However, they did not study the impact of code coverage 

on their results nor were results given for programs other than versions of Velocity.

2.6. Run-time/Dynamic Cohesion Metrics 25 

2.6 Run-time/Dynamic Cohesion Metrics 

As is the case with the run-time coupling metrics, there has not been much research 

into run-time measures for cohesion. This section presents the only available study 

to date. 

2.6.1 Gupta and Rao 

Gupta and Rao conduct a study which measures module cohesion in legacy software 

[46]. Gupta and Rao compare statically calculated metrics against a program 

execution based approach of measuring the levels of module cohesion. The results 

from this study show that the static approach significantly overestimates the levels 

of cohesion present in the software tested. However, Gupta and Rao are considering 

programs written in C, where many features of object-oriented programs are not 

directly applicable. 

2.7 Other Studies of Dynamic Behaviour 

In this section we present a review of other work on studies into the dynamic behaviour 

of Java programs. While such research is not directly related to coupling 

and cohesion metrics, many of the issues and approaches to measurement are similar. 

Indeed, any research that performs both static and dynamic analyses of programs 

benefits from being viewed in the context of some overall perspective of the relationship 

between the static and dynamic data. 

2.7.1 Dynamic Behaviour Studies 

A number of studies of the dynamic behaviour of Java programs have been carried 

out, mostly for optimisation purposes. Issues such as bytecode usage [45] and memory 

utilisation [28] have been studied, along with a comprehensive set of dynamic 

measures relating to polymorphism, object creation and hot-spots [33]. However, 

none of this work directly addresses the calculation of standard software metrics at 

run-time.

2.8. Coverage Metrics and Software Testing 26 

The Sable group [33] seek to quantify the behaviour of programs with a concise 

and precisely defined set of metrics. They define a set of unambiguous, dynamic, robust 

and architecture-independent measures that can be used to categorise programs 

according to their dynamic behaviour in five areas which are size, data structure, 

memory use, concurrency, and polymorphism. Many of the measurements they 

record are of interest to the Java performance community as understanding the dynamic 

behaviour of programs is one important aspect in developing effective new 

strategies for optimising compilers and runtime systems. It is important to note 

that these are not typical software engineering metrics. 

2.8 Coverage Metrics and Software Testing 

Dynamic coverage measures are typically used in the field of software testing as 

an estimate of the effectiveness of a test suite [10, 72]. Measurements of structural 

coverage of code is a means of assessing the thoroughness of testing. The basis 

of software testing is that software functionality is characterised by its execution 

behaviour. In general, improved test coverage leads to improved fault coverage 

and improved software reliability [69]. There are a number of metrics available for 

measuring coverage, with increasing support from software tools. Such metrics do 

not constitute testing techniques, but can be used as a measure of the effectiveness 

of testing techniques. There are many different strategies for testing software, and 

there is no consensus among software engineers about which approach is preferable 

in a given situation. Test strategies fall into two categories [40]: 

• Black-box (closed-box) testing: The test cases are derived from the specification 

or requirements without reference to the code itself or its structure. 

• White-box (open-box) testing: The test cases are selected based on knowledge 

of the internal program structure. 

A number of coverage metrics are based on the traversal of paths through the 

control dataflow graph (CDFG) representing the system behaviour. Applying these 

metrics to the CDFG representing a single process is a well understood task. The

2.8. Coverage Metrics and Software Testing 27 

following coverage metrics are examples of white-box testing techniques and are 

based on the CDFG. 

2.8.1 Instruction Coverage 

Instruction coverage is the simplest structural coverage metric. It is achieved if 

every source language statement in the program is executed at least once. With 

this technique test cases are selected so that every program statement is executed 

at least once. It is also known as statement coverage, segment coverage [84], C1 [7] 

and basic block coverage. 

The main advantage of this measure is that it can be applied directly to object 

code and does not require processing source code. Performance profilers commonly 

implement this measure. The main disadvantage of statement coverage is that it is 

insensitive to some control structures. In summary, this measure is affected more 

by computational statements than by decisions. Due to its ubiquity this was chosen 

as the coverage measure that is used in the case studies in this thesis. There are 

however, a number of other methods for evaluating the coverage of a program, for 

example branch coverage, condition coverage, condition/decision coverage, modified 

condition/decision coverage and path coverage. 

2.8.2 Alexander and Offutt 

In their paper [3], Alexander and Offutt describe a coupling-based testing approach 

for analysing and testing the polymorphic relationships that occur in object-oriented 

software. The traditional notion of software coupling has been updated to apply 

to object-oriented software, handling the relationships of aggregation, inheritance 

and polymorphism. This allows the introduction of a new integration analysis and 

testing technique for data flow interactions within object-oriented software. The 

foundation of this technique is the coupling sequence, which is a new abstraction 

for representing state space interactions between pairs of method invocations. The 

coupling sequence provides the analytical focal point for methods under test and is 

the foundation for identifying and representing polymorphic relationships for both

2.9. Previous Work by the Author 28 

static and dynamic analysis. With this abstraction both testers and developers of 

object-oriented programs can analyse and better understand the interactions within 

their software. The application of these techniques can result in an increased ability 

to find faults and overall higher quality software. 

2.9 Previous Work by the Author 

A preliminary study was previously conducted on the issues involved in performing 

a run-time analysis of Java programs [74]. This study outlined the general 

principles involved in performing such an analysis. However, the results did not 

offer a justifiable basis for generalisation as the programs analysed were a set of 

Java microbenchmarks from the Java Grande Forum Benchmark Suite (JGFBS) 

and therefore not representative of real applications. The metrics used were also 

of a more primitive nature than the ones used in this study. Also, there was no 

investigation made into the perspective of the measures, that is, the influence of 

coverage, or the ability to predict external design quality. It did however provide 

an indication that the evaluation of software metrics at run-time can provide an 

interesting quantitative analysis of a program and that further research in this area 

is needed. 

The following papers have also been published: 

• In [77, 78] studies on the quantification of a variety of run-time class-level 

coupling metrics for object-oriented programs are described. 

• In [77,79] an empirical investigation into run-time metrics for cohesion is presented. 

• A study into a coverage analysis of Java benchmark suites is described in [20]. 

• An investigation into how object-level run-time metrics can be used to study 

coupling between objects is presented in [81]. 

• A study of the influence of coverage on the relationship between static and 

dynamic coupling metrics is described in [80].

2.10. Definition of Run-time Metrics 29 

2.10 Definition of Run-time Metrics 

This section outlines the run-time metrics used in the remainder of this thesis. 

Originally, it was decided to develop a number of run-time metrics for coupling 

and cohesion that parallel the standard static object-oriented measures defined by 

Chidamber and Kemerer [26]. Later Arishlom et al. defined a set of dynamic 

coupling metrics in their paper [5] which closely parallel ours, so for the ease of 

comparison it was decided to use their terminology and definitions for the coupling 

measures. 

The cohesion measures are all novel and are based on our own definitions. 

2.10.1 Coupling Metrics 

Three decision criteria are used to define and classify the run-time coupling measures. 

Firstly, a distinction is made as to whether the entity of measurement is 

the object or the class. Run-time object-level coupling quantifies the level of dependencies 

between objects in a system. Run-time class-level coupling quantifies the 

level of dependencies between the classes that implement the methods or variables of 

the caller object and the receiver object. The class of the object sending or receiving 

a message may be different from the class implementing the corresponding method 

due to the impact of inheritance. 

Second, the direction of coupling for a class or object is taken into account, 

as is outlined in previous static coupling frameworks [13]. This allows for the fact 

that in a coupling relationship a class may act as a client or a server, that is, it may 

access methods or instance variables from another class (import coupling) or it may 

have its own methods or instance variables used (export coupling). 

Finally the strength of the coupling relationship is assessed, that is the amount 

of association between the classes. To do this it is possible to count either: 

1. The number of distinct classes that a method in a given class uses or is used 

by. 

2. The number of distinct methods invoked by each method in each class.


3. The total number of dynamic messages sent or received from one class to or 

from other classes. 

Class-Level Metrics 

The following are metrics for evaluating class-level coupling: 

• IC CC: This determines the number of distinct classes accessed by a class at 

run-time. 

• IC CM: This determines the number of distinct methods accessed by a class 

at run-time. 

• IC CD: This determines the number of dynamic messages accessed by a class 

at run-time. 

• EC CC: This determines the number of distinct classes that are accessed by 

other classes at run-time. 

• EC CM: This determines the number of distinct methods that are accessed by 

other classes at run-time. 

• EC CD: This determines the number of dynamic messages that are accessed 

by other classes at run-time. 

Object-Level Metric 

To evaluate object-level coupling it was deemed necessary to define just one metric. 

Since we want to examine the behaviour of objects at run-time we require a measure 

that is based on a class rather than a method-level. Further, it was deemed necessary 

to evaluate only coupling at the import level, as we are interested in examining how 

classes use other classes at the object-level rather than how they are used by other 

classes, therefore export coupling for this measure was not evaluated. 

The following is a measure for evaluating object-level coupling: 

• IC OC: Import, Object-Level, Number of Distinct Classes: This measure will 

be some function of the static CBO measure, as this measure determines the


classes that can be theoretically accessed at run-time. This is a coarse-grained 

measure which will assess class-class coupling at the object-level. 

2.10.2 Cohesion Metrics 

The following run-time measures are based on the Chidamber and Kemerer static 

LCOM measure for cohesion as described in Section 2.3.1. However, a problem with 

the original definition for LCOM is its lack of discriminating power. Much of this 

arises from the criteria which states if |P | < |Q|, LCOM is automatically set to zero. 

The result of this is a large number of classes with an LCOM of zero so the metric 

has little discriminating power between these classes. In an attempt to correct this, 

for the purpose of this analysis, we modify the original definition to be: 

S LCOM = |P | 

|P |+|Q| 

(2.4) 

S LCOM can range in value from zero to one. This new definition allows for comparison 

across classes therefore we use this new version as a basis for the definition 

of the run-time metrics. As these are cohesion measures they are evaluated at the 

class-level only. 

Run-time Simple LCOM (R LCOM ) 

R LCOM 

is a direct extension of the static case, except that now we only count 

instance variables that are actually accessed at run-time. Thus, for a set of methods 

m 1 , . . . , m n , as before, let {Ii R } represent the set of instance variables referenced by 

method m i at run-time. Two disjoint sets are defined from this: 

We can then define R LCOM as: 

P R = {(I R i , I R j ) | I R i ∩ I R j = ∅} 

Q R = {(I R i , I R j ) | I R i ∩ I R j ≠ ∅} 

(2.5) 

R LCOM = |P R | 

|P R |+|Q R | 

(2.6)


We note that for any method m i , (I i − Ii R ) ≥ 0, and represents the number 

of instance variables mentioned in a method’s code, but not actually accessed at 

run-time. 

Run-time Call-Weighted LCOM (RW LCOM ) 

It is reasonable to suggest that a heavily accessed variable should make a greater contribution 

to class cohesion than one which is rarely accessed. However, the R LCOM 

metric does not distinguish between the degree of access to instance variables. Thus 

a second run-time measure RW LCOM is defined by weighting each instance variable 

by the number of times it is accessed at run-time. This metric assesses the strength 

of cohesion by taking the number of accesses into account. 

As before, consider a class with n methods, m 1 , . . . , m n , and let {I i } be the set 

of instance variables referenced by method m i . Define N i as the number of times 

method m i dynamically accesses instance variables from the set {I i }. 

Now define a call-weighted version of equation 2.2 by summing over the number 

of accesses: 

P W = 

Q W = 

∑ 

1≤i,j≤n 

∑ 

1≤i,j≤n 

{(N i + N j ) | I i ∩ I j = ∅} 

{(N i + N j ) | I i ∩ I j ≠ ∅} 

(2.7) 

where 

Following equation 2.6 we define: 

P W = 0, if {I 1 }, ..., {I n } = ∅ 

RW LCOM = |P W | 

|P W |+|Q W | 

(2.8) 

RW LCOM can range in value from zero to one. There is no direct relationship 

with S LCOM or R LCOM , as it is based on the “hotness” of a particular program.

2.11. Conclusion 33 

2.11 Conclusion 

This chapter outlined the most prevalent metrics for coupling and cohesion and discussed 

other work on studies into the dynamic behaviour of Java programs. Measures 

for dynamic coverage that are commonly used in the field of software testing 

were described. Work and publications by the author were outlined. Finally, a 

description of the run-time metrics used in this thesis were provided.

Chapter 3 

Experimental Design 

This chapter presents an overview of the tools and techniques used to carry out the 

run-time empirical evaluation of a set of Java programs together with a detailed 

description of the set of programs analysed. A review of the statistical techniques 

used to interpret the data is also given. 

3.1 Methods for Collecting Run-time Information 

There are a number of alternative techniques available for extracting run-time information 

from Java programs, each with their own advantages and disadvantages. 

3.1.1 Instrumenting a Virtual Machine 

There are several open-source implementations of the JVM available, for example 

Kaffe [58], Jikes [57] or the Sable VM [59]. As their source code is freely available 

this means that all aspects of a running Java program can be observed. However, 

due to the logging of bytecode instructions, instrumenting a JVM can result in a 

huge amount of data being generated for the simplest of programs. The source code 

organisation must be understood and the instrumentation has to be redone for each 

new version of the VM. There can also be compatibility issues when compared with 

the Java class libraries released by Sun. It has also been found that these VMs are 

not very robust. This was the method used for a preliminary study [74], however it 

was later discarded due to its many disadvantages. 

34

3.2. Metrics Data Collection Tools (Design Objectives) 35 

3.1.2 Sun’s Java Platform Debug Architecture (JPDA) 

Version 1.4 and later of the Java SDK supports a debugging architecture, the JPDA 

[96], that provides event notification for low level JVM operations. A trace program 

that handles these events can thus record information about the execution of a Java 

program. This method is faster than instrumenting a VM and is more robust. The 

same agent works with all VM’s supporting the JPDA and this is currently supported 

by both Sun and IBM (although there are some differences). This technique has 

proved useful in class-level metrics analysis. However, it is still very time consuming 

to generate a profile for a large application and it is difficult to conduct an objectlevel 

analysis using this approach. 

3.1.3 Bytecode Instrumentation 

This involves statically manipulating the bytecode to insert probes, or other tracking 

mechanisms, that record information at runtime. This provides the simplest 

approach to dynamic analysis since it does not require implementation specific 

knowledge of JVM internals, and imposes little overhead on the running program. 

Bytecode instrumentation can be performed using the publicly available Apache 

Bytecode Engineering Library (BCEL) [30]. This technique provides object-level 

accuracy and therefore was used in the object-level metrics analysis. 

3.2 Metrics Data Collection Tools (Design Objectives) 

The dynamic analysis of any program involves a huge amount of data processing. 

However, the level of performance of the collection mechanism was not considered 

to be a critical issue. It was only desirable that the analysis could be carried out in 

reasonable and practical time. The flexibility of the collection mechanism was a key 

issue, as it was necessary to be able to collect a wide variety of dynamic information.


3.2.1 Class-Level Metrics Collection Tool (ClMet) 

We have developed a tool for the collection of class-level metrics called ClMet, as 

illustrated by Figure 3.1, which utilises the JPDA. This is a multi-tiered debugging 

architecture contained within Sun Microsystem’s Java 2 SDK version 1.4. It consists 

of two interfaces, the Java Virtual Machine Debug Interface (JVMDI), and the Java 

Debug Interface (JDI), and a protocol, the Java Debug Wire Protocol (JDWP). 

The first layer of the JPDA, the JVMDI, is a programming interface implemented 

by the virtual machine. It provides a way to both inspect the state and control the 

execution of applications running in the JVM. The second layer, the JDWP, defines 

the format of information and requests transferred between the process being 

debugged and the debugger front-end which implements the JDI. The JDI, which 

comprises the third layer, defines information and requests at the user code level. It 

provides introspective access to a running virtual machine’s state, the class, array, 

interface, and primitive types, and instances of those types. While a tracer implementor 

could directly use the Java Debug Wire Protocol (JDWP) or Java Virtual 

Machine Debug Interface (JVMDI), this interface greatly facilitates the integration 

of tracing capabilities into development environments. This method was selected 

because of the ease with which it is possible to obtain specific information about 

the run-time behaviour of a program. 

In order to match objects against method calls it is necessary to model the 

execution stack of the JVM, as this information is not provided directly by the 

JPDA. We have implemented an EventTrace analyser class in Java, which carries 

out a stack based simulation of the entire execution in order to obtain information 

about the state of the execution stack. This class also implements a filter which 

allows the user to specify which events and which of their corresponding fields are 

to be captured for processing. This allows a high degree of flexibility in the collection 

of the dynamic trace data. 

The final component of our collection system is a Metrics class, which is responsible 

for calculating the desired metrics on the fly. It is also responsible for 

outputting the results in text format. The metrics to be calculated can be specified 

from the command line. The addition of the metrics class allows new metrics to be


Figure 3.1: Components of run-time class-level metrics collection tool, ClMet 

easily defined as the user need only interact with this class. 

3.2.2 Object-Level Metrics Collection Tool (ObMet) 

We have developed an object-level metrics collection tool called ObMet, which uses 

the BCEL and is based on the Gretel [53] coverage monitoring tool. 

The BCEL is an API which can be used to analyse, create, and manipulate 

(binary) Java class files. Classes are represented by BCEL objects which contain all 

the symbolic information of the given class, such as methods, fields and byte code 

instructions. Such objects can be read from an existing file, be transformed by a 

program and dumped to a file again. 

Figure 3.2 illustrates the components of ObMet. In the first stage the Instru-


A.class B.class C.class 

Instrumenter 


JVM 

Probe Table 

 

 

 

Probe hit 

reports 

(Binary file) 

Calculate 

Metrics 

Metrics 

Results 

Figure 3.2: Components of run-time object-level metrics collection tool, ObMet 

menter program takes a list of class files and instruments them. During this phase 

the BCEL inserts probes into these files to flag events like method calls or instance 

variable accesses. During instrumentation, the class files are changed in-place, and 

a file containing information on method and field accesses is created. Each method 

and field are given a unique index in this file. When the application is run, each 

probe records a “hit” in another file. The Metrics program then calculates the 

run-time measures utilising the information in these files. 

3.2.3 Static Data Collection Tool (StatMet) 

In order to calculate the static metrics it is necessary to convert the binary class 

files into a human readable format. The StatMet tool is based on the Gnoloo 

disassembler [38], which converts the class files into an Oolong source file. The 

Oolong language is an assembly language for the Java Virtual Machine and the 

resulting file will be nearly equivalent to the class file format but it will be suitable for 

human interpretation. The StatMet tool extends the disassembler with an additional



[Binary Format] 

Gnoloo 

[Disassembler] 

Oolong Code 

[Human Readable Code] 

Metrics 

Static Metrics 

Results 

Figure 3.3: Components of static metrics collection tool, StatMet 

metrics component which calculates the static metrics from the Oolong code. Figure 

3.3 illustrates the components of the StatMet tool. 

3.2.4 Coverage Data Collection Tool (InCov) 

In order to calculate the instruction coverage, it is necessary to record, for each 

instruction, whether or not it was executed. In fact, well-known techniques exist for 

identifying sequences of consecutive instructions, known as basic blocks, that somewhat 

reduce the instrumentation overhead. Nonetheless, since static code analysis 

is required to determine basic block entry points, it seemed most efficient to also 

instrument the bytecode during this analysis. 

The instrumentation framework uses the Apache Byte Code Engineering Library 

(BCEL) [30] along with the Gretel Residual Test Coverage Tool [53]. The Gretel 

tool statically works out the basic blocks in a Java class file and inserts a probe 

consisting of small sequence of bytecode instructions at each basic block. Whenever 

the basic block is executed, the probe code records a “hit” as a simple boolean value. 

The number of bytecode instructions in the basic block can then be used to calculate


instruction coverage. 

3.2.5 Fault Detection Study 

Mutation testing [48, 64] is a fault-based testing technique that measures the effectiveness 

of test cases. It was first introduced as a way of measuring the accuracy 

of test suites. It is based on the assumption that a program will be well tested if 

a majority of simple faults are detected and removed. Mutation testing measures 

how good a test is by inserting faults into the program under test. Each fault generates 

a new program, a mutant, that is slightly different from the original. These 

mutant versions of the program are created from the original program by applying 

mutation operators, which describe syntactic changes to the programming language. 

Test cases are used to execute these mutants with the goal of causing each mutant 

to produce incorrect output. The idea is that the tests are adequate if they distinguish 

the program from one or more mutants. The cost of mutation testing has 

always been a serious issue and many techniques proposed for implementing it have 

proved to be too slow for practical adoption. µJava is a tool created for performing 

mutation testing on Java programs. 

µJava 

µJava [66, 67] is a mutation system for Java programs. It automatically generates 

mutants for both traditional mutation testing and class-level mutation testing. It 

can test individual classes and packages of multiple classes. 

The method-level or traditional mutants are based on the selective operator set 

by Offutt et al. [87]. These (non-OO) mutants are all behavioural in nature. There 

are five traditional mutants in total. A description of these mutants can be found 

in Appendix D.1. 

The class-level mutation operators were designed for Java classes by Ma, Kwon 

and Offutt [68], and were in turn designed from a categorisation of object-oriented 

faults by Offutt, Alexander et al. [86]. The object-oriented mutants are created 

according to 23 operators that are specialised to object-oriented faults. Each of 

these can be catergorised based one of five language feature groups they are related

3.3. Test Case Programs 41 

to. The class-level mutants can also be divided into one of two types, behavioural 

mutants are those that change the behavior of the program while structural mutants 

are those that change the structure of the program. A detailed description of these 

mutants can be found in Appendix D.2. 

After creating mutants, µJava allows the tester to enter and run tests, and 

evaluates the mutation coverage of the tests. Test cases are then added in an attempt 

to “kill” the mutants by differentiating the output of the original program from the 

mutant programs. Tests are supplied by the users as sequences of method calls to 

the classes under test encapsulated in methods in separate classes. 

3.3 Test Case Programs 

An important technique used in the evaluation of object systems is benchmarking. A 

benchmark is a black-box test, even if the source code is available [73]. A benchmark 

should consists of two elements: 

• The structure of the persistent data. 

• The behaviour of an application accessing and manipulating the data. 

The process of using a benchmark to assess a particular object system involves executing 

or simulating the behaviour of the application while collecting data reflecting 

its performance [54]. A number of different Java benchmarks are available and those 

used in the course of this study are discussed in the following subsection. 

3.3.1 Benchmark Programs 

Benchmark suites are commonly used to measure performance and fulfill many of 

the required properties of a test suite. The following were used in this analysis. 

SPECjvm98 Benchmark Suite 

The SPECjvm98 benchmark suite [8] is typically used to study the architectural 

implications of a Java runtime environment. The benchmark suite consists of eight


Application 

Description 

201 compress A popular modified Lempel–Ziv method (LZW) compression 

program. 

202 jess JESS is the Java Expert Shell System and is based on 

NASAs popular CLIPS rule-based expert shell system. 

205 raytrace This is a raytracer that works on a scene depicting a dinosaur. 

209 db Data management software written by IBM. 

213 javac This is the Sun Microsystem Java compiler from the JDK 

1.0.2. 

222 mpegaudio This is an application that decompresses audio files that 

conform to the ISO MPEG Layer–3 audio specification. 

227 mtrt This is a variant of 205 raytrace. This is a dual–threaded 

program that ray traces an image. 

228 jack A Java parser generator from Sun Microsystems that is 

based on the Purdue Compiler Construction Tool Set (PC- 

CTS). This is an early version of what is now called 

JavaCC. 

Table 3.1: Description of the SPECjvm98 benchmarks 

Java programs which represent different classes of Java applications as illustrated 

by Table 3.1. 

These programs were run at the command line prompt and do not include graphics, 

AWT (graphical interfaces), or networking. The programs were run with a 100% 

size execution by specifying a problem size s100 at the command line. 

JOlden Benchmark Suite 

The original Olden benchmarks are a suite of pointer intensive C programs which 

have been translated into Java. They are small, synthetic programs but they were 

used as part of this study as each program exhibits a large volume of object creation.


Application 

bh 

bisort 

em3d 

health 

mst 

perimeter 

power 

treeadd 

tsp 

voronoi 

Description 

Solves the N-body problem using hierarchical methods. 

Sorts by creating two disjoint bitonic sequences and then 

merging them. 

Simulates the propagation electro-magnetic waves in a 3D 

object. 

Simulates the Columbian health care system. 

Computes the minimum spanning tree of a graph. 

Computes the perimeter of a set of quad-tree encoded 

raster images. 

Solves the Power System Optimization problem. 

Adds the values in a tree. 

Computes an estimate of the best hamiltonian circuit for 

the travelling salesman problem. 

Computes the Voronoi Diagram of a set of points. 

Table 3.2: Description of the JOlden benchmarks 

Table 3.2 gives a description of the programs [23]. 

There are a number of other benchmark suite available that could be used in this 

type of study which were excluded for various reasons. The DaCapo benchmark suite 

was excluded as it is still in its beta stage of development. The Java Grande Forum 

Benchmark Suite (JGFBS), which was used in a previous study [74], was excluded 

as the programs did not exhibit very high levels of coupling and cohesion at runtime. 

Other suite such as CaffineMark were excluded as these are microbenchmark 

programs therefore are not typical of real Java applications. 

3.3.2 Real-World Programs 

It was deemed desirable to include a number of real-world programs in the analysis 

to see if the results are scalable to actual programs. The following were chosen as 

they are all publicly available and so is their source code. They all come with a set


of pre-defined test cases that are also publicly available, thus defining both the static 

and dynamic context of our work. This contrasts with some other approaches which, 

at worst, can use arbitrary software packages, often proprietary, with an ad-hoc set 

of test inputs. 

Velocity 

Velocity (version 1.4.1) is an open-source software system that is part of the Apache 

Jakarta Project [55]. It is a Java-based template engine and it permits anyone to 

use a simple yet powerful template language to reference objects defined in Java 

code. It can be used to generate web pages, SQL, PostScript, and other outputs 

from template documents. It can be used either as a standalone utility or as an 

integrated component of other systems. The set of JUnit test cases supplied with 

the program were used to execute the program. 

Xalan-Java 

Xalan-Java (version 2.6.0) is an open-source software system that is part of the 

Apache XML Project [92]. It is an XSLT processor for transforming XML documents 

into HTML, text, or other XML document types. It implements XSL Transformations 

(XSLT) Version 1.0 and XML Path Language (XPath) Version 1.0. It 

can be used from the command line, in an applet or a servlet, or as a module in 

other program. A set of JUnit test cases supplied for the program were used for its 

execution. 

Ant 

Ant (version 1.6.1) is a Java-based build tool that is part of the Apache Ant Project 

[4]. It is similar to GNU Make but has the full portability of pure Java code. Instead 

of writing shell commands, as with Make, the configuration files are XML-based, 

calling out a target tree where various tasks are executed.

3.4. Statistical Techniques 45 

SPECjvm98 JOlden Velocity Xalan Ant 

Case Study 1: X X X X X 

Case Study 2: X X X X 

Case Study 3: X X X 

Table 3.3: Programs used for each case study 

3.3.3 Execution of Programs 

All the programs except those in the SPEC benchmark suite were compiled using 

the javac compiler from Sun’s SDK version 1.5.0 01, and all benchmarks were run 

using the client virtual machine from this SDK. The programs in the SPEC suite are 

distributed in class file format, and were not recompiled or otherwise modified. We 

note (in accordance with the license) that the SPEC programs were run individually, 

and thus none of these results are comparable with the standard SPECjvm98 metric. 

All benchmark suites include not just the programs themselves, but a test harness 

to ensure that results from different executions are comparable. Table 3.3 outlines 

the programs used for each case study. Not all programs were suitable for use in 

every case study and we defer the explanation of this to the relevant chapters. 

3.4 Statistical Techniques 

The following section presents a detailed review of the statistical techniques used in 

this study. 

3.4.1 Descriptive Statistics 

Descriptive statistics describe patterns and general trends in a data set. They also 

aid in explaining the results of more complex statistical techniques. For each case 

study a number of descriptive statistics were evaluated from the following: 

The Distribution or Mean (X) 

X = 

∑ X 

N 

(3.1)


The mean is the sum of all values (X) divided by the total number of values (N). 

The Standard Deviation (s) 

√ ∑(X 

s = √ − X) 

2 

var = 

N − 1 

(3.2) 

The standard deviation is a measure of the range of values in a set of numbers. 

It is used used as a measure of the dispersion or variation in a distribution. Simply 

put, it tells us how far a typical member of a sample or population is from the 

mean value of that sample or population. A large standard deviation suggests that 

a typical member is far away from the mean. A small standard deviation suggests 

that members are clustered closely around the mean. It is computed as the square 

root of the variance. 

Many statistical techniques assume that data is normally distributed. If that 

assumption can be justified, then 68% of the values are at most one standard deviation 

away from the mean, 95% of the values are at most two standard deviations 

away from the mean, and 99.7% of the values lie within three standard deviations 

of the mean. 

The Coefficient of Variation (C V ) 

C V = σ/µ ∗ 100 (3.3) 

C V 

measures the relative scatter in data with respect to the mean and is calculated 

by dividing the standard deviation by the mean. It has no units and can be 

expressed as a simple decimal value or reported as a percentage value. When the 

C V is small the data scatter relative to the mean is small. When the C V is large 

compared to the mean the amount of variation is large. Equation 3.3 defines the 

coefficient of variation as a percentage, where µ is the mean and σ is the standard 

deviation. 

Skewness 

skewness = 

∑ n 

i=1 (X i − X) 3 

(N − 1)s 3 (3.4)


Skewness is the tilt (or lack of it) in a distribution. It characterises the degree of 

asymmetry of a distribution around its mean. A distribution is symmetric if it looks 

the same to the left and right of the centre point. Equation 3.4 gives the formula 

for skewness for X 1 , X 2 , ..., X N , where X is the mean, s is the standard deviation 

and N is the number of data points 

Kurtosis 

∑ n 

i=1 

kurotsis = 

(X i − X) 4 

(3.5) 

(N − 1)s 4 

Kurtosis is the peakedness of a distribution. Equation 3.5 gives the formula for 

kurtosis for X 1 , X 2 , ..., X N ,. 

3.4.2 Normality Tests 

Many statistical procedures require that the data being analysed follow a normal 

data distribution. If this is not the case, then the computed statistics may be 

extremely misleading. Normal distributions take the form of a symmetric bellshaped 

curve. Normality can be visually assessed by looking at a histogram of 

frequencies, or by looking at a normal probability plot. 

A common rule-of-thumb test for normality is to get skewness and kurtosis, then 

divide these by the standard errors. Skew and kurtosis should be within the +2 

to -2 range when the data are normally distributed. Negative skew is left-leaning, 

positive skew right-leaning. Negative kurtosis indicates too many cases in the tails 

of the distribution. Positive kurtosis indicates too few cases in the tails. 

Shapiro-Wilk’s W Test 

W = (∑ n 

i−1 a ix (i) ) 2 

∑ n 

i−1 (x i − x) 2 (3.6) 

Formal tests such as the Shapiro-Wilk’s test may also be applied to assess 

whether the data is normally distributed. It calculates a W statistic that tests 

whether a random sample, x 1 , x 2 , ..., x n comes from a normal distribution. W may 

be thought of as the correlation between given data and their corresponding normal 

scores, with W = 1 when the given data are perfectly normal in distribution.


When W is significantly smaller than 1, the assumption of normality is not met. 

Shapiro-Wilks W is recommended for small and medium samples up to n = 2000. 

Equation 3.6 calculates the W statistic where x i are the ordered sample values and 

a i are the constants generated from the means, variances and covariances of the 

order statistics of a sample of size n from a normal distribution [90, 93]. 

Kolmogorov-Smirnov D Test or K-S Lilliefors test 

D = max l≤i≤N |F (y i ) − i N | (3.7) 

For larger samples, the Kolmogorov-Smirnov test is recommended. For a single 

sample of data, this test is used to test whether or not the sample of data is consistent 

with a specified distribution function. When there are two samples of data, it is 

used to test whether or not these two samples may reasonably be assumed to come 

from the same distribution. Equation 3.7 defines the test statistic, where F is the 

theoretical cumulative distribution of the distribution being tested which must be a 

continuous distribution. The hypothesis regarding the distributional form is rejected 

if the test statistic, D, is greater than the critical value obtained from a table. There 

are several variations of these tables in the literature [24]. 

3.4.3 Normalising Transformations 

There are a number of transformations that can be applied to approximate data 

to become normally distributed. To normalise right or positive skew, square roots, 

logarithmic, and inverse (1/x) transforms “pull in” outliers. Inverse transforms are 

stronger than logarithmic transforms which are stronger than roots. To correct left 

or negative skew, first subtract all values from the highest value plus 1, then apply 

square root, inverse, or logarithmic transforms. Power transforms can be used to 

correct both types of skew and finer adjustments can be made by adding a constant, 

C, in the transform of X: (X + C) P . Values of P less than one (roots) correct 

right skew, which is the common situation (using a power of 2/3 is common when 

attempting to normalise). Values of P greater than 1 (powers) correct left skew. 

For right skew, decreasing P decreases right skew. Too great a reduction in P will


overcorrect and cause left skew. 

When the best P is found, further refinements 

can be made by adjusting C. For right skew, for instance, subtracting C will decrease 

skew. Logarithmic transformations are appropriate to achieve symmetry in 

the central distribution when symmetry of the tails is not important. Square root 

transformations are used when symmetry in the tails is important. When both are 

important, a fourth root transform may work. 

3.4.4 Pearson Correlation Test 

R = 

n ∑ xy − ∑ x ∑ y 

√ 

([n 

∑ 

x2 − ( ∑ x) 2 ][n ∑ y 2 − ( ∑ y) 2 ]) 

(3.8) 

The Pearson or product moment correlation test is used to assess if there is a 

relationship between two or more variables, in other words it is a measure of the 

strength of the relationship between the variables. Having n pairs of data (x i , y i ), 

equation 3.8 computes the correlation coefficient (R). R is a number that summarises 

the direction and degree (closeness) of linear relations between two variables and is 

also known as the Pearson Product-Moment Correlation Coefficient. R can take 

values between -1 through 0 to +1. The sign (+ or -) of the correlation affects its 

interpretation. When the correlation is positive (R > 0), as the value of one variable 

increases, so does the other. The closer R is to zero the weaker the relationship. If 

a correlation is negative, when one variable increases, the other variable decreases. 

The following general categories indicate a quick way of interpreting a calculated R 

value [97]: 

• 0.0 to 0.2 Very weak to negligible correlation 

• 0.2 to 0.4 Weak, low correlation (not very significant) 

• 0.4 to 0.6 Moderate correlation 

• 0.7 to 0.9 Strong, high correlation 

• 0.9 to 1.0 Very strong correlation 

The results of such an analysis are displayed in a correlation matrix table. 

3.4.5 T-Test 

t = 

r 

√ 

[(1 − r2 )/(N − 2)] 

(3.9)


Any relationship between two variables should be assessed for its significance 

as well as its strength. A standard two tailed t-test is used to test for statistical 

significance as illustrated by equation 3.9. Coefficients are considered significant if 

the t-test p-value is below 0.05. This tells how unlikely a given correlation coefficient, 

r, will occur given no relationship in the population. Therefore the smaller the p- 

level, the more significant the relationship taking account of type I and type II 

errors. 

3.4.6 Principal Component Analysis 

Principal Component Analysis (PCA) is used to analyse the covariate structure 

of the metrics and to determine the underlying structural dimensions they capture. 

In other words PCA can tell if all the metrics are likely to be measuring the same 

class property. PCA usually generates a large number of principal components. The 

number will be decided based on the amount of variance explained by each component. 

A typical threshold would be retaining principal components with eigenvalues 

(variances) larger than 1.0. This is the Kaiser criterion. There are a number of 

stages involved in performing a PCA on a set of data: 

1. Select a data set, for example one with two dimensions x and y. 

2. Subtract the mean from each of the data dimensions. The mean subtracted 

is the average across each dimension, so all the x values have the mean x 

subtracted and all the y values will have y subtracted. This produces a data 

set whose mean is zero. 

3. Calculate the covariance matrix. Formula 3.10 gives the definition for a covariance 

matrix for a set of data with n dimensions, where C n×n is a matrix 

with n rows and n columns, and Dim x is the x th dimension. 

C n×n = (c i,j, , c i,j = cov(Dim i , Dim j )) (3.10) 

An n-dimensional data set will have 

n! 

(n−2!)∗2 

different covariance values. As 

the data we propose to use is two dimensional, the covariance matrix will be


2 × 2: 

⎛ 

C = ⎝ 

cov(x, x) cov(x, y) 

cov(y, x) cov(y, y) 

⎞ 

⎠ 

4. Calculate the eigenvectors and eigenvalues of the covariance matrix. They are 

both unit vectors, that is their lengths are both 1 and they are both closely 

related. These are important as they provide information about patterns in 

the data. 

5. Choosing components and forming a feature vector. In general, once eigenvectors 

are found from the covariance matrix, the next step is to order them 

by eigenvalue, highest to lowest. This gives you the components in order of 

significance. Some of the components of lesser significance can be ignored. If 

some components are left out, the final data set will have less dimensions than 

the original. To be precise, if there are originally n dimensions in the data, 

and n eigenvectors and eigenvalues are calculated, and the first p eigenvectors 

are chosen, then the final data set has only p dimensions. 

Forming a feature vector, which is just another name for a matrix of vectors, 

is constructed by taking the eigenvectors that you want to keep from the list 

of eigenvectors, and forming a matrix with these eigenvectors in the columns. 

F eatureV ector = (eig 1 eig 2 eig 3 ...eig n ) (3.11) 

6. Derive a new data set, for this we simply take the transpose of the vector and 

multiply it on the left of the original data set, transposed. 

F inalData = RowF eatureV ector ∗ RowDataAdjust (3.12) 

where RowFeatureVector is the matrix with the eigenvectors in the columns 

transposed so that the eigenvectors are now in the rows, with the most significant 

eigenvector at the top, and RowDataAdjust is the mean-adjusted data


transposed, that is, the data items are in each column, with each row holding 

a separate dimension. 

See [56] for further details on PCA. 

3.4.7 Cluster Analysis 

Cluster Analysis is a data exploratory statistical procedure that helps reveal associations 

and structures of data in a domain set [91]. A measure of proximity or 

similarity/dissimilarity is needed in order to determine groups from a complex data 

set. A wide variety of such measures exist but no consensus prevails over which is 

superior. For this study, two widely used dissimilarity measures, Pearson dissimilarity 

and Euclidean distance were chosen. The analysis was repeated using these 

two different measures in order to verify the results. 

Equation 3.13 defines the Pearson Dissimilarity, where µ x and µ y are the means 

of the first and second sets of data, and σ x and σ y are the standard deviations of 

the first and second sets of data. 

d(x, y) = 

∑ 

1 

n i x iy i − µ x µ y 

(3.13) 

σ x σ y 

Equation 3.14 defines the Euclidean Distance between two sets of data. 

∑ 

d(x, y) = √ n (x i − y i ) 2 (3.14) 

i 

The next step is to select the most suitable type of clustering algorithm for the 

analysis. The agglomerative hierarchical clustering (AHC) algorithm was chosen as 

being the most suitable for the specifications of the analysis. Also, it does not require 

the number of clusters the data should be grouped into be specified in advance. AHC 

algorithms start with singleton clusters, one for each entity. The most similar pair 

of clusters are merged, one pair at a time, until a single cluster remains. 

Throughout the cluster analysis, there is a symmetric matrix of dissimilarities 

maintained between the clusters. Once two clusters have been merged, it is necessary 

to generate the dissimilarity between the new cluster and every other cluster.


The unweighted pair group average linkage algorithm was employed here as it is 

theoretically the best method to use. This algorithm clusters objects based on the 

average distance between all pairs. 

Suppose we have three clusters A, B and C, with i being the distance between 

A and B, and j being the distance between B and C. If A and B are the most 

similar pair of entities and are joined together into a new cluster D, the method of 

calculating the new distance k between C and D is given by Equation 3.15. 

k = (i ∗ size(A) + j ∗ size(B))/(size(A) + size(B)) (3.15) 

The analysis was repeated using Ward’s method to verify the results. With 

this method cluster membership is assessed by calculating the total sum of squared 

deviations from the mean of a cluster. The criterion for fusion is that it should 

produce the smallest possible increase in the error sum of squares. 

The output of AHC is usually represented in a special type of tree structure called 

a dendrogram, as illustrated by Figure 3.4. Each branch of the tree represents a 

cluster and is drawn vertically to height where the cluster merges with neighbouring 

clusters. The cutting line is a line drawn horizontally across the dendrogram at a 

given dissimilarity level to determine the number of clusters. The cutting line is 

determined by constructing a histogram of node levels to find where the increase 

in dissimilarity is strongest, as then we have reached a level where we are grouping 

groups that are already homogeneous. The cutting line is selected before this level 

is reached. 

3.4.8 Regression Analysis 

The general computational problem that needs to be solved in linear regression 

analysis is to fit a straight line to a set of points [43]. When there is more than 

one independent variable, the regression procedures will estimate a linear equation 

of the form shown in Equation 3.16, where Y is the dependent variable, X i stands 

for a set of independent variables, a is a constant and each b i is the slope of the 

regression line. The constant a is also known as the intercept, and the slope as the 

regression coefficient.


Figure 3.4: Dendrogram: At the cutting line there are two clusters


Y = a + b 1 X 1 + b 2 X 2 + . . . + b p X p (3.16) 

The regression line expresses the best prediction of the dependent variable Y 

given the independent variables X i . However, usually there is substantial variation 

of the observed points around the fitted regression line. The deviation of a particular 

point from the line is known as the residual value. The smaller the variability of 

the residual values around the regression line relative to the overall variability, the 

better the prediction. In most cases the ratio will fall somewhere between 0.0 and 

1.0. If there is no relationship between the X and Y variables the ratio will be 

1.0, while if X and Y are perfectly related the ratio will be 0.0. The least squares 

method is employed to perform the regression. 

The R 2 or the coefficient of determination is 1.0 minus this ratio. The R 2 value 

is an indicator of how well the model fits the data. If we have an R 2 close to 1.0 this 

indicates that we have accounted for almost all of the variability with the variables 

specified in the model. 

The correlation coefficient R expresses the degree to which two or more independent 

variables are related to the dependent variable, and it is the square root of R 2 . 

R can assume values between -1 and +1. The sign (plus or minus) of the correlation 

coefficient interprets the direction of the relationship between the variables. If 

it is positive, then the relationship of this variable with the dependent variable is 

positive. If it is negative then the relationship is negative. If it is zero then there is 

no relationship between the variables. 

3.4.9 Analysis of Variance (ANOVA) 

ANOVA is used to test the significance of the variation in the dependent variable that 

can be attributed to the regression of one or more independent variables. The results 

enable us to determine whether or not the explanatory variables bring significant 

information to the model. ANOVA gives a statistical test of the null hypothesis H 0 , 

which is, there is no linear relationship between the variables versus the alternative 

hypothesis H 1 , which is, there is a relationship between the variables.


There are four parts to ANOVA results, the sum of squares, degrees of freedom, 

mean squares and the F test. Fisher’s F test, as given by Equation 3.17, is used 

to test whether the R 2 values are statistically significant. Values are deemed to be 

significant at p ≤ 0.05. 

F = R2 ∗ (N − K − 1) 

(1 − R 2 ) ∗ K 

(3.17) 

Here, K is the number of independent variables (two in our case) and N is the 

number of observed values. 


A detailed account of the tools and techniques needed to conduct the test case 

studies described in the next sections were outlined in this chapter. The programs 

evaluated in this work were discussed and an outline of the statistical techniques 

used to analyse the results were provided.

Chapter 4 

Case Study 1: The Influence of 

Instruction Coverage on the 

Relationship Between Static and 

Run-time Coupling Metrics 

When comparing static and run-time measures it is important to have a thorough 

understanding of the degree to which the analysed source code corresponds to the 

code that is actually executed. In this chapter this relationship is studied using 

instruction coverage measures with regard to the influence of coverage on the relationship 

between static and run-time metrics. It is proposed that coverage results 

have a significant influence on the relationship and thus should always be a measured, 

recorded factor in any such comparison. 

An empirical investigation is conducted using a set of six run-time metrics on 

seventeen Java benchmark and real-world programs. First, the differences in the 

underlying dimensions of coupling captured by the static versus the run-time metrics 

are assessed using principal component analysis. Subsequently, multiple regression 

analysis is used to study the predictive ability of the static CBO and instruction 

coverage data to extrapolate the run-time measures. 

57

4.1. Goals and Hypotheses 58 

4.1 Goals and Hypotheses 

The Goal Question Metric/MEtric DEfinition Approach (GQM/MEDEA) framework 

proposed by Briand et al. [18] was used to set up the experiments for this 

study. 

Goal: To investigate the relationship between static and run-time coupling metrics. 

Experiment 1: 

Perspective: We would expect some degree of correlation between the run-time 

measures for coupling and the static CBO metric. We use a number of statistical 

techniques, including principle component analysis to analyse the covariate structure 

of the metrics to determine if they are measuring the same class properties. 

Environment: We chose to evaluate a number of Java programs from well 

defined publicly-available benchmark suites as well as a number of open source realworld 

programs. 

Hypothesis: 

H 0 : Run-time measures for coupling are simply surrogate measures for the static 

CBO metric. 

H 1 : Run-time measures for coupling are not simply surrogate measures for the 

static CBO metric. 

Experiment 2: 

Goal: To examine the relationship between static CBO and run-time coverage 

metrics, particularly in the context of the influence of instruction coverage. 

Perspective: Intuitively, one would expect the better the coverage of the test 

cases used the greater the correlation between the static and run-time metrics. We 

use multiple regression analysis to determine if there is a significant correlation.

4.2. Experimental Design 59 

Environment: We chose to evaluate a number of Java programs from well 

defined publicly-available benchmark suites as well as a number of open source realworld 

programs. 

Hypothesis: 

H 0 : The coverage of the test cases used to evaluate a program has no influence 

on the relationship between static and run-time coupling metrics. 

H 1 : The coverage of the test cases used to evaluate a program has an influence 

on the relationship between static and run-time coupling metrics. 

4.2 Experimental Design 

In order to conduct the practical experiments underlying this study, it was necessary 

to select a suite of Java programs and measure: 

• the static CBO metric 

• the instruction coverage percentages: I C 

• the run-time coupling metrics: IC CC, EC CC, IC CM, EC EM, IC CD, EC CD 

The static metrics data collection tool StatMet, described in Section 3.2.3, was 

used to calculate CBO, while the InCov tool, outlined in Section 3.2.4, was used to 

determine the instruction coverage. The run-time metrics were evaluated using the 

ClMet tool, which is described in Chapter 3.2.1. 

The set of programs used in this study consist of the benchmark programs JOlden 

and SPECjvm98, as well as the real-world programs Velocity, Xalan and Ant. The 

SPECjvm98 suite was chosen as it is directly comparable to other studies that use 

Java software. The program mtrt was excluded from the investigation as it is multithreaded 

and therefore is not suitable for this type of analysis. The more synthetic 

JOlden programs were included to ensure that it considers programs that create 

significantly large populations of objects. Three of the programs from the JOlden 

suite BiSort, TreeAdd and TSP were omitted from the analysis as they contained

4.3. Results 60 

only two classes, therefore the results could not be further analysed. A selection of 

real programs were selected to ensure that the results were scalable to all types of 

programs. 

4.3 Results 

4.3.1 Experiment 1: To investigate the relationship between 

static and run-time coupling metrics 

For each program the distribution (mean) and variance (standard deviation) 

of each measure across the class is calculated. These statistics are used to select 

metrics that exhibit enough variance to merit further analysis, as a low variance 

metric would not differentiate classes very well and therefore would not be a useful 

predictor of external quality. Descriptive statistics also aid in explaining the results 

of the subsequent analysis. 

The descriptive statistic results for each program are summarised in Table 4.1. 

The metric values exhibit large variances which makes them suitable candidates for 

further analysis. 

Principal Component Analysis 

Principal Component Analysis (PCA) is used to investigate whether the runtime 

coupling metrics are not simply surrogate measures for static CBO. 

A similar study was carried out by Arisholm et al. using only the Velocity 

program [5]. The work in this chapter extends their work to include fourteen benchmark 

programs as well as three real-world programs in order to demonstrate the 

robustness of these results over a larger range and variety of programs. 

Appendix A.1 shows the results of the principal component analysis used to investigate 

the covariate structure of the static and run-time metrics. Using the Kaiser 

criterion to select the number of factors to retain shows that the metrics mostly capture 

three orthogonal dimensions in the sample space formed by all measures. In 

other words, the coupling is divided along three dimensions for each of the programs



201 compress 

202 jess 

205 raytrace 

209 db 

Mean 

SD 


SD 


SD 


SD 

CBO 6.24 6.2 

CBO 6.99 4.78 

CBO 7.25 7.51 

CBO 9.12 6.60 

IC CC 1.72 2.11 

IC CC 2.97 7.21 

IC CC 2.14 4.25 

IC CC 1.81 1.98 

IC CM 4.34 3.54 

IC CM 4.34 3.43 

IC CM 4.45 3.54 

IC CM 6.56 4.46 

IC CD 7.56 5.46 

IC CD 5.45 4.54 

IC CD 7.56 6.56 

IC CD 9.67 8.68 

EC CC 1.80 1.16 

EC CC 2.97 9.01 

EC CC 2.06 1.89 

EC CC 1.88 1.54 

EC CM 4.35 4.76 

EC CM 4.34 4.35 

EC CM 4.54 4.53 

EC CM 6.45 5.67 

EC CD 6.56 4.56 

EC CD 7.56 6.56 

EC CD 6.56 4.56 

EC CD 9.57 7.65 

213 javac 

Mean SD 

CBO 8.54 7.15 

IC CC 3.21 3.01 

IC CM 5.45 4.56 

IC CD 7.56 7.56 

EC CC 3.01 2.87 

EC CM 3.45 4.56 

EC CD 5.45 5.65 

222 mpegaudio 


CBO 5.75 4.90 

IC CC 2.60 2.36 

IC CM 4.54 3.56 

IC CD 7.56 6.56 

EC CC 2.60 2.70 

EC CM 5.45 4.56 

EC CD 5.87 5.46 

228 jack 


CBO 6.05 7.51 

IC CC 2.68 5.37 

IC CM 3.45 3.43 

IC CD 5.45 4.45 

EC CC 2.68 2.39 

EC CM 5.45 4.56 

EC CD 7.56 6.56 


BH 

Em3d 

Health 

MST 


SD 


SD 


SD 


SD 

CBO 5.22 3.40 

CBO 4.20 2.86 

CBO 3.43 3.46 

CBO 4.34 3.45 

IC CC 2.62 2.50 

IC CC 3.22 0.71 

IC CC 2.43 2.46 

IC CC 3.54 2.45 

IC CM 7.44 8.86 

IC CM 3.87 1.01 

IC CM 3.35 4.24 

IC CM 4.23 3.45 

IC CD 8.67 10.84 

IC CD 4.76 3.96 

IC CD 4.25 5.46 

IC CD 7.54 4.54 

EC CC 2.33 1.33 

EC CC 3.75 1.33 

EC CC 3.35 3.46 

EC CC 3.45 3.34 

EC CM 5.77 4.44 

EC CM 3.35 3.49 

EC CM 3.55 2.43 

EC CM 3.45 2.45 

EC CD 6.25 4.74 

EC CD 4.65 3.46 

EC CD 4.46 4.43 

EC CD 4.56 4.32 

Perimeter 


CBO 5.34 4.34 

IC CC 3.34 3.45 

IC CM 4.34 2.45 

IC CD 8.56 6.45 

EC CC 3.54 3.45 

EC CM 4.54 3.43 

EC CD 6.54 3.54 

Power 


CBO 4.50 2.54 

IC CC 1.32 0.45 

IC CM 5.23 2.23 

IC CD 5.64 2.56 

EC CC 1.54 1.45 

EC CM 4.12 4.56 

EC CD 4.67 5.35 

Voronoi 


CBO 5.43 3.46 

IC CC 2.43 1.45 

IC CM 4.54 0.45 

IC CD 7.45 3.46 

EC CC 3.45 3.46 

EC CM 4.45 2.45 

EC CD 5.36 2.46 

Real-World Programs 

Velocity 


CBO 7.59 7.57 

IC CC 4.27 7.11 

IC CM 8.45 10.87 

IC CD 20.45 32.14 

EC CC 3.85 4.30 

EC CM 7.54 9.45 

EC CD 25.45 28.45 

Xalan 


CBO 8.98 9.92 

IC CC 4.03 4.61 

IC CM 8.54 8.99 

IC CD 35.45 38.14 

EC CC 2.85 3.60 

EC CM 6.54 7.56 

EC CD 42.15 45.12 

Ant 


CBO 8.49 7.74 

IC CC 3.92 7.91 

IC CM 7.46 8.78 

IC CD 16.75 17.25 

EC CC 2.43 3.51 

EC CM 7.04 7.54 

EC CD 21.23 20.56 

Table 4.1: Descriptive statistic results for all programs


analysed. 

Analysing the definitions of the measures that exhibit high loadings in PC1, PC2 

and PC3 yields the following interpretation of the coupling dimensions: 

• P C1 = {IC CC, IC CD, IC CM}, the run-time import coupling metrics as 

illustrated by Figure 4.1(a). 

• P C2 = {EC CC, EC CD, EC CM}, the run-time export coupling metrics 

as illustrated by Figure 4.1(b). 

• P C3 = {CBO}, the static coupling metric as illustrated by Figure 4.1(c). 

Figure 4.1 summarises these results graphically. Overall the PCA results demonstrate 

that the run-time coupling metrics are not redundant with the static CBO 

metric and that they capture additional dimensions of coupling. This leads us to 

reject our null hypothesis H 0 , to say that run-time measures for coupling are not 

simply surrogate measures for the static CBO metric, suggesting that additional 

information over and above the information obtainable from the static CBO metrics 

can be extracted using run-time metrics. This confirms the findings of Arisholm et 

al. for the single Velocity program are applicable across a variety of programs. 

The results also indicate that the direction of coupling is a greater determining 

factor than the type of coupling, with P C1 containing the three import-based 

metrics and P C2 containing the three export-based metrics. 

4.3.2 Experiment 2: The influence of instruction coverage 

Multiple Regression Analysis 

Multiple regression analysis is used to test the hypothesis that instruction coverage 

of test cases used to evaluate a program has no influence on the relationship between 

static and run-time metrics. The two independent variables are thus the static CBO 

metric and the instruction coverage measure I c ; each of the six run-time coupling 

metrics in turn is then used as the dependent variable. A full list of these results 

can be found in Appendix A.2.


(a) Results from PCA for IC CC, IC CM and IC CD 

(b) Results from PCA for EC CC, EC CM and EC CD 

(c) Results from PCA for CBO 

Figure 4.1: PCA test results for all programs for metrics in PC1, PC2 and PC3. In 

all graphs the bars represents the PCA value obtained for the corresponding metric. 

PC1 contains import level run-time metrics. PC2 contains the export level run-time 

metrics and PC3 contain the static CBO metric.


First, all R values turned out to be positive for each of the programs used in 

this study. This means that there is a positive correlation between the dependent 

(run-time metric) and independent variables CBO and I c . Therefore as the values 

for CBO and I c increase or decrease so will the observed value for the run-time 

metric under consideration. 

Figures 4.2(a) and 4.2(b) give a pictorial view of the results from the multiple 

regression analysis for all programs for class-level run-time coupling, and Figures 

4.3(a) and 4.3(b) for method-level run-time coupling. The lighter bars represent the 

influence of CBO, while the the darker bars represent the influence of both CBO 

and I c . Therefore the difference between these gives the additional amount of the 

variation of the run-time metric that can be allocated to the influence of instruction 

coverage. 

Distinct Classes: IC CC and EC CC 

It is immediately apparent from Figures 4.2(a) and 4.2(b) that the instruction coverage 

is a significant influencing factor. For example, from Figure 4.2(a) it can be 

seen that in ten of the programs, I c accounts for an extra additional 20% variation. 

Two of the programs in Figure 4.2(a), MST and Voroni, that show little increase, 

already exhibit a high correlation with CBO alone that would have been difficult 

to improve on. While the increase is not uniform throughout the programs in Figure 

4.2(a), the overall data demonstrates that instruction coverage is an important 

contributory factor. 

Figure 4.2(b), representing the contribution of CBO and I c to export coupling 

measured at the class level, presents a sharper contrast. Here, the influence of I c 

is clearly a vital contributing factor, accounting for at least an extra 20% of the 

variation in eleven of the seventeen programs. The important factor here is that 

the overall contribution of CBO to export coupling is much lower than to import 

coupling, as can be seen from contrasting the lighter-shaded bars in Figure 4.2(a) 

with those in Figure 4.2(b). Thus classes with a high level of static coupling exhibit 

a higher level of import coupling at run-time. This indicates that the coupling being 

exercised at run-time is from classes behaving as clients, making use of other class


(a) Results from the multiple linear regression where Y = IC CC. 

(b) Results from the multiple linear regression where Y = EC CC. 

Figure 4.2: Multiple linear regression results for class-level metrics (IC CC and 

EC CC). In both graphs the lighter bars represents the R 2 value for CBO, and the 

darker bars represents the R 2 value for CBO and I c combined.


(a) Results from the multiple linear regression where Y = IC CM 

(b) Results from the multiple linear regression where Y = EC CM 

Figure 4.3: Multiple linear regression results for method-level metrics (IC CM and 

EC CM). In both graphs the lighter bars represents the R 2 value for CBO, and 

the darker bars represents the R 2 value for CBO and I c combined.


methods, rather than those behaving as servers, offering their methods for use by 

others. The greater influence of I c in export coupling results from there being less 

of a drop in its influence between IC CC and EC CC, suggesting that instruction 

coverage, as a predictor of coupling, is not as sensitive to the direction of that 

coupling. 

Distinct Methods: IC CM and EC CM 

The results for the IC CM and EC CM, illustrated by Figures 4.3(a) and 4.3(b), 

present a similar picture. Both of these run-time metrics are scaled by the number 

of methods involved in the coupling relationship. Given that CBO is defined on a 

class level, it does surprisingly well in influencing the IC CM metric. Instruction 

coverage is also defined at a class level, but nonetheless accounts for roughly an 

extra 20% of the variance for five programs, and roughly an extra 10% for five other 

programs. The drop between import and export coupling is accentuated here, but 

while Figure 4.3(b) shows CBO proving a bad predictor for EC CM, instruction 

coverage dramatically improves this for over half the programs studied. 

Overall, these results show that coverage has a significant impact on the correlation 

between static CBO and the four run-time coupling metrics defined for distinct 

classes and distinct methods. 

Run-time Messages: IC CD and EC CD 

The run-time metrics IC CD and EC CD did not exhibit a significant relationship 

for any of the programs under consideration and thus are not depicted graphically 

here. As these metrics are defined in terms of a count of the number of distinct times 

a method was executed, this result was not surprising. It is reasonable to postulate 

that such metrics might be more influenced by the “hotness” of a particular method, 

and the distribution of execution focus through the program, rather than instruction 

coverage data. This was the result we expected for the measures based on the number 

of dynamic method calls.



From our experimental data, using principal component analysis, we showed that 

run-time coupling metrics captured different properties than static CBO and therefore 

are not simply surrogate measures for CBO. This indicated that useful information 

beyond that which is provided by CBO may be obtained through the use of 

these run-time measures. 

Second, we found that the coverage of test cases used to evaluate a program had 

a significant impact on the correlation between CBO and run-time coupling metrics 

and thus should be a measured, recorded factor in any comparison made. We found 

that instruction coverage and CBO were a better predictor of the run-time metics 

based on distinct class (IC CC, EC CC) and distinct method counts (IC CM, 

EC CM) than CBO alone. The results in Appendix A.2 show the results from the 

Fishers F test which illustrate that all results were statistically significant at the 5% 

level of significance..

Chapter 5 

Case Study 2: The Impact of 

Run-time Cohesion on Object 

Behaviour 

In this study we present an investigation into the run-time behaviour of objects 

in Java programs and whether cohesion metrics are a good predictor of object behaviour. 

Based on the definition of static CBO it would be expected that objects 

derived from the same class would exhibit similar coupling behaviour, that is, that 

they would be coupled to the same classes and make the same accesses. It is unknown 

whether static CBO provides a true measure of coupling between objects, or 

whether it is restricted to being a measure of the level of coupling between classes. 

To this end, a measure, the Number of Object-Class Clusters (N OC ), is proposed 

in an attempt to analyse run-time object behaviour. This measure is derived from 

a statistical analysis of run-time object-level coupling metrics. Cluster analysis is 

used to group objects together based on the similarity of the accesses they make to 

other classes. Therefore one would expect objects from the same class to occupy the 

same cluster. If more than one cluster is found for a class then it is reasonable to 

postulate that the class has objects that are behaving differently at run-time from 

the point of view of coupling. A selection of programs are anaylsed to determine if 

this is the case. 

The second part of this study involves determining the predictive ability of cohe- 

69


sion metrics (both static and run-time) to forecast object behaviour, in other words, 

how well they indicate the N OC for a class. First, the differences in the underlying 

dimensions of cohesion captured by the static versus the run-time measures 

are assessed using principal component analysis. Subsequently, multiple regression 

analysis is used to study the predictive ability of cohesion metrics to extrapolate 

N OC for a class. We also wish to determine if a run-time definition of cohesion is a 

better predictor of N OC than the static S LCOM version alone. 


The GQM/MEDEA framework was used to set up the experiments for this study. 

Experiment 1: 

Goal: To determine if objects from the same class behave differently at runtime 

from the point of view of coupling. 

Perspective: We investigate the behaviour of objects at run-time with respect 

to coupling using a number of metrics which measure the level of coupling at different 

layers of granularity. We use a number of statistical techniques capable of 

separating objects from a class into groups based on their similarity. 

Environment: Since we are studying object behaviour, a set of Java programs 

which create a large number of objects at run-time are used. These are supplemented 

with a number of real-world programs to ensure the results are scalable to 

genuine programs. 

Hypothesis: 

H 0 : Objects from a class behave similarly at run-time from the point of view of 

coupling 

H 1 : Objects from a class behave differently at run-time from the point of view 

of coupling


Experiment 2: 

Goal: To determine if a run-time definition for cohesion gives any additional 

information about class behaviour over and above the standard static definition. 

Perspective: Within a highly cohesive class the components of the class are 

functionally related, with a class that exhibits low cohesion, they are not. Intuitively, 

one would expect the more cohesive the class the lower the N OC for a class. 

We use a number of statistical techniques, including PCA and regression analysis to 

determine if there is a significant correlation between static and run-time cohesion 

and N OC . We also wish to determine if run-time cohesion is a better predictor of 

N OC than the static version alone. 

Environment: Since we are studying object behaviour a set of Java programs 

which create a large number of objects at run-time are used. These are supplemented 

with a number of real-world programs to ensure the results are scalable to 

genuine programs. 

Hypothesis: 

H 0 

: Run-time cohesion metrics do not provide additional information about 

class behaviour over and above that provided by static S LCOM 

H 1 : Run-time cohesion metrics provide additional information about class behaviour 

over and above that provided by static S LCOM 


For this study it was necessary to calculate: 

• the run-time object-level coupling metric: IC OC 

• the Number of Object-Class Clusters: N OC 

• the static S LCOM


GreyNode QuadT reeNode W hiteNode 

BlackNode 1 0 2 0 




Table 5.1: Matrix of unique accesses per object, for objects 

BlackNode 1 , . . . , BlackNode 4 to classes GreyNode, QuadTreeNode and WhiteNode 

• the run-time cohesion metrics: R LCOM , RW LCOM 

IC OC was calculated using the object-level run-time metric analysis tool Ob- 

Met which is described in Chapter 3.2.2. In order to test the first hypothesis the 

coefficient of variation, C V , was calculated for the IC OC results to determine how 

the IC OC values varied across the objects of a class. If the C V for all classes under 

consideration is zero then this would lead us to accept the null hypothesis, H 0 , as all 

objects of this classes would be accessing the same variables. However, if there was 

variation in the IC OC values, C V > 0, this would lead us to reject H 0 and accept 

H 1 , as the objects would be behaving differently at run-time from the point of view 

of coupling. 

To determine the N OC for a class, one class is fixed and the distribution of 

unique accesses per object is determined. A matrix of such values for each class 

in the program under consideration is constructed. Table 5.1 gives an example of 

such a matrix, where we record the run-time coupling values for individual objects 

of class BlackNode, BlackNode 1 , . . . , BlackNode 4 , against the classes GreyNode, 

QuadTreeNode and WhiteNode. This data is statistically analysed using cluster 

analysis to evaluate the behaviour of the objects. This technique groups objects 

together based on their similarity. The number of clusters are determined and this 

becomes the N OC for that class. In order to accept H 0 we would expect objects from 

the same class to group together and occupy the same cluster, therefore expecting 

values of N OC to be 1. The formation of a number of different clusters, where N OC 

> 1, would lead us to reject H 0 and accept H 1 .



BH 

Em3d 

Health 

MST 





IC OC 1.83 2.74 IC OC 1 0.5 IC OC 2.5 1.84 IC OC 2 1.54 

N OC 2 0.52 N OC 6 − 

N OC 2.5 1.29 N OC 2.5 2.12 

S LCOM 0.317 0.30 S LCOM 0.317 0.223 S LCOM 0.318 0.223 S LCOM 0.163 0.283 

R LCOM 0.144 0.287 R LCOM 0.190 0.381 R LCOM 0.171 0.189 R LCOM 0.111 0.172 

RW LCOM 0.248 0.226 RW LCOM 0.472 0.572 RW LCOM 0.335 0.356 RW LCOM 0.252 0.154 

Perimeter 

Power 

Voronoi 


IC OC 2.25 2.6 

N OC 2.5 1.73 

S LCOM 0.136 0.275 

R LCOM 0.104 0.285 

RW LCOM 0.132 0.254 


IC OC 1.66 1.88 

N OC 2 1.73 

S LCOM 0.151 0.199 

R LCOM 0.083 0.204 

RW LCOM 0.155 0.134 


IC OC 2 2.12 

N OC 4.5 0.71 

S LCOM 0.373 0.238 

R LCOM 0.265 0.363 

RW LCOM 0.448 0.438 

Real-World Programs 

Velocity 


IC OC 6.14 7.21 

N OC 5.1 2.45 

S LCOM 0.314 0.385 

R LCOM 0.154 0.254 

RW LCOM 0.398 0.454 



IC OC 7.45 8.21 

N OC 6.7 3.45 

S LCOM 0.251 0.305 

R LCOM 0.198 0.241 

RW LCOM 0.354 0.484 

Ant 


IC OC 8.11 8.65 

N OC 7.2 2.56 

S LCOM 0.333 0.31 

R LCOM 0.247 0.208 

RW LCOM 0.387 0.355 

Table 5.2: Descriptive statistic results for all programs 

The static metrics data collection tool StatMet, described in Section 3.2.3, was 

used to calculate S LCOM . The ClMet tool, described in Chapter 3.2.1, was used to 

calculate the run-time cohesion metrics. 

The analysis was conducted on the programs from the JOlden benchmark suite 

as well as the real-world programs Velocity, Xalan and Ant. Three of the classes 

BiSort, TSP and TreeAdd contain too few class to perform PCA and regression 

analysis, therefore are excluded from further analysis. The SPECjvm98 benchmark 

programs that were used in the previous study were excluded from this analysis as 

they did not exhibit significant volumes of object creation. 

5.3 Results 

Table 5.2 summarises the descriptive statistic results for each program. The measures 

all exhibited large variances which makes them suitable candidates for further 

analysis.


5.3.1 Experiment 1: To determine if objects from the same 

class behave differently at run-time from the point of 

view of coupling 

IC OC Results 

The IC OC metric is used to investigate whether objects of the same class type are 

coupled to the same classes at run-time. The first thing to look at is the C V results 

for the IC OC metric, as depicted by Figure 5.1. If all objects from the same class 

are behaving in a similar fashion we would expect them to make accesses to the 

same classes at run-time. Consequently, there should be little or no variability in 

the IC OC values for objects from the same class, for example, two classes from 

BH had C V of 0. However, for the classes from the set of programs studied, the 

C V varied from 0% to 54.2%. In the cases where the C V > 0, we have classes with 

objects that are coupled to different classes at run-time. A class might create one 

group of objects that access one set of classes and another that access a different set. 

So we have a number of objects from the same class that are behaving differently at 

run-time at the class-class level. Due to these results, at the class-class level, we can 

reject H 0 and accept H 1 . One cannot observe such behaviour simply by calculating 

the static CBO value for that class. 

N OC Results 

Figure 5.2 illustrates the N OC values for the programs under consideration. The N OC 

values range from one to seven and the bars represent the number of classes from 

each program that exhibit that value. Since cluster analysis groups objects together 

based on the similarity of the accesses they make to other classes one would expect 

objects from the same class to occupy the same cluster (N OC = 1). This was the 

case for a large proportion of the classes under consideration, for example 50% of the 

classes from the program BH from the JOlden suite exhibited an N OC of 1. Similar 

results were obtained with the real-world programs with N OC = 1 for 51% of classes 

from Velocity, 49% from Xalan and 48% from Ant. However, there were instances 

where more than one cluster was found for a class, for example 50% of the classes


IC_OC C V Results 

50% - 60% 

40% - 50% 

Cv 

30% - 40% 

20% - 30% 

10% - 20% 

Ant 


Velocity 

Voronoi 

Power 

Perimeter 

MST 

Health 

Em3d 

BH 

0% -10% 

0% 

0 10 20 30 40 50 60 

No. Classes 

Figure 5.1: C V of IC OC for classes from the programs studied. The bars represent 

the number of classes in each program that have C V in the corresponding range.


Figure 5.2: N OC results of cluster analysis. The bars represent the number of classes 

in each program that have the corresponding N OC value. 

from Perimeter from the JOlden suite had N OC = 4. When more than one cluster 

is found we have the situtation where a single class is creating groups of objects 

that are exhibiting different behaviours at run-time. This leads us to reject H 0 and 

accept H 1 to state that objects from a class can behave differently at run-time from 

the point of view of coupling. 

Looking at Figures 5.1 and 5.2 there seems to be a relationship between the C V 

and the number of clusters with both graphs being markedly similar. In many cases 

a high C V leads to >1 clusters. Intuitively this would make sense as it is easy to 

see how variation in the number of classes used by an object would lead to variation 

in the variables they use and consequently leading to a number of groups of objects 

behaving differently. 

From these findings, it is suggested that the static CBO metric would be better 

defined as coupling between classes as it does not necessarily give a true measure of 

run-time coupling between objects.


5.3.2 Experiment 2: The influence of cohesion on the N OC 

The following statistical analysis is applied to determine first, if run-time cohesion 

metrics are redundant with respect to S LCOM and second, if cohesion metrics are 

good predictors of N OC . 

Principal Component Analysis 

Initially, we investigate the relationship between static and run-time cohesion metrics. 

We use PCA to determine if the static and run-time cohesion metrics are 

likely to be measuring the same class property, in other words it is used to examine 

whether the run-time cohesion metrics are not simply surrogate measures for static 

S LCOM . 

Appendix B.1 shows the results of the principal component analysis when all 

of the cohesion metrics are taken into consideration. Using the Kaiser criterion to 

select the number of factors to retain it is found that the metrics mostly capture 

two orthogonal dimensions in the sample space formed by all measures. In other 

words, cohesion is divided along two dimensions for each of the programs analysed. 

Analysing the definitions of the measures that exhibit high loadings in PC1 and 

PC2 yields the following interpretation of the cohesion dimensions: 

• P C1 = {S LCOM }, the static cohesion metric. 

• P C2 = {R LCOM , RW LCOM }, the run-time cohesion metrics. 

Figure 5.3 summarises these results graphically. The PCA findings from this 

study indicate that no significant information about the cohesiveness of a class 

can be gained by evaluating the RW LCOM instead of the simpler R LCOM , as both 

metrics belonged to the same principal component. This means not enough variance 

is captured by the RW LCOM that is not accounted for by R LCOM . 

However, the PCA results indicate that R LCOM is not redundant with respect to 

S LCOM and that it captures additional information about cohesion. The values show 

that R LCOM is not simply an alternative static measure. Clearly, the simple static 

calculation of S LCOM masks a considerable amount of detail available at run-time.


PCA Test Results for all Programs for Metrics in PC1 

1 

0.9 

0.8 

0.7 

0.6 

PCA Value 

0.5 

0.4 

RLCOM 

RWLCOM 

0.3 

0.2 

0.1 

0 

BH Em3d Health MST Perimeter Power Voronoi Velocity Xalan Ant 

Program 

PCA Test Results for all Programs for Metrics in PC2 

1 

0.9 

0.8 

0.7 

PCA Value 

0.6 

0.5 

0.4 

SLCOM 

0.3 

0.2 

0.1 

0 

BH Em3d Health MST Perimeter Power Voronoi Velocity Xalan Ant 

Program 

Figure 5.3: PCA Test Results for all programs for metrics in PC1 and PC2. In both 

graphs the bars represents the PCA value obtained for the corresponding metric. 

PC1 contains R LCOM and RW LCOM . PC2 contains S LCOM .


Multiple Regression Analysis 

Next we wish to discover if cohesion metrics are good predictors of object behaviour, 

that is, can they be used to deduct the N OC for a class. Multiple regression analysis 

is used for this purpose. In this case the dependent variable is the N OC , while the 

independent variables are the static S LCOM and the run-time R LCOM and RW LCOM 

cohesion metrics. Appendix B.2 gives the results from this analysis. 

First, the results show that there is a positive correlation between the N OC 

(dependent variable) and the static and run-time cohesion measures (independent 

variables), as all R values were positive. This means that as the value for S LCOM , 

R LCOM and RW LCOM increases/decreases so will the observed value for N OC . Intuitively 

this would make sense, as one would expect the more cohesive the class, that 

is the lower the LCOM value is, the more the class it is geared toward performing a 

single function. Therefore one would expect the number of clusters to be low also. 

Figure 5.4 summarises the results of the regression analysis for each of the programs 

analysed. The lighter bars represent the influence of S LCOM , while the darker 

bars depict the influence of both S LCOM and R LCOM . The difference between the two 

indicates the additional amount of variation that can be allocated to the run-time 

cohesion metric. 

It is apparent from this graph that the R LCOM is a significant factor influencing 

N OC , for example for the three real-world programs R LCOM accounts for approximately 

an additional 30% variation, while five of the benchmarks exhibit a similar 

result. For eight out of the ten programs studied R LCOM was a better predictor of 

N OC than S LCOM . 

Overall, these results show that cohesion metrics are a good predictor of N OC , 

with run-time cohesion being the superior metric. This leads us to reject our null 

hypothesis and state that run-time cohesion metrics provide additional information 

about class behaviour over and above that provided by static S LCOM . 

Only one program exhibited a significant result when using the RW LCOM measure, 

therefore the results have not been summarised graphically. This could be due 

to the fact that the metric is defined on a call-weighted basis, which may skew the 

results.


Figure 5.4: Results from multiple linear regression where Y=N OC . The lighter bars 

represent the R 2 for S LCOM , and the darker bars represent the R 2 value for S LCOM 

and R LCOM combined.



From this case study, we found that run-time object-level coupling metrics could be 

used to investigate object behaviour. Using the IC OC run-time coupling measure 

we discovered that objects from the same class exhibited different behaviours at 

run-time from the point of view of coupling. Object behaviour was identified by 

defining a new metric N OC which groups objects together based on their run-time 

coupling properties. 

We defined a number of metrics for evaluating run-time cohesion. First, we 

proved that these measures were not redundant with respect to the static LCOM 

measure and that they captured additional dimensions of cohesion. Next, we investigated 

the impact of run-time cohesion metrics on object behaviour using regression 

analysis and proved that these run-time cohesion metrics were good predictors of 

object behaviour, as identified by the N OC measure. Appendix B.2 gives the results 

from this analysis and shows the Fishers F test results which state that all results 

were statistically significant the 5% level of significance..

Chapter 6 

Case Study 3: A Study of 

Run-time Coupling Metrics and 

Fault Detection 

Fault-proneness detection is an interesting concept in many areas of software engineering 

research. Quality and maintenance effort control depend on the understanding 

of this concept. In previous years, a large volume of work has been performed 

in order to define suitable metrics and models for fault detection [6, 13, 19, 41]. 

Code coverage has been proposed to be an estimator for fault-proneness, but it 

remains a controversial topic which lacks support from empirical data [22]. In this 

case study we investigate whether instruction coverage is a significant predictor of 

fault-proneness, an important software quality indicator. This is done by taking 

a set of real-world programs, namely Velocity, Xalan and Ant, and introducing 

faults into them using the mutation system µJava. Two kinds of mutations are 

introduced separately into the programs, traditional and class-type mutations. We 

then determine the percentage mutants killed (M K ) by the set of test cases provided 

with the programs. Equation 6.1 gives the formula for M K . Regression analysis is 

applied to determine if instruction coverage is a good predictor of fault-proneness, 

which is defined as the M K for the class for each type of mutation. From previous 

work we expect instruction coverage to be a good predictor of non object-oriented 

or traditional-type mutants [69]. 

82


M K = 

Number of mutants killed 

Total number of mutants created ∗ 100 

1 

(6.1) 

Next, we empirically validate a set of six run-time object-oriented metrics in 

terms of their usefulness in predicting fault-proneness. We use regression analysis 

again to investigate the ability of these run-time measures in predicting M K for both 

types of mutations. From these two experiments we wish to discover if the run-time 

measures for coupling are better predictors of fault-proneness than the traditional 

coverage measure. 


The GQM/MEDEA framework was used to set up the experiments for this study. 

Experiment 1: 

Goal: To examine the relationship between coverage and fault detection, in the 

context of instruction coverage. 

Perspective: Code coverage has been proposed to be an estimator for testing 

effectiveness. Regression analysis is used to assess if coverage is a better indicator 

of fault-proneness in comparison to the run-time coupling metrics. In particular we 

investigate whether it is a better detector of traditional or class-type mutations in 

programs. 

Environment: We chose to evaluate a selection of open source real-world programs. 

Each program comes with its own set of JUnit test cases, thus defining both 

the static and dynamic context of our work. 

Hypothesis: 

H 0 : Coverage measures are poor detectors of faults in a program. 

H 1 : Coverage measures are good detectors of faults in a program.


Experiment 2: 

Goal: To examine the relationship between run-time coupling metrics and fault 

detection. 

Perspective: Previous work has shown the static coupling measure CBO is a 

good detector of faults in programs [13]. Intuitively, one would expect run-time coupling 

measures to give a better indication as they are based on an actual execution 

of the program. Regression analysis is used to determine if there is a significant 

correlation. 

Environment: We chose to evaluate a selection of open source real-world programs. 

Each program comes with its own set of JUnit test cases, thus defining both 

the static and dynamic context of our work. 

Hypothesis: 

H 0 : Run-time coupling metrics are poor detectors of faults in a program. 

H 1 : Run-time coupling metrics are good detectors of faults in a program. 


In order to conduct the practical experiments underlying this study, it was necessary 

to select a suite of Java programs and measure: 

• the instruction coverage percentages: I C 

• the mutation coverage of test cases (mutants killed (M K )) 

• the run-time coupling metrics: IC CC, EC CC, IC CM, EC EM, IC CD, EC CD 

The InCov tool, described in Chapter 3.2.4, was used to determine I C . The 

run-time measures were evaluated using the ClMet tool. The mutation system 

µJava, described in Chapter 3.2.5, was used to insert both traditional and class-


level mutants into the test case programs and to determine the M K rates of the test 

cases supplied with the programs. 

Three open source real-world programs Velocity, Xalan and Ant were evaluated in 

this study. The SPECjvm98 and JOlden benchmark programs used in the previous 

studies exhibited very poor mutant kill percentages when analysed (most classes 

exhibited 0% mutant kill rate) and therefore were excluded from futher analysis. 

6.3 Results 

Percentage Mutant Kill Rate Results 

Figure 6.1 gives the percentages of mutants killed upon the execution of the JUnit 

test cases supplied with the programs analysed. Looking at Figure 6.1(a) for the 

Velocity program, twenty-three classes exhibit a percentage kill rate of zero for the 

class-level mutants, while thirteen classes exhibit the same rate for the traditional 

mutants. At the other end of the spectrum for the class-level mutants, six classes 

exhibited a percentage kill rate of between 90% and 100%, while seven classes exhibited 

the same kill rate for the traditional mutants. 

In their paper [66] Offutt et al. created test cases for the set of programs they 

studied by hand so that 100% M K was achieved. To date no one has endeavoured 

to apply this mutation system to a set of real programs so there is no consensus on 

what a desirable M K rate would be. 

6.3.1 Experiment 1: To examine the relationship between 

instruction coverage and fault detection. 

Regression Analysis 

We investigate the statistical relationship between instruction code coverage and 

fault-proneness using regression analysis. The dependent variable is the percentage 

mutant kill rate M K , while the independent variable is the instruction coverage measure 

I c for each class. Both class and traditional mutants are evaluated separately. 

Appendix C.2 gives the results from this analysis.


(a) Results from Mutation Testing for Velocity. 

(b) Results from Mutation Testing for Xalan. 

(c) Results from Mutation Testing for Ant. 

Figure 6.1: Mutation test results for real-world programs Velocity, Xalan and Ant. 

In all graphs the bars represents the number of classes that exhibit a percentage 

mutant kill rate in the corresponding range.


Figure 6.2: Regression analysis results for the effectiveness of I c in predicting class 

and traditional-level mutations in real-world programs Velocity, Xalan and Ant. The 

bars represents the R 2 value for the run-time metric under consideration. 

Figure 6.2 depicts the results on the effects of instruction coverage on faultproneness 

for both types of mutations tested. For all of the programs tested I c 

proved to be a poor predictor of class-type mutants with the highest value being 

16.7% for Xalan. In contrast I c showed to be an effective indicator of traditional 

mutations with the values ranging from 64.5% to 78.9%. This was as we expected as 

coverage is not really effective in evaluating object-oriented type programs therefore 

we would not expect it to be good predictors of object-oriented type faults. 

6.3.2 Experiment 2: To examine the relationship between 

run-time coupling metrics and fault detection. 

Regression Analysis 

Regression analysis is used to determine the effectiveness of run-time coupling metrics 

in detecting faults in programs. The dependent variable is the percentage mu-


tant kill rate of the test cases used to execute the programs, while the independent 

variables are the six run-time coupling metrics. Both class and traditional mutants 

are evaluated separately. Appendix C.1 gives the results from this analysis. 

The traditional mutants did not show any relation with the run-time coupling 

measures, with only the IC CC metric for the Velocity program exhibiting a significant 

correlation. This is in contrast to the results from the previous experiment 

where I c proved to be a poor predictor of class-type mutants but a good predictor 

of traditional-type mutants. 

Figure 6.3 illustrates the results for the effectiveness of the run-time coupling 

metrics IC CC, IC CM, EC CC and EC CM in predicting the M K for class-level 

mutations for each of the programs analysed. For two of the programs Velocity and 

Xalan the IC CC measure provided the greatest predictor of M K at 69% and 59% 

respectively. For the Ant program, the EC CC metric had the highest value at 69%, 

however the IC CC value for this was also high at 60%. For all of the programs the 

EC CM measure was the poorest predictor. There were five categories of mutations 

introduced into the programs by µJava, as illustrated by Table D.2. We would 

expect the coupling measures to be a good predictor of those mutations based on 

inheritance, polymorphism and overloading. However, we would not expect such 

a relationship for those based on Java-specific features and common programming 

mistakes. The inclusion of these types of mutations may have negitavely skewed the 

results. 

None of the run-time metrics based on distinct method counts IC CD and EC CD 

exhibited a significant result and therefore have not been summarised graphically. 

As with the case in Section 4.3, this was expected and emphasises the significance 

of the predictive capabilities of the other metrics. 

Overall, one would expect this kind of result as the class-type mutants are objectoriented, 

while the traditional mutations are based on factors like operator replacement 

and therefore would not be expected to correlate strongly with coupling. This 

leads us to reject our null hypothesis for both experiments and state that run-time 

coupling metrics are good detectors of class-level faults, while coverage measures 

are good detectors of traditional-type faults in a program. We therefore postulate


Figure 6.3: Regression analysis results for the effectiveness of run-time coupling metrics 

in predicting class-level mutations in real-world programs Velocity, Xalan and 

Ant. The bars represents the R 2 value for the run-time metric under consideration. 

a possible utility for run-time coupling metrics for use in fault-proneness detection 

with regard to identifying faults in object-oriented programs. 


The results from this case study used regression analysis to show that run-time coupling 

metrics were good detectors of class-type faults in programs, while instruction 

coverage was a good detector of traditional-type mutants. Appendix C.1 illustrates 

these results and shows that all results were deemed to be statistically significant at 

the 5% level of significance. We therefore proposed the run-time coupling metrics as 

alternative measures for fault-detection useful for identifying object-oriented type 

faults in programs.

Chapter 7 

Conclusions 

In this thesis we presented an empirical investigation into run-time coupling and 

cohesion metrics. 

The first case study investigated the influence of instruction coverage on the relationship 

between static and run-time coupling metrics. An empirical investigation 

was conducted using the set of run-time metrics proposed by Arisholm et al. on 

a large set of Java programs. This set contained programs from the SPECjvm98 

and JOlden benchmark suites and also included three real-world programs Velocity, 

Xalan and Ant. 

The differences in the underlying dimensions of coupling captured by the static 

versus the run-time metrics were assessed using principal component analysis. Three 

components were identified which contained static CBO, the import-based run-time 

metrics, and the export-based run-time metrics. This established that the runtime 

metrics were not simply surrogate static measures, which made them suitable 

candidates for further analysis. 

A study into the predictive ability of the static CBO and instruction coverage 

data was then conducted using multiple regression analysis. The purpose of this was 

to show how well the static CBO metric and instruction coverage measure I c could 

predict the six run-time metrics under consideration. The PCA analysis placed 

import and export based coupling in different components, and this difference was 

also seen in the regression analysis. Both CBO and instruction coverage had less 

influence overall on the export-based metrics, EC CC and EC CM than on the 

90

Chapter 7. Conclusions 91 

Static 

Coupling 

Not surrogate 

Run-time 

Coupling 

Static 

Coupling 

+ 

Instruction 

Coverage 

Good predictor 


Coupling 

Figure 7.1: Findings from case study one that show our run-time coupling metrics 

are not simply surrogate measures for static CBO and coverage plus static metrics 

are better predictors of run-time measures than static measure alone. 

import-based run-time metrics, IC CC and IC CM. 

It was shown from the regression analysis that the combination of the static 

measure with instruction coverage gave a significantly better prediction of the runtime 

behavior of programs than the use of static metrics alone, for the class-based 

and method-based metrics. This suggested that the correlation between static and 

run-time was as much a factor of coverage as an intrinsic property of the metrics 

themselves. 

The results for the two run-time metrics based on distinct message counts, 

EC CD and EC CD were not within the chosen significance level, and thus no determination 

was made on the relationship for these metrics. Figure 7.1 summarises 

the finding from this study.


The second case study looked at run-time object behaviour and whether run-time 

cohesion metrics could be used to identify such behaviour. 

First, we looked at object behaviour in the context of coupling. We used the 

IC OC object-level metric, as defined by Arisholm et al. and defined a new measure 

N OC in an attempt to identify objects that were behaving differently at run-time 

from the point of view of coupling. We concluded that objects from the same class 

could behave differently at run-time from the point of view of coupling due to the 

fact that there were classes that exhibited variable C V values for IC OC and N OC 

values greater than one 

Subsequently, we looked at whether run-time cohesion metrics could be used to 

predict object behaviour, as defined by the N OC measure. First, we had to prove 

that the run-time cohesion metrics were not redundant with respect to static S LCOM . 

The relationship between static and run-time cohesion metrics was investigated using 

PCA. Two components were identified containing the static S LCOM and the run-time 

cohesion measures R LCOM , RW LCOM . This established that the run-time metrics 

were not simply surrogate static measures, making them suitable candidates for 

further analysis. 

Multiple regression analysis was used to discover if the cohesion metrics were 

good predictors of object behaviour. The purpose of this was to show how well the 

S LCOM metric and the run-time cohesion measures R LCOM , RW LCOM could predict 

N OC . Overall, the results showed that cohesion metrics were a good predictor of 

N OC , with run-time cohesion being the superior metric. This led us to conclude that 

run-time cohesion metrics provided additional information about class behaviour 

over and above that provided by S LCOM . Figure 7.2 depicts the results of this 

study. 

The third case study investigated whether instruction coverage was a good predictor 

of faults in a program. We used regression analysis to determine if this 

measure was related to M K , the mutation kill rate of the test cases used. It was 

found that I C was a good predictor of traditional-type faults but a poor predictor of 

class-type faults which verifies results from previous studies on coverage measures. 

Next, we analysed the extent to which the run-time coupling metrics were good


Class A 

IC_OC used to determine N OC 

N OC =3 

a 9 

Run-time Cohesion 

good predictor of N OC 

a 10 

a 11 

a 12 

a 5 

Task 3 

a 6 

a 1 a 7 

a 2 

a 3 

a 8 

Task 2 

Objects 

a 4 

Task 1 

Figure 7.2: Findings from case study two that show run-time object-level coupling 

measures can be used to identify objects that are exhibiting different behaviours 

at run-time and run-time cohesion measures are good predictors of this type of 

behaviour.

7.1. Contributions 94 

Fault Proneness 


Coupling 

Good predictor 

Instruction 

Coverage 

Class-type Mutations 

Traditional Mutations 

Figure 7.3: Findings from case study three that show run-time coupling metrics are 

good predictors of class-type faults and instruction coverage is a good predictor of 

traditional faults in programs. 

detectors of traditional and class-type faults in a program. Our results showed that 

the measures IC CC, IC CM, EC CC and EC CM were significantly related to M K 

when considering class-type mutations. The results for EC CD and EC CD, the 

two run-time metrics based on distinct message counts, were not within the chosen 

significance level, and thus no determination was made on the relationship for these 

metrics. 

The purpose of this study was to determine whether instruction coverage is a better 

predictor of fault-proneness than the run-time coupling measures. As we found 

these measures were superior in detecting object-oriented type faults in programs 

than simple measures of coverage, we proposed the run-time coupling metrics as an 

alternative measure for fault-proneness, useful for detecting faults in object-oriented 

software. Figure 7.3 illustrates the finding from this study. 

7.1 Contributions 

We have implemented the tools ClMet and ObMet that can be used to perform a 

class and object-level analysis of Java programs.

7.1. Contributions 95 

We use the definitions of Arishlom et al. for the set of run-time coupling metrics 

in this analysis. We had however defined our own set of run-time coupling metrics 

previous to the publication of this paper [75, 76]. However, due to the similarity in 

nature of the metrics we then switched to using their definitions for the sake of ease 

of comparison. 

We define a number of object-oriented run-time metrics for cohesion and we 

investigate their possible utility. To date no one has attempted to do this. 

We define a new measure N OC that can be used to study run-time objectbehaviour. 

To the best of our knowledge this is the largest empirical study that has been 

performed to date on the run-time analysis of Java programs. Previously, a study 

was carried out by Arisholm et al. on “Dynamic Coupling Measurement for Object- 

Oriented Software”, however this included a single program, Velocity, in the analysis. 

Our study looks at not only Velocity but also the real-world programs Xalan and 

Ant as well as seven benchmark programs from the SPECjvm98 suite and seven 

programs from the JOlden suite thus making it much wider in scope. 

The main findings from our study are as follows: 

• We showed run-time coupling metrics capture additional dimensions of coupling 

and are not simply surrogate measures for static CBO. Therefore, useful 

information above that provided by a simple static analysis may be acquired 

through the use of run-time metrics. 

• Coverage has a significant impact on the correlation between static CBO and 

run-time coupling metrics and should be a measured, recorded factor in any 

comparison. 

• Run-time object-level coupling metrics can be used to investigate object behaviour. 

Using such a measure we discovered that objects from the same class 

can behave differently at run-time from the point of view of coupling. 

• Run-time cohesion metrics are not redundant with respect to the static S LCOM 

measure and capture additional dimensions of cohesion.

7.2. Applications of this Work 96 

• Run-time cohesion metrics are good predictors of run-time object behaviour. 

• Run-time coupling metrics based on distinct class and distinct method counts 

are good predictors of class-type or object-oriented faults in programs but poor 

predictors of traditional-type mutations. 

• Coverage is a good predictor of traditional-type faults but a poor predictor of 

class-type faults in programs. 

7.2 Applications of this Work 

Much of the work on the dynamic analysis of Java programs has come from the 

language design and compiler community. The work in this thesis forms part of an 

increasing link between this community and the software engineering community, 

with an emphasis on collecting, analysing and comparing quantitative static and 

dynamic data. Other possible examples of this synthesis include relating studies 

of polymorphicity with testing inheritance relationships, or relating measures of 

program “hot-spots” with metrics based on distinct messages such as IC CD and 

EC CD. Run-time metrics may also have a role to play in areas of research such 

as reverse engineering and program comprehension, as they contribute to a better 

understanding of the behavior of code in its operational environment. 

7.3 Threats to Validity 

7.3.1 Internal Threats 

There are a number of factors which may potentially affect the validity of these runtime 

metrics. In this thesis we have chosen only to look at run-time definitions for 

coupling and cohesion which are based on the standard static definitions proposed by 

Chidamber and Kemerer. Their metric suite for analysing object-oriented software 

consists of three additional measurements for evaluating the depth of inheritance tree 

(DIT), the number of children (NOC) and the weighted methods per class (WMC).

7.3. Threats to Validity 97 

Our set of run-time measures should be expanded to include run-time definitions for 

these also to ensure that the set is fully comprehensive. 

The run-time metrics used in this study are rated based on how they perform 

in relation to static measurements for coupling and cohesion. However, no study 

has definitively shown that any measurement for coupling or cohesion provides any 

extra information on design quality over and above that which can be gained simply 

by evaluating the much simpler lines of code measure. 

7.3.2 External Threats 

A general problem with any type of run-time analysis is that the results are based on 

dynamic measurement and are thus tied to the inputs or test cases used. Therefore 

the use of different test cases may produce different results. Static measurements 

however will remain the same regardless of the set of test cases used to execute the 

program. 

The set of programs used in this study may not be representative of all classes 

of Java programs, as for example, no GUI based programs were included in this 

analysis. 

While the run-time analysis tools ClMet and ObMet made it easy to collect a 

wide variety of run-time information from a program and were easy to use, it was 

still quite time comsuming to perform a full analysis of a program. Allthough it was 

stated that performance was not really an issue in the design of these tools. if such 

a method of evaluating a program was to be marketed to industry the performance 

of the tools would have to be given more serious consdieration. 

Only one external quality attribute, fault detection, was investigated in this 

thesis. Further research need to be conducted to see how measures for coupling and 

cohesion do in predicting other important external quality attributes of a design 

such as maintainability, reusability or comprehensibility. 

The relationship between internal and external quality attribute is quite intituitive, 

for example, more complex code will require greater effort to maintain. However 

the precise functional form of this relationship is less clear and is the subject 

of intense and practical research concern. Using theories of cognition and problem-

7.4. Future Work 98 

solving to help us understand the effects of complexity on software is the subject of 

much current research [31]. 

7.4 Future Work 

Future work may involve extending the existing set of coupling and cohesion metrics 

to develop a comprehensive set of run-time object-oriented metrics that can 

intuitively quantify such aspects of object-oriented applications such as inheritance, 

dynamic binding and polymorphism. 

Currently there exists no set of benchmarks specifically designed for evaluating 

properties of object-oriented programming such as coupling and cohesion, it would 

be useful to design such a set of benchmarks for use in similar empirical studies. 

Futher research could involve designing a run-time profiling tool written in C++ 

rather than Java. Such a tool could utilise the JVMDI component of the JPDA 

directly and therefore would be dynamically linked with thw JVM at run-time. 

This would probably result in less performance overhead which would result in a 

reduction in the time taken to preform such an analysis. 

Another important aspect would be to further investigate the correlation between 

run-time metrics and external quality aspects of a design, including investigating 

the possibility of using hybrid models that use a combination of static and runtime 

metrics to evaluate a design. 

It would be interesting to conduct an industrial case study using real commercial 

software and data to further verify the results in this thesis. 

Other applications of run-time metrics should be investigated, for example they 

could be useful in determining where refactoring have been or could be applied or 

they could be used to aid in program comprehension. 

This study has focused solely on the evaluation of Java software, it would be 

important to investigate if the run-time metrics gave similar results if they were 

used to evaluate other types of object-oriented software, for example C sharp. 

Though the approach and results are of significance to the field, they can also be 

used as stepping stones to open up new ways to consider a wider set of internal qual-

7.4. Future Work 99 

ity attributes and their interrelationships, and their independent and interdependent 

effect relationship on external quality aspect of a design.

Appendix A 

Case Study 1: To Investigate the 

Influence of Instruction Coverage 

on the Relationship Between 

Static and Run-time Coupling 

Metric 

Appendix A.1 contains the PCA test results for the SPECjvm98 and JOlden suites 

and for the real-world programs Velocity, Xalan and Ant. Values deemed to be 

significant at the level p ≤ 0.05 are highlighted. 

Appendix A.2 contains the results from the multiple linear regression used to test 

the hypothesis H 0 , that coverage has no effect on the relationship between static 

and run-time metrics for the programs from the SPECjvm98 and JOlden suites and 

for the real-world programs Velocity, Xalan and Ant. All significant results are 

highlighted. 

100

A.1. PCA Test Results for all programs. 101 

A.1 PCA Test Results for all programs. 

A.1.1 


201 compress 

P C1 P C2 P C3 

CBO 0.113 0.014 0.712 

IC CC 0.865 0.065 0.186 

IC CM 0.766 0.154 0.097 

IC CD 0.866 0.073 0.100 

EC CC 0.023 0.873 0.176 

EC CM 0.143 0.799 0.035 

EC CD 0.098 0.834 0.096 

209 db 


CBO 0.012 0.163 0.843 

IC CC 0.893 0.088 0.002 

IC CM 0.923 0.004 0.000 

IC CD 0.976 0.003 0.013 

EC CC 0.178 0.763 0.002 

EC CM 0.110 0.793 0.027 

EC CD 0.087 0.823 0.017 

202 jess 


CBO 0.198 0.187 0.672 

IC CC 0.963 0.007 0.005 

IC CM 0.912 0.003 0.016 

IC CD 0.874 0.032 0.004 

EC CC 0.154 0.812 0.002 

EC CM 0.298 0.734 0.054 

EC CD 0.098 0.923 0.002 

213 javac 


CBO 0.187 0.000 0.973 

IC CC 0.633 0.083 0.184 

IC CM 0.834 0.033 0.023 

IC CD 0.723 0.143 0.002 

EC CC 0.138 0.834 0.004 

EC CM 0.078 0.734 0.012 

EC CD 0.067 0.759 0.034 

228 jack 


CBO 0.004 0.243 0.634 

IC CC 0.605 0.234 0.154 

IC CM 0.723 0.194 0.076 

IC CD 0.604 0.195 0.098 

EC CC 0.194 0.749 0.098 

EC CM 0.103 0.694 0.049 

EC CD 0.094 0.749 0.104 

205 raytrace 


CBO 0.123 0.087 0.723 

IC CC 0.834 0.021 0.019 

IC CM 0.912 0.017 0.008 

IC CD 0.896 0.103 0.001 

EC CC 0.198 0.763 0.003 

EC CM 0.125 0.709 0.017 

EC CD 0.097 0.821 0.002 

222 mpegaudio 


CBO 0.244 0.137 0.583 

IC CC 0.943 0.004 0.087 

IC CM 0.898 0.034 0.041 

IC CD 0.943 0.023 0.001 

EC CC 0.034 0.943 0.043 

EC CM 0.134 0.754 0.085 

EC CD 0.098 0.845 0.005 

A.1.2 


BH 


CBO 0.403 0.002 0.520 

IC CC 0.728 0.224 0.012 

IC CM 0.536 0.391 0.001 

IC CD 0.555 0.376 0.000 

EC CC 0.358 0.522 0.109 

EC CM 0.203 0.763 0.025 

EC CD 0.203 0.763 0.025 

MST 


CBO 0.000 0.013 0.972 

IC CC 0.900 0.063 0.032 

IC CM 0.956 0.010 0.026 

IC CD 0.941 0.012 0.027 

EC CC 0.356 0.609 0.033 

EC CM 0.121 0.877 0.001 

EC CD 0.118 0.881 0.000 

Em3d 


CBO 0.134 0.034 0.712 

IC CC 0.933 0.013 0.016 

IC CM 0.772 0.168 0.039 

IC CD 0.772 0.168 0.039 

EC CC 0.139 0.702 0.082 

EC CM 0.223 0.716 0.039 

EC CD 0.223 0.716 0.039 

Perimeter 


CBO 0.231 0.123 0.612 

IC CC 0.541 0.169 0.281 

IC CM 0.876 0.080 0.002 

IC CD 0.905 0.056 0.038 

EC CC 0.236 0.752 0.000 

EC CM 0.147 0.830 0.023 

EC CD 0.142 0.828 0.026 

Health 


CBO 0.238 0.187 0.521 

IC CC 0.956 0.005 0.017 

IC CM 0.936 0.024 0.010 

IC CD 0.940 0.028 0.009 

EC CC 0.076 0.831 0.086 

EC CM 0.070 0.919 0.002 

EC CD 0.065 0.794 0.003 

Power 


CBO 0.329 0.014 0.626 

IC CC 0.617 0.073 0.161 

IC CM 0.624 0.338 0.036 

IC CD 0.712 0.228 0.041 

EC CC 0.022 0.915 0.015 

EC CM 0.007 0.880 0.112 

EC CD 0.008 0.824 0.164

A.1. PCA Test Results for all programs. 102 

Voronoi 


CBO 0.198 0.213 0.526 

IC CC 0.718 0.123 0.069 

IC CM 0.812 0.088 0.134 

IC CD 0.773 0.176 0.141 

EC CC 0.043 0.911 0.005 

EC CM 0.067 0.934 0.004 

EC CD 0.148 0.834 0.054 

A.1.3 

Real-World Programs, Velocity, Xalan and Ant 

Velocity 


CBO 0.384 0.184 0.734 

IC CC 0.623 0.034 0.174 

IC CM 0.725 0.087 0.231 

IC CD 0.684 0.196 0.192 

EC CC 0.284 0.684 0.097 

EC CM 0.023 0.793 0.005 

EC CD 0.174 0.590 0.015 



CBO 0.316 0.174 0.586 

IC CC 0.824 0.184 0.183 

IC CM 0.890 0.284 0.284 

IC CD 0.795 0.003 0.194 

EC CC 0.013 0.834 0.164 

EC CM 0.284 0.793 0.023 

EC CD 0.384 0.823 0.154 

Ant 


CBO 0.125 0.254 0.687 

IC CC 0.874 0.125 0.125 

IC CM 0.789 0.231 0.012 

IC CD 0.801 0.324 0.214 

EC CC 0.214 0.789 0.124 

EC CM 0.141 0.785 0.054 

EC CD 0.123 0.754 0.014

A.2. Multiple linear regression results for all programs 103 

A.2 Multiple linear regression results for all programs 

A.2.1 


201 compress 

209 db 

Hypothesis Y R R 2 P > F Hypothesis Y R R 2 P > F 

H CBO IC CC 0.775 0.593 0.003 H CBO IC CC 0.419 0.178 0.0001 

H CBO,Ic IC CC 0.798 0.602 0.0001 H CBO,Ic IC CC 0.868 0.754 0.001 

H CBO EC CC 0.634 0.402 0.01 H CBO EC CC 0.567 0.322 0.002 

H CBO,Ic EC CC 0.870 0.759 0.007 H CBO,Ic EC CC 0.881 0.777 0.001 

H CBO IC CD 0.512 0.262 0.421 H CBO IC CD 0.691 0.478 0.522 

H CBO,I c IC CD 0.599 0.359 0.201 H CBO,I c IC CD 0.768 0.589 0.263 

H CBO EC CD 0.239 0.057 0.054 H CBO EC CD 0.312 0.097 0.609 

H CBO,I c EC CD 0.422 0.178 0.134 H CBO,I c EC CD 0.429 0.184 0.816 

H CBO IC CM 0.762 0.58 0.003 H CBO IC CM 0.582 0.338 0.003 

H CBO,Ic IC CM 0.885 0.784 0.006 H CBO,Ic IC CM 0.703 0.494 0.006 

H CBO EC CM 0.235 0.056 0.04 H CBO EC CM 0.313 0.098 0.019 

H CBO,Ic EC CM 0.58 0.336 0.035 H CBO,Ic EC CM 0.428 0.184 0.016 

202 jess 

Hypothesis Y R R 2 P > F 

H CBO IC CC 0.553 0.306 0.002 

H CBO,Ic IC CC 0.703 0.494 0.001 

H CBO EC CC 0.428 0.184 0.031 

H CBO,I c EC CC 0.567 0.322 0.023 

H CBO IC CD 0.765 0.586 0.145 

H CBO,I c IC CD 0.868 0.754 0.321 

H CBO EC CD 0.691 0.748 0.246 

H CBO,Ic EC CD 0.723 0.523 0.135 

H CBO IC CM 0.762 0.581 0.023 

H CBO,Ic IC CM 0.922 0.852 0.012 

H CBO EC CM 0.618 0.382 0.001 

H CBO,I c EC CM 0.645 0.416 0.002 

213 javac 


H CBO IC CC 0.535 0.286 0.005 

H CBO,Ic IC CC 0.748 0.559 0.002 

H CBO EC CC 0.443 0.196 0.004 

H CBO,I c EC CC 0.531 0.282 0.007 

H CBO IC CD 0.512 0.262 0.234 

H CBO,I c IC CD 0.606 0.367 0.176 

H CBO EC CD 0.872 0.76 0.765 

H CBO,Ic EC CD 0.922 0.85 0.567 

H CBO IC CM 0.553 0.306 0.034 

H CBO,Ic IC CM 0.76 0.577 0.024 

H CBO EC CM 0.321 0.107 0.042 

H CBO,I c EC CM 0.567 0.322 0.034 

205 raytrace 


H CBO IC CC 0.444 0.197 0.021 

H CBO,I c IC CC 0.659 0.434 0.002 

H CBO EC CC 0.59 0.349 0.043 

H CBO,I c EC CC 0.669 0.447 0.032 

H CBO IC CD 0.256 0.065 0.342 

H CBO,I c IC CD 0.36 0.13 0.365 

H CBO EC CD 0.239 0.057 0.123 

H CBO,Ic EC CD 0.363 0.132 0.432 

H CBO IC CM 0.443 0.196 0.034 

H CBO,I c IC CM 0.599 0.359 0.032 

H CBO EC CM 0.422 0.178 0.012 

H CBO,I c EC CM 0.632 0.399 0.032 

222 mpegaudio 


H CBO IC CC 0.174 0.032 0.003 

H CBO,I c IC CC 0.452 0.204 0.001 

H CBO EC CC 0.296 0.088 0.013 

H CBO,I c EC CC 0.635 0.403 0.006 

H CBO IC CD 0.734 0.538 0.165 

H CBO,I c IC CD 0.885 0.784 0.214 

H CBO EC CD 0.948 0.899 0.234 

H CBO,Ic EC CD 0.978 0.956 0.654 

H CBO IC CM 0.753 0.567 0.001 

H CBO,I c IC CM 0.769 0.592 0.002 

H CBO EC CM 0.533 0.284 0.021 

H CBO,I c EC CM 0.635 0.403 0.03 

228 jack 


H CBO IC CC 0.606 0.367 0.003 

H CBO,I c IC CC 0.966 0.933 0.012 

H CBO EC CC 0.512 0.262 0.002 

H CBO,I c EC CC 0.872 0.76 0.003 

H CBO IC CD 0.239 0.057 0.465 

H CBO,Ic IC CD 0.618 0.382 0.450 

H CBO EC CD 0.363 0.132 0.123 

H CBO,I c EC CD 0.419 0.178 0.576 

H CBO IC CM 0.585 0.343 0.013 

H CBO,I c IC CM 0.599 0.359 0.002 

H CBO EC CM 0.363 0.132 0.045 

H CBO,I c EC CM 0.417 0.174 0.032


A.2.2 


BH 


H CBO IC CC 0.531 0.282 0.038 

H CBO,I c IC CC 0.767 0.588 0.044 

H CBO EC CC 0.092 0.008 0.0001 

H CBO,I c EC CC 0.533 0.284 0.0001 

H CBO IC CD 0.431 0.185 0.247 

H CBO,I c IC CD 0.617 0.381 0.237 

H CBO EC CD 0.443 0.196 0.232 

H CBO,Ic EC CD 0.514 0.264 0.398 

H CBO IC CM 0.45 0.203 0.024 

H CBO,I c IC CM 0.635 0.403 0.013 

H CBO EC CM 0.443 0.196 0.032 

H CBO,I c EC CM 0.514 0.264 0.024 

MST 


H CBO IC CC 0.97 0.941 0.001 

H CBO,I c IC CC 0.972 0.945 0.0001 

H CBO EC CC 0.606 0.367 0.002 

H CBO,I c EC CC 0.76 0.577 0.001 

H CBO IC CD 0.966 0.933 0.200 

H CBO,I c IC CD 0.987 0.974 0.401 

H CBO EC CD 0.239 0.057 0.649 

H CBO,Ic EC CD 0.618 0.382 0.486 

H CBO IC CM 0.966 0.933 0.002 

H CBO,I c IC CM 0.987 0.974 0.004 

H CBO EC CM 0.239 0.057 0.049 

H CBO,I c EC CM 0.618 0.382 0.086 

Em3d 


H CBO IC CC 0.617 0.381 0.046 

H CBO,I c IC CC 0.748 0.659 0.001 

H CBO EC CC 0.262 0.069 0.03 

H CBO,I c EC CC 0.937 0.878 0.024 

H CBO IC CD 0.59 0.349 0.294 

H CBO,Ic IC CD 0.591 0.349 0.651 

H CBO EC CD 0.02 0.00 0.975 

H CBO,I c EC CD 0.626 0.392 0.608 

H CBO IC CM 0.59 0.349 0.194 

H CBO,I c IC CM 0.591 0.349 0.151 

H CBO EC CM 0.02 0.000 0.075 

H CBO,I c EC CM 0.626 0.392 0.008 

Perimeter 


H CBO IC CC 0.36 0.13 0.306 

H CBO,I c IC CC 0.422 0.178 0.503 

H CBO EC CC 0.095 0.009 0.194 

H CBO,I c EC CC 0.599 0.359 0.211 

H CBO IC CD 0.512 0.262 0.131 

H CBO,Ic IC CD 0.585 0.343 0.230 

H CBO EC CD 0.256 0.065 0.476 

H CBO,I c EC CD 0.58 0.336 0.238 

H CBO IC CM 0.645 0.416 0.044 

H CBO,I c IC CM 0.66 0.435 0.135 

H CBO EC CM 0.256 0.065 0.076 

H CBO,I c EC CM 0.58 0.336 0.038 

Health 


H CBO IC CC 0.601 0.372 0.04 

H CBO,I c IC CC 0.643 0.414 0.003 

H CBO EC CC 0.22 0.048 0.06 

H CBO,Ic EC CC 0.254 0.064 0.13 

H CBO IC CD 0.659 0.434 0.075 

H CBO,Ic IC CD 0.753 0.566 0.124 

H CBO EC CD 0.444 0.197 0.27 

H CBO,I c EC CD 0.535 0.286 0.431 

H CBO IC CM 0.669 0.447 0.07 

H CBO,I c IC CM 0.76 0.578 0.116 

H CBO EC CM 0.444 0.197 0.207 

H CBO,Ic EC CM 0.535 0.286 0.431 

Power 


H CBO IC CC 0.709 0.502 0.042 

H CBO,I c IC CC 0.713 0.508 0.001 

H CBO EC CC 0.635 0.404 0.011 

H CBO,Ic EC CC 0.872 0.76 0.001 

H CBO IC CD 0.104 0.011 0.844 

H CBO,Ic IC CD 0.723 0.523 0.329 

H CBO EC CD 0.363 0.132 0.479 

H CBO,I c EC CD 0.632 0.399 0.465 

H CBO IC CM 0.067 0.004 0.9 

H CBO,I c IC CM 0.638 0.407 0.456 

H CBO EC CM 0.417 0.174 0.010 

H CBO,Ic EC CM 0.673 0.453 0.005 

Voronoi 


H CBO IC CC 0.922 0.85 0.009 

H CBO,Ic IC CC 0.941 0.885 0.0001 

H CBO EC CC 0.553 0.306 0.255 

H CBO,Ic EC CC 0.561 0.314 0.568 

H CBO IC CD 0.762 0.58 0.078 

H CBO,I c IC CD 0.768 0.589 0.263 

H CBO EC CD 0.627 0.393 0.183 

H CBO,I c EC CD 0.636 0.405 0.459 

H CBO IC CM 0.765 0.586 0.076 

H CBO,Ic IC CM 0.77 0.594 0.059 

H CBO EC CM 0.627 0.393 0.083 

H CBO,Ic EC CM 0.636 0.405 0.029


A.2.3 


Velocity 


H CBO IC CC 0.515 0.265 0.0001 

H CBO,Ic IC CC 0.722 0.521 0.001 

H CBO EC CC 0.381 0.145 0.014 

H CBO,I c EC CC 0.617 0.381 0.025 

H CBO IC CD 0.595 0.354 0.254 

H CBO,I c IC CD 0.741 0.547 0.354 

H CBO EC CD 0.677 0.458 0.144 

H CBO,Ic EC CD 0.861 0.741 0.214 

H CBO IC CM 0.675 0.455 0.005 

H CBO,Ic IC CM 0.752 0.565 0.004 

H CBO EC CM 0.409 0.167 0.007 

H CBO,I c EC CM 0.506 0.256 0.01 



H CBO IC CC 0.453 0.205 0.002 

H CBO,I c IC CC 0.637 0.406 0.001 

H CBO EC CC 0.430 0.185 0.002 

H CBO,I c EC CC 0.570 0.325 0.004 

H CBO IC CD 0.709 0.502 0.547 

H CBO,Ic IC CD 0.892 0.796 0.214 

H CBO EC CD 0.830 0.689 0.114 

H CBO,Ic EC CD 0.857 0.735 0.147 

H CBO IC CM 0.652 0.425 0.006 

H CBO,I c IC CM 0.762 0.581 0.005 

H CBO EC CM 0.504 0.254 0.011 

H CBO,I c EC CM 0.624 0.389 0.007 

Ant 


H CBO IC CC 0.604 0.365 0.005 

H CBO,I c IC CC 0.765 0.585 0.006 

H CBO EC CC 0.453 0.205 0.014 

H CBO,Ic EC CC 0.636 0.405 0.018 

H CBO IC CD 0.597 0.356 0.154 

H CBO,Ic IC CD 0.698 0.487 0.198 

H CBO EC CD 0.518 0.268 0.287 

H CBO,I c EC CD 0.667 0.445 0.098 

H CBO IC CM 0.725 0.525 0.017 

H CBO,I c IC CM 0.784 0.615 0.025 

H CBO EC CM 0.451 0.204 0.042 

H CBO,I c EC CM 0.560 0.314 0.034

Appendix B 

Case Study 2: The Impact of 

Run-time Cohesion on Object 

Behaviour 

Appendix B.1 contains the PCA test results for the JOlden benchmark suite and for 

the real-world programs Velocity, Xalan and Ant. Values deemed to be significant 

at the level p ≤ 0.05 are highlighted. 

Appendix B.2 contains the results from the multiple linear regression used to test 

the hypothesis H 0 , that measures of run-time cohesion provide a better indication 

of N OC than a static measure alone for the JOlden benchmark programs and the 

real-world programs Velocity, Xalan and Ant. All significant results are highlighted. 

B.1 PCA Test Results for all programs. 

B.1.1 


BH 

P C1 P C2 

S LCOM 0.214 0.754 

R LCOM 0.714 0.214 

RW LCOM 0.721 0.101 

MST 

P C1 P C2 

S LCOM 0.251 0.712 

R LCOM 0.714 0.211 

RW LCOM 0.751 0.165 

Em3d 

P C1 P C2 

S LCOM 0.135 0.812 

R LCOM 0.841 0.014 

RW LCOM 0.814 0.014 

Perimeter 

P C1 P C2 

S LCOM 0.025 0.912 

R LCOM 0.874 0.145 

RW LCOM 0.768 0.121 

Health 

P C1 P C2 

S LCOM 0.122 0.789 

R LCOM 0.674 0.145 

RW LCOM 0.714 0.212 

Power 

P C1 P C2 

S LCOM 0.142 0.775 

R LCOM 0.654 0.154 

RW LCOM 0.698 0.177 

106

B.2. Multiple linear regression results for all programs. 107 

Voronoi 

P C1 P C2 

S LCOM 0.045 0.901 

R LCOM 0.854 0.104 

RW LCOM 0.868 0.021 

B.1.2 


Velocity 

P C1 P C2 

S LCOM 0.215 0.614 

R LCOM 0.814 0.124 

RW LCOM 0.751 0.165 


P C1 P C2 

S LCOM 0.315 0.554 

R LCOM 0.714 0.116 

RW LCOM 0.641 0.225 

Ant 

P C1 P C2 

S LCOM 0.114 0.712 

R LCOM 0.814 0.124 

RW LCOM 0.801 0.101 

B.2 Multiple linear regression results for all programs. 

B.2.1 


BH 


H SLCOM N OC 0.444 0.197 0.016 

H SLCOM ,R LCOM N OC 0.711 0.507 0.01 

H SLCOM N OC 0.105 0.012 0.452 

H SLCOM ,RW LCOM N OC 0.631 0.398 0.487 

Em3d 


H SLCOM N OC 0.365 0.134 0.006 


H SLCOM N OC 0.415 0.173 0.254 


Health 


H SLCOM N OC 0.518 0.268 0.012 


H SLCOM N OC 0.445 0.198 0.124 


MST 


H SLCOM N OC 0.235 0.056 0.025 


H SLCOM N OC 0.555 0.308 0.121 


Perimeter 


H SLCOM N OC 0.514 0.263 0.002 


H SLCOM N OC 0.366 0.135 0.048 


Power 


H SLCOM N OC 0.177 0.035 0.028 


H SLCOM N OC 0.445 0.198 0.214 


Voronoi 


H SLCOM N OC 0.523 0.273 0.004 


H SLCOM N OC 0.255 0.064 0.381 

H SLCOM ,RW LCOM N OC 0.333 0.129 0.358

B.2. Multiple linear regression results for all programs. 108 

B.2.2 


Velocity 


H SLCOM N OC 0.445 0.198 0.002 


H SLCOM N OC 0.363 0.132 0.456 




H SLCOM N OC 0.242 0.06 0.044 


H SLCOM N OC 0.722 0.523 0.287 


Ant 


H SLCOM N OC 0.455 0.207 0.0001 


H SLCOM N OC 0.633 0.401 0.214 

H SLCOM ,RW LCOM N OC 0.69 0.747 0.564

Appendix C 

Case Study 3: A Study of 

Run-time Coupling Metrics and 

Fault Detection 

Appendix C.1 contains the results from the regression analysis used to test the 

hypothesis H 0 , that run-time coupling metrics are poor detectors of faults in a 

program for the set of real-world programs Velocity, Xalan and Ant. 

Appendix C.2 presents the results to test the hypothesis H 0 , that coverage measures 

are poor detectors of faults in a program for the real-world programs Velocity, 

Xalan and Ant. All significant results are highlighted. 

109

C.1. Regression analysis results for real-world programs, Velocity, Xalan 

and Ant. 110 

C.1 Regression analysis results for real-world programs, 

Velocity, Xalan and Ant. 

C.1.1 

For Class Mutants 

Velocity 


H IC CC M K 0.830 0.689 0.002 

H IC CM M K 0.766 0.587 0.001 

H IC CD M K 0.684 0.468 0.006 

H EC CC M K 0.790 0.621 0.007 

H EC CM M K 0.754 0.569 0.411 

H EC CD M K 0.491 0.241 0.456 



H IC CC M K 0.767 0.589 0.003 

H IC CM M K 0.705 0.498 0.002 

H IC CC M K 0.710 0.504 0.001 

H EC CD M K 0.706 0.499 0.046 

H EC CM M K 0.706 0.499 0.254 

H EC CD M K 0.649 0.421 0.680 

Ant 


H IC CC M K 0.773 0.598 0.003 

H IC CM M K 0.708 0.501 0.005 

H IC CC M K 0.829 0.687 0.001 

H EC CD M K 0.749 0.561 0.075 

H EC CM M K 0.749 0.561 0.342 

H EC CD M K 0.463 0.214 0.127



C.1.2 

For Traditional Mutants 

Velocity 


H IC CC M K 0.570 0.325 0.048 

H IC CM M K 0.644 0.415 0.054 

H IC CD M K 0.375 0.141 0.065 

H EC CC M K 0.642 0.412 0.115 

H EC CM M K 0.463 0.214 0.256 

H EC CD M K 0.392 0.154 0.658 



H IC CC M K 0.598 0.358 0.091 

H IC CM M K 0.567 0.321 0.078 

H IC CC M K 0.463 0.214 0.154 

H EC CD M K 0.676 0.457 0.254 

H EC CM M K 0.459 0.211 0.254 

H EC CD M K 0.381 0.145 0.351 

Ant 


H IC CC M K 0.740 0.547 0.065 

H IC CM M K 0.649 0.421 0.085 

H IC CC M K 0.463 0.214 0.159 

H EC CD M K 0.577 0.333 0.241 

H EC CD M K 0.606 0.367 0.154 

H EC CM M K 0.536 0.287 0.054 

H EC CD M K 0.459 0.211 0.216 

C.2 Regression analysis results for real-world programs, 

Velocity, Xalan and Ant. 

C.2.1 

For Class Mutants 

Velocity 


H Ic M K 0.326 0.106 0.032 



H I c M K 0.409 0.167 0.004 

Ant 


H I c M K 0.344 0.118 0.005



C.2.2 

For Traditional Mutants 

Velocity 


H Ic M K 0.888 0.789 0.003 



H I c M K 0.803 0.645 0.024 

Ant 


H I c M K 0.836 0.699 0.019

Appendix D 

Mutation operators in µJava 

Table D.1 presents a description of the traditional-level mutation operators in µJava. 

Table D.2 presents a description of the class-level mutation operators in µJava. 

Operator 

ABS 

AOR 

LCR 

ROR 

UOI 

Description 

Absolute value insertion 

Arithmetic operator replacement 

Logical connector replacement 

Relational operator replacement 

Unary operator insertion 

Table D.1: Traditional-level mutation operators in µJava 

113

Appendix D. Mutation operators in µJava 114 

Language Feature Operator Description 

Inheritance: IHD Hiding variable deletion 

IHI Hiding variable delection 

IOD Overriding method deletion 

IOP Overriding method calling position change 

IOR Overriding method rename 

ISK super key word deletion 

IPC Explicit call of a parents constructor deletion 

Polymorphism: PNC new method call with child class type 

PMD Instance variable declaration with parent class type 

PPD Parameter variable declaration with child class type 

PRV Reference assignment with other comparable type 

Overloading: OMR Overloading method contents change 

OMD Overloading method deletion 

OAO Argument order change 

OAN Argument number change 

Java-specific features: JTD this keyword deletion 

JSC static modifier change 

JID Member variable initialization deletion 

JDC Java-supported default constructor creation 

Common programming EOA Reference assignment and content 

mistakes: 

assignment replacement 

EOC Reference comparison and content 

comparison replacement 

EAM Accessor method change 

EMM Modifier method change 

Table D.2: Class-level mutation operators in µJava

Bibliography 

[1] F. Abreu, M. Goulo, and R. Esteves. Toward the design quality evaluation of 

object-oriented software systems. In Fifth International Conference on Software 

Quality, pages 44–57, Austin, Texas, USA, Oct 1995. 

[2] A.J. Albrecht. Measuring application development. In IBM Applications Development 

joint SHARE/GUIDE symposium, pages 83–92, Monterey California, 

USA, 1979. 

[3] R.T. Alexander and J. Offutt. Coupling-based testing of O-O programs. The 

Journal of Universal Computer Science, 10(4):391–427, 2004. 

[4] The Apache Ant Project. Ant. http://ant.apache.org/. 

[5] E. Arisholm, L.C. Briand, and A. Foyen. Dynamic coupling measures for objectoriented 

software. IEEE Transactions on Software Engineering, 30(8):491–506, 

2004. 

[6] V.R. Basili, L.C. Briand, and W.L. Melo. A validation of object-oriented design 

metrics as quality indicators. IEEE Transactions on Software Engineering, 

22(10):751–761, October 1996. 

[7] B. Beizer. Software Testing Techniques. 2nd edition, Van Nostrand Reinhold, 

New York, USA, 1990. 

[8] Standard Performance Evaluation Corporation SPECjvm98 Benchmarks. 

http://www.spec.org/jvm98. 

115

Bibliography 116 

[9] J.M. Bieman and B.K. Kang. Cohesion and reuse in an object-oriented system. 

In ACM Symposium on Software Reusability, pages 295–262, Seattle, Washington, 

USA, 1995. 

[10] R. Binder. Testing Object Oriented Systems: Models, Patterns and Tools. Addison 

Wesley, Boston, Massachusetts, USA, October 1999. 

[11] L.C. Briand, J. Daly, V. Porter, and J. Wüst. A comprehensive empirical validation 

of product measures in object-oriented systems. Technical Report ISERN- 

98-07, Fraunhofer Institute for Experimental Software Engineering, Germany, 

1998. 

[12] L.C. Briand, J.W. Daly, and J.K. Wüst. A unified framework for cohesion 

measurement in object-oriented systems. Empirical Software Engineering: An 

International Journal, 3(1):65–117, 1998. 

[13] L.C. Briand, J.W. Daly, and J.K. Wüst. A unified framework for coupling 

measurement in object-oriented systems. IEEE Transactions on Software Engineering, 

25(1):91–121, Jan/Feb 1999. 

[14] L.C. Briand, P. Devanbu, and W. Melo. An investigation into coupling measures 

for C++. In 19th International Conference on Software Engineering, pages 

412–421, Boston, USA, May 1997. 

[15] L.C. Briand, W.L. Melo, and J. Wüst. Assessing the applicability of faultproneness 

models across object-oriented software projects. IEEE Transactions 

on Software Engineering, 28(7):706–720, 2002. 

[16] L.C. Briand, S. Morasca, and V. Basili. Measuring and assessing maintainability 

at the end of high-level design. In International Conference on Software 

Maintenance, pages 88–97, Montreal, Canada, 1993. 

[17] L.C. Briand, S. Morasca, and V. Basili. Defining and validating high-level design 

metrics. Technical Report CS-TR 3301, Department of Computer Science, 

University of Maryland, College Park, MD 20742, USA, 1994.


[18] L.C. Briand, S. Morasca, and V.R. Basili. An operational process for goaldriven 

definition of measures. IEEE Transactions on Software Engineering, 

28(12):1106–1125, December 2002. 

[19] L.C. Briand, J.K. Wüst, J.W. Daly, and V. Porter. Exploring the relationship 

between design measures and software quality in object-oriented systems. The 

Journal of Systems and Software, 51:245–273, 2000. 

[20] S. Brown, Á. Mitchell, and J.F. Power. A coverage analysis of Java benchmark 

suites. In IASTED International Conference on Software Engineering, pages 

144–150, Innsbruck, Austria, Feburary 15-17 2005. 

[21] M. Bunge. Treatise on Basic Philosophy: Ontology II: The World of Systems. 

Riedel, Boston, USA, 1972. 

[22] X. Cai and M.R. Lyu. The effect of code coverage on fault detection under 

different testing profiles. In First International Workshop on Advances in 

Model-based Testing, pages 1–7, St. Louis, Missouri, USA, 2005. 

[23] M.C. Carlisle and A. Rogers. Software caching and computation migration in 

olden. In ACM Symposium on Principles and Practice of Parallel Programming, 

pages 29–38, Santa Barbara, California, USA, July 1995. 

[24] I.M. Chakravarti, R.G. Laha, and J. Roy. Handbook of Methods of Applied 

Statistics, volume 1. John Wiley and Sons, New York, USA, 1967. 

[25] S.R. Chidamber and C.F. Kemerer. Towards a metrics suite for object-oriented 

design. In Object Oriented Programming Systems Languages and Applications, 

pages 197–211, Phoenix, Arizona, USA, November 1991. 

[26] S.R. Chidamber and C.F. Kemerer. A metrics suite for object-oriented design. 

IEEE Transactions on Software Engineering, 20(6):467–493, June 1994. 

[27] E.J. Chikofsky and J.H. Cross II. Reverse engineering and design recovery: A 

taxonomy. IEEE Software, 7(1):13–17, 1990.


[28] J. Choi, M. Gupta, M.J. Serrano, V.C. Sreedhar, and S.P. Midkiff. Stack 

allocation and synchronization optimizations for Java using escape analysis. 

ACM Transactions on Programming Languages and Systems, 25(6):876 – 910, 

November 2003. 

[29] L.L Constantine and E. Yourdon. Structured Design. Prentice-Hall, Englewood 

Cliffs, New Jersey, USA, 1979. 

[30] M. Dahm. Byte Code Engineering Library (BCEL), version 5.1, April 25 2004. 

http://jakarta.apache.org/bcel/. 

[31] D.P. Darcy, C.F. Kemerer, S.A. Slaughter, and T.A. Tomayko. The structural 

complexity of software: An experimental test. TOSE, 31(11):982–995, 2005. 

[32] S. Demeyer, S. Ducasse, and O. Nierstrasz. Finding refactorings via change 

metrics. In 15th ACM SIGPLAN conference on Object-oriented programming, 

systems, languages, and applications, pages 166–178, Minneapolis, Minnesota, 

USA, 2000. 

[33] B. Dufour, K. Driesen, L. J. Hendren, and C. Verbrugge. Dynamic metrics for 

Java. In Conference on Object-Oriented Programming Systems, Languages and 

Applications, pages 149–168, Anaheim, California, USA, October 26-30 2003. 

[34] J. Eder, G. Kappel, and M. Schrefl. Coupling and cohesion in object–oriented 

systems. Technical Report 2/93, Department of Information Systems, University 

of Linz, Linz, Austria, 1993. 

[35] D.W. Embley and S.N. Woodfield. Cohesion and coupling for abstract data 

types. In 6th International Phoenix Conference on Computers and Communications, 

pages 144–153, Phoenix, Arizona, USA, 1987. 

[36] T. J. Emerson. A discriminant metric for module cohesion. In 7th 1nternational 

Conference on Software Engineering, pages 294–303, Orlando, Florida, USA, 

1984.


[37] T. J. Emerson. Program testing, path coverage, and the cohesion metric. In 

Computer Software Application Conference, pages 421–431, Chicago, Illinois, 

USA, 1984. 

[38] J. Engel. Programming for the Java Virtual Machine. Addison-Wesley, California, 

USA, 1999. 

[39] N.E. Fenton and M. Neil. Software metrics: Successes, failures and new directions. 

The Journal of Systems and Software, 47:149–157, 1999. 

[40] N.E. Fenton and S.L. Pfleeger. Software Metrics: A Rigorous and Practical 

Approach. PWS Publishing Company, Boston, Massachusetts, USA, 1997. 

[41] F. Fioravanti and P. Nesi. A study on fault-proneness detection of objectoriented 

systems. In Fifth European Conference on Software Maintenance and 

Reengineering, pages 121–130, Lisbon, Portugal, 14-16 March 2001. 

[42] P.G. Frankl and E.J. Weyuker. An applicable family of data flow testing criteria. 

IEEE Transactions on Software Engineering, 14(10):1483–1498, 1988. 

[43] R.J. Freund and W.J. Wilson. Regression Analysis: Statistical Modeling of a 

Response Variable. Academic Press, 1998. 

[44] R.R. Gonzalez. A unified metric of software complexity: Measuring productivity, 

quality and value. The Journal of Systems and Software, 29(1):17–37, 

1995. 

[45] D. Gregg, J. Power, and J. Waldron. Platform independent dynamic Java virtual 

machine analysis: the Java Grande Forum benchmark suite. Concurrency 

and Computation: Practice and Experience, 15(3-5):459–484, March 2003. 

[46] N. Gupta and P. Rao. Program execution based module cohesion measurement. 

In 16th International Conference on Automated on Software Engineering, pages 

144–153, San Diego, USA, Nov 2001. 

[47] M. Halstead. Elements of Software Science. North-Holland, Amsterdam, 1977.


[48] R. G. Hamlet. Testing programs with the aid of a compiler. IEEE Transactions 

on Software Engineering, 3(4):279–290, 1977. 

[49] B. Henderson-Sellers. Software Metrics. Prentice Hall, Hemel Hempstaed, U.K., 

1996. 

[50] B. Henderson-Sellers and J. Edwards. Object-Oriented Knowledge: The Working 

Object (Book Two). Prentice Hall, Sydney, Australia, 1994. 

[51] M. Hitz and B. Montazeri. Measuring coupling and cohesion in object-oriented 

systems. In International Symposium on Applied Corporate Computing, pages 

25–27, Monterrey, Mexico, October 1995. 

[52] M. Hitz and B. Montazeri. Measuring product attributes of object-oriented 

systems. In Fifth European Software Engineering Conference, pages 124 – 136, 

Barcelona, Spain, September 1995. 

[53] C. Howells. Gretel: An open-source residual test coverage tool, June 2002. 

http://www.cs.uoregon.edu/research/perpetual/Software/Gretel/. 

[54] T.O. Humphries, A. Klauser, A.L. Wolf, and B.G. Zorn. An infrastructure for 

generating and sharing experimental workloads for persistent object systems. 

Software–Practice and Experience, 30:387–417, 2000. 

[55] Jakarta. The Apache Jakarta Project. http://jakarta.apache.org/. 

[56] I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 2nd edition, 2002. 

[57] Jikes JVM. http://www-124.ibm.com/developerworks/oss/jikes/. 

[58] Kaffe JVM. http://www.kaffe.org/. 

[59] Sable JVM. http://sablevm.org/. 

[60] B. Kitchemham and S.L. Pfleeger. Software quality: The elusive target. IEEE 

Software, pages 12–21, 1996.


[61] A. Lakhotia. Rule-based approach to computing module cohesion. In 15th International 

Conference on Software Engineering, pages 35–44, Baltimore, Maryland, 

USA, 1993. 

[62] Y.S. Lee, B.S. Liang, S.F. Wu, and F.J. Wang. Measuring the coupling and 

cohesion of an object-oriented program based on information flow. In International 

Conference on Software Quality, pages 81–90, Maribor, Slovenia, 1995. 

[63] W. Li and S. Henry. Object-oriented metrics that predict maintainability. The 

Journal of Systems and Software, 23(2):111–122, 1993. 

[64] R.J. Lipton, R.A. DeMillo, and F.G. Sayward. Hints on test data selection: 

Help for the practicing programmer. IEEE Computer, 11(4):34–41, 1978. 

[65] M. Lorenz and J. Kidds. Object-Oriented Software Metrics. Prentice Hall 

Object-Oriented Series, Englewood Cliffs, USA, 1994. 

[66] Y. Ma, J. Offutt, and Y. Kwon. Mujava: An automated class mutation system. 

The Journal of Software Testing, Verification and Reliability, 15(2):97– 

133, June 2005. 

[67] Y.S. Ma, Y.R. Kwon, and J. Offutt. mujava. http://www.isse.gmu.edu/- 

faculty/ofut/mujava/. 

[68] Y.S. Ma, Y.R. Kwon, and J. Offutt. Inter-class mutation operators for java. 

In 13th International Symposium on Software Reliability Engineering, pages 

352–363, Annapolis, Maryland, USA, November 2002. 

[69] Y.K. Malaiya, M.N. Li, J.M. Bieman, and R. Karcich. Software reliability 

growth with test coverage. IEEE Transactions on Reliability, 51(4):420426, 

December 2002. 

[70] R. Martin. OO design quality metrics: An analysis of dependencies. In Proceedings 

Workshop on Pragmatic and Theoretical Directions in Object-Oriented 

Software Metrics, 1994.


[71] T. McCabe. A software complexity measure. IEEE Transactions on Software 

Engineering, 2(4):308–320, 1976. 

[72] J.D. McGregor and D.A. Sykes. A Practical Guide to Testing Object-oriented 

Software. Addison Wesley, March 2001. 

[73] P.C. Mehlitz. Performance analysis of Java implementations. In 

http://www.transvirtual.com/presentations/speed/index.html. 

[74] Á. Mitchell and J.F. Power. Masters thesis: Dynamic coupling and cohesion 

metrics for Java programs. Department of Computer Science, N.U.I. Maynooth, 

Co. Kildare, Ireland, Aug 2002. 

[75] Á. Mitchell and J.F. Power. Run-time cohesion metrics for the analysis of 

Java programs - preliminary results from the SPEC and Grande suites. Technical 

Report NUIM-CS-TR2003-08, Department of Computer Science, N.U.I. 

Maynooth, Co. Kildare, Ireland, April 2003. 

[76] Á. Mitchell and J.F. Power. Run-time coupling metrics for the analysis of 

Java programs - preliminary results from the SPEC and Grande suites. Technical 

Report NUIM-CS-TR2003-07, Department of Computer Science, N.U.I. 

Maynooth, Co. Kildare, Ireland, April 2003. 

[77] Á. Mitchell and J.F. Power. Toward a definition of run-time object-oriented 

metrics. In 7th ECOOP Workshop on Quantitative Approaches in Object- 

Oriented Software Engineering, Darmstadt, Germany, July 2003. 

[78] Á. Mitchell and J.F. Power. An empirical investigation into the dimensions of 

run-time coupling in java programs. In 3rd Conference on the Principles and 

Practice of Programming in Java, pages 9–14, Las Vegas, Nevada, USA, June 

16-18 2004. 

[79] Á. Mitchell and J.F. Power. Run-time cohesion metrics: An empirical investigation. 

In International Conference on Software Engineering Research and 

Practice, pages 532–537, Las Vegas, Nevada, USA, June 21-24 2004.


[80] Á. Mitchell and J.F. Power. A study of the influence of coverage on the relationship 

between static and dynamic coupling metrics. Science of Computer 

Programming Elsevier, Accepted for Publication, March 2005. 

[81] Á. Mitchell and J.F. Power. Using object-level run-time metrics to study coupling 

between objects. In ACM Symposium on Applied Computing, pages 1456– 

1463, Santa Fe, New Mexico, USA, March 13-17 2005. 

[82] G. Myers. Reliable Software Through Composite Design. Mason and Lipscomb 

Publishers, New York, USA, 1974. 

[83] G. Myers. Composite Structured Design. Van Nostrand Reinhold, New York, 

USA, 1978. 

[84] S. Ntafos. A comparison of some structural testing strategies. IEEE Transactions 

on Software Engineering, 14(6):868–874, June 1988. 

[85] A.J. Offutt, M.J. Harrold, and P. Kolte. A software metrics system for module 

coupling. The Journal of Systems and Software, 20(3):295–308, 1993. 

[86] J. Offutt, R. Alexander, Y. Wu, Q. Xiao, and C. Hutchinson. A fault model for 

subtype inheritance and polymorphism. In 12th International Symposium on 

Software Reliability Engineering, pages 84–93, Hong Kong, China, November 

2001. 

[87] J. Offutt, A. Lee, G. Rothermel, R. Untch, and C. Zapf. An experimental 

determination of sufficient mutation operators. ACM Transactions on Software 

Engineering Methodology, 5(2):99–118, April 1996. 

[88] L. M. Ott and J. J. Thuss. The relationship between slices and module cohesion. 

In 11th International Conference on Software Engineering, pages 198 – 204, 

Pittsburgh, Pennsylvania, USA, 1989. 

[89] M. Page-Jones. The Practical Guide to Structured Systems Design. Yourdon 

Press, New York, NY, 1980.


[90] A.V. Pearson and H.O. Hartley. Biometrica Tables for Statisticians, volume 2. 

Cambridge University Press, Cambridge, England, 1972. 

[91] S. Phattarsukol and P. Muenchaisri. Identifying candidate objects using hierarchical 

clustering analysis. In 8th Asia-Pacific Software Engineering Conference, 

pages 381–389, Macao, Japan, December 4-7 2001. 

[92] The Apache XML Project. Xalan. http://xml.apache.org/xalan-j//. 

[93] S.S. Shapiro and M.B. Wilk. An analysis of variance test for normality (complete 

samples). Biometrika, 52(3/4):591–611, 1965. 

[94] M.L Shooman. Software Engineering: Design, Reliability and Management. 

McGraw Hill, New York, USA, 1983. 

[95] W.P. Stevens, G.J. Myers, and L. L. Constantine. Structured design. IBM 

Systems Journal, 13(2):115–139, 1974. 

[96] Sun Microsystems, Inc. Java Platform Debug Architecture (JPDA). http://- 

java.sun.com/products/jpda. 

[97] TimeWeb. Correlation explained. http://www.bized.ac.uk/timeweb/crunching/- 

crunch relate expl.htm, 2002. 

[98] D.A. Troy and S.H. Zweben. Measuring the quality of structured designs. The 

Journal of Systems and Software, 2:112–120, 1981. 

[99] S.M. Yacoub, H.H. Ammar, and T. Robinson. Dynamic metrics for objectoriented 

designs. In Software Metrics Symposium, pages 50–61, Boca Raton, 

Florida, USA, Nov 4-6 1999. 

Note: All URL’s correct as of 30 th September 2005

an empirical study of run-time coupling and cohesion software metrics

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?