27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

are implemented on Weka 3.6.1 by using its default setting<br />

since this experiment is designed to answer the three research<br />

questions.<br />

The second independent variable TrainingSet refers to<br />

the number of instances (i.e., executions) in a training set.<br />

As the number of available training instances may be associated<br />

with the scale of a program, we use the ratio between<br />

the number of training instances and the number of executable<br />

statements to represent TrainingSet, which is set to<br />

be 20%, 40%, 50%, 60%, 80%, and 100% in our empirical<br />

study.<br />

The third independent variable Attribute refers to which<br />

type of execution data has been collected during the execution<br />

of the program and used to classify the execution<br />

data. In our empirical study, the values of variable<br />

Attribute can be any of the following: statement count<br />

(abbreviated as #statements, which is the number of times<br />

each statement has been executed), statement coverage (abbreviated<br />

as ?statement, which is whether each statement<br />

has been executed), method count (abbreviated as #method,<br />

which is the number of times each method has been executed),<br />

method coverage (abbreviated as ?method, which is<br />

whether each method has been executed), and branch coverage<br />

(abbreviated as ?branch, which is whether each branch<br />

has been executed). To record the execution information<br />

on branches when running the program, we used “GCOV”<br />

command of GCC, whose collected branches are predicates<br />

within a branch condition, not the whole branch condition.<br />

Moreover, the predicates whose value is true or false are<br />

taken as different branches.<br />

The dependent variables in our empirical study are the<br />

results of execution-data classification measured by ROC<br />

analysis [4] since it has advantages overs other measures<br />

like precision-recall and accuracy. Specifically, our study<br />

uses the “Area under a ROC curve (usually abbreviated as<br />

AUC)” to measure the performance of an execution-data<br />

classification approach. The AUC is always between 0 and<br />

1. The bigger the AUC is, the higher performance the corresponding<br />

classification approach has. Moreover, a realistic<br />

classifier is always no less than 0.5 since the AUC for the<br />

random guessing curve is 0.5.<br />

3.2. Process<br />

First, we ran each subject with its test cases by recording<br />

the five types of execution data as well as the outcomeofan<br />

execution (i.e., whether it is passing or failing).<br />

Second, we took all the test cases with their execution<br />

information (i.e., execution data and outcome) as instances<br />

and split all the instances into a training set and a testing<br />

set. Although all the test cases are labeled with either passing<br />

or failing, our experiment randomly selected some test<br />

cases into a training set, and took the rest test cases as a<br />

testing set. The training set is the set of test cases whose<br />

execution data and outcome are taken as input to build a<br />

classifier, whereas the testing set is the set of test cases that<br />

are used to verify whether the classified outcome based on<br />

the classifier is correct. Since all the test cases are known to<br />

be passing or failing in our empirical study before building<br />

the classifier, we know whether the classification is correct<br />

or not by comparing the classification result with its actual<br />

outcome.<br />

Moreover, for each faulty program, our empirical study<br />

randomly selected n ∗ 20%, n ∗ 40%, n ∗ 50%, n ∗ 60%,<br />

n ∗ 80%, orn test cases from its whole test collection as<br />

a training set and took the rest test cases as a testing set,<br />

where n is the number of executable statements for each<br />

subject shown by Table 1. To reduce the influence of random<br />

selection, we repeated each test-selection process 100<br />

times.<br />

Third, we applied each of the three machine-learning algorithms<br />

to each training set and recorded the classified outcome<br />

of test cases in its corresponding testing set. As the<br />

outcome of each test case is known, we know whether the<br />

classification is correct. We recorded the number of passing<br />

test cases that have been correctly classified to be passing,<br />

the number of passing test cases that have been classified to<br />

be failing, the number of failing test cases that have been<br />

correctly classified to be failing, and the number of failing<br />

test cases that have been classified to be passing. Then<br />

for each faulty program with a machine-learning algorithm,<br />

we calculated its corresponding AUC. As each subject has<br />

several faulty versions, we calculated their average AUC as<br />

the result of the corresponding subject. Our experiment is<br />

performed on an Intel E7300 Dual-Core Processor 2.66GHz<br />

with 2G memory.<br />

3.3. Threats to Validity<br />

The construct threat of our empirical study lies in the<br />

measures, although ROC analysis has been widely used in<br />

evaluating classifiers in machine learning because it has advantages<br />

over the accuracy, error rate, precision, recall and<br />

can decouple classifier performance from class skew and<br />

error costs [4]. The main external threat comes from the<br />

subjects and the seeded faults. Although these subjects including<br />

the faults have been widely used in the literature of<br />

software testing and analysis, we will perform more experiments<br />

on larger programs with real faults.<br />

3.4. Results and Analysis<br />

In our empirical study we exclude the faulty programs<br />

whose corresponding test collection has less than 5% failing<br />

test cases because its percentage of correct classification<br />

would be larger than 95% if the test cases are always classified<br />

to be “passing”. After excluding such biased data,<br />

285

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!