27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 1. Boxplots of average AUC results for<br />

various machine-learning algorithms<br />

Table 2. Paired Samples T-Test on Machine-<br />

Learning Algorithms (α=0.05)<br />

Single-Fault Programs t-value Sig. Result<br />

SMO - RF -18.767 0.000 Reject<br />

SMO - NB -14.576 0.000 Reject<br />

RF - NB 3.970 0.000 Reject<br />

Multiple-Fault Programs t-value Sig. Result<br />

SMO - RF -30.602 0.000 Reject<br />

SMO - NB -16.362 0.000 Reject<br />

RF - NB 6.950 0.000 Reject<br />

our empirical study has only the results of six single-fault<br />

programs except for print tokens and schedule2, and seven<br />

multiple-fault programs except for Space. Our empirical<br />

study does not acquire the results of execution-data classification<br />

for Space based on its statement coverage or statement<br />

count because the maximal memory of Java Virtual<br />

Machine is not large enough to classify executions which<br />

have a large number of attributes (i.e., the statement coverage<br />

or the statement count).<br />

(1)RQ1: Machine-Learning Algorithms<br />

Figure 1 shows the distribution of the average AUC of<br />

the execution-data classification approach with the same<br />

machine-learning algorithm for each single-fault subject.<br />

The horizontal axis represents the three machine-learning<br />

algorithms, whereas the vertical axis represents the average<br />

AUC. According to Figure 1, the classification approach<br />

with SMO is usually much less effective than the<br />

approach with RF. The classification approach with RF produces<br />

close AUC results to the approach with NB. Moreover,<br />

sometimes (e.g., for replace) the execution-data classification<br />

approach with RF is better than the execution-data<br />

classification approach with NB, whereas sometimes (e.g.,<br />

for schedule) the latter approach is better than the former.<br />

To further compare the three machine-learning algorithms,<br />

we performed the paired samples t-test on the average<br />

AUC results by comparing each pair of results of the<br />

execution-data classification approaches with the same subject,<br />

the same percentage of training instances, and the same<br />

type of execution data, but different machine-learning algorithms.<br />

The results are shown by Table 2. Here the t-test<br />

is performed separately on results of single-fault programs<br />

and multiple-fault programs because the machine-learning<br />

based execution-data classification approach is intuitively<br />

more effective to classify an execution for single-fault programs<br />

than for multiple-fault programs considering the impact<br />

of multiple faults.<br />

According to this table, the hypothesis that neither RF<br />

nor NB is superior to the other in execution-data classification<br />

is rejected (for both single-fault and multiple-fault<br />

programs). Moreover, execution-data classification using<br />

RF wins the classification approach using NB since its calculated<br />

t-value is positive. That is, for single-fault and<br />

multiple-fault programs, execution-data classification with<br />

RF is significantly better than the approach with NB. Similarly,<br />

execution-data classification with either RF or NB is<br />

significantly better than the approach with SMO.<br />

(2)RQ2: Training Set<br />

According to the average AUC, with the increase of the<br />

number of training instances (i.e., variable TrainingSet increases<br />

from 20% to 100%), the average AUC of each subject<br />

usually increased except for a few data.<br />

However, it is not practical to construct a classifier based<br />

on a large number of training instances (i.e., execution<br />

data with known outcome) in execution-data classification.<br />

Thus, we are more interested in the experimental results<br />

of execution-data classification using a small number of<br />

training instances. According to the experimental results,<br />

as we increase the number of training instances, the average<br />

AUC results increase slightly. That is, the executiondata<br />

classification approach fed with 20% training instances<br />

has close AUC results to the approach with 100% training<br />

instances, which produces reliable classification results.<br />

Consequently, for any subject whose number of executable<br />

statements is n, execution-data classification has been evaluated<br />

to be reliable even if the number of training instances<br />

is n ∗ 20%.<br />

(3)RQ3: Type of Execution Data<br />

To show the influence of the types of execution data on<br />

AUC results of execution-data classification, we draw some<br />

boxplots for single-fault programs by statically analyzing<br />

average AUC results of same type of execution data. For<br />

each program, the five boxplots for various types of execution<br />

data have observable difference. For instance, the<br />

AUC results of execution-data classification with the branch<br />

coverage usually distribute in a smaller range than the approach<br />

with any of the other types of execution data. That<br />

is, execution-data classification with the branch coverage<br />

is more stable than the approach with the other type of<br />

execution-data although some other types of execution data<br />

286

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!