27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1) Classifiers: In this study, software defect prediction models are<br />

built with three well-known classification algorithms: naïve Bayes<br />

(NB), multilayer perceptron (MLP), and logistic regression (LR). All<br />

three learners themselves do not have a built-in feature selection capability<br />

and are commonly used in the software engineering and other<br />

data mining applications. All classifiers were implemented in the<br />

WEKA tool with default parameter settings [6] except MLP. Based<br />

on preliminary research, the parameters of MLP were set as follows.<br />

The ‘hiddenLayers’ parameter was set to 3 to define a network with<br />

one hidden layer containing three nodes. The ‘validationSetSize’<br />

parameter was set to 10 to cause the classifier to leave 10% of the<br />

training data aside to be used as a validation set to determine when<br />

to stop the iterative training process.<br />

2) Performance Metric: Because using traditional performance<br />

measures such as classification accuracy can give misleading results<br />

on imbalanced data, we use a performance metric that considers<br />

the ability of a classifier to differentiate between the two classes:<br />

the area under the ROC (Receiver Operating Characteristic) curve<br />

(AUC). A perfect classifier provides an AUC that equals 1. It has<br />

been shown that AUC has lower variance and more reliability than<br />

other performance metrics (such as precision, recall, F-measure) [15].<br />

Note that the metric used to measure the performance of the classifiers<br />

is completely independent from the metric in the TBFS algorithm.<br />

AUC is used both to select the most predictive subset of features in<br />

TBFS and to evaluate the classification models constructed using this<br />

set of features.<br />

3) Experimental Results: During the experiments, ten runs of fivefold<br />

cross-validation were performed. First, we ranked the attributes<br />

using the eighteen rankers separately. Once the attributes are ranked,<br />

the top six attributes are selected to yield the final training data. After<br />

feature selection, we applied the classifier to the training datasets<br />

with the selected features, and then we used AUC to evaluate the<br />

performance of the classification model. In total, 18 rankers × 4<br />

datasets × 10 runs × 5 folds = 3600 combinations of feature ranking<br />

techniques were employed, and correspondingly 3600 × 3 classifiers<br />

= 10800 classification models were built.<br />

All the results are reported in Table VI through Table VIII. Note<br />

that each value presented in the tables is the average over the ten<br />

runs of five-fold cross-validation outcomes. From these tables, we<br />

can observe that although a given ranker might perform best in<br />

combination with one learner, this may not be true when other<br />

learners are used to evaluate models. For example, S2N performed<br />

best on average in terms of AUC when the NB and LR classifiers<br />

are used. However, this is not true when the MLP classifier is used;<br />

in that case, PO performed best. The results also demonstrate that<br />

although no particular ranker dominates the others, we can conclude<br />

that S2N, PO, and PRC are most often the best techniques, while<br />

PR, OR, and GR are rarely optimal. Our recent study [5] shows<br />

that the reduced feature subsets can have better or similar prediction<br />

performance compared to the complete set of attributes (original data<br />

set).<br />

We also conducted a two-way ANalysis Of VAriance (ANOVA)<br />

F test ([16]) to statistically examine the various effects on the<br />

performances of the classification models. The two-way ANOVA test<br />

in this study includes two factors: the first represents 18 rankers, and<br />

the second represents the three learners. In this ANOVA test, the<br />

results from all four datasets were taken into account together. In<br />

this study, we also performed the multiple comparison tests using<br />

Tukey’s honestly significant difference (HSD) criterion [16]. All<br />

tests of statistical significance utilize a significance level α of 5%.<br />

Both ANOVA and multiple comparison tests were implemented in<br />

TABLE VI<br />

CLASSIFICATION PERFORMANCE, NB<br />

SP1 SP2 SP3 SP4 Average<br />

CS 0.7846 0.8108 0.8184 0.7696 0.7958<br />

GR 0.7346 0.7613 0.7808 0.7519 0.7571<br />

IG 0.7831 0.8081 0.8118 0.7794 0.7956<br />

RF 0.7879 0.8053 0.8305 0.7731 0.7992<br />

RFW 0.7882 0.8081 0.8190 0.7735 0.7972<br />

SU 0.7865 0.7729 0.7882 0.7592 0.7767<br />

FM 0.7822 0.8074 0.8176 0.7731 0.7951<br />

OR 0.7405 0.8060 0.7181 0.7558 0.7551<br />

PO 0.7891 0.8071 0.8141 0.8023 0.8031<br />

PR 0.7345 0.7963 0.7179 0.7605 0.7523<br />

GI 0.7341 0.7982 0.7678 0.6997 0.7500<br />

MI 0.7739 0.8010 0.8119 0.7788 0.7914<br />

KS 0.7722 0.7750 0.8125 0.7588 0.7796<br />

DV 0.7820 0.8099 0.8163 0.7874 0.7989<br />

GM 0.7716 0.7740 0.8165 0.7586 0.7802<br />

AUC 0.7685 0.8072 0.7947 0.7683 0.7847<br />

PRC 0.7885 0.8131 0.8120 0.7953 0.8022<br />

S2N 0.7995 0.8142 0.8067 0.8129 0.8083<br />

TABLE VII<br />

CLASSIFICATION PERFORMANCE, MLP<br />

SP1 SP2 SP3 SP4 Average<br />

CS 0.7943 0.8117 0.8126 0.7914 0.8025<br />

GR 0.7475 0.7545 0.7688 0.7464 0.7543<br />

IG 0.7926 0.8099 0.8209 0.8103 0.8084<br />

RF 0.7948 0.8119 0.8191 0.7619 0.7969<br />

RFW 0.7955 0.8139 0.8303 0.7598 0.7999<br />

SU 0.7875 0.7847 0.7843 0.7504 0.7767<br />

FM 0.7917 0.8127 0.8163 0.7924 0.8033<br />

OR 0.7665 0.8058 0.7244 0.7550 0.7630<br />

PO 0.7955 0.8133 0.8261 0.8010 0.8090<br />

PR 0.7666 0.7970 0.7299 0.7576 0.7628<br />

GI 0.7660 0.7963 0.7813 0.7495 0.7733<br />

MI 0.7855 0.7945 0.8249 0.7865 0.7978<br />

KS 0.7836 0.7749 0.8201 0.7607 0.7848<br />

DV 0.7923 0.8095 0.8180 0.7959 0.8039<br />

GM 0.7812 0.7779 0.8150 0.7701 0.7861<br />

AUC 0.7754 0.8099 0.8260 0.7769 0.7970<br />

PRC 0.7977 0.8151 0.8105 0.7908 0.8035<br />

S2N 0.8047 0.8039 0.8108 0.7972 0.8042<br />

TABLE VIII<br />

CLASSIFICATION PERFORMANCE, LR<br />

Ranker SP1 SP2 SP3 SP4 Average<br />

CS 0.8021 0.8229 0.8354 0.8153 0.8189<br />

GR 0.7688 0.7935 0.7805 0.7816 0.7811<br />

IG 0.8014 0.8176 0.8361 0.8216 0.8192<br />

RF 0.8103 0.8221 0.8354 0.8118 0.8199<br />

RFW 0.8091 0.8233 0.8387 0.8142 0.8213<br />

SU 0.7993 0.7909 0.8040 0.7802 0.7936<br />

FM 0.8023 0.8235 0.8298 0.8088 0.8161<br />

OR 0.7816 0.8158 0.7422 0.7768 0.7791<br />

PO 0.8041 0.8255 0.8334 0.8202 0.8208<br />

PR 0.7784 0.8053 0.7467 0.7808 0.7778<br />

GI 0.7787 0.8062 0.7918 0.7780 0.7887<br />

MI 0.7934 0.7983 0.8354 0.7942 0.8053<br />

KS 0.7902 0.7788 0.8338 0.7705 0.7933<br />

DV 0.8020 0.8186 0.8323 0.8164 0.8173<br />

GM 0.7886 0.7755 0.8327 0.7799 0.7942<br />

AUC 0.7845 0.8154 0.8318 0.7882 0.8049<br />

PRC 0.8041 0.8265 0.8282 0.8123 0.8178<br />

S2N 0.8176 0.8279 0.8336 0.8231 0.8256<br />

98

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!