27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Stability of Filter-Based Feature Selection Methods<br />

for Imbalanced Software Measurement Data<br />

Kehan Gao<br />

Eastern Connecticut State University<br />

Willimantic, Connecticut 06226<br />

gaok@easternct.edu<br />

Taghi M. Khoshgoftaar<br />

Florida Atlantic University<br />

Boca Raton, Florida 33431<br />

khoshgof@fau.edu<br />

Amri Napolitano<br />

Florida Atlantic University<br />

Boca Raton, Florida 33431<br />

amrifau@gmail.com<br />

Abstract—Feature selection (FS) is necessary for software quality<br />

modeling, especially when a large number of software metrics are<br />

available in data repositories. Selecting a subset of features (software<br />

metrics) that best describe the class attribute (module’s quality) can bring<br />

many benefits such as reducing the training time of learners, improving<br />

the comprehensibility of the resulting classifier models, and facilitating<br />

software metrics collection, organization, and management. Another<br />

challenge of software measurement data is the presence of skewed or<br />

imbalanced distributions between the two types of modules (e.g., many<br />

more not-fault-prone modules than fault-prone modules found in those<br />

datasets). In this paper, we use data sampling to deal with this problem.<br />

Previous research usually evaluates FS techniques by comparing the<br />

performance of classifiers before and after the training data is modified.<br />

This study assesses FS techniques from a different perspective: stability.<br />

Stability is important because FS techniques that reliably produce the<br />

same features are more trustworthy. We consider six filter-based feature<br />

selection methods and six data sampling approaches. We also vary the<br />

number of features selected in the feature subsets. We want to examine<br />

the effect of data sampling approaches on the stability of FS when using<br />

the sampled data. The experiments were performed on nine datasets<br />

from a real-world software project. The results demonstrate that different<br />

FS techniques may have quite different stability behaviors. In addition,<br />

other factors, such as the sampling technique used and the number of<br />

attributes retained in the feature subset, may also greatly influence the<br />

stability results.<br />

Index Terms—software defect prediction, software metrics, feature<br />

selection, data sampling, stability<br />

I. INTRODUCTION<br />

Software defect prediction is a process of building a classifier by<br />

using software metrics and fault data collected during the previous<br />

software project and then applying this classifier to predict the quality<br />

of new program modules (e.g., classify the program modules as either<br />

fault-prone (fp) or not-fault-prone (nfp)) [1]. The benefit of such<br />

prediction is that the project resources can be strategically allocated<br />

to the program modules according to the prediction. For instance,<br />

intensive inspection and testing can first be applied to the potentially<br />

problematic modules, thereby improving the quality of the product.<br />

Two problems that often come with the software measurement data<br />

are high-dimensionality and class imbalance. High-dimensionality<br />

refers to the situation where the number of available software metrics<br />

is too large to easily work with. Several problems may arise due to<br />

high-dimensionality, including longer learning time of a classification<br />

algorithm and a decline in prediction performance of a classification<br />

model. Class imbalance occurs when instances of one class in a<br />

dataset appear more frequently than instances of the other class. This<br />

phenomenon is more prevalent in high-assurance and mission-critical<br />

software systems, where the type of nfp modules is always dominant<br />

between the two types (fp and nfp) ofmodules in a given dataset.<br />

The primary weakness of such imbalanced data is that a traditional<br />

classification algorithm tends to classify fp modules as nfp, resulting<br />

in more customer-discovered faults that have serious consequences<br />

and high repair cost.<br />

Feature selection and data sampling are often employed to deal<br />

with these problems. Feature selection (FS) is a process of choosing<br />

a subset of input variables by eliminating features with little or no<br />

predictive information. Although FS techniques have been studied<br />

in a variety of domains [2], [3] for many years, research working<br />

on improving software defect prediction through metric (feature)<br />

selection just started recently [4], [5]. Data sampling isacommon<br />

technique to alter the relative proportion of the different types of<br />

modules, therefore achieving a more balanced dataset. Note that in<br />

this study, the training dataset is sampled to change the relative<br />

proportion of the nfp and fp modules before FS is performed.<br />

To evaluate a FS technique, most previous research focuses on<br />

comparing the performance of classification models before and after<br />

a specific FS technique is performed. In this paper, we use a different<br />

way to assess FS techniques – stability. Stability of a FS technique<br />

usually refers to the sensitivity of the technique to variations in<br />

the training set. Practitioners may prefer a FS algorithm that can<br />

produce consistent results despite such variations. For example, if a<br />

FS technique produces the same orsimilar results when using the<br />

entire training dataset or only half of it, a practitioner may save the<br />

computation time and use this FS technique on the smaller training<br />

dataset to get the same reliable results.<br />

In this study, we are more interested in investigating the stability<br />

of FS techniques with respect to various data sampling approaches.<br />

The strategy we adopted is that the ranking of features from each<br />

sampled dataset is compared to the ranking from the original dataset<br />

which it came from. Those FS techniques that are able to produce<br />

consistent outputs with respect to the different perturbations (due to<br />

sampling) in the input data are considered stable (robust), or say they<br />

are insensitive to that particular data sampling technique. Since the<br />

purpose of data sampling here is to alter class proportions rather than<br />

changing the size of the training dataset, the consistent outputs for<br />

feature rankings imply that the data sampling technique has less or no<br />

effect on the FS technique. To our knowledge, limited research has<br />

been done on studying the impact of data sampling on the stability<br />

of feature selection.<br />

The case study of this paper is performed on nine datasets from a<br />

real-world software project. We examine six filter-based FS methods,<br />

including five threshold-based techniques (mutual information (MI),<br />

Kolmogorov-Smirnov statistic (KS), geometric mean (GM), area<br />

under the ROC curve (AUC), and area under the precision-recall<br />

curve (PRC)), and the signal-to-noise ratio (S2N) approach. We employ<br />

three data sampling techniques (random undersampling (RUS),<br />

random oversampling (ROS), and synthetic minority oversampling<br />

(SMO)), each combined with two post-sampling class ratios. Besides,<br />

we vary from 2to10thenumber of features retained in the feature<br />

74

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!