27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

An Empirical Study of Software Metric Selection Techniques<br />

for Defect Prediction<br />

Huanjing Wang, Taghi M. Khoshgoftaar, Randall Wald, and Amri Napolitano<br />

{huanjing.wang@wku.edu, khoshgof@fau.edu, rwald1@fau.edu, amrifau@gmail.com}<br />

Abstract—In software engineering, a common classification problem<br />

is determining the quality of a software component, module, or release.<br />

To aid in this task, software metrics are collected at various states of<br />

a software development cycle, and these metrics can be used to build<br />

a defect prediction model. However, not all metrics are relevant to<br />

defect prediction. One solution to finding the relevant metrics is the data<br />

preprocessing step known as feature selection. We present an empirical<br />

study in which we evaluate the similarity of eighteen different feature<br />

selection techniques and how the feature subsets chosen by each of these<br />

techniques perform in defect prediction. We look at similarity in addition<br />

to classification because many applications seek a diverse set of rankers,<br />

and similarity can be used to find which rankers are too close together<br />

to provide diversity. The classification models are trained using three<br />

commonly-used classifiers. The case study is based on software metrics<br />

and defect data collected from multiple releases of a large real-world<br />

software system. The results show that the features fall into a number<br />

of identifiable clusters in terms ofsimilarity. In addition, the similarity<br />

clusters were somewhat predictive of the clusters based on classification<br />

ranking: rankers within a similarity cluster had similar classification<br />

performance, and thus ended up in the same or adjacent classification<br />

clusters. The reverse was not true, with some classification clusters<br />

containing multiple unrelated similarity clusters. Overall, we found that<br />

the signal-to-noise and ReliefF-W rankers selected good features while<br />

being dissimilar from one another, suggesting they are appropriate for<br />

choosing diverse but high-performance rankers.<br />

I. INTRODUCTION<br />

In the practice of software quality assurance, software metrics<br />

are often collected and associated with modules, which have their<br />

number of pre- and post-release defects recorded. A software defect<br />

prediction model is often built to ensure the quality of future software<br />

products or releases. However, not all software metrics are relevant<br />

for predicting the fault proneness of software modules. Software<br />

metrics selection (or feature selection) prior to training a defect<br />

prediction model can help separate relevant software metrics from<br />

irrelevant or redundant ones.<br />

In this paper, we focus on feature selection of software metrics<br />

for defect prediction. During the past decade, numerous studies have<br />

examined feature selection with respect to classification performance,<br />

but very few studies focus on the similarity of feature selection<br />

techniques. The purpose of studying this similarity is to make it easier<br />

to select a set of diverse rankers, ensuring that none of those chosen<br />

are so similar to the others as to provide no additional diversity<br />

in the collection of rankers. In this study, we perform similarity<br />

analysis on eighteen different feature selection techniques, eleven of<br />

which were recently developed and implemented by our research<br />

group. We evaluate the similarity of two filters on a dataset by<br />

measuring the consistency between the two feature subsets chosen.<br />

We also evaluate the effectiveness of defect predictors that estimate<br />

the quality of program modules, e.g., fault-prone (fp) or not-faultprone<br />

(nfp). Three different classifiers (learners) are used to build our<br />

prediction models. The empirical validation of the similarity measure<br />

and model performance was implemented through a case study of<br />

four consecutive releases of a very large telecommunications software<br />

system (denoted as LLTS). To our knowledge this is the first study<br />

to examine both similarity and classification performance of feature<br />

rankers in the software engineering domain.<br />

The experimental results show that using both the similarity and<br />

classification performance of the rankers gave different types of<br />

clusters. Similarity produced a number of smaller clusters, with four<br />

or five rankers being about the largest sizes found (and two and one<br />

being common sizes as well). Classification gave a smaller number<br />

of clusters, with the largest cluster (also the best-performing cluster)<br />

having nine members. In both cases, the clusters often contained<br />

feature rankers which would not immediately seem to have much in<br />

common; although this makes sense for classification performance,<br />

it is an intriguing result for the similarity (indicating that the rankers<br />

may be more related than one might expect). We also found that<br />

the similarity groups were generally predictive of the classification<br />

groupings: members within a single similarity group are in the same<br />

or adjacent classification groups. Finally, we noted that signal-tonoise<br />

and ReliefF-W performed very well in terms of classification<br />

while choosing two or fewer features in common.<br />

The rest of the paper is organized as follows. We review relevant<br />

literature on feature selection techniques in Section II. Section III<br />

provides detailed information about the 18 feature selection techniques.<br />

Section IV describes the datasets used in the study, presents<br />

similarity results and analysis, and shows model performance results<br />

and analysis. Finally, in Section V, the conclusion is summarized and<br />

suggestions for future work are indicated.<br />

II. RELATED WORK<br />

The main goal of feature selection is to select a subset of<br />

features that minimizes the prediction errors of classifiers. Feature<br />

selection can be broadly classified as feature ranking and feature<br />

subset selection. Feature ranking sorts the attributes according to<br />

their individual predictive power, while feature subset selection finds<br />

subsets of attributes that collectively have good predictive power.<br />

Feature selection can also be categorized as filters and wrappers.<br />

Filters are algorithms in which a feature subset is selected without<br />

involving any learning algorithm. Wrappers are algorithms that use<br />

feedback from a learning algorithm to determine which feature(s) to<br />

include in building a classification model.<br />

A number of papers have studied the use of feature selection<br />

techniques as a data preprocesing step. Guyon and Elisseeff [1]<br />

outline key approaches used for attribute selection, including feature<br />

construction, feature ranking, multivariate feature selection, efficient<br />

search methods, and feature validity assessment methods. A study by<br />

Liu and Yu [2] provides a comprehensive survey of feature selection<br />

algorithms and presents an integrated approach to intelligent feature<br />

selection. Jeffery et al. [3] compare the similarity between gene lists<br />

produced by 10 different feature selection methods. They conclude<br />

that sample size clearly affects the ranked gene lists produced by<br />

different feature selection methods.<br />

94

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!