27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

subsets. The empirical results demonstrate that S2N, AUC, and PRC<br />

show higher stability performance than the other filters. Moreover, the<br />

filters with the ROS sampling technique have higer stability behavior<br />

than they do with the other sampling methods. Finally, the stability<br />

of the FS techniques increases as the number of attributes retained<br />

in the feature subset increases.<br />

The remainder of the paper is organized as follows: Section<br />

II discusses related work. Section III outlines the methods and<br />

techniques used in this paper. Section IV describes nine datasets<br />

used in the case study. Section V presents the case study including<br />

design, results, and analysis. Finally, we summarize our conclusions<br />

and provide suggestions for future work in Section VI.<br />

II. RELATED WORK<br />

Feature selection (FS), also known as attribute selection or variable<br />

selection, is a process of selecting some subset of the features which<br />

are useful in building a classifier. FS techniques can be divided into<br />

wrapper and filter categories [3]. Wrappers use a search algorithm to<br />

search through the space of possible features, evaluate each subset<br />

through a learning algorithm, and determine which ones are finally<br />

selected in building a classifier. Filters use a simpler statistical<br />

measure to evaluate each subset or individual feature rather than using<br />

a learning algorithm. Feature selection may also be categorized as<br />

ranking or subset selection [3]. Feature ranking scores the attributes<br />

based on their individual predictive power, while subset selection<br />

selects subset of attributes that collectively have good prediction<br />

capability. In this study, the FS techniques used belong to the filterbased<br />

feature ranking category.<br />

Class imbalance, which appears in various domains, is another<br />

significant problem in data mining. One effective method for alleviating<br />

the adverse effect of skewed class distribution is sampling [6],<br />

[7]. While considerable work has been done for feature selection<br />

and data sampling separately, research on investigating both together<br />

started recently. Chen et al. [8] have studied data row pruning (data<br />

sampling) and data column pruning (feature selection) in the context<br />

of software cost/effort estimation. However, the data sampling in their<br />

study was not specifically for the class imbalance problem, and also<br />

the classification models were not for binary problems.<br />

To evaluate FS techniques, most existing research works on comparing<br />

the classification behaviors of models built with the selected<br />

features to those built with the complete set of features. Instead<br />

of using classification performance, the present work assesses FS<br />

techniques using stability. The stability of a FS algorithm is normally<br />

defined as the degree of consensus between the output of that FS<br />

method as it pertains to randomly-selected subsets of the same input<br />

data. Lustgarten et al. [9] presented an adjusted stability measure<br />

that computes robustness of a FS method with respect to random FS.<br />

Saeys et al. [10] assessed the robustness of FS techniques using the<br />

Spearman rank correlation coefficient and Jaccard index. Abeel et<br />

al. [11] presented a general framework for stability analysis of the<br />

FS techniques. They showed that stability could be improved through<br />

ensemble FS. Alelyani et al. [12] jointly considered both sample sets’<br />

similarity and feature list similarity in stability assessment for FS<br />

algorithms.<br />

III. METHODOLOGY<br />

A. Filter-based feature ranking techniques<br />

The procedure of filter-based feature ranking is to score each<br />

feature (attribute) according to a particular method (metric), allowing<br />

the selection of the best set of features. In this study, we use five<br />

threshold-based feature selection techniques and the signal-to-noise<br />

ratio method.<br />

1) Threshold-based feature selection (TBFS) methods: The TBFS<br />

techniques were proposed by our research team and implemented<br />

within WEKA [2]. The procedure is shown in Algorithm 1. Each<br />

independent attribute works individually with the class attribute, and<br />

that two-attribute dataset is evaluated using different performance<br />

metrics. More specifically, the TBFS procedure includes two steps:<br />

(1) normalizing the attribute values so that they fall between 0 and<br />

1; and (2) treating those values as the posterior probabilities from<br />

which to calculate classifier performance metrics.<br />

Analogous to the procedure for calculating rates in a classification<br />

setting with a posterior probability, the true positive (TPR), true<br />

negative (TNR), false positive (FPR), and false negative (FNR)<br />

rates can be calculated at each threshold t ∈ [0, 1] relative to the<br />

normalized attribute ˆX j . Precision PRE(t) is defined as the fraction<br />

of the predicted-positive examples which are actually positive.<br />

The feature rankers utilize five metrics: Mutual Information (MI),<br />

Kolmogorov-Smirnov Statistic (KS), Geometric Mean (GM), Area<br />

Under the ROC Curve (AUC), and Area Under the Precision-Recall<br />

Curve (PRC). The value is computed in both directions: first treating<br />

instances above the threshold (t) as positive and below as negative,<br />

then treating instances above the threshold as negative and below as<br />

positive. The better result is used. Five metrics are calculated for<br />

each attribute individually, and attributes with higher values for MI,<br />

KS, GM, AUC, and PRC are determined to better predict the class<br />

attribute. In this manner, the attributes can be ranked from most to<br />

least predictive based on each of the five metrics. For more detailed<br />

information about these five metrics, please reference the work of<br />

Dittman et al. [2].<br />

2) Signal-to-Noise Ratio (S2N) Technique: S2N represents how<br />

well a feature separates two classes. The equation for signal-to-noise<br />

is:<br />

S2N =(μ P − μ N )/(σ P + σ N ) (1)<br />

where μ P and μ N are the mean values of that particular attribute<br />

in all of the instances which belong to a specific class, which is<br />

either P or N (the positive and negative classes). σ P and σ N are<br />

the standard deviations of that particular attribute as it relates to the<br />

class. The larger the S2N ratio, the more relevant a feature is to the<br />

dataset [13].<br />

B. Data Sampling Techniques<br />

We here present three sampling techniques, which stand for the<br />

major paradigms in data sampling: random and intelligent under and<br />

oversampling.<br />

1 Random Sampling Techniques<br />

The two most common data sampling techniques are random<br />

oversampling (ROS) and random undersampling (RUS). Random<br />

oversampling duplicates instances (selected randomly) of<br />

the minority class. Random undersampling randomly discards<br />

instances from the majority class.<br />

2 Synthetic Minority Oversampling Technique<br />

Chawla et al. proposed an intelligent oversampling method<br />

called Synthetic Minority Oversampling Technique (SMOTE)<br />

[6]. SMOTE (denoted SMO in this work) adds new, artificial<br />

minority examples by extrapolating between preexisting minority<br />

instances rather than simply duplicating original examples.<br />

The newly created instances cause the minority regions of the<br />

feature-space to be fuller and more general.<br />

75

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!