27.06.2013 Views

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Suzan Arslanturk et al.<br />

Knowledge Analysis) is used. Weka is an open source Java based data mining tool that performs pre<br />

processing, classification, clustering and many other machine learning tools by using a graphical user<br />

interface. First the dataset is uploaded to the Weka Explorer mode and the data is discretized. Then all<br />

the attribute selection algorithms are performed. Sensitivity and specificity are used to evaluate the<br />

performances in a statistical manner. Sensitivity i.e. true positive rate; measures the proportion of actual<br />

positives which are correctly identified. While specificity (1 – false positive probability) measures the<br />

proportion of negatives which are correctly identified. The algorithms with higher sensitivity and higher<br />

specificity are the ones that performs significantly better than the others.<br />

5. Results and discussions<br />

Our experiment applied five different feature selection algorithms to the simulation dataset and the results<br />

are compared when noise, missing values and multicollinearity are added to the data. The first set of<br />

experiments was designed to evaluate the following feature selection methods to see which one handles<br />

additive noise better: J48, Relief, information gain, consistency based feature selection and correlation<br />

based feature selection.<br />

In order to refer to a feature selection algorithm as robust, it has to both has a high sensitivity and a high<br />

specificity. Fig. 1 shows that J48 algorithm performs well in terms of sensitivity but there is a huge<br />

decrease in the specificity curve (Fig. 2) which makes the algorithm less desirable than the others.<br />

Without any noise Information gain and ReliefF algorithms both perform well. Consistency based feature<br />

evaluation has a low specificity and J48 decision tree classification algorithm has a low sensitivity. The<br />

results do not change in Cfs, Information Gain and Relief algorithms when the noise level is 2%, 5%,<br />

10%, 15%, ,respectively. In spite, in J48 decision tree the specificity decreases and the sensitivity<br />

increases rapidly when the noise level increase. The sensitivity of consistency based feature selection<br />

was 11% higher than the average sensitivity of other methods. However, its specificity was 37% lower<br />

than that of the average. It is important to note that data method that maximizes both the sensitivity and<br />

specificity is of interest. The best method, when both sensitivity and specificity are considered, was<br />

information gain. This method outperformed the average performance of the other methods by 1% and<br />

20.5% when we considered its sensitivity and specificity, respectively. Consistency based subset<br />

evaluation and J48 algorithms can not handle noise. The next set of experiments was designed to<br />

evaluate the feature selection methods to see which one handles missing values better. Fig. 3 and Fig. 4<br />

show the results of different missing value levels. In<br />

this case, when considering sensitivity and specificity, Relief and information gain were proved to perform<br />

better compared to the other methods of our study by 7.2% and 12.4%, respectively. Despite, the<br />

Consistency based feature selection and J48 algorithms can not handle missing values effectively. Our<br />

studies also showed that when multicollinearity was embedded into the dataset without any noise and<br />

missing values, the correlation based feature selection method outperformed other methods. In summary,<br />

Relief and information gain were the best in all three situations (noise, missing value, multicollinearity)<br />

when both sensitivity and specificity were considered.<br />

Figure 1: Noise-sensitivity<br />

21

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!