27.03.2014 Views

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SEKE 2012 Proceedings - Knowledge Systems Institute

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Fig. 3.<br />

Stability comparisons over three groups of Eclipse datasets<br />

datasets show the same/similar patterns. It is clearly seen that the FS<br />

methods had highest stability performance on the Eclipse3 datasets,<br />

then on Eclipse2, and finally on Eclipse1. This is especially true when<br />

using the RUS35, RUS50, SMO35, and SMO50 sampling approaches.<br />

VI. CONCLUSION<br />

This paper presents a strategy that uses feature selection (FS) and<br />

data sampling together to cope with the high-dimensionality and<br />

class imbalance problems in the context of software defect prediction.<br />

Instead of assessing FS techniques by measuring classification<br />

performance after the training dataset is modified, this study focuses<br />

on another important property of FS – stability, more specifically, the<br />

sensitivity of a FS method when used with a data sampling technique.<br />

More stable FS techniques will reliably give the same features even<br />

after sampling has been used, so practitioners can be more confident<br />

in those features.<br />

We examined six filter-based feature ranking techniques, five<br />

of which are threshold-based feature selection methods (MI, KS,<br />

GR, AUC, and PRC), and the remaining one is the signal-to-noise<br />

ratio (S2N) method. The three sampling techniques adopted are<br />

random undersampling (RUS), random oversampling (ROS), and<br />

synthetic minority oversampling (SMO), each combined with two<br />

post-sampling class ratios (35:65 and 50:50). The experiments were<br />

performed on three groups of datasets from a real-world software<br />

project.<br />

The results demonstrate that 1) S2N, AUC, and PRC had higher<br />

stability performance than other rankers; 2) ROS35 and ROS50<br />

produced higher stability values than other sampling approaches; (3)<br />

post-sampling class ratio between fp and nfp of 35:65 showed higher<br />

stability than the ratio of 50:50 for the RUS and SMO sampling<br />

techniques; (4) the stability performance generally increased along<br />

the increment of the number of the attributes retained in the feature<br />

subset, especially when the dataset is relatively skewed; and (5) the<br />

less imbalanced original datasets (prior to sampling) were more likely<br />

to obtain higher stability performance when data sampling techniques<br />

(such as RUS or SMO) was performed on them.<br />

Future work will include additional case studies with software<br />

measurement datasets of other software systems. In addition, different<br />

data sampling and FS techniques will be considered in future<br />

research.<br />

REFERENCES<br />

[1] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking<br />

classification models for software defect prediction: A proposed framework<br />

and novel findings,” IEEE Transactions on Software Engineering,<br />

vol. 34, no. 4, pp. 485–496, July-August 2008.<br />

[2] D. J. Dittman, T. M. Khoshgoftaar, R. Wald, and J. V. Hulse, Handbook<br />

of Data Intensive Computing. Springer, 2011, ch. Feature Selection<br />

Algorithms for Mining High Dimensional DNA Microarray Data, pp.<br />

685–710.<br />

[3] H. Liu, H. Motoda, R. Setiono, and Z. Zhao, “Feature selection: An<br />

ever evolving frontier in data mining,” in <strong>Proceedings</strong> of the Fourth International<br />

Workshop on Feature Selection in Data Mining, Hyderabad,<br />

India, 2010, pp. 4–13.<br />

[4] K. Gao, T. M. Khoshgoftaar, H. Wang, and N. Seliya, “Choosing software<br />

metrics for defect prediction: an investigation on feature selection<br />

techniques,” Softw., Pract. Exper., vol. 41, no. 5, pp. 579–606, 2011.<br />

[5] S. Shivaji, J. E. W. Jr., R. Akella, and S. Kim, “Reducing feature to<br />

improve bug prediction,” in ASE ’09 <strong>Proceedings</strong> of the 2009 IEEE/ACM<br />

International Conference on Automated Software Engineering, Auckland,<br />

New Zealand, 2009, pp. 600–604.<br />

[6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and P. W. Kegelmeyer,<br />

“SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial<br />

Intelligence Research, vol. 16, pp. 321–357, 2002.<br />

[7] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano,<br />

“Rusboost: A hybrid approach to alleviate class imbalance,” IEEE<br />

Transactions on <strong>Systems</strong>, Man & Cybernetics: Part A: <strong>Systems</strong> and<br />

Humans, vol. 40, no. 1, January 2010.<br />

[8] Z. Chen, T. Menzies, D. Port, and B. Boehm, “Finding the right data<br />

for software cost modeling,” IEEE Software, no. 22, pp. 38–46, 2005.<br />

[9] J. L. Lustgarten, V. Gopalakrishnan, and S. Visweswaran, “Measuring<br />

stability of feature selection in biomedical datasets,” in AMIA Annu Symp<br />

Proc. 2009, 2009, pp. 406–410.<br />

[10] Y. Saeys, T. Abeel, and Y. Peer, “Robust feature selection using ensemble<br />

feature selection techniques,” in ECML PKDD ’08: <strong>Proceedings</strong> of the<br />

European conference on Machine Learning and <strong>Knowledge</strong> Discovery<br />

in Databases - Part II. Berlin, Heidelberg: Springer-Verlag, 2008, pp.<br />

313–325.<br />

[11] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys,<br />

“Robust biomarker identification for cancer diagnosis with ensemble<br />

feature selection methods,” Bioinformatics, vol. 26, no. 3, pp. 392–398,<br />

February 2010.<br />

[12] S. Alelyani, Z. Zhao, and H. Liu, “A dilemma in assessing stability of<br />

feature selection algorithms,” in <strong>Proceedings</strong> of the 13th International<br />

Conference on High Performance Computing and Communications<br />

(HPCC), Banff, Canada, 2011, pp. 701–707.<br />

[13] X. Chen and M. Wasikowski, “Fast: a roc-based feature selection metric<br />

for small samples and imbalanced data classification problems,” in KDD<br />

’08: Proc. 14th ACM SIGKDD Int’l Conf. Knowldege Discovery and<br />

Data Mining. New York, NY: ACM, 2008, pp. 124–132.<br />

[14] L. I. Kuncheva, “A stability index for feature selection,” in <strong>Proceedings</strong><br />

of the 25th conference on <strong>Proceedings</strong> of the 25th IASTED International<br />

Multi-Conference: artificial intelligence and applications. Anaheim,<br />

CA, USA: ACTA Press, 2007, pp. 390–395.<br />

[15] K. Dunne, P. Cunningham, and F. Azuaje, “Solutions to Instability Problems<br />

with Sequential Wrapper-Based Approaches To Feature Selection,”<br />

Department of Computer Science, Trinity College, Dublin, Ireland, Tech.<br />

Rep. TCD-CD-2002-28, 2002.<br />

[16] P. Křížek, J. Kittler, and V. Hlaváč, “Improving stability of feature<br />

selection methods,” in <strong>Proceedings</strong> of the 12th international conference<br />

on Computer analysis of images and patterns, ser. CAIP’07. Berlin,<br />

Heidelberg: Springer-Verlag, 2007, pp. 929–936.<br />

[17] G. Boetticher, T. Menzies, and T. Ostrand. (2007) Promise<br />

repository of empirical software engineering data. [Online]. Available:<br />

http://promisedata.org/<br />

[18] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects for<br />

eclipse,” in <strong>Proceedings</strong> of the 29th International Conference on Software<br />

Engineering Workshops.<br />

Society, 2007, p. 76.<br />

Washington, DC, USA: IEEE Computer<br />

79

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!