bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P58. SFINX: STRAIGHTFORWARD FILTERING INDEX FOR AFFINITY PURIFICATION-MASS SPECTROMETRY DATA ANALYSIS Kevin Titeca 1,2 , Pieter Meysman 3,4 , Kris Gevaert 1,2 , Jan Tavernier 1,2 , Kris Laukens 3,4 , Lennart Martens 1,2 & Sven Eyckerman 1,2* . Medical Biotechnology Center, VIB, B-9000 Ghent, Belgium 1 ; Department of Biochemistry, Ghent University, B-9000 Ghent, Belgium 2 ; Advanced Database Research and Modeling (ADReM), University of Antwerp, Belgium 3 ; Biomedical informatics research center Antwerpen (biomina), Belgium 4 . sven.eyckerman@vib-ugent.be Affinity purification-mass spectrometry (AP-MS) is one of the most common techniques for the analysis of proteinprotein interactions, but inferring bona fide interactions from the resulting datasets remains notoriously difficult because of the many false positives. The ideal filter technique for these data is highly accurate, fast and user friendly without the need to rely on extensive parameter optimization or external databases, which also makes it reproducible and unbiased. Because none of the existing filter techniques combines all these features, we developed SFINX, the Straightforward Filtering INdeX. We here describe the SFINX algorithm and its performance on two independent AP-MS benchmark datasets. SFINX shows superior performance over the other approaches with accuracy increases of up to 20%, and is extremely fast. It does not require parameter optimization, and is absolutely independent of external resources. Both the algorithm and its website interface are highly intuitive with limited need for user input and the possibility of immediate network visualization and interpretation at http://sfinx.ugent.be/. SFINX might become essential in the toolbox of any scientist interested in user-friendly and highly accurate filtering of AP-MS data. 102
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION: AN EXTREMELY IMBALANCED BIG DATA PROBLEM Isaac Triguero 1,2* , Sara del Río 3 , Victoria López 3 , Jaume Bacardit 4 , José M. Benítez 3 & Francisco Herrera 3 . VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Department of Computer Science and Artificial Intelligence 3 ; School of Computing Science, Newcastle University 4 . * Isaac.Triguero@irc.vib-Ugent.be The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics classifications problems. INTRODUCTION The prediction of a protein’s contact map is a crucial step for the prediction of the complete 3D structure of a protein. This is one of the most challenging bioinformatics tasks within the field of protein structure prediction because of the sparseness of the contacts (i.e. few positive examples) and the great amount of data extracted (i.e. millions of instances, Gbs of disk space) from a few thousand of proteins. This problem refers to an imbalance bioinformatics big data application, in which traditional machine learning techniques become non effective and non efficient due to the big dimension of the problem. However, with use of the emerging cloud-based technologies, these techniques can be redesigned to extract valuable knowledge from such amount of data. The ECDBL’14 competition (http://cruncher.ncl.ac.uk/ bdcomp/) brought up a data set that modeled the contact map prediction problem as a classification task. Concretely, the training data set considered was formed by 32 million instances, 631 attributes, 2 classes, 98% of negative examples and it occupies about 56GB of disk space. In this work we describe the methodology with which we have participated, under the name 'Efdamis', ranking as the winner algorithm (Triguero et al, 2015). METHODS In the proposed methodology, we focused on the MapReduce (Dean et al, 2008) paradigm in order to manage this voluminous data set. We extended the applicability of some pre-processing and classification models to deal with large-scale problems. This is composed of four main parts: An oversampling approach: The goal is to balance the highly skewed class distribution of the problem by replicating randomly the instances of the minority class (del Rio et al, 2014). An evolutionary feature weighting method: Due the relative high number of features of the given problem we developed a feature selection scheme for largescale problems that improves the classification performance by detecting the most significant features (Triguero et al, 2012). Building a learning model: As classifier, we focused on a scalable RandomForest algorithm. Testing the model: Even the test data can be considered big data (2.9 millions of instances), so that, the testing phase was also deployed within a parallel approach. RESULTS & DISCUSSION Table 1 presents the final results of the top 5 participants in terms of True Positive Rate (TPR) and True Negative Rate (TNR). In this particular problem, the necessity of balancing the TPR and TNR ratios emerged as a difficult challenge for most of the participants of the competition. In this sense, the use of scalable preprocessing techniques played in important role to improve the results of the RandomForest classifier. First, the designed oversampling approach allowed us to prevent RandomForest to be biased to the negative class. Second, our feature weighting approach provided us the possibility of reducing the dimensionality of the problem by selecting the most relevant features. Thus, it resulted in a better performance as well as a notable reduction of the time requirements. Team TPR TNR TPR * TNR Efdamis 0.73043 0.73018 0.53335 ICOS 0.70321 0.73016 0.51345 UNSW 0.69916 0.72763 0.50873 HyperEns 0.64003 0.76338 0.48858 PUC-Rio_ICA 0.65709 0.71460 0.46956 TABLE 1: Comparison with the top 5 of the competition. REFERENCES Dean J., Ghemawat S., Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (1), 107–113 (2008). del Río S., et al., On the use of MapReduce for imbalanced big data using random forest, Inf. Sci. 285 (2014) 112–137. Triguero I. et al., Integrating a differential evolution feature weighting scheme into prototype generation, Neurocomputing 97 (2012) 332– 343. 103
Page 1 and 2:
10 th Benelux Bioinformatics Confer
Page 3 and 4:
10th Benelux Bioinformatics Confere
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
BeNeLux Bioinformatics Conference -
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
Page 51 and 52: BeNeLux Bioinformatics Conference -
Page 101: BeNeLux Bioinformatics Conference -
Page 115: 10th Benelux Bioinformatics Confere
show all

bbc 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?