bbc 2015
BBC2015_booklet
BBC2015_booklet
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION:<br />
AN EXTREMELY IMBALANCED BIG DATA PROBLEM<br />
Isaac Triguero 1,2* , Sara del Río 3 , Victoria López 3 , Jaume Bacardit 4 , José M. Benítez 3 & Francisco Herrera 3 .<br />
VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Department of Computer<br />
Science and Artificial Intelligence 3 ; School of Computing Science, Newcastle University 4 .<br />
* Isaac.Triguero@irc.vib-Ugent.be<br />
The application of data mining and machine learning techniques to biological and biomedicine data continues to be an<br />
ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and<br />
store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these<br />
problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these<br />
circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine<br />
learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was<br />
concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal<br />
with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics<br />
classifications problems.<br />
INTRODUCTION<br />
The prediction of a protein’s contact map is a crucial step<br />
for the prediction of the complete 3D structure of a protein.<br />
This is one of the most challenging bioinformatics tasks<br />
within the field of protein structure prediction because of<br />
the sparseness of the contacts (i.e. few positive examples)<br />
and the great amount of data extracted (i.e. millions of<br />
instances, Gbs of disk space) from a few thousand of<br />
proteins.<br />
This problem refers to an imbalance bioinformatics big<br />
data application, in which traditional machine learning<br />
techniques become non effective and non efficient due to<br />
the big dimension of the problem. However, with use of<br />
the emerging cloud-based technologies, these techniques<br />
can be redesigned to extract valuable knowledge from<br />
such amount of data.<br />
The ECDBL’14 competition (http://cruncher.ncl.ac.uk/<br />
bdcomp/) brought up a data set that modeled the contact<br />
map prediction problem as a classification task.<br />
Concretely, the training data set considered was formed by<br />
32 million instances, 631 attributes, 2 classes, 98% of<br />
negative examples and it occupies about 56GB of disk<br />
space.<br />
In this work we describe the methodology with which we<br />
have participated, under the name 'Efdamis', ranking as the<br />
winner algorithm (Triguero et al, <strong>2015</strong>).<br />
METHODS<br />
In the proposed methodology, we focused on the<br />
MapReduce (Dean et al, 2008) paradigm in order to<br />
manage this voluminous data set. We extended the<br />
applicability of some pre-processing and classification<br />
models to deal with large-scale problems. This is<br />
composed of four main parts:<br />
<br />
<br />
An oversampling approach: The goal is to balance the<br />
highly skewed class distribution of the problem by<br />
replicating randomly the instances of the minority<br />
class (del Rio et al, 2014).<br />
<br />
<br />
An evolutionary feature weighting method: Due the<br />
relative high number of features of the given problem<br />
we developed a feature selection scheme for largescale<br />
problems that improves the classification<br />
performance by detecting the most significant features<br />
(Triguero et al, 2012).<br />
Building a learning model: As classifier, we focused<br />
on a scalable RandomForest algorithm.<br />
Testing the model: Even the test data can be<br />
considered big data (2.9 millions of instances), so that,<br />
the testing phase was also deployed within a parallel<br />
approach.<br />
RESULTS & DISCUSSION<br />
Table 1 presents the final results of the top 5 participants<br />
in terms of True Positive Rate (TPR) and True Negative<br />
Rate (TNR). In this particular problem, the necessity of<br />
balancing the TPR and TNR ratios emerged as a difficult<br />
challenge for most of the participants of the competition.<br />
In this sense, the use of scalable preprocessing techniques<br />
played in important role to improve the results of the<br />
RandomForest classifier. First, the designed oversampling<br />
approach allowed us to prevent RandomForest to be<br />
biased to the negative class. Second, our feature weighting<br />
approach provided us the possibility of reducing the<br />
dimensionality of the problem by selecting the most<br />
relevant features. Thus, it resulted in a better performance<br />
as well as a notable reduction of the time requirements.<br />
Team TPR TNR TPR * TNR<br />
Efdamis 0.73043 0.73018 0.53335<br />
ICOS 0.70321 0.73016 0.51345<br />
UNSW 0.69916 0.72763 0.50873<br />
HyperEns 0.64003 0.76338 0.48858<br />
PUC-Rio_ICA 0.65709 0.71460 0.46956<br />
TABLE 1: Comparison with the top 5 of the competition.<br />
REFERENCES<br />
Dean J., Ghemawat S., Mapreduce: simplified data processing on large<br />
clusters, Commun. ACM 51 (1), 107–113 (2008).<br />
del Río S., et al., On the use of MapReduce for imbalanced big data using<br />
random forest, Inf. Sci. 285 (2014) 112–137.<br />
Triguero I. et al., Integrating a differential evolution feature weighting<br />
scheme into prototype generation, Neurocomputing 97 (2012) 332–<br />
343.<br />
103