03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P59. MAPREDUCE APPROACHES FOR CONTACT MAP PREDICTION:<br />

AN EXTREMELY IMBALANCED BIG DATA PROBLEM<br />

Isaac Triguero 1,2* , Sara del Río 3 , Victoria López 3 , Jaume Bacardit 4 , José M. Benítez 3 & Francisco Herrera 3 .<br />

VIB Inflammation Research Center 1 ; Department of Respiratory Medicine, Ghent University 2 ; Department of Computer<br />

Science and Artificial Intelligence 3 ; School of Computing Science, Newcastle University 4 .<br />

* Isaac.Triguero@irc.vib-Ugent.be<br />

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an<br />

ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and<br />

store large quantities of data about cells, proteins, genes, etc, that should be processed. Moreover, in many of these<br />

problems such as contact map prediction, it is difficult to collect representative positive examples. Learning under these<br />

circumstances, known as imbalance big data classification, may not be straightforward for most of the standard machine<br />

learning methods. In this work we describe the methodology that won the ECBDL'14 big data competition, which was<br />

concerned with the prediction of contact maps. Our methodology is composed of several MapReduce approaches to deal<br />

with big amounts of data. The results show that this model is very suitable to tackle large-scale bioinformatics<br />

classifications problems.<br />

INTRODUCTION<br />

The prediction of a protein’s contact map is a crucial step<br />

for the prediction of the complete 3D structure of a protein.<br />

This is one of the most challenging bioinformatics tasks<br />

within the field of protein structure prediction because of<br />

the sparseness of the contacts (i.e. few positive examples)<br />

and the great amount of data extracted (i.e. millions of<br />

instances, Gbs of disk space) from a few thousand of<br />

proteins.<br />

This problem refers to an imbalance bioinformatics big<br />

data application, in which traditional machine learning<br />

techniques become non effective and non efficient due to<br />

the big dimension of the problem. However, with use of<br />

the emerging cloud-based technologies, these techniques<br />

can be redesigned to extract valuable knowledge from<br />

such amount of data.<br />

The ECDBL’14 competition (http://cruncher.ncl.ac.uk/<br />

bdcomp/) brought up a data set that modeled the contact<br />

map prediction problem as a classification task.<br />

Concretely, the training data set considered was formed by<br />

32 million instances, 631 attributes, 2 classes, 98% of<br />

negative examples and it occupies about 56GB of disk<br />

space.<br />

In this work we describe the methodology with which we<br />

have participated, under the name 'Efdamis', ranking as the<br />

winner algorithm (Triguero et al, <strong>2015</strong>).<br />

METHODS<br />

In the proposed methodology, we focused on the<br />

MapReduce (Dean et al, 2008) paradigm in order to<br />

manage this voluminous data set. We extended the<br />

applicability of some pre-processing and classification<br />

models to deal with large-scale problems. This is<br />

composed of four main parts:<br />

<br />

<br />

An oversampling approach: The goal is to balance the<br />

highly skewed class distribution of the problem by<br />

replicating randomly the instances of the minority<br />

class (del Rio et al, 2014).<br />

<br />

<br />

An evolutionary feature weighting method: Due the<br />

relative high number of features of the given problem<br />

we developed a feature selection scheme for largescale<br />

problems that improves the classification<br />

performance by detecting the most significant features<br />

(Triguero et al, 2012).<br />

Building a learning model: As classifier, we focused<br />

on a scalable RandomForest algorithm.<br />

Testing the model: Even the test data can be<br />

considered big data (2.9 millions of instances), so that,<br />

the testing phase was also deployed within a parallel<br />

approach.<br />

RESULTS & DISCUSSION<br />

Table 1 presents the final results of the top 5 participants<br />

in terms of True Positive Rate (TPR) and True Negative<br />

Rate (TNR). In this particular problem, the necessity of<br />

balancing the TPR and TNR ratios emerged as a difficult<br />

challenge for most of the participants of the competition.<br />

In this sense, the use of scalable preprocessing techniques<br />

played in important role to improve the results of the<br />

RandomForest classifier. First, the designed oversampling<br />

approach allowed us to prevent RandomForest to be<br />

biased to the negative class. Second, our feature weighting<br />

approach provided us the possibility of reducing the<br />

dimensionality of the problem by selecting the most<br />

relevant features. Thus, it resulted in a better performance<br />

as well as a notable reduction of the time requirements.<br />

Team TPR TNR TPR * TNR<br />

Efdamis 0.73043 0.73018 0.53335<br />

ICOS 0.70321 0.73016 0.51345<br />

UNSW 0.69916 0.72763 0.50873<br />

HyperEns 0.64003 0.76338 0.48858<br />

PUC-Rio_ICA 0.65709 0.71460 0.46956<br />

TABLE 1: Comparison with the top 5 of the competition.<br />

REFERENCES<br />

Dean J., Ghemawat S., Mapreduce: simplified data processing on large<br />

clusters, Commun. ACM 51 (1), 107–113 (2008).<br />

del Río S., et al., On the use of MapReduce for imbalanced big data using<br />

random forest, Inf. Sci. 285 (2014) 112–137.<br />

Triguero I. et al., Integrating a differential evolution feature weighting<br />

scheme into prototype generation, Neurocomputing 97 (2012) 332–<br />

343.<br />

103

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!