03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P5. BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW<br />

COVERAGE SEQUENCING DATA, BY INTEGRATION OF HADOOP, HBASE<br />

AND HIVE<br />

Amin Ardeshirdavani 1* , Erika Souche 2 , Martijn Oldenhof 3 & Yves Moreau 1 .<br />

KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic 1; KU Leuven<br />

Department of Human Genetics 2; KU Leuven Facilities for Research 3. *amin.ardeshirdavani@esat.kuleuven.be<br />

Next Generation Sequencing (NGS) technologies allow the sequencing of the whole human genome to, among others,<br />

efficiently study human genetic disorders. However, the sequencing data flood needs high computation power and<br />

optimized programming structure to tackle data analysis. A lot of researchers use scale-out network to simulate<br />

supercomputer. In many use cases Apache Hadoop and HBase have been used to coordinate distributed computation and<br />

act as a storage platform, respectively. However, scale-out network has rarely been used to handle gene variation data<br />

from NGS, except for sequencing reads assembly. In our study, we propose a Big Data solution by integrating Apache<br />

Hadoop, HBase and Hive to efficiently analyze NGS output such as VCF files.<br />

INTRODUCTION<br />

The goal of this project is trying to overcome the<br />

difficulties between massive NGS data and low data<br />

process ability. We want propose a data process and<br />

storage model specifically for NGS data. To address our<br />

goal we develop an application based on this model to test<br />

whether its process ability is highly increased. The target<br />

users of this application are researchers with intermediatelevel<br />

computer skills. The new model should meet certain<br />

demands, which are scalable, high tolerant and availability.<br />

Data import procedure should be fast and occupies the<br />

smallest storage volume. It also needs to make querying<br />

data faster and possible from remote place. In order to<br />

achieve these demands, three open source projects:<br />

Apache Hadoop, HBase and Hive are integrated as the<br />

backbone and on top of them a user-friendly interface<br />

designed application is developed to make this integration<br />

more straightforward.<br />

METHODS<br />

Generally, Hadoop is for utilizing distributed MapReduce<br />

data processing, HBase is the platform for complex<br />

structured data storage and Hive is for data retrieve from<br />

HBase using of Structural Query Language (SQL) syntax.<br />

Though Hadoop and HBase are popular recently, the<br />

combination of Hadoop, HBase and Hive is rare to be<br />

implemented in bioinformatics field.<br />

Here we mainly discuss gene variation data analysis. Thus<br />

the application developing is focusing on parsing and<br />

storing VCF (Variant Call Format) file. The application is<br />

designed to dynamically adapt VCF file structures with<br />

respect to variant callers. For example in<br />

UnifiedGenotyper calls SNPs and InDels separately by<br />

considering each variant is independent, yet the other<br />

caller HaplotypeCaller calls variants by using local<br />

assembly. For gene variation analysis, the VCF files of<br />

different samples need to be queried and the results should<br />

be able to export for further usage. Normally a VCF file<br />

for each sample or a group of samples is considerably<br />

large, so the efficiency of processing is for sure very<br />

crucial.<br />

The model we have decided is the integration of Hadoop,<br />

HBase and Hive; Hadoop will be used for data processing,<br />

HBase for storage and Hive for querying. Since all of<br />

these projects need distributed cluster to optimize the<br />

performance, it is crucial to decide the suitable<br />

architecture for our application. The cluster will be the<br />

major processing and storage platform. The single server<br />

outside the cluster will act as a client for users. Our<br />

application can connect remotely to the Hive server for<br />

researchers.<br />

RESULTS & DISCUSSION<br />

The tests show clearly that the Apache integration<br />

performances much better than SQL model when dealing<br />

with large size VCF files. Also, for small VCF files, the<br />

integration performance is acceptable. So we conclude that<br />

Apache integration could be a good solution for this kind<br />

of file management. Our newly developed application H3<br />

VCF with user-friendly interface is a nice tool for users<br />

without high level IT knowledge so they can conveniently<br />

use the integration to tackle VCF files. User can either<br />

choose to build his/ her own local computer cluster or use<br />

Amazon EMR to easily create a cluster with Apache<br />

projects for a few dollars.<br />

49

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!