bbc 2015
BBC2015_booklet
BBC2015_booklet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P5. BIG DATA SOLUTIONS FOR VARIANT DISCOVERY FROM LOW<br />
COVERAGE SEQUENCING DATA, BY INTEGRATION OF HADOOP, HBASE<br />
AND HIVE<br />
Amin Ardeshirdavani 1* , Erika Souche 2 , Martijn Oldenhof 3 & Yves Moreau 1 .<br />
KU Leuven ESAT-STADIUS Center for Dynamical Systems, Signal Processing and Data Analytic 1; KU Leuven<br />
Department of Human Genetics 2; KU Leuven Facilities for Research 3. *amin.ardeshirdavani@esat.kuleuven.be<br />
Next Generation Sequencing (NGS) technologies allow the sequencing of the whole human genome to, among others,<br />
efficiently study human genetic disorders. However, the sequencing data flood needs high computation power and<br />
optimized programming structure to tackle data analysis. A lot of researchers use scale-out network to simulate<br />
supercomputer. In many use cases Apache Hadoop and HBase have been used to coordinate distributed computation and<br />
act as a storage platform, respectively. However, scale-out network has rarely been used to handle gene variation data<br />
from NGS, except for sequencing reads assembly. In our study, we propose a Big Data solution by integrating Apache<br />
Hadoop, HBase and Hive to efficiently analyze NGS output such as VCF files.<br />
INTRODUCTION<br />
The goal of this project is trying to overcome the<br />
difficulties between massive NGS data and low data<br />
process ability. We want propose a data process and<br />
storage model specifically for NGS data. To address our<br />
goal we develop an application based on this model to test<br />
whether its process ability is highly increased. The target<br />
users of this application are researchers with intermediatelevel<br />
computer skills. The new model should meet certain<br />
demands, which are scalable, high tolerant and availability.<br />
Data import procedure should be fast and occupies the<br />
smallest storage volume. It also needs to make querying<br />
data faster and possible from remote place. In order to<br />
achieve these demands, three open source projects:<br />
Apache Hadoop, HBase and Hive are integrated as the<br />
backbone and on top of them a user-friendly interface<br />
designed application is developed to make this integration<br />
more straightforward.<br />
METHODS<br />
Generally, Hadoop is for utilizing distributed MapReduce<br />
data processing, HBase is the platform for complex<br />
structured data storage and Hive is for data retrieve from<br />
HBase using of Structural Query Language (SQL) syntax.<br />
Though Hadoop and HBase are popular recently, the<br />
combination of Hadoop, HBase and Hive is rare to be<br />
implemented in bioinformatics field.<br />
Here we mainly discuss gene variation data analysis. Thus<br />
the application developing is focusing on parsing and<br />
storing VCF (Variant Call Format) file. The application is<br />
designed to dynamically adapt VCF file structures with<br />
respect to variant callers. For example in<br />
UnifiedGenotyper calls SNPs and InDels separately by<br />
considering each variant is independent, yet the other<br />
caller HaplotypeCaller calls variants by using local<br />
assembly. For gene variation analysis, the VCF files of<br />
different samples need to be queried and the results should<br />
be able to export for further usage. Normally a VCF file<br />
for each sample or a group of samples is considerably<br />
large, so the efficiency of processing is for sure very<br />
crucial.<br />
The model we have decided is the integration of Hadoop,<br />
HBase and Hive; Hadoop will be used for data processing,<br />
HBase for storage and Hive for querying. Since all of<br />
these projects need distributed cluster to optimize the<br />
performance, it is crucial to decide the suitable<br />
architecture for our application. The cluster will be the<br />
major processing and storage platform. The single server<br />
outside the cluster will act as a client for users. Our<br />
application can connect remotely to the Hive server for<br />
researchers.<br />
RESULTS & DISCUSSION<br />
The tests show clearly that the Apache integration<br />
performances much better than SQL model when dealing<br />
with large size VCF files. Also, for small VCF files, the<br />
integration performance is acceptable. So we conclude that<br />
Apache integration could be a good solution for this kind<br />
of file management. Our newly developed application H3<br />
VCF with user-friendly interface is a nice tool for users<br />
without high level IT knowledge so they can conveniently<br />
use the integration to tackle VCF files. User can either<br />
choose to build his/ her own local computer cluster or use<br />
Amazon EMR to easily create a cluster with Apache<br />
projects for a few dollars.<br />
49