03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS<br />

Bertrand Escaliere 1,2 , Nicolas Simonis 1,3 , Gianluca Bontempi 1,2 & Guillaume Smits 1,4 .<br />

Interuniversity Institute of Bioinformatics in Brussels 1 ; Machine Learning Group, Université Libre de Bruxelles 2 ; Institut<br />

de Pathologie et de Génétique 3 ; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles 4 .<br />

NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing<br />

variants. A better combination between existing tools and the right choice of parameters can lead to more specific and<br />

sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a<br />

simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled<br />

way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine<br />

exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a<br />

pipeline one variable at the time.<br />

INTRODUCTION<br />

Identification of anomalies causing genetic disorders is<br />

difficult. It can be limited by scarcity of affliction<br />

concerned, by disorder genetic heterogeneity, or by<br />

phenotypic pleiotropy associated with the anomalies in a<br />

single gene. Exome and genome sequencing allowed the<br />

identification of many genetic diseases causes, whose<br />

origin remained inaccessible up to now by the usual<br />

techniques of research in genetics (Ng et al., 2009),<br />

(Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al.,<br />

2014). Exome and genome sequencing data analysis<br />

pipelines are constituted by several steps (roughly:<br />

alignment, quality filters, variant calling) and several<br />

software are available for those steps. Evaluation and<br />

comparison of those tools are crucial in order to improve<br />

pipelines accuracy. Exome and genome sequencing<br />

simulations should allow to determine the veracity of<br />

called variants (false positives and false negatives).<br />

METHODS<br />

We implemented TuneSIM, a wrapper around NGS<br />

dwgsim (http://sourceforge.net/projects/dnaa/) reads<br />

simulator with realistic mutations. Generated reads contain<br />

real mutations from 1KG project and dbsnp138. We use<br />

existing tool dwgsim for reads generations. In order to<br />

generate data as realistic as possible we decided to keep<br />

the haplotype blocks structure. We computed blocks using<br />

vcf files from 1KG project phase 3 in european individuals<br />

with Plink (Purcell et al., 2007). For each block, we<br />

obtained a frequency of each combination of variants and<br />

we used these frequencies for blocks selection. We also<br />

insert variants in an independent way using their<br />

frequencies in dbSNP (Smigielski et al., 2000). Using 33<br />

in house samples, we computed global allele frequency<br />

variants distributions in coding and non coding regions<br />

and we select the variants according to those frequencies.<br />

Similar operation has been performed for CNVs insertion<br />

using 1KG data. We are developing a web interface<br />

allowing users to download existing generated datasets.<br />

After running their pipelines they can upload their output<br />

and see accuracy of their pipelines.<br />

RESULTS & DISCUSSION<br />

Simulations with different coverage, rate of indels have<br />

been performed and analysed with different pipelines.<br />

Results will be presented.<br />

REFERENCES<br />

Gilissen, et al. (2012). Disease gene identification strategies for exome<br />

sequencing. Eur J Hum Genet, 20, 490–497.<br />

Gilissen, et al. (2014). Genome sequencing identifies major causes of<br />

severe intellectual disability. Nature, 511, 344–347.<br />

Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a<br />

mendelian disorder. Nature Genetics, 42, 30–35.<br />

Purcell, et al. (2007). PLINK: a tool set for whole-genome association<br />

and population-based linkage analyses. American journal of human<br />

genetics, 81, 559–575.<br />

Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp:<br />

a database of single nucleotide polymorphisms. Nucleic Acids<br />

Research, 28, 352–355.<br />

Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis<br />

of Mendelian Disorders. N Engl J Med, 369, 1502–1511.<br />

61

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!