bbc 2015
BBC2015_booklet
BBC2015_booklet
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P17. TUNESIM - TUNABLE VARIANT SET SIMULATOR FOR NGS READS<br />
Bertrand Escaliere 1,2 , Nicolas Simonis 1,3 , Gianluca Bontempi 1,2 & Guillaume Smits 1,4 .<br />
Interuniversity Institute of Bioinformatics in Brussels 1 ; Machine Learning Group, Université Libre de Bruxelles 2 ; Institut<br />
de Pathologie et de Génétique 3 ; Hopital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles 4 .<br />
NGS analysis softwares and pipelines optimization is crucial in order to improve discovery of (new) disease causing<br />
variants. A better combination between existing tools and the right choice of parameters can lead to more specific and<br />
sensitive calling. Simulated datasets allow the step-by-step generation of new alignment or calling software. Creating a<br />
simulator able to insert known human variants at a realistic minor frequency and artificial variants in a tunable controlled<br />
way would allow to overcome three optimization limits: complete knowledge of the input dataset, allowing to determine<br />
exact calling sensitivity and accuracy; optimization on the appropriate population; and the capacity to dynamically test a<br />
pipeline one variable at the time.<br />
INTRODUCTION<br />
Identification of anomalies causing genetic disorders is<br />
difficult. It can be limited by scarcity of affliction<br />
concerned, by disorder genetic heterogeneity, or by<br />
phenotypic pleiotropy associated with the anomalies in a<br />
single gene. Exome and genome sequencing allowed the<br />
identification of many genetic diseases causes, whose<br />
origin remained inaccessible up to now by the usual<br />
techniques of research in genetics (Ng et al., 2009),<br />
(Gilissen et al., 2012), (Yang et al., 2013), (Gilissen et al.,<br />
2014). Exome and genome sequencing data analysis<br />
pipelines are constituted by several steps (roughly:<br />
alignment, quality filters, variant calling) and several<br />
software are available for those steps. Evaluation and<br />
comparison of those tools are crucial in order to improve<br />
pipelines accuracy. Exome and genome sequencing<br />
simulations should allow to determine the veracity of<br />
called variants (false positives and false negatives).<br />
METHODS<br />
We implemented TuneSIM, a wrapper around NGS<br />
dwgsim (http://sourceforge.net/projects/dnaa/) reads<br />
simulator with realistic mutations. Generated reads contain<br />
real mutations from 1KG project and dbsnp138. We use<br />
existing tool dwgsim for reads generations. In order to<br />
generate data as realistic as possible we decided to keep<br />
the haplotype blocks structure. We computed blocks using<br />
vcf files from 1KG project phase 3 in european individuals<br />
with Plink (Purcell et al., 2007). For each block, we<br />
obtained a frequency of each combination of variants and<br />
we used these frequencies for blocks selection. We also<br />
insert variants in an independent way using their<br />
frequencies in dbSNP (Smigielski et al., 2000). Using 33<br />
in house samples, we computed global allele frequency<br />
variants distributions in coding and non coding regions<br />
and we select the variants according to those frequencies.<br />
Similar operation has been performed for CNVs insertion<br />
using 1KG data. We are developing a web interface<br />
allowing users to download existing generated datasets.<br />
After running their pipelines they can upload their output<br />
and see accuracy of their pipelines.<br />
RESULTS & DISCUSSION<br />
Simulations with different coverage, rate of indels have<br />
been performed and analysed with different pipelines.<br />
Results will be presented.<br />
REFERENCES<br />
Gilissen, et al. (2012). Disease gene identification strategies for exome<br />
sequencing. Eur J Hum Genet, 20, 490–497.<br />
Gilissen, et al. (2014). Genome sequencing identifies major causes of<br />
severe intellectual disability. Nature, 511, 344–347.<br />
Ng, S. B., et al. (2009). Exome sequencing identifies the cause of a<br />
mendelian disorder. Nature Genetics, 42, 30–35.<br />
Purcell, et al. (2007). PLINK: a tool set for whole-genome association<br />
and population-based linkage analyses. American journal of human<br />
genetics, 81, 559–575.<br />
Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbsnp:<br />
a database of single nucleotide polymorphisms. Nucleic Acids<br />
Research, 28, 352–355.<br />
Yang, et al. (2013). Clinical Whole-Exome Sequencing for the Diagnosis<br />
of Mendelian Disorders. N Engl J Med, 369, 1502–1511.<br />
61