bbc 2015
BBC2015_booklet
BBC2015_booklet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P54. TOWARDS A BELGIAN REFERENCE SET<br />
Erika Souche 1* , Amin Ardeshirdavani 2 , Yves Moreau 2 , Gert Matthijs 1 & Joris Vermeesch 1 .<br />
Department of Human Genetics, KU Leuven 1 ; ESAT-STADIUS Center for Dynamical Systems, Signal Processing and<br />
Data Analytic, KU Leuven 2 . * Erika.souche@uzleuven.be<br />
Next-Generation Sequencing (NGS) is increasingly used to study and diagnose human disorders. The simultaneous<br />
sequencing of a large number of genes leading to the detection of a large number of variants, the bottleneck has moved<br />
from sequencing to variant interpretation and classification. Although publically available databases of variant<br />
frequencies help distinguishing causative mutations from common variants, they often lack population specific variant<br />
frequencies. To circumvent this shortage of population specific information, most genetic centers exploit their sequence<br />
data of unrelated and unaffected individuals to filter out common local variants is often done. However the<br />
files/databases are rarely shared and they are mainly based on whole exome data. In this project we demonstrate the<br />
utility of a local variant database generated from whole exome data, describe a procedure allowing the sharing of<br />
information between genetic centers and mine low coverage whole genome data for common variants.<br />
INTRODUCTION<br />
Next-Generation Sequencing (NGS) is increasingly used<br />
to study and diagnose human disorders. The simultaneous<br />
sequencing of a large number of genes leading to the<br />
detection of a large number of variants, the bottleneck has<br />
moved from sequencing to variant interpretation and<br />
classification. Publically available databases of variant<br />
frequencies provided by, among others, the Exome<br />
Sequencing Project (ESP) the 1000 genomes project<br />
(McVean et al., 2012) or dbSNP (Sherry et al., 2001) help<br />
distinguishing causative mutations from common variants,<br />
identifying up to 78% of variants as common for a Belgian<br />
exome. However, these data sets often lack population<br />
specific variant frequencies and are outperformed by<br />
databases of local variants. For example, using GoNL<br />
(The Genome of the Netherlands Consortium, 2014) alone<br />
allowed the identification of up to 85% of variants as<br />
common for the same Belgian exome. The fact that the<br />
GoNL is based on only 498 individuals further highlights<br />
the importance of building and using population specific<br />
databases.<br />
Such population specific data can be retrieved from locally<br />
sequenced individuals that underwent Whole Exome<br />
Sequencing (WES) or Whole Genome Sequencing (WGS).<br />
Storing only the frequencies and genotype counts of the<br />
variants provides a valuable tool for variant classification<br />
while no sensitive information on the individuals is<br />
included.<br />
METHODS<br />
WES data of 350 unrelated and unaffected individuals<br />
have been parsed. All samples were analysed in a similar<br />
way i.e. reads were aligned to the reference genome with<br />
BWA (Li & Durbin, 2009) and genotyping was performed<br />
according to GATK best practices (McKenna et al., 2010;<br />
DePristo et al., 2011). All samples were genotyped at all<br />
polymorphic positions using GATK HaplotypeCaller and<br />
GenotypeGVCFs. For each position, samples with low<br />
quality genotype were considered as not genotyped and<br />
excluded from the genotype counts. The number of<br />
alternate alleles, allele counts and genotypes were<br />
compiled in a population VCF file, in which individual<br />
genotypes are not accessible.<br />
Variant frequencies can also be extracted from low<br />
coverage WGS. As a pilot we processed the data of<br />
chromosome 21 of about 4,000 WGS. The mapping was<br />
performed with BWA (Li & Durbin, 2009) and the BAM<br />
files were merged per 200 samples. All positions were<br />
genotyped using freebayes (Garrison & Marth, 2012).<br />
Genotype information of all locations outside low<br />
complexity regions were then compiled for all samples<br />
using the integration of Apache Hadoop, HBase and Hive<br />
(see poster “Big data solutions for variant discovery from<br />
low coverage sequencing data, by integration of Hadoop,<br />
Hbase and Hive”). Several models were then used to<br />
distinguish real variants from sequencing errors: the Minor<br />
Allele Frequency (MAF), the transition/transversion ratio,<br />
the expected number of loci with a MAF of 5%, etc.<br />
RESULTS & DISCUSSION<br />
We demonstrated the effect of our reference set on several<br />
exomes. The inclusion of only 350 individuals allowed the<br />
identification of about 3% additional common variants,<br />
not listed as common by ESP, dbSNP (Sherry et al., 2001),<br />
1000 Genomes (McVean et al., 2012) and GoNL (The<br />
Genome of the Netherlands Consortium, 2014). Since only<br />
the frequencies of the variants in the screened populations<br />
are reported, this file can easily be shared between<br />
laboratories. Besides, the procedure used to generate the<br />
population VCF file can easily be applied to several<br />
genetic centers in order to generate a common population<br />
VCF file, as planned within the BeMGI project.<br />
Finally we expect that the data from WGS will further<br />
increase the performance of our reference set. A genomewide<br />
variant frequencies file from local population will<br />
become worthwhile when WGS is routinely used in<br />
diagnostics.<br />
REFERENCES<br />
DePristo M et al. Nature Genetics 43, 491-498 (2011).<br />
Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle,<br />
WA (URL: http://evs.gs.washington.edu/EVS/).<br />
Garrison E & Marth G http://arxiv.org/abs/1207.3907 (2012).<br />
Li H & Durbin R Bioinformatics 25, 1754-60 (2009).<br />
McKenna A et al. Genome Research 20, 1297-303 (2010).<br />
McVean et al. Nature 491, 56–65 (2012).<br />
Sherry ST, et al. Nucleic Acids Res. 29, 308-11 (2001).<br />
The Genome of the Netherlands Consortium. Nature Genetics 46,<br />
818–825 (2014).<br />
98