03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P54. TOWARDS A BELGIAN REFERENCE SET<br />

Erika Souche 1* , Amin Ardeshirdavani 2 , Yves Moreau 2 , Gert Matthijs 1 & Joris Vermeesch 1 .<br />

Department of Human Genetics, KU Leuven 1 ; ESAT-STADIUS Center for Dynamical Systems, Signal Processing and<br />

Data Analytic, KU Leuven 2 . * Erika.souche@uzleuven.be<br />

Next-Generation Sequencing (NGS) is increasingly used to study and diagnose human disorders. The simultaneous<br />

sequencing of a large number of genes leading to the detection of a large number of variants, the bottleneck has moved<br />

from sequencing to variant interpretation and classification. Although publically available databases of variant<br />

frequencies help distinguishing causative mutations from common variants, they often lack population specific variant<br />

frequencies. To circumvent this shortage of population specific information, most genetic centers exploit their sequence<br />

data of unrelated and unaffected individuals to filter out common local variants is often done. However the<br />

files/databases are rarely shared and they are mainly based on whole exome data. In this project we demonstrate the<br />

utility of a local variant database generated from whole exome data, describe a procedure allowing the sharing of<br />

information between genetic centers and mine low coverage whole genome data for common variants.<br />

INTRODUCTION<br />

Next-Generation Sequencing (NGS) is increasingly used<br />

to study and diagnose human disorders. The simultaneous<br />

sequencing of a large number of genes leading to the<br />

detection of a large number of variants, the bottleneck has<br />

moved from sequencing to variant interpretation and<br />

classification. Publically available databases of variant<br />

frequencies provided by, among others, the Exome<br />

Sequencing Project (ESP) the 1000 genomes project<br />

(McVean et al., 2012) or dbSNP (Sherry et al., 2001) help<br />

distinguishing causative mutations from common variants,<br />

identifying up to 78% of variants as common for a Belgian<br />

exome. However, these data sets often lack population<br />

specific variant frequencies and are outperformed by<br />

databases of local variants. For example, using GoNL<br />

(The Genome of the Netherlands Consortium, 2014) alone<br />

allowed the identification of up to 85% of variants as<br />

common for the same Belgian exome. The fact that the<br />

GoNL is based on only 498 individuals further highlights<br />

the importance of building and using population specific<br />

databases.<br />

Such population specific data can be retrieved from locally<br />

sequenced individuals that underwent Whole Exome<br />

Sequencing (WES) or Whole Genome Sequencing (WGS).<br />

Storing only the frequencies and genotype counts of the<br />

variants provides a valuable tool for variant classification<br />

while no sensitive information on the individuals is<br />

included.<br />

METHODS<br />

WES data of 350 unrelated and unaffected individuals<br />

have been parsed. All samples were analysed in a similar<br />

way i.e. reads were aligned to the reference genome with<br />

BWA (Li & Durbin, 2009) and genotyping was performed<br />

according to GATK best practices (McKenna et al., 2010;<br />

DePristo et al., 2011). All samples were genotyped at all<br />

polymorphic positions using GATK HaplotypeCaller and<br />

GenotypeGVCFs. For each position, samples with low<br />

quality genotype were considered as not genotyped and<br />

excluded from the genotype counts. The number of<br />

alternate alleles, allele counts and genotypes were<br />

compiled in a population VCF file, in which individual<br />

genotypes are not accessible.<br />

Variant frequencies can also be extracted from low<br />

coverage WGS. As a pilot we processed the data of<br />

chromosome 21 of about 4,000 WGS. The mapping was<br />

performed with BWA (Li & Durbin, 2009) and the BAM<br />

files were merged per 200 samples. All positions were<br />

genotyped using freebayes (Garrison & Marth, 2012).<br />

Genotype information of all locations outside low<br />

complexity regions were then compiled for all samples<br />

using the integration of Apache Hadoop, HBase and Hive<br />

(see poster “Big data solutions for variant discovery from<br />

low coverage sequencing data, by integration of Hadoop,<br />

Hbase and Hive”). Several models were then used to<br />

distinguish real variants from sequencing errors: the Minor<br />

Allele Frequency (MAF), the transition/transversion ratio,<br />

the expected number of loci with a MAF of 5%, etc.<br />

RESULTS & DISCUSSION<br />

We demonstrated the effect of our reference set on several<br />

exomes. The inclusion of only 350 individuals allowed the<br />

identification of about 3% additional common variants,<br />

not listed as common by ESP, dbSNP (Sherry et al., 2001),<br />

1000 Genomes (McVean et al., 2012) and GoNL (The<br />

Genome of the Netherlands Consortium, 2014). Since only<br />

the frequencies of the variants in the screened populations<br />

are reported, this file can easily be shared between<br />

laboratories. Besides, the procedure used to generate the<br />

population VCF file can easily be applied to several<br />

genetic centers in order to generate a common population<br />

VCF file, as planned within the BeMGI project.<br />

Finally we expect that the data from WGS will further<br />

increase the performance of our reference set. A genomewide<br />

variant frequencies file from local population will<br />

become worthwhile when WGS is routinely used in<br />

diagnostics.<br />

REFERENCES<br />

DePristo M et al. Nature Genetics 43, 491-498 (2011).<br />

Exome Variant Server, NHLBI Exome Sequencing Project (ESP), Seattle,<br />

WA (URL: http://evs.gs.washington.edu/EVS/).<br />

Garrison E & Marth G http://arxiv.org/abs/1207.3907 (2012).<br />

Li H & Durbin R Bioinformatics 25, 1754-60 (2009).<br />

McKenna A et al. Genome Research 20, 1297-303 (2010).<br />

McVean et al. Nature 491, 56–65 (2012).<br />

Sherry ST, et al. Nucleic Acids Res. 29, 308-11 (2001).<br />

The Genome of the Netherlands Consortium. Nature Genetics 46,<br />

818–825 (2014).<br />

98

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!