bbc 2015
BBC2015_booklet
BBC2015_booklet
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING<br />
SCHEMES FOR BACTERIA<br />
Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*.<br />
Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium.<br />
INTRODUCTION<br />
As next-generation sequencing in general, and whole<br />
genome sequencing (WGS) in particular, is increasingly<br />
adopted in public health for routine surveillance tasks,<br />
there is a clear need to incorporate this new technology in<br />
the day-to-day operational workflow of a public health<br />
institute. As cluster detection based on WGS data is<br />
evolving into a commodity, thanks to technologies such as<br />
whole genome multi-locus sequence typing (wgMLST),<br />
the question remains as to how WGS-based data analysis<br />
can be used to build up a human-friendly but highprecision<br />
and epidemiologically consistent naming<br />
strategy for communication purposes.<br />
METHODS<br />
For various organisms, the use of so-called ‘SNP<br />
addresses’ (based on single nucleotide polymorphisms or<br />
SNPs) has been proposed to build up a hierarchical<br />
naming scheme (see [1], [2]). This idea relies on single<br />
linkage clustering of isolates at different levels of<br />
similarity or distance, hence leading to a hierarchical name.<br />
However, the main difficulty here is to define the<br />
appropriate levels of similarity to cluster on, and the<br />
dependence of the naming scheme on the samples at hand.<br />
Moreover, the SNP approach might not provide the best<br />
type of data for this due to its relatively large volatility.<br />
In this work, we present a mathematical framework to<br />
define the levels of similarity upon which single linkage<br />
clustering makes sense. For this, we model the observed<br />
multimodal distribution of pairwise similarities between<br />
samples to obtain a theoretical model of the similarity<br />
distribution, and from there infer the most likely breaking<br />
points for stable similarity cutoffs. This is done in a dataindependent<br />
manner, and is therefore applicable to SNP<br />
data, but also to wgMLST data and even gene presenceabsence<br />
data. We assess the stability of the naming<br />
scheme by using a cross-validation approach.<br />
RESULTS & DISCUSSION<br />
We apply our methods to propose a wgMLST-based<br />
naming scheme for Listeria monocytogenes. Using a<br />
reference dataset of the diversity within Listeria<br />
monocytogenes, and an extensive data set of over 4000<br />
isolates from real-time surveillance, we show the stability<br />
of the naming scheme, and the epidemiological<br />
concordance.<br />
REFERENCES<br />
[1] Dallman T et al., Applying phylogenomics to understand the<br />
2 emergence of Shiga Toxin producing Escherichia coli<br />
3 O157:H7 strains causing severe human disease in the<br />
4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029<br />
[2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium<br />
tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi:<br />
10.1016/j.tube.2014.02.005<br />
53