03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING<br />

SCHEMES FOR BACTERIA<br />

Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*.<br />

Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium.<br />

INTRODUCTION<br />

As next-generation sequencing in general, and whole<br />

genome sequencing (WGS) in particular, is increasingly<br />

adopted in public health for routine surveillance tasks,<br />

there is a clear need to incorporate this new technology in<br />

the day-to-day operational workflow of a public health<br />

institute. As cluster detection based on WGS data is<br />

evolving into a commodity, thanks to technologies such as<br />

whole genome multi-locus sequence typing (wgMLST),<br />

the question remains as to how WGS-based data analysis<br />

can be used to build up a human-friendly but highprecision<br />

and epidemiologically consistent naming<br />

strategy for communication purposes.<br />

METHODS<br />

For various organisms, the use of so-called ‘SNP<br />

addresses’ (based on single nucleotide polymorphisms or<br />

SNPs) has been proposed to build up a hierarchical<br />

naming scheme (see [1], [2]). This idea relies on single<br />

linkage clustering of isolates at different levels of<br />

similarity or distance, hence leading to a hierarchical name.<br />

However, the main difficulty here is to define the<br />

appropriate levels of similarity to cluster on, and the<br />

dependence of the naming scheme on the samples at hand.<br />

Moreover, the SNP approach might not provide the best<br />

type of data for this due to its relatively large volatility.<br />

In this work, we present a mathematical framework to<br />

define the levels of similarity upon which single linkage<br />

clustering makes sense. For this, we model the observed<br />

multimodal distribution of pairwise similarities between<br />

samples to obtain a theoretical model of the similarity<br />

distribution, and from there infer the most likely breaking<br />

points for stable similarity cutoffs. This is done in a dataindependent<br />

manner, and is therefore applicable to SNP<br />

data, but also to wgMLST data and even gene presenceabsence<br />

data. We assess the stability of the naming<br />

scheme by using a cross-validation approach.<br />

RESULTS & DISCUSSION<br />

We apply our methods to propose a wgMLST-based<br />

naming scheme for Listeria monocytogenes. Using a<br />

reference dataset of the diversity within Listeria<br />

monocytogenes, and an extensive data set of over 4000<br />

isolates from real-time surveillance, we show the stability<br />

of the naming scheme, and the epidemiological<br />

concordance.<br />

REFERENCES<br />

[1] Dallman T et al., Applying phylogenomics to understand the<br />

2 emergence of Shiga Toxin producing Escherichia coli<br />

3 O157:H7 strains causing severe human disease in the<br />

4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029<br />

[2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium<br />

tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi:<br />

10.1016/j.tube.2014.02.005<br />

53

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!