bbc 2015
BBC2015_booklet
BBC2015_booklet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P32. ON THE LZ DISTANCE FOR DEREPLICATING<br />
REDUNDANT PROKARYOTIC GENOMES<br />
Raphaël R. Léonard 1,2* , Damien Sirjacobs², Eric Sauvage 1 , Frédéric Kerff 1 & Denis Baurain².<br />
Centre for Protein Engineering, University of Liège 1 ; PhytoSYSTEMS, University of Liège 2 . * rleonard@doct.ulg.ac.be<br />
The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a problem<br />
when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of<br />
the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to<br />
single species, such as Escherichia coli (almost 2000 genomes as of Sept <strong>2015</strong>), while the continuous flow of newly<br />
discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes<br />
combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An<br />
automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the<br />
quality of the annotations, can be used to create an automated approach for selecting a subset of representative genomes<br />
without redundancy. We are planning to release this tool on a website that will be made publicly available.<br />
INTRODUCTION<br />
The LZ distance (Lempel and Ziv, 1977; Otu and Sayood,<br />
2003) is inspired by compression algorithms, such as gzip<br />
or WinRAR. This distance, amongst others, has already<br />
been used in attempts to produce alignment-free<br />
phylogenetic trees (Bacha and Baurain, 2005; Hohl et al.<br />
2007), though the results were disappointing in such a<br />
context (due to the heterogeneity of the substitution<br />
process at large evolutionary scales). However, the LZ<br />
distance is likely to provide enough resolving power to<br />
identify groups of redundant genomes and to keep only<br />
one representative for each group.<br />
METHODS<br />
For each pair of genomes A and B, the LZ distance is<br />
computed from the gzip-compressed file lengths of the<br />
corresponding nucleotide assemblies s(A) and s(B) and of<br />
their concatenations s(A+B) and s(B+A). These distances,<br />
along with taxonomic information, are stored in a<br />
database.<br />
A clustering method is then applied to regroup the similar<br />
genomes into a user-specified number of groups. For each<br />
of these groups, a representative is chosen based on the<br />
quality of the genomic assemblies (chromosomes rather<br />
than scaffolds) and of the protein annotations (e.g., few<br />
rather than many “unknown proteins”).<br />
RESULTS & DISCUSSION<br />
Our method using the LZ distance is currently under<br />
development using the genomes from the release 28 of<br />
Ensembl Bacteria (ftp://ftp.ensemblgenomes.org/pub/<br />
bacteria/release-28/). It contains 20,950 unique<br />
prokaryotic genomes, composed of 286 Archaea and<br />
20,664 Bacteria. The three most represented phyla are the<br />
Proteobacteria (8642, of which 1980 E. coli), the<br />
Firmicutes (7766) and the Actinobacteria (2673). These<br />
genomes are already the result of a pre-processing step<br />
designed to remove extra assemblies for strains present in<br />
multiple copies (due to parallel sequencing or<br />
resequencing in different labs).<br />
We are working on different approaches for validating our<br />
dereplication method, based on (1) current taxonomy, (2)<br />
16S rRNA phylogeny, and (3) clustering using genomic<br />
signatures (Moreno-Hagelsieb et al. 2013).<br />
First, we compute a central measure of the taxonomic<br />
“purity” of all genome clusters, which reflects the amount<br />
of “mixture” at different taxonomic levels (phylum, class,<br />
order etc). A good clustering should regroup different<br />
genera (or species) without amalgamating distinct classes<br />
(or phyla). Second, we cut the branches of a large 16S<br />
rRNA tree based on the same genome collection to<br />
produce an equal number of groups to compare with our<br />
clustering method. We then compute a statistic of the<br />
overlap between the 16S subtrees and the LZ clusters. A<br />
good clustering should have a reasonable overlap with the<br />
gold standard that is the 16S rRNA tree. Third, using the<br />
same overlap metric, we compare the LZ clusters to<br />
clusters obtained using the genomic signature.<br />
Finally, an interactive tool will be made available through<br />
a website. It will allow the users to download precomputed<br />
sets of representative genomes for either the<br />
complete database or for taxonomic subsets. We are also<br />
planning to allow users to upload their own genomes to<br />
cluster them with the LZ method.<br />
REFERENCES<br />
Ziv, J. and a. Lempel. 1977. ‘A Universal Algorithm for Sequential Data<br />
Compression.’ IEEE Transactions on Information Theory 23.3.<br />
doi:10.1109/TIT.1977.1055714.<br />
Otu, H. H. and K. Sayood. 2003. ‘A New Sequence Distance Measure for<br />
Phylogenetic Tree Construction.’ Bioinformatics 19.16: 2122–2130.<br />
doi:10.1093/bioinformatics/btg295.<br />
Moreno-Hagelsieb, G., Z. Wang, S. Walsh and A. Elsherbiny. 2013.<br />
‘Phylogenomic Clustering for Selecting Non-Redundant Genomes<br />
for Comparative Genomics.’ Bioinformatics 29.1: 947–949.<br />
doi:10.1093/bioinformatics/btt064.<br />
Höhl, M. and M. a Ragan. 2007. ‘Is Multiple-Sequence Alignment<br />
Required for Accurate Inference of Phylogeny?’ Systematic biology<br />
56.2: 206–221. doi:10.1080/10635150701294741.<br />
Bacha, S. and Baurain, D. 2005. ‘Application of Lempel-Ziv complexity<br />
to alignment-free sequence comparison of protein families’.<br />
Benelux Bioinformatics Conference 2005.<br />
http://hdl.handle.net/2268/80179<br />
76