03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P32. ON THE LZ DISTANCE FOR DEREPLICATING<br />

REDUNDANT PROKARYOTIC GENOMES<br />

Raphaël R. Léonard 1,2* , Damien Sirjacobs², Eric Sauvage 1 , Frédéric Kerff 1 & Denis Baurain².<br />

Centre for Protein Engineering, University of Liège 1 ; PhytoSYSTEMS, University of Liège 2 . * rleonard@doct.ulg.ac.be<br />

The fast-growing number of available prokaryotic genomes, along with their uneven taxonomic distribution, is a problem<br />

when trying to assemble broadly sampled genome sets for phylogenomics and comparative genomics. Indeed, most of<br />

the new genomes belong to the same subset of hyper-sampled phyla, such as Proteobacteria and Firmicutes, or even to<br />

single species, such as Escherichia coli (almost 2000 genomes as of Sept <strong>2015</strong>), while the continuous flow of newly<br />

discovered phyla prompts for regular updates. This situation makes it difficult to maintain sets of representative genomes<br />

combining lesser known phyla, for which only few species are available, and sound subsets of highly abundant phyla. An<br />

automated straightforward method is required but none are publicly available. The LZ distance, in conjunction with the<br />

quality of the annotations, can be used to create an automated approach for selecting a subset of representative genomes<br />

without redundancy. We are planning to release this tool on a website that will be made publicly available.<br />

INTRODUCTION<br />

The LZ distance (Lempel and Ziv, 1977; Otu and Sayood,<br />

2003) is inspired by compression algorithms, such as gzip<br />

or WinRAR. This distance, amongst others, has already<br />

been used in attempts to produce alignment-free<br />

phylogenetic trees (Bacha and Baurain, 2005; Hohl et al.<br />

2007), though the results were disappointing in such a<br />

context (due to the heterogeneity of the substitution<br />

process at large evolutionary scales). However, the LZ<br />

distance is likely to provide enough resolving power to<br />

identify groups of redundant genomes and to keep only<br />

one representative for each group.<br />

METHODS<br />

For each pair of genomes A and B, the LZ distance is<br />

computed from the gzip-compressed file lengths of the<br />

corresponding nucleotide assemblies s(A) and s(B) and of<br />

their concatenations s(A+B) and s(B+A). These distances,<br />

along with taxonomic information, are stored in a<br />

database.<br />

A clustering method is then applied to regroup the similar<br />

genomes into a user-specified number of groups. For each<br />

of these groups, a representative is chosen based on the<br />

quality of the genomic assemblies (chromosomes rather<br />

than scaffolds) and of the protein annotations (e.g., few<br />

rather than many “unknown proteins”).<br />

RESULTS & DISCUSSION<br />

Our method using the LZ distance is currently under<br />

development using the genomes from the release 28 of<br />

Ensembl Bacteria (ftp://ftp.ensemblgenomes.org/pub/<br />

bacteria/release-28/). It contains 20,950 unique<br />

prokaryotic genomes, composed of 286 Archaea and<br />

20,664 Bacteria. The three most represented phyla are the<br />

Proteobacteria (8642, of which 1980 E. coli), the<br />

Firmicutes (7766) and the Actinobacteria (2673). These<br />

genomes are already the result of a pre-processing step<br />

designed to remove extra assemblies for strains present in<br />

multiple copies (due to parallel sequencing or<br />

resequencing in different labs).<br />

We are working on different approaches for validating our<br />

dereplication method, based on (1) current taxonomy, (2)<br />

16S rRNA phylogeny, and (3) clustering using genomic<br />

signatures (Moreno-Hagelsieb et al. 2013).<br />

First, we compute a central measure of the taxonomic<br />

“purity” of all genome clusters, which reflects the amount<br />

of “mixture” at different taxonomic levels (phylum, class,<br />

order etc). A good clustering should regroup different<br />

genera (or species) without amalgamating distinct classes<br />

(or phyla). Second, we cut the branches of a large 16S<br />

rRNA tree based on the same genome collection to<br />

produce an equal number of groups to compare with our<br />

clustering method. We then compute a statistic of the<br />

overlap between the 16S subtrees and the LZ clusters. A<br />

good clustering should have a reasonable overlap with the<br />

gold standard that is the 16S rRNA tree. Third, using the<br />

same overlap metric, we compare the LZ clusters to<br />

clusters obtained using the genomic signature.<br />

Finally, an interactive tool will be made available through<br />

a website. It will allow the users to download precomputed<br />

sets of representative genomes for either the<br />

complete database or for taxonomic subsets. We are also<br />

planning to allow users to upload their own genomes to<br />

cluster them with the LZ method.<br />

REFERENCES<br />

Ziv, J. and a. Lempel. 1977. ‘A Universal Algorithm for Sequential Data<br />

Compression.’ IEEE Transactions on Information Theory 23.3.<br />

doi:10.1109/TIT.1977.1055714.<br />

Otu, H. H. and K. Sayood. 2003. ‘A New Sequence Distance Measure for<br />

Phylogenetic Tree Construction.’ Bioinformatics 19.16: 2122–2130.<br />

doi:10.1093/bioinformatics/btg295.<br />

Moreno-Hagelsieb, G., Z. Wang, S. Walsh and A. Elsherbiny. 2013.<br />

‘Phylogenomic Clustering for Selecting Non-Redundant Genomes<br />

for Comparative Genomics.’ Bioinformatics 29.1: 947–949.<br />

doi:10.1093/bioinformatics/btt064.<br />

Höhl, M. and M. a Ragan. 2007. ‘Is Multiple-Sequence Alignment<br />

Required for Accurate Inference of Phylogeny?’ Systematic biology<br />

56.2: 206–221. doi:10.1080/10635150701294741.<br />

Bacha, S. and Baurain, D. 2005. ‘Application of Lempel-Ziv complexity<br />

to alignment-free sequence comparison of protein families’.<br />

Benelux Bioinformatics Conference 2005.<br />

http://hdl.handle.net/2268/80179<br />

76

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!