bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P8. IDENTIFICATION OF NUMTS THROUGH NGS DATA Vincent Branders 1,2* , Chedly Kastally 2 & Patrick Mardulyn 2 . Machine Learning Group, Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM), Université catholique de Louvain 1 ; Evolutionary Biology and Ecology, Université libre de Bruxelles 2 . * vincent.branders@uclouvain.be Numts are copies of mitochondrial DNA sequences that have been transferred into the nuclear genome. Due to their similarity with mitochondrial DNA sequences, numts have led to many misinterpretations from overestimation of diversity to wrong association between cystic fibrosis and mitochondrial genome variation. To avoid such bias induced by numts, theses sequences have to be identified. Current methodologies are based on comparisons of existing nuclear and mitochondrial sequences and searches for similarities. The Pacific Biosciences (PacBio) new technology generates sequencing reads that span thousands of base pairs, which gives the opportunity to identify numts by looking for reads with regions similar to mitochondrial sequences and surrounded by regions highly different from it. It should allow the systematic identification of numts without a complete known nuclear reference. INTRODUCTION The transfer of DNA from mitochondria to the nucleus generates nuclear copies of mitochondrial DNA (numts). Numts have been found in many species including yeasts, rodents and plants. Due to their similarity to mitochondrial DNA, numts are responsible for many misinterpretations, both in mitochondrial disease studies and phylogenetic reconstructions (Hazkani-Covo et al., 2010). Numt variation have commonly been misreported as mitochondrial mutations in patients (Yao et al., 2008). Moreover, DNA barcoding was found to overestimate the number of species when numts are coamplified (Song et al., 2008). Current methods identify such sequences by aligning mitochondrial sequences against the nuclear genome and identifying similar regions (Figure 1, left). The PacBio technology allows the sequencing of DNA fragments spanning thousands of bases pairs. This size should allow the identification of numts without the need of a complete nuclear reference (the insect species Gonioctena intermedia for example). Indeed, it should be possible to use a mitochondrial assembly to identify PacBio reads with a central region similar to the mitochondrial sequence enclosed by nuclear regions that are dissimilar to it (Figure 1, right). FIGURE 1. Identification of numts – Existing methods (left) and proposed method (right). Comparison of mitochondrial sequence to nuclear sequence (left) or long reads (right). METHODS The proposed approach aligns PacBio reads to a mitochondrial genome (here de novo assemblies of PacBio reads and Illumina HiSeq 2000 reads are used). In these long reads, numts are identified with one region similar to the mitochondrial genome but surrounded by regions that are not similar. We introduce different criteria to distinguish reads that are presumably numts and reads of mitochondrial origin (Figure 2). DNA sequences comes from an insect (Gonioctena intermedia) without reference genome. FIGURE 2. Mitochondrial reads and numts with nuclear borders. RESULTS & DISCUSSION A systematic identification of potential numts is proposed: through alignments, we identify 10 mitochondrial reads and 34 reads with potential numt for one particular mitochondrial region (the widely studied cytochrome oxidase I gene). As an exploratory research, we highlight the usefulness of Pacific Biosciences data in the identification of numts when no nuclear reference is available. It only requires PacBio reads and a mitochondrial assembly. The proposed approach is more efficient than an identification of numts through short reads that would require the complete reconstruction of both mitochondrial and nuclear genomes. A systematic identification of numts in non-models organisms should avoid misinterpretations in studies where numts could be sources of bias. Our current distinction of numts and mitochondrial reads is quite simple. A detailed analysis of this distinction could be a perspective of improvements. REFERENCES Hazkani-Covo E. et al. PLOS Genetics 6, 1-11 (2010). Song H. et al. PNAS 105, 13486-13491 (2008). Yao Y. G. et al. Journal of Medical Genetics 45, 769-772 (2008). 52
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P9. MICROBIAL SEMANTICS: GENOME-WIDE HIGH-PRECISION NAMING SCHEMES FOR BACTERIA Esther Camilo dos Reis, Dolf Michielsen, Hannes Pouseele*. Applied Maths NV, Keistraat 120, 9830 Sint-Martens-Latem, Belgium. INTRODUCTION As next-generation sequencing in general, and whole genome sequencing (WGS) in particular, is increasingly adopted in public health for routine surveillance tasks, there is a clear need to incorporate this new technology in the day-to-day operational workflow of a public health institute. As cluster detection based on WGS data is evolving into a commodity, thanks to technologies such as whole genome multi-locus sequence typing (wgMLST), the question remains as to how WGS-based data analysis can be used to build up a human-friendly but highprecision and epidemiologically consistent naming strategy for communication purposes. METHODS For various organisms, the use of so-called ‘SNP addresses’ (based on single nucleotide polymorphisms or SNPs) has been proposed to build up a hierarchical naming scheme (see [1], [2]). This idea relies on single linkage clustering of isolates at different levels of similarity or distance, hence leading to a hierarchical name. However, the main difficulty here is to define the appropriate levels of similarity to cluster on, and the dependence of the naming scheme on the samples at hand. Moreover, the SNP approach might not provide the best type of data for this due to its relatively large volatility. In this work, we present a mathematical framework to define the levels of similarity upon which single linkage clustering makes sense. For this, we model the observed multimodal distribution of pairwise similarities between samples to obtain a theoretical model of the similarity distribution, and from there infer the most likely breaking points for stable similarity cutoffs. This is done in a dataindependent manner, and is therefore applicable to SNP data, but also to wgMLST data and even gene presenceabsence data. We assess the stability of the naming scheme by using a cross-validation approach. RESULTS & DISCUSSION We apply our methods to propose a wgMLST-based naming scheme for Listeria monocytogenes. Using a reference dataset of the diversity within Listeria monocytogenes, and an extensive data set of over 4000 isolates from real-time surveillance, we show the stability of the naming scheme, and the epidemiological concordance. REFERENCES [1] Dallman T et al., Applying phylogenomics to understand the 2 emergence of Shiga Toxin producing Escherichia coli 3 O157:H7 strains causing severe human disease in the 4 United Kingdom. Microbial Genomics., 10.1099/mgen.0.000029 [2] Coll F et al., PolyTB: A genomic variation map for Mycobacterium tuberculosis, Tuberculosis (Edinb). 2014 May; 94(3): 346–354. doi: 10.1016/j.tube.2014.02.005 53
Page 1 and 2: 10 th Benelux Bioinformatics Confer
Page 3 and 4: 10th Benelux Bioinformatics Confere
Page 19 and 20: BeNeLux Bioinformatics Conference -
Page 51: BeNeLux Bioinformatics Conference -
Page 103 and 104:
BeNeLux Bioinformatics Conference -
Page 105 and 106:
Page 107 and 108:
Page 109 and 110:
Page 111 and 112:
Page 113 and 114:
Page 115:
10th Benelux Bioinformatics Confere
show all

bbc 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?