Computational tools and Interoperability in Comparative ... - CBS

More documents

Recommendations

Info

166 literally millions of bacterial species, only a small proportion of these can be grown in the laboratory (Handelsman 2004). Bacteria (and Archaea) can be found almost anywhere in the environment: in the air, even in the International Space Station (Novikova et al. 2006), in thermal ducts found at great depths in the oceans (Alain et al. 2002; Vezzi et al. 2005), in the intestinal tracts of animals (Yan and Polk 2004; Backhed et al. 2005) and in soil and rocks, even thousands of meters deep (Torsvik et al. 1990). Bacteria live within unicellular eukaryotes, algae, plants or animals. This diversity is reflected in their physiology, morphology, metabolism and ecosystems. For example, from a physiological perspective, most intestinal bacteria such as Escherichia coli are motile by means of flagella, to overcome the peristalsis of the gut, whilst the soil bacterium Clostridium perfringens does not posses such motility machinery (Shimizu et al. 2002). From a metabolic perspective, the versatile Burkholderia cepacia (formerly Pseudomonas cepacia) can utilise approximately 100 different organic compounds as a sole energy source (Goldmann and Klinger 1986) compared to the strictly intracellular Mycobacterium tuberculosis which is dependent on only a few carbon sources produced by its involuntary host. From an inter-bacterial interaction perspective, sometimes bacteria cooperate. For example, Enterobacter cloacae and Pseudomonas mendocina positively interact to stimulate plant growth (Duponnois et al. 1999). On the other hand, there are also bacteria which not only “do not cooperate” but exhibit predatory behavior, such as Bdellovibrio bacteriovorus (Rendulic et al. 2004). As for bacteria–host interactions, for a given bacterial species both pathogenic and non-pathogenic strains can exist (Dobrindt and Hacker 2001; Penyalver and Lopez 1999), while other species may be exclusively parasitic (Goebel and Gross 2001), truly symbiotic (Gil et al. 2004) or commensal (Yan and Polk 2004) for their host. It is interesting to note that this diversity is somehow captured in the relatively small bacterial genomes. The first complete viral genome (φX174) was published in 1977 (Sanger et al. 1977). To put this into perspective, to sequence the 4.6-Mbp E. coli K-12 genome at that time (about a thousand base pairs (bp) could be sequenced per year in 1977) would take more than a thousand years to finish, and to sequence the human genome would take more than a million years to complete. The automation of sequencing methods, the invention of polymerase chain reaction (PCR) (Mullis et al. 1986) and the shotgun cloning procedure reduced costs and time, and provided the capability for large-scale sequencing. These developments together have led to the sequencing of the first complete bacterial genome (Fleischmann et al. 1995) almost 20 years after the sequencing of φX174. The choice of the first bacterium to be completely sequenced (H. influenzae Rd KW20) was based on the following reasons: (1) the genome size was thought to be ‘typical’ among bacteria (1.8 Mbp), (2) the G + C base composition was close to that of the human genome (38%) and (3) the bacterium had important human health implications. In the absence of procedures to produce a genetic map for the species, genome sequencing was proven to be a powereful alternative for genetic characterisation. This landmark work initiated the influx of genome sequence data which is now updated frequently and is publicly available. As of November 2005, there are more than 300 fully sequenced, publicly available bacterial genomes. Figure 1 shows this increase of sequence data over the past decade. 1 The total number of completed bacterial genome sequences has more than doubled over the past 2 years and, at the time of writing, there are 855 publicly listed bacterial and archaeal genome projects that are in various stages of progress. 2 In addition to new species, multiple strains of the same bacterial species are being sequenced. The amount of genomic data currently available has provided significant advances in our understanding of a number of important themes, including bacterial diversity, population characteristics, operon structure, mobile genetic elements (MGE) and horizontal gene transfer (HGT). It has also provided a number of challenges in understanding the ecology of, as yet, undiscovered bacterial worlds. The availability of whole genome sequences for pathogenic and commensal bacterial species has allowed a more detailed analysis of the complex interactions that occur with their plant or animal hosts. Figure 2a is a phylogenetic tree of 300 sequenced bacterial genomes (available at the time of writing). Many of these genomes are from pathogenic bacteria living in complex ecosystems, such as the spirochaete Brachyspira pilosicoli labelled in red in the phylogenetic tree shown in Fig. 2b. This bacterium attaches to enterocytes to form a “false brush border” in the colon. Most genome sequencing projects are currently carried out using automated applications of the sequencing technique developed by Sanger et al. (1973), but newly developed methodologies may enable even more rapid sequencing in the future. Two papers have been published about two different methods for high-throughput sequencing of bacterial genomes (Pennisi 2005). One method is essentially a “do-it-yourself kit”, which uses a laser confocal microscope and other “off-the-shelf” components to build a sequencing machine capable of sequencing an E. coli genome in less than a day (Shendure et al. 2005). The second method is a commercial machine, based on pyrosequencing methodologies to generate many short pieces of DNA; this method was used to sequence a bacterial genome within a few hours (Margulies et al. 2005). Although there are still some technical problems with both of these methods, it is clear that, in the near future, it will be possible to quickly sequence a bacterial genome at a considerably low cost. 1 Completed genome statistics obtained from the CBS atlas web pages http://www.cbs.dtu.dk/services/GenomeAtlas 2 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj
Fig. 1 Cumulative number of complete published sequenced bacterial genomes (bars) and total number of basepairs (line) over the past decade (1995–2005) Genomic information DNA codes for more than just proteins The quality of annotation of bacterial genomes varies, although a survey based on three different methods to predict the expected number of genes in a genome has found that it is likely that, for most bacterial genomes, around 20% of the genes annotated might not be “real” (Skovgaard et al. 2001). Furthermore, some “real” genes, based on proteomics experiments, which were not originally predicted have been detected, highlighting the dynamic nature of annotation and that genes are missed (Jaffe et al. 2004). Over-annotation of bacterial genomes is a problem but, unfortunately, this cannot be easily avoided. On the one hand, no one wants to miss a gene and, on the other hand, small genes can be quite difficult to predict, as a short open reading frame could easily occur by statistical chance (Skovgaard et al. 2001). There are currently several automated annotation systems and the BaSys system (Van Domselaar et al. 2005) provides a comprehensive annotation of a DNA sequence file. To conduct comparative genomics with several hundred genomes, quality databases are essential and the “GenomeAtlas” database, which was originally developed to store DNA structural information about the various sequenced genomes, is one example (Hallin and Ussery 2004). Approximately a hundred different features for each genome (such as percent AT, coding skew bias, length of genome and number of genes) are currently made available through http://www.cbs.dtu.dk/services/GenomeAtlas/. Duplication of essentials One of the features of genomic sequences that can be easily recognised is the presence of repeat sequences. The most obvious and extensive repeats present in many bacterial 167 genomes are the operons encoding the ribosomal RNA genes. These rRNA operons typically encode 16S and 23S rRNA separated by a short spacer, often followed by the 5S rRNA gene. All sequenced bacterial genomes possess at least one rRNA operon, and many (215 of 300) have two or more copies; the number of operons tends to correlate with bacterial division time. Thus, species that divide quickly (such as Bacillus cereus) have more copies of rRNA genes, so as to enable rapid production of ribosomes. In addition, species containing multiple rRNA operons appear to be more adaptable to changing environmental conditions (Acinas et al. 2004). The rRNA genes are a valuable tool for the estimation of taxonomic relationships (see Fig 2a). These genes evolve slowly, presumably because they play an essential role as the backbone of ribosomes while interacting with multiple proteins. Any changes in the shape (sequence) of rRNA would most likely be fatal. Multiple copies per genome of tRNA genes can also be found in some genomes, again tending to correlate with division time. However, for tRNAs, the duplication number is also dictated by the frequency with which particular codons are used (or vice versa, as cause and effect cannot be distinguished here). This enables a less obvious level of regulating gene activity: a gene using many codons for which only one tRNA gene is available will probably be translated at a rate-limiting step, whereas abundant proteins are more likely to use tRNAs for which multiple gene copies are available. This is the basis for the codon adaption index, which is a measure of the adaptation of a gene’s codon usage towards the optimal tRNA pool (Sharp and Li 1987). There are of course other duplications in bacterial genomes, some of which might appear at first glance to be less essential. For example, the ‘REP’ repetitive sequences frequently found in enterobacteriaceae can be used as unique identifiers of bacterial genomes (Tobes and Ramos 2005). It has been speculated that these repeats are meaningless, resulting from errors in replication, or that
Page 1 and 2:
Peter Fischer Hallin | 2009 Peter F
Page 4:
Preface This Ph.D. thesis is writte
Page 7 and 8:
thesis, the work is just being publ
Page 9 and 10:
ved at blive publiceret i Standards
Page 11 and 12:
viii
Page 13 and 14: Paper VI [Lagesen K, Hallin P] 1 ,
Page 15 and 16: xii
Page 17 and 18: 3.3.3 Refining E. coli and Shigella
Page 19 and 20: xvi
Page 21 and 22: xviii 2.17 Pan- and core-genome plo
Page 24 and 25: Chapter 1 Introduction Introduction
Page 26 and 27: Chapter 2 Comparative Genomics 2.1
Page 28 and 29: Comparative Genomics the publicly a
Page 30 and 31: Comparative Genomics source CDS tot
Page 32 and 33: Comparative Genomics 1 mysql -N -B
Page 34 and 35: Listing 2.8: R code to generate a 2
Page 36 and 37: 1st U C A G U 2nd position C A G 3r
Page 38 and 39: 1st U C A G U 2nd position C A G 3r
Page 40 and 41: Escherichia coli strain K-12, subst
Page 42 and 43: Comparative Genomics
Page 44 and 45: 3M 2.5M 3.5M 2.5M 2M 0M 2M 0.5M B.
Page 46 and 47: Streptococcus Escherichia Bacillus
Page 48 and 49: 2.4 Summary Comparative Genomics Th
Page 50 and 51: Comparative Genomics 2.5 Instant in
Page 52 and 53: ‘ReSourCe is he best online submi
Page 54 and 55: up to a total of 41 different E. co
Page 56 and 57: Fig. 2 Genes (or segments) from eac
Page 58 and 59: Fig. 5 BLASTatlas of Clostridium bo
Page 60 and 61: different applications, such as ide
Page 62 and 63: 1 Comparative Genomics 2.7 Paper II
Page 66 and 67: 168
Page 68 and 69: 170 resistance genes on mobile gene
Page 70 and 71: 172 involved in generating diversit
Page 72 and 73: 174 recipient DNA. A feature observ
Page 74 and 75: 176 Fig. 5 Genome length distributi
Page 76 and 77: 178
Page 78 and 79: 180 reasons why organisms remain un
Page 80 and 81: 182 A final problem has to do with
Page 82 and 83: 184 Middendorf B, Hochhut B, Leipol
Page 84 and 85: 1 Comparative Genomics 2.8 Paper II
Page 86 and 87: 2 O. N. Reva et al. Fig. 1. Genome
Page 88 and 89: 4 O. N. Reva et al. decrease of the
Page 90 and 91: 6 O. N. Reva et al. of which are kn
Page 92 and 93: 8 O. N. Reva et al. compiled into a
Page 94 and 95: 10 O. N. Reva et al. encoded by a c
Page 96 and 97: 12 O. N. Reva et al. systems and ef
Page 98 and 99: 1 2.9 Paper IV: The origins of Vibr
Page 100 and 101: phylogenies based on alternative ho
Page 102 and 103: Figure 1 Phylogenetic tree of the 1
Page 104 and 105: 25000 20000 15000 10000 5000 0 Pan
Page 106 and 107: Gap F 2M 2.5M Gap E 875k 750k 625k
Page 108 and 109: Table 2 A selection of genes locate
Page 110 and 111: Open Access This article is distrib
Page 112 and 113: 1 Comparative Genomics 2.10 Paper V
Page 114 and 115:
4314 74 Tools Abstract: Of the plet
Page 116 and 117:
4316 74 Tools Size distribution of
Page 118 and 119:
4318 74 Tools Genome atlas Intrinsi
Page 120 and 121:
4320 74 Tools Genome atlas Intrinsi
Page 122 and 123:
4322 74 Tools for Comparison of Bac
Page 124 and 125:
4324 74 Tools for Comparison of Bac
Page 126 and 127:
4326 74 Tools information, as genet
Page 128 and 129:
Chapter 3 rRNA operons and promoter
Page 130 and 131:
tuB murI Fis III Fis II Fis I UP -3
Page 132 and 133:
Bits 2.0 1.5 1.0 0.5 0.0 Bits 2.0 1
Page 134 and 135:
RNA operons and promoter analysis O
Page 136 and 137:
Bits 2.0 1.5 1.0 0.5 0.0 Bits T A T
Page 138 and 139:
Code Meaning Example C Coding CCCCC
Page 140 and 141:
RNA operons and promoter analysis 3
Page 142 and 143:
P2 -10 -35 UP P1 -10 -35 UP FIS FIS
Page 144 and 145:
1 rRNA operons and promoter analysi
Page 146 and 147:
Using HMMs also simplifies the use
Page 148 and 149:
Information content Information con
Page 150 and 151:
of the annotation. Some of the majo
Page 152 and 153:
where match states stop around 10 c
Page 154 and 155:
1 rRNA operons and promoter analysi
Page 156 and 157:
synthesis in flow cells to simultan
Page 158 and 159:
Read absence. A boolean where ‘on
Page 160 and 161:
Hallin, et al. Figure 4 | The dataf
Page 162 and 163:
Genome homology: Comparing multiple
Page 164 and 165:
ing platform-‐independent Java
Page 166 and 167:
34. Wang H, Noordewier M, Benham CJ
Page 168 and 169:
Chapter 4 Web Services and Interope
Page 170 and 171:
Web Services and Interoperability i
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Chapter 5 Conclusion and perspectiv
Page 180 and 181:
Appendix A Appendix: Workshops, tea
Page 182 and 183:
Appendix B Appendix: Ph.D. study pl
Page 184 and 185:
Danmarks Tekniske Universitet AFI,
Page 186 and 187:
Danmarks Tekniske Universitet AFI,
Page 188 and 189:
Appendix C Appendix: Courses C.1 Gl
Page 190 and 191:
D.2 Sample output from queryGenomes
Page 192 and 193:
Appendix: Software 13 w a r n " $ o
Page 194 and 195:
Appendix: Software 109 m y ( $ m i
Page 196 and 197:
Appendix: Software 25 [ ] A l t e r
Page 198 and 199:
BIBLIOGRAPHY J. Rogers, P. F. Stadl
Page 200 and 201:
BIBLIOGRAPHY Q. Jin, Z. Yuan, J. Xu
Page 202 and 203:
BIBLIOGRAPHY Velicer, F.-J. Vorholt
show all

Computational tools and Interoperability in Comparative ... - CBS

Create successful ePaper yourself

Delete template?

Save as template?