Computational tools and Interoperability in Comparative ... - CBS

More documents

Recommendations

Info

Information content Information content Information content 0.0 1.0 2.0 0.0 1.0 2.0 0.0 1.0 2.0 RESULTS 0 20 40 60 80 100 120 140 0 50 100 150 0 50 100 150 Position in Alignment The predictions of the full HMM models have been compared first against annotations, then against the spotter models. Full model predictions versus annotation As Table 2 shows, the predictors appeared to be better at detecting bacterial rRNAs and less powerful for eukaryotic rRNAs. The highest accuracy was seen for the 16/18S rRNAs followed by the 23/28S. Two groups of rRNAs were particularly difficult to locate: the archaeal 5S and the eukaryotic 18S. The missing archaeal 5S were all from four euryarchaeotic genomes which are all anaerobic methane producers. The eukaryotic 18S that the predictors could not find were all from two genomes, Guillardia theta and Plasmodium falciparum. Closer evaluation revealed that several annotated rRNAs that lacked a matching prediction had actually been detected, but on the opposite strand. In eukaryotes, this was only seen with Arabidopsis thaliana 5S. In bacteria, most of the reverse predictions were 5S; in archaea, they were predominantly 16S and 23S. It should be noted that for all the reverse strand predictions the predicted start and stop positions agreed well with the annotation, indicating that they have been annotated on the wrong strand. Annotated rRNAs that lacked matching predictions in either direction are listed in Supplementary Table S4. Table 2 gives the number of predicted rRNAs that did not have a corresponding annotation: putative novel rRNAs. About 70% of them were 5S rRNAs, and only a 0.0 1.0 2.0 0.0 1.0 2.0 0.0 1.0 2.0 0 500 1000 1500 0 500 1000 1500 2000 2500 3000 0 1000 2000 3000 4000 5000 Position in Alignment Nucleic Acids Research, 2007, Vol. 35, No. 9 3103 A 5S, Archaea (n = 48) B 16S, Archaea (n = 76) C 23S, Archaea (n = 15) 0.0 1.0 2.0 0.0 1.0 2.0 0.0 1.0 2.0 few were archaeal. In bacteria, most of the novel rRNAs were found in Firmicutes and Gammaproteobacterias, although it should be noted that these two phyla are the two dominant groups and contain the bulk of the currently sequenced bacterial genomes. Among the eukaryotes, only A. thaliana had novel rRNAs. The scores of the new rRNA predictions did not significantly differ from those that were annotated, indicating that these are true rRNAs not yet annotated. The 5S is often omitted in the rRNA annotation; since the eukaryotic 5S is usually separated from the 18-28S sequence, they might be less visible to annotators. Start and stop deviations 0 500 1000 1500 2000 2500 3000 3500 D 5S, Bacteria (n = 360) E 16S, Bacteria (n = 743) F 23S, Bacteria (n = 127) 0 1000 2000 3000 4000 G 8S, Eukaryotes (n = 283) H 18S, Eukaryotes (n = 979) I 28S, Eukaryotes (n = 58) 0 1000 2000 3000 4000 5000 6000 7000 Position in Alignment Figure 1. The graphs show conservation in the alignments as measured by information content: C ¼ P i fi log 2ðfi=qiÞ where i sums over the four nucleotides, f i is the frequency of nucleotide i in the column and qi ¼ 1=4 is used as the background frequency. Ambiguous nucleotide symbols were evenly divided between the corresponding f i, gaps between all four nucleotides. The grey line represents the value for each position in the alignment, the black line is a running average over 75 nt around the current position, whereas the white dot indicates the center of the most conserved 75 nt region of the alignment. The differences between predicted and annotated start and stop positions are illustrated in Figure 2 and it shows that they agree well. The median of the start and stop prediction deviations were in most groups zero or very close to zero with more than half within 10 nucleotides. This was not the case for the eukaryotes. For eukaryotic 5S, only five genomes contained predictions with matching annotations. The predictions were uniform in length, whereas the annotations were more variable. The predictions that indicated a substantially shorter 5S than annotated were all in Schizosaccharomyces pombe: the average length of the annotations was 170 nt, whereas the corresponding predictions were all 114 nt. For eukaryotic 18S, however, predicted start and stop positions were very accurate, although many annotated 18S were missed.
3104 Nucleic Acids Research, 2007, Vol. 35, No. 9 Table 2. The number of rRNAs annotated and predicted in the genomes that were examined. Kingdom Type Annotated Same strand Other strand Not found Full model predictions Novel Archaea (n ¼ 27) 5S 56 (24) 43 (21) 1 (1) 12 (8) 47 (23) 4 (3) 16S 47 (25) 45 (25) 2 (2) 0 (0) 47 (27) 2 (2) 23S 47 (25) 44 (24) 2 (2) 1 (1) 46 (26) 2 (2) Bacteria (n ¼ 321) 5S 1205 (285) 1166 (285) 30 (16) 9 (5) 1339 (320) 173 (69) 16S 1172 (299) 1146 (299) 22 (12) 4 (4) 1237 (320) 91 (34) 23S 1197 (297) 1154 (291) 22 (13) 21 (12) 1248 (313) 94 (36) Eukaryotes (n ¼ 13) 5S 65 (7) 46 (6) 19 (1) 0 (0) 324 (9) 278 (5) 18S 13 (4) 6 (4) 0 (0) 7 (2) 13 (6) 7 (3) 28S 13 (5) 12 (4) 0 (0) 1 (1) 19 (7) 7 (3) The table gives the number of annotations, and splits this into those matching predictions on the same strand, on the other strand, and not found. The total number of full model predictions is given. Novel predictions are full model predictions not matching any annotation on the same strand, and include those annotated on the other strand. Numbers in parentheses indicate the number of genomes. It should be noted that the eukaryotic annotated count is somewhat uncertain due to ambiguous rRNA annotations. The genomes which were analyzed were from the GenomeAtlas database, a database over all available fully sequenced genomes. Archaea Bacteria Eukaryotes Start 1000 −100 −10 0 10 100 5S (43/1163/46) For eukaryotic 28S, only two genomes had predictions with matching annotations. One of them, Encephalitozoon cuniculi, had stop positions predicted once 1112 nt and twice 4797 nt downstream of the annotation, whereas the start position was accurately predicted. In the other genome, Guillardia theta, the start positions were uniformly predicted 110 nt upstream of the annotated position, but with the stop position quite accurately predicted. 1000 Stop 1000 −100 −10 0 10 100 1000 Start 1000 −100 −10 0 10 100 16/18S (44/1146/6) 1000 Stop 1000 −100 −10 0 10 100 1000 Start 1000 −100 −10 0 10 100 23/28S (42/1150/9) Stop 1000 −100 −10 0 10 100 1000 Figure 2. Deviation of start and stop positions between predicted and annotated RNA is presented as pairs of panels. The number of predictions among the archaea, bacteria and eukaryotes are denoted beneath the panel group heading. The zero position in each panel corresponds to the annotation start or stop position with predicted positions presented relative to these. The yellow dot indicates the median deviation and the black box the quartile range. The hinges on the side of the box extend from the side of the box to the data point that is closest to, but does not exceed, 1.5 times the interquartile range. The curves show the density of the distribution. Since rRNAs tend to be very similar within a genome, predictions within each genome generally had similar lengths. This similarity within genomes as well as within groups of closely related genomes caused multiple peaks in the distributions of endpoint deviations. An example of this can be seen in the bacterial 16S predictions where some of the predicted start and stop positions were clustered downstream of the annotation and where some of the predicted start positions were clustered upstream 1000
Page 1 and 2:
Peter Fischer Hallin | 2009 Peter F
Page 4:
Preface This Ph.D. thesis is writte
Page 7 and 8:
thesis, the work is just being publ
Page 9 and 10:
ved at blive publiceret i Standards
Page 11 and 12:
viii
Page 13 and 14:
Paper VI [Lagesen K, Hallin P] 1 ,
Page 15 and 16:
xii
Page 17 and 18:
3.3.3 Refining E. coli and Shigella
Page 19 and 20:
xvi
Page 21 and 22:
xviii 2.17 Pan- and core-genome plo
Page 24 and 25:
Chapter 1 Introduction Introduction
Page 26 and 27:
Chapter 2 Comparative Genomics 2.1
Page 28 and 29:
Comparative Genomics the publicly a
Page 30 and 31:
Comparative Genomics source CDS tot
Page 32 and 33:
Comparative Genomics 1 mysql -N -B
Page 34 and 35:
Listing 2.8: R code to generate a 2
Page 36 and 37:
1st U C A G U 2nd position C A G 3r
Page 38 and 39:
1st U C A G U 2nd position C A G 3r
Page 40 and 41:
Escherichia coli strain K-12, subst
Page 42 and 43:
Comparative Genomics
Page 44 and 45:
3M 2.5M 3.5M 2.5M 2M 0M 2M 0.5M B.
Page 46 and 47:
Streptococcus Escherichia Bacillus
Page 48 and 49:
2.4 Summary Comparative Genomics Th
Page 50 and 51:
Comparative Genomics 2.5 Instant in
Page 52 and 53:
‘ReSourCe is he best online submi
Page 54 and 55:
up to a total of 41 different E. co
Page 56 and 57:
Fig. 2 Genes (or segments) from eac
Page 58 and 59:
Fig. 5 BLASTatlas of Clostridium bo
Page 60 and 61:
different applications, such as ide
Page 62 and 63:
1 Comparative Genomics 2.7 Paper II
Page 64 and 65:
166 literally millions of bacterial
Page 66 and 67:
168
Page 68 and 69:
170 resistance genes on mobile gene
Page 70 and 71:
172 involved in generating diversit
Page 72 and 73:
174 recipient DNA. A feature observ
Page 74 and 75:
176 Fig. 5 Genome length distributi
Page 76 and 77:
178
Page 78 and 79:
180 reasons why organisms remain un
Page 80 and 81:
182 A final problem has to do with
Page 82 and 83:
184 Middendorf B, Hochhut B, Leipol
Page 84 and 85:
1 Comparative Genomics 2.8 Paper II
Page 86 and 87:
2 O. N. Reva et al. Fig. 1. Genome
Page 88 and 89:
4 O. N. Reva et al. decrease of the
Page 90 and 91:
6 O. N. Reva et al. of which are kn
Page 92 and 93:
8 O. N. Reva et al. compiled into a
Page 94 and 95:
10 O. N. Reva et al. encoded by a c
Page 96 and 97:
12 O. N. Reva et al. systems and ef
Page 98 and 99: 1 2.9 Paper IV: The origins of Vibr
Page 100 and 101: phylogenies based on alternative ho
Page 102 and 103: Figure 1 Phylogenetic tree of the 1
Page 104 and 105: 25000 20000 15000 10000 5000 0 Pan
Page 106 and 107: Gap F 2M 2.5M Gap E 875k 750k 625k
Page 108 and 109: Table 2 A selection of genes locate
Page 110 and 111: Open Access This article is distrib
Page 112 and 113: 1 Comparative Genomics 2.10 Paper V
Page 114 and 115: 4314 74 Tools Abstract: Of the plet
Page 116 and 117: 4316 74 Tools Size distribution of
Page 118 and 119: 4318 74 Tools Genome atlas Intrinsi
Page 120 and 121: 4320 74 Tools Genome atlas Intrinsi
Page 122 and 123: 4322 74 Tools for Comparison of Bac
Page 124 and 125: 4324 74 Tools for Comparison of Bac
Page 126 and 127: 4326 74 Tools information, as genet
Page 128 and 129: Chapter 3 rRNA operons and promoter
Page 130 and 131: tuB murI Fis III Fis II Fis I UP -3
Page 132 and 133: Bits 2.0 1.5 1.0 0.5 0.0 Bits 2.0 1
Page 134 and 135: RNA operons and promoter analysis O
Page 136 and 137: Bits 2.0 1.5 1.0 0.5 0.0 Bits T A T
Page 138 and 139: Code Meaning Example C Coding CCCCC
Page 140 and 141: RNA operons and promoter analysis 3
Page 142 and 143: P2 -10 -35 UP P1 -10 -35 UP FIS FIS
Page 144 and 145: 1 rRNA operons and promoter analysi
Page 146 and 147: Using HMMs also simplifies the use
Page 150 and 151: of the annotation. Some of the majo
Page 152 and 153: where match states stop around 10 c
Page 154 and 155: 1 rRNA operons and promoter analysi
Page 156 and 157: synthesis in flow cells to simultan
Page 158 and 159: Read absence. A boolean where ‘on
Page 160 and 161: Hallin, et al. Figure 4 | The dataf
Page 162 and 163: Genome homology: Comparing multiple
Page 164 and 165: ing platform-‐independent Java
Page 166 and 167: 34. Wang H, Noordewier M, Benham CJ
Page 168 and 169: Chapter 4 Web Services and Interope
Page 170 and 171: Web Services and Interoperability i
Page 178 and 179: Chapter 5 Conclusion and perspectiv
Page 180 and 181: Appendix A Appendix: Workshops, tea
Page 182 and 183: Appendix B Appendix: Ph.D. study pl
Page 184 and 185: Danmarks Tekniske Universitet AFI,
Page 186 and 187: Danmarks Tekniske Universitet AFI,
Page 188 and 189: Appendix C Appendix: Courses C.1 Gl
Page 190 and 191: D.2 Sample output from queryGenomes
Page 192 and 193: Appendix: Software 13 w a r n " $ o
Page 194 and 195: Appendix: Software 109 m y ( $ m i
Page 196 and 197: Appendix: Software 25 [ ] A l t e r
Page 198 and 199:
BIBLIOGRAPHY J. Rogers, P. F. Stadl
Page 200 and 201:
BIBLIOGRAPHY Q. Jin, Z. Yuan, J. Xu
Page 202 and 203:
BIBLIOGRAPHY Velicer, F.-J. Vorholt
show all

Computational tools and Interoperability in Comparative ... - CBS

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?