Computational tools and Interoperability in Comparative ... - CBS
Computational tools and Interoperability in Comparative ... - CBS
Computational tools and Interoperability in Comparative ... - CBS
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Information content<br />
Information content<br />
Information content<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
RESULTS<br />
0 20 40 60 80 100 120 140<br />
0 50 100 150<br />
0 50 100 150<br />
Position <strong>in</strong> Alignment<br />
The predictions of the full HMM models have been<br />
compared first aga<strong>in</strong>st annotations, then aga<strong>in</strong>st the<br />
spotter models.<br />
Full model predictions versus annotation<br />
As Table 2 shows, the predictors appeared to be better<br />
at detect<strong>in</strong>g bacterial rRNAs <strong>and</strong> less powerful for<br />
eukaryotic rRNAs. The highest accuracy was seen for<br />
the 16/18S rRNAs followed by the 23/28S. Two groups of<br />
rRNAs were particularly difficult to locate: the archaeal<br />
5S <strong>and</strong> the eukaryotic 18S. The miss<strong>in</strong>g archaeal 5S were<br />
all from four euryarchaeotic genomes which are all<br />
anaerobic methane producers. The eukaryotic 18S that<br />
the predictors could not f<strong>in</strong>d were all from two genomes,<br />
Guillardia theta <strong>and</strong> Plasmodium falciparum.<br />
Closer evaluation revealed that several annotated<br />
rRNAs that lacked a match<strong>in</strong>g prediction had actually<br />
been detected, but on the opposite str<strong>and</strong>. In eukaryotes,<br />
this was only seen with Arabidopsis thaliana 5S.<br />
In bacteria, most of the reverse predictions were 5S; <strong>in</strong><br />
archaea, they were predom<strong>in</strong>antly 16S <strong>and</strong> 23S. It should<br />
be noted that for all the reverse str<strong>and</strong> predictions<br />
the predicted start <strong>and</strong> stop positions agreed well<br />
with the annotation, <strong>in</strong>dicat<strong>in</strong>g that they have been<br />
annotated on the wrong str<strong>and</strong>. Annotated rRNAs<br />
that lacked match<strong>in</strong>g predictions <strong>in</strong> either direction are<br />
listed <strong>in</strong> Supplementary Table S4.<br />
Table 2 gives the number of predicted rRNAs that did<br />
not have a correspond<strong>in</strong>g annotation: putative novel<br />
rRNAs. About 70% of them were 5S rRNAs, <strong>and</strong> only a<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0 500 1000 1500<br />
0 500 1000 1500 2000 2500 3000<br />
0 1000 2000 3000 4000 5000<br />
Position <strong>in</strong> Alignment<br />
Nucleic Acids Research, 2007, Vol. 35, No. 9 3103<br />
A 5S, Archaea (n = 48) B 16S, Archaea (n = 76) C 23S, Archaea (n = 15)<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
0.0 1.0 2.0<br />
few were archaeal. In bacteria, most of the novel rRNAs<br />
were found <strong>in</strong> Firmicutes <strong>and</strong> Gammaproteobacterias,<br />
although it should be noted that these two phyla are<br />
the two dom<strong>in</strong>ant groups <strong>and</strong> conta<strong>in</strong> the bulk of the<br />
currently sequenced bacterial genomes. Among the<br />
eukaryotes, only A. thaliana had novel rRNAs. The<br />
scores of the new rRNA predictions did not significantly<br />
differ from those that were annotated, <strong>in</strong>dicat<strong>in</strong>g that<br />
these are true rRNAs not yet annotated. The 5S is often<br />
omitted <strong>in</strong> the rRNA annotation; s<strong>in</strong>ce the eukaryotic 5S<br />
is usually separated from the 18-28S sequence, they might<br />
be less visible to annotators.<br />
Start <strong>and</strong> stop deviations<br />
0 500 1000 1500 2000 2500 3000 3500<br />
D 5S, Bacteria (n = 360) E 16S, Bacteria (n = 743) F 23S, Bacteria (n = 127)<br />
0 1000 2000 3000 4000<br />
G 8S, Eukaryotes (n = 283) H 18S, Eukaryotes (n = 979) I 28S, Eukaryotes (n = 58)<br />
0 1000 2000 3000 4000 5000 6000 7000<br />
Position <strong>in</strong> Alignment<br />
Figure 1. The graphs show conservation <strong>in</strong> the alignments as measured by <strong>in</strong>formation content: C ¼ P<br />
i fi log 2ðfi=qiÞ where i sums over the four<br />
nucleotides, f i is the frequency of nucleotide i <strong>in</strong> the column <strong>and</strong> qi ¼ 1=4 is used as the background frequency. Ambiguous nucleotide symbols were<br />
evenly divided between the correspond<strong>in</strong>g f i, gaps between all four nucleotides. The grey l<strong>in</strong>e represents the value for each position <strong>in</strong> the alignment,<br />
the black l<strong>in</strong>e is a runn<strong>in</strong>g average over 75 nt around the current position, whereas the white dot <strong>in</strong>dicates the center of the most conserved 75 nt<br />
region of the alignment.<br />
The differences between predicted <strong>and</strong> annotated start<br />
<strong>and</strong> stop positions are illustrated <strong>in</strong> Figure 2 <strong>and</strong> it shows<br />
that they agree well. The median of the start <strong>and</strong> stop<br />
prediction deviations were <strong>in</strong> most groups zero or very<br />
close to zero with more than half with<strong>in</strong> 10 nucleotides.<br />
This was not the case for the eukaryotes.<br />
For eukaryotic 5S, only five genomes conta<strong>in</strong>ed<br />
predictions with match<strong>in</strong>g annotations. The predictions<br />
were uniform <strong>in</strong> length, whereas the annotations<br />
were more variable. The predictions that <strong>in</strong>dicated a<br />
substantially shorter 5S than annotated were all <strong>in</strong><br />
Schizosaccharomyces pombe: the average length of the<br />
annotations was 170 nt, whereas the correspond<strong>in</strong>g<br />
predictions were all 114 nt. For eukaryotic 18S, however,<br />
predicted start <strong>and</strong> stop positions were very accurate,<br />
although many annotated 18S were missed.