29.07.2013 Views

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Information content<br />

Information content<br />

Information content<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

RESULTS<br />

0 20 40 60 80 100 120 140<br />

0 50 100 150<br />

0 50 100 150<br />

Position <strong>in</strong> Alignment<br />

The predictions of the full HMM models have been<br />

compared first aga<strong>in</strong>st annotations, then aga<strong>in</strong>st the<br />

spotter models.<br />

Full model predictions versus annotation<br />

As Table 2 shows, the predictors appeared to be better<br />

at detect<strong>in</strong>g bacterial rRNAs <strong>and</strong> less powerful for<br />

eukaryotic rRNAs. The highest accuracy was seen for<br />

the 16/18S rRNAs followed by the 23/28S. Two groups of<br />

rRNAs were particularly difficult to locate: the archaeal<br />

5S <strong>and</strong> the eukaryotic 18S. The miss<strong>in</strong>g archaeal 5S were<br />

all from four euryarchaeotic genomes which are all<br />

anaerobic methane producers. The eukaryotic 18S that<br />

the predictors could not f<strong>in</strong>d were all from two genomes,<br />

Guillardia theta <strong>and</strong> Plasmodium falciparum.<br />

Closer evaluation revealed that several annotated<br />

rRNAs that lacked a match<strong>in</strong>g prediction had actually<br />

been detected, but on the opposite str<strong>and</strong>. In eukaryotes,<br />

this was only seen with Arabidopsis thaliana 5S.<br />

In bacteria, most of the reverse predictions were 5S; <strong>in</strong><br />

archaea, they were predom<strong>in</strong>antly 16S <strong>and</strong> 23S. It should<br />

be noted that for all the reverse str<strong>and</strong> predictions<br />

the predicted start <strong>and</strong> stop positions agreed well<br />

with the annotation, <strong>in</strong>dicat<strong>in</strong>g that they have been<br />

annotated on the wrong str<strong>and</strong>. Annotated rRNAs<br />

that lacked match<strong>in</strong>g predictions <strong>in</strong> either direction are<br />

listed <strong>in</strong> Supplementary Table S4.<br />

Table 2 gives the number of predicted rRNAs that did<br />

not have a correspond<strong>in</strong>g annotation: putative novel<br />

rRNAs. About 70% of them were 5S rRNAs, <strong>and</strong> only a<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0 500 1000 1500<br />

0 500 1000 1500 2000 2500 3000<br />

0 1000 2000 3000 4000 5000<br />

Position <strong>in</strong> Alignment<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3103<br />

A 5S, Archaea (n = 48) B 16S, Archaea (n = 76) C 23S, Archaea (n = 15)<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

few were archaeal. In bacteria, most of the novel rRNAs<br />

were found <strong>in</strong> Firmicutes <strong>and</strong> Gammaproteobacterias,<br />

although it should be noted that these two phyla are<br />

the two dom<strong>in</strong>ant groups <strong>and</strong> conta<strong>in</strong> the bulk of the<br />

currently sequenced bacterial genomes. Among the<br />

eukaryotes, only A. thaliana had novel rRNAs. The<br />

scores of the new rRNA predictions did not significantly<br />

differ from those that were annotated, <strong>in</strong>dicat<strong>in</strong>g that<br />

these are true rRNAs not yet annotated. The 5S is often<br />

omitted <strong>in</strong> the rRNA annotation; s<strong>in</strong>ce the eukaryotic 5S<br />

is usually separated from the 18-28S sequence, they might<br />

be less visible to annotators.<br />

Start <strong>and</strong> stop deviations<br />

0 500 1000 1500 2000 2500 3000 3500<br />

D 5S, Bacteria (n = 360) E 16S, Bacteria (n = 743) F 23S, Bacteria (n = 127)<br />

0 1000 2000 3000 4000<br />

G 8S, Eukaryotes (n = 283) H 18S, Eukaryotes (n = 979) I 28S, Eukaryotes (n = 58)<br />

0 1000 2000 3000 4000 5000 6000 7000<br />

Position <strong>in</strong> Alignment<br />

Figure 1. The graphs show conservation <strong>in</strong> the alignments as measured by <strong>in</strong>formation content: C ¼ P<br />

i fi log 2ðfi=qiÞ where i sums over the four<br />

nucleotides, f i is the frequency of nucleotide i <strong>in</strong> the column <strong>and</strong> qi ¼ 1=4 is used as the background frequency. Ambiguous nucleotide symbols were<br />

evenly divided between the correspond<strong>in</strong>g f i, gaps between all four nucleotides. The grey l<strong>in</strong>e represents the value for each position <strong>in</strong> the alignment,<br />

the black l<strong>in</strong>e is a runn<strong>in</strong>g average over 75 nt around the current position, whereas the white dot <strong>in</strong>dicates the center of the most conserved 75 nt<br />

region of the alignment.<br />

The differences between predicted <strong>and</strong> annotated start<br />

<strong>and</strong> stop positions are illustrated <strong>in</strong> Figure 2 <strong>and</strong> it shows<br />

that they agree well. The median of the start <strong>and</strong> stop<br />

prediction deviations were <strong>in</strong> most groups zero or very<br />

close to zero with more than half with<strong>in</strong> 10 nucleotides.<br />

This was not the case for the eukaryotes.<br />

For eukaryotic 5S, only five genomes conta<strong>in</strong>ed<br />

predictions with match<strong>in</strong>g annotations. The predictions<br />

were uniform <strong>in</strong> length, whereas the annotations<br />

were more variable. The predictions that <strong>in</strong>dicated a<br />

substantially shorter 5S than annotated were all <strong>in</strong><br />

Schizosaccharomyces pombe: the average length of the<br />

annotations was 170 nt, whereas the correspond<strong>in</strong>g<br />

predictions were all 114 nt. For eukaryotic 18S, however,<br />

predicted start <strong>and</strong> stop positions were very accurate,<br />

although many annotated 18S were missed.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!