Computational tools and Interoperability in Comparative ... - CBS
Computational tools and Interoperability in Comparative ... - CBS
Computational tools and Interoperability in Comparative ... - CBS
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Comparative</strong> Genomics<br />
source CDS total TP FP FN 3’off 5’off sens. shared<br />
U00096 (present) 4,321 - - - - - - -<br />
U00096 (2004) 4,254 4,172 82 109 1.02 -4.07 0.97 93%<br />
Glimmer 3.02 4,476 4,174 302 125 -0.6 -24.09 0.97 87%<br />
GeneMark-S 2.6 4,377 4,207 170 90 1.94 -20.17 0.98 91%<br />
EasyGene 1.2 4,056 4,017 39 256 -0.28 -19.07 0.94 91%<br />
Prodigal 1.1 4,332 4,200 132 97 0.54 -20.07 0.98 92%<br />
Table 2.1: Performance of prokaryotic gene f<strong>in</strong>ders. An older genbank record for E. coli K12<br />
(U00096, 2002) has been <strong>in</strong>cluded <strong>and</strong> the reference of all comparisons is the most recent shown<br />
at the top. The 3’ <strong>and</strong> 5’ off correspond to the number of base pairs that a query coord<strong>in</strong>ate is<br />
downstream (positive number) or upstream (negative number) when compared to the reference.<br />
T P<br />
The sensitivity is estimated by b<strong>in</strong>ary classification, T P +F N<br />
where T P is the number of prote<strong>in</strong>s<br />
shared between reference <strong>and</strong> query <strong>and</strong> F N are prote<strong>in</strong>s unique to the reference, not found <strong>in</strong><br />
the query. Calculat<strong>in</strong>g specificity (which requires a true negative count) is difficult as it is hard<br />
to identify regions of the chromosome that for certa<strong>in</strong> does not conta<strong>in</strong> prote<strong>in</strong> cod<strong>in</strong>g genes<br />
(Larsen & Krogh, 2003). The rightmost column conta<strong>in</strong>s an estimate of the percentage of prote<strong>in</strong><br />
families shared between the query <strong>and</strong> the reference genome. The number is derived us<strong>in</strong>g the<br />
BLASTmatrix tool.<br />
(U00096 from 2004) were compared pairwise to the latest version of the GenBank entry.<br />
The number of unique genes <strong>in</strong> both reference <strong>and</strong> query genome was derived <strong>and</strong> for each<br />
overlapp<strong>in</strong>g pair of ORFs, the average <strong>in</strong>accuracy of the 3’ <strong>and</strong> 5’ ends was calculated<br />
(table 2.1). In addition the encoded prote<strong>in</strong>s were compared us<strong>in</strong>g the BLASTmatrix<br />
tool, described <strong>in</strong> section 2.3.6. This allows estimation of the number of prote<strong>in</strong> families<br />
shared between the reference <strong>and</strong> the query genomes.<br />
2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes<br />
The tool tRNAscan-SE (Lowe & Eddy, 1997) has been implemented <strong>in</strong> the <strong>CBS</strong> Genome<br />
Atlas Database Web Service, <strong>and</strong> it predicts tRNA genes <strong>in</strong> contigs or genomes:<br />
1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />
2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / trnascan .pl<br />
3 perl trnascan .pl < mapped . fsa > mapped . trna . fsa<br />
The RNAmmer method (Paper VI, chapter 3) can be used to consistently annotate<br />
rRNA genes <strong>in</strong> contigs <strong>and</strong> full genome sequences. This tool is implemented as a separate<br />
Web Service at <strong>CBS</strong>. Please refer to http://www.cbs.dtu.dk/ws/RNAmmer for full documentation.<br />
In list<strong>in</strong>g 2.7 <strong>and</strong> example is provided show<strong>in</strong>g the usage of the RNAmmer<br />
client script.<br />
List<strong>in</strong>g 2.7: Runn<strong>in</strong>g RNAmmer on a genome sequence<br />
1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />
2 wget http :// www . cbs . dtu .dk/ws/ RNAmmer / examples / rnammer .pl<br />
3 perl rnammer .pl bac < mapped . fsa > mapped . rrna . fsa<br />
2.3 Genome Comparisons<br />
The previous section has described some <strong>in</strong>itial steps for annotat<strong>in</strong>g the bacterial genome<br />
which is required for further comparative studies. In this section emphasis will be placed<br />
on compar<strong>in</strong>g annotated genomes both on the proteome level as well as us<strong>in</strong>g meta-data.<br />
7