29.07.2013 Views

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Comparative</strong> Genomics<br />

source CDS total TP FP FN 3’off 5’off sens. shared<br />

U00096 (present) 4,321 - - - - - - -<br />

U00096 (2004) 4,254 4,172 82 109 1.02 -4.07 0.97 93%<br />

Glimmer 3.02 4,476 4,174 302 125 -0.6 -24.09 0.97 87%<br />

GeneMark-S 2.6 4,377 4,207 170 90 1.94 -20.17 0.98 91%<br />

EasyGene 1.2 4,056 4,017 39 256 -0.28 -19.07 0.94 91%<br />

Prodigal 1.1 4,332 4,200 132 97 0.54 -20.07 0.98 92%<br />

Table 2.1: Performance of prokaryotic gene f<strong>in</strong>ders. An older genbank record for E. coli K12<br />

(U00096, 2002) has been <strong>in</strong>cluded <strong>and</strong> the reference of all comparisons is the most recent shown<br />

at the top. The 3’ <strong>and</strong> 5’ off correspond to the number of base pairs that a query coord<strong>in</strong>ate is<br />

downstream (positive number) or upstream (negative number) when compared to the reference.<br />

T P<br />

The sensitivity is estimated by b<strong>in</strong>ary classification, T P +F N<br />

where T P is the number of prote<strong>in</strong>s<br />

shared between reference <strong>and</strong> query <strong>and</strong> F N are prote<strong>in</strong>s unique to the reference, not found <strong>in</strong><br />

the query. Calculat<strong>in</strong>g specificity (which requires a true negative count) is difficult as it is hard<br />

to identify regions of the chromosome that for certa<strong>in</strong> does not conta<strong>in</strong> prote<strong>in</strong> cod<strong>in</strong>g genes<br />

(Larsen & Krogh, 2003). The rightmost column conta<strong>in</strong>s an estimate of the percentage of prote<strong>in</strong><br />

families shared between the query <strong>and</strong> the reference genome. The number is derived us<strong>in</strong>g the<br />

BLASTmatrix tool.<br />

(U00096 from 2004) were compared pairwise to the latest version of the GenBank entry.<br />

The number of unique genes <strong>in</strong> both reference <strong>and</strong> query genome was derived <strong>and</strong> for each<br />

overlapp<strong>in</strong>g pair of ORFs, the average <strong>in</strong>accuracy of the 3’ <strong>and</strong> 5’ ends was calculated<br />

(table 2.1). In addition the encoded prote<strong>in</strong>s were compared us<strong>in</strong>g the BLASTmatrix<br />

tool, described <strong>in</strong> section 2.3.6. This allows estimation of the number of prote<strong>in</strong> families<br />

shared between the reference <strong>and</strong> the query genomes.<br />

2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes<br />

The tool tRNAscan-SE (Lowe & Eddy, 1997) has been implemented <strong>in</strong> the <strong>CBS</strong> Genome<br />

Atlas Database Web Service, <strong>and</strong> it predicts tRNA genes <strong>in</strong> contigs or genomes:<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / trnascan .pl<br />

3 perl trnascan .pl < mapped . fsa > mapped . trna . fsa<br />

The RNAmmer method (Paper VI, chapter 3) can be used to consistently annotate<br />

rRNA genes <strong>in</strong> contigs <strong>and</strong> full genome sequences. This tool is implemented as a separate<br />

Web Service at <strong>CBS</strong>. Please refer to http://www.cbs.dtu.dk/ws/RNAmmer for full documentation.<br />

In list<strong>in</strong>g 2.7 <strong>and</strong> example is provided show<strong>in</strong>g the usage of the RNAmmer<br />

client script.<br />

List<strong>in</strong>g 2.7: Runn<strong>in</strong>g RNAmmer on a genome sequence<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ RNAmmer / examples / rnammer .pl<br />

3 perl rnammer .pl bac < mapped . fsa > mapped . rrna . fsa<br />

2.3 Genome Comparisons<br />

The previous section has described some <strong>in</strong>itial steps for annotat<strong>in</strong>g the bacterial genome<br />

which is required for further comparative studies. In this section emphasis will be placed<br />

on compar<strong>in</strong>g annotated genomes both on the proteome level as well as us<strong>in</strong>g meta-data.<br />

7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!