29.07.2013 Views

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Peter Fischer Hall<strong>in</strong> | 2009 Peter Fischer Hall<strong>in</strong><br />

<strong>Computational</strong> <strong>tools</strong> <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> <strong>Comparative</strong> Genomics<br />

2.5<br />

<strong>Computational</strong> <strong>tools</strong> <strong>and</strong><br />

<strong>Interoperability</strong> <strong>in</strong><br />

<strong>Comparative</strong> Genomics<br />

lari<br />

jejuni<br />

concisus<br />

curvus<br />

fetus<br />

hom<strong>in</strong>is<br />

2.3 %<br />

34 / 1,494<br />

57.2 %<br />

1,123 / 1,965<br />

56.7 %<br />

1,123 / 1,979<br />

1.7 %<br />

27 / 1,581<br />

55.2 %<br />

1,145 / 2,073<br />

84.7 %<br />

1,448 / 1,709<br />

49.4 %<br />

1,062 / 2,150<br />

83.5 %<br />

1,481 / 1,773<br />

1.5 %<br />

24 / 1,585<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

Campylobacter curvus<br />

525.92<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter jejuni<br />

RM1221<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

Campylobacter lari<br />

RM2100<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

53.0 %<br />

1,143 / 2,158<br />

67.3 %<br />

1,316 / 1,955<br />

82.9 %<br />

1,474 / 1,778<br />

22.8 %<br />

596 / 2,619<br />

76.9 %<br />

1,466 / 1,906<br />

64.4 %<br />

1,289 / 2,003<br />

2.3 %<br />

39 / 1,702<br />

30.0 %<br />

742 / 2,476<br />

22.9 %<br />

614 / 2,676<br />

74.6 %<br />

1,441 / 1,931<br />

62.2 %<br />

1,304 / 2,096<br />

24.7 %<br />

682 / 2,756<br />

30.6 %<br />

774 / 2,526<br />

23.1 %<br />

617 / 2,675<br />

71.4 %<br />

1,451 / 2,032<br />

4.0 %<br />

66 / 1,650<br />

24.5 %<br />

704 / 2,875<br />

24.8 %<br />

698 / 2,820<br />

30.3 %<br />

770 / 2,538<br />

22.5 %<br />

628 / 2,795<br />

63.5 %<br />

1,345 / 2,118<br />

24.4 %<br />

718 / 2,948<br />

25.1 %<br />

706 / 2,816<br />

28.7 %<br />

767 / 2,669<br />

21.2 %<br />

595 / 2,802<br />

2.3 %<br />

41 / 1,780<br />

jejuni<br />

hom<strong>in</strong>is<br />

fetus<br />

curvus<br />

concisus<br />

PhD thesis | Peter Fischer Hall<strong>in</strong> | 2009<br />

Center for Biological Sequence Analysis<br />

Department of Systems Biology<br />

Technical University of Denmark<br />

Campylobacter lari<br />

RM2100<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

24.3 %<br />

717 / 2,950<br />

23.7 %<br />

699 / 2,950<br />

27.5 %<br />

736 / 2,676<br />

21.4 %<br />

618 / 2,886<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

23.6 %<br />

723 / 3,070<br />

22.5 %<br />

668 / 2,964<br />

27.9 %<br />

767 / 2,750<br />

2.0 %<br />

33 / 1,623<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

22.7 %<br />

698 / 3,076<br />

23.0 %<br />

698 / 3,036<br />

30.4 %<br />

782 / 2,576<br />

22.5 %<br />

713 / 3,175<br />

26.1 %<br />

741 / 2,838<br />

1.5 %<br />

25 / 1,665<br />

lari<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

RM1221<br />

25.8 %<br />

765 / 2,961<br />

34.7 %<br />

929 / 2,678<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

32.4 %<br />

916 / 2,828<br />

1.8 %<br />

34 / 1,885<br />

21.2 %<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

50.3 %<br />

1,317 / 2,616<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter curvus<br />

525.92<br />

3.5 %<br />

69 / 1,972<br />

1.5 %<br />

Homology between proteomes<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

Homology with<strong>in</strong> proteomes<br />

84.7 %<br />

4.0 %


To my family. Thank you Susanne for your endless support <strong>and</strong> for giv<strong>in</strong>g us two<br />

wonderful boys, Oliver <strong>and</strong> Victor.


Preface<br />

This Ph.D. thesis is written for The Department for Systems Biology, Technical University<br />

of Denmark, as part of the Life Science programme as a requirement for obta<strong>in</strong><strong>in</strong>g the<br />

Ph.D. degree.<br />

The work was supported through the EMBRACE project which is funded by the European<br />

Commission with<strong>in</strong> the Sixth Framework Programme, under the area of “Life sciences,<br />

genomics <strong>and</strong> biotechnology for health”, contract number LSGH-CT-2004-512092.<br />

Parts of the work was supported through a grant from the Danish Natural Science Research<br />

Council, contract number 26-06-0349 entitled “<strong>Comparative</strong> Genomics of Campylobacter<br />

jejuni”.<br />

The work was carried out at the Center for Biological Sequence Analysis (<strong>CBS</strong>), Department<br />

of Systems Biology, under supervision by Associate Professor David W. Ussery.<br />

The work on bacterial promotors was carried out dur<strong>in</strong>g an external stay at University<br />

of California, Davis (UC Davis Genome Center), under supervision by Professor Craig J.<br />

Benham <strong>and</strong> supported through an NSF Research Grant, contract number DBI-0416764.<br />

Lyngby, 28 September, 2009<br />

Peter Fischer Hall<strong>in</strong><br />

Cover illustration<br />

The background of the cover shows a “BLAST atlas” of Burkholderia pseudomallei, stra<strong>in</strong><br />

1710b compared with 22 other Burkholderia genomes. The top panel, under the title,<br />

shows the P1/P2 rrnB promotor region of E. coli, mapped to different DNA properties.<br />

The panel below is a “BLAST matrix” of 10 different Campylobacter stra<strong>in</strong>s, show<strong>in</strong>g the<br />

overall proteome similarity.<br />

i


Abstract<br />

The scientific community is witness<strong>in</strong>g an explosion <strong>in</strong> both the number <strong>and</strong> the complexity<br />

of DNA sequenc<strong>in</strong>g projects. As sequenc<strong>in</strong>g equipment becomes more reliable,<br />

faster <strong>and</strong> less expensive, new possibilities of apply<strong>in</strong>g the technology are open<strong>in</strong>g up.<br />

The early genome sequenc<strong>in</strong>g projects, dat<strong>in</strong>g back almost 15 years, presented only <strong>in</strong>dividual<br />

microbial stra<strong>in</strong>s <strong>and</strong> the large efforts <strong>and</strong> scientific achievements at this time<br />

qualified publication <strong>in</strong> high rank<strong>in</strong>g journals. Today however, projects like the Human<br />

Microbiome Project (HMP), Human Gut Microbiome Initiative (HGMI) <strong>and</strong> the Genomic<br />

Encyclopedia of Bacteria <strong>and</strong> Archaea (GEBA) takes sequenc<strong>in</strong>g <strong>in</strong>to a new era, to study<br />

the genomes <strong>and</strong> ecological niches of entire populations consist<strong>in</strong>g of thous<strong>and</strong>s of microorganisms.<br />

These <strong>in</strong>itiatives put a dem<strong>and</strong> for new analysis <strong>tools</strong> to process <strong>and</strong> derive<br />

knowledge from the wealth of genomic <strong>in</strong>formation.<br />

This thesis describes development of new <strong>tools</strong> <strong>and</strong> methods to study these types<br />

of data. When the genome of characterized stra<strong>in</strong>s <strong>and</strong> environmental samples are sequenced,<br />

the ribosomal RNA genes are commonly chosen as a start<strong>in</strong>g po<strong>in</strong>t to describe<br />

the phylogeny <strong>and</strong> diversity. The rRNA genes are often <strong>in</strong>terpreted as an ‘evolutionary<br />

chronometer’ <strong>and</strong> the RNAmmer software was developed as a tool to quickly <strong>and</strong><br />

consistently identify the rRNA genes allow<strong>in</strong>g for large-scale analysis of phylogeny of complex<br />

data sets. RNAmmer solved previous issues of the gene boundary accuracy, that<br />

is observed when us<strong>in</strong>g BLAST approaches to mapp<strong>in</strong>g rRNA genes. The possibility to<br />

accurately map the start of rRNA transcripts has allowed the <strong>in</strong>vestigation of promotor<br />

structures of these highly expressed operons <strong>and</strong> a promotor analysis <strong>in</strong> E. coli K12 is<br />

demonstrated by apply<strong>in</strong>g a mathematical model of the energetics <strong>in</strong>volved <strong>in</strong> DNA helix<br />

open<strong>in</strong>g.<br />

But a s<strong>in</strong>gle gene, such as the 16S rRNA, can <strong>in</strong> nature not describe the phenotype<br />

nor the full cod<strong>in</strong>g potential of an organism. This thesis describes the development of<br />

the BLASTatlas tool, which is a visualization tool to overview similarity <strong>and</strong> differences<br />

between any number of genomes, metagenomic samples or sequence databases from the<br />

viewpo<strong>in</strong>t of a reference genome. This software has proved to be a powerful tool to study<br />

the localization <strong>and</strong> ga<strong>in</strong>/loss of gene clusters, such as pathogenicity isl<strong>and</strong>s <strong>in</strong> virulent<br />

organisms. The tool has been used <strong>in</strong> several research projects <strong>and</strong> collaborations <strong>and</strong><br />

was described as a cover article <strong>in</strong> Molecular BioSystems <strong>in</strong> 2008, <strong>and</strong> highlighted <strong>in</strong> the<br />

journal Chemical Biology. Despite the usefulness of this tool, it became obvious that a web<br />

based version, more “biologist friendly” with zoom<strong>in</strong>g capability, was needed. This lead<br />

to the GeneWiz browser, which was developed <strong>in</strong> a jo<strong>in</strong>t effort with the IT staff at <strong>CBS</strong>.<br />

The tool enables the user to <strong>in</strong>teractively zoom from a global chromosomal scale down<br />

the nucletide, while ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g the overview of all data be<strong>in</strong>g presented <strong>in</strong> the plot. It<br />

features disproportional zoom<strong>in</strong>g as known from google maps. At the time of writ<strong>in</strong>g this<br />

iii


thesis, the work is just be<strong>in</strong>g published <strong>in</strong> the second issue of the SIGS journal (St<strong>and</strong>ards<br />

In Genomic Sciences).<br />

S<strong>in</strong>ce start<strong>in</strong>g my Ph.D. project, a total of 630 prokaryotic genomes has been sequenced<br />

<strong>and</strong> published. This represents on average about four genomes per week! As we<br />

ga<strong>in</strong> knowledge from this vast amount of data, new prediction methods become available<br />

allow<strong>in</strong>g for the generation of even more data; examples <strong>in</strong>clude predict<strong>in</strong>g sigma factor<br />

genes, chromosomal replication starts, <strong>and</strong> secretion systems. This comb<strong>in</strong>ation of new<br />

sequence data as well as new predicitons squares the problem: How do we deal with the<br />

challenge that more <strong>and</strong> more genomic material shall be processed through more <strong>and</strong> more<br />

bio<strong>in</strong>formatic <strong>tools</strong>? And how is this flow of <strong>in</strong>formation formalized <strong>and</strong> automated allow<strong>in</strong>g<br />

bio<strong>in</strong>formaticians to programmatically submit comparisons of any genome to any<br />

prediction method anywhere <strong>in</strong> the world? The need for <strong>in</strong>teroperable <strong>and</strong> programmable<br />

<strong>in</strong>terfaces for these resources is now widely recognized, <strong>and</strong> mach<strong>in</strong>e-to-mach<strong>in</strong>e communication<br />

through Web Services has ga<strong>in</strong>ed acceptance. But ahead lies challenges dur<strong>in</strong>g the<br />

transition from a web-browser-centric th<strong>in</strong>k<strong>in</strong>g towards <strong>in</strong>teroperability <strong>and</strong> service orietated<br />

architecture, SOA. Dur<strong>in</strong>g my Ph.D. work a number of significant contributions to<br />

both implementations <strong>and</strong> server <strong>in</strong>frastructure has provided remote users access to <strong>CBS</strong><br />

prediction servers <strong>and</strong> databases. This work has been presented both dur<strong>in</strong>g the general<br />

meet<strong>in</strong>gs of the EU project (EMBRACE) <strong>in</strong>itiat<strong>in</strong>g these efforts <strong>and</strong> dur<strong>in</strong>g various<br />

workshops teach<strong>in</strong>g the usage of Web Services <strong>and</strong> <strong>Comparative</strong> Genomics.<br />

iv


Resumé<br />

Det videnskabelige samfund er vidne til en eksplosion i b˚ade antallet og kompleksiteten<br />

af genomsekventer<strong>in</strong>ger. I takt med, at sekventer<strong>in</strong>gsudstyret bliver hurtigere, mere<br />

p˚alideligt, og tilmed billigere, ˚abner der sig nye muligheder for anvendelse af teknologien.<br />

De første genomprojekter, der g˚ar næsten 15 ˚ar tilbage, præsenterede kun enkelte<br />

bakteriestammer og den store <strong>in</strong>dsats sammen de videnskabelige resultater har bidraget<br />

med publikationer i højt rangerende tidsskrifter. I dag har projekter som Human Microbiome<br />

Project (HMP), Human Gut Microbiome Initiative (HGMI) og Genome Encyclopedia<br />

of Bacteria <strong>and</strong> Archaea (GEBA) bragt genomsekventer<strong>in</strong>g <strong>in</strong>d i en ny æra ved at<br />

karakterisere tus<strong>in</strong>der af referencegenomer og hele økosystemer best˚aende at tus<strong>in</strong>der af<br />

specier. Disse <strong>in</strong>itiativer vil efterspørge nye analyseværktøjer til at beh<strong>and</strong>le og omdanne<br />

denne flod af <strong>in</strong>formation til viden.<br />

Denne afh<strong>and</strong>l<strong>in</strong>g beskriver metoder og værktøjer til at studere disse typer af data.<br />

N˚ar karakteriserede stammer og prøver bliver sekventeret, er det ribosomale RNA ofte<br />

valgt som udgangspunkt til at beskrive fylogeni og diversitet. Ribosomalt RNA er ofte<br />

benyttet som et ’evolutionært kronometer’ og programmet RNAmmer blev udviklet som<br />

et værktøj til hurtigt og konsistent at identificere rRNA gener, hvilket giver mulighed<br />

for mere omfattende fylogenetiske analyser af komplekse datasæt. RNAmmer har løst<br />

tidligere problemer med at fastsl˚a genernes nøjagtige annoter<strong>in</strong>g, hvilket har været tilfældet<br />

med BLAST baserede metoder. Muligheden for nøjagtigt at kunne kortlægge rRNA<br />

gener, har tilladt undersøgelse af promotor strukturer for disse stærkt udtrykte operoner.<br />

Efterfølgende er en eksisterende matematisk energimodel for DNAets ˚abn<strong>in</strong>g anvendt, til<br />

at lave en promotor analyse af P1/P2 systemet i E. coli K12.<br />

Men et enkelt gen, som for eksempel 16S rRNA, er i sagens natur ude af st<strong>and</strong> til at<br />

beskrive en hel organismes fænotype eller dens fulde kodende potentiale. Denne afh<strong>and</strong>l<strong>in</strong>g<br />

beskriver BLASTatlas metoden, som er et visualiser<strong>in</strong>gsværktøj til at give et overblik<br />

over similaritet mellem et vilk˚arligt antal genomer, metagenomiske prøver eller sekvensdatabaser<br />

med udgangspunkt i et referencegenom. Denne software har vist sig at være et<br />

effektivt redskab til at studere enkelte gener eller grupper af gener, der er konserveret eller<br />

g˚aet tabt i eksempelvis sygdomsfremkaldende mikroorganismer. Værktøjet er blev brugt<br />

i forb<strong>in</strong>delse med flere forskn<strong>in</strong>gsprojekter og samarbejder og metoden blev offentliggjort<br />

som forsideartikel i maj 2008 udgaven af Environmental Microbiology. Det blev imidlertid<br />

klart, at manglen p˚a et <strong>in</strong>teraktivt aspekt, gjorde værktøjet vanskeligt at anvende for biologer.<br />

Dette førte til udvikl<strong>in</strong>gen af programmet GeneWiz Browser, som blev udviklet i<br />

samarbejde med IT-personale p˚a <strong>CBS</strong>. Værktøjet gør det muligt for brugeren <strong>in</strong>teraktivt<br />

at zoome ud fra det globale genom og ned til det enkelte nukleotid, og samtidig bevare<br />

overblikket over alle data, der præsenteres i diagrammet. Programmet anvender disproportional<br />

skaler<strong>in</strong>g som det kendes fra for eksempel Google Maps. Arbejdet er i øjeblikket<br />

v


ved at blive publiceret i St<strong>and</strong>ards In Genomic Sciences.<br />

Siden starten p˚a mit tre ˚arige Ph.D. projekt er ialt 630 prokaryote organismer blev fuld<br />

sekventeret og offentliggjort. Dette svarer i gennemsnit til tre genomer om ugen! I takt<br />

med vi f˚ar ny viden udfra disse store data mængder, bliver der publiceret nye forudsigelsesmetoder<br />

til for eksempel sigma faktorer, kromosomal replikation, og sekretionssystemer.<br />

Denne dobbelthed understreger problemet: Hvordan reagerer vi p˚a den udfordr<strong>in</strong>g, at<br />

mere og mere genomisk materiale skal processeres ved hjælp af flere og flere bio<strong>in</strong>formatiske<br />

værktøjer? Og hvordan kan denne strøm af <strong>in</strong>formation formaliseres og automatiseres<br />

p˚a en s˚adan m˚ade, at bio<strong>in</strong>formatikere og biologer p˚a en programmrbar m˚ade kan<br />

køre sammenlign<strong>in</strong>ger af enhvert genom p˚a enhver forudsigelsesmetode overalt i verden?<br />

Behovet for <strong>in</strong>teroperable og programmerbare grænseflader til disse ressourcer er nu alm<strong>in</strong>deligt<br />

anerkendt, og computer-til-computer kommunikation gennem Web Services har<br />

vundet <strong>in</strong>dpas. Men forude ligger udfordr<strong>in</strong>ger i overgangen fra en webbrowser-fokuseret<br />

tankegang i retn<strong>in</strong>g af <strong>in</strong>teroperabilitet og Service Orientated Architecture, kaldet SOA. I<br />

mit Ph.D. arbejde har er en række betydelige bidrag i form a implementer<strong>in</strong>ger og <strong>in</strong>frastruktur<br />

givet eksterne brugere af forskellige <strong>CBS</strong> værktøjer og databaser en programmerbar<br />

adgang via Web Services. Disse bidrag er blevet præsenteret b˚ade under generalmøder i<br />

EMBRACE EU-projektet og forskellige workshops omh<strong>and</strong>lende brugen af Web Services.<br />

vi


Acknowledgments<br />

I would like to express a deep gratitude to my supervisor Prof. David Ussery for his support<br />

dur<strong>in</strong>g my Ph.D. project. It has been a great pleasure to work with him dur<strong>in</strong>g my time<br />

at <strong>CBS</strong> <strong>and</strong> I will miss the time of organiz<strong>in</strong>g workshops <strong>and</strong> prepar<strong>in</strong>g for conferences.<br />

A thanks to Prof. <strong>and</strong> center director Søren Brunak for creat<strong>in</strong>g a unique <strong>and</strong> <strong>in</strong>spir<strong>in</strong>g<br />

environment at <strong>CBS</strong> which enabled this project.<br />

I would like to extend my heartfelt gratitude to Craig <strong>and</strong> Marcia Benham for the<br />

<strong>in</strong>cribile hospitality <strong>and</strong> openness towards our family dur<strong>in</strong>g my research visit at University<br />

of California, Davis <strong>in</strong> 2007.<br />

I would like to thank a great collegue <strong>and</strong> friend of m<strong>in</strong>e, Tim T. B<strong>in</strong>newies, for support<br />

dur<strong>in</strong>g conferences, manuscript preperations <strong>and</strong> our daily colaborations - it has been a<br />

pleasure to work with Tim. A thanks to Kar<strong>in</strong> Lagesen for great research collaboration<br />

dur<strong>in</strong>g the development of RNAmmer <strong>and</strong> Hanni Willenbrock for great collaboration <strong>and</strong><br />

for driv<strong>in</strong>g numerous publications. I would also like thank all the people I worked with<br />

dur<strong>in</strong>g the development of the ENCODE pipel<strong>in</strong>e, Ramneek Gupta, Thomas Blicher,<br />

Haakan Svensson, Henrik Nielsen, Rasmus Wernersson, Morten Bo Johansen <strong>and</strong> Eleonora<br />

Kulberkyte.<br />

A special thanks to Hans-Henrik Stærfeldt for valuable feedback <strong>and</strong> all the <strong>in</strong>spir<strong>in</strong>g<br />

<strong>and</strong> productive sessions of f<strong>in</strong>aliz<strong>in</strong>g GeneWiz Browser <strong>and</strong> compos<strong>in</strong>g web services software.<br />

A special thanks to Kristoffer Rapacki for be<strong>in</strong>g a great travel companion, for always<br />

f<strong>in</strong>d<strong>in</strong>g solutions, <strong>and</strong> for the many fruitfull discussions we have had - I hope there will be<br />

more. I would like to thank the numerous people with whom I have had the pleasure of<br />

work<strong>in</strong>g with, dur<strong>in</strong>g research projects <strong>and</strong> courses.<br />

Former center adm<strong>in</strong>istrators Johanne Keid<strong>in</strong>g <strong>and</strong> Anne Christensen, current center<br />

adm<strong>in</strong>istrator Dorthe Kjærsgaard, Lone Boesen <strong>and</strong> Malene Beck for your extrod<strong>in</strong>ary<br />

efforts of mak<strong>in</strong>g the <strong>CBS</strong> eng<strong>in</strong>e runn<strong>in</strong>g efficient. Lone Boesen deserves special praise<br />

for smoothly arrang<strong>in</strong>g <strong>and</strong> h<strong>and</strong>l<strong>in</strong>g travel details for my many trips abroad, <strong>in</strong>clud<strong>in</strong>g<br />

five cont<strong>in</strong>ents.<br />

vii


viii


Publications <strong>and</strong> manuscripts<br />

Publications <strong>in</strong>cluded <strong>in</strong> this thesis are listed <strong>in</strong> the order they appear. All other articles<br />

are sorted by publication date, descend<strong>in</strong>g. For papers with five <strong>and</strong> more citations this<br />

number is <strong>in</strong>dicated.<br />

Paper I<br />

Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW. The genome BLASTatlas - a GeneWiz extension<br />

for visualization of whole-genome homology. Mol Biosyst 4:363-71 (2008).<br />

Paper II<br />

B<strong>in</strong>newies TT, Motro Y, Hall<strong>in</strong> PF, Lund O, Dunn D. La T, Hampson DJ, Bellgard M,<br />

Wassenaar TM, Ussery DW. Ten years of bacterial genome sequenc<strong>in</strong>g: comparative–<br />

genomics–based discoveries. Funct Integr Genomics 6:165-85 (2006) - 56 citations.<br />

Paper III<br />

Reva ON, Hall<strong>in</strong> PF, Willenbrock H, Sicheritz-Ponten T, Tummler B, Ussery DW Global<br />

features of the Alcanivorax borkumensis SK2 genome. Environ Microbiol 10:614-<br />

25 (2008).<br />

Paper IV<br />

Vesth T, Hall<strong>in</strong> PF, Snipen L, Lagesen K, Wassenaar TM, Ussery DW. The orig<strong>in</strong>s of<br />

Vibrio species. Microbial Ecology (2009) doi:10.1007/s00248-009-9596-7<br />

Paper V<br />

Wassenaar TM, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, <strong>and</strong> Ussery DW Tools for comparison of<br />

bacterial genomes. Book chapter, Microbiology of Hydrocarbons, Oils, Lipids, <strong>and</strong><br />

Derived Compounds, Spr<strong>in</strong>ger-Verlag, Heidelberg, Germany, 2009.<br />

ix


Paper VI<br />

[Lagesen K, Hall<strong>in</strong> P] 1 , Rodl<strong>and</strong> EA, Stærfeldt HH, Rognes T, Ussery DW. RNAmmer:<br />

consistent <strong>and</strong> rapid annotation of ribosomal RNA genes. Nucleic Acids Res<br />

35:3100-8 (2007) - 8 citations 2<br />

Paper VII<br />

Hall<strong>in</strong> PF, Stærfeldt H, Rotenberg E, B<strong>in</strong>newies TT, Benham CJ, <strong>and</strong> Ussery DW. GeneWiz<br />

browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced Chromosomes.<br />

St<strong>and</strong>ards <strong>in</strong> Genomic Sciences 1:204-215 (2009) doi:10.4056/sigs.28177.<br />

Papers not <strong>in</strong>cluded<br />

Contributions have been made to the follow<strong>in</strong>g papers dur<strong>in</strong>g my PhD project.<br />

• Miller WG, Parker CT, Rubenfield M, Mendz GL, Wosten MM, Ussery DW,<br />

Stolz JF, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Wang G, Malek JA, Rogos<strong>in</strong> A, Stanker<br />

LH, M<strong>and</strong>rell RE. The complete genome sequence <strong>and</strong> analysis of the<br />

human pathogen Arcobacter butzleri. PLoS ONE 2:e1358 (2007)<br />

• Willenbrock H, Hall<strong>in</strong> PF, Wassenaar TM, Ussery DW Characterization of<br />

probiotic Escherichia coli isolates with a novel pan-genome microarray.<br />

Genome Biol 8:R267 (2007)<br />

Earlier papers, 2004–2006<br />

• Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Stærfeldt HH, Ussery DW Orig<strong>in</strong> of replication<br />

<strong>in</strong> circular prokaryotic chromosomes. Environ Microbiol 8:353-61<br />

(2006) - 28 citations<br />

• Kill K, B<strong>in</strong>newies TT, Sicheritz-Ponten T, Willenbrock H, Hall<strong>in</strong> PF, Wassenaar<br />

TM, Ussery DW Genome update: sigma factors <strong>in</strong> 240 bacterial<br />

genomes. Microbiology 151:3147-50 (2005)<br />

• Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW Genome update: prediction<br />

of membrane prote<strong>in</strong>s <strong>in</strong> prokaryotic genomes. Microbiology<br />

151:2119-21 (2005)<br />

• Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sicheritz-Ponten T, Ussery DW Genome<br />

update: prediction of secreted prote<strong>in</strong>s <strong>in</strong> 225 bacterial proteomes.<br />

Microbiology 151:1725-7 (2005)<br />

• B<strong>in</strong>newies TT, Bendtsen JD, Hall<strong>in</strong> PF, Nielsen N, Wassenaar TM, Pedersen<br />

MB, Klemm P, Ussery DW Genome Update: Prote<strong>in</strong> secretion systems<br />

<strong>in</strong> 225 bacterial genomes. Microbiology 151:1013-6 (2005)<br />

• Hall<strong>in</strong> PF, Nielsen N, Dev<strong>in</strong>e KM, B<strong>in</strong>newies TT, Willenbrock H, Ussery DW<br />

Genome update: base skews <strong>in</strong> 200+ bacterial chromosomes. Microbiology<br />

151:633-7 (2005)<br />

1 Both authors contributed equally<br />

2 Additionally 8 citations for the first 8 GEBA genomes published <strong>in</strong> SIGS journal; be<strong>in</strong>g part of a<br />

st<strong>and</strong>ard pipel<strong>in</strong>e, RNAmmer will be cited for future GEBA articles.<br />

x


• Willenbrock H, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW Genome update: 2D<br />

cluster<strong>in</strong>g of bacterial genomes. Microbiology 151:333-6 (2005)<br />

• B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Stærfeldt HH, Ussery DW Genome Update: proteome<br />

comparisons. Microbiology 151:1-4 (2005)<br />

• Hall<strong>in</strong> PF, Ussery DW <strong>CBS</strong> Genome Atlas Database: a dynamic storage<br />

for bio<strong>in</strong>formatic results <strong>and</strong> sequence data. Bio<strong>in</strong>formatics 20:3682-<br />

6 (2004) - 37 citations<br />

• Hall<strong>in</strong> PF, Coenye T, B<strong>in</strong>newies TT, Jarmer H, Stærfeldt HH, Ussery DW<br />

Genome update: correlation of bacterial genomic properties. Microbiology<br />

150:3899-903 (2004)<br />

• Ussery DW, B<strong>in</strong>newies TT, Gouveia-Oliveira R, Jarmer H, Hall<strong>in</strong> PF Genome<br />

update: DNA repeats <strong>in</strong> bacterial genomes. Microbiology 150:3519-21<br />

(2004) - 11 citations<br />

• Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW Genome update: chromosome atlases.<br />

Microbiology 150:3091-3 (2004)<br />

• Ussery DW, T<strong>in</strong>dbaek N, Hall<strong>in</strong> PF Genome update: promoter profiles.<br />

Microbiology 150:2791-3 (2004)<br />

• Ussery DW, Jensen MS, Poulsen TR, Hall<strong>in</strong> PF Genome update: alignment<br />

of bacterial chromosomes. Microbiology 150:2491-3 (2004)<br />

• Ussery DW, Hall<strong>in</strong> PF Genome Update: annotation quality <strong>in</strong> sequenced<br />

microbial genomes. Microbiology 150:2015-7 (2004) - 8 citations<br />

• Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Wassenaar TM Genome update: tR-<br />

NAs <strong>in</strong> sequenced microbial genomes. Microbiology 150:1603-6 (2004)<br />

• Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Coenye T Genome update: rRNAs <strong>in</strong><br />

sequenced microbial genomes. Microbiology 150:1113-5 (2004)<br />

• Ussery DW, Hall<strong>in</strong> PF Genome Update: AT content <strong>in</strong> sequenced prokaryotic<br />

genomes. Microbiology 150:749-52 (2004) - 8 citations<br />

• Ussery DW, Hall<strong>in</strong> PF Genome update: Length distributions of sequenced<br />

prokaryotic genomes. Microbiology 150:513-6 (2004)<br />

xi


xii


Contents<br />

List of Figures xvii<br />

1 Introduction 1<br />

2 <strong>Comparative</strong> Genomics 3<br />

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

2.2 The genome annotation pipel<strong>in</strong>e . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

2.2.1 fetchgbk: Obta<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g public genomes from GenBank . . . . 4<br />

2.2.2 Other ways to acquire genome <strong>in</strong>formation . . . . . . . . . . . . . . 4<br />

2.2.3 Tools contigsort <strong>and</strong> contigmap . . . . . . . . . . . . . . . . . . . 5<br />

2.2.4 F<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> encod<strong>in</strong>g genes <strong>in</strong> prokaryotes . . . . . . . . . . . . 6<br />

2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes . . . . . . . . . . . . . . . . . . . . . 7<br />

2.3 Genome Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7<br />

2.3.1 Box-<strong>and</strong>-wiskers plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.3.2 heatmap - 2D cluster<strong>in</strong>g . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.3.3 Codon usage <strong>and</strong> chromosomal base composition . . . . . . . . . . . 11<br />

2.3.4 CodonPlot: visualiz<strong>in</strong>g codon usage . . . . . . . . . . . . . . . . . . 13<br />

2.3.5 Base composition <strong>and</strong> DNA repair . . . . . . . . . . . . . . . . . . . 16<br />

2.3.6 BLASTmatrix - proteome comparison . . . . . . . . . . . . . . . . . . 16<br />

2.3.7 BLASTatlas - visualiz<strong>in</strong>g while-genome homology . . . . . . . . . . . 18<br />

2.3.8 CorePlot - plott<strong>in</strong>g the core- <strong>and</strong> pan-genomes of species . . . . . . 23<br />

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.5 Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas . . . . . . . . . . . . . . . . . . 27<br />

2.6 Paper I: The genome BLASTatlas - a GeneWiz extension for visualization<br />

of whole-genome homology . . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.7 Paper II: Ten years of bacterial genome sequenc<strong>in</strong>g: comparative–genomics–<br />

based discoveries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

2.8 Paper III: Global features of the Alcanivorax borkumensis SK2 genome . . 61<br />

2.9 Paper IV: The orig<strong>in</strong>s of Vibrio species . . . . . . . . . . . . . . . . . . . . 75<br />

2.10 Paper V: Tools for comparison of bacterial genomes . . . . . . . . . . . . . 89<br />

3 rRNA operons <strong>and</strong> promoter analysis 105<br />

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

3.2 P1 <strong>and</strong> P2 promoters <strong>in</strong> E. coli . . . . . . . . . . . . . . . . . . . . . . . . . 105<br />

3.3 Conservation of regulatory elements . . . . . . . . . . . . . . . . . . . . . . 106<br />

3.3.1 Model<strong>in</strong>g the P1 <strong>and</strong> P2 <strong>in</strong> selected enterics . . . . . . . . . . . . . . 108<br />

3.3.2 Iterat<strong>in</strong>g weight matrix frequencies . . . . . . . . . . . . . . . . . . . 112<br />

xiii


3.3.3 Ref<strong>in</strong><strong>in</strong>g E. coli <strong>and</strong> Shigella models . . . . . . . . . . . . . . . . . . 112<br />

3.4 DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

3.4.1 codesearch: Mapp<strong>in</strong>g nummerical data to genome annotations . . . 114<br />

3.5 The genomic context: visualiz<strong>in</strong>g operons <strong>and</strong> DNA properties . . . . . . . 117<br />

3.6 Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser . . . . . . . . . . . . . . . . 117<br />

3.6.1 Visualiz<strong>in</strong>g the P1 <strong>and</strong> P2 structure us<strong>in</strong>g gwBrowser . . . . . . . . 119<br />

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

3.8 Paper VI: RNAmmer: Fast two-level HMM prediction of rRNA <strong>in</strong> prokaryotic<br />

genome sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

3.9 Paper VII: GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced<br />

Chromosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

4 Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics 145<br />

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

4.2 <strong>Interoperability</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146<br />

4.2.1 SOAP based Web Services . . . . . . . . . . . . . . . . . . . . . . . . 147<br />

4.3 EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability . . . . . . . . . . 147<br />

4.3.1 Quasi - a light-weight SOAP server . . . . . . . . . . . . . . . . . . 150<br />

4.3.2 quasi mktemp - From template to Web Service . . . . . . . . . . . . 150<br />

4.4 ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services . . . . . . . . . . . . . . . . . . . 151<br />

4.4.1 Collect<strong>in</strong>g Web Services clients <strong>in</strong> EPipe . . . . . . . . . . . . . . . . 151<br />

4.4.2 Mapp<strong>in</strong>g Pfam annotations to prote<strong>in</strong> structure: mecA . . . . . . . . 151<br />

5 Conclusion <strong>and</strong> perspectives 155<br />

A Appendix: Workshops, teach<strong>in</strong>g, <strong>and</strong> conferences 157<br />

A.1 Lectures <strong>and</strong> Presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

A.1.1 DTU Course 27101: Framework Course <strong>in</strong> Biotechnology <strong>and</strong> Food<br />

Sciences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157<br />

A.1.2 <strong>Comparative</strong> Microbial Genomics Workshop . . . . . . . . . . . . . . 157<br />

A.1.3 <strong>Comparative</strong> Microbial Genomics <strong>and</strong> Taxonomy . . . . . . . . . . . 157<br />

A.1.4 EMBRACE Workshop on Client Side Script<strong>in</strong>g for Web Services . . 157<br />

A.1.5 EMBRACE Workshop on Bio<strong>in</strong>formatics of Immunology . . . . . . . 157<br />

A.1.6 EMBRACE 3 rd AGM: Implementation of web services . . . . . . . . 157<br />

A.1.7 EMBRACE Workshop on Perl, SQL <strong>and</strong> Web Services . . . . . . . . 158<br />

A.2 Workshops <strong>and</strong> meet<strong>in</strong>gs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

A.2.1 EMBRACE Workshop: SOAP web services . . . . . . . . . . . . . . 158<br />

A.2.2 EUCOMM Bio<strong>in</strong>formatics Tra<strong>in</strong><strong>in</strong>g Course . . . . . . . . . . . . . . 158<br />

A.2.3 EMBRACE Workshop: Modern computer <strong>tools</strong> for the biosciences . 158<br />

A.2.4 EMBRACE 3rd Annual General Meet<strong>in</strong>g . . . . . . . . . . . . . . . 158<br />

A.2.5 EMBRACE Workshop: Deploy<strong>in</strong>g Web Services for Biological Sequence<br />

Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

A.2.6 EMBRACE 4th Annual General Meet<strong>in</strong>g . . . . . . . . . . . . . . . 158<br />

A.2.7 Technical discussion of EMBRACE registry . . . . . . . . . . . . . . 158<br />

A.2.8 EMBRACE meet<strong>in</strong>g: Discussion of st<strong>and</strong>ard data types . . . . . . . 158<br />

A.3 Conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158<br />

A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A. . . . . . . . 158<br />

A.3.2 Conference: ASM Biodefense 2007, February 2007, Wash<strong>in</strong>gton U.S.A.158<br />

B Appendix: Ph.D. study plan 159<br />

xiv


C Appendix: Courses 165<br />

C.1 Global regulatory networks <strong>in</strong> microorganisms . . . . . . . . . . . . . . . . . 165<br />

C.2 Prote<strong>in</strong> Structure <strong>and</strong> <strong>Computational</strong> Biology . . . . . . . . . . . . . . . . . 165<br />

C.3 Biological Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

C.4 <strong>Comparative</strong> Genome Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

C.5 Doctorial sem<strong>in</strong>ar on bus<strong>in</strong>ess economics for academic entrepreneurs . . . . 165<br />

C.6 ECTS summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165<br />

D Appendix: Software 166<br />

D.1 fetchgbk manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166<br />

D.2 Sample output from queryGenomes . . . . . . . . . . . . . . . . . . . . . . . 167<br />

D.3 BLASTatlas configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.3.1 file blast.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.3.2 file custom.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.4 BLASTmatrix example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168<br />

D.5 iscan source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169<br />

D.6 quasi mktemp manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172<br />

Bibliography 174<br />

xv


xvi


List of Figures<br />

2.1 Mapp<strong>in</strong>g of multiple contigs to a backbone genome. C. jejuni str. NCTC<br />

11168 is used as backbone for mapp<strong>in</strong>g contigs C. jejuni str. 260.94. Blue<br />

<strong>and</strong> red blocks represent direct <strong>and</strong> reverse hits, respectively. Panel (a)<br />

shows un-mapped whereas panel (b) shows mapped contigs. . . . . . . . . 6<br />

2.2 Construction of a box-<strong>and</strong>-whiskers plot. Notches is an estimate of the 95%<br />

confidence <strong>in</strong>terval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.3 Genome size of all public prokaryotic. . . . . . . . . . . . . . . . . . . . . . 10<br />

2.4 Average AT content of all public prokaryotic. . . . . . . . . . . . . . . . . 10<br />

2.5 2D-cluster<strong>in</strong>g show<strong>in</strong>g 87 Enterobacteriaceae. . . . . . . . . . . . . . . . . . 12<br />

2.6 Codon <strong>and</strong> am<strong>in</strong>o acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella<br />

pneumoniae NTUH-K2044 (42.3% AT), <strong>and</strong> E. coli K12 49.2% AT.<br />

Rightmost column shows the nucleotide bias of the three codon positions. . 14<br />

2.7 AT content profile 400 bp upstream <strong>and</strong> downstram of annotated translation<br />

starts <strong>in</strong> Buchnera aphidicola Cc. . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.8 Deam<strong>in</strong>ation of cytos<strong>in</strong>e (C) <strong>in</strong>to uracil (U) . . . . . . . . . . . . . . . . . . 16<br />

2.9 Construction of the BLASTmatrix diagram. Proteome similarity between<br />

three E. coli genomes. Lower part of the diagram corresponds to <strong>in</strong>traproteome<br />

similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

2.10 Proteome similarity between ten Campylobacter species. Color encod<strong>in</strong>g<br />

corresponds to percentage of shared prote<strong>in</strong> families. . . . . . . . . . . . . 17<br />

2.11 Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae<br />

stra<strong>in</strong>s lack<strong>in</strong>g the cholera enterotox<strong>in</strong> genes are highlighted <strong>in</strong> bright green,<br />

whilst pathogenic V. cholerae stra<strong>in</strong>s genomes are shown <strong>in</strong> dark green. . . 18<br />

2.12 Mapp<strong>in</strong>g of pairwise alignment to a reference genome. Mismatches, conservative<br />

mismatches <strong>and</strong> perfect matches contrubute to the overall map 0.0,<br />

0.5, <strong>and</strong> 1.0, respectively. Gaps with<strong>in</strong> the reference prote<strong>in</strong>, correspond<strong>in</strong>g<br />

to miss<strong>in</strong>g features of the reference prote<strong>in</strong>, cannot be mapped <strong>and</strong> are<br />

hence excluded. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.13 Inclusion of multiple organisms us<strong>in</strong>g the BLASTatlas method. Each track<br />

correspond to a pairwise comparison aga<strong>in</strong>st the reference chromosome. . . 19<br />

2.14 Comparison of B. pseudomallei 1710b chomosome I <strong>and</strong> II aga<strong>in</strong>st all public<br />

Burkholderia genomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.15 A phylome atlas of Alcanivorax borkumensis, compar<strong>in</strong>g the proteome aga<strong>in</strong>st<br />

all γ-, α-, β-, δ, <strong>and</strong> ɛ-proteobacteria available at the time of publish<strong>in</strong>g. . 22<br />

2.16 Count of genomes <strong>and</strong> species divided by genera. Source: <strong>CBS</strong> Genome<br />

Atlas Database as of 2009-09-11. . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

xvii


xviii<br />

2.17 Pan- <strong>and</strong> core-genome plot of 10 Campylobacter genomes. For the data<br />

currently available, there seem to exist an equilibrium at close to 600 prote<strong>in</strong><br />

families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.18 CorePlot output for 32 Vibrio genomes. . . . . . . . . . . . . . . . . . . . . 24<br />

3.1 The transcription of bacterial genes. . . . . . . . . . . . . . . . . . . . . . . 106<br />

3.2 The promotor structure of the rrnB operon <strong>in</strong> E. coli. . . . . . . . . . . . . 107<br />

3.3 The –10 <strong>and</strong> –35 hexamers of the E. coli σ 70 promotor correspond to the<br />

motifs be<strong>in</strong>g located on opposite side of the DNA helix. Delition or <strong>in</strong>sertions<br />

of the spac<strong>in</strong>g cases a shift of approx. 36deg per nucleotide. . . . . . 107<br />

3.4 Logo plots show<strong>in</strong>g the <strong>in</strong>itial weight matrices used for search<strong>in</strong>g E. coli<br />

<strong>and</strong> Shigella genomes: –10 hexamer (a), –35 hexamer (b), UP element (c),<br />

<strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g motif (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

3.5 Neighbor-jo<strong>in</strong><strong>in</strong>g tree of first 1k bases of all 16S rRNA genes of Yers<strong>in</strong>ia,<br />

Salmonella, Shigella, <strong>and</strong> E. coli . . . . . . . . . . . . . . . . . . . . . . . . 110<br />

3.6 Profiles show<strong>in</strong>g the maximum Ri(tot) scores of the <strong>in</strong>itial weight matrices<br />

applied to E. coli <strong>and</strong> Shigella: Unadjusted P1 scores (a), Adjusted P1<br />

scores (b), Unadjusted P2 scores (c), <strong>and</strong> Adjusted P2 scores (d) . . . . . . 112<br />

3.7 Logos show<strong>in</strong>g the base compostion of P1 <strong>and</strong> P2 of E. coli genomes, as<br />

identified by <strong>in</strong>itial P1 <strong>and</strong> P2 scan: P1 –10 hexamer (a), P1 –35 hexamer<br />

(b), P1 UP element (c), P1 FIS b<strong>in</strong>d<strong>in</strong>g motif (d), P2 –10 hexamer (e), P2<br />

–35 hexamer (f), P2 UP element (g) . . . . . . . . . . . . . . . . . . . . . . 113<br />

3.8 Average profiles of SIDD energy calculated at five different helix densities<br />

-0.025, -0.035, -0.045, <strong>and</strong> -0.055. All genes have been aligned at the translation<br />

start. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

3.9 E. coli <strong>and</strong> Shigella rrnB energy l<strong>and</strong>scape visualized us<strong>in</strong>g the heatmap<br />

function. Each vertical column corresponds to a promotor sequence, whereas<br />

the horizontal rows represent average values over 10 bp with<strong>in</strong> each sequence.<br />

Coord<strong>in</strong>ates labeled on the horizontal rows are relative to the 16S<br />

rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps<br />

show P2. Leftmost heatmaps show P1/P2 model scores <strong>in</strong> green, whereas<br />

rightmost heatmaps show the SIDD energy <strong>in</strong> blue. . . . . . . . . . . . . . 116<br />

3.10 Pr<strong>in</strong>ciple workflow of gwBrowser data exchange. . . . . . . . . . . . . . . . 118<br />

3.11 Mapp<strong>in</strong>g qualities of sequenc<strong>in</strong>g reads to a reference genome while account<strong>in</strong>g<br />

for the uniqueness of the read. . . . . . . . . . . . . . . . . . . . . . . . 118<br />

3.12 A zoom of the P1 P2 t<strong>and</strong>em promotor system upstream of the rrnB operon<br />

of E. coli K12. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

4.1 Screen shot of NCBI Entrez Genome projects web page . . . . . . . . . . . 146<br />

4.2 Schematic layout of a simple SOAP resource, where WSDL <strong>and</strong> schemas<br />

reside on the same server. WSDL <strong>and</strong> schemas are read <strong>and</strong> <strong>in</strong>tepreted<br />

by the SOAP client <strong>in</strong> order compose the outgo<strong>in</strong>g request <strong>and</strong> parse the<br />

<strong>in</strong>com<strong>in</strong>g server response. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

4.3 Schematic layout of the ENCODE pipel<strong>in</strong>e, EPipe. The ma<strong>in</strong> program<br />

ensures that as much as possible is dispatched <strong>in</strong> parrallel. Modules may<br />

either be alignment dependent or not. If the alignment is required to predict<br />

the prote<strong>in</strong> features, the module is not launched until the alignment<br />

algorithm has f<strong>in</strong>ished. Modules may either return global features of the<br />

entire prote<strong>in</strong> (e.g. cellular localization), or return positional features (e.g.<br />

phosphorylation sites). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152


4.4 The <strong>in</strong>put web page of EPipe: Upper part def<strong>in</strong>es sequence upload <strong>and</strong><br />

alignment method, <strong>and</strong> lower part selects which modules / methods to<br />

run. When applicable, gene ontologies have been added to each feature <strong>and</strong><br />

feature values (light green boxes). . . . . . . . . . . . . . . . . . . . . . . . 153<br />

4.5 The mecA encoded prote<strong>in</strong> (EEV85461) shows homology to PDB entry<br />

1VQQ (Lim & Strynadka, 2002). Top panel shows the EPipe structure<br />

browser which allows for any 90 degrees rotat<strong>in</strong>g. Lower panel shows a<br />

post-process<strong>in</strong>g of the PyMol script, generated by EPipe. . . . . . . . . . . 154<br />

xix


Chapter 1<br />

Introduction<br />

Introduction<br />

S<strong>in</strong>ce the publication of the first complete bacterial genome sequence <strong>in</strong> 1995 close to a<br />

thous<strong>and</strong> prokaryotes have been fully sequenced <strong>and</strong> made publicly available. These data<br />

represent large efforts by many scientists <strong>and</strong> technicians, clos<strong>in</strong>g gaps <strong>in</strong> the chromosomal<br />

sequences <strong>and</strong> provid<strong>in</strong>g detailed gene annotations. These genome projects constitute a<br />

valuable collection of prokaryotic diversity <strong>and</strong> they serve as an <strong>in</strong>dispensable resource for<br />

comparative studies when novel features of newly discovered organisms are identified.<br />

We are however witness<strong>in</strong>g a transition phase as genome sequenc<strong>in</strong>g becomes a trivial<br />

step carried out by any researcher or company <strong>in</strong> the need of a better characterization of an<br />

organism. Sequenc<strong>in</strong>g equipment <strong>and</strong> the capability of assembl<strong>in</strong>g an entire genome will<br />

likely follow the same path as any other technological advance the world has seen. Telephones,<br />

cars, aeroplanes, <strong>and</strong> computers all have started as costly <strong>and</strong> clumsy attempts,<br />

<strong>and</strong> ended up as ma<strong>in</strong>stream affordable <strong>and</strong> efficient products, taken for granted. Noth<strong>in</strong>g<br />

will prevent sequenc<strong>in</strong>g technology to follow the same path <strong>and</strong> it will likely end up as a<br />

t<strong>in</strong>y desktop <strong>in</strong>strument on a doctor’s table next to the blood preasure measur<strong>in</strong>g device.<br />

But the decreas<strong>in</strong>g novelty of present<strong>in</strong>g a new genome sequence could cause a decl<strong>in</strong>e <strong>in</strong><br />

the number of published genomes <strong>in</strong> the near future, caus<strong>in</strong>g less control <strong>and</strong> organization<br />

of these data, with fewer dem<strong>and</strong>s on data <strong>in</strong>tegrity, sequenc<strong>in</strong>g <strong>and</strong> annotation quality.<br />

Some major issues arrise as massive amounts of genomic data becomes a reality. There<br />

are signs that our ability to process <strong>and</strong> analyze genomic data is be<strong>in</strong>g overtaken by the<br />

technological developments of the sequenc<strong>in</strong>g equipment. For example, over the past<br />

twenty-five years, GenBank has grown roughly 100,000 fold, whereas the computer process<strong>in</strong>g<br />

power, follow<strong>in</strong>g Moore’s law has grown “only” a 1,000 times. The overwhelm<strong>in</strong>g<br />

data generated by modern sequenc<strong>in</strong>g mach<strong>in</strong>es constitite tough challenges for most biologist<br />

<strong>and</strong> although efforts are constantly be<strong>in</strong>g made to improve gene prediction <strong>and</strong><br />

genome assembly software, these steps are not yet function<strong>in</strong>g <strong>in</strong> a scalable <strong>and</strong> unsupervised<br />

fashion. Further, post-annotation steps deriv<strong>in</strong>g knowledge from predicted genes<br />

rema<strong>in</strong> one of the biggest challenges. How do we transform contigs of nucleotide sequences<br />

<strong>in</strong>to knowledge to derive the phenotype of the organism?<br />

As more prokaryotic genomes are be<strong>in</strong>g sequenced, there are now a number of species<br />

for which multiple stra<strong>in</strong>s are sequenced. Roughly one fourth of all prokaryotic projects<br />

exist with<strong>in</strong> species where 5 or more stra<strong>in</strong>s are available. As this coverage of diversity<br />

<strong>in</strong>creases, we may beg<strong>in</strong> to answer some key questions with better confidence. How do<br />

we def<strong>in</strong>e core sets of genes? Can we estimate the size of the pan genome? Which<br />

features are novel <strong>in</strong> selected stra<strong>in</strong>s <strong>and</strong> are these features regionally conserved with<strong>in</strong><br />

the chromosomes? To answer these questions, there is a fundamental need to visuzalize<br />

<strong>and</strong> overview the similarity <strong>and</strong> differences between larger number of genomes. Obta<strong>in</strong><strong>in</strong>g<br />

such an overview allows some questions concern<strong>in</strong>g gene acquisition <strong>and</strong> chromosomal<br />

1


organization to be answered. The development <strong>and</strong> ref<strong>in</strong>ement of the BLASTatlas method<br />

done dur<strong>in</strong>g this Ph.D. project is an essential step forward enabl<strong>in</strong>g these types of analysis<br />

<strong>and</strong> the method is now offered as an onl<strong>in</strong>e service by <strong>CBS</strong>. This work let to a publication<br />

<strong>in</strong> 2008, describ<strong>in</strong>g the BLASTatlas method.<br />

In chapter 2 a number of <strong>tools</strong> are described, which can assist rapid analysis of genomes,<br />

genomic contigs <strong>and</strong> larger collections of genomes to conclude the similarity. Enabl<strong>in</strong>g<br />

local <strong>and</strong> web based genome analysis <strong>tools</strong> for the novice user rema<strong>in</strong>s a critical po<strong>in</strong>t for<br />

the success of future sequenc<strong>in</strong>g projects. In chapter 3 the RNAmmer tool was used as<br />

a start<strong>in</strong>g po<strong>in</strong>t to study the E. coli rrn t<strong>and</strong>em promotors. This work presents useful<br />

<strong>tools</strong> to model <strong>and</strong> visualize promotor conservation <strong>in</strong> genomes. The exchange of genomic<br />

data between users, sequenc<strong>in</strong>g centers, repositories, <strong>and</strong> tool providers currently lack<br />

st<strong>and</strong>ardizaion <strong>and</strong> <strong>in</strong>teroperability. The lack of a formal way to exchange genomic data is<br />

a limit<strong>in</strong>g factor as to how we <strong>in</strong> the future may exploit the wave of new genomic material<br />

be<strong>in</strong>g generated. Chapter 4 of this thesis describe a number of efforts made dur<strong>in</strong>g this<br />

Ph.D. project to provide <strong>in</strong>teroperabitlity <strong>and</strong> programmatic access to both prediction<br />

methods, genomic visualization methods as well as management of data st<strong>and</strong>ards. The<br />

outcome of this work has led <strong>CBS</strong> to adapt <strong>tools</strong> <strong>and</strong> server <strong>in</strong>frastructure thereby shar<strong>in</strong>g<br />

its many <strong>tools</strong> <strong>in</strong> a way that allow programmers to <strong>in</strong>sert sophistcated prediction methods<br />

directoy <strong>in</strong> their own programm<strong>in</strong>g environment.<br />

2


Chapter 2<br />

<strong>Comparative</strong> Genomics<br />

2.1 Introduction<br />

<strong>Comparative</strong> Genomics<br />

This chapter covers work for five publications. The first paper (I) describes the BLASTatlas<br />

method developed to compare <strong>and</strong> visualize the homology between a reference genome<br />

<strong>and</strong> any number of other genomes, collections of genomes, metagenomic sequences, or<br />

databases as a s<strong>in</strong>gle graphic. The method has been used <strong>in</strong> connection with various<br />

research projects <strong>in</strong>clud<strong>in</strong>g the publication of the Arcobacter butzleri RM4018 genome<br />

(Miller et al., 2007), computer exercises (see chapter 4 <strong>and</strong> appendix A.1) <strong>and</strong> as analysis<br />

tool for publications made dur<strong>in</strong>g the project (papers II-V).<br />

A number of smaller unpublished methods, <strong>in</strong>clud<strong>in</strong>g the BLAST matrix, Core Plot,<br />

<strong>and</strong> Codon Plot has been written <strong>and</strong> used as <strong>in</strong>-house <strong>tools</strong>. The BLASTmatrix software<br />

derives unique <strong>and</strong> shared prote<strong>in</strong> families for any number of proteomes. This enables the<br />

viewer to obta<strong>in</strong> the similarity between any pair of organisms <strong>in</strong>cluded <strong>in</strong> the comparison.<br />

The tool was first used <strong>in</strong> (Jensen et al., 2005), <strong>and</strong> also used <strong>in</strong> other papers <strong>in</strong>clud<strong>in</strong>g<br />

paper II. An improved version of the BLASTmatrix tool is used <strong>in</strong> paper IV. The<br />

BLASTmatrix software generates all-aga<strong>in</strong>st-all BLAST (Basic Local alignment Search<br />

Tool, Altschul et al. (1997)) of a number of selected proteomes. When compar<strong>in</strong>g multiple<br />

species of the same genus, these BLAST results can be reused by the CorePlot program<br />

to estimate the size of the core- <strong>and</strong> pan-genome. F<strong>in</strong>ally, the CodonPlot program was<br />

written to visualize the codon <strong>and</strong> am<strong>in</strong>o acid usage by an organism. The CodonPlot<br />

results contributed to papers II, III, <strong>and</strong> V.<br />

The development of an <strong>in</strong>teractive web based genome browser (gwBrowser) has allowed<br />

a broader application of the atlas visualization method, <strong>in</strong>clud<strong>in</strong>g analysis of sequenc<strong>in</strong>g<br />

reads <strong>and</strong> promotor regions. This work is described <strong>in</strong> chapter 3.<br />

2.2 The genome annotation pipel<strong>in</strong>e<br />

Hav<strong>in</strong>g assembled the reads of a sequenc<strong>in</strong>g project, the biologist is often presented with<br />

an <strong>in</strong>complete mapp<strong>in</strong>g of the chromosome, with gaps <strong>and</strong> a large number of contigs<br />

(contiguous pieces of DNA). The quality of the assembly orig<strong>in</strong>at<strong>in</strong>g from most modern<br />

high-throughput techniques can be negatively affected by a number of factors such as<br />

short or <strong>in</strong>sufficient reads, elevated error rates near the end of the reads, DNA repeats on<br />

the chromosome, <strong>in</strong>adequate assembly <strong>tools</strong> etc. This section describes <strong>tools</strong> to analyze<br />

both complete genome data (s<strong>in</strong>gle-contig) as well as prelim<strong>in</strong>ary data generated by pyrosequenc<strong>in</strong>g<br />

mach<strong>in</strong>es (multiple contigs). Most <strong>tools</strong> that are presented here are stored<br />

on the <strong>CBS</strong> servers at /home/people/pfh/scripts/.<br />

3


The genome annotation pipel<strong>in</strong>e<br />

2.2.1 fetchgbk: Obta<strong>in</strong><strong>in</strong>g exist<strong>in</strong>g public genomes from GenBank<br />

Without robust access to prior knowledge about exist<strong>in</strong>g genomes, it is hard to draw<br />

conclusions about a novel genome sequence. The tool fetchgbk was made to download the<br />

most recent genbank entries via NCBI us<strong>in</strong>g both <strong>in</strong>dividual accession numbers (GenBank<br />

<strong>and</strong> RefSeq), ranges thereof, or the NCBI project id whereby all replicons of an organism<br />

can be obta<strong>in</strong>ed. List<strong>in</strong>g 2.1 shows common usage of the program <strong>and</strong> appendix D.1<br />

<strong>in</strong>cludes the manual.<br />

List<strong>in</strong>g 2.1: Usage of fetchgbk<br />

1 # download a s<strong>in</strong>gle genbank record<br />

2 fetchgbk -a CP000896<br />

3 # download a s<strong>in</strong>gle refseq entry<br />

4 fetchgbk -a NZ_ABIZ00000000<br />

5 # download a range of RefSeq entries<br />

6 fetchgbk -a NZ_ABIH01000001 - NZ_ABIH01000038<br />

7 # just list<strong>in</strong>g refseq accession numbers of a project<br />

8 fetchgbk -p 12997 -d refseq -l<br />

9 # download all replicons of a project ( RefSeq )<br />

10 fetchgbk -p 19391 -d refseq<br />

11 # download all replicons of a project ( GenBank )<br />

12 fetchgbk -p 19391 -d genbank<br />

2.2.2 Other ways to acquire genome <strong>in</strong>formation<br />

The genbank records ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> the <strong>CBS</strong> Genome Atlas Database (Hall<strong>in</strong> & Ussery,<br />

2004) are regularly synchronized aga<strong>in</strong>st NCBI Entrez (see http://www.ncbi.nlm.nih.<br />

gov/genomes/lproks.cgi). The raw sequence data can be downloaded from this database<br />

us<strong>in</strong>g the Web Services client scripts getSeq, getOrfs, <strong>and</strong> getProt. Example scripts can be<br />

downloaded <strong>and</strong> run as separate comm<strong>and</strong>s (list<strong>in</strong>g 2.2) or <strong>in</strong>tegrated <strong>in</strong>to larger workflows,<br />

<strong>in</strong> other programm<strong>in</strong>g languages if needed.<br />

List<strong>in</strong>g 2.2: Access<strong>in</strong>g Genome Atlas Database through Web Services.<br />

1 # download prerequisites<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />

3 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getseq .pl<br />

4 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getprot .pl<br />

5 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / getorfs .pl<br />

6<br />

7 # obta<strong>in</strong> full genome sequence of genbank entry<br />

8 perl getseq .pl CP000550 > CP000550 . fsa<br />

9<br />

10 # obta<strong>in</strong> translations of genbank entry<br />

11 perl getprot .pl CP000550 > CP000550 . prote<strong>in</strong>s . fsa<br />

12<br />

13 # obta<strong>in</strong> open read<strong>in</strong>g frames of genbank entry<br />

14 perl getorfs .pl CP000550 > CP000550 . orfs . fsa<br />

The <strong>CBS</strong> Genome Atlas Database conta<strong>in</strong>s an <strong>in</strong>dex of genome meta-data, such as<br />

organism name, NCBI Project ID, replicon, genome size, number of cod<strong>in</strong>g genes, tRNA<br />

genes, rRNA genes, the base composition, <strong>and</strong> average values of various DNA properties<br />

such <strong>in</strong>tr<strong>in</strong>sic curvature (Bolshoy et al., 1991) <strong>and</strong> stack<strong>in</strong>g energy (Satchwell et al., 1986).<br />

For more <strong>in</strong>formation on the Web Services implementation, see section 4.2.1 <strong>and</strong> for a<br />

full documentation please refer to http://www.cbs.dtu.dk/ws/GenomeAtlas. List<strong>in</strong>g 2.3<br />

shows an example of how to use queryGenomes to obta<strong>in</strong> AT content <strong>and</strong> gene count for<br />

4


<strong>Comparative</strong> Genomics<br />

the publicly available Vibrio genomes. Output the comm<strong>and</strong> is listed <strong>in</strong> appendix D.2.<br />

List<strong>in</strong>g 2.3: Us<strong>in</strong>g queryGenomes to obta<strong>in</strong> genome meta data.<br />

1 # download client script<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / querygenomes .pl<br />

3<br />

4 # download XML :: Compile helper script<br />

5 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />

6<br />

7 # extract AT - content <strong>and</strong> number of genes for all vibrio genomes<br />

8 perl querygenomes .pl - hideMerged - organism vibrio -output<br />

ATCONTENT , NGENES<br />

2.2.3 Tools contigsort <strong>and</strong> contigmap<br />

For some applications <strong>in</strong> analysis of unf<strong>in</strong>ished or partially sequenced genomes, it is desired<br />

to obta<strong>in</strong> approximate coord<strong>in</strong>ates of the contigs with<strong>in</strong> the complete chromosome. To<br />

resolve this the contigsort program was written. It accepts any number of entries (contigs)<br />

<strong>in</strong> one FASTA file together with a backbone sequence <strong>in</strong> one contig <strong>in</strong> a second FASTA file.<br />

The entries of the contig file is then mapped to the backbone sequence us<strong>in</strong>g a nucleotide<br />

BLAST, assum<strong>in</strong>g at least one significant hit. The tool then sorts all contigs based on the<br />

coord<strong>in</strong>ate <strong>in</strong> the backbone of the center-po<strong>in</strong>t of each alignment. Contigs spann<strong>in</strong>g the<br />

orig<strong>in</strong> of circular backbones are automatically split <strong>in</strong> two.<br />

The tool genomemap was written to visualize genome homology between two genomes<br />

sequences. Each genome may consist of one or more contigs <strong>and</strong> all contigs are aligned<br />

us<strong>in</strong>g BLASTN. This tool allow a user to validate the output of the backbone mapp<strong>in</strong>g from<br />

contigsort. The plot generated has similarities to that produced by Artemis Comparison<br />

Tool (ACT) (Rutherford et al., 2000); however the output of genomemap is a vector<br />

graphic file (PostScript) <strong>and</strong> allows for multiple sequence entries with<strong>in</strong> each of the two<br />

compared sequences.<br />

Example: Campylobacter jejuni str. 260.94<br />

The 10 contigs of the currently unpublished sequence of Campylobacter jejuni str. 260.94<br />

(GenBank accession no. AANK01000001-AANK01000010) were downloaded <strong>and</strong> converted<br />

<strong>in</strong>to FASTA format file. The program saco convert is an <strong>in</strong>-house program at <strong>CBS</strong>,<br />

which converts between different sequence formats. In the example provided the Campylobacter<br />

jejuni str. NCTC 11168 (Parkhill et al., 2000) is used as the backbone (see list<strong>in</strong>g<br />

2.4).<br />

List<strong>in</strong>g 2.4: Us<strong>in</strong>g contigsort to map assemblied contigs to a backbone.<br />

1 set path = (˜ pfh/scripts/contigsort ˜pfh/scripts/fetchgbk $path )<br />

2 fetchgbk −a AANK01000001−AANK01000010 > AANK . gbk<br />

3 saco_convert −I genbank −O fasta AANK . gbk > AANK . fsa<br />

4 fetchgbk −a AL111168 > AL111168 . gbk<br />

5 saco_convert −I genbank −O fasta AL111168 . gbk > AL111168 . fsa<br />

6 contigsort −c −i AANK . fsa −b AL111168 . fsa > mapped . fsa<br />

To visualize the result of the contig mapp<strong>in</strong>g the mapped <strong>and</strong> un-mapped contigs were<br />

processed by contigmap. The output from the comparison is a PostScript document (figure<br />

2.1 <strong>and</strong> list<strong>in</strong>g 2.5).<br />

5


The genome annotation pipel<strong>in</strong>e<br />

AL111168_AL139074_AL<br />

AANK01000001_AANK010 AANK01000002_AANK010 AANK01000003_AANK010<br />

(a)<br />

AANK01000004_AANK010<br />

AANK01000005_AANK010<br />

AANK01000006_AANK010<br />

AANK01000007_AANK010<br />

AANK01000010_AANK010<br />

AANK01000009_AANK010<br />

AANK01000008_AANK010<br />

AANK01000007_AANK010<br />

AANK01000002_AANK010 AANK01000008_AANK010<br />

AANK01000003_AANK010<br />

AL111168_AL139074_AL<br />

AANK01000005_AANK010<br />

AANK01000001_AANK010 AANK01000009_AANK010<br />

Figure 2.1: Mapp<strong>in</strong>g of multiple contigs to a backbone genome. C. jejuni str. NCTC 11168 is used<br />

as backbone for mapp<strong>in</strong>g contigs C. jejuni str. 260.94. Blue <strong>and</strong> red blocks represent direct <strong>and</strong><br />

reverse hits, respectively. Panel (a) shows un-mapped whereas panel (b) shows mapped contigs.<br />

List<strong>in</strong>g 2.5: Us<strong>in</strong>g contigmap to draw homology between contigs <strong>and</strong> reference genome<br />

1 set path = (˜ pfh/scripts/contigmap $path )<br />

2 contigmap AL111168 . fsa AANK . fsa > AANK−raw . ps<br />

3 contigmap AL111168 . fsa mapped . fsa > AANK−mapped . ps<br />

2.2.4 F<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> encod<strong>in</strong>g genes <strong>in</strong> prokaryotes<br />

A crucial step for implement<strong>in</strong>g any genome pipel<strong>in</strong>e is the gene f<strong>in</strong>d<strong>in</strong>g. Hav<strong>in</strong>g successfully<br />

completed the gene call<strong>in</strong>g enables a number of downstream analysis such as<br />

translation of ORFs <strong>in</strong>to prote<strong>in</strong> sequence, f<strong>in</strong>d<strong>in</strong>g of potentially novel genes, annotation<br />

of prote<strong>in</strong> function by homology searches, assign<strong>in</strong>g functional doma<strong>in</strong>s, <strong>and</strong> detection<br />

of signal peptide to derive the secretome. To both reveal novel prote<strong>in</strong> sequences <strong>and</strong><br />

to draw conclusions as to the overall proteome, it is therefore essential that the gene<br />

call<strong>in</strong>g can be trusted. There are several public prokaryotic gene predictors available<br />

such as Glimmer3 (http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi,<br />

Delcher et al. (1999)), GeneMarkS (http://exon.biology.gatech.edu/, Besemer et al.<br />

(2001)), EasyGene (http://www.cbs.dtu.dk/services/EasyGene/, Larsen & Krogh (2003)),<br />

<strong>and</strong> Prodigal (unpublished, http://compbio.ornl.gov/prodigal). Prodigal is a recent<br />

development <strong>and</strong> despite of its high speed <strong>and</strong> simplicity it provides promis<strong>in</strong>g results. It<br />

has been implemented as part of the <strong>CBS</strong> Genome Atlas Database Web Services. Code<br />

examples are provided show<strong>in</strong>g the usage of the Prodigal client scripts (list<strong>in</strong>g 2.6).<br />

List<strong>in</strong>g 2.6: Us<strong>in</strong>g Prodigal for ORF prediction. Note that 6pack is an <strong>in</strong>ternal <strong>CBS</strong> tool used for<br />

translation of ORFs.<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples /xml - compile .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / prodigal .pl<br />

3 perl prodigal .pl -ta 11 -fasta < mapped . fsa > mapped . orfs . fsa<br />

4 6 pack -1 < mapped . orfs . fsa > mapped . prote<strong>in</strong>s . fsa<br />

Assess<strong>in</strong>g annotation quality<br />

All of the four gene f<strong>in</strong>ders listed above were applied to the latest version of the E. coli<br />

stra<strong>in</strong> K-12 isolate MG1655 genome sequence (U00096, 28 July, 2009, Blattner et al.<br />

(1997)). These predictions, together with an older annotation of the same GenBank entry<br />

6<br />

(b)<br />

AANK01000010_AANK010<br />

AANK01000004_AANK010<br />

AANK01000006_AANK010<br />

AANK01000007_AANK010


<strong>Comparative</strong> Genomics<br />

source CDS total TP FP FN 3’off 5’off sens. shared<br />

U00096 (present) 4,321 - - - - - - -<br />

U00096 (2004) 4,254 4,172 82 109 1.02 -4.07 0.97 93%<br />

Glimmer 3.02 4,476 4,174 302 125 -0.6 -24.09 0.97 87%<br />

GeneMark-S 2.6 4,377 4,207 170 90 1.94 -20.17 0.98 91%<br />

EasyGene 1.2 4,056 4,017 39 256 -0.28 -19.07 0.94 91%<br />

Prodigal 1.1 4,332 4,200 132 97 0.54 -20.07 0.98 92%<br />

Table 2.1: Performance of prokaryotic gene f<strong>in</strong>ders. An older genbank record for E. coli K12<br />

(U00096, 2002) has been <strong>in</strong>cluded <strong>and</strong> the reference of all comparisons is the most recent shown<br />

at the top. The 3’ <strong>and</strong> 5’ off correspond to the number of base pairs that a query coord<strong>in</strong>ate is<br />

downstream (positive number) or upstream (negative number) when compared to the reference.<br />

T P<br />

The sensitivity is estimated by b<strong>in</strong>ary classification, T P +F N<br />

where T P is the number of prote<strong>in</strong>s<br />

shared between reference <strong>and</strong> query <strong>and</strong> F N are prote<strong>in</strong>s unique to the reference, not found <strong>in</strong><br />

the query. Calculat<strong>in</strong>g specificity (which requires a true negative count) is difficult as it is hard<br />

to identify regions of the chromosome that for certa<strong>in</strong> does not conta<strong>in</strong> prote<strong>in</strong> cod<strong>in</strong>g genes<br />

(Larsen & Krogh, 2003). The rightmost column conta<strong>in</strong>s an estimate of the percentage of prote<strong>in</strong><br />

families shared between the query <strong>and</strong> the reference genome. The number is derived us<strong>in</strong>g the<br />

BLASTmatrix tool.<br />

(U00096 from 2004) were compared pairwise to the latest version of the GenBank entry.<br />

The number of unique genes <strong>in</strong> both reference <strong>and</strong> query genome was derived <strong>and</strong> for each<br />

overlapp<strong>in</strong>g pair of ORFs, the average <strong>in</strong>accuracy of the 3’ <strong>and</strong> 5’ ends was calculated<br />

(table 2.1). In addition the encoded prote<strong>in</strong>s were compared us<strong>in</strong>g the BLASTmatrix<br />

tool, described <strong>in</strong> section 2.3.6. This allows estimation of the number of prote<strong>in</strong> families<br />

shared between the reference <strong>and</strong> the query genomes.<br />

2.2.5 F<strong>in</strong>d<strong>in</strong>g tRNA <strong>and</strong> rRNA genes<br />

The tool tRNAscan-SE (Lowe & Eddy, 1997) has been implemented <strong>in</strong> the <strong>CBS</strong> Genome<br />

Atlas Database Web Service, <strong>and</strong> it predicts tRNA genes <strong>in</strong> contigs or genomes:<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / trnascan .pl<br />

3 perl trnascan .pl < mapped . fsa > mapped . trna . fsa<br />

The RNAmmer method (Paper VI, chapter 3) can be used to consistently annotate<br />

rRNA genes <strong>in</strong> contigs <strong>and</strong> full genome sequences. This tool is implemented as a separate<br />

Web Service at <strong>CBS</strong>. Please refer to http://www.cbs.dtu.dk/ws/RNAmmer for full documentation.<br />

In list<strong>in</strong>g 2.7 <strong>and</strong> example is provided show<strong>in</strong>g the usage of the RNAmmer<br />

client script.<br />

List<strong>in</strong>g 2.7: Runn<strong>in</strong>g RNAmmer on a genome sequence<br />

1 wget http :// www . cbs . dtu .dk/ws/ GenomeAtlas / examples / fasta . <strong>in</strong>c .pl<br />

2 wget http :// www . cbs . dtu .dk/ws/ RNAmmer / examples / rnammer .pl<br />

3 perl rnammer .pl bac < mapped . fsa > mapped . rrna . fsa<br />

2.3 Genome Comparisons<br />

The previous section has described some <strong>in</strong>itial steps for annotat<strong>in</strong>g the bacterial genome<br />

which is required for further comparative studies. In this section emphasis will be placed<br />

on compar<strong>in</strong>g annotated genomes both on the proteome level as well as us<strong>in</strong>g meta-data.<br />

7


Genome Comparisons<br />

Right whisker ends at an observed<br />

data po<strong>in</strong>t, not exceed<strong>in</strong>g 1.5 IQR<br />

1.5 x IQR<br />

95% confidence <strong>in</strong>terval<br />

Q1 IQR Q3<br />

1.5 x IQR<br />

median<br />

Right whisker ends at an observed<br />

data po<strong>in</strong>t, not exceed<strong>in</strong>g 1.5 IQR<br />

Mild outliers between 1.5 <strong>and</strong> 3.0 IQR<br />

<strong>and</strong> extreme outliers more than 3 IQR<br />

away from Q1 <strong>and</strong> Q3<br />

Figure 2.2: Construction of a box-<strong>and</strong>-whiskers plot. Notches is an estimate of the 95% confidence<br />

<strong>in</strong>terval.<br />

The <strong>tools</strong> presented here have all been used widely dur<strong>in</strong>g course activities <strong>and</strong> research<br />

projects.<br />

2.3.1 Box-<strong>and</strong>-wiskers plot<br />

As the number of sequenced bacterial genomes grew from only two <strong>in</strong> 1995 to now close to a<br />

thous<strong>and</strong> at the time of writ<strong>in</strong>g, there began to be enough data to sample various genomic<br />

properties amongst the different phylogenetic groups. The box-<strong>and</strong>-wiskers plot (Tukey,<br />

1977) is a useful tool for visualiz<strong>in</strong>g such differences. The plot shows a box between the<br />

first <strong>and</strong> the third quantile (figure 2.2). The distance between Q1 <strong>and</strong> Q3 is called the Inter<br />

Quantile Ratio (IQR) <strong>and</strong> whiskers are drawn through observations that are not exceed<strong>in</strong>g<br />

1.5 × IQR. A l<strong>in</strong>e is drawn with<strong>in</strong> the box represent<strong>in</strong>g the median. Data between<br />

1.5 × IQR <strong>and</strong> 3.0 × IQR are denoted ”mild” outliers whereas observations exceed<strong>in</strong>g<br />

3.0 × IQR are extreme outliers. Notches are sometimes drawn to denote the confidence<br />

<strong>in</strong>terval. In the R implementation of the box-<strong>and</strong>-wiskers plot the 95% confidence <strong>in</strong>terval<br />

is approximated by 1.5×IQR<br />

√ . When compar<strong>in</strong>g two or more distributions, non-overlapp<strong>in</strong>g<br />

N<br />

notches marks significant differences.<br />

Distribution of genome size <strong>and</strong> base composition <strong>in</strong> prokaryotes<br />

To exam<strong>in</strong>e the base composition <strong>and</strong> genome size for different phylogenetic groups, a<br />

query to the <strong>CBS</strong> Genome Atlas Database can be done, group<strong>in</strong>g replicons <strong>in</strong>to projects<br />

<strong>and</strong> summariz<strong>in</strong>g / averag<strong>in</strong>g with<strong>in</strong> each project. Altough only possible from with<strong>in</strong> <strong>CBS</strong>,<br />

the comm<strong>and</strong>s are listed below.<br />

8


<strong>Comparative</strong> Genomics<br />

1 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color )<br />

,ord , sum ( length ),concat ( organism_name ,’/’, segment_name ,’/’,<br />

genbank ) from atlasdb as a, genbank_complete_prj as p ,<br />

genbank_complete_seq as s , phyla as ph where s. genbank = a.<br />

accession <strong>and</strong> s. pid = p. pid <strong>and</strong> segment_name not like ’genome %’<br />

<strong>and</strong> ph. phyla = p. grp group by s. pid " > length . tbl<br />

2 set N = ‘wc -l < length .tbl ‘<br />

3 ~ pfh / scripts / boxplot -ma<strong>in</strong> " Size distribution of Prokaryotic<br />

genomes (N = $N)" < length . tbl > length .ps<br />

4 mysql -N -B -D genomeatlas3_cur -e " select p.grp , concat (’#’, color )<br />

,ord , sum ( atcontent * length )/ sum ( length ),concat ( organism_name<br />

,’/’, segment_name ,’/’, genbank ) from atlasdb as a,<br />

genbank_complete_prj as p , genbank_complete_seq as s , phyla<br />

as ph where s. genbank = a. accession <strong>and</strong> s. pid = p. pid <strong>and</strong><br />

segment_name not like ’genome %’ <strong>and</strong> ph. phyla = p. grp group by s<br />

. pid "> atcontent . tbl<br />

5 ~ pfh / scripts / boxplot -ma<strong>in</strong> "AT content distribution of Prokaryotic<br />

genomes (N = $N)" < atcontent . tbl > atcontent .ps<br />

The tables generated by the MySQL query can be read by the boxplot program, which<br />

is a Perl wrapper for the R comm<strong>and</strong> boxplot, <strong>and</strong> a PostScript document is generated.<br />

Figure 2.4 shows the total genome length (<strong>in</strong>clud<strong>in</strong>g all replicons) of all published prokaryotic<br />

genomes, divided <strong>in</strong>to phyla. The confidence <strong>in</strong>terval appears wide for many groups,<br />

reflect<strong>in</strong>g a high <strong>in</strong>tra-phyla variation. However, for a number of phyla the difference<br />

is significant. The β-protebacteria tend to have longer chromosomes than for example<br />

the firmicutes, the α-proteobacteria, <strong>and</strong> the cyanobacteria. It is also evident that the<br />

δ-proteobacteria Sorangium cellulosum Soce56 represents the longest genome (13,033,779<br />

nt, Schneiker et al. (2007)) but that this is an outlier not representative of the entire phylum.<br />

The shortest bacterial genome published so far is the α-proteobacterium C<strong>and</strong>idatus<br />

Hodgk<strong>in</strong>ia cicadicola Dsem (143,795 nt, McCutcheon et al. (2009)). Thus, the difference<br />

between the smallest <strong>and</strong> the largest is close to 100 fold. The plot <strong>in</strong> figure 2.3 shows the<br />

fraction of AT for the prokaryotic genomes rang<strong>in</strong>g from 25% for the δ-proteobacterium<br />

Anaeromyxobacter dehalogenans 2CP-C (Sanford et al., 2002) to 83% for C<strong>and</strong>idatus Carsonella<br />

ruddii PV (Nakabachi et al. (2006).<br />

2.3.2 heatmap - 2D cluster<strong>in</strong>g<br />

A way to <strong>in</strong>crease the dimensionality for visualiz<strong>in</strong>g genomic properties is by us<strong>in</strong>g a socalled<br />

heatmap or 2D cluster<strong>in</strong>g. Instead of look<strong>in</strong>g at a s<strong>in</strong>gle property at a time (e.g.<br />

length or AT content), multiple features may be <strong>in</strong>cluded <strong>in</strong> the same plot. The axis is<br />

replaced with a color transformation of the data <strong>and</strong> different normalization methods may<br />

be applied. In the example below a comparison is made for 87 Enterobacteriaceae, cover<strong>in</strong>g<br />

among others the genera of Escherichia, Salmonella, Yers<strong>in</strong>ia, Shigella, Buchnera, <strong>and</strong><br />

Klebsiella. The <strong>CBS</strong> Genome Atlas Database is queried for the features such as tRNA <strong>and</strong><br />

rRNA gene count, total cod<strong>in</strong>g genes, genome size, AT content, simple genomic repeats,<br />

local direct repeats, base pairs per gene, <strong>and</strong> cod<strong>in</strong>g fraction of the genome. The plot<br />

is shown <strong>in</strong> figure 2.5 <strong>and</strong> the R code for produc<strong>in</strong>g the plot is shown below <strong>in</strong> list<strong>in</strong>g<br />

2.8. The data have been normalized to allow for comparison. Features <strong>and</strong> organisms are<br />

hierarchically clustered to group organisms with similar properties <strong>and</strong> to gorup properties<br />

that correlate with<strong>in</strong> the organisms.<br />

9


Genome Comparisons<br />

12<br />

10<br />

12<br />

Size distribution of Prokaryotic genomes (N = 932)<br />

Crenarchaeota (n=23)<br />

Euryarchaeota (n=39)<br />

Nanoarchaeota (n=1)<br />

Acidobacteria (n=3)<br />

Crenarchaeota (n=23)<br />

Act<strong>in</strong>obacteria (n=68)<br />

Euryarchaeota (n=39)<br />

Aquificae (n=5)<br />

Nanoarchaeota (n=1)<br />

Bacteroidetes/Chlorobi (n=26)<br />

Acidobacteria (n=3)<br />

Chlamydiae/Verrucomicrobia (n=14)<br />

Act<strong>in</strong>obacteria (n=68)<br />

Chloroflexi (n=10)<br />

Aquificae (n=5)<br />

Cyanobacteria (n=36)<br />

Bacteroidetes/Chlorobi (n=26)<br />

De<strong>in</strong>ococcus−Thermus (n=5)<br />

Chlamydiae/Verrucomicrobia (n=14)<br />

Firmicutes (n=191)<br />

Chloroflexi (n=10)<br />

Fusobacteria (n=1)<br />

Cyanobacteria (n=36)<br />

Planctomycetes (n=1)<br />

De<strong>in</strong>ococcus−Thermus (n=5)<br />

Alphaproteobacteria (n=114)<br />

Firmicutes (n=191)<br />

Betaproteobacteria (n=70)<br />

Fusobacteria (n=1)<br />

Gammaproteobacteria (n=226)<br />

Planctomycetes (n=1)<br />

Deltaproteobacteria (n=29)<br />

Alphaproteobacteria (n=114)<br />

Epsilonproteobacteria (n=25)<br />

Betaproteobacteria (n=70)<br />

Spirochaetes (n=18)<br />

Gammaproteobacteria (n=226)<br />

Thermotogae (n=10)<br />

Deltaproteobacteria (n=29)<br />

Other Archaea (n=1)<br />

Epsilonproteobacteria (n=25)<br />

Other Bacteria (n=16)<br />

Spirochaetes (n=18)<br />

Thermotogae (n=10)<br />

Size distribution of Prokaryotic genomes (N = 932)<br />

Other Archaea (n=1)<br />

0.0e+00 2.0e+06<br />

Other Bacteria (n=16)<br />

Buchnera<br />

4.0e+06 6.0e+06<br />

E. coli<br />

Salmonella<br />

Yers<strong>in</strong>ia<br />

8.0e+06 1.0e+07 1.2e+07<br />

0.0e+00 2.0e+06 4.0e+06 6.0e+06 8.0e+06 1.0e+07 1.2e+07<br />

E. coli<br />

Buchnera<br />

Salmonella<br />

Yers<strong>in</strong>ia<br />

Crenarchaeota (n=23)<br />

Euryarchaeota (n=39)<br />

Nanoarchaeota (n=1)<br />

Crenarchaeota<br />

Acidobacteria<br />

(n=23)<br />

(n=3)<br />

Euryarchaeota<br />

Act<strong>in</strong>obacteria (n=68)<br />

(n=39)<br />

Nanoarchaeota<br />

Aquificae (n=5)<br />

(n=1)<br />

Bacteroidetes/Chlorobi Acidobacteria (n=26) (n=3)<br />

Chlamydiae/Verrucomicrobia Act<strong>in</strong>obacteria (n=14) (n=68)<br />

Chloroflexi Aquificae (n=10) (n=5)<br />

Bacteroidetes/Chlorobi Cyanobacteria (n=36) (n=26)<br />

Chlamydiae/Verrucomicrobia De<strong>in</strong>ococcus−Thermus (n=14) (n=5)<br />

Firmicutes Chloroflexi (n=191) (n=10)<br />

Cyanobacteria Fusobacteria (n=36) (n=1)<br />

De<strong>in</strong>ococcus−Thermus Planctomycetes (n=1) (n=5)<br />

Alphaproteobacteria Firmicutes (n=114) (n=191)<br />

Betaproteobacteria Fusobacteria (n=70) (n=1)<br />

Gammaproteobacteria Planctomycetes (n=226) (n=1)<br />

Alphaproteobacteria Deltaproteobacteria (n=114) (n=29)<br />

Epsilonproteobacteria Betaproteobacteria (n=25) (n=70)<br />

Gammaproteobacteria Spirochaetes (n=226) (n=18)<br />

Deltaproteobacteria Thermotogae (n=10) (n=29)<br />

Epsilonproteobacteria Other Archaea (n=25) (n=1)<br />

Other Spirochaetes Bacteria (n=16) (n=18)<br />

Thermotogae (n=10)<br />

Other Archaea (n=1)<br />

Other Bacteria (n=16)<br />

Figure 2.3: Genome size of all public prokaryotic.<br />

Figure 2.3: Genome size of all public prokaryotic.<br />

Figure 2.3: Genome size of all public prokaryotic.<br />

AT content distribution of Prokaryotic genomes (N = 932)<br />

AT content distribution of Prokaryotic genomes (N = 932)<br />

0.3 0.4 0.5 0.6 0.7 0.8<br />

E. coli<br />

Salmonella<br />

Buchnera<br />

Yers<strong>in</strong>ia<br />

0.3 0.4 0.5 0.6 0.7 0.8<br />

E. coli<br />

Salmonella<br />

Buchnera<br />

Yers<strong>in</strong>ia<br />

Figure 2.4: Average AT content of all public prokaryotic.<br />

Figure 2.4: Average AT content contentof ofall all public prokaryotic.


List<strong>in</strong>g 2.8: R code to generate a 2D cluster<strong>in</strong>g graphic<br />

<strong>Comparative</strong> Genomics<br />

1 library ( gplots )<br />

2 postscript ( file =’output .ps ’)<br />

3 data


Genome Comparisons<br />

12<br />

TRNA_SCAN_COUNT<br />

LENGTH<br />

NGENES<br />

RNAMMER_SSU_COUNT<br />

ATCONTENT<br />

LOC_DIR_REPEAT<br />

LOC_INV_REPEAT<br />

SR_PERCENT<br />

CODING_FRACTION<br />

BPPRGENE<br />

Escherichia coli SMS−3−5<br />

Escherichia coli O127:H6 str. E2348/69<br />

Escherichia coli E24377A<br />

Escherichia coli S88<br />

Escherichia coli SE11<br />

Escherichia coli UMN026<br />

Escherichia coli IAI39<br />

Escherichia coli 55989<br />

Escherichia coli ED1a<br />

Escherichia coli UTI89<br />

Escherichia coli CFT073<br />

Salmonella enterica subsp. enterica serovar Heidelberg str. SL476<br />

Salmonella enterica subsp. enterica serovar Newport str. SL254<br />

Salmonella enterica subsp. enterica serovar Agona str. SL483<br />

Salmonella enterica subsp. enterica serovar Schwarzengrund str. CVM19633<br />

Salmonella enterica subsp. enterica serovar Paratyphi C stra<strong>in</strong> RKS4594<br />

Salmonella enterica subsp. enterica serovar Dubl<strong>in</strong> str. CT_02021853<br />

Salmonella enterica subsp. enterica serovar Choleraesuis str. SC−B67<br />

Escherichia coli 536<br />

Salmonella enterica subsp. enterica serovar Typhi str. CT18<br />

Serratia proteamaculans 568<br />

Klebsiella pneumoniae subsp. pneumoniae MGH 78578<br />

Klebsiella pneumoniae NTUH−K2044<br />

Klebsiella pneumoniae 342<br />

Salmonella enterica subsp. enterica serovar Paratyphi B str. SPB7<br />

Citrobacter koseri ATCC BAA−895<br />

Escherichia coli O157:H7 str. Sakai<br />

Escherichia coli O157:H7 EDL933<br />

Escherichia coli O157:H7 str. EC4115<br />

Escherichia coli str. K−12 substr. MG1655<br />

Escherichia coli str. K−12 substr. W3110<br />

Escherichia coli HS<br />

Escherichia coli IAI1<br />

Escherichia fergusonii ATCC 35469<br />

Salmonella enterica subsp. arizonae serovar 62:z4,z23:−−<br />

Salmonella enterica subsp. enterica serovar Enteritidis str. P125109<br />

Salmonella enterica subsp. enterica serovar Paratyphi A str. AKU_12601<br />

Enterobacter sp. 638<br />

Escherichia coli BL21<br />

Escherichia coli ATCC 8739<br />

Escherichia coli str. K−12 substr. DH10B<br />

Salmonella enterica subsp. enterica serovar Typhimurium str. LT2<br />

Escherichia coli BW2952<br />

Escherichia coli BL21(DE3)<br />

Yers<strong>in</strong>ia pseudotuberculosis YPIII<br />

Yers<strong>in</strong>ia pseudotuberculosis PB1/+<br />

Yers<strong>in</strong>ia pseudotuberculosis IP 31758<br />

Yers<strong>in</strong>ia enterocolitica subsp. enterocolitica 8081<br />

Yers<strong>in</strong>ia pseudotuberculosis IP 32953<br />

Shigella boydii Sb227<br />

Shigella dysenteriae Sd197<br />

Escherichia coli APEC O1<br />

Shigella flexneri 2a str. 301<br />

Shigella sonnei Ss046<br />

Shigella flexneri 5 str. 8401<br />

Shigella flexneri 2a str. 2457T<br />

Shigella boydii CDC 3083−94<br />

Edwardsiella ictaluri 93−146<br />

Cronobacter sakazakii ATCC BAA−894<br />

Erw<strong>in</strong>ia tasmaniensis Et1/99<br />

Photorhabdus lum<strong>in</strong>escens subsp. laumondii TTO1<br />

Photorhabdus asymbiotica<br />

Proteus mirabilis HI4320<br />

Pectobacterium atrosepticum SCRI1043<br />

Salmonella enterica subsp. enterica serovar Gall<strong>in</strong>arum str. 287/91<br />

Pectobacterium carotovorum subsp. carotovorum PC1<br />

Dickeya zeae Ech1591<br />

Dickeya dadantii Ech703<br />

Salmonella enterica subsp. enterica serovar Typhi str. Ty2<br />

Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150<br />

Yers<strong>in</strong>ia pestis Angola<br />

Yers<strong>in</strong>ia pestis CO92<br />

Yers<strong>in</strong>ia pestis Antiqua<br />

Yers<strong>in</strong>ia pestis KIM<br />

Yers<strong>in</strong>ia pestis Nepal516<br />

Yers<strong>in</strong>ia pestis biovar Microtus str. 91001<br />

Yers<strong>in</strong>ia pestis Pestoides F<br />

Sodalis gloss<strong>in</strong>idius str. morsitans<br />

Buchnera aphidicola str. Cc (C<strong>in</strong>ara cedri)<br />

Wigglesworthia gloss<strong>in</strong>idia endosymbiont of Gloss<strong>in</strong>a brevipalpis<br />

C<strong>and</strong>idatus Blochmannia floridanus<br />

C<strong>and</strong>idatus Blochmannia pennsylvanicus str. BPEN<br />

Buchnera aphidicola str. Sg (Schizaphis gram<strong>in</strong>um)<br />

Buchnera aphidicola str. Bp (Baizongia pistaciae)<br />

Buchnera aphidicola str. APS (Acyrthosiphon pisum)<br />

Buchnera aphidicola str. Tuc7 (Acyrthosiphon pisum)<br />

Buchnera aphidicola str. 5A (Acyrthosiphon pisum)<br />

−1 −0.5 0 0.5 1<br />

Value<br />

Figure 2.5: 2D-cluster<strong>in</strong>g show<strong>in</strong>g 87 Enterobacteriaceae.<br />

Color Key


1st<br />

U<br />

C<br />

A<br />

G<br />

U<br />

2nd position<br />

C A G<br />

3rd<br />

56 Phe 31 Ser 41 Tyr 12 Cys U<br />

2 Phe 1 Ser 2 Tyr 1 Cys C<br />

79 Leu 22 Ser 3 Stop 0 Stop A<br />

5 Leu 1 Ser 0 Stop 8 Trp G<br />

7 Leu 13 Pro 17 His 7 Arg U<br />

0 Leu 1 Pro 1 His 0 Arg C<br />

5 Leu 12 Pro 25 Gln 5 Arg A<br />

0 Leu 2 Pro 2 Gln 0 Arg G<br />

79 Ile 18 Thr 75 Asn 12 Ser U<br />

4 Ile 1 Thr 6 Asn 1 Ser C<br />

51 Ile 20 Thr 131 Lys 18 Arg A<br />

18 Met 1 Thr 6 Lys 1 Arg G<br />

18 Val 16 Ala 33 Asp 18 Gly U<br />

1 Val 1 Ala 2 Asp 1 Gly C<br />

18 Val 15 Ala 41 Glu 27 Gly A<br />

1 Val 1 Ala 2 Glu 2 Gly G<br />

<strong>Comparative</strong> Genomics<br />

Table 2.2: Codon usage <strong>in</strong> Buchnera aphidicola Cc. Frequencies are measured per thous<strong>and</strong>. A<br />

total of 354,219 base pairs are exam<strong>in</strong>ed <strong>in</strong> 360 ORFs (5 orfs rejectred due to possible frame shifts)<br />

codons may be replaced to encode both identical <strong>and</strong> similar am<strong>in</strong>o acids to adjust the<br />

overall base composition.<br />

2.3.4 CodonPlot: visualiz<strong>in</strong>g codon usage<br />

A rose plot diagram (Ussery et al., 2004; B<strong>in</strong>newies et al., 2006) may be used to make a<br />

graphical representation of codon <strong>and</strong> am<strong>in</strong>o acid usage. In the codon rose plot, all 64<br />

codons are listed <strong>in</strong> the perimeter <strong>and</strong> the frequency of each codon is drawn on a radial<br />

scale. The 64 codons are sorted <strong>in</strong> the order AUGC, first by the last letter (XX[AUCG]),<br />

then by the second letter (X[AUGC]X), <strong>and</strong> f<strong>in</strong>ally by the first letter ([AUGC]XX). The<br />

result is four quadrants, with codons end<strong>in</strong>g with A or U <strong>in</strong> the right half, <strong>and</strong> codons<br />

end<strong>in</strong>g with C or G <strong>in</strong> the left half. This allows easy overview of biases <strong>in</strong> the third position.<br />

For the am<strong>in</strong>o acid rose plot, all 20 am<strong>in</strong>o acids are drawn <strong>in</strong> the perimeter with their<br />

frequencies show radially. Here, the am<strong>in</strong>o acids are grouped accord<strong>in</strong>g to their chemical<br />

properties. In addition to the rose plot, <strong>in</strong>formation content can be applied to measure the<br />

bias with<strong>in</strong> each of the three positions of the codon. These codon analysis are shown <strong>in</strong><br />

figure 2.6 for three different enteric genomes: the AT rich Buchnera aphidicola Cc (79.8%<br />

AT), an E. coli stra<strong>in</strong> K-12 (49.2% AT), <strong>and</strong> a somewhat GC rich Klebsiella pneumoniae<br />

NTUH-K2044 (42.3%). The bias <strong>in</strong> B. aphidicola is strik<strong>in</strong>g with a strong preference of A<br />

<strong>and</strong> U at the third position. This variation results <strong>in</strong> a periodic fluctuation of AT content<br />

when align<strong>in</strong>g all open read<strong>in</strong>g frames (ORFs) to the translation start, <strong>and</strong> extract<strong>in</strong>g 400<br />

base pairs up- <strong>and</strong> down-stream, as shown <strong>in</strong> figure 2.7. The red l<strong>in</strong>e represents a 3 po<strong>in</strong>t<br />

runn<strong>in</strong>g average which quickly approaches zero <strong>in</strong> the cod<strong>in</strong>g region. Gray l<strong>in</strong>es represent<br />

the raw average values.<br />

13


Genome Comparisons<br />

N<br />

E<br />

D<br />

N<br />

E<br />

D<br />

N<br />

E<br />

D<br />

Q<br />

R<br />

Q<br />

R<br />

Q<br />

R<br />

S<br />

K<br />

S<br />

K<br />

Am<strong>in</strong>o Acid Usage<br />

Buchnera_aphidicola_Cc<br />

M<br />

T<br />

A<br />

C<br />

(a)<br />

V<br />

Am<strong>in</strong>o Acid Usage<br />

Ecoli_K12<br />

M<br />

T<br />

A<br />

C<br />

(d)<br />

Y<br />

L<br />

W<br />

Am<strong>in</strong>o Acid Usage<br />

Klebsiella_pneumoniae_NTUH-K2044<br />

S<br />

K<br />

M<br />

T<br />

A<br />

C<br />

(g)<br />

V<br />

Y<br />

V<br />

Y<br />

L<br />

W<br />

L<br />

W<br />

I<br />

I<br />

H<br />

I<br />

H<br />

H<br />

G<br />

G<br />

G<br />

F<br />

F<br />

F<br />

P<br />

P<br />

P<br />

0.14<br />

0.11<br />

0.09<br />

0.06<br />

0.03<br />

0.01<br />

0.11<br />

0.09<br />

0.07<br />

0.05<br />

0.03<br />

0.01<br />

0.11<br />

0.09<br />

0.07<br />

0.05<br />

0.03<br />

0.01<br />

Frequency<br />

Frequency<br />

Frequency<br />

GGC<br />

GGC<br />

GGC<br />

GAG<br />

CAG<br />

CGC<br />

GAG<br />

CGC<br />

GAG<br />

UAG<br />

GCC<br />

CAG<br />

UGC<br />

GCC<br />

CAG<br />

CGC<br />

UAG<br />

UGC<br />

UAG<br />

GCC<br />

UGC<br />

UUG<br />

AAG<br />

AGC<br />

CCC<br />

CUG<br />

AUG<br />

UUG<br />

AAG<br />

AGC<br />

CCC<br />

UUG<br />

AAG<br />

CCC<br />

GUG<br />

AUG<br />

AUG<br />

AGC<br />

UCC<br />

GUG<br />

CUG<br />

GUG<br />

CUG<br />

GUC<br />

UCC<br />

UCC<br />

ACC<br />

GUC<br />

GUC<br />

ACC<br />

ACC<br />

UCG<br />

ACG<br />

CCG<br />

CUC<br />

UCG<br />

ACG<br />

CUC<br />

ACG<br />

CUC<br />

GCG<br />

CCG<br />

UUC<br />

GCG<br />

UUC<br />

Codon Usage<br />

Buchnera_aphidicola_Cc<br />

AGG<br />

AUC<br />

GAC<br />

UGG<br />

AGG<br />

AUC<br />

GAC<br />

CGG<br />

CAC<br />

UGG<br />

CAC<br />

GGG<br />

UAC<br />

UAC<br />

AAA<br />

AAC<br />

UAA<br />

GGU<br />

CAA<br />

CGU<br />

(b)<br />

Codon Usage<br />

Ecoli_K12<br />

CGG<br />

GGG<br />

AAA<br />

AAC<br />

UAA<br />

GGU<br />

UGU<br />

CAA<br />

CGU<br />

(e)<br />

UGU<br />

GAA<br />

AUA<br />

AGU<br />

GAA<br />

AUA<br />

AGU<br />

UUA<br />

GCU<br />

UUA<br />

GCU<br />

CUA<br />

CCU<br />

UCU<br />

CUA<br />

Codon Usage<br />

Klebsiella_pneumoniae_NTUH-K2044<br />

CCG<br />

UCG<br />

GCG<br />

UUC<br />

AGG<br />

AUC<br />

GAC<br />

UGG<br />

CGG<br />

CAC<br />

GGG<br />

UAC<br />

AAA<br />

AAC<br />

UAA<br />

GGU<br />

CAA<br />

CGU<br />

(h)<br />

UGU<br />

GAA<br />

AUA<br />

AGU<br />

UUA<br />

GCU<br />

CCU<br />

UCU<br />

ACA<br />

ACU<br />

ACA<br />

ACU<br />

CUA<br />

CCU<br />

AC UCU<br />

ACA<br />

GUA<br />

GUA<br />

UCA<br />

UCA<br />

GUA<br />

UCA<br />

CCA<br />

AGA<br />

AUU<br />

UUU<br />

CUU<br />

GUU<br />

CCA<br />

AGA<br />

AUU<br />

UUU<br />

CUU<br />

GUU<br />

CCA<br />

AGA<br />

AUU<br />

UUU<br />

CUU<br />

GUU<br />

UGA<br />

GCA<br />

AAU<br />

UGA<br />

CGA<br />

UAU<br />

GCA<br />

AAU<br />

UGA<br />

CGA<br />

UAU<br />

GCA<br />

AAU<br />

CAU<br />

CAU<br />

CGA<br />

UAU<br />

GGA<br />

GAU<br />

GGA<br />

GAU<br />

CAU<br />

GGA<br />

GAU<br />

0.13<br />

0.10<br />

0.08<br />

0.05<br />

0.03<br />

0.00<br />

0.05<br />

0.04<br />

0.03<br />

0.02<br />

0.01<br />

0.00<br />

0.07<br />

0.06<br />

0.04<br />

0.03<br />

0.01<br />

0.00<br />

Frequency<br />

Frequency<br />

Frequency<br />

bits<br />

bits<br />

bits<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

(c)<br />

(f)<br />

(i)<br />

| C<br />

1<br />

G<br />

U A<br />

CU<br />

G<br />

A<br />

C GU A |<br />

2<br />

3<br />

| U<br />

1<br />

CAG C<br />

G<br />

A<br />

U<br />

U<br />

A CG|<br />

| 1<br />

2<br />

3<br />

U CG|<br />

U ACG C<br />

G<br />

AU<br />

A<br />

2<br />

3<br />

Figure 2.6: Codon <strong>and</strong> am<strong>in</strong>o acid usage of Buchnera aphidicola Cc (79.8% AT), Klebsiella<br />

pneumoniae NTUH-K2044 (42.3% AT), <strong>and</strong> E. coli K12 49.2% AT. Rightmost column shows the<br />

nucleotide bias of the three codon positions.<br />

14


1st<br />

U<br />

C<br />

A<br />

G<br />

U<br />

2nd position<br />

C A G<br />

3rd<br />

19 Phe 4 Ser 14 Tyr 3 Cys U<br />

19 Phe 11 Ser 13 Tyr 8 Cys C<br />

6 Leu 4 Ser 2 Stop 1 Stop A<br />

7 Leu 12 Ser 0 Stop 16 Trp G<br />

8 Leu 5 Pro 12 His 13 Arg U<br />

16 Leu 8 Pro 11 His 31 Arg C<br />

3 Leu 4 Pro 7 Gln 3 Arg A<br />

72 Leu 30 Pro 38 Gln 10 Arg G<br />

20 Ile 5 Thr 12 Asn 4 Ser U<br />

33 Ile 31 Thr 22 Asn 22 Ser C<br />

3 Ile 3 Thr 24 Lys 2 Arg A<br />

27 Met 13 Thr 13 Lys 1 Arg G<br />

10 Val 10 Ala 26 Asp 13 Gly U<br />

21 Val 44 Ala 24 Asp 43 Gly C<br />

7 Val 8 Ala 27 Glu 6 Gly A<br />

33 Val 43 Ala 27 Glu 14 Gly G<br />

<strong>Comparative</strong> Genomics<br />

Table 2.3: Codon usage <strong>in</strong> Klebsiella pneumoniae NTUH-K2044. Frequencies are measured per<br />

thous<strong>and</strong>. A total of 4,697,097 base pairs are exam<strong>in</strong>ed <strong>in</strong> 5,006 ORFs.<br />

Z−score<br />

−2.0 −1.5 −1.0 −0.5 0.0<br />

Buchnera_aphidicola_Cc: AT content<br />

−400 −200 0 200 400<br />

Distance from translation start<br />

Figure 2.7: AT content profile 400 bp upstream <strong>and</strong> downstram of annotated translation starts <strong>in</strong><br />

Buchnera aphidicola Cc.<br />

15


Genome Comparisons<br />

Figure 2.8: Deam<strong>in</strong>ation of cytos<strong>in</strong>e (C) <strong>in</strong>to uracil (U)<br />

2.3.5 Base composition <strong>and</strong> DNA repair<br />

Klebsiella is often found <strong>in</strong> plant products, root surfaces <strong>and</strong> liv<strong>in</strong>g trees, fresh vegetables,<br />

<strong>and</strong> foods with high content of sugars <strong>and</strong> acids, such as frozen orange juice concentrate.<br />

Klebsiella pneumoniae can causes ur<strong>in</strong>ary tract <strong>in</strong>fections <strong>and</strong> the NTUH-K2044 stra<strong>in</strong><br />

was isolated from a patient with liver abscess <strong>and</strong> men<strong>in</strong>gitis. The broad range of ecological<br />

niches <strong>in</strong> which Klebsiella lives share the property of be<strong>in</strong>g rich <strong>in</strong> energy <strong>and</strong> nitrogen.<br />

Nitrogen-fix<strong>in</strong>g aerobic bacteria are known to have higher chromosomal GC content (McEwan<br />

et al., 1998), expla<strong>in</strong>ed by the nitrogen requirement to replicate the chromosome; an<br />

AT base pairs conta<strong>in</strong>s 7 nitrogen atoms whereas a GC pair conta<strong>in</strong>s 8 nitrogen atoms.<br />

Cytos<strong>in</strong>e pairs are prone to mutation caused by spontaneous deam<strong>in</strong>ation <strong>in</strong>to uracil<br />

(Visnes et al., 2009) (figure 2.8). In E. coli the two enzymes uracil N -glycosylase <strong>and</strong><br />

apur<strong>in</strong>ic (AP) endonuclease are responsible for the repair of this mutation. However, <strong>in</strong><br />

Buchnera aphidicola Cc, which is a small reduced genome, these two enzymes are absent<br />

(confirmed by prote<strong>in</strong> BLAST). A negative selection is likely to occur <strong>in</strong> organisms with<br />

high chromosomal GC content <strong>and</strong> the lack of a functional repair mechanism. Hence, base<br />

composition of the bacterial genome is by no means r<strong>and</strong>om <strong>and</strong> adjust<strong>in</strong>g the overall GC<br />

contant through evolution may be yet another way to adapt to the environment.<br />

2.3.6 BLASTmatrix - proteome comparison<br />

The BLASTmatrix tool allows for visualization of proteome similarity between larger<br />

numbers of organisms. For each of the pairwise comb<strong>in</strong>ations of proteomes, a BLAST<br />

is performed. Two prote<strong>in</strong>s are declared homologous when 50% of the prote<strong>in</strong> is aligned<br />

<strong>and</strong> 50% of the residues with<strong>in</strong> the alignment are conserved. For a report of proteome<br />

A aga<strong>in</strong>st proteome B, all homologous prote<strong>in</strong>s are then grouped <strong>in</strong>to families <strong>and</strong> the<br />

similarity between A <strong>and</strong> B is calculated as the number of families hav<strong>in</strong>g both organism<br />

A <strong>and</strong> B represented. The BLAST report is cached, based on MD5 checksums of the<br />

proteomes. This enables the tool to efficiently reuse previous results, when organisms<br />

are added to a comparison. This is repeated for all N j=1 j comb<strong>in</strong>ations <strong>and</strong> for each<br />

comb<strong>in</strong>ation a square is drawn conta<strong>in</strong><strong>in</strong>g the follow<strong>in</strong>g <strong>in</strong>formation: the similarity as<br />

percentage of all families of A <strong>and</strong> B, the number of shared families <strong>and</strong> the total number<br />

of families. A small example matrix is shown <strong>in</strong> figure 2.9. The percentage is used to<br />

color-code the square to allow for easier overview of larger comparisons.<br />

The software requires a configuration <strong>in</strong> XML as first argument. In appendix D.4<br />

a Perl script is provided which automatically constructs a configuration that compares<br />

all published Campylobacter proteomes, by query<strong>in</strong>g the Genome Atlas Database. The<br />

output of the BLASTmatrix configuration is shown <strong>in</strong> figure 2.10.<br />

The software has been used <strong>in</strong> different publications (B<strong>in</strong>newies et al., 2005, 2006) <strong>and</strong><br />

has been updated a number of times s<strong>in</strong>ce. The older versions conta<strong>in</strong>ed both BLAST<br />

directions <strong>and</strong> showed the number of shared prote<strong>in</strong>s, leav<strong>in</strong>g the diagram redundant. The<br />

recent version avoids this by <strong>in</strong>stead plott<strong>in</strong>g the shared families which renders the plot<br />

symmetrical across the diagonal. This allows the lower triangle to be removed.<br />

16


Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> DH10B<br />

4,126 prote<strong>in</strong>s, 3,797 families<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> W3110<br />

4,226 prote<strong>in</strong>s, 3,965 families<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> MG1655<br />

4,150 prote<strong>in</strong>s, 3,912 families<br />

4.3 %<br />

167 / 3,912<br />

95.3 %<br />

3,843 / 4,034<br />

91.5 %<br />

3,685 / 4,027<br />

4.3 %<br />

170 / 3,965<br />

93.1 %<br />

3,742 / 4,020<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> MG1655<br />

4,150 prote<strong>in</strong>s, 3,912 families<br />

6.4 %<br />

242 / 3,797<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> W3110<br />

4,226 prote<strong>in</strong>s, 3,965 families<br />

<strong>Comparative</strong> Genomics<br />

Escherichia coli<br />

stra<strong>in</strong> K-12, substra<strong>in</strong> DH10B<br />

4,126 prote<strong>in</strong>s, 3,797 families<br />

Figure 2.9: Construction of the BLASTmatrix diagram. Proteome similarity between three E.<br />

coli genomes. Lower part of the diagram corresponds to <strong>in</strong>tra-proteome similarity.<br />

lari<br />

jejuni<br />

concisus<br />

curvus<br />

fetus<br />

hom<strong>in</strong>is<br />

2.3 %<br />

34 / 1,494<br />

57.2 %<br />

1,123 / 1,965<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter jejuni<br />

RM1221<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

Campylobacter lari<br />

RM2100<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

56.7 %<br />

1,123 / 1,979<br />

1.7 %<br />

27 / 1,581<br />

55.2 %<br />

1,145 / 2,073<br />

84.7 %<br />

1,448 / 1,709<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

Campylobacter curvus<br />

525.92<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

49.4 %<br />

1,062 / 2,150<br />

83.5 %<br />

1,481 / 1,773<br />

1.5 %<br />

24 / 1,585<br />

53.0 %<br />

1,143 / 2,158<br />

67.3 %<br />

1,316 / 1,955<br />

82.9 %<br />

1,474 / 1,778<br />

22.8 %<br />

596 / 2,619<br />

76.9 %<br />

1,466 / 1,906<br />

64.4 %<br />

1,289 / 2,003<br />

2.3 %<br />

39 / 1,702<br />

30.0 %<br />

742 / 2,476<br />

22.9 %<br />

614 / 2,676<br />

74.6 %<br />

1,441 / 1,931<br />

62.2 %<br />

1,304 / 2,096<br />

24.7 %<br />

682 / 2,756<br />

30.6 %<br />

774 / 2,526<br />

23.1 %<br />

617 / 2,675<br />

71.4 %<br />

1,451 / 2,032<br />

4.0 %<br />

66 / 1,650<br />

24.5 %<br />

704 / 2,875<br />

24.8 %<br />

698 / 2,820<br />

30.3 %<br />

770 / 2,538<br />

22.5 %<br />

628 / 2,795<br />

63.5 %<br />

1,345 / 2,118<br />

Campylobacter lari<br />

RM2100<br />

24.4 %<br />

718 / 2,948<br />

25.1 %<br />

706 / 2,816<br />

28.7 %<br />

767 / 2,669<br />

21.2 %<br />

595 / 2,802<br />

2.3 %<br />

41 / 1,780<br />

1,546 prote<strong>in</strong>s, 1,494 families<br />

Campylobacter jejuni<br />

subsp. jejuni NCTC 11168<br />

1,624 prote<strong>in</strong>s, 1,581 families<br />

24.3 %<br />

717 / 2,950<br />

23.7 %<br />

699 / 2,950<br />

27.5 %<br />

736 / 2,676<br />

21.4 %<br />

618 / 2,886<br />

Campylobacter jejuni<br />

subsp. jejuni 81116<br />

1,626 prote<strong>in</strong>s, 1,585 families<br />

23.6 %<br />

723 / 3,070<br />

22.5 %<br />

668 / 2,964<br />

27.9 %<br />

767 / 2,750<br />

2.0 %<br />

33 / 1,623<br />

22.7 %<br />

698 / 3,076<br />

23.0 %<br />

698 / 3,036<br />

30.4 %<br />

782 / 2,576<br />

22.5 %<br />

713 / 3,175<br />

26.1 %<br />

741 / 2,838<br />

1.5 %<br />

25 / 1,665<br />

lari<br />

Campylobacter jejuni<br />

subsp. jejuni 81-176<br />

1,758 prote<strong>in</strong>s, 1,702 families<br />

Campylobacter jejuni<br />

subsp. doylei 269.97<br />

1,731 prote<strong>in</strong>s, 1,650 families<br />

Campylobacter jejuni<br />

RM1221<br />

25.8 %<br />

765 / 2,961<br />

34.7 %<br />

929 / 2,678<br />

1,838 prote<strong>in</strong>s, 1,780 families<br />

32.4 %<br />

916 / 2,828<br />

1.8 %<br />

34 / 1,885<br />

50.3 %<br />

1,317 / 2,616<br />

jejuni<br />

Campylobacter hom<strong>in</strong>is<br />

ATCC BAA-381<br />

1,687 prote<strong>in</strong>s, 1,623 families<br />

Campylobacter fetus<br />

subsp. fetus 82-40<br />

1,719 prote<strong>in</strong>s, 1,665 families<br />

Campylobacter curvus<br />

525.92<br />

3.5 %<br />

69 / 1,972<br />

1.5 %<br />

Homology between proteomes<br />

1,931 prote<strong>in</strong>s, 1,885 families<br />

Campylobacter concisus<br />

13826<br />

2,080 prote<strong>in</strong>s, 1,972 families<br />

hom<strong>in</strong>is<br />

fetus<br />

curvus<br />

concisus<br />

Homology with<strong>in</strong> proteomes<br />

Figure 2.10: Proteome similarity between ten Campylobacter species. Color encod<strong>in</strong>g corresponds<br />

to percentage of shared prote<strong>in</strong> families.<br />

21.2 %<br />

84.7 %<br />

4.0 %<br />

17


Genome Comparisons<br />

A.salmonicida LFI1238<br />

V.species Ex25<br />

V.campbellii AND4<br />

V.harveyi BAA1116<br />

V.shilonii AK1<br />

P.profundum SS9<br />

27.2 %<br />

1,946 / 7,165<br />

27.1 %<br />

31.2 %<br />

1,964 / 7,245 2,143 / 6,862<br />

27.5 %<br />

31.1 %<br />

32.5 %<br />

1,971 / 7,179 2,163 / 6,948 2,385 / 7,336<br />

26.3 %<br />

31.5 %<br />

32.6 %<br />

35.8 %<br />

1,893 / 7,208 2,169 / 6,884 2,405 / 7,380 2,018 / 5,637<br />

28.0 %<br />

30.4 %<br />

33.1 %<br />

35.9 %<br />

38.7 %<br />

1,962 / 7,016 2,098 / 6,893 2,415 / 7,299 2,049 / 5,713 2,143 / 5,536<br />

28.7 %<br />

32.3 %<br />

31.7 %<br />

36.4 %<br />

38.3 %<br />

32.1 %<br />

1,944 / 6,766 2,164 / 6,706 2,323 / 7,337 2,055 / 5,647 2,156 / 5,631 1,846 / 5,747<br />

28.2 %<br />

33.0 %<br />

33.6 %<br />

34.7 %<br />

38.8 %<br />

32.1 %<br />

34.0 %<br />

1,960 / 6,957 2,137 / 6,467 2,410 / 7,181 1,968 / 5,677 2,162 / 5,566 1,873 / 5,828 1,963 / 5,771<br />

27.6 %<br />

32.4 %<br />

34.3 %<br />

37.3 %<br />

37.9 %<br />

32.5 %<br />

33.7 %<br />

35.0 %<br />

1,965 / 7,122 2,155 / 6,649 2,377 / 6,932 2,045 / 5,477 2,110 / 5,560 1,873 / 5,769 1,977 / 5,865 1,949 / 5,561<br />

27.7 %<br />

31.8 %<br />

33.8 %<br />

38.7 %<br />

40.3 %<br />

30.6 %<br />

34.2 %<br />

34.8 %<br />

40.3 %<br />

1,965 / 7,093 2,169 / 6,817 2,403 / 7,116 2,021 / 5,225 2,167 / 5,378 1,777 / 5,804 1,983 / 5,797 1,967 / 5,647 2,326 / 5,771<br />

27.8 %<br />

32.1 %<br />

33.3 %<br />

37.4 %<br />

41.6 %<br />

33.3 %<br />

32.5 %<br />

35.3 %<br />

39.8 %<br />

38.4 %<br />

1,967 / 7,064 2,173 / 6,778 2,418 / 7,252 2,032 / 5,428 2,140 / 5,139 1,863 / 5,593 1,896 / 5,827 1,972 / 5,581 2,339 / 5,873 2,291 / 5,971<br />

25.7 %<br />

32.2 %<br />

33.5 %<br />

36.7 %<br />

40.6 %<br />

34.4 %<br />

35.3 %<br />

33.6 %<br />

40.4 %<br />

38.0 %<br />

41.7 %<br />

1,850 / 7,198 2,173 / 6,752 2,420 / 7,225 2,048 / 5,585 2,159 / 5,323 1,846 / 5,360 1,981 / 5,619 1,884 / 5,612 2,345 / 5,808 2,307 / 6,067 2,552 / 6,116<br />

25.6 %<br />

30.3 %<br />

33.6 %<br />

37.0 %<br />

39.5 %<br />

33.4 %<br />

36.6 %<br />

36.3 %<br />

38.6 %<br />

38.5 %<br />

41.2 %<br />

44.3 %<br />

1,841 / 7,194 2,079 / 6,856 2,420 / 7,193 2,051 / 5,545 2,169 / 5,493 1,852 / 5,547 1,964 / 5,371 1,965 / 5,413 2,251 / 5,839 2,311 / 6,004 2,564 / 6,224 2,515 / 5,683<br />

28.1 %<br />

29.7 %<br />

31.0 %<br />

37.2 %<br />

39.7 %<br />

32.7 %<br />

35.5 %<br />

37.7 %<br />

41.7 %<br />

37.0 %<br />

41.9 %<br />

43.7 %<br />

42.2 %<br />

1,904 / 6,782 2,044 / 6,887 2,282 / 7,362 2,052 / 5,516 2,168 / 5,459 1,868 / 5,705 1,974 / 5,563 1,947 / 5,165 2,346 / 5,626 2,227 / 6,026 2,575 / 6,151 2,527 / 5,781 2,215 / 5,254<br />

26.9 %<br />

32.4 %<br />

30.8 %<br />

34.4 %<br />

40.0 %<br />

33.0 %<br />

34.6 %<br />

36.6 %<br />

42.9 %<br />

39.7 %<br />

40.0 %<br />

44.5 %<br />

41.6 %<br />

40.0 %<br />

1,851 / 6,869 2,098 / 6,481 2,270 / 7,379 1,944 / 5,645 2,171 / 5,428 1,872 / 5,667 1,982 / 5,732 1,961 / 5,354 2,314 / 5,388 2,312 / 5,825 2,473 / 6,185 2,539 / 5,707 2,225 / 5,354 2,421 / 6,055<br />

28.2 %<br />

31.2 %<br />

33.3 %<br />

34.8 %<br />

38.2 %<br />

33.2 %<br />

34.8 %<br />

35.7 %<br />

41.9 %<br />

40.6 %<br />

42.9 %<br />

42.8 %<br />

42.3 %<br />

39.6 %<br />

70.3 %<br />

1,949 / 6,915 2,045 / 6,565 2,327 / 6,984 1,952 / 5,606 2,104 / 5,504 1,872 / 5,641 1,984 / 5,694 1,969 / 5,522 2,334 / 5,571 2,270 / 5,592 2,564 / 5,977 2,449 / 5,718 2,236 / 5,283 2,438 / 6,154 2,933 / 4,174<br />

27.9 %<br />

32.6 %<br />

32.1 %<br />

38.1 %<br />

37.3 %<br />

30.2 %<br />

35.0 %<br />

35.9 %<br />

40.9 %<br />

39.9 %<br />

44.1 %<br />

45.9 %<br />

41.3 %<br />

40.0 %<br />

69.2 %<br />

73.6 %<br />

1,942 / 6,969 2,153 / 6,600 2,268 / 7,062 1,994 / 5,228 2,064 / 5,537 1,747 / 5,786 1,985 / 5,667 1,971 / 5,485 2,343 / 5,733 2,299 / 5,768 2,533 / 5,743 2,535 / 5,526 2,181 / 5,277 2,440 / 6,094 2,953 / 4,267 3,045 / 4,135<br />

27.9 %<br />

31.8 %<br />

34.2 %<br />

36.4 %<br />

41.6 %<br />

30.0 %<br />

31.9 %<br />

36.1 %<br />

41.2 %<br />

38.9 %<br />

43.3 %<br />

47.1 %<br />

43.8 %<br />

38.4 %<br />

69.7 %<br />

74.9 %<br />

71.6 %<br />

1,941 / 6,954 2,123 / 6,682 2,394 / 7,002 1,935 / 5,317 2,134 / 5,135 1,736 / 5,791 1,857 / 5,817 1,971 / 5,458 2,346 / 5,697 2,309 / 5,932 2,559 / 5,916 2,503 / 5,310 2,234 / 5,101 2,348 / 6,120 2,944 / 4,221 3,101 / 4,142 3,010 / 4,205<br />

27.9 %<br />

32.0 %<br />

33.4 %<br />

37.7 %<br />

39.1 %<br />

33.6 %<br />

32.1 %<br />

32.8 %<br />

41.4 %<br />

39.3 %<br />

42.3 %<br />

46.4 %<br />

45.9 %<br />

41.4 %<br />

66.3 %<br />

75.5 %<br />

72.6 %<br />

75.9 %<br />

1,909 / 6,851 2,130 / 6,656 2,359 / 7,060 2,026 / 5,367 2,048 / 5,244 1,805 / 5,377 1,861 / 5,795 1,843 / 5,611 2,346 / 5,670 2,314 / 5,892 2,572 / 6,075 2,534 / 5,464 2,223 / 4,842 2,445 / 5,905 2,833 / 4,271 3,089 / 4,092 3,068 / 4,226 3,094 / 4,077<br />

29.6 %<br />

32.0 %<br />

33.4 %<br />

37.3 %<br />

40.4 %<br />

31.9 %<br />

35.6 %<br />

33.1 %<br />

38.0 %<br />

39.4 %<br />

42.7 %<br />

45.2 %<br />

44.3 %<br />

42.4 %<br />

73.2 %<br />

69.8 %<br />

73.5 %<br />

77.2 %<br />

68.7 %<br />

2,295 / 7,753 2,097 / 6,549 2,375 / 7,115 2,022 / 5,418 2,139 / 5,293 1,743 / 5,469 1,922 / 5,398 1,848 / 5,585 2,213 / 5,823 2,314 / 5,868 2,578 / 6,032 2,546 / 5,633 2,232 / 5,038 2,408 / 5,683 2,952 / 4,034 2,942 / 4,217 3,065 / 4,172 3,155 / 4,088 2,874 / 4,181<br />

27.9 %<br />

35.2 %<br />

33.0 %<br />

37.3 %<br />

39.4 %<br />

33.5 %<br />

34.2 %<br />

36.7 %<br />

38.0 %<br />

36.9 %<br />

42.9 %<br />

45.5 %<br />

43.0 %<br />

41.8 %<br />

73.5 %<br />

76.0 %<br />

68.5 %<br />

78.0 %<br />

67.2 %<br />

70.4 %<br />

1,972 / 7,061 2,581 / 7,333 2,325 / 7,056 2,019 / 5,407 2,118 / 5,370 1,845 / 5,501 1,872 / 5,473 1,906 / 5,192 2,209 / 5,811 2,208 / 5,989 2,579 / 6,005 2,548 / 5,599 2,240 / 5,212 2,434 / 5,818 2,863 / 3,897 3,059 / 4,025 2,914 / 4,256 3,149 / 4,038 2,880 / 4,288 2,922 / 4,153<br />

29.4 %<br />

34.3 %<br />

46.4 %<br />

37.8 %<br />

40.3 %<br />

32.9 %<br />

35.7 %<br />

34.9 %<br />

41.8 %<br />

36.4 %<br />

39.4 %<br />

45.8 %<br />

43.4 %<br />

40.8 %<br />

76.4 %<br />

75.2 %<br />

74.1 %<br />

71.5 %<br />

69.7 %<br />

70.3 %<br />

64.7 %<br />

2,212 / 7,534 2,276 / 6,634 3,371 / 7,266 2,001 / 5,288 2,145 / 5,320 1,824 / 5,545 1,970 / 5,513 1,843 / 5,282 2,264 / 5,418 2,186 / 6,003 2,432 / 6,172 2,552 / 5,568 2,242 / 5,171 2,445 / 5,993 2,970 / 3,887 2,954 / 3,928 3,024 / 4,083 2,986 / 4,175 2,916 / 4,183 2,965 / 4,217 2,888 / 4,463<br />

27.8 %<br />

34.4 %<br />

34.9 %<br />

47.0 %<br />

39.8 %<br />

33.0 %<br />

34.9 %<br />

36.8 %<br />

39.9 %<br />

39.9 %<br />

39.1 %<br />

42.2 %<br />

43.6 %<br />

41.1 %<br />

73.1 %<br />

80.4 %<br />

73.0 %<br />

79.5 %<br />

69.0 %<br />

72.2 %<br />

64.9 %<br />

76.9 %<br />

2,222 / 7,979 2,472 / 7,184 2,496 / 7,160 2,741 / 5,827 2,086 / 5,245 1,831 / 5,549 1,952 / 5,586 1,951 / 5,307 2,202 / 5,514 2,238 / 5,609 2,413 / 6,176 2,409 / 5,711 2,244 / 5,143 2,450 / 5,957 2,977 / 4,072 3,080 / 3,831 2,908 / 3,986 3,125 / 3,932 2,860 / 4,145 2,986 / 4,136 2,940 / 4,533 3,165 / 4,117<br />

28.1 %<br />

33.0 %<br />

38.7 %<br />

37.8 %<br />

64.9 %<br />

33.1 %<br />

35.2 %<br />

36.1 %<br />

42.0 %<br />

38.0 %<br />

43.0 %<br />

41.4 %<br />

41.1 %<br />

41.3 %<br />

73.4 %<br />

77.3 %<br />

78.5 %<br />

77.9 %<br />

71.8 %<br />

68.5 %<br />

67.6 %<br />

76.7 %<br />

83.4 %<br />

2,155 / 7,667 2,516 / 7,615 2,880 / 7,439 2,081 / 5,503 3,384 / 5,214 1,804 / 5,448 1,954 / 5,558 1,940 / 5,373 2,320 / 5,530 2,171 / 5,707 2,483 / 5,781 2,372 / 5,735 2,153 / 5,242 2,449 / 5,936 2,971 / 4,050 3,098 / 4,009 3,061 / 3,901 3,002 / 3,856 2,896 / 4,036 2,869 / 4,191 2,983 / 4,413 3,195 / 4,167 3,315 / 3,973<br />

29.5 %<br />

36.5 %<br />

37.0 %<br />

39.9 %<br />

45.0 %<br />

31.9 %<br />

35.3 %<br />

36.2 %<br />

41.1 %<br />

40.1 %<br />

40.1 %<br />

46.3 %<br />

41.2 %<br />

37.9 %<br />

73.8 %<br />

77.1 %<br />

75.8 %<br />

83.0 %<br />

71.5 %<br />

73.7 %<br />

65.1 %<br />

81.6 %<br />

81.3 %<br />

82.4 %<br />

2,198 / 7,456 2,593 / 7,105 2,900 / 7,832 2,372 / 5,942 2,357 / 5,232 2,074 / 6,494 1,926 / 5,455 1,940 / 5,352 2,303 / 5,603 2,293 / 5,719 2,373 / 5,919 2,464 / 5,326 2,152 / 5,228 2,313 / 6,099 2,975 / 4,030 3,088 / 4,007 3,073 / 4,056 3,135 / 3,777 2,801 / 3,915 2,947 / 4,001 2,880 / 4,423 3,264 / 4,000 3,320 / 4,085 3,302 / 4,009<br />

30.3 %<br />

36.7 %<br />

34.6 %<br />

37.5 %<br />

46.1 %<br />

32.3 %<br />

35.5 %<br />

36.3 %<br />

41.6 %<br />

39.2 %<br />

43.5 %<br />

43.5 %<br />

46.0 %<br />

38.2 %<br />

65.6 %<br />

78.0 %<br />

75.1 %<br />

79.4 %<br />

72.2 %<br />

81.0 %<br />

67.3 %<br />

77.5 %<br />

81.9 %<br />

80.8 %<br />

83.2 %<br />

2,110 / 6,968 2,562 / 6,982 2,682 / 7,762 2,396 / 6,387 2,626 / 5,697 1,842 / 5,705 2,270 / 6,400 1,906 / 5,250 2,314 / 5,569 2,272 / 5,796 2,550 / 5,859 2,367 / 5,437 2,220 / 4,821 2,320 / 6,080 2,791 / 4,256 3,097 / 3,971 3,061 / 4,077 3,144 / 3,961 2,861 / 3,960 2,989 / 3,688 2,909 / 4,320 3,153 / 4,066 3,311 / 4,041 3,319 / 4,106 3,325 / 3,995<br />

29.7 %<br />

30.4 %<br />

36.7 %<br />

36.9 %<br />

43.2 %<br />

32.6 %<br />

34.5 %<br />

35.9 %<br />

41.5 %<br />

39.8 %<br />

42.2 %<br />

45.9 %<br />

42.7 %<br />

42.3 %<br />

65.2 %<br />

71.3 %<br />

76.3 %<br />

79.3 %<br />

69.0 %<br />

74.9 %<br />

67.8 %<br />

78.4 %<br />

76.3 %<br />

81.6 %<br />

80.7 %<br />

85.8 %<br />

2,127 / 7,169 2,085 / 6,866 2,759 / 7,516 2,259 / 6,124 2,655 / 6,143 2,040 / 6,250 1,965 / 5,696 2,233 / 6,219 2,272 / 5,479 2,292 / 5,756 2,506 / 5,941 2,501 / 5,451 2,113 / 4,953 2,399 / 5,675 2,768 / 4,246 2,953 / 4,142 3,076 / 4,029 3,138 / 3,958 2,868 / 4,158 2,944 / 3,932 2,836 / 4,184 3,157 / 4,029 3,142 / 4,120 3,311 / 4,057 3,321 / 4,117 3,291 / 3,837<br />

28.3 %<br />

29.4 %<br />

29.6 %<br />

38.6 %<br />

40.2 %<br />

30.5 %<br />

35.3 %<br />

36.2 %<br />

43.9 %<br />

39.2 %<br />

42.9 %<br />

46.1 %<br />

44.3 %<br />

39.7 %<br />

71.6 %<br />

70.2 %<br />

69.2 %<br />

80.3 %<br />

69.1 %<br />

73.3 %<br />

68.1 %<br />

74.3 %<br />

83.7 %<br />

75.3 %<br />

81.4 %<br />

82.5 %<br />

79.6 %<br />

1,980 / 6,989 2,083 / 7,082 2,214 / 7,478 2,289 / 5,931 2,413 / 5,999 2,050 / 6,715 2,191 / 6,211 1,976 / 5,464 2,762 / 6,293 2,230 / 5,684 2,536 / 5,906 2,513 / 5,455 2,213 / 5,001 2,303 / 5,796 2,802 / 3,915 2,930 / 4,172 2,925 / 4,226 3,147 / 3,918 2,864 / 4,145 2,983 / 4,071 2,876 / 4,226 2,987 / 4,018 3,275 / 3,915 3,136 / 4,162 3,309 / 4,067 3,278 / 3,971 3,139 / 3,944<br />

28.0 %<br />

26.7 %<br />

29.3 %<br />

33.6 %<br />

42.3 %<br />

33.1 %<br />

33.1 %<br />

36.3 %<br />

45.4 %<br />

41.4 %<br />

42.3 %<br />

45.7 %<br />

43.5 %<br />

42.9 %<br />

68.6 %<br />

77.2 %<br />

64.3 %<br />

73.1 %<br />

70.0 %<br />

73.6 %<br />

66.4 %<br />

78.6 %<br />

86.6 %<br />

82.6 %<br />

76.0 %<br />

83.4 %<br />

78.1 %<br />

92.9 %<br />

2,022 / 7,222 1,916 / 7,168 2,244 / 7,665 1,915 / 5,695 2,451 / 5,795 2,074 / 6,269 2,209 / 6,672 2,179 / 6,005 2,507 / 5,523 2,698 / 6,523 2,475 / 5,845 2,506 / 5,480 2,200 / 5,058 2,463 / 5,745 2,743 / 4,001 2,983 / 3,866 2,805 / 4,365 3,000 / 4,103 2,873 / 4,102 2,979 / 4,045 2,917 / 4,393 3,113 / 3,962 3,253 / 3,757 3,267 / 3,954 3,147 / 4,141 3,267 / 3,919 3,147 / 4,032 3,489 / 3,754<br />

25.5 %<br />

34.5 %<br />

28.3 %<br />

32.5 %<br />

34.5 %<br />

34.9 %<br />

35.7 %<br />

34.2 %<br />

43.7 %<br />

43.7 %<br />

46.4 %<br />

45.1 %<br />

44.9 %<br />

40.8 %<br />

77.1 %<br />

71.8 %<br />

71.6 %<br />

69.5 %<br />

68.3 %<br />

74.3 %<br />

66.3 %<br />

75.5 %<br />

91.2 %<br />

85.6 %<br />

82.9 %<br />

79.4 %<br />

80.2 %<br />

89.7 %<br />

77.1 %<br />

1,872 / 7,339 2,335 / 6,762 2,095 / 7,406 1,919 / 5,903 1,963 / 5,692 2,114 / 6,065 2,219 / 6,213 2,205 / 6,448 2,670 / 6,112 2,492 / 5,705 3,042 / 6,550 2,444 / 5,415 2,242 / 4,998 2,400 / 5,876 2,975 / 3,861 2,855 / 3,974 2,868 / 4,006 2,908 / 4,185 2,820 / 4,126 2,982 / 4,014 2,908 / 4,386 3,125 / 4,141 3,355 / 3,679 3,244 / 3,790 3,277 / 3,954 3,143 / 3,956 3,169 / 3,953 3,485 / 3,884 3,186 / 4,134<br />

26.1 %<br />

30.9 %<br />

43.4 %<br />

30.3 %<br />

33.9 %<br />

55.5 %<br />

38.1 %<br />

36.6 %<br />

40.8 %<br />

41.9 %<br />

43.2 %<br />

48.9 %<br />

43.5 %<br />

42.4 %<br />

73.0 %<br />

82.5 %<br />

67.9 %<br />

76.7 %<br />

68.0 %<br />

68.5 %<br />

67.0 %<br />

74.6 %<br />

91.7 %<br />

90.1 %<br />

83.2 %<br />

87.0 %<br />

75.1 %<br />

81.1 %<br />

74.9 %<br />

80.4 %<br />

2,254 / 8,624 2,144 / 6,948 2,981 / 6,875 1,795 / 5,923 1,991 / 5,874 2,683 / 4,838 2,277 / 5,979 2,201 / 6,016 2,680 / 6,565 2,637 / 6,301 2,597 / 6,013 2,994 / 6,128 2,155 / 4,958 2,451 / 5,781 2,911 / 3,989 3,117 / 3,780 2,780 / 4,092 2,961 / 3,861 2,806 / 4,126 2,844 / 4,150 2,915 / 4,348 3,103 / 4,160 3,455 / 3,766 3,346 / 3,715 3,208 / 3,855 3,272 / 3,762 3,024 / 4,028 3,280 / 4,046 3,187 / 4,253 3,303 / 4,109<br />

25.9 %<br />

30.1 %<br />

45.0 %<br />

46.2 %<br />

30.5 %<br />

52.4 %<br />

75.0 %<br />

38.7 %<br />

72.3 %<br />

39.7 %<br />

67.5 %<br />

47.2 %<br />

43.5 %<br />

40.9 %<br />

74.7 %<br />

78.0 %<br />

78.6 %<br />

71.8 %<br />

73.1 %<br />

70.6 %<br />

64.7 %<br />

75.4 %<br />

96.0 %<br />

90.4 %<br />

91.4 %<br />

83.0 %<br />

80.7 %<br />

77.3 %<br />

80.2 %<br />

88.8 %<br />

88.1 %<br />

2,170 / 8,370 2,581 / 8,574 3,018 / 6,702 2,452 / 5,307 1,813 / 5,939 2,666 / 5,085 3,261 / 4,346 2,246 / 5,808 3,688 / 5,101 2,672 / 6,728 3,741 / 5,540 2,608 / 5,524 2,547 / 5,858 2,360 / 5,769 2,922 / 3,914 3,045 / 3,906 3,059 / 3,894 2,849 / 3,968 2,818 / 3,854 2,886 / 4,087 2,847 / 4,403 3,111 / 4,124 3,531 / 3,678 3,439 / 3,805 3,373 / 3,689 3,126 / 3,768 3,108 / 3,853 3,164 / 4,093 3,271 / 4,079 3,489 / 3,927 3,495 / 3,966<br />

5.0 %<br />

243 / 4,897<br />

3.9 %<br />

200 / 5,078<br />

3.9 %<br />

201 / 5,117<br />

V.parahaemolyticus 2210633<br />

V.parahaemolyticus 16<br />

V.vulnificus CMCP6<br />

V.vulnificus YJ016<br />

V.species MED222<br />

V.splendidus LGP32<br />

V.fischeri ES114<br />

V.fischeri MJ11<br />

2.3 %<br />

88 / 3,822<br />

2.7 %<br />

103 / 3,886<br />

3.3 %<br />

111 / 3,378<br />

2.9 %<br />

112 / 3,894<br />

2.6 %<br />

96 / 3,691<br />

2.8 %<br />

118 / 4,277<br />

2.3 %<br />

103 / 4,463<br />

3.1 %<br />

150 / 4,773<br />

V.cholerae MO10<br />

V.cholerae BX330286<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

V.cholerae B33VCE<br />

V.cholerae 2740-80<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

V.cholerae 12129<br />

V.cholerae TM11079-80<br />

V.cholerae TMA21<br />

V.cholerae VL426<br />

V.cholerae 1587<br />

2.8 %<br />

121 / 4,337<br />

2.1 %<br />

79 / 3,683<br />

V.cholerae N16961<br />

V.cholerae 0395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae V52<br />

V.cholerae M66-2<br />

3.2 %<br />

147 / 4,662<br />

1.9 %<br />

62 / 3,316<br />

2.9 %<br />

99 / 3,427<br />

P.profundum SS9<br />

2.4 %<br />

83 / 3,442<br />

V.shilonii AK1<br />

V.harveyi BAA1116<br />

2.1 %<br />

72 / 3,454<br />

V.campbellii AND4<br />

V.species Ex25<br />

2.2 %<br />

73 / 3,311<br />

A.salmonicida LFI1238<br />

V.fischeri MJ11<br />

2.5 %<br />

84 / 3,305<br />

V.fischeri ES114<br />

V.splendidus LGP32<br />

2.8 %<br />

99 / 3,586<br />

V.species MED222<br />

V.vulnificus YJ016<br />

3.5 %<br />

125 / 3,567<br />

V.vulnificus CMCP6<br />

V.parahaemolyticus 16<br />

2.6 %<br />

92 / 3,593<br />

V.parahaemolyticus 2210633<br />

V.cholerae VL426<br />

3.0 %<br />

109 / 3,575<br />

V.cholerae TMA21<br />

V.cholerae TM11079-80<br />

2.8 %<br />

102 / 3,619<br />

V.cholerae 12129<br />

V.cholerae MZO-2<br />

2.9 %<br />

100 / 3,429<br />

V.cholerae AM-19226<br />

V.cholerae 1587<br />

1.8 %<br />

59 / 3,353<br />

30.0 %<br />

V.cholerae 2740-80<br />

V.cholerae B33VCE<br />

2.8 %<br />

99 / 3,560<br />

0.0 %<br />

Homology between proteomes<br />

Homology with<strong>in</strong> proteomes<br />

V.cholerae MJ1236<br />

V.cholerae RC9<br />

3.3 %<br />

120 / 3,599<br />

V.cholerae BX330286<br />

V.cholerae MO10<br />

4.3 %<br />

157 / 3,665<br />

V.cholerae M66-2<br />

V.cholerae V52<br />

4.2 %<br />

155 / 3,729<br />

V.cholerae 0395 TIGR<br />

V.cholerae 0395 TEDA<br />

3.0 %<br />

110 / 3,665<br />

90.0 %<br />

6.0 %<br />

V.cholerae N16961<br />

Figure 2.11: Proteome comparison of 32 Vibrionaceae genomes. Environmental V. cholerae stra<strong>in</strong>s<br />

lack<strong>in</strong>g the cholera enterotox<strong>in</strong> genes are highlighted <strong>in</strong> bright green, whilst pathogenic V. cholerae<br />

stra<strong>in</strong>s genomes are shown <strong>in</strong> dark green.<br />

Large similarities between environmental <strong>and</strong> pathogenic V. cholerae<br />

The BLAST matrix shown <strong>in</strong> figure 2.11 <strong>in</strong>cludes environmental <strong>and</strong> pathogenetic stra<strong>in</strong>s<br />

of V. cholerae. The figures shows that with<strong>in</strong> <strong>and</strong> between these two groups the V. cholerae<br />

stra<strong>in</strong>s share a large number of genes.<br />

Intra- vs. <strong>in</strong>ter-proteome similarity<br />

The lower row of the diagram shows the special case of organism A versus itself. This<br />

shows the <strong>in</strong>tra-proteome similarity. If not dealt with separately, this part would appear<br />

as 100% similar s<strong>in</strong>ce the proteome is BLASTed aga<strong>in</strong>st itself. However, all self-match<strong>in</strong>g<br />

prote<strong>in</strong>s are excluded, leav<strong>in</strong>g this part to reflect the paraloges of the organism. Also, this<br />

part has a separate color encod<strong>in</strong>g (red) whereas the <strong>in</strong>tra-protome comparison is coded<br />

green (see figure 2.10).<br />

2.3.7 BLASTatlas - visualiz<strong>in</strong>g while-genome homology<br />

The BLASTmatrix tool described earlier condenses the similarity between two proteomes<br />

<strong>in</strong>to a s<strong>in</strong>gle number. This simplification allows for an all-aga<strong>in</strong>st-all comparison, but lacks<br />

detailed <strong>in</strong>formation on the conserved genes <strong>and</strong> where these are located. The BLASTatlas<br />

method overcomes these issues by compar<strong>in</strong>g the proteomes to a s<strong>in</strong>gle reference chromosome.<br />

When a s<strong>in</strong>gle representative chromosome has been selected, all ORF’s or prote<strong>in</strong>s<br />

of that reference is BLASTed aga<strong>in</strong>st each of the proteome to be <strong>in</strong>cluded <strong>in</strong> the comparison.<br />

The most optimal alignment of each proteome, disregard<strong>in</strong>g the significance, is<br />

mapped back to the reference genome. A numerical value of zero is mapped at mismatches<br />

or gaps, 0.5 at conservative mismatches, <strong>and</strong> one is mapped to matches. This method has<br />

proved powerful because it answers several questions <strong>in</strong> one diagram: Which reference<br />

prote<strong>in</strong>s are found <strong>in</strong> which query genomes? How well are they conserved? And is there<br />

18


<strong>Comparative</strong> Genomics<br />

<br />

<br />

<br />

Figure 2.12: Mapp<strong>in</strong>g of pairwise alignment to a reference genome. Mismatches, conservative<br />

mismatches <strong>and</strong> perfect matches contrubute to the overall map 0.0, 0.5, <strong>and</strong> 1.0, respectively. Gaps<br />

with<strong>in</strong> the reference prote<strong>in</strong>, correspond<strong>in</strong>g to miss<strong>in</strong>g features of the reference prote<strong>in</strong>, cannot be<br />

mapped <strong>and</strong> are hence excluded.<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Figure 2.13: Inclusion of multiple organisms us<strong>in</strong>g the BLASTatlas method. Each track correspond<br />

to a pairwise comparison aga<strong>in</strong>st the reference chromosome.<br />

any correlation between the conservation of neighbor<strong>in</strong>g genes such as with<strong>in</strong> larger genomic<br />

isl<strong>and</strong>s. Figure 2.12 depicts the remapp<strong>in</strong>g of a prote<strong>in</strong>-prote<strong>in</strong> alignment back to<br />

the reference genome.<br />

The result of the mapp<strong>in</strong>g step is a list of same length as the reference genome. BLASTmatrix<br />

then uses the GeneWiz software (Pedersen et al., 2000) to visualize this numerical<br />

data. Genewiz applies a smooth<strong>in</strong>g <strong>and</strong> each b<strong>in</strong> is then encoded <strong>in</strong>to a color representation<br />

either fixed or dynamic, given as n st<strong>and</strong>ard deviations around the average. Each<br />

genome <strong>in</strong>cluded <strong>in</strong> the comparison is plotted as <strong>in</strong>dividual tracks. The tool is offered<br />

as a Web Service (see chapter 4) A general client script can be obta<strong>in</strong>ed from the onl<strong>in</strong>e<br />

documentation at http://www.cbs.dtu.dk/ws/BLASTatlas. The client script produces<br />

as PostScript plot as output. In the next sections examples are provided demonstrat<strong>in</strong>g<br />

the flexibility of the tool.<br />

Gene loss <strong>in</strong> Burkholderia species<br />

A comparative study aimed at mapp<strong>in</strong>g pathogenic isl<strong>and</strong>s or gene losses among different<br />

bacterial genomes can benefit from the graphical representation provided by the BLAS-<br />

Tatlas method. The genus of Burkholderia covers a number of important animal <strong>and</strong><br />

human pathogens known to cause melioidosis (B. pseudomallei) <strong>and</strong> pulmonary <strong>in</strong>fection<br />

<strong>in</strong> CF patients (B. cepacia), whereas B. thail<strong>and</strong>ensis, which is closely related to B. pseudomallei,<br />

rarely gives rise to diseases <strong>in</strong> humans (Brett et al., 1998; Smith et al., 1997). All<br />

publicly available <strong>and</strong> fully sequenced Burkholderia genomes are compared to chromosome<br />

I <strong>and</strong> II of B. pseudomallei 1710b. The code list<strong>in</strong>g below describes how the comparison<br />

was made <strong>and</strong> it demonstrates the flexibility of the tool as it allows for easy automation<br />

19


Genome Comparisons<br />

by read<strong>in</strong>g simple configurations files - <strong>in</strong> this case generated by a MySQL query. The<br />

output configuration file is listed <strong>in</strong> appendix D.3.<br />

1 # let mysql construct the blast configuration file<br />

2 mysql --raw -B -N -e ’ select concat (" legend :",replace (<br />

organism_name ," Burkholderia ","B."),"\ nprogram : blastp \ ncolor :",<br />

if( organism_name like "% pseudomal %"," 101010 _000009 ",if(<br />

organism_name like "% mallei %"," 101010 _000900 ",if( organism_name<br />

like "% cenocep %"," 101010 _080000 ",if( organism_name like "% ambi %"<br />

," 101010 _020002 ",if( organism_name like "% thail<strong>and</strong> %"," 101010<br />

_000900 "," 101010 _050505 "))))),"\ nrange :0.0 ,0.8\ nsource : files /",<br />

pid ,". fsa \n") from genomeatlas3_cur . genbank_complete_prj where<br />

organism_name like " burkhold %" <strong>and</strong> organism_name not like "<br />

%1710 b%" order by organism_name ;’ > blast . cfg<br />

3 # copy genbank files of chr I <strong>and</strong> II<br />

4 foreach acc ( CP000124 CP000125 )<br />

5 cp / home / databases / genomeatlasdb -3.0 _cur / data / $acc / $acc . gbk .<br />

6 saco_convert -I genbank -O annotation $acc . gbk > $acc . ann<br />

7 saco_extract -I genbank -O fasta -t $acc . gbk > $acc . prote<strong>in</strong>s . fsa<br />

8 saco_convert -I genbank -O fasta $acc . gbk > $acc . fsa<br />

9 end<br />

10<br />

11 # run the BLASTatlas client script on both chromosomes<br />

12 perl BLASTatlas -modus circle -ref CP000124 . fsa - prote<strong>in</strong>s CP000124<br />

. prote<strong>in</strong>s . fsa -ann CP000124 . ann - blastcfg blast . cfg -- dnap ="<br />

Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr I" ><br />

burkholderia_chrI .ps<br />

13 perl BLASTatlas -modus circle -ref CP000125 . fsa - prote<strong>in</strong>s CP000125<br />

. prote<strong>in</strong>s . fsa -ann CP000125 . ann - blastcfg blast . cfg -- dnap ="<br />

Percent AT ,GC Skew " -title "B. pseudomallei 1710b, chr II" ><br />

burkholderia_chrII .ps<br />

The plots of the two chromosomes are shown <strong>in</strong> figure 2.14. The other B. pseudomallei<br />

genomes are obvious as three dark blue tracks, represent<strong>in</strong>g high homology with<strong>in</strong> the<br />

species. Both species of B. thail<strong>and</strong>ensis <strong>and</strong> B. mallei display large chromosomal deletions<br />

when compared to B. pseudomallei. However the more scattered nature of the gene loss<br />

observed <strong>in</strong> B. thail<strong>and</strong>ensis suggests that B. mallei evolved from B. pseudomallei through<br />

the loss of larger regions (Ong et al., 2004). These deletions are evident from the atlases<br />

shown <strong>in</strong> figure 2.14. It is evident that a strong preference of deletions exist for chromosome<br />

II. Ong <strong>and</strong> co-workers report that deletions <strong>in</strong> chromosome II counts for 70% <strong>and</strong> 61%<br />

of the total gene loss <strong>in</strong> B. mallei <strong>and</strong> B. thail<strong>and</strong>ensis, respectively.<br />

The Alcanivorax phylome BLASTatlas<br />

Tracks on the BLASTatlas are not limitted to s<strong>in</strong>gle genomes or proteomes. Sequence files<br />

specified for a given tracks is converted <strong>in</strong>to a BLAST database <strong>and</strong> reference genome is<br />

searched aga<strong>in</strong>st each the databases of each track. However, a track may just as well be<br />

a collection of genomes, entire phyla or even SwissProt. In Paper III a ‘phylome’ atlas<br />

was constructed for the oil-degrad<strong>in</strong>g mar<strong>in</strong>e bacterium Alcanivorax borkumensis (Reva<br />

et al., 2008). Here, tracks were constructed collect<strong>in</strong>g all prote<strong>in</strong>s of all published bacterial<br />

genomes, all proteobacteria, all γ-, α-, β-, δ, <strong>and</strong> ɛ-proteobacteria (see figure 2.15). The<br />

phylome atlas reveals no or very few homologes <strong>in</strong> δ- <strong>and</strong> ɛ-proteobacteria, some homologes<br />

<strong>in</strong> α- <strong>and</strong> β-proteobacteria wheras the highest sequence homology was identified among<br />

γ-proteobacteria.<br />

20


3M<br />

2.5M<br />

3.5M<br />

2.5M<br />

2M<br />

0M<br />

2M<br />

0.5M<br />

B. pseudomallei 1710b, chr I<br />

4,126,292 bp<br />

3M<br />

0M<br />

1.5M<br />

0.5M<br />

B. pseudomallei 1710b, chr II<br />

3,181,762 bp<br />

1.5M<br />

1M<br />

1M<br />

<strong>Comparative</strong> Genomics<br />

B. ambifaria AMMD<br />

0.00 0.80<br />

B. ambifaria MC40-6<br />

0.00 0.80<br />

B. cenocepacia AU 1054<br />

0.00 0.80<br />

B. cenocepacia HI2424<br />

0.00 0.80<br />

B. cenocepacia J2315<br />

0.00 0.80<br />

B. cenocepacia MC0-3<br />

0.00 0.80<br />

B. glumae BGR1<br />

0.00 0.80<br />

B. mallei ATCC 23344<br />

0.00 0.80<br />

B. mallei NCTC 10229<br />

0.00 0.80<br />

B. mallei NCTC 10247<br />

0.00 0.80<br />

B. mallei SAVP1<br />

0.00 0.80<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

Center for Biological Sequence Analysis<br />

http://www.cbs.dtu.dk/<br />

B. ambifaria AMMD<br />

0.00 0.80<br />

B. ambifaria MC40-6<br />

0.00 0.80<br />

B. cenocepacia AU 1054<br />

0.00 0.80<br />

B. cenocepacia HI2424<br />

0.00 0.80<br />

B. cenocepacia J2315<br />

0.00 0.80<br />

B. cenocepacia MC0-3<br />

0.00 0.80<br />

B. glumae BGR1<br />

0.00 0.80<br />

B. mallei ATCC 23344<br />

0.00 0.80<br />

B. mallei NCTC 10229<br />

0.00 0.80<br />

B. mallei NCTC 10247<br />

0.00 0.80<br />

B. mallei SAVP1<br />

0.00 0.80<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

Center for Biological Sequence Analysis<br />

http://www.cbs.dtu.dk/<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

B. phymatum STM815<br />

0.00 0.80<br />

B. phytofirmans PsJN<br />

0.00 0.80<br />

B. pseudomallei 1106a<br />

0.00 0.80<br />

B. pseudomallei 668<br />

0.00 0.80<br />

B. pseudomallei K96243<br />

0.00 0.80<br />

B. sp. 383<br />

0.00 0.80<br />

B. thail<strong>and</strong>ensis E264<br />

0.00 0.80<br />

B. vietnamiensis G4<br />

0.00 0.80<br />

B. xenovorans LB400<br />

0.00 0.80<br />

W) Annotations:<br />

CDS +<br />

CDS -<br />

rRNA<br />

tRNA<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

B. multivorans ATCC 17616 fix<br />

avg<br />

0.00 0.80<br />

B. phymatum STM815<br />

0.00 0.80<br />

B. phytofirmans PsJN<br />

0.00 0.80<br />

B. pseudomallei 1106a<br />

0.00 0.80<br />

B. pseudomallei 668<br />

0.00 0.80<br />

B. pseudomallei K96243<br />

0.00 0.80<br />

B. sp. 383<br />

0.00 0.80<br />

B. thail<strong>and</strong>ensis E264<br />

0.00 0.80<br />

B. vietnamiensis G4<br />

0.00 0.80<br />

B. xenovorans LB400<br />

0.00 0.80<br />

W) Annotations:<br />

Figure 2.14: Comparison of B. pseudomallei 1710b chomosome I <strong>and</strong> II aga<strong>in</strong>st all public<br />

Burkholderia genomes.<br />

CDS +<br />

CDS -<br />

rRNA<br />

tRNA<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

fix<br />

avg<br />

Percent AT<br />

0.21 0.42<br />

GC Skew<br />

-0.09 0.09<br />

Percent AT<br />

0.21 0.42<br />

GC Skew<br />

-0.09 0.09<br />

21<br />

Resolution: 1273<br />

dev<br />

avg<br />

dev<br />

avg<br />

BLAST ATLAS<br />

Resolution: 1273<br />

dev<br />

avg<br />

dev<br />

avg<br />

BLAST ATLAS


Genome Comparisons<br />

Bacteria<br />

fix<br />

avg<br />

0.00 0.50<br />

Proteobacteria<br />

fix<br />

avg<br />

0.00 0.50<br />

gamma<br />

fix<br />

avg<br />

0.00 0.50<br />

Annotations:<br />

CDS +<br />

CDS -<br />

0M<br />

rRNA<br />

tRNA<br />

0.5M<br />

2.5M<br />

alpha<br />

fix<br />

avg<br />

A. borkumensis<br />

3,120,143 bp<br />

0.00 0.30<br />

beta<br />

1M<br />

2M<br />

fix<br />

avg<br />

0.00 0.30<br />

1.5M<br />

delta<br />

fix<br />

avg<br />

0.00 0.30<br />

epsilon<br />

fix<br />

avg<br />

0.00 0.30<br />

Percent AT<br />

dev<br />

avg<br />

0.40 0.51<br />

Resolution: 1249<br />

http://www.cbs.dtu.dk/<br />

Center for Biological Sequence Analysis<br />

Figure 2.15: A phylome atlas of Alcanivorax borkumensis, compar<strong>in</strong>g the proteome aga<strong>in</strong>st all γ-,<br />

α-, β-, δ, <strong>and</strong> ɛ-proteobacteria available at the time of publish<strong>in</strong>g.<br />

22<br />

Phylome ATLAS


Streptococcus<br />

Escherichia<br />

Bacillus<br />

Clostridium<br />

Burkholderia<br />

Mycobacterium<br />

C<strong>and</strong>idatus<br />

Staphylococcus<br />

Shewanella<br />

Mycoplasma<br />

Stra<strong>in</strong>s<br />

Species<br />

0 10 20 30 40 50<br />

<strong>Comparative</strong> Genomics<br />

Figure 2.16: Count of genomes <strong>and</strong> species divided by genera. Source: <strong>CBS</strong> Genome Atlas<br />

Database as of 2009-09-11.<br />

2.3.8 CorePlot - plott<strong>in</strong>g the core- <strong>and</strong> pan-genomes of species<br />

There are a number of bacterial genera for which numerous stra<strong>in</strong>s <strong>and</strong> species are fully<br />

sequenced. Streptococcus (43 stra<strong>in</strong>s), Escherichia (29 stra<strong>in</strong>s), <strong>and</strong> Bacillus (25 stra<strong>in</strong>s)<br />

are the most highly represented genomes among the Bacteria (Genome Atlas Database,<br />

2009-09-11). Figure 2.16 shows the genome <strong>and</strong> species counts of the 10 most sampled<br />

genera. The <strong>in</strong>creased depth by which bacterial genera are sequenced has previously been<br />

used to estimate the core- <strong>and</strong> pan-genome by fitt<strong>in</strong>g an exponential decay<strong>in</strong>g function.<br />

An often used approach is to perform either a limited or a full permutation of the genome<br />

order (Lefebure & Stanhope, 2007; Tettel<strong>in</strong> et al., 2005). This provides an error estimate<br />

for every step a genome is added An alternative method was developed dur<strong>in</strong>g the Ph.D.<br />

project, which derives the prote<strong>in</strong> families by group<strong>in</strong>g homologous prote<strong>in</strong>s, however us<strong>in</strong>g<br />

a fixed order of genomes. Homologs are generated by pairwise prote<strong>in</strong> BLAST between<br />

proteomes followed by a group<strong>in</strong>g of all significant alignments (50% alignment length <strong>and</strong><br />

50% conservation with<strong>in</strong> the alignment). The method can re-use cached BLAST reports<br />

from the BLASTmatrix method. The example below uses the same proteome files as<br />

was generated <strong>in</strong> the BLASTmatrix example (section 2.3.6 <strong>and</strong> appendix D.4) <strong>and</strong> it<br />

demonstrates how a MySQL query can be used as configuration for CorePlot program.<br />

1 mysql -N -B -e " select organism_name , concat (pid , ’. prote<strong>in</strong>s .fsa ’)<br />

from genomeatlas3_cur . genbank_complete_prj where organism_name<br />

like ’ campylobacter %’ order by organism_name " > table . dat<br />

2 perl ~ pfh / scripts / coregenome / coregenome -2.3 < table . dat > core .ps<br />

Both the BLASTmatrix <strong>and</strong> the coregenome scripts accesses the same MySQL cach<strong>in</strong>g<br />

databases. The user will not have to worry about how results are cached <strong>and</strong> shared<br />

between the two programs. Figure 2.17 shows the output core- <strong>and</strong> pan-genome plot<br />

generated by the program.<br />

By us<strong>in</strong>g a fixed genome order, it is possible to compare multiple species with<strong>in</strong> the<br />

same plot, to reveal vary<strong>in</strong>g slopes of the pan- <strong>and</strong> core-genome graphs. From figure 2.17<br />

it is visible that the first 5 stra<strong>in</strong>s come from dist<strong>in</strong>ct species, giv<strong>in</strong>g rise to a steep <strong>in</strong>crease<br />

of the pan genome, <strong>and</strong> reduction of the core genome. The follow<strong>in</strong>g five genomes come<br />

from C. jejuni <strong>and</strong> the curves appear to flatten out at a core size of 600 prote<strong>in</strong>s, 5,200<br />

prote<strong>in</strong>s. In figure 2.18 a larger core- <strong>and</strong> pan-genome plot for Vibrio species are shown<br />

(paper IV).<br />

23


Genome Comparisons<br />

0 1000 2000 3000 4000 5000 6000 7000<br />

New genes<br />

New gene families<br />

Core genome<br />

Pan genome<br />

1 : Campylobacter concisus 13826<br />

2 : Campylobacter curvus 525.92<br />

3 : Campylobacter fetus subsp. fetus 8240<br />

pan-genome (blue l<strong>in</strong>e) <strong>in</strong>creases, <strong>and</strong> the number of conserved gene families (red<br />

4 : Campylobacter hom<strong>in</strong>is ATCC BAA381<br />

l<strong>in</strong>e) <strong>in</strong> the core genome decreases, albeit at a lower rate. This is because every<br />

5 : Campylobacter jejuni RM1221<br />

genome can add many novel (<strong>and</strong> frequently different) genes to the pan-genome but<br />

6 : Campylobacter jejuni subsp. doylei 269.97<br />

only decreases the core genome with a few genes that are absent <strong>in</strong> that particular<br />

7 : Campylobacter jejuni subsp. jejuni 81176<br />

stra<strong>in</strong> but that were conserved <strong>in</strong> the previously given genomes. The pan-genome<br />

8 : Campylobacter jejuni subsp. jejuni 81116<br />

curve <strong>in</strong>creases with a relative steep slope when a novel species is added, as is<br />

9 : Campylobacter jejuni subsp. jejuni NCTC 11168<br />

obvious when one V. parahaemolyticus genome is added after the 18th V. cholerae. A<br />

10 : Campylobacter lari RM2100<br />

stable plateau can be seen for pan genome of the V. cholerae genomes around 6500<br />

genes, whereas the core genome steadily decreases to approximately 1000 genes for<br />

these 32 genomes. A. salmonicida, although not a member of the Vibrio genus, does<br />

not add significantly more genes to the pan genome than the other Vibrio species do, <strong>in</strong><br />

contrast to P. profundum which produces a sharp <strong>in</strong>crease <strong>in</strong> the pan genome, as does,<br />

<strong>in</strong>terest<strong>in</strong>gly, V. shilonii. Note that there are approximately 20,000 total gene families<br />

with<strong>in</strong> the 30 sequenced Vibrionaceae genomes.<br />

In fact, the small jump seen <strong>in</strong> the pan genome of V. cholerae when add<strong>in</strong>g the 11th<br />

1 2 3 4 5 6 7 8 9 10<br />

genome (figure 3) is caused by the difference between the two subclusters of V.<br />

cholerae seen <strong>in</strong> the pan-genome family tree (figure 2). Note that the 10th stra<strong>in</strong> (V.<br />

clolerae 2740-80) behaves as an outlier <strong>in</strong> all the figures shown; although documented<br />

Figureas 2.17: an environmental Pan- <strong>and</strong> core-genome isolate, this plotappears of 10 Campylobacter closer to the genomes. cl<strong>in</strong>ical isolates, For the<strong>in</strong> data terms currently of<br />

available, overall there genomic seem to properties. exist an equilibrium at close to 600 prote<strong>in</strong> families.<br />

24<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

Pan genome<br />

Core genome<br />

New gene families<br />

V. cholerae MJ1236<br />

V. cholerae RC9<br />

V. cholerae BX330286<br />

V. cholerae MO10<br />

V. cholerae O395 TIGR<br />

V. cholerae O395 TEDA<br />

V. cholerae M66-2<br />

V. cholerae N16961<br />

V. cholerae B33VCE<br />

V. cholerae AM-19226<br />

V. cholerae 1587<br />

V. cholerae 2740-80<br />

V. cholerae TM11079-80<br />

V. cholerae TMA21<br />

V. cholerae 12129<br />

V. cholerae MZO-2<br />

P.profundum SS9<br />

V.shilonii AK1<br />

V.harveyi BAA-1116<br />

V.campbellii<br />

Vibrio sp Ex25<br />

A.salmonicida LFI1238<br />

V. fisheri MJ11<br />

V. fisheri ES114<br />

V.splendidus LGB2<br />

Vibrio. sp MED222<br />

V. vulnificus YJ016<br />

V. vulnificus CMCP6<br />

V. parahaem. 16<br />

V. parahaiem. 2210633<br />

V. cholerae V52<br />

V. cholerae VL426<br />

Figure 3. Pan- <strong>and</strong> core-genome plot of the 32 Vibrionaceae genomes. V. cholerae<br />

stra<strong>in</strong>s that do not cause cholera are highlighted <strong>in</strong> bright green. Colours are the same<br />

as <strong>in</strong> Figure 2.<br />

Figure 2.18: CorePlot output for 32 Vibrio genomes.<br />

BLAST comparison visualized <strong>in</strong> a BLAST matrix<br />

A BLAST matrix provides a visual overview of reciprocal pairwise whole genome<br />

comparisons (figure 4). The stronger a matrix cell is colored, the more similarity was


2.4 Summary<br />

<strong>Comparative</strong> Genomics<br />

This chapter presents a number of comparative genomics <strong>and</strong> visualization <strong>tools</strong> used <strong>in</strong><br />

a genome annotation <strong>and</strong> analysis pipel<strong>in</strong>e. Visualization methods have been shown to<br />

help draw biological conclusions about adaptation to environmental niches, pathogenic<br />

properties, <strong>and</strong> comparison of many other genomic properties <strong>in</strong>clud<strong>in</strong>g proteome similarity.<br />

Overview<strong>in</strong>g the large amount of genomic data constitutes a constant challenge that<br />

will need more attention <strong>in</strong> the future as sequenc<strong>in</strong>g technology becomes more <strong>and</strong> more<br />

common. How can one visualize comparison of a thous<strong>and</strong> genomes? Soon there will be<br />

a need to compare sets of thous<strong>and</strong>s of genomes.<br />

25


Summary<br />

26


<strong>Comparative</strong> Genomics<br />

2.5 Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas


Instant <strong>in</strong>sight: Read<strong>in</strong>g the genetic atlas


‘ReSourCe is<br />

he best onl<strong>in</strong>e<br />

submission<br />

system of any<br />

publisher.’<br />

ReSourCe<br />

nd referees who have used<br />

o help you through every step of<br />

l<strong>in</strong>e proof collection, free pdf<br />

check <strong>and</strong> update their personal<br />

ence even further.<br />

se juggl<strong>in</strong>g a hectic research<br />

a not-for-prot society publisher<br />

e today <strong>and</strong> nd out more.<br />

Registered Charity No. 207890<br />

.rsc.org/resource<br />

<br />

1<br />

<strong>Comparative</strong> Genomics<br />

2.6 Paper I: The genome BLASTatlas - a GeneWiz extension<br />

for visualization of whole-genome homology<br />

Volume 4 | Number 5 | 2008 Molecular BioSystems Pages 353–444<br />

Molecular<br />

BioSystems<br />

www.molecularbiosystems.org Volume 4 | Number 5 | May 2008 | Pages 353–444<br />

ISSN 1742-206X<br />

HIGHLIGHT<br />

Peter F. Hall<strong>in</strong> et al.<br />

REVIEW<br />

The genome BLASTatlas—a GeneWiz Eric C. Greene et al.<br />

extension for visualization of whole- The importance of surfaces <strong>in</strong> s<strong>in</strong>glegenome<br />

homology molecule bioscience<br />

1742-206X(2008)4:5;1-9<br />

Indexed <strong>in</strong><br />

MEDLINE!<br />

17/04/2008 11:00:58


HIGHLIGHT www.rsc.org/molecularbiosystems | Molecular BioSystems<br />

The genome BLASTatlas—a GeneWiz<br />

extension for visualization of whole-genome<br />

homology<br />

Peter F. Hall<strong>in</strong>, Tim T. B<strong>in</strong>newies* <strong>and</strong> David W. Ussery<br />

DOI: 10.1039/b717118h<br />

The development of fast <strong>and</strong> <strong>in</strong>expensive methods for sequenc<strong>in</strong>g bacterial genomes<br />

has led to a wealth of data, often with many genomes be<strong>in</strong>g sequenced of the same<br />

species or closely related organisms. Thus, there is a need for visualization methods that<br />

will allow easy comparison of many sequenced genomes to a def<strong>in</strong>ed reference stra<strong>in</strong>.<br />

The BLASTatlas is one such tool that is useful for mapp<strong>in</strong>g <strong>and</strong> visualiz<strong>in</strong>g whole<br />

genome homology of genes <strong>and</strong> prote<strong>in</strong>s with<strong>in</strong> a reference stra<strong>in</strong> compared to other<br />

stra<strong>in</strong>s or species of one or more prokaryotic organisms. We provide examples of<br />

BLASTatlases, <strong>in</strong>clud<strong>in</strong>g the Clostridium tetani plasmid p88, where homologues for tox<strong>in</strong><br />

genes can be easily visualized <strong>in</strong> other sequenced Clostridium genomes, <strong>and</strong> for a<br />

Clostridium botul<strong>in</strong>um genome, compared to 14 other Clostridium genomes. DNA<br />

structural <strong>in</strong>formation is also <strong>in</strong>cluded <strong>in</strong> the atlas to visualize the DNA chromosomal<br />

context of regions. Additional <strong>in</strong>formation can be added to these plots, <strong>and</strong> as an<br />

example we have added circles show<strong>in</strong>g the probability of the DNA helix open<strong>in</strong>g up<br />

under superhelical tension. The tool is SOAP compliant <strong>and</strong> WSDL (web services<br />

description language) files are located on our website: (http://www.cbs.dtu.dk/ws/<br />

BLASTatlas), where programm<strong>in</strong>g examples are available <strong>in</strong> Perl. By provid<strong>in</strong>g an<br />

<strong>in</strong>teroperable method to carry out whole genome visualization of homology,<br />

this service offers bio<strong>in</strong>formaticians as well as biologists an easy-to-adopt workflow<br />

that can be directly called from the programm<strong>in</strong>g language of the user, hence<br />

enabl<strong>in</strong>g automation of repeated tasks. This tool can be relevant <strong>in</strong> many pangenomic<br />

as well as <strong>in</strong> metagenomic studies, by giv<strong>in</strong>g a quick overview of clusters of<br />

<strong>in</strong>sertion sites, genomic isl<strong>and</strong>s <strong>and</strong> overall homology between a reference<br />

sequence <strong>and</strong> a data set.<br />

Center for Biological Sequence Analysis,<br />

Department of Systems Biology, The<br />

Technical University of Denmark, 2800<br />

Lyngby, Denmark. E-mail: pfh@cbs.dtu.dk.<br />

E-mail: tim@cbs.dtu.dk. E-mail:<br />

dave@cbs.dtu.dk<br />

Background<br />

It has been more than 10 years s<strong>in</strong>ce the<br />

sequenc<strong>in</strong>g of the first bacterial genome<br />

(ref. 1, US patent number 6,528,289), <strong>and</strong><br />

currently sequence data are available for<br />

more than a thous<strong>and</strong> sequenced genomes.<br />

Peter F. Hall<strong>in</strong> Tim T. B<strong>in</strong>newies David W. Ussery<br />

With so many genome sequences, for<br />

several bacterial species multiple genome<br />

sequences exist; for example, at the time<br />

of writ<strong>in</strong>g, 10 different Escherichia coli<br />

genomes have been fully sequenced <strong>and</strong><br />

published, <strong>and</strong> draft sequences for another<br />

31 genomes are available, add<strong>in</strong>g<br />

Peter F. Hall<strong>in</strong> was born <strong>in</strong><br />

Odense, Denmark, <strong>and</strong> is currently<br />

a PhD student at <strong>CBS</strong>,<br />

DTU. Tim T. B<strong>in</strong>newies grew<br />

up <strong>in</strong> Kiel, Germany, <strong>and</strong> obta<strong>in</strong>ed<br />

his PhD from the Technical<br />

University of Denmark,<br />

he is currently work<strong>in</strong>g for<br />

Roche Diagnostics AG <strong>in</strong> Switzerl<strong>and</strong>.<br />

David W. Ussery was<br />

born <strong>and</strong> raised <strong>in</strong> Spr<strong>in</strong>gdale,<br />

Arkansas. S<strong>in</strong>ce 1998, he has<br />

been leader for the <strong>Comparative</strong><br />

Genomics group at <strong>CBS</strong>.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 363


up to a total of 41 different E. coli<br />

genomes (accord<strong>in</strong>g to the National Center<br />

for Biotechnology Information,<br />

NCBI Entrez, 12-Feb-2008). Table 1 lists<br />

the top 20 represented prokaryotic<br />

genera <strong>in</strong> terms of numbers of fully<br />

sequenced genomes based on recent<br />

count<strong>in</strong>g <strong>in</strong> Entrez Genome Projects,<br />

although these numbers will change<br />

quickly as more genomes are be<strong>in</strong>g<br />

added on a regular basis. Thus, analysis<br />

of multiple genomes of the same organism<br />

(the ‘‘pangenome’’) is now possible,<br />

<strong>and</strong> as more metagenomic datasets are<br />

published (see for example the projects<br />

listed on the GOLD web pages 24 ), there<br />

is a need for a graphical representation<br />

of how these new data compare to exist<strong>in</strong>g<br />

reference stra<strong>in</strong>s or model organisms.<br />

We have developed a visualization<br />

method, called ‘‘BLASTatlas’’, for show<strong>in</strong>g<br />

mapped alignments of BLAST<br />

searches of a reference sequence aga<strong>in</strong>st<br />

one or more databases, onto the reference<br />

genome. Early implementation of a<br />

similar method 2–4 accounted for the statistical<br />

significance (E-value) of each hit,<br />

by color cod<strong>in</strong>g the expectation values<br />

[ log(E)] of the alignment. This method<br />

gives a uniform color throughout the<br />

alignment (gene or prote<strong>in</strong>) but shows<br />

no <strong>in</strong>formation about the am<strong>in</strong>o acid<br />

conservation with<strong>in</strong> regions of the alignment.<br />

At the level of a bacterial chromosome,<br />

this makes little difference,<br />

although when one zooms <strong>in</strong> at the level<br />

of <strong>in</strong>dividual genes, the older method of<br />

shad<strong>in</strong>g the entire gene based on the Evalue<br />

gives no <strong>in</strong>formation about regions<br />

with<strong>in</strong> a gene (such as functional doma<strong>in</strong>s)<br />

which might be strongly conserved,<br />

whilst other parts of the gene<br />

have little sequence homology with<strong>in</strong><br />

other genomes. We have ref<strong>in</strong>ed the<br />

BLASTatlas method to map each<br />

<strong>in</strong>dividual am<strong>in</strong>o acid residue or<br />

nucleotide back to the reference genome<br />

sequence from which the cod<strong>in</strong>g sequence<br />

was derived. Instead of colourcod<strong>in</strong>g<br />

the significance of the entire hit,<br />

this method maps the conservation of the<br />

<strong>in</strong>dividual bases or am<strong>in</strong>o acids. Tools<br />

such as the Artemis Comparison Tool<br />

(ACT) 5 allow detailed view<strong>in</strong>g of complete<br />

BLAST results, <strong>and</strong> this is an<br />

excellent graphical method for comparison<br />

of two genomes. ACT can also be<br />

extended to compare two genomes to a<br />

reference, placed <strong>in</strong> the middle. In<br />

contrast, the BLASTatlas method can<br />

compare many genomes to the same<br />

reference, <strong>and</strong> can provide a quick overview<br />

of chromosomal regions of gene<br />

conservation across many genomes.<br />

As can be seen from Table 1, for many<br />

of the heavily sampled genera, there are<br />

further genome projects <strong>in</strong> the pipel<strong>in</strong>e<br />

which will produce even more sequences<br />

than are currently available, <strong>and</strong> there is<br />

a need for methods for efficient comparison<br />

of these genomes, giv<strong>in</strong>g an overview<br />

of general trends <strong>in</strong> the data. The<br />

Table 1 The number of species <strong>and</strong> NCBI Entrez Project IDs of the 20 most represented genera<br />

<strong>in</strong> the Entrez Genome Projects Database, 13 as accessed on 21 October 2007. The numbers <strong>in</strong><br />

brackets show the count<strong>in</strong>g of both ongo<strong>in</strong>g <strong>and</strong> completed projects, whereas the first number<br />

reflects only the completed projects. C<strong>and</strong>idate genera have been excluded from this count<strong>in</strong>g<br />

Genus Projects Species<br />

Streptococcus 26 [63] 8 [15]<br />

Burkholderia 15 [55] 8 [15]<br />

Bacillus 16 [48] 9 [16]<br />

Clostridium 14 [43] 9 [22]<br />

Vibrio 7 [35] 5 [14]<br />

Mycobacterium 16 [30] 9 [14]<br />

Salmonella 5 [30] 2 [3]<br />

Listeria 4 [29] 3 [6]<br />

Escherichia 10 [27] 1 [1]<br />

Mycoplasma 13 [25] 11 [17]<br />

Shewanella 14 [24] 10 [15]<br />

Pseudomonas 13 [23] 7 [8]<br />

Yers<strong>in</strong>ia 9 [23] 3 [7]<br />

Haemophilus 6 [23] 3 [4]<br />

Staphylococcus 17 [22] 4 [5]<br />

Synechococcus 10 [21] 2 [2]<br />

Campylobacter 9 [20] 5 [9]<br />

Francisella 7 [16] 1 [2]<br />

Lactobacillus 11 [15] 10 [12]<br />

Rickettsia 10 [15] 9 [12]<br />

BLASTatlas allows the comparison of<br />

many genomes to a reference sequence.<br />

The current limit is about 60 genomes.<br />

There are two levels of comparison, the<br />

first represents a one-page map of the<br />

whole chromosome, <strong>and</strong> the second level<br />

zoom<strong>in</strong>g <strong>in</strong> a particular region of <strong>in</strong>terest,<br />

allow<strong>in</strong>g the visualization of regions<br />

of conservation with<strong>in</strong> <strong>in</strong>dividual genes.<br />

The color-cod<strong>in</strong>g represents identical<br />

am<strong>in</strong>o acids (or nucleic acids), based on<br />

a pairwise alignment of all prote<strong>in</strong> cod<strong>in</strong>g<br />

regions, with the best matches for<br />

each gene <strong>in</strong> the reference genome<br />

shown. Thus, comb<strong>in</strong><strong>in</strong>g both levels, it<br />

is possible to get a global overview of the<br />

whole chromosome, <strong>and</strong> to then quickly<br />

identify gene conservation (or lack thereof)<br />

<strong>in</strong> regions of <strong>in</strong>terest, at the level of<br />

conservation of <strong>in</strong>dividual am<strong>in</strong>o acid<br />

residues.<br />

Clostridium botul<strong>in</strong>um is an important<br />

human pathogen which is the causative<br />

agent of botulism, giv<strong>in</strong>g rise to fatal<br />

paralysis of the respiratory muscles,<br />

caused by botul<strong>in</strong>um neurotox<strong>in</strong> (BoNT)<br />

which disrupts nerve functions. The<br />

genes encod<strong>in</strong>g BoNT components are<br />

clustered on the bacterial chromosome<br />

(group I + II stra<strong>in</strong>s), on prophages<br />

(group III stra<strong>in</strong>s) or on plasmids (group<br />

IV stra<strong>in</strong>s). Group I stra<strong>in</strong>s encode type<br />

A, B <strong>and</strong> F type tox<strong>in</strong>s, group II stra<strong>in</strong>s<br />

produce type B, E <strong>and</strong> F tox<strong>in</strong>s <strong>and</strong><br />

group III stra<strong>in</strong>s encode for type C <strong>and</strong><br />

D tox<strong>in</strong>s, whereas group IV stra<strong>in</strong>s<br />

produce type G tox<strong>in</strong>. 6 We use the<br />

BLASTatlas method to show the overall<br />

genome homology of the C. botul<strong>in</strong>um<br />

stra<strong>in</strong> F Langel<strong>and</strong>, compared to all<br />

currently available <strong>and</strong> fully sequenced<br />

stra<strong>in</strong>s of the Clostridium genus.<br />

Methods<br />

The BLASTatlas method uses all the<br />

provided annotated cod<strong>in</strong>g sequences<br />

(or prote<strong>in</strong>s) of a reference genome, <strong>and</strong><br />

compares each of those with one or more<br />

genomes. The total genome sequence for<br />

each organism is represented by a database<br />

<strong>and</strong> can conta<strong>in</strong> any number of<br />

DNA or prote<strong>in</strong> sequences. BLAST<br />

searches with a non-str<strong>in</strong>gent E-value<br />

cut-off of 0.01 are used to identify the<br />

best alignments between the reference<br />

sequence prote<strong>in</strong> <strong>and</strong> the database<br />

(genome) <strong>in</strong> question. Once identified,<br />

the s<strong>in</strong>gle best pairwise alignment for<br />

364 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


each of the reference sequences is<br />

obta<strong>in</strong>ed <strong>and</strong> <strong>in</strong>cluded <strong>in</strong> the map.<br />

The reference genome of a given<br />

comparison has a fixed size, whereas<br />

the sequences to be compared can be<br />

thought of as simply a ‘‘pile of prote<strong>in</strong>s’’,<br />

rang<strong>in</strong>g between the size from that of a<br />

small phage, to a s<strong>in</strong>gle genome, or an<br />

entire metagenomic sample or even exist<strong>in</strong>g<br />

large BLAST databases, such as<br />

UniProt. It is important to emphasize<br />

that each prote<strong>in</strong> <strong>in</strong> the reference genome<br />

is compared to all the prote<strong>in</strong>s <strong>in</strong> the<br />

query set—regardless of orientation or<br />

location. The BLASTatlas method uses<br />

the software BLASTALL v. 2.2.11 for<br />

the search, <strong>and</strong> <strong>in</strong> BLAST term<strong>in</strong>ology,<br />

the reference genome constitutes the<br />

‘query’ whereas each other genome<br />

(e.g., a lane or circle <strong>in</strong> the atlas) <strong>in</strong> the<br />

comparison corresponds to the ‘database’.<br />

We def<strong>in</strong>e a lane as a visual representation<br />

of mapped database hits<br />

(<strong>in</strong>dividual residue matches) on to the<br />

reference genome. A lane can have a<br />

boxfilter (smooth<strong>in</strong>g) applied with<strong>in</strong><br />

each of the smallest visible units of the<br />

atlas (the resolution of the graphical<br />

representation). A s<strong>in</strong>gle BLASTatlas<br />

may conta<strong>in</strong> several lanes; currently<br />

around 60 circles is the upper limit.<br />

The <strong>in</strong>put requires a file conta<strong>in</strong><strong>in</strong>g the<br />

genome sequence, <strong>in</strong>clud<strong>in</strong>g all annotated<br />

cod<strong>in</strong>g sequences (compris<strong>in</strong>g prote<strong>in</strong>-start,<br />

-stop <strong>and</strong> -direction) for the<br />

reference genome. The four programs<br />

‘BLASTp’, ‘BLASTn’, ‘BLASTx’, <strong>and</strong><br />

‘tBLASTn’ can be used for each lane of<br />

the BLASTatlas, although of course the<br />

appropriate sequences (DNA or prote<strong>in</strong>)<br />

must be provided. For example, when<br />

us<strong>in</strong>g ‘ BLASTn’ or ‘tBLASTn’ <strong>in</strong> a lane,<br />

the required DNA sequence can be a set<br />

of open read<strong>in</strong>g frames (ORFs), chromosomal<br />

contigs, entire genome sequences<br />

or even environmental (metagenomic)<br />

samples. In a pairwise fashion, the sequence<br />

of the reference is BLASTed<br />

aga<strong>in</strong>st each database def<strong>in</strong>ed by the<br />

user, employ<strong>in</strong>g the specified BLAST<br />

algorithm.<br />

Interpretation of BLAST alignments<br />

For each of the sequences def<strong>in</strong>ed <strong>in</strong> the<br />

reference, only the best hit <strong>in</strong> each database<br />

is stored. For these hits, the alignments<br />

are mapped on to the reference<br />

genome. When align<strong>in</strong>g two DNA<br />

sequences, the map shows one of four<br />

possible states for each position: match,<br />

mismatch, gap <strong>in</strong> query (reference genome),<br />

<strong>and</strong> gap <strong>in</strong> database (lane). Only<br />

the match contributes to the overall score<br />

with a value of 1, whereas mismatches<br />

<strong>and</strong> gaps <strong>in</strong> the database get a score<br />

value of zero. When align<strong>in</strong>g two prote<strong>in</strong><br />

sequences, an additional state is <strong>in</strong>troduced<br />

for conservative mismatches, <strong>in</strong>dicat<strong>in</strong>g<br />

that two am<strong>in</strong>o acids have similar<br />

physical–chemical properties; such a<br />

state will receive a score of 0.5. Match<br />

<strong>and</strong> gap states of prote<strong>in</strong> alignments are<br />

def<strong>in</strong>ed similar to those of the DNA<br />

alignments. The occurrence of gaps <strong>in</strong><br />

the reference sequence do not get a correspond<strong>in</strong>g<br />

coord<strong>in</strong>ate <strong>and</strong> are therefore<br />

ignored (see Fig. 1). In the BLASTatlas<br />

context, a map is an array of match<br />

scores. The array has the same length<br />

as the reference genome, with each position<br />

along the gene hav<strong>in</strong>g a value of 0,<br />

0.5 or 1: It should be noted that <strong>in</strong>tergenic<br />

regions (<strong>and</strong> ncRNAs, <strong>in</strong>clud<strong>in</strong>g<br />

tRNAs <strong>and</strong> rRNAs) have values of 0,<br />

because BLASTatlases only compare<br />

prote<strong>in</strong> encod<strong>in</strong>g genes. We use this as<br />

a control, check<strong>in</strong>g to make sure that the<br />

rRNA operons are visualized as ‘‘gaps’’<br />

throughout all the lanes, for example.<br />

For each database def<strong>in</strong>ed, there will be<br />

a correspond<strong>in</strong>g BLAST map with<strong>in</strong> the<br />

atlas (see Fig. 2). Each database entry of<br />

the BLAST searches must conta<strong>in</strong> a<br />

legend text for the lane, a colour code<br />

range <strong>and</strong> a scal<strong>in</strong>g method. For the<br />

colours, an upper <strong>and</strong> lower colour is<br />

required, whereas the middle colour<br />

is usually grey; all colours are def<strong>in</strong>ed<br />

<strong>in</strong> RGB <strong>in</strong>tegers rang<strong>in</strong>g from 0 to 10.<br />

The scale can be either fixed, such as<br />

rang<strong>in</strong>g from 0 to 1, or scaled us<strong>in</strong>g any<br />

number of st<strong>and</strong>ard deviations around<br />

the average.<br />

DNA properties<br />

The BLASTatlas method allows users to<br />

add structural as well as base composition<br />

<strong>in</strong>formation to the atlas by us<strong>in</strong>g the<br />

‘DNAparameters’ element <strong>in</strong> the request.<br />

These properties can be for example<br />

DNA structural properties, 7<br />

such as<br />

<strong>in</strong>tr<strong>in</strong>sic curvature, 8 global or local<br />

repeats 9 or other measures of base composition.<br />

10 A list of possible different<br />

properties currently pre-computed can<br />

be obta<strong>in</strong>ed via the onl<strong>in</strong>e documentation<br />

<strong>and</strong> type declarations of the web<br />

services description. The DNA property<br />

lanes are usually added near the center<br />

(or at the lowest part when seen from the<br />

outermost circle) of the atlas.<br />

Custom properties<br />

In addition to the st<strong>and</strong>ard DNA properties<br />

<strong>and</strong> BLAST maps, the web service<br />

provides a method for add<strong>in</strong>g <strong>in</strong>dividual<br />

customer data for example gene expression<br />

values to the atlas, us<strong>in</strong>g the ‘customMap’<br />

element <strong>in</strong> the request. Data<br />

must be provided <strong>in</strong> the form of comma<br />

separated str<strong>in</strong>gs, with each position <strong>in</strong><br />

the list correspond<strong>in</strong>g to the genomic<br />

position. When def<strong>in</strong><strong>in</strong>g custom data<br />

lanes, the colour ranges, scal<strong>in</strong>g method,<br />

<strong>and</strong> legend text must be provided.<br />

Visualization<br />

Details such as the atlas title <strong>and</strong> the<br />

geometry (l<strong>in</strong>ear or circle representation)<br />

are necessary for the f<strong>in</strong>al visualization.<br />

Once the BLAST searches are carried<br />

out <strong>and</strong> remapped to the reference<br />

Fig. 1 Mapp<strong>in</strong>g of prote<strong>in</strong>–prote<strong>in</strong> alignment to DNA. Panel A: mismatches <strong>and</strong> perfect matches are assigned a score of 0 <strong>and</strong> 1, respectively.<br />

Conservative mismatches are assigned a score of 0.5. In the case of DNA alignment, only scores of 0 <strong>and</strong> 1 are possible. Panel B: gaps <strong>in</strong> the<br />

database sequence will be rendered as be<strong>in</strong>g non-conserved areas (filled with zeros). Panel C: gaps <strong>in</strong> the reference sequence will be neglected, s<strong>in</strong>ce<br />

they have no correspond<strong>in</strong>g region <strong>in</strong> the reference genome <strong>in</strong>to which they can be mapped.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 365


Fig. 2 Genes (or segments) from each genome are compared with a reference gene, as shown <strong>in</strong><br />

the left panel; a pairwise comparison is made us<strong>in</strong>g one of the BLAST algorithms. On the right is<br />

shown the ‘‘remapp<strong>in</strong>g’’, or the representation of each of the BLAST runs on the left, mapped<br />

onto the chromosomal sequence. Note that gaps <strong>in</strong> the reference gene (grey) are not <strong>in</strong>cluded <strong>in</strong><br />

the colored maps of the atlas.<br />

genome <strong>and</strong> custom data <strong>and</strong> DNA<br />

properties are collected, an XML configuration<br />

file is composed which conta<strong>in</strong>s<br />

all these data <strong>and</strong> the layout of the atlas.<br />

This file is then sent to the GeneWiz 7<br />

software which produces a PostScript<br />

document, it then is base64 encoded to<br />

allow transport via XML. This part of<br />

the process takes place on the server <strong>and</strong><br />

requires no user-<strong>in</strong>teraction. An example<br />

atlas of a plasmid is shown <strong>in</strong> Fig. 3, <strong>and</strong><br />

will be discussed <strong>in</strong> more detail below.<br />

Web services implementation<br />

A WSDL (web services description language)<br />

file is written which describes the<br />

operations (runAtlas, pollQueue, fetch-<br />

AtlasResult) <strong>and</strong> the <strong>in</strong>put requirements<br />

for them. The file can be downloaded.<br />

All <strong>in</strong>put/output objects are def<strong>in</strong>ed <strong>in</strong> a<br />

separated XSD file (XML schema def<strong>in</strong>ition)<br />

with<strong>in</strong> the WSDL file, which comprises<br />

<strong>in</strong>formation <strong>and</strong> type restrictions<br />

applicable <strong>in</strong> the request. This serves as<br />

documentation of the objects as well as a<br />

way to validate a request before it is<br />

submitted. Unfortunately, the validation<br />

supports only Perl modules for now that<br />

is not optimal yet, whereas this option is<br />

well implemented <strong>in</strong> <strong>tools</strong> like soapUI<br />

(http://www.soapui.org/). It should be<br />

stressed that users should, until better<br />

validation support can be implemented,<br />

be careful to correctly format the <strong>in</strong>put<br />

parameters before send<strong>in</strong>g the request.<br />

Fig. 3 BLASTatlas of pE88—a small plasmid of Clostridium tetani stra<strong>in</strong> E88, GenBank accession number AF528097. DNA parameters percent AT,<br />

GC skew, global direct repeats, <strong>and</strong> global <strong>in</strong>verted repeats are <strong>in</strong>cluded <strong>in</strong> the <strong>in</strong>ner most lanes. BLAST lanes of all complete genome sequences of the<br />

Clostridium genomes (see Table 1), <strong>in</strong>clud<strong>in</strong>g plasmids are <strong>in</strong>cluded <strong>in</strong> the outer most lanes. As examples of custom lanes, the free energy (G, blue kcal<br />

mol 1 ) <strong>and</strong> the probability (P, red) measures of stress <strong>in</strong>duced DNA duplex destabilization (SIDD) sites are <strong>in</strong>cluded <strong>in</strong> the lanes between the DNA<br />

properties <strong>and</strong> the BLAST lanes. 23 SIDD calculations were obta<strong>in</strong>ed from the SIDDbase WebService (http://www.cbs.dtu.dk/ws/SIDDbase). The<br />

request XML used to construct this plot can be downloaded from the example section of the service homepage, http://www.cbs.dtu.dk/ws/BLASTatlas.<br />

As expected, there is full homology of all cod<strong>in</strong>g regions between the plasmids <strong>and</strong> all replicons of C. tetani E88 (black lane just outside of the<br />

annotations); however there appears to be limited conservation of these pE88 genes throughout the genomes for other Clostridium stra<strong>in</strong>s.<br />

366 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


Table 2 A list of all stra<strong>in</strong>s <strong>and</strong> their accession numbers used <strong>in</strong> this comparison. Each row represents the NCBI Entrez sequenc<strong>in</strong>g project. The<br />

number of base pairs <strong>and</strong> prote<strong>in</strong> cod<strong>in</strong>g genes are those derived as the sum with<strong>in</strong> each project. C. botul<strong>in</strong>um str. F Langel<strong>and</strong> is that used as<br />

reference of the comparison<br />

Species Segments Size Prote<strong>in</strong>s<br />

C. acetobutylicum ATCC 824 14<br />

Entrez Project 77: Chromosome: AE001437,<br />

Plasmid pSOL1: AE001438<br />

4.132.880 3.848<br />

C. beijer<strong>in</strong>ckii NCIMB 8052 (unpublished) Entrez Project 12637: Chromosome: CP000721 6.000.632 5.020<br />

C. botul<strong>in</strong>um A str. ATCC 19397 (unpublished) Entrez Project 19517: Chromosome: CP000726 3.863.450 3.552<br />

C. botul<strong>in</strong>um A str. ATCC 3502 6<br />

Entrez Project 193: Chromosome: AM412317,<br />

Plasmid pBOT3502: AM412318<br />

3.903.260 3.671<br />

C. botul<strong>in</strong>um A str. Hall (unpublished) Entrez Project 19521: Chromosome: CP000727 3.760.560 3.407<br />

C. botul<strong>in</strong>um F str. (unpublished) Entrez Project 19519: Chromosome: CP000728,<br />

Plasmid pCLI: CP000729<br />

4.012.918 3.659<br />

C. difficile 630 15<br />

Entrez Project 78: Chromosome: AM180355,<br />

Plasmid pCD630: AM180356<br />

4.298.133 3.787<br />

C. kluyveri DSM 555 (unpublished) Entrez Project 19065: Chromosome: CP000673,<br />

Plasmid pCKL555A: CP000674<br />

4.023.800 3.913<br />

C. novyi NT 16<br />

Entrez Project 16820: Chromosome: CP000382 2.547.720 2.325<br />

C. perfr<strong>in</strong>gens ATCC 13124 25<br />

Entrez Project 304: Chromosome: CP000246 3.256.683 2.876<br />

C. perfr<strong>in</strong>gens SM101 17<br />

Entrez Project 12521: Chromosome: CP000312,<br />

Plasmid 1: CP000313, Plasmid 2: CP000314,<br />

Viral segment phage phiSM101: CP000315<br />

2.960.088 2.631<br />

C. perfr<strong>in</strong>gens str. 13 18<br />

Entrez Project 79: Chromosome: BA000016,<br />

Plasmid pCP13: AP003515,<br />

3.085.740 2.723<br />

C. tetani E88 19<br />

Entrez Project 81: Chromosome: AE015927,<br />

Plasmid pE88: AF528097<br />

2.873.333 2.432<br />

C. thermocellum ATCC 27405 (unpublished) Entrez Project 314: Chromosome: CP000568 3.843.301 3.191<br />

Clostridium phage 20<br />

Phage c-st: AP008983 185.683 198<br />

Web services workflow<br />

A workflow was written <strong>in</strong> Perl (v5.8.7),<br />

employ<strong>in</strong>g SOAP:Lite (v0.69) which<br />

reads the FASTA files of the database<br />

stra<strong>in</strong>s listed <strong>in</strong> Table 3 <strong>and</strong> produces a<br />

BLASTatlas us<strong>in</strong>g the C. botul<strong>in</strong>um<br />

stra<strong>in</strong> F Langel<strong>and</strong> as reference. The<br />

script uses the onl<strong>in</strong>e web service (see<br />

Fig. 4). The BLASTatlas figure produced<br />

by this workflow is seen <strong>in</strong> Fig. 5.<br />

Results<br />

Fig. 3 represents a BLASTatlas for plasmid<br />

pE88 from Clostridium tetani stra<strong>in</strong><br />

Fig. 4 Workflow description: a Perl script was written for h<strong>and</strong>l<strong>in</strong>g the assembly of the SOAP<br />

envelope <strong>and</strong> contact<strong>in</strong>g various other web services operations: (A) obta<strong>in</strong><strong>in</strong>g genomes sequence:<br />

us<strong>in</strong>g the getSeq operation of the GenomeAtlas Web Services (v.3.3), the genome sequence of the<br />

reference genome is obta<strong>in</strong>ed as one cont<strong>in</strong>uous str<strong>in</strong>g. (B) Obta<strong>in</strong><strong>in</strong>g atlas annotations:<br />

annotated CDS, rRNA, <strong>and</strong> tRNA features of the GenBank record of the reference genome<br />

us<strong>in</strong>g the getFeatures operation—these are the features which will be pr<strong>in</strong>ted <strong>in</strong> a separate lane<br />

on the atlas. (C) Obta<strong>in</strong><strong>in</strong>g ORF annotations of the reference genome: aga<strong>in</strong>, us<strong>in</strong>g the getFeatures<br />

operation, all codon sequences <strong>and</strong> their translations are obta<strong>in</strong>ed. (D) Obta<strong>in</strong> databases: read<br />

FASTA files conta<strong>in</strong><strong>in</strong>g prote<strong>in</strong>s <strong>and</strong> ORFs of the database genomes to be added as lanes. The<br />

output of A–F are assembled <strong>in</strong>to a s<strong>in</strong>gle SOAP request, <strong>in</strong>clud<strong>in</strong>g configurations of the atlas.<br />

(E) Poll<strong>in</strong>g the queue: once the job has been submitted, a 32 character hex str<strong>in</strong>g is returned for<br />

identify<strong>in</strong>g the job, which can be used by operation pollQueue to see the status of the job.<br />

(F + G) Obta<strong>in</strong><strong>in</strong>g result: once a status ‘‘FINISHED’’ is obta<strong>in</strong>ed from pollQueue, the job id<br />

can submitted to fetchResult <strong>and</strong> the result<strong>in</strong>g PostScript image is returned.<br />

E88. The homology for genes <strong>in</strong> the<br />

plasmid to other sequenced genomes is<br />

shown <strong>in</strong> the circles, additional ‘‘custom<br />

lanes’’ represent chromosomal regions<br />

predicted to open under superhelical<br />

stress. The chromosomal location of the<br />

genes encod<strong>in</strong>g colT <strong>and</strong> tetR are labelled<br />

<strong>in</strong> the figure. Notice that these two prote<strong>in</strong>s<br />

conta<strong>in</strong> regions of homology that<br />

are found <strong>in</strong> most of the Clostridium<br />

proteomes searched. S<strong>in</strong>ce the C. tetani<br />

plasmid is <strong>in</strong>cluded <strong>in</strong> the genome sequence<br />

(black circle <strong>in</strong> the figure), all<br />

the genes are found <strong>in</strong> this genome (solid<br />

black), <strong>and</strong> most of the other Clostridium<br />

proteomes conta<strong>in</strong> some weak homology<br />

but <strong>in</strong> general lack most of the plasmidencoded<br />

genes. Thus, this is a quick overview<br />

of gene conservation of a plasmid<br />

compared to many sequenced genomes of<br />

the same genera.<br />

To demonstrate this for an entire bacterial<br />

genome (which is millions of bp <strong>in</strong><br />

size, compared to a small/B75 000 bp<br />

plasmid, shown <strong>in</strong> Fig. 3), we have used<br />

the genome sequence of C. botul<strong>in</strong>um<br />

stra<strong>in</strong> F Langel<strong>and</strong>, the largest of the<br />

C. botul<strong>in</strong>um genomes, to build a prote<strong>in</strong><br />

BLASTatlas of all publicly available<br />

fully sequenced Clostridia genomes, <strong>in</strong>clud<strong>in</strong>g<br />

all chromosomes, plasmids <strong>and</strong><br />

phages (see Fig. 5). Each lane of the atlas<br />

corresponds to a sequenc<strong>in</strong>g project that<br />

conta<strong>in</strong>s the ma<strong>in</strong> chromosome plus any<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 367


Fig. 5 BLASTatlas of Clostridium botul<strong>in</strong>um F stra<strong>in</strong> Langel<strong>and</strong>: Lanes show genome homology of (start<strong>in</strong>g from the outermost lane):<br />

C. acetobutylicum ATCC 824, C. beijer<strong>in</strong>ckii NCIMB 8052, C. botul<strong>in</strong>um A str. ATCC 19397, C. botul<strong>in</strong>um A ATCC 3502, C. botul<strong>in</strong>um A str. Hall,<br />

C. difficile 630, C. kluyveri DSM 555, C. novyi NT, C. perfr<strong>in</strong>gens ATCC 13124, C. perfr<strong>in</strong>gens SM101, C. perfr<strong>in</strong>gens str. 13, C. tetani E88,<br />

C. thermocellum ATCC 27405, <strong>and</strong> Clostridium phage c-st genome. Inside of the annotation circle are shown global direct repeats, global <strong>in</strong>verted<br />

repeats, stack<strong>in</strong>g energy, <strong>and</strong> percent AT. Blue <strong>and</strong> red annotations are cod<strong>in</strong>g sequences on plus <strong>and</strong> m<strong>in</strong>us str<strong>and</strong>, whereas green <strong>and</strong> turquoise<br />

are rRNA <strong>and</strong> tRNA, genes respectively. The two tox<strong>in</strong> components NTNH <strong>and</strong> BoNT/A1 that are identified on phage c-st are present <strong>in</strong> the<br />

reference genome at positions 880 kb <strong>and</strong> 883 kb, respectively (marked ‘cst’). The presence of the two is visible as a th<strong>in</strong> blue b<strong>and</strong> on the c-st blast<br />

lane. The lower part of the figure shows a zoom of the region around 2635 kb, provid<strong>in</strong>g an example of a gene cluster which appears to be<br />

conserved throughout the C. botul<strong>in</strong>um stra<strong>in</strong>s <strong>and</strong> partly with<strong>in</strong> the C. difficile 630.<br />

phages or plasmids present <strong>in</strong> the genome.<br />

The prote<strong>in</strong>s encoded by the 185 kb<br />

neurotox<strong>in</strong>-convert<strong>in</strong>g bacteriophage<br />

c-st are labelled, as well as a region which<br />

is zoomed <strong>in</strong> the second panel <strong>in</strong> Fig. 5.<br />

The accession numbers, total size <strong>and</strong><br />

total number of genes with<strong>in</strong> each lane<br />

can be seen <strong>in</strong> Table 2.<br />

There are several items of <strong>in</strong>terest which<br />

can be seen <strong>in</strong> Fig. 5. First, the rRNA<br />

operons can be quite readily seen, near the<br />

top part of the chromosome map, labeled<br />

turquoise; these rRNA operons are more<br />

GC rich (hence less red <strong>in</strong> the <strong>in</strong>ner-most<br />

lane), have direct <strong>and</strong> <strong>in</strong>verted repeats (the<br />

next two lanes), <strong>and</strong> are not shown <strong>in</strong> the<br />

proteome comparison lanes (s<strong>in</strong>ce these<br />

genes do not encode prote<strong>in</strong>s).<br />

As expected, the circle represent<strong>in</strong>g<br />

the c-st phage shows little match for most<br />

of the C. botul<strong>in</strong>um genome, at the<br />

prote<strong>in</strong> level. In general, the two other<br />

C. botul<strong>in</strong>um genomes (both <strong>in</strong> blue) have<br />

the highest similarity to the reference<br />

C. botul<strong>in</strong>um genome (also shown as a<br />

circle). In this case it is used as an <strong>in</strong>ternal<br />

control: all of the prote<strong>in</strong>s should show a<br />

match for this lane, s<strong>in</strong>ce the reference<br />

genome is blasted aga<strong>in</strong>st itself. Another<br />

<strong>in</strong>terest<strong>in</strong>g observation is the upper-lefth<strong>and</strong><br />

part of the genome which seems to<br />

have more homology to other Clostridium<br />

genomes, <strong>in</strong> particular show<strong>in</strong>g<br />

many matches to the C. perfr<strong>in</strong>gens<br />

genomes (green circles), compared to the<br />

rest of the genome.<br />

Application <strong>in</strong> metagenomics<br />

The genera of Prochlorococcus belongs<br />

to the cyanobacteria <strong>and</strong> is one of the<br />

most abundant photosynthetic organisms<br />

of the ocean. It plays an important<br />

role <strong>in</strong> the planet’s carbon cycle <strong>and</strong> has<br />

adapted to the various light <strong>and</strong> oxygen<br />

conditions present at the various<br />

depths. 11 As of the end of January<br />

2008, eleven Prochlorococcus mar<strong>in</strong>us<br />

genomes are publicly available <strong>and</strong> we<br />

have <strong>in</strong>cluded all encoded prote<strong>in</strong>s of<br />

these data with the seven metagenomic<br />

read collections from the ALOHA<br />

station near Hawaii, 12 as shown <strong>in</strong><br />

Table 3. The stra<strong>in</strong> of P. mar<strong>in</strong>us stra<strong>in</strong><br />

MIT 9303 has the largest genome of all<br />

368 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


Table 3 A list of all stra<strong>in</strong>s/sample names <strong>and</strong> their accession numbers used <strong>in</strong> the metagenomic comparison. The list is sorted by sampl<strong>in</strong>g depth<br />

Source Size Orig<strong>in</strong> Accession/sample Ref. Depth<br />

P. mar<strong>in</strong>us str. MIT 9515 1 704 176 (1906 prote<strong>in</strong>s) Tropical Pacific CP000552 Unpublished Surface<br />

P. mar<strong>in</strong>us str. MIT 9215 1 738 790 (1983 prote<strong>in</strong>s) Equatorial Pacific CP000825 Unpublished Surface<br />

P. mar<strong>in</strong>us str. MED4 1 657 990 (1936 prote<strong>in</strong>s) Mediterranean Sea BX548174 21 4 m<br />

JGI_SMPL_HF10_10-07-02 7 482 668 (7842 contigs) North Pacific Subtropical Gyre — 12 10 m<br />

P. mar<strong>in</strong>us str. NATL1A 1 864 731 (2193 prote<strong>in</strong>s) North Atlantic CP000553 Unpublished 30 m<br />

P. mar<strong>in</strong>us str. NATL2A 1 842 899 (2163 prote<strong>in</strong>s) North Atlantic CP000095 Unpublished 30 m<br />

P. mar<strong>in</strong>us str. AS9601 1 669 886 (1921 prote<strong>in</strong>s) Arabian Sea CP000551 Unpublished 50 m<br />

JGI_SMPL_HF70_10-07-02 10 828 386 (10 999 contigs) North Pacific Subtropical Gyre — 12 70 m<br />

P. mar<strong>in</strong>us str. MIT 9211 1 688 963 (1855 prote<strong>in</strong>s) Equatorial Pacific CP000878 21 83 m<br />

P. mar<strong>in</strong>us str. MIT 9301 1 641 879 (1907 prote<strong>in</strong>s) Sargasso Sea CP000576 Unpublished 90 m<br />

P. mar<strong>in</strong>us str. MIT 9303 2 682 675 (2997 prote<strong>in</strong>s) Sargasso Sea CP000554 Unpublished 100 m<br />

P. mar<strong>in</strong>us str. SS120 1 751 080 (1882 prote<strong>in</strong>s) Sargasso Sea AE017126 22 120 m<br />

JGI_SMPL_HF130_10-06-02 6 091 784 (6812 contigs) North Pacific Subtropical Gyre — 12 130 m<br />

P. mar<strong>in</strong>us str. MIT 9312 1 709 204 (1962 prote<strong>in</strong>s) Equatorial Pacific CP000111 Unpublished 135 m<br />

P. mar<strong>in</strong>us str. MIT MIT9313 2 410 873 (2273 prote<strong>in</strong>s) Gulf Stream BX548175 21 135 m<br />

JGI_SMPL_HF200_10-06-02 7 829 659 (8286 contigs) North Pacific Subtropical Gyre — 12 200 m<br />

JGI_SMPL_HF500_10-06-02 8 764 642 (9027 contigs) North Pacific Subtropical Gyre — 12 500 m<br />

JGI_SMPL_HF770_12-21-03 11 811 597 (11 479 contigs) North Pacific Subtropical Gyre — 12 770 m<br />

JGI_SMPL_HF4000_12-21-03 11 028 821 (11 229 contigs) North Pacific Subtropical Gyre — 12 4000 m<br />

currently available sequences (2.7 Mb)<br />

<strong>and</strong> was therefore used as reference <strong>in</strong><br />

this comparison. BLAST hits between<br />

the reference <strong>and</strong> the encoded prote<strong>in</strong>s<br />

of all the P. mar<strong>in</strong>us genomes <strong>in</strong>cluded<br />

were generated with the BLASTp<br />

algorithm, whereas hits between the<br />

reference prote<strong>in</strong>s <strong>and</strong> the DNA reads<br />

of the metagenomic samples were gener-<br />

ated us<strong>in</strong>g the tBLASTn algorithm.<br />

tBLASTn was used to avoid the<br />

gene prediction step of the metagenomic<br />

samples <strong>and</strong> to allow a rough estimate<br />

of the cod<strong>in</strong>g potential of these samples.<br />

All lanes are sorted accord<strong>in</strong>g to<br />

the water depth at which the samples<br />

were collected (see Fig. 6). The Perl<br />

code for construct<strong>in</strong>g this plot us<strong>in</strong>g<br />

web services is provided on the service<br />

homepage.<br />

Discussion<br />

The BLASTatlas method can assist biologists<br />

<strong>in</strong> f<strong>in</strong>d<strong>in</strong>g regions along the chromosome<br />

which are conserved (or not).<br />

This <strong>in</strong>formation is useful for several<br />

Fig. 6 BLASTatlas show<strong>in</strong>g fully sequenced Prochlorococcus genomes (green) <strong>and</strong> the seven ALOHA metagenomic samples (blue). Outermost<br />

lanes represent samples closer to the ocean surface.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 369


different applications, such as identify<strong>in</strong>g<br />

phage <strong>in</strong>sertion sites <strong>and</strong> loss of important<br />

genetic material. This method is<br />

even able to scale down to each <strong>in</strong>dividual<br />

nucleotide or am<strong>in</strong>o acid residue.<br />

However, it is unable to deal with sequences<br />

(or parts thereof) that are not<br />

found <strong>in</strong> the reference genome. A good<br />

compromise when deal<strong>in</strong>g with this issue<br />

is often to use the largest chromosome of<br />

a species as reference; <strong>in</strong> addition, it can<br />

be useful to rebuild the maps us<strong>in</strong>g different<br />

reference genomes. Besides this<br />

limitation, the fact that all coord<strong>in</strong>ates<br />

are mapped back to the reference causes<br />

the coord<strong>in</strong>ates of the database genomes<br />

to ‘‘get lost’’ <strong>in</strong> that only the best match<br />

is displayed, regardless of the chromosomal<br />

location <strong>in</strong> the database genomes.<br />

Other aspects of genome homology like<br />

gene synteny cannot effectively be<br />

answered by this tool. However, it is<br />

possible to use an additional circle to<br />

plot gene order conservation along the<br />

chromosome.<br />

Currently, we see the BLASTatlas as<br />

an <strong>in</strong>termediate stage <strong>in</strong> analysis of many<br />

genomes of similar species. Soon there<br />

will be a need to compare hundreds or<br />

thous<strong>and</strong>s of genome sequences, <strong>and</strong> the<br />

need for development of new methods<br />

for comparison of even larger numbers<br />

of genomes (hundreds or thous<strong>and</strong>s) is<br />

ever more important.<br />

Acknowledgements<br />

The authors would like to thank Hans<br />

Henrik Stærfeld for assistance with server<br />

side programs <strong>and</strong> Kristoffer Rapacki<br />

for assistance on web services<br />

data types. The work was supported by<br />

a grant from the European Union<br />

through the EMBRACE network of Excellence,<br />

contract number LSHG-CT-<br />

2004-512092 <strong>and</strong> a grant from the Danish<br />

Center for Scientific Comput<strong>in</strong>g<br />

(DCSC).<br />

References<br />

1 R. D. Fleischmann, M. D. Adams, O.<br />

White, R. A. Clayton, E. F. Kirkness, A.<br />

R. Kerlavage, C. J. Bult, J. F. Tomb, B. A.<br />

Dougherty, J. M. Merrick, J. McKenney,<br />

G. Sutton, W. FitzHugh, C. Fields, J. D.<br />

Gocyne, J. Scott, R. Shirley, L. I. Liu, A.<br />

Glodek, J. M. Kelley, J. F. Weidman, C.<br />

A. Phillips, T. Spriggs, E. Hedblom, M. D.<br />

Cotton, T. R. Utterback, M. C. Hanna, D.<br />

T. Nguyen, D. M. Saudek, R. C. Br<strong>and</strong>on,<br />

L. D. F<strong>in</strong>e, J. L. Fritchman, J. L. Fuhrmann,<br />

N. S. M. Geoghagen, C. L. Gnehm,<br />

L. A. McDonald, K. V. Small, C. M.<br />

Fraser, H. O. Smith <strong>and</strong> J. C. Venter,<br />

Whole-Genome R<strong>and</strong>om Sequenc<strong>in</strong>g <strong>and</strong><br />

Assembly of Haemophilus Influenzae Rd.,<br />

Science, 1995, 269(5223), 496–512.<br />

2 L. J. Jensen, M. Skovgaard, T. Sicheritz-<br />

Ponten, M. K. Jorgensen, C. Lundegaard,<br />

C. C. Pedersen, N. Petersen <strong>and</strong> D. Ussery,<br />

Analysis of two large functionally uncharacterized<br />

regions <strong>in</strong> the Methanopyrus<br />

k<strong>and</strong>leri AV19 genome, BMC Genomics,<br />

2003, 4, 12.<br />

3 L. J. Jensen, M. Skovgaard, T. Sicheritz-<br />

Ponten, N. T. Hansen, H. Johansson,<br />

M. K. Jørgensen, K. Kiil, P. F. Hall<strong>in</strong><br />

<strong>and</strong> D. Ussery, <strong>Comparative</strong> genomics of<br />

four Pseudomonas species, <strong>in</strong> The Pseudomonads<br />

Vol. I. Genomics, Life Style<br />

<strong>and</strong> Molecular Architecture, ed. J. L.<br />

Ramos, Kluwer Academic/Plenum<br />

Publishers, New York, 2004, ch. 5,<br />

pp. 139–164.<br />

4 P. F. Hall<strong>in</strong>, T. T. B<strong>in</strong>newies <strong>and</strong> D. W.<br />

Ussery, Genome update: chromosome atlases,<br />

Microbiology (Read<strong>in</strong>g, U. K.),<br />

2004, 150, 3091–3093.<br />

5 T. J. Carver, K. M. Rutherford, M. Berriman,<br />

M. A. Raj<strong>and</strong>ream, B. G. Barrell <strong>and</strong><br />

J. Parkhill, ACT: the Artemis Comparison<br />

Tool, Bio<strong>in</strong>formatics, 2005, 21, 3422–3423.<br />

6 M. Sebaihia, M. W. Peck, N. P. M<strong>in</strong>ton,<br />

N. R. Thomson, M. T. Holden, W. J.<br />

Mitchell, A. T. Carter, S. D. Bentley, D.<br />

R. Mason, L. Crossman, C. J. Paul, A.<br />

Ivens, M. H. Wells-Bennik, I. J. Davis, A.<br />

M. Cerdeno-Tarraga, C. Churcher, M. A.<br />

Quail, T. Chill<strong>in</strong>gworth, T. Feltwell, A.<br />

Fraser, I. Goodhead, Z. Hance, K. Jagels,<br />

N. Larke, M. Maddison, S. Moule, K.<br />

Mungall, H. Norbertczak, E. Rabb<strong>in</strong>owitsch,<br />

M. S<strong>and</strong>ers, M. Simmonds, B.<br />

White, S. Whithead <strong>and</strong> J. Parkhill, Genome<br />

sequence of a proteolytic (Group I)<br />

Clostridium botul<strong>in</strong>um stra<strong>in</strong> Hall A <strong>and</strong><br />

comparative analysis of the clostridial genomes,<br />

Genome Res., 2007, 17, 1082–1092.<br />

7 A. G. Pedersen, L. J. Jensen, S. Brunak, H.<br />

H. Staerfeldt <strong>and</strong> D. W. Ussery, A DNA<br />

structural atlas for Escherichia coli, J. Mol.<br />

Biol., 2000, 299, 907–930.<br />

8 E. S. Shpigelman, E. N. Trifonov <strong>and</strong><br />

Bolshoy, A Curvature: software for the<br />

analysis of curved DNA, CABIOS, Comput.<br />

Appl. Biosci., 1993, 9, 435–440.<br />

9 M. Skovgaard, L. J. Jensen, C. Friis, H. H.<br />

Stærfeldt, P. Worn<strong>in</strong>g, S. Brunak <strong>and</strong> D.<br />

Ussery, The Atlas Visualisation of Genome-wide<br />

Information, Methods Microbiol.,<br />

2002, 33, 49–63.<br />

10 L. J. Jensen, C. Friis <strong>and</strong> D. W. Ussery,<br />

Three Views of Microbial Genomes, Res.<br />

Microbiol., 1999, 150, 773–777.<br />

11 M. B. Sullivan, M. L. Coleman, P. Weigele,<br />

F. Rohwer <strong>and</strong> S. W. Chisholm,<br />

Three Prochlorococcus cyanophage Genomes:<br />

Signature Features <strong>and</strong> Ecological<br />

Interpretations, PLoS Biol., 2005, 3, e144;<br />

PMID: 15828858 [PubMed—<strong>in</strong>dexed for<br />

MEDLINE].<br />

12 E. F. DeLong, C. M. Preston, T. M<strong>in</strong>cer,<br />

V. Rich, S. J. Hallam, N.-U. Frigaard, A.<br />

Mart<strong>in</strong>ez, M. B. Sullivan, R. Edwards, B.<br />

R. Brito, S. W. Chisholm <strong>and</strong> D. M. Karl,<br />

Community Genomics Among Stratified<br />

Microbial Assemblages <strong>in</strong> the Ocean’s Interior,<br />

Science, 2006, 311(5760), 496–503.<br />

13 D. L. Wheeler, T. Barrett, D. A. Benson,<br />

S. H. Bryant, K. Canese, V. Chetvern<strong>in</strong>,<br />

D. M. Church, M. DiCuccio, R. Edgar, S.<br />

Federhen, L. Y. Geer, Y. Kapust<strong>in</strong>, O.<br />

Khovayko, D. L<strong>and</strong>sman, D. J. Lipman,<br />

T. L. Madden, D. R. Maglott, J. Ostell, V.<br />

Miller, K. D. Pruitt, G. D. Schuler, E.<br />

Sequeira, S. T. Sherry, K. Sirotk<strong>in</strong>, A.<br />

Souvorov, G. Starchenko, R. L. Tatusov,<br />

T. A. Tatusova, L. Wagner <strong>and</strong> E.<br />

Yaschenko, Database Resources of the<br />

National Center for Biotechnology Information,<br />

Nucleic Acids Res., 2007, 35,<br />

D5–D12.<br />

14 J. Noll<strong>in</strong>g, G. Breton, M. V. Omelchenko,<br />

K. S. Makarova, Q. Zeng, R. Gibson, H.<br />

M. Lee, J. Dubois, D. Qiu, J. Hitti, Y. I.<br />

Wolf, R. L. Tatusov, F. Sabathe, L. Doucette-Stamm,<br />

P. Soucaille, M. J. Daly, G.<br />

N. Bennett, E. V. Koon<strong>in</strong> <strong>and</strong> D. R.<br />

Smith, Genome Sequence <strong>and</strong> <strong>Comparative</strong><br />

Analysis of the Solvent-produc<strong>in</strong>g<br />

Bacterium Clostridium acetobutylicum, J.<br />

Bacteriol., 2001, 183, 4823–4838.<br />

15 M. Sebaihia, B. W. Wren, P. Mullany, N.<br />

F. Fairweather, N. M<strong>in</strong>ton, R. Stabler, N.<br />

R. Thomson, A. P. Roberts, A. M. Cerdeno-Tarraga,<br />

H. Wang, M. T. Holden, A.<br />

Wright, C. Churcher, M. A. Quail, S.<br />

Baker, N. Bason, K. Brooks, T. Chill<strong>in</strong>gworth,<br />

A. Cron<strong>in</strong>, P. Davis, L. Dowd, A.<br />

Fraser, T. Feltwell, Z. Hance, S. Holroyd,<br />

K. Jagels, S. Moule, K. Mungall, C. Price,<br />

E. Rabb<strong>in</strong>owitsch, S. Sharp, M. Simmonds,<br />

K. Stevens, L. Unw<strong>in</strong>, S. Whithead,<br />

B. Dupuy, G. Dougan, B. Barrell<br />

<strong>and</strong> J. Parkhill, The Multidrug-resistant<br />

Human Pathogen Clostridium difficile has<br />

a Highly Mobile: Mosaic Genome, Nat.<br />

Genet., 2006, 38, 779–786.<br />

16 C. Bettegowda, X. Huang, J. L<strong>in</strong>, I.<br />

Cheong, M. Kohli, S. A. Szabo, X. Zhang,<br />

L. A. Diaz, Jr, V. E. Velculescu, G. Parmigiani,<br />

K. W. K<strong>in</strong>zler, B. Vogelste<strong>in</strong> <strong>and</strong><br />

S. Zhou, The Genome <strong>and</strong> Transcriptomes<br />

of the Anti-tumor Agent Clostridiumnovyi-NT,<br />

Nat. Biotechnol., 2006, 24,<br />

1573–1580.<br />

17 G. S. Myers, D. A. Rasko, J. K. Cheung, J.<br />

Ravel, R. Seshadri, R. T. DeBoy, Q. Ren,<br />

J. Varga, M. M. Awad, L. M. Br<strong>in</strong>kac, S.<br />

C. Daugherty, D. H. Haft, R. J. Dodson,<br />

R. Madupu, W. C. Nelson, N. J. Rosovitz,<br />

S. A. Sullivan, H. Khouri, G. I. Dimitrov,<br />

K. L. Watk<strong>in</strong>s, S. Mulligan, J. Benton, D.<br />

Radune, D. J. Fisher, H. S. Atk<strong>in</strong>s, T.<br />

Hiscox, B. H. Jost, S. J. Bill<strong>in</strong>gton, J. G.<br />

Songer, B. A. McClane, R. W. Titball, J. I.<br />

Rood, S. B. Melville <strong>and</strong> I. T. Paulsen,<br />

Skewed Genomic Variability <strong>in</strong> Stra<strong>in</strong>s of<br />

the Toxigenic Bacterial Pathogen,<br />

Clostridium perfr<strong>in</strong>gens, Genome Res.,<br />

2006, 16, 1031–1040.<br />

18 T. Shimizu, K. Ohtani, H. Hirakawa, K.<br />

Ohshima, A. Yamashita, T. Shiba, N.<br />

Ogasawara, M. Hattori, S. Kuhara <strong>and</strong><br />

H. Hayashi, Complete Genome Sequence<br />

of Clostridium perfr<strong>in</strong>gens, an Anaerobic<br />

Flesh-eater, Proc. Natl. Acad. Sci.<br />

U. S. A., 2002, 99, 996–1001.<br />

19 H. Bruggemann, S. Baumer, W. F. Fricke,<br />

A. Wiezer, H. Liesegang, I. Decker,<br />

370 | Mol. BioSyst., 2008, 4, 363–371 This journal is c The Royal Society of Chemistry 2008


C. Herzberg, R. Mart<strong>in</strong>ez-Arias, R. Merkl,<br />

A. Henne <strong>and</strong> G. Gottschalk, The Genome<br />

Sequence of Clostridium tetani, the<br />

Causative Agent of Tetanus Disease, Proc.<br />

Natl. Acad. Sci. U. S. A., 2003, 100,<br />

1316–1321.<br />

20 Y. Sakaguchi, T. Hayashi, K. Kurokawa,<br />

K. Nakayama, K. Oshima, Y. Fuj<strong>in</strong>aga, M.<br />

Ohnishi, E. Ohtsubo, M. Hattori <strong>and</strong> K.<br />

Oguma, The Genome Sequence of<br />

Clostridium botul<strong>in</strong>um Type C Neurotox<strong>in</strong><br />

Convert<strong>in</strong>g Phage <strong>and</strong> the Molecular Mechanisms<br />

of Unstable Lysogeny, Proc. Natl.<br />

Acad. Sci. U. S. A.,2005,102,17472–17477.<br />

21 G. Rocap, F. W. Larimer, J. Lamerd<strong>in</strong>, S.<br />

Malfatti, P. Cha<strong>in</strong>, N. A. Ahlgren, A.<br />

Arellano, M. Coleman, L. Hauser, W. R.<br />

Hess, Z. I. Johnson, M. L<strong>and</strong>, D. L<strong>in</strong>dell,<br />

A. F. Post, W. Regala, M. Shah, S. L.<br />

Shaw, C. Steglich, M. B. Sullivan, C. S.<br />

T<strong>in</strong>g, A. Tolonen, E. A. Webb, E. R.<br />

Z<strong>in</strong>ser <strong>and</strong> S. W. Chisholm, Genome Divergence<br />

<strong>in</strong> Two Prochlorococcus ecotypes<br />

Reflects Oceanic Niche Differentiation,<br />

Nature, 2003, 424, 1042–1047.<br />

22 A. Dufresne, M. Salanoubat, F. Partensky,<br />

F. Artiguenave, I. M. Axmann, V.<br />

Barbe, S. Duprat, M. Y. Galper<strong>in</strong>, E. V.<br />

Koon<strong>in</strong>, F. Le Gall, K. S. Makarova, M.<br />

Ostrowski, S. Oztas, C. Robert, I. B. Rogoz<strong>in</strong>,<br />

D. J. Scanlan, N. T<strong>and</strong>eau de Marsac,<br />

J. Weissenbach, P. W<strong>in</strong>cker, Y. I.<br />

Wolf <strong>and</strong> W. R. Hess, Genome Sequence<br />

of the Cyanobacterium Prochlorococcus<br />

mar<strong>in</strong>us SS120, a Nearly M<strong>in</strong>imal Oxyphototrophic<br />

Genome, Proc. Natl. Acad.<br />

Sci. U. S. A., 2003, 100, 9647–9649.<br />

23 C. J. Benham <strong>and</strong> C. Bi, The Analysis of<br />

Stress-<strong>in</strong>duced Duplex Destabilization <strong>in</strong><br />

Long Genomic DNA Sequences, J.<br />

Comput. Biol., 2004, 11, 519–543.<br />

24 K. Liolios, N. Tavernarakis, P.<br />

Hugenholtz <strong>and</strong> N. C. Kyrpides, The<br />

Genomes On L<strong>in</strong>e Database (GOLD)<br />

v.2: a monitor of genome projects worldwide,<br />

Nucleic Acids Res., 2006, 34,<br />

D332–D334.<br />

25 J. I. Rood <strong>and</strong> S. T. Cole, Molecular<br />

genetics <strong>and</strong> pathogenesis of Clostridium<br />

perfr<strong>in</strong>gens, Microbiol. Rev., 1991, 55,<br />

621–648.<br />

This journal is c The Royal Society of Chemistry 2008 Mol. BioSyst., 2008, 4, 363–371 | 371


1<br />

<strong>Comparative</strong> Genomics<br />

2.7 Paper II: Ten years of bacterial genome sequenc<strong>in</strong>g:<br />

comparative–genomics–based discoveries


Funct Integr Genomics (2006) 6: 165–185<br />

DOI 10.1007/s10142-006-0027-2<br />

REVIEW<br />

Tim T. B<strong>in</strong>newies . Yair Motro . Peter F. Hall<strong>in</strong> .<br />

Ole Lund . David Dunn . Tom La . David J. Hampson .<br />

Matthew Bellgard . Trudy M. Wassenaar .<br />

David W. Ussery<br />

Ten years of bacterial genome sequenc<strong>in</strong>g:<br />

comparative-genomics-based discoveries<br />

Received: 20 January 2006 / Revised: 24 February 2006 / Accepted: 7 March 2006 / Published onl<strong>in</strong>e: 12 May 2006<br />

# Spr<strong>in</strong>ger-Verlag 2006<br />

Abstract It has been more than 10 years s<strong>in</strong>ce the first<br />

bacterial genome sequence was published. Hundreds of<br />

bacterial genome sequences are now available for comparative<br />

genomics, <strong>and</strong> search<strong>in</strong>g a given prote<strong>in</strong> aga<strong>in</strong>st<br />

more than a thous<strong>and</strong> genomes will soon be possible. The<br />

subject of this review will address a relatively straightforward<br />

question: “What have we learned from this vast<br />

amount of new genomic data?” Perhaps one of the most<br />

important lessons has been that genetic diversity, at the<br />

level of large-scale variation amongst even genomes of the<br />

same species, is far greater than was thought. The classical<br />

textbook view of evolution rely<strong>in</strong>g on the relatively slow<br />

accumulation of mutational events at the level of <strong>in</strong>dividual<br />

bases scattered throughout the genome has changed. One<br />

of the most obvious conclusions from exam<strong>in</strong><strong>in</strong>g the<br />

sequences from several hundred bacterial genomes is the<br />

enormous amount of diversity—even <strong>in</strong> different genomes<br />

from the same bacterial species. This diversity is generated<br />

by a variety of mechanisms, <strong>in</strong>clud<strong>in</strong>g mobile genetic<br />

elements <strong>and</strong> bacteriophages. An exam<strong>in</strong>ation of the 20<br />

Escherichia coli genomes sequenced so far dramatically<br />

illustrates this, with the genome size rang<strong>in</strong>g from 4.6 to<br />

5.5 Mbp; much of the variation appears to be of phage<br />

orig<strong>in</strong>. This review also addresses mobile genetic elements,<br />

T. T. B<strong>in</strong>newies . P. F. Hall<strong>in</strong> . O. Lund . D. W. Ussery (*)<br />

Center for Biological Sequence Analysis,<br />

Technical University of Denmark,<br />

2800 Lyngby, Denmark<br />

e-mail: dave@cbs.dtu.dk<br />

Y. Motro . D. Dunn . M. Bellgard<br />

Center for Bio<strong>in</strong>formatics <strong>and</strong> Biological Comput<strong>in</strong>g,<br />

Murdoch University,<br />

Murdoch, Western Australia 6150, Australia<br />

T. La . D. J. Hampson<br />

School of Veter<strong>in</strong>ary <strong>and</strong> Biomedical Sciences,<br />

Murdoch University,<br />

Murdoch, Western Australia 6150, Australia<br />

T. M. Wassenaar<br />

Molecular Microbiology <strong>and</strong> Genomics Consultants,<br />

Zotzenheim, Germany<br />

<strong>in</strong>clud<strong>in</strong>g pathogenicity isl<strong>and</strong>s <strong>and</strong> the structure of<br />

transposable elements. There are at least 20 different<br />

methods available to compare bacterial genomes. Metagenomics<br />

offers the chance to study genomic sequences<br />

found <strong>in</strong> ecosystems, <strong>in</strong>clud<strong>in</strong>g genomes of species that are<br />

difficult to culture. It has become clear that a genome<br />

sequence represents more than just a collection of gene<br />

sequences for an organism <strong>and</strong> that <strong>in</strong>formation concern<strong>in</strong>g<br />

the environment <strong>and</strong> growth conditions for the organism<br />

are important for <strong>in</strong>terpretation of the genomic data. The<br />

newly proposed M<strong>in</strong>imal Information about a Genome<br />

Sequence st<strong>and</strong>ard has been developed to obta<strong>in</strong> this<br />

<strong>in</strong>formation.<br />

Keywords Bacterial genomics . <strong>Comparative</strong> genomics .<br />

Bio<strong>in</strong>formatics . Genomic diversity .<br />

Molecular evolution<br />

Introduction<br />

The year 1995 marked the publication of two human<br />

pathogenic bacterial genome sequences: Haemophilus<br />

<strong>in</strong>fluenzae (Fleischmann et al. 1995, US patent number<br />

6,528,289) <strong>and</strong> Mycoplasma genetalium (Fraser et al.<br />

1995, US patent number 6,537,773). S<strong>in</strong>ce then, more than<br />

300 bacterial genomes have been fully sequenced <strong>and</strong><br />

become publicly available, <strong>in</strong>clud<strong>in</strong>g the sequence of a<br />

virulent form of H. <strong>in</strong>fluenzae (Harrison et al. 2005); the<br />

orig<strong>in</strong>al H. <strong>in</strong>fluenzae stra<strong>in</strong> sequenced <strong>in</strong> 1995 was from<br />

an isolate that does not cause disease. Although the<br />

majority of these several hundred genomes are from<br />

pathogenic organisms, some environmental bacterial genome<br />

sequences have also become available. This review<br />

article will provide a brief overview of sequenced bacterial<br />

genomes, their genomic diversity <strong>and</strong> some of the <strong>in</strong>sights<br />

ga<strong>in</strong>ed from analysis of this vast amount of data.<br />

Bacteria are microscopic unicellular prokaryotes that<br />

<strong>in</strong>habit a wide variety of environmental niches, broadly<br />

distributed <strong>in</strong> three ecosystems: the soil, mar<strong>in</strong>e environments<br />

<strong>and</strong> other liv<strong>in</strong>g organisms. Although there are


166<br />

literally millions of bacterial species, only a small proportion<br />

of these can be grown <strong>in</strong> the laboratory (H<strong>and</strong>elsman<br />

2004). Bacteria (<strong>and</strong> Archaea) can be found almost<br />

anywhere <strong>in</strong> the environment: <strong>in</strong> the air, even <strong>in</strong> the<br />

International Space Station (Novikova et al. 2006), <strong>in</strong><br />

thermal ducts found at great depths <strong>in</strong> the oceans (Ala<strong>in</strong> et<br />

al. 2002; Vezzi et al. 2005), <strong>in</strong> the <strong>in</strong>test<strong>in</strong>al tracts of<br />

animals (Yan <strong>and</strong> Polk 2004; Backhed et al. 2005) <strong>and</strong> <strong>in</strong><br />

soil <strong>and</strong> rocks, even thous<strong>and</strong>s of meters deep (Torsvik et<br />

al. 1990). Bacteria live with<strong>in</strong> unicellular eukaryotes,<br />

algae, plants or animals. This diversity is reflected <strong>in</strong> their<br />

physiology, morphology, metabolism <strong>and</strong> ecosystems. For<br />

example, from a physiological perspective, most <strong>in</strong>test<strong>in</strong>al<br />

bacteria such as Escherichia coli are motile by means of<br />

flagella, to overcome the peristalsis of the gut, whilst the<br />

soil bacterium Clostridium perfr<strong>in</strong>gens does not posses<br />

such motility mach<strong>in</strong>ery (Shimizu et al. 2002). From a<br />

metabolic perspective, the versatile Burkholderia cepacia<br />

(formerly Pseudomonas cepacia) can utilise approximately<br />

100 different organic compounds as a sole energy source<br />

(Goldmann <strong>and</strong> Kl<strong>in</strong>ger 1986) compared to the strictly<br />

<strong>in</strong>tracellular Mycobacterium tuberculosis which is dependent<br />

on only a few carbon sources produced by its<br />

<strong>in</strong>voluntary host. From an <strong>in</strong>ter-bacterial <strong>in</strong>teraction<br />

perspective, sometimes bacteria cooperate. For example,<br />

Enterobacter cloacae <strong>and</strong> Pseudomonas mendoc<strong>in</strong>a positively<br />

<strong>in</strong>teract to stimulate plant growth (Duponnois et al.<br />

1999). On the other h<strong>and</strong>, there are also bacteria which not<br />

only “do not cooperate” but exhibit predatory behavior,<br />

such as Bdellovibrio bacteriovorus (Rendulic et al. 2004).<br />

As for bacteria–host <strong>in</strong>teractions, for a given bacterial<br />

species both pathogenic <strong>and</strong> non-pathogenic stra<strong>in</strong>s can<br />

exist (Dobr<strong>in</strong>dt <strong>and</strong> Hacker 2001; Penyalver <strong>and</strong> Lopez<br />

1999), while other species may be exclusively parasitic<br />

(Goebel <strong>and</strong> Gross 2001), truly symbiotic (Gil et al. 2004)<br />

or commensal (Yan <strong>and</strong> Polk 2004) for their host. It is<br />

<strong>in</strong>terest<strong>in</strong>g to note that this diversity is somehow captured<br />

<strong>in</strong> the relatively small bacterial genomes.<br />

The first complete viral genome (φX174) was published<br />

<strong>in</strong> 1977 (Sanger et al. 1977). To put this <strong>in</strong>to perspective, to<br />

sequence the 4.6-Mbp E. coli K-12 genome at that time<br />

(about a thous<strong>and</strong> base pairs (bp) could be sequenced per<br />

year <strong>in</strong> 1977) would take more than a thous<strong>and</strong> years to<br />

f<strong>in</strong>ish, <strong>and</strong> to sequence the human genome would take<br />

more than a million years to complete. The automation of<br />

sequenc<strong>in</strong>g methods, the <strong>in</strong>vention of polymerase cha<strong>in</strong><br />

reaction (PCR) (Mullis et al. 1986) <strong>and</strong> the shotgun clon<strong>in</strong>g<br />

procedure reduced costs <strong>and</strong> time, <strong>and</strong> provided the<br />

capability for large-scale sequenc<strong>in</strong>g. These developments<br />

together have led to the sequenc<strong>in</strong>g of the first complete<br />

bacterial genome (Fleischmann et al. 1995) almost 20 years<br />

after the sequenc<strong>in</strong>g of φX174. The choice of the first<br />

bacterium to be completely sequenced (H. <strong>in</strong>fluenzae Rd<br />

KW20) was based on the follow<strong>in</strong>g reasons: (1) the<br />

genome size was thought to be ‘typical’ among bacteria<br />

(1.8 Mbp), (2) the G + C base composition was close to that<br />

of the human genome (38%) <strong>and</strong> (3) the bacterium had<br />

important human health implications. In the absence of<br />

procedures to produce a genetic map for the species,<br />

genome sequenc<strong>in</strong>g was proven to be a powereful<br />

alternative for genetic characterisation. This l<strong>and</strong>mark<br />

work <strong>in</strong>itiated the <strong>in</strong>flux of genome sequence data which<br />

is now updated frequently <strong>and</strong> is publicly available. As of<br />

November 2005, there are more than 300 fully sequenced,<br />

publicly available bacterial genomes. Figure 1 shows this<br />

<strong>in</strong>crease of sequence data over the past decade. 1<br />

The total number of completed bacterial genome<br />

sequences has more than doubled over the past 2 years<br />

<strong>and</strong>, at the time of writ<strong>in</strong>g, there are 855 publicly listed<br />

bacterial <strong>and</strong> archaeal genome projects that are <strong>in</strong> various<br />

stages of progress. 2 In addition to new species, multiple<br />

stra<strong>in</strong>s of the same bacterial species are be<strong>in</strong>g sequenced.<br />

The amount of genomic data currently available has<br />

provided significant advances <strong>in</strong> our underst<strong>and</strong><strong>in</strong>g of a<br />

number of important themes, <strong>in</strong>clud<strong>in</strong>g bacterial diversity,<br />

population characteristics, operon structure, mobile genetic<br />

elements (MGE) <strong>and</strong> horizontal gene transfer (HGT). It has<br />

also provided a number of challenges <strong>in</strong> underst<strong>and</strong><strong>in</strong>g the<br />

ecology of, as yet, undiscovered bacterial worlds. The<br />

availability of whole genome sequences for pathogenic <strong>and</strong><br />

commensal bacterial species has allowed a more detailed<br />

analysis of the complex <strong>in</strong>teractions that occur with their<br />

plant or animal hosts. Figure 2a is a phylogenetic tree of<br />

300 sequenced bacterial genomes (available at the time of<br />

writ<strong>in</strong>g). Many of these genomes are from pathogenic<br />

bacteria liv<strong>in</strong>g <strong>in</strong> complex ecosystems, such as the<br />

spirochaete Brachyspira pilosicoli labelled <strong>in</strong> red <strong>in</strong> the<br />

phylogenetic tree shown <strong>in</strong> Fig. 2b. This bacterium attaches<br />

to enterocytes to form a “false brush border” <strong>in</strong> the colon.<br />

Most genome sequenc<strong>in</strong>g projects are currently carried<br />

out us<strong>in</strong>g automated applications of the sequenc<strong>in</strong>g<br />

technique developed by Sanger et al. (1973), but newly<br />

developed methodologies may enable even more rapid<br />

sequenc<strong>in</strong>g <strong>in</strong> the future. Two papers have been published<br />

about two different methods for high-throughput sequenc<strong>in</strong>g<br />

of bacterial genomes (Pennisi 2005). One method is<br />

essentially a “do-it-yourself kit”, which uses a laser<br />

confocal microscope <strong>and</strong> other “off-the-shelf” components<br />

to build a sequenc<strong>in</strong>g mach<strong>in</strong>e capable of sequenc<strong>in</strong>g an E.<br />

coli genome <strong>in</strong> less than a day (Shendure et al. 2005). The<br />

second method is a commercial mach<strong>in</strong>e, based on<br />

pyrosequenc<strong>in</strong>g methodologies to generate many short<br />

pieces of DNA; this method was used to sequence a<br />

bacterial genome with<strong>in</strong> a few hours (Margulies et al.<br />

2005). Although there are still some technical problems<br />

with both of these methods, it is clear that, <strong>in</strong> the near<br />

future, it will be possible to quickly sequence a bacterial<br />

genome at a considerably low cost.<br />

1 Completed genome statistics obta<strong>in</strong>ed from the <strong>CBS</strong> atlas web<br />

pages http://www.cbs.dtu.dk/services/GenomeAtlas<br />

2 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj


Fig. 1 Cumulative number of<br />

complete published sequenced<br />

bacterial genomes (bars) <strong>and</strong><br />

total number of basepairs (l<strong>in</strong>e)<br />

over the past decade<br />

(1995–2005)<br />

Genomic <strong>in</strong>formation<br />

DNA codes for more than just prote<strong>in</strong>s<br />

The quality of annotation of bacterial genomes varies,<br />

although a survey based on three different methods to<br />

predict the expected number of genes <strong>in</strong> a genome has<br />

found that it is likely that, for most bacterial genomes,<br />

around 20% of the genes annotated might not be “real”<br />

(Skovgaard et al. 2001). Furthermore, some “real” genes,<br />

based on proteomics experiments, which were not<br />

orig<strong>in</strong>ally predicted have been detected, highlight<strong>in</strong>g the<br />

dynamic nature of annotation <strong>and</strong> that genes are missed<br />

(Jaffe et al. 2004). Over-annotation of bacterial genomes is<br />

a problem but, unfortunately, this cannot be easily avoided.<br />

On the one h<strong>and</strong>, no one wants to miss a gene <strong>and</strong>, on the<br />

other h<strong>and</strong>, small genes can be quite difficult to predict, as a<br />

short open read<strong>in</strong>g frame could easily occur by statistical<br />

chance (Skovgaard et al. 2001).<br />

There are currently several automated annotation systems<br />

<strong>and</strong> the BaSys system (Van Domselaar et al. 2005)<br />

provides a comprehensive annotation of a DNA sequence<br />

file. To conduct comparative genomics with several<br />

hundred genomes, quality databases are essential <strong>and</strong> the<br />

“GenomeAtlas” database, which was orig<strong>in</strong>ally developed<br />

to store DNA structural <strong>in</strong>formation about the various<br />

sequenced genomes, is one example (Hall<strong>in</strong> <strong>and</strong> Ussery<br />

2004). Approximately a hundred different features for each<br />

genome (such as percent AT, cod<strong>in</strong>g skew bias, length of<br />

genome <strong>and</strong> number of genes) are currently made available<br />

through http://www.cbs.dtu.dk/services/GenomeAtlas/.<br />

Duplication of essentials<br />

One of the features of genomic sequences that can be easily<br />

recognised is the presence of repeat sequences. The most<br />

obvious <strong>and</strong> extensive repeats present <strong>in</strong> many bacterial<br />

167<br />

genomes are the operons encod<strong>in</strong>g the ribosomal RNA<br />

genes. These rRNA operons typically encode 16S <strong>and</strong> 23S<br />

rRNA separated by a short spacer, often followed by the 5S<br />

rRNA gene. All sequenced bacterial genomes possess at<br />

least one rRNA operon, <strong>and</strong> many (215 of 300) have two or<br />

more copies; the number of operons tends to correlate with<br />

bacterial division time. Thus, species that divide quickly<br />

(such as Bacillus cereus) have more copies of rRNA genes,<br />

so as to enable rapid production of ribosomes. In addition,<br />

species conta<strong>in</strong><strong>in</strong>g multiple rRNA operons appear to be<br />

more adaptable to chang<strong>in</strong>g environmental conditions<br />

(Ac<strong>in</strong>as et al. 2004). The rRNA genes are a valuable tool<br />

for the estimation of taxonomic relationships (see Fig 2a).<br />

These genes evolve slowly, presumably because they play<br />

an essential role as the backbone of ribosomes while<br />

<strong>in</strong>teract<strong>in</strong>g with multiple prote<strong>in</strong>s. Any changes <strong>in</strong> the<br />

shape (sequence) of rRNA would most likely be fatal.<br />

Multiple copies per genome of tRNA genes can also be<br />

found <strong>in</strong> some genomes, aga<strong>in</strong> tend<strong>in</strong>g to correlate with<br />

division time. However, for tRNAs, the duplication<br />

number is also dictated by the frequency with which<br />

particular codons are used (or vice versa, as cause <strong>and</strong><br />

effect cannot be dist<strong>in</strong>guished here). This enables a less<br />

obvious level of regulat<strong>in</strong>g gene activity: a gene us<strong>in</strong>g<br />

many codons for which only one tRNA gene is available<br />

will probably be translated at a rate-limit<strong>in</strong>g step, whereas<br />

abundant prote<strong>in</strong>s are more likely to use tRNAs for which<br />

multiple gene copies are available. This is the basis for the<br />

codon adaption <strong>in</strong>dex, which is a measure of the adaptation<br />

of a gene’s codon usage towards the optimal tRNA pool<br />

(Sharp <strong>and</strong> Li 1987).<br />

There are of course other duplications <strong>in</strong> bacterial<br />

genomes, some of which might appear at first glance to be<br />

less essential. For example, the ‘REP’ repetitive sequences<br />

frequently found <strong>in</strong> enterobacteriaceae can be used as<br />

unique identifiers of bacterial genomes (Tobes <strong>and</strong> Ramos<br />

2005). It has been speculated that these repeats are<br />

mean<strong>in</strong>gless, result<strong>in</strong>g from errors <strong>in</strong> replication, or that


168


3Fig. 2 a Phylogenetic tree of 287 sequenced bacterial genomes,<br />

based on aligments from the 16S rRNA gene sequence. The phyla<br />

are colour-coded; a more detailed view, with names of all the<br />

organisms can be found <strong>in</strong> the supplemental <strong>in</strong>formation: http://<br />

www.cbs.dtu.dk/services/GenomeAtlas/suppl/FIG10yr/. b Photomicrograph<br />

show<strong>in</strong>g a dense fr<strong>in</strong>ge of anaerobic spirochaetes (B.<br />

pilosicoli) attached by one cell end to the lum<strong>in</strong>al surface of human<br />

colonic enterocytes, form<strong>in</strong>g a “false brush border”. Besides that of<br />

humans, B. pilosicoli colonises the large <strong>in</strong>test<strong>in</strong>e of a variety of<br />

mammals <strong>and</strong> birds, caus<strong>in</strong>g diarrhoea <strong>and</strong> reduced growth rates.<br />

Genomic sequence from B. pilosicoli is be<strong>in</strong>g analysed to assist <strong>in</strong><br />

underst<strong>and</strong><strong>in</strong>g the genetic basis of this dense colonisation, <strong>in</strong>clud<strong>in</strong>g<br />

patterns of gene expression underly<strong>in</strong>g the complex <strong>in</strong>teractions that<br />

occur between <strong>in</strong>dividual bacterial cells <strong>and</strong> the colonised<br />

enterocytes. The photograph is courtesy of Dr. W. Bastiaan DeBoer,<br />

University of Western Australia, Perth, Western Australia<br />

they may be a part of mobile elements that are able to<br />

translocate <strong>and</strong> duplicate themselves. These could alternatively<br />

be non-functional ‘molecular fossils’ of previous<br />

<strong>in</strong>sertion events. F<strong>in</strong>ally, it could well be that these repeats<br />

serve some as yet undiscovered useful purpose. It is<br />

possible, for example, that repetitive sequences <strong>and</strong><br />

<strong>in</strong>sertion sequence elements (ISs) contribute to genome<br />

plasticity through structural changes based on homologous<br />

recomb<strong>in</strong>ation (Kennedy et al. 2001; Fraser-Liggett 2005).<br />

A brief history of bacterial operons<br />

Much of the early classical work <strong>in</strong> microbiology has been<br />

done with E. coli, as this bacterium is relatively easy to<br />

culture <strong>in</strong> the laboratory. As more <strong>and</strong> more genetic<br />

<strong>in</strong>formation was gathered, it was considered a ‘typical’<br />

bacterium, although E. coli is not more typical for bacteria<br />

than a rabbit is for all eukaryotic organisms. More than<br />

40 years ago, a model was proposed for gene regulation of<br />

the catabolism of lactose <strong>in</strong> E. coli (Jacob et al. 1960; Jacob<br />

<strong>and</strong> Monod 1961). The model described an operon as a<br />

cluster of genes with related functions (encod<strong>in</strong>g, <strong>in</strong> this<br />

case, enzymes required for lactose degradation). This<br />

operon structure neatly allows regulation of gene expression<br />

by the concentration of lactose (Lewis et al. 1996;<br />

Reznikoff 1992). With the cont<strong>in</strong>uous expression of one<br />

small prote<strong>in</strong> (a repressor), wasteful expression of several<br />

other catabolic enzymes <strong>in</strong> the absence of lactose is<br />

prevented.<br />

S<strong>in</strong>ce the discovery of the lac operon, many more<br />

catabolic operons have been discovered, with positive <strong>and</strong><br />

negative feedback strategies, <strong>and</strong> these illustrate the<br />

biological need to use resources as efficiently as possible.<br />

Many, if not all, bacterial genomes <strong>in</strong>deed display clusters<br />

of genes <strong>in</strong>volved <strong>in</strong> a s<strong>in</strong>gle process (be it co-jo<strong>in</strong>tly<br />

transcribed <strong>and</strong> regulated, as <strong>in</strong> classical operons, or with<br />

separate promoters <strong>and</strong> regulators), but the degree of<br />

operon gene organisation <strong>and</strong> gene cluster<strong>in</strong>g differs<br />

between species. In some bacteria, such as <strong>in</strong> Helicobacter<br />

pylori, operons are relatively unconserved, <strong>and</strong> genes<br />

<strong>in</strong>volved <strong>in</strong> one cellular process can be dispersed<br />

169<br />

throughout the genome (Tomb et al. 1997; Alm <strong>and</strong> Trust<br />

1999), although more recent work suggest that perhaps<br />

there are more operons <strong>in</strong> H. pylori than previously thought<br />

(Price et al. 2005). There are currently many resources for<br />

prediction of operons (Rogoz<strong>in</strong> et al. 2004; Rosenfeld et al.<br />

2004; Alm et al. 2005; Janga et al. 2005; Nishi et al. 2005;<br />

Price et al. 2005; Vallenet et al. 2006), <strong>in</strong>clud<strong>in</strong>g several<br />

databases, such as the Operon Database (Okuda et al.<br />

2006), RegulonDB (Salgado et al. 2006a,b) <strong>and</strong> Gene-<br />

Chords (Zheng et al. 2005).<br />

How did the first operon evolve? There have been<br />

historically three models proposed for the orig<strong>in</strong>s of gene<br />

clusters. The first model, which dates back to 1945,<br />

proposed the cluster<strong>in</strong>g of genes to be the direct result of<br />

gene duplication <strong>and</strong> evolution (Horowitz 1945, 1965).<br />

Gene duplication can occur dur<strong>in</strong>g replication <strong>and</strong>, as a<br />

duplicated gene has more freedom to mutate, this is<br />

believed to be a classical mechanism for novel enzymes to<br />

evolve (Lazcano et al. 1995). However, although all genes<br />

with<strong>in</strong> an operon may be <strong>in</strong>volved <strong>in</strong> a s<strong>in</strong>gle metabolic<br />

process, their function <strong>and</strong> structure can vary considerably,<br />

<strong>and</strong> a phylogenic relationship between them is not always<br />

likely.<br />

The second model proposed for the evolution of operons<br />

is that coregulation of genes under a common promoter<br />

could provide selective advantage (Jacob et al. 1960).<br />

However, we now know that, <strong>in</strong> fact, it is possible to have<br />

coregulation of genes that are not physically l<strong>in</strong>ked<br />

together. Furthermore, this model does not really provide<br />

a gradual step-by-step mechanism for the evolution of<br />

operons.<br />

The third model for the evolution of an operon is that<br />

pre-exist<strong>in</strong>g genes moved together due to selective<br />

advantages of hav<strong>in</strong>g genes <strong>in</strong>volved <strong>in</strong> the same<br />

biochemical pathways or processes be<strong>in</strong>g physically<br />

close to each other. This hypothesis allows for structurally<br />

dist<strong>in</strong>ct genes to be part of one operon. This model requires<br />

both variation <strong>and</strong> frequent recomb<strong>in</strong>ation <strong>and</strong> has been<br />

proposed as an explanation of cluster<strong>in</strong>g of genes <strong>in</strong><br />

bacteriophage genomes (Stahl <strong>and</strong> Murray 1966; Juhala et<br />

al. 2000).<br />

In addition to these three views, there are other<br />

alternatives. Gene cluster<strong>in</strong>g may be of selective advantage<br />

<strong>in</strong> the case of horizontal gene transfer (see section below)<br />

<strong>and</strong>, based on this idea, a fourth mechanism, ‘selfish<br />

operon’ model, was proposed (Lawrence <strong>and</strong> Roth 1996).<br />

This view has been recently called <strong>in</strong>to question, based on<br />

the physical cluster<strong>in</strong>g of essential genes <strong>in</strong> the E. coli K-12<br />

genome (Pal <strong>and</strong> Hurst 2004). Two other alternatives for<br />

operon evolution deal with chromat<strong>in</strong> structure <strong>and</strong> the<br />

physical location of genes <strong>in</strong> bacterial chromosomes, where<br />

transcription <strong>and</strong> translation are coupled (Pal <strong>and</strong> Hurst<br />

2004). It is quite possible that, <strong>in</strong> fact, there is no one<br />

“correct” mechanism, but perhaps different mechanisms are<br />

<strong>in</strong>volved at the same time. For example, the selective<br />

advantage of gene cluster<strong>in</strong>g dur<strong>in</strong>g horizontal gene transfer<br />

is exemplified by the cluster<strong>in</strong>g of multiple antibiotic


170<br />

resistance genes on mobile genetic elements (Carattoli<br />

2001). In the era of antibiotic use, such genes are under<br />

strong selective pressure <strong>and</strong> are frequently passed on<br />

between bacteria by means of mobile elements. Whether<br />

these have directly contributed to the spread of catabolic <strong>and</strong><br />

other operons between bacterial species is currently not<br />

known.<br />

What separates genes <strong>in</strong> a genome?<br />

In comparison to genes, the non-cod<strong>in</strong>g part of genomes<br />

receives far less attention. Some genomes are more<br />

densely packed than the others. The average cod<strong>in</strong>g<br />

density is about 90%, rang<strong>in</strong>g from 95% for Pelagibacter<br />

ubique (Giovannoni et al. 2005) to 51% for Sodalis<br />

gloss<strong>in</strong>idius (Toh et al. 2006). Bacterial genes are not<br />

spliced as they are <strong>in</strong> eukaryotes; that is, <strong>in</strong>trons are absent<br />

from nearly all bacterial genes. The sequences separat<strong>in</strong>g<br />

genes (<strong>in</strong>tergenic regions) can be thought of as spacers<br />

where <strong>in</strong>formation on regulation of transcription can be<br />

stored, although sometimes these <strong>in</strong>tergenic regions can<br />

also be more than regulatory <strong>and</strong> spacer doma<strong>in</strong>s.<br />

Intergenic regions <strong>in</strong> the E. coli K-12 chromosome have<br />

been suggested to conta<strong>in</strong> the sequences for several<br />

hundreds of small RNA genes which are transcribed but do<br />

Table 1 Current E. coli genomes sequenced or <strong>in</strong> progress<br />

Escherichia coli<br />

stra<strong>in</strong><br />

Length (bp) Number of<br />

genes<br />

Number of<br />

tRNAs<br />

not code for prote<strong>in</strong>s (Chen et al. 2002). Many of these<br />

small RNAs act as regulators (Gottesman 2005).<br />

In general, the <strong>in</strong>tergenic regions of bacterial genomes<br />

are more AT-rich, will melt more readily, are more curved<br />

<strong>and</strong> are more rigid than the chromosomal average<br />

(Pedersen et al. 2000; Hall<strong>in</strong> <strong>and</strong> Ussery 2004). This is<br />

true for nearly all of the several hundreds of bacterial<br />

genomes sequenced, regardless of AT content. These<br />

characteristics make sense <strong>in</strong> terms of mechanical properties<br />

needed for <strong>in</strong>itiat<strong>in</strong>g transcription.<br />

Generation of genomic diversity <strong>in</strong> bacteria<br />

Genomic diversity is far greater than expected<br />

The view <strong>in</strong> many textbooks of biological diversity <strong>and</strong><br />

evolution often envisions clonal bacteria which slowly<br />

evolve through the gradual accumulation of s<strong>in</strong>gle-nucleotide<br />

changes. There might occasionally be a rare event<br />

where a new gene is duplicated but, <strong>in</strong> general, it has been<br />

commonly thought that if one were to sequence two<br />

different stra<strong>in</strong>s of a common bacterium like E. coli, the<br />

sequences would, for the most part, be similar <strong>and</strong> the two<br />

stra<strong>in</strong>s would share most (perhaps 90% or more) of their<br />

genes. At the time of writ<strong>in</strong>g, there are 20 different E. coli<br />

Number of<br />

rRNAs<br />

Number of<br />

contigs<br />

Accessionumber<br />

O157_EDL93 5,528,445 5,349 100 7 1 AE005174<br />

E22 5,516,16 4,788 NA NA 109 AAJV00000000<br />

O157_RIMD0509952 5,498,450 5,361 103 7 1 BA000007<br />

E110019 5,384,084 4,746 NA NA 119 AAJW00000000<br />

B171 5,299,753 4,467 NA NA 159 AAJX00000000<br />

53638 5,289,471 4,783 NA NA 119 AAKB00000000<br />

042 5,241,977 4,899 93 7 2 Sanger Institute<br />

(unpublished)<br />

CFT073 5,231,428 5,379 89 7 1 AE014075<br />

H10407 ~5,208,000 ~5,000 NA NA 225 Sanger Institute<br />

(unpublished)<br />

F11 5,206,906 4,467 NA NA 88 AAJU00000000<br />

B7A 5,202,558 4,637 NA NA 198 AAJT00000000<br />

NMEC RS218 5,089,235 ~4,900 NA NA 1 Uni. Wisc. (unpublished)<br />

E2348 5,072,200 4,594 71 7 4 Sanger Institute<br />

(unpublished)<br />

E24377A 4,980,187 4,254 97 6 1 AAJZ00000000<br />

UPEC 536 ~4,900,000 ~4800 NA NA 1 Uni. Würzburg<br />

(unpublished)<br />

101NA1 4,880,380 4,238 NA NA 70 AAMK00000000<br />

HS 4,643,538 3,689 89 6 1 AAJY00000000<br />

K-12_W3110 4,641,433 4,390 88 7 1 AP009048<br />

K-12_MG1655 4,639,675 4,254 88 7 1 U00096<br />

B03 4,629,810 4,387 86 6 1 CNRS France (unpublished)<br />

NA Currently not annotated


genomes which have been either completely sequenced or<br />

at least with an expected coverage of greater than 99% of<br />

the genome. Table 1 lists these genomes, <strong>and</strong> one of the<br />

surpris<strong>in</strong>g observations is the diversity just <strong>in</strong> size of the<br />

ma<strong>in</strong> chromosome, rang<strong>in</strong>g from 5.5 to 4.6 Mbp—that is,<br />

close to a million base pairs present <strong>in</strong> some E. coli stra<strong>in</strong>s<br />

which are miss<strong>in</strong>g <strong>in</strong> others. Furthermore, if one were to<br />

pick any one of these 20 stra<strong>in</strong>s, there would be more than a<br />

hundred genes which are unique to that stra<strong>in</strong> <strong>and</strong> are not<br />

found <strong>in</strong> the other 19 E. coli genomes. Studies have<br />

<strong>in</strong>dicated that much of this diversity comes from<br />

bacteriophages (Ohnishi et al. 2001).<br />

Gene order conservation<br />

When compar<strong>in</strong>g bacterial genomes, two features are<br />

frequently analysed: gene presence <strong>and</strong> gene order. The<br />

presence or absence of genes is particularly <strong>in</strong>terest<strong>in</strong>g<br />

when two closely related species or stra<strong>in</strong>s that have<br />

different phenotypes, such as a pathogenic <strong>and</strong> a commensal<br />

stra<strong>in</strong> of the same species, are compared (Hayashi et al.<br />

2001). As for the actual process lead<strong>in</strong>g to the difference,<br />

the direction of the <strong>in</strong>sertion/deletion event is not always<br />

clear; the nature of the <strong>in</strong>del (INsertion/DELetion) is<br />

generally kept neutral.<br />

Table 2 Types of mobile genetic elements found <strong>in</strong> bacterial genomes<br />

There are various models of how the gene order with<strong>in</strong><br />

operons may have changed throughout evolution. It may be<br />

that the gene order <strong>in</strong> ancient ancestral operons has been<br />

ma<strong>in</strong>ta<strong>in</strong>ed, such that all (or many) of the operons <strong>in</strong><br />

studied genomes would be expected to have a similar gene<br />

structure. However, this view has been contradicted by data<br />

from whole genome studies. Exam<strong>in</strong><strong>in</strong>g the stability of<br />

operon structures over evolutionary distance shows that the<br />

majority of the gene orders with<strong>in</strong> operons could be<br />

shuffled frequently dur<strong>in</strong>g evolution, with the ribosomal<br />

prote<strong>in</strong> operons as an exception (Itoh et al. 1999). Such<br />

observations support the alternative possibility that operons<br />

are multiple evolutionary <strong>in</strong>ventions. A more recent<br />

study has exam<strong>in</strong>ed the evolution of the histid<strong>in</strong>e operon <strong>in</strong><br />

Proteobacteria <strong>and</strong> found evidence for <strong>in</strong>deed a gradual<br />

merg<strong>in</strong>g of genes with similar function <strong>in</strong>to operons, at<br />

least <strong>in</strong> this case (Fani et al. 2005).<br />

Comparisons of gene order can also be <strong>in</strong>formative of<br />

chromosomal translocations <strong>and</strong> <strong>in</strong>versions, which frequently<br />

happen <strong>in</strong> bacterial genomes (Kuwahara et al.<br />

2004). Such events are mostly neutral <strong>in</strong> terms of<br />

evolution, as they do not change the total genetic content<br />

of the cell, but translocations <strong>and</strong> <strong>in</strong>versions frequently<br />

co<strong>in</strong>cide with <strong>in</strong>sertions or deletions. Any of these<br />

processes can result from <strong>in</strong>accurate excision of mobile<br />

genetic elements <strong>and</strong>, as such elements are frequently<br />

MGE Description References<br />

Plasmids Circular, self-replicat<strong>in</strong>g DNA molecules that exist <strong>in</strong> cells as extra-chromosomal<br />

replicons. Some plasmids can <strong>in</strong>sert <strong>in</strong>to the chromosome.<br />

(Dobr<strong>in</strong>dt et al. 2004)<br />

Transposons DNA molecules that frequently change their chromosomal localisation, either<br />

with<strong>in</strong> or between replicons. They usually code for a transposase <strong>and</strong> some other<br />

genes (such as antibiotic resistance genes), <strong>and</strong> are flanked by <strong>in</strong>verted repeat<br />

DNA sequences.<br />

(Dobr<strong>in</strong>dt et al. 2004)<br />

Conjugative Transposons that also carry genes related to plasmid-encoded conjugation, thus, (Dobr<strong>in</strong>dt et al. 2004)<br />

transposons provid<strong>in</strong>g the ability to transfer between cells via conjugation<br />

Bacteriophages Prokaryote-<strong>in</strong>fect<strong>in</strong>g viruses, which can modify the host genome by cod<strong>in</strong>g new<br />

functions or by modify<strong>in</strong>g exist<strong>in</strong>g functions. They are also capable of <strong>in</strong>sert<strong>in</strong>g<br />

<strong>in</strong>to the genome (prophages). These are also agents of HGT.<br />

(Dobr<strong>in</strong>dt et al. 2004)<br />

Integrons Genetic elements composed of a gene encod<strong>in</strong>g an <strong>in</strong>tegrase (<strong>in</strong>t gene; excises <strong>and</strong> (Fluit <strong>and</strong> Schmitz 2004; Holmes et al.<br />

<strong>in</strong>tegrates the gene cassettes from <strong>and</strong> <strong>in</strong>to the <strong>in</strong>tegron), gene cassettes (become<br />

part of the <strong>in</strong>tegron upon <strong>in</strong>tegration; consist of a promoterless gene <strong>and</strong> a<br />

recomb<strong>in</strong>ation site termed attC) <strong>and</strong> an <strong>in</strong>tegration site for the gene cassettes (attI<br />

gene)<br />

2003; Peters et al. 2001)<br />

Insertion Small, genetically compact DNA sequences, generally less than 2.5 kbp <strong>in</strong> length, (Mahillon et al. 1999; Ou et al. 2006)<br />

sequence encod<strong>in</strong>g functions <strong>in</strong>volved <strong>in</strong> their translocation, <strong>and</strong> transpose both with<strong>in</strong> <strong>and</strong><br />

elements between genomes. IS elements are a subset of a general group of elements named<br />

transposable elements. These transposable elements are def<strong>in</strong>ed as elements of<br />

DNA segments that carry the genes required for this process (<strong>and</strong>, <strong>in</strong> some cases,<br />

other genes), <strong>and</strong> consequently move about chromosomes <strong>and</strong>, more generally,<br />

genomes.<br />

Genomic Large chromosomal regions that conta<strong>in</strong> a cluster of functionally related genes, an (Dobr<strong>in</strong>dt et al. 2004)<br />

isl<strong>and</strong>s operon or a number of operons, flanked by direct repeat sequences, <strong>and</strong> located<br />

near an <strong>in</strong>tegrase or transposase gene <strong>and</strong> a tRNA gene.<br />

171


172<br />

<strong>in</strong>volved <strong>in</strong> generat<strong>in</strong>g diversity <strong>in</strong> bacteria, they deserve to<br />

be treated <strong>in</strong> a separate section.<br />

Mobile genetic elements<br />

MGEs are genomic elements that are capable of translocat<strong>in</strong>g<br />

themselves with<strong>in</strong> or between genomes. When mov<strong>in</strong>g<br />

to a new genome, they may confer a new characteristic on<br />

the recipient. Their size ranges from hundreds of base pairs<br />

to more than 100 kbp. Plasmids, transposons, conjugative<br />

transposons, bacteriophages, <strong>in</strong>tegrons, <strong>in</strong>sertion sequence<br />

elements <strong>and</strong> genomic isl<strong>and</strong>s (GEIs) are all considered<br />

MGEs (Table 2). Bacteriophages are the most sophisticated,<br />

as they produce their own prote<strong>in</strong> coat to protect the<br />

genetic material (which can be DNA or RNA). Conjugative<br />

transposons <strong>in</strong>duce conjugation between cells, a process <strong>in</strong><br />

which cellular membranes merge to produce a bridge<br />

through which the transposon can move. Some plasmids<br />

can also <strong>in</strong>duce conjugation (a transposon always encodes<br />

transposase whereas a conjugative plasmid replicates<br />

without <strong>in</strong>tegration <strong>in</strong> the chromosome). Some of the<br />

def<strong>in</strong>itions for the various MGEs partly overlap, as <strong>in</strong>deed<br />

these terms are flexible. For <strong>in</strong>stance, transposons can<br />

<strong>in</strong>tegrate <strong>in</strong> plasmids, <strong>and</strong> bacteriophages may conta<strong>in</strong><br />

<strong>in</strong>sertion sequence elements (Burrus <strong>and</strong> Waldor 2004).<br />

MGEs constitute potentially foreign DNA located <strong>in</strong> a<br />

conceptual ‘flexible’ gene pool, from where ‘donated’<br />

DNA is made available for recipient cells. Once the MGE<br />

is transferred <strong>in</strong>to the recipient cell, the DNA will either<br />

<strong>in</strong>sert <strong>in</strong>to a region on the chromosome or it will start to<br />

evoke its own replication mach<strong>in</strong>ery. If the MGE is<br />

<strong>in</strong>tegrated <strong>in</strong>to the genome, for example, like a pathogenicity<br />

isl<strong>and</strong> (PAI), the genes (or operon) will start to be<br />

expressed, thus add<strong>in</strong>g a new characteristic to the cell. The<br />

MGE may later <strong>in</strong>itiate ‘donation’ of DNA either to a next<br />

receptor (for which the trigger is as yet unknown) or to the<br />

flexible gene pool, perhaps tak<strong>in</strong>g with it a ‘new’ or<br />

additional gene or function. The <strong>in</strong>tegrated MGE may also<br />

become immobile as a result of chromosomal re-arrangements,<br />

duplications or sequence <strong>in</strong>sertions/deletions. In the<br />

case of such rendered immobility, the <strong>in</strong>tegrated MGE<br />

becomes a permanent genomic element or genomic isl<strong>and</strong>.<br />

At a later stage, the genomic isl<strong>and</strong> may be modified <strong>and</strong><br />

rendered mobile aga<strong>in</strong>, mak<strong>in</strong>g it available for transfer to<br />

the flexible gene pool once aga<strong>in</strong>.<br />

As the subject of all MGEs listed <strong>in</strong> Table 2 would<br />

suffice a review paper on its own, this review focuses on<br />

two, namely, <strong>in</strong>sertion sequence elements <strong>and</strong> GEIs. These<br />

two MGEs are of particular <strong>in</strong>terest because our knowledge<br />

of them has improved dramatically as a direct result of<br />

genome sequence availability <strong>and</strong> due also to their impact<br />

on the diversity of bacteria.<br />

Insertion sequence elements<br />

IS elements are small DNA sequences, generally less than<br />

2.5 kb <strong>in</strong> length, encod<strong>in</strong>g functions <strong>in</strong>volved <strong>in</strong> their own<br />

translocation <strong>and</strong> can transpose both with<strong>in</strong> <strong>and</strong> between<br />

genomes (Mahillon et al. 1999). IS elements were<br />

orig<strong>in</strong>ally described as a subset of transposable elements<br />

(Prescott et al. 1999). IS elements are the simplest form of<br />

MGE <strong>and</strong> a key component of a majority of the more<br />

complex transposable elements, found both <strong>in</strong> bacterial <strong>and</strong><br />

eukaryotic genomes. A number of reviews deal with IS<br />

elements <strong>in</strong> greater depth (van Belkum et al. 1998;<br />

Mahillon et al. 1999; Galun 2003).<br />

An IS conta<strong>in</strong>s a transposase gene, flanked by term<strong>in</strong>al<br />

<strong>in</strong>verted repeats (the sequence of one flank is encoded on<br />

the opposite str<strong>and</strong> of the other flank). One of these repeats<br />

classically conta<strong>in</strong>s the promoter for the transposase gene<br />

(Fig. 3; Galun 2003). The IS elements are also flanked by<br />

short, directly repeated sequences, which are generated <strong>in</strong><br />

the recipient DNA as a result of <strong>in</strong>sertion.<br />

The activity of transposable elements <strong>in</strong> genomes was<br />

first noted by McCl<strong>in</strong>tock (1950) <strong>in</strong> maize, although at that<br />

time the mechanism beh<strong>in</strong>d the observed genetic changes<br />

was not understood. Starl<strong>in</strong>ger <strong>and</strong> Saedler (1976) provided<br />

the first review of IS elements <strong>in</strong> bacterial genomes. As<br />

noted by Lupski <strong>and</strong> We<strong>in</strong>stock (1992), the first ISs were<br />

classified before their function, orig<strong>in</strong> <strong>and</strong> dispersion<br />

mechanisms were understood. The present genomic era<br />

has resulted <strong>in</strong> advances <strong>in</strong> their classification, underst<strong>and</strong><strong>in</strong>g<br />

of mechanisms of dispersion <strong>and</strong> identification of<br />

their role <strong>in</strong> evolution (van Belkum et al. 1998; Mahillon et<br />

al. 1999). Although the classical ISs are considered to be<br />

evolutionary neutral, as each can only translocate their own<br />

transposase, they are the means by which genomic isl<strong>and</strong>s<br />

(for example PAIs <strong>and</strong> metabolic isl<strong>and</strong>s) are transferred,<br />

<strong>and</strong> they also play a role <strong>in</strong> plasmid <strong>in</strong>tegration (Rocha et al.<br />

1999). Variation <strong>in</strong> the excision of ISs promotes genome<br />

rearrangements (<strong>in</strong>clud<strong>in</strong>g deletions, <strong>in</strong>versions <strong>and</strong> replicon<br />

fusions; Mahillon et al. 1999). Antibiotic resistance<br />

genes are frequently spread with<strong>in</strong> bacterial populations<br />

with the aid of ISs, which gives these simple elements<br />

cl<strong>in</strong>ical relevance. F<strong>in</strong>ally, <strong>in</strong> special cases, IS elements can<br />

<strong>in</strong>directly cause antigenic variation, a process <strong>in</strong> which a<br />

gene is switched off <strong>and</strong> on <strong>in</strong> a reversible manner with<strong>in</strong> a<br />

bacterial population (Talarico et al. 2005). IS sequences that<br />

Fig. 3 Organisation of a typical <strong>in</strong>sertion sequence. The IS is<br />

represented as an open box <strong>in</strong> which the term<strong>in</strong>al <strong>in</strong>verted repeats are<br />

shown as blue boxes labelled IRL (left IR) <strong>and</strong> IRR (right IR). An<br />

open read<strong>in</strong>g frame encod<strong>in</strong>g the transposase (grey box) is located <strong>in</strong><br />

the IS. WXY boxes flank<strong>in</strong>g the IS represent short directly repeated<br />

sequences generated <strong>in</strong> the target DNA as a consequence of<br />

<strong>in</strong>sertion. The transposase promoter is localised <strong>in</strong> IRL


are present <strong>in</strong> the first part of a gene can cause slippage<br />

dur<strong>in</strong>g replication, as DNA polymerase has difficulties with<br />

correct replication of short multiple repeats. The result can<br />

be a frame shift with consequential <strong>in</strong>activation, but the<br />

next frame shift can restore gene function. Such slippage<br />

can also vary the distance <strong>and</strong>, thus, activity of a promoter<br />

<strong>and</strong> its gene. Examples <strong>in</strong>volv<strong>in</strong>g genes with a role <strong>in</strong><br />

pathogenicity, with antigenic variation of surface exposed<br />

prote<strong>in</strong>s, <strong>and</strong> environmental adaptation have been described<br />

(van Belkum et al. 1998; Rocha et al. 1999).<br />

Monitor<strong>in</strong>g of these elements has provided <strong>in</strong>sights <strong>in</strong>to<br />

bacterial genome molecular processes <strong>and</strong> the nature of IS<br />

elements. For example, underst<strong>and</strong><strong>in</strong>g the regulatory<br />

mechanisms of IS elements has provided <strong>in</strong>sights <strong>in</strong>to the<br />

importance of the compromises adopted by IS elements<br />

(<strong>and</strong> MGEs, <strong>in</strong> general) between a stable host genome <strong>and</strong><br />

<strong>in</strong> endanger<strong>in</strong>g the survival of the host, through too much<br />

transposition activity (Nagy <strong>and</strong> Ch<strong>and</strong>ler 2004). It has<br />

also been suggested that IS expansion occurs dur<strong>in</strong>g an<br />

evolutionary bottleneck, which reduces effective population<br />

size <strong>and</strong> the degree of <strong>in</strong>traspecies competition<br />

(Parkhill et al. 2003).<br />

Genomic isl<strong>and</strong>s<br />

GEIs, also referred to as <strong>in</strong>tegrative <strong>and</strong> conjugative<br />

elements or ICEl<strong>and</strong>s (van der Meer <strong>and</strong> Sentchilo 2003),<br />

are large chromosomal regions that cluster functionally<br />

related genes, are flanked by direct repeat sequences <strong>and</strong><br />

are located near an <strong>in</strong>tegrase or transposase gene <strong>and</strong> often<br />

also near a tRNA. Furthermore, GEIs must have a GC<br />

composition different from the rest of the genome. GEIs<br />

<strong>in</strong>clude pathogenicity isl<strong>and</strong>s, symbiosis isl<strong>and</strong>s (SYIs),<br />

metabolic isl<strong>and</strong>s (MEIs), antibiotic resistance isl<strong>and</strong>s<br />

(REIs) <strong>and</strong> secretion system isl<strong>and</strong>s (SEIs) (Zhang <strong>and</strong><br />

Zhang 2004). This remarkable variety of GEIs demonstrates<br />

the power of horizontal gene transfer, as they are<br />

believed to be the result of <strong>in</strong>terspecies DNA transfer. With<br />

multiple genes neatly clustered <strong>in</strong> functional groups<br />

<strong>in</strong>clud<strong>in</strong>g all necessary regulatory <strong>and</strong> secretory genes,<br />

the power of transferr<strong>in</strong>g such ‘adaptive genetic bombs’<br />

can be easily imag<strong>in</strong>ed.<br />

Genome sequences have revealed that GEIs are common<br />

<strong>in</strong> bacteria as a result of successful horizontal transfers of<br />

Fig. 4 Generalised diagrammatic representation of a pathogenicity<br />

isl<strong>and</strong>. Commonly <strong>in</strong>serted <strong>in</strong>to a tRNA gene sequence, flanked by<br />

direct repeat sequences, conta<strong>in</strong><strong>in</strong>g an <strong>in</strong>tegrase (<strong>in</strong>t) gene,<br />

commonly conta<strong>in</strong><strong>in</strong>g <strong>in</strong>sertion sequence elements, <strong>and</strong> harbour<strong>in</strong>g<br />

DNA from a donor genome to a recipient genome. In most<br />

cases, the nature of the donor is unfortunately unknown.<br />

Even when an identified GEI bears a high resemblance to a<br />

section of another sequenced organism, one should not<br />

assume (though frequently this mistake has been made)<br />

that the GEI was directly received from that other<br />

organism. The transfer could well have <strong>in</strong>volved a third<br />

unidentified species, serv<strong>in</strong>g either as an <strong>in</strong>termediate<br />

between the first two or as the donor for the others. These<br />

possibilities are frequently not recognised, as people can be<br />

mislead by the available genome sequences <strong>and</strong> are not<br />

sufficiently aware of all those bacterial genomes for which<br />

we are currently lack<strong>in</strong>g sequence <strong>in</strong>formation.<br />

The discovery of abundant genomic isl<strong>and</strong>s is strengthen<strong>in</strong>g<br />

the concept of a bacterial genome be<strong>in</strong>g quite<br />

dynamic <strong>and</strong> consist<strong>in</strong>g of a backbone genome supplemented<br />

with adaptive genome modules, which may or may<br />

not be present <strong>in</strong> a given stra<strong>in</strong> of the species (Fraser-<br />

Liggett 2005). All modules available to the species (but<br />

never all present <strong>in</strong> one stra<strong>in</strong>) would comprise the gene<br />

pool of that organism. This concept clearly does not apply<br />

to strictly clonal species, <strong>in</strong> which case all isolates or stra<strong>in</strong>s<br />

closely resemble each other (as is the case, for <strong>in</strong>stance,<br />

with Bacillus anthracis), but it better describes the situation<br />

for frequently observed highly diverse species, such as E.<br />

coli or Streptomyces. Nevertheless, the timescale at which<br />

these events take place should not be ignored. Genomes are<br />

the sum of thous<strong>and</strong>s of years of evolution. Observations of<br />

evolutionary events tak<strong>in</strong>g place <strong>in</strong> ‘real time’ are still<br />

relatively seldom.<br />

Pathogenicity isl<strong>and</strong>s<br />

173<br />

PAIs are now considered a subtype of genomic isl<strong>and</strong>s but<br />

were among the earliest isl<strong>and</strong>s to be described. PAIs<br />

harbour pathogenicity-related genes, thus potentially conferr<strong>in</strong>g<br />

a pathogenic phenotype on a recipient genome.<br />

Figure 4 illustrates a generalised model of a PAI. As with<br />

other GEIs, PAIs are commonly <strong>in</strong>serted <strong>in</strong>to tRNA genes,<br />

which may be preferred sites of <strong>in</strong>sertion due to their<br />

relative conservation <strong>and</strong> redundancy (Dobr<strong>in</strong>dt et al.<br />

2004). PAIs are flanked by direct repeat sequences<br />

allow<strong>in</strong>g for <strong>in</strong>sertion <strong>in</strong>to the recipient DNA <strong>and</strong> conta<strong>in</strong><br />

an <strong>in</strong>tegrase gene that enables the <strong>in</strong>tegration <strong>in</strong>to the<br />

functional genes (with virulence associated properties), which may<br />

be organised <strong>in</strong>to an operon structure. Sometimes, a type III<br />

secretion system is also present


174<br />

recipient DNA. A feature observed for many PAIs (<strong>and</strong><br />

orig<strong>in</strong>ally <strong>in</strong>cluded <strong>in</strong> their def<strong>in</strong>ition although not always<br />

present) is the presence of a type III secretion system, a set<br />

of genes build<strong>in</strong>g an apparatus to specifically <strong>in</strong>ject<br />

virulence factors <strong>in</strong>to the host cell (Jores et al. 2004).<br />

Numerous <strong>in</strong>vestigations have identified <strong>and</strong> analysed PAIs<br />

(McGillivary et al. 2005; Middendorf et al. 2004; Paulsen<br />

et al. 2003; Schneider et al. 2004; Zubrzycki 2004; Schmidt<br />

<strong>and</strong> Hensel 2004).<br />

Horizontal gene transfer <strong>and</strong> restriction modification<br />

systems<br />

Evidence of HGT (also referred to as lateral gene transfer<br />

LGT) dates back more than 30 years (Falkow 1975), with<br />

the f<strong>in</strong>d<strong>in</strong>g of transposable elements. Although such events<br />

were considered only exceptional cases at that time, it is<br />

now evident that HGT events can make a substantial<br />

contribution to the generation of genetic diversity. As with<br />

all other features, the degree of horizontal transfer varies<br />

amongst species. Ochman et al. (2000) assessed 19<br />

completely sequenced bacterial genomes <strong>and</strong> reported<br />

that the proportion of foreign prote<strong>in</strong>s vary from 0%<br />

(Mycoplasma genitalium) to about 17% (Synechocystis<br />

spp). These f<strong>in</strong>d<strong>in</strong>gs were supported by others <strong>in</strong>clud<strong>in</strong>g<br />

Dufraigne et al. (2005). Ortutay et al. (2003) undertook a<br />

genomic-scale phylogenetic analysis of prote<strong>in</strong>-encod<strong>in</strong>g<br />

genes from five closely related Chlamydia spp <strong>and</strong><br />

identified a set of sequences that have arisen via HGT as<br />

the divergence of the Chlamydia l<strong>in</strong>eage. These data<br />

illustrate the significant role of HGT <strong>in</strong> the evolution of<br />

particular bacterial species. It is not surpris<strong>in</strong>g that obligate<br />

<strong>in</strong>tracellular pathogens show less evidence of recent HGT:<br />

they will not easily encounter other bacterial species with<br />

which to share DNA.<br />

Doolittle (1999a) listed three observations that can only<br />

be expla<strong>in</strong>ed by HGT. The first observation is that<br />

phylogenetic trees based on <strong>in</strong>dividual prote<strong>in</strong>-cod<strong>in</strong>g<br />

genes frequently differ substantially from the rRNA tree<br />

or from each other. The second observation comes from<br />

analysis, with<strong>in</strong> a genome, of variation <strong>in</strong> G + C content,<br />

codon usage <strong>and</strong> gene order. The third observation is a<br />

result of between-genome comparisons, which show that<br />

all genomes conta<strong>in</strong> particular genes that are more similar<br />

to homologues <strong>in</strong> distant genomes than to homologues <strong>in</strong><br />

closer relatives or <strong>in</strong>deed that are absent from all known<br />

genomes of closer relatives. Comb<strong>in</strong><strong>in</strong>g this evidences,<br />

Doolittle (1999b) proposed an alternative to the tree of life<br />

to describe the evolutionary history of liv<strong>in</strong>g organisms.<br />

His model of a web-like structure takes <strong>in</strong>to account the<br />

<strong>in</strong>fluence of HGT, where <strong>in</strong>teractions occur between<br />

ancestral organisms <strong>and</strong> descendants (branches) as well<br />

as between branches. A similar concept of a biological<br />

network has been further explored by Kun<strong>in</strong> et al. (2005).<br />

Such a concept is difficult to work with, <strong>and</strong> currently<br />

many microbiologists still accept a tree-like phylogenetic<br />

relationship, at least for an artificial ‘backbone’ of the<br />

species. Independent of the source (stra<strong>in</strong> or species) of the<br />

genes, phylogenetic trees can <strong>in</strong>deed be correctly produced<br />

for many genes <strong>and</strong> gene families <strong>and</strong> may describe<br />

evolutionary relationships that do not date back very far.<br />

Go<strong>in</strong>g back further <strong>in</strong> time, the vertical l<strong>in</strong>eages become<br />

weaker <strong>and</strong> the phylogenetic trees are less mean<strong>in</strong>gful. The<br />

paradoxal conclusion is that, by elucidat<strong>in</strong>g more of the<br />

evolutionary history of bacteria, their history has become<br />

less clear.<br />

If it is really true that horizontal gene transfer is so<br />

general, how is it still possible to recognise bacterial<br />

species? First, HGT is not so frequent that it can be easily<br />

observed as DNA exchange <strong>in</strong> ‘real time’ (other than the<br />

uptake of plasmids, spread of antibiotic resistance genes or<br />

transfection of phages). Evidence for past HGT events can<br />

be seen <strong>in</strong> many bacterial genomes <strong>and</strong> exemplifies its<br />

importance <strong>in</strong> evolution but, without a time scale, the<br />

frequency of such events cannot be estimated. Second,<br />

there are barriers that restrict HGT. It is obvious that not all<br />

bacteria share the same gene pool <strong>and</strong> only bacteria that<br />

share an ecological niche are likely to encounter <strong>and</strong> share<br />

each other’s DNA. Even under circumstances that favour<br />

DNA exchange, <strong>in</strong>ternal factors restrict the success of<br />

HGT, notably bacteriophage specificity, plasmid <strong>in</strong>compatibility,<br />

<strong>and</strong> the activity of restriction modification (RM)<br />

systems. F<strong>in</strong>ally, not all putatively HGT genes from E. coli<br />

are actually translated <strong>in</strong>to prote<strong>in</strong>s, perhaps because of<br />

<strong>in</strong>compatability of translational mach<strong>in</strong>ery (Taoka et al.<br />

2004).<br />

The discovery of restriction enzymes which could cleave<br />

specific DNA sequences provided the basis for driv<strong>in</strong>g the<br />

“biotechnology revolution” <strong>in</strong> the 1970s. RM systems are<br />

popular <strong>in</strong> molecular genetics <strong>and</strong> are rout<strong>in</strong>ely used by<br />

most molecular biology laboratories throughout the world.<br />

The RM systems encode a modification enzyme that<br />

chemically modifies a specific short DNA sequence <strong>and</strong> a<br />

restriction endonuclease that will digest the DNA at that<br />

same specific recognition sequence unless the sequence has<br />

been modified (usually by methylation). Bacterial species<br />

(<strong>and</strong> frequently stra<strong>in</strong>s with<strong>in</strong> a species) all have their own<br />

comb<strong>in</strong>ation of RM systems (Roberts et al. 2005).<br />

Incom<strong>in</strong>g DNA with a different modification pattern will<br />

be recognised by the endonuclease of the recipient stra<strong>in</strong>,<br />

<strong>and</strong> the fate of such DNA is to be degraded. This is seen as<br />

a serious restriction for the spread of DNA through<br />

populations unless their RM systems are compatible.<br />

The analysis of RM systems at a comparative genomics<br />

level (particularly the type restriction II endonucleases) has<br />

shown the dynamic state of the respective genes (L<strong>in</strong> et al.<br />

2001) <strong>and</strong> posed a number of questions to the view that RM<br />

genes restrict gene flow. For example, H. pylori <strong>and</strong><br />

Campylobacter jejuni are competent to take up DNA <strong>and</strong><br />

have a large set of genes to ma<strong>in</strong>ta<strong>in</strong> this property. The<br />

dynamic nature of the H. pylori genome <strong>and</strong> its natural<br />

competence is consistent with the weakly clonal population<br />

structure of H. pylori. Nevertheless, studies on H. pylori<br />

identified at least eight type II RM systems across two<br />

stra<strong>in</strong>s with an active restriction endonuclease <strong>and</strong><br />

methylase (Kong et al. 2000; L<strong>in</strong> et al. 2001). In addition,<br />

there were several active methylase genes without an active


endonuclease. The occurrence of RM systems that are not<br />

shared between the stra<strong>in</strong>s suggests that new RM systems<br />

are readily acquired <strong>and</strong> subsequently lost as a result of<br />

mutation or recomb<strong>in</strong>ation (L<strong>in</strong> et al. 2001). But that these<br />

would pose restriction barriers <strong>in</strong> gene flow is difficult to<br />

envisage with the dynamic population structure. RM genes<br />

possibly have other advantages to the cell. For methylation<br />

genes miss<strong>in</strong>g their match<strong>in</strong>g restriction gene, it has been<br />

suggested that they may be used for regulat<strong>in</strong>g gene<br />

expression (as for DAM methylation <strong>in</strong> E. coli; Lobner-<br />

Olesen et al. 2005; Robb<strong>in</strong>s-Manke et al. 2005) <strong>and</strong> for<br />

keep<strong>in</strong>g track of which parts of the chromosome have been<br />

recently replicated (Maas 2004).<br />

Methods for compar<strong>in</strong>g bacterial genomes<br />

There are at least 20 methods to compare bacterial<br />

genomes, as shown <strong>in</strong> Table 3. Some methods are more<br />

commonly used than the others, <strong>and</strong> it is beyond the scope<br />

of this review to provide a detailed analysis of each<br />

method. A few of these methods are discussed <strong>in</strong> this<br />

section.<br />

Chromosome alignment <strong>and</strong> size comparison<br />

Perhaps one of the easiest ways to compare genomes is by<br />

their sizes, as shown <strong>in</strong> Fig. 5. Although different phyla<br />

have different average sizes, it must be kept <strong>in</strong> m<strong>in</strong>d that<br />

many of the phyla have currently few representatives <strong>and</strong><br />

that there is a strong economic bias towards sequenc<strong>in</strong>g the<br />

smallest genome, so the size distributions shown here for<br />

the sequenced genomes could well be shorter than what<br />

Table 3 Approaches to compar<strong>in</strong>g bacterial genomes<br />

exist <strong>in</strong> natural ecosystems. Another way of compar<strong>in</strong>g<br />

chromosomes is to do a simple alignment of the DNA<br />

sequences. There are two versions of the alignment<br />

programmes. One <strong>in</strong>volves download<strong>in</strong>g some scripts<br />

<strong>and</strong> runn<strong>in</strong>g them on a local computer such as the Sanger<br />

Centre’s (Cambridge, UK) Artemis Comparison Tool<br />

(ACT, Carver et al. 2005) <strong>and</strong> the other is web-based<br />

such as “WebACT”, a web-based version of ACT with precomputed<br />

comparisons between several hundred bacterial<br />

genomes. The latter might be easier to use for those<br />

biologists who are less computationally <strong>in</strong>cl<strong>in</strong>ed (Abbott et<br />

al. 2005).<br />

AT content <strong>in</strong> genomes <strong>and</strong> promoter analysis<br />

Another relatively easy method to compare genomes is by<br />

their AT content, which ranges from 78% (Wigglesworthia<br />

gloss<strong>in</strong>idia) to 27% (Clavibacter michiganensis) for the<br />

300 genomes sequenced at the time of writ<strong>in</strong>g. In addition<br />

to the average AT content for a whole genome, if the<br />

variation of the AT content with<strong>in</strong> a given genome is<br />

exam<strong>in</strong>ed, two general trends can be seen for nearly all of<br />

the bacterial genomes. First, on a more global chromosomal<br />

level, there is a tendency for the region around the<br />

orig<strong>in</strong> of DNA replication to be more GC rich (i.e. less AT<br />

rich) <strong>and</strong> the region around the replication term<strong>in</strong>us to be<br />

more AT rich (Hall<strong>in</strong> et al. 2004b). Second, the average AT<br />

content for DNA about 400 bp upstream of the translation<br />

start site for all the genes <strong>in</strong> a genome is higher than 400 bp<br />

downstream (Hall<strong>in</strong> et al. 2004b). This makes sense <strong>in</strong> that<br />

the DNA will need to melt more easily <strong>in</strong> order for<br />

transcription to start.<br />

Level Method Reference<br />

Genome Chromosome alignment Carver et al. 2005<br />

AT content <strong>in</strong> the genome <strong>and</strong> upstream of genes Ussery <strong>and</strong> Hall<strong>in</strong> 2004a<br />

Oligomer bias on lead<strong>in</strong>g or lagg<strong>in</strong>g str<strong>and</strong>s Worn<strong>in</strong>g et al. 2006<br />

Repeats (local <strong>and</strong> global) Ussery et al. 2004a<br />

Periodicity of DNA structural properties Worn<strong>in</strong>g et al. 2000<br />

Length comparison Ussery <strong>and</strong> Hall<strong>in</strong> 2004b<br />

Promoter analysis Ussery et al. 2004d<br />

Transcriptome Organisation of rRNA operons Ussery et al. 2004b<br />

tRNAs <strong>and</strong> codon usage Ussery et al. 2004c<br />

Third nucleotide position bias <strong>in</strong> codon usage Ussery et al. 2004c<br />

Annotation quality Skovgaard et al. 2001<br />

Proteome Am<strong>in</strong>o acid usage Ussery et al. 2004c<br />

BLAST atlases Hall<strong>in</strong> et al. 2004a<br />

BLAST matrices B<strong>in</strong>newies et al. 2004<br />

Sigma factors Kiil et al. 2005a<br />

Transcription factors Kummerfeld 2006<br />

Secreted prote<strong>in</strong>s Bendtsen et al. 2005a<br />

Membrane prote<strong>in</strong>s Bendtsen et al. 2005b<br />

2-D correlation of properties Willenbrock et al. 2005<br />

Two component signal transduction systems Kiil et al. 2005b<br />

175


176<br />

Fig. 5 Genome length distribution for 287 bacterial chromosomes,<br />

shown as box <strong>and</strong> whiskers plot for each phyla. The number of<br />

chromosomes <strong>in</strong> each phylum is shown on the axis. Most of the<br />

bacterial genomes shown are either Proteobacteria (156 genomes) or<br />

tRNAs, codon usage <strong>and</strong> am<strong>in</strong>o acid<br />

As mentioned above, the 200 bp upstream of translation<br />

start sites is more AT rich, on average, than the 200 bp<br />

downstream. However, if the unsmoothed data is exam<strong>in</strong>ed<br />

(the grey l<strong>in</strong>es <strong>in</strong> Fig. 6, panel a), there is much “noise” <strong>in</strong><br />

the cod<strong>in</strong>g sequence, compared to the upstream, noncod<strong>in</strong>g<br />

DNA. This is due to bias <strong>in</strong> codon usage, as shown <strong>in</strong><br />

Fig. 6, panel b. The genome for a given organism will tend<br />

to show a preference towards certa<strong>in</strong> codons <strong>and</strong> can be<br />

seen as a bias <strong>in</strong> the third codon position (Fig. 6, panel c).<br />

F<strong>in</strong>ally, these codon biases also are <strong>in</strong> part affected by<br />

which am<strong>in</strong>o acids an organism uses, as shown <strong>in</strong> panel d<br />

of Fig. 6. The am<strong>in</strong>o acid usage for different E.coli<br />

proteomes differ: for example, E. coli K-12 shows the same<br />

am<strong>in</strong>o acid usage as Salmonella entericia LT2, while the<br />

usage <strong>in</strong> E.coli O157 resembles that of Shigella flexeneri.<br />

Thus, two different E. coli genomes can have quite<br />

different am<strong>in</strong>o acid usage (which might not be that<br />

surpris<strong>in</strong>g <strong>in</strong> view of the differences between stra<strong>in</strong>s of this<br />

species, see Table 1).<br />

BLAST atlases<br />

The GenomeAtlas is a method to visualise structural<br />

features of an entire bacterial genome sequence as one plot.<br />

The plots are created us<strong>in</strong>g the “GeneWiz” programme,<br />

Firmicutes (70). At the time of writ<strong>in</strong>g, the largest complete bacterial<br />

genome sequenced is that of Burkholderia xenovorans, which is<br />

consists of 9,703,676 bp with<strong>in</strong> two chromosomes, <strong>and</strong> the smallest<br />

is that of M. genitalium genome of 580,074 bp<br />

developed at <strong>CBS</strong> (Pedersen et al. 2000). A more recent<br />

extension of this method is the development of the<br />

“genome BLAST atlas”, <strong>in</strong> which genes from different<br />

genomes are blasted aga<strong>in</strong>st a reference genome <strong>and</strong><br />

visualised us<strong>in</strong>g an atlas plot. BLAST atlases can provide<br />

additional contextual <strong>in</strong>formation about regions which<br />

conta<strong>in</strong> few conserved genes. For example, a new genome<br />

might have a few small isl<strong>and</strong>s of unique prote<strong>in</strong>s, <strong>and</strong><br />

these regions might be more AT rich or might be expected<br />

to be potentially highly expressed, based on chromosomal<br />

structural <strong>in</strong>formation also provided <strong>in</strong> the plots. As<br />

mentioned above, when the 20 E. coli sequenced genomes<br />

<strong>in</strong> Table 1 are compared, an enormous amount of diversity<br />

is found. A BLAST atlas for E.coli 0157 is shown <strong>in</strong> Fig 7a.<br />

Several regions of the chromosome have “holes” represent<strong>in</strong>g<br />

large segments of miss<strong>in</strong>g genes <strong>in</strong> some organisms,<br />

compared to the reference genome. In a sense, this<br />

<strong>in</strong>formation is somewhat similar to that obta<strong>in</strong>ed by the<br />

ACT plots mentioned above, although now the comparisons<br />

are be<strong>in</strong>g made at the level of presence/absence of<br />

clusters of prote<strong>in</strong>s. In Fig. 7b, some of the regions<br />

conta<strong>in</strong><strong>in</strong>g gaps are more AT rich, some conta<strong>in</strong> repeats <strong>and</strong><br />

a few (marked) conta<strong>in</strong> genes that might be highly<br />

expressed, based on chromat<strong>in</strong> properties. Thus, this tool<br />

can give a quick overview of the comparison of many<br />

genomes.<br />

In Fig. 7a, the gaps correspond to regions of miss<strong>in</strong>g<br />

genes <strong>in</strong> the E. coli O157 genome. Similar patterns can be


Fig. 6 Genomic properties of Streptomyces coelicolor A3. a Comparison<br />

of AT content upstream <strong>and</strong> downstream of all 7,825 genes; the<br />

genes are all oriented <strong>in</strong> the same direction <strong>and</strong> aligned such that the<br />

translation start site is <strong>in</strong> the middle. Z-scores of st<strong>and</strong>ard deviations<br />

from the chromosomal average are plotted, as described previously<br />

(Ussery <strong>and</strong> Hall<strong>in</strong> 2004a). b Codon usage of the same set of 7825<br />

genes. The frequency of occurrence of each of the 64 codons is plotted<br />

<strong>in</strong> a star plot; note that most codons have a relatively low frequency of<br />

usage. c Bias <strong>in</strong> the codon position are plotted as frequencies; note that<br />

seen for many other bacterial genomes. For example, <strong>in</strong><br />

Fig. 7b, there are four large gaps <strong>in</strong> the C. jejuni RM1221<br />

genome compared to other epsilon Proteobacteria. These<br />

correspond to phage <strong>in</strong>sertion sites <strong>in</strong> C. jejuni RM1221, as<br />

described <strong>in</strong> the orig<strong>in</strong>al genome sequence publication<br />

(Fouts et al. 2005). Similar results have been observed for<br />

177<br />

there is a strong tendancy for Cs <strong>and</strong> Gs <strong>in</strong> third position. d Am<strong>in</strong>o acid<br />

usage of each of the 20 am<strong>in</strong>o acids for the entire S. coelicolor<br />

proteome is plotted as frequency of the total; the am<strong>in</strong>o acids <strong>in</strong> this plot<br />

are grouped accord<strong>in</strong>g to their properties; for example, all the aliphatic<br />

am<strong>in</strong>o acids (A, V, L, I <strong>and</strong> G) are together <strong>and</strong>, <strong>in</strong> general, there is a<br />

general trend for this proteome to favour aliphatic am<strong>in</strong>o acids, with the<br />

exception of isoleuc<strong>in</strong>e. The three star plots are as described previously<br />

(Ussery et al. 2004c)<br />

Streptococcus (Hall<strong>in</strong> et al. 2004a). In all three of these<br />

cases, there are large regions which conta<strong>in</strong> many genes<br />

which are miss<strong>in</strong>g <strong>in</strong> other genomes of the same species.<br />

These clusters of genes often conta<strong>in</strong> evidence that they<br />

came from phages, which appears to be an efficient method<br />

of br<strong>in</strong>g<strong>in</strong>g new DNA <strong>in</strong>to a genome.


178


3Fig. 7 Genome BLAST atlases. The outer circles represent BLAST<br />

hits of a given genome (named <strong>in</strong> the legend) to the reference<br />

genome (named <strong>in</strong> the center of the atlas). The colours are scaled<br />

such that good BLAST hits (E=10–40) are darkly shaded, whilst<br />

regions conta<strong>in</strong><strong>in</strong>g no hits are shown <strong>in</strong> light grey, as described<br />

previously (Hall<strong>in</strong> et al. 2004a). a Genome BLAST atlas of E. coli<br />

EO157 EDL933 vs four other sequenced E. coli stra<strong>in</strong>s (the four<br />

outermost circles; the genomes are, go<strong>in</strong>g from the outermost<br />

towards the center, E. coli K-12 MG1655, E. coli K-12 W3110, E.<br />

coli CFT1076 <strong>and</strong> E. coli O157 RIMD0509952). b Genome BLAST<br />

atlas of C. jejuni vs other epsilon Proteobacteria<br />

BLAST matrices<br />

Figure 7a,b illustrates the use of BLAST atlases to compare<br />

genome sequences. However, with several hundred<br />

genomes available, there is a need for a faster way of<br />

gett<strong>in</strong>g an overview of genome similarity. One method is<br />

the use of reciprocal hits—that is, to BLAST all the<br />

prote<strong>in</strong>s encoded <strong>in</strong> a genome of <strong>in</strong>terest aga<strong>in</strong>st those <strong>in</strong><br />

another genome (B<strong>in</strong>newies et al. 2004). First, the genomes<br />

of <strong>in</strong>terest are selected (e.g. all genomes of Proteobacteria),<br />

then a BLAST matrix can be displayed from this selection.<br />

The results are pre-generated <strong>and</strong> the system keeps track of<br />

sequence updates by generat<strong>in</strong>g MD5 checksums of all<br />

sequences <strong>and</strong> the comb<strong>in</strong>ations <strong>in</strong> which they have been<br />

BLASTed. The MD5 (termed also a message digest) will<br />

Fig. 8 The BLAST table shows<br />

the overall prote<strong>in</strong> homology<br />

between all comb<strong>in</strong>ations of the<br />

five available Vibrio sequences.<br />

Only hits conta<strong>in</strong><strong>in</strong>g at least<br />

80% of the length of the gene<br />

<strong>and</strong> with an E-value of 1×10 or<br />

better are counted. The diagonal<br />

(red/p<strong>in</strong>k) <strong>in</strong>dicates the fraction<br />

of prote<strong>in</strong>s that have homologous<br />

hits with<strong>in</strong> the proteome<br />

itself; the fraction is similar <strong>in</strong><br />

all genomes, <strong>and</strong> the <strong>in</strong>tensity is<br />

shown by the red colour, scaled<br />

from ~24% (grey) to ~27%<br />

(red). Note that the largest genome<br />

also has the highest fraction<br />

of <strong>in</strong>ternal homologs. The<br />

green area for the rest of<br />

the table, on each side of the<br />

diagonal, shows the number<br />

of prote<strong>in</strong>s that have homologous<br />

hits between different<br />

Vibrio genomes. As before, the<br />

fraction is <strong>in</strong>dicated by the <strong>in</strong>tensity<br />

of the colour (green)<br />

scaled from ~57 (grey) to ~83%<br />

(green). In general, it is clear<br />

that these organisms share a<br />

high percentage of their genes<br />

with the other Vibrio species,<br />

which should be expected<br />

because they are from the same<br />

genus<br />

produce a 32-digit str<strong>in</strong>g that is unique to an <strong>in</strong>put str<strong>in</strong>g,<br />

e.g. a genomic sequence. The system ma<strong>in</strong>ta<strong>in</strong>s an allaga<strong>in</strong>st-all<br />

BLAST database updat<strong>in</strong>g only the miss<strong>in</strong>g<br />

comparisons—that is, chang<strong>in</strong>g the sequence of a record or<br />

<strong>in</strong>sert<strong>in</strong>g a new record will cause a BLAST run of the<br />

sequence aga<strong>in</strong>st all the exist<strong>in</strong>g sequences of the database.<br />

By hav<strong>in</strong>g multiple genomes <strong>in</strong> a given selection, an allaga<strong>in</strong>st-all<br />

BLAST matrix can be presented show<strong>in</strong>g the<br />

percentage of genes that are shared between sequences—<br />

both on a prote<strong>in</strong> <strong>and</strong> on a nucleotide level. Each such<br />

percentage is supplied with a l<strong>in</strong>k to give a full list<strong>in</strong>g from<br />

the BLAST report. Fig. 8 shows an example of such a<br />

BLAST matrix, with the diagonal (<strong>in</strong> red) reflect<strong>in</strong>g the<br />

<strong>in</strong>ternal homologues of a given genome. The boxes are<br />

colour-coded such that the <strong>in</strong>tensity represents the fraction<br />

of hits (B<strong>in</strong>newies et al. 2004) (Fig. 8).<br />

Meta-genomics: comparison of all the genomes<br />

<strong>in</strong> an ecosystem<br />

179<br />

The term “metagenomics” is used for genome sequenc<strong>in</strong>g<br />

projects <strong>in</strong> which many organisms are sequenced at once<br />

by shotgun clon<strong>in</strong>g of all DNA present <strong>in</strong> a sample<br />

(H<strong>and</strong>elsman 2004). This enables microbial ecosystems<br />

conta<strong>in</strong><strong>in</strong>g microbes that are not (presently) culturable <strong>in</strong><br />

pure form to be <strong>in</strong>vestigated (H<strong>and</strong>elsman 2004). The


180<br />

reasons why organisms rema<strong>in</strong> uncultured can be practical<br />

(e.g. thermophilic bacteria grow at a temperature above the<br />

melt<strong>in</strong>g po<strong>in</strong>t of agar), physiological (e.g. extremophiles<br />

that grow on pure culture can have very different properties<br />

from those observed <strong>in</strong> their true environment) or biological<br />

(symbiotic life forms cannot be cultured <strong>in</strong> microbiological<br />

pure form). The first genome sequence obta<strong>in</strong>ed<br />

from a non-culturable bacterium was <strong>in</strong>deed that of<br />

Buchnera aphidicola, a symbiont of aphids. This sequence<br />

was not obta<strong>in</strong>ed by meta-genomics at the total genome<br />

DNA level but rather at the rRNA level. Cell counts<br />

compared to plate counts showed that the latter can be<br />

orders of magnitude wrong: many viable bacteria refuse to<br />

grow on solid culture medium. The isolation of bulk RNA<br />

<strong>and</strong> the subsequent determ<strong>in</strong>ation of rRNA sequences<br />

us<strong>in</strong>g specific primers allowed qualitative analysis to be<br />

performed for identify<strong>in</strong>g novel bacterial species or<br />

ribotypes present <strong>in</strong> an ecosystem (Olsen et al. 1986).<br />

The application of PCR improved the sensitivity of such<br />

approaches but the limitation to rRNA sequences conf<strong>in</strong>ed<br />

analyses to phylogenetic <strong>in</strong>formation only <strong>and</strong> little further<br />

knowledge was obta<strong>in</strong>ed about the new species. Metagenomics<br />

can be used to generate complete or fragmented<br />

genome sequences of organisms that might be abundant <strong>in</strong><br />

nature but are not easily culturable.<br />

The acid m<strong>in</strong>e dra<strong>in</strong>age sequenc<strong>in</strong>g project has shown<br />

the potential of meta-genomics (Tyson et al. 2004). The<br />

m<strong>in</strong>e water of the Richmond m<strong>in</strong>e is covered with a biofilm<br />

of bacteria despite its hostile environment: an extreme acid<br />

pH (between 0 <strong>and</strong> 1), high concentrations of metal ions,<br />

<strong>in</strong>clud<strong>in</strong>g copper, z<strong>in</strong>c <strong>and</strong> arsenic, <strong>and</strong> the absence of<br />

carbon or nitrogen sources (other than from air). The<br />

biofilm was composed of relatively few organisms,<br />

enabl<strong>in</strong>g the sequenc<strong>in</strong>g of shotgun-cloned DNA <strong>and</strong> the<br />

sort<strong>in</strong>g of fragments accord<strong>in</strong>g to their G + C content <strong>in</strong>to<br />

nearly complete bacterial genomes. A dom<strong>in</strong>ant bacterial<br />

genus was identified, Leptospirillum, <strong>and</strong> a less abundant<br />

Sulfobacillus spp <strong>and</strong> some Archaea were also present. The<br />

f<strong>in</strong>d<strong>in</strong>gs greatly improved underst<strong>and</strong><strong>in</strong>g of this ecosystem.<br />

The predom<strong>in</strong>ant bacteria were responsible for nitrogen<br />

<strong>and</strong> carbon fixation (Leptospirillum group III), whereas<br />

several species were able to generate energy from iron<br />

oxidation (Ferroplasma <strong>and</strong> Leptospirillum spp). As <strong>in</strong> this<br />

approach, each sequenced DNA fragment is obta<strong>in</strong>ed from<br />

a different <strong>in</strong>dividual (whereas <strong>in</strong> classical genome<br />

sequenc<strong>in</strong>g all DNA is obta<strong>in</strong>ed from one clone);<br />

<strong>in</strong>formation on polymorphisms also becomes available.<br />

As more complex ecosystems are studied, the puzzle of<br />

genome assembly becomes more difficult due to the<br />

presence of more species, genomic rearrangements <strong>and</strong><br />

horizontal gene transfer events.<br />

The largest attempt so far at metagenomics was <strong>in</strong>itiated<br />

by C. Venter to sequence the microbial ecosystem <strong>in</strong> the<br />

Sargasso Sea (Venter et al. 2004). Seawater was sampled<br />

by filter<strong>in</strong>g to specifically recover bacterial (<strong>and</strong> not viral or<br />

amoebal) DNA. Over 1 billion base pairs of sequence were<br />

generated, which was attributed to at least 1,800 species.<br />

As the abundance of <strong>in</strong>dividual species determ<strong>in</strong>es their<br />

coverage <strong>in</strong> shotgun clon<strong>in</strong>g, this coverage (or rather the<br />

mean of their Poisson distribution) was used to sort out<br />

DNA scaffolds (a scaffold is a reconstructed genomic<br />

region), <strong>and</strong> oligonucleotide frequencies were used to<br />

ref<strong>in</strong>e this sort<strong>in</strong>g. Although the complexity of the<br />

<strong>in</strong>vestigated ecosystem did not allow complete assembly<br />

of <strong>in</strong>dividual genomes, the scaffolds belong<strong>in</strong>g to the most<br />

abundant species could be attributed to Burkholderia <strong>and</strong><br />

Shewanella-like species. As with the acid ma<strong>in</strong> dra<strong>in</strong>age<br />

project, polymorphisms were detected with vary<strong>in</strong>g<br />

frequencies. In fact, the dataset ranged from organisms<br />

belong<strong>in</strong>g to a s<strong>in</strong>gle species <strong>and</strong> clonal (few polymorphisms)<br />

to a population cont<strong>in</strong>uum <strong>in</strong> which some clonal<br />

complexes could be recognised. These observations<br />

illustrate the ‘unnatural’ approach of study<strong>in</strong>g only pure<br />

bacterial cultures that have a strict clonal structure <strong>in</strong><br />

contrast to natural environments where the population<br />

structure is much more fluid <strong>and</strong> the concept of clones or<br />

species is more elusive. The most impressive output of the<br />

Sargasso Sea study is the numbers of <strong>in</strong>dividual genes that<br />

were identified (69,901). Among the surpris<strong>in</strong>g f<strong>in</strong>d<strong>in</strong>gs<br />

was that rhodops<strong>in</strong> (the bacterial prote<strong>in</strong> required for<br />

carbon fixation) was abundant outside the proteobacteria<br />

where it had previously been identified. The f<strong>in</strong>d<strong>in</strong>g of<br />

many genes <strong>in</strong>volved <strong>in</strong> phosphate uptake <strong>and</strong> utilisation of<br />

poly- <strong>and</strong> pyrophosphates is puzzl<strong>in</strong>g, as the mar<strong>in</strong>e<br />

environment is extremely phosphate-limited.<br />

The challenge to analyse the complex communities of a<br />

nutrient-rich environment was taken up by Tr<strong>in</strong>ge <strong>and</strong><br />

Rub<strong>in</strong> (2005). One sample that was analysed was derived<br />

from agricultural soil <strong>and</strong> three were from mar<strong>in</strong>e whale<br />

carcasses. First, rRNA libraries were generated by PCR to<br />

<strong>in</strong>vestigate the microbial diversity. The soil sample (DNA<br />

obta<strong>in</strong>ed from 5 g of surface clay loam from l<strong>and</strong> that had<br />

been used for livestock) was extremely rich <strong>in</strong> species with<br />

at least 847 ribotypes detected represent<strong>in</strong>g over 12 phyla.<br />

The whale samples (two bone parts <strong>and</strong> one biofilm<br />

cover<strong>in</strong>g a whale carcass) were less diverse but still<br />

conta<strong>in</strong>ed between 25 <strong>and</strong> 150 ribotypes. Although the<br />

assembly of sequences obta<strong>in</strong>ed from shotgun libraries was<br />

not possible, the genes that were identified on the<br />

sequenced library clones demonstrated that approximately<br />

half of the predicted prote<strong>in</strong>s found similarities (homologs)<br />

<strong>in</strong> exist<strong>in</strong>g gene databases. Plott<strong>in</strong>g the number of novel<br />

gene families aga<strong>in</strong>st the amount of generated sequences<br />

suggested that, for the soil sample, few novel orthologues<br />

were found after sequenc<strong>in</strong>g 25 Mbp. The functions of<br />

predicted prote<strong>in</strong>s from the sequences were naturally<br />

diverse, but for the soil sample, potassium channell<strong>in</strong>g<br />

systems were overrepresented, whereas for the whale<br />

samples sodium ion exporters were abundant—which fit<br />

with the abundance of these two ions <strong>in</strong> the two<br />

environments, respectively.<br />

The metagenomics analyses will cont<strong>in</strong>ue to see databases<br />

exp<strong>and</strong><strong>in</strong>g, with the <strong>in</strong>terpretation <strong>and</strong> assembly of<br />

raw data becom<strong>in</strong>g more complete. The human gastro<strong>in</strong>test<strong>in</strong>al<br />

tract, for example, is the target of a metagenomics<br />

sequenc<strong>in</strong>g project (Mongod<strong>in</strong> et al. 2005). It is apparent<br />

that each <strong>in</strong>dividual carries a large variety of microflora,<br />

probably acquired early <strong>in</strong> life (<strong>and</strong> which may have health


consequences even though these organisms are not pathogenic)<br />

as well as bacterial microheterogeneity that was not<br />

recognised previously. Aga<strong>in</strong>st the common belief that<br />

Firmicutes <strong>and</strong> Bacteroides would be the most abundant<br />

microbes present <strong>in</strong> the human gut, it appears that<br />

Act<strong>in</strong>obacteria <strong>and</strong> Archaea may be more prom<strong>in</strong>ent<br />

(Mongod<strong>in</strong> et al. 2005). The <strong>in</strong>test<strong>in</strong>al microflora of<br />

obese mice differs considerably to that of lean animals,<br />

an observation <strong>in</strong> support of the view that the microbiota of<br />

mammals are good <strong>in</strong>dicators (be it cause or effect) of their<br />

health status (Ley et al. 2005). There are clearly many<br />

microbial communities to be analysed <strong>and</strong> compared us<strong>in</strong>g<br />

metagenomics.<br />

Application: computational vacc<strong>in</strong>e development<br />

Vacc<strong>in</strong>es rema<strong>in</strong> an extremely important tool for controll<strong>in</strong>g<br />

<strong>in</strong>fectious diseases of humans <strong>and</strong> animals, although<br />

they are only available for about 10% of the microrganisms<br />

known to be harmful to humans (Lund et al. 2005).<br />

Traditional vacc<strong>in</strong>es typically have <strong>in</strong>corporated whole live<br />

attenuated or killed microorganisms, but, particularly for<br />

use <strong>in</strong> humans, such vacc<strong>in</strong>es now have limited application<br />

due to concerns about safety, efficacy <strong>and</strong>/or ease of<br />

production. Much recent work, therefore, has focused on<br />

develop<strong>in</strong>g vacc<strong>in</strong>es composed of prom<strong>in</strong>ent immunogenic<br />

parts of microorganisms (subunit vacc<strong>in</strong>es) or genes<br />

encod<strong>in</strong>g these components (genetic vacc<strong>in</strong>es, Ellis<br />

1999). For bacterial vacc<strong>in</strong>e discovery, these newer<br />

approaches have been greatly assisted by the recent<br />

availability of whole genomic sequence data <strong>and</strong> has<br />

allowed a new approach to vacc<strong>in</strong>e development called<br />

“reverse vacc<strong>in</strong>ology” (Rappuoli 2001).<br />

In reverse vacc<strong>in</strong>ology, bio<strong>in</strong>formatics <strong>tools</strong> are used to<br />

undertake comprehensive <strong>in</strong> silico screen<strong>in</strong>g of genomic<br />

sequence to identify genes encod<strong>in</strong>g prote<strong>in</strong>s that have<br />

desirable characteristics. The power of this process has<br />

<strong>in</strong>creased as more <strong>and</strong> more genomic sequences that<br />

encode prote<strong>in</strong>s of known function become available <strong>in</strong> the<br />

databases for comparative analysis. Targets for consideration<br />

for use <strong>in</strong> vacc<strong>in</strong>es <strong>in</strong>clude genes encod<strong>in</strong>g outer<br />

membrane prote<strong>in</strong>s or lipoprote<strong>in</strong>s, transmembrane doma<strong>in</strong>s<br />

or export signal peptides, <strong>and</strong> prote<strong>in</strong>s with<br />

homologies to bacterial factors already known to be<br />

<strong>in</strong>volved <strong>in</strong> virulence or pathogenicity. Surface-exposed<br />

or secreted prote<strong>in</strong>s as well as virulence factors such as<br />

tox<strong>in</strong>s or adhesive factors are likely to <strong>in</strong>duce an immune<br />

response that may be protective (Zagursky <strong>and</strong> Russell<br />

2001). In this way, large numbers of potential vacc<strong>in</strong>e<br />

components can be identified from a whole (or partial)<br />

genome sequence. This approach was first taken for the<br />

human pathogen Neisseria men<strong>in</strong>gitidis serogroup B, with<br />

600 open read<strong>in</strong>g frames (ORFs) of potential <strong>in</strong>terest<br />

<strong>in</strong>itially be<strong>in</strong>g identified (Pizza et al. 2000). Recomb<strong>in</strong>ant<br />

prote<strong>in</strong>s from 350 ORFs were eventually produced <strong>and</strong>,<br />

after screen<strong>in</strong>g <strong>in</strong> for distribution <strong>in</strong> different serotypes,<br />

stability, immunogenicity <strong>and</strong> cross-protection, 15 were<br />

selected as potential subunit vacc<strong>in</strong>e c<strong>and</strong>idates. This same<br />

approach to vacc<strong>in</strong>e discovery is now be<strong>in</strong>g taken for a<br />

number of important human <strong>and</strong> animal pathogens (Serruto<br />

et al. 2004). Reverse vacc<strong>in</strong>ology allows rapid identification<br />

of a large number of potential subunit vacc<strong>in</strong>e<br />

c<strong>and</strong>idates, many of which would not have been recognised<br />

by more traditional approaches. It is complemented by the<br />

use of microarrays to analyse gene expression <strong>and</strong> of<br />

proteomic approaches to study prote<strong>in</strong> expression <strong>and</strong><br />

distribution <strong>and</strong> can be focused further by the use of<br />

computer alogorithms that scan <strong>and</strong> identify sequences<br />

encod<strong>in</strong>g specific epitopes <strong>in</strong>volved <strong>in</strong> immunogenicity<br />

(reviewed <strong>in</strong> Lund et al. 2002; see also, fo a review,<br />

Theoretical Biology <strong>and</strong> Biophysics Group, Los Alamos<br />

National Laboratory [http://www.hiv.lanl.gov/content/<br />

immunology/pdf/2002/1/Lund2002.pdf]). These alogorithms<br />

have been strengthened by the availability of full<br />

genomic sequences for many pathogens.<br />

Methods for the three ma<strong>in</strong> types of epitopes target<strong>in</strong>g B<br />

cell, helper T lymphocyte <strong>and</strong> cytotoxic T lymphocyte<br />

have been made, <strong>and</strong> improved methods are constantly<br />

be<strong>in</strong>g developed. Thus, it is possible to take a genome<br />

sequence, use some predictors as described above <strong>and</strong><br />

select potential peptide sequences for construction of<br />

vacc<strong>in</strong>es. These vacc<strong>in</strong>es can be either chemically<br />

synthesised peptide based or DNA based. With regards to<br />

peptides, these can be used directly or used to construct a<br />

“polytope”, which is a composite prote<strong>in</strong> made from<br />

<strong>in</strong>dividual epitopes.<br />

Intellectual property rights: who owns the genome<br />

sequence?<br />

181<br />

This review started by giv<strong>in</strong>g the US patent numbers for the<br />

first two genomes sequenced. This f<strong>in</strong>al section will briefly<br />

discuss some of the issues fac<strong>in</strong>g researchers work<strong>in</strong>g with<br />

genomic data. At the time of writ<strong>in</strong>g, ten whole genome<br />

patents have been granted, with more patents be<strong>in</strong>g applied<br />

for (O’Malley et al. 2005). Some of these patents <strong>in</strong>clude<br />

the use of the sequence <strong>in</strong> silico <strong>and</strong> clearly raise a number<br />

of issues related to freedom to operate <strong>in</strong> research. In<br />

addition, the enforcement of the patents could be difficult,<br />

with many bio<strong>in</strong>formatic <strong>tools</strong> be<strong>in</strong>g developed <strong>in</strong> the<br />

public doma<strong>in</strong>.<br />

Another related difficulty has to do with us<strong>in</strong>g or<br />

analys<strong>in</strong>g genome sequences before they are presented <strong>in</strong><br />

scientific publications. Now that it is possible to sequence a<br />

bacterial genome <strong>in</strong> an afternoon <strong>and</strong> have a GenBank file a<br />

day or two later, the time gap between hav<strong>in</strong>g the sequence<br />

publicly available <strong>and</strong> hav<strong>in</strong>g the paper <strong>in</strong> pr<strong>in</strong>t can be<br />

several years. Some public grant<strong>in</strong>g agencies have pushed<br />

hard for the data to be made available as soon as possible<br />

for people to search for their particular gene of <strong>in</strong>terest. On<br />

the other h<strong>and</strong>, it is also underst<strong>and</strong>able that the <strong>in</strong>dividuals<br />

who have actually sequenced the genomes need some lead<br />

time to analyse their data. With high-throughput bio<strong>in</strong>formatic<br />

techniques, it is possible, for example, for some<br />

groups to do <strong>in</strong> a few days what would take other groups<br />

months (or years) to complete.


182<br />

A f<strong>in</strong>al problem has to do with obta<strong>in</strong><strong>in</strong>g basic<br />

<strong>in</strong>formation about the stra<strong>in</strong> used for sequenc<strong>in</strong>g a genome.<br />

For example, what was the stra<strong>in</strong> isolated from? What was<br />

the growth temperature or culture medium pH for the<br />

culture that the genomic DNA was derived from? What is<br />

the doubl<strong>in</strong>g time of this organism under these conditions?<br />

These are all important pieces of data, but they are often<br />

miss<strong>in</strong>g <strong>in</strong> genome publications. A recent “m<strong>in</strong>imal<br />

<strong>in</strong>formation about a genome sequence” st<strong>and</strong>ard has been<br />

proposed (Field <strong>and</strong> Hughes 2005), which is <strong>in</strong> the same<br />

spirit as the MIAMI st<strong>and</strong>ard for microarray experiments. 3<br />

In the future, it could well be that someth<strong>in</strong>g resembl<strong>in</strong>g a<br />

GenBank file with additional biological <strong>in</strong>formation will be<br />

the “publication” for a bacterial genome sequence, as<br />

genome sequenc<strong>in</strong>g becomes ever cheaper <strong>and</strong> easier to<br />

perform. Overall, it is important that genome sequence<br />

<strong>in</strong>formation is released <strong>in</strong>to the public doma<strong>in</strong> <strong>in</strong> a timely<br />

manner so that global scientific progress can be ma<strong>in</strong>ta<strong>in</strong>ed.<br />

Acknowledgements DWU, PFH <strong>and</strong> TTB are supported by grants<br />

from the Danish Research Foundation. We are grateful to the Sanger<br />

Center for allow<strong>in</strong>g prepublication access to the sequences for the E.<br />

coli 042 genome (the DNA sequence <strong>and</strong> annotation files were<br />

downloaded from the Sanger web site http://www.sanger.ac.uk/).<br />

References<br />

Abbott JC, Aanensen DM, Rutherford K, Butcher S, Spratt BG<br />

(2005) WebACT—an onl<strong>in</strong>e companion for the Artemis<br />

Comparison Tool. Bio<strong>in</strong>formatics 21(18):3665–3666<br />

Ac<strong>in</strong>as SG, Marcel<strong>in</strong>o LA, Klepac-Ceraj V, Polz MF (2004)<br />

Divergence <strong>and</strong> redundancy of 16S rRNA sequences <strong>in</strong> genomes<br />

with multiple rrn operons. J Bacteriol 186(9):2629–2635<br />

Ala<strong>in</strong> K, Querellou J, Lesongeur F, Pignet P, Crassous P, Raguenes G,<br />

Cueff V, Cambon-Bonavita M-A (2002) Cam<strong>in</strong>ibacter hydrogeniphilus<br />

gen. nov., sp. nov., a novel thermophilic, hydrogenoxidiz<strong>in</strong>g<br />

bacterium isolated from an East Pacific Rise<br />

hydrothermal vent. Int J Syst Evol Microbiol 52:1317–1323<br />

Alm EJ, Huang KH, Price MN, Koche RP, Keller K, Dubchak IL,<br />

Ark<strong>in</strong> AP (2005) The MicrobesOnl<strong>in</strong>e Web site for comparative<br />

genomics. Genome Res 15(7):1015–1022<br />

Alm RA, Trust TJ (1999) Analysis of the genetic diversity of<br />

Helicobacter pylori: the tale of two genomes. J Mol Med 77<br />

(12):834–846 (Review)<br />

Backhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI (2005)<br />

Host–bacterial mutualism <strong>in</strong> the human <strong>in</strong>test<strong>in</strong>e. Science 307<br />

(5717):1915–1920<br />

Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sicheritz-Ponten T, Ussery<br />

DW (2005a) Genome update: prediction of secreted prote<strong>in</strong>s <strong>in</strong><br />

225 bacterial proteomes. Microbiology 151(Pt 6):1725–1727<br />

Bendtsen JD, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW (2005b)<br />

Genome update: prediction of membrane prote<strong>in</strong>s <strong>in</strong> prokaryotic<br />

genomes. Microbiology 151(Pt 7):2119–2121<br />

B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW (2004) Genome<br />

update: proteome comparisons. Microbiology 151(Pt 1):1–4<br />

Burrus V, Waldor MK (2004) Shap<strong>in</strong>g bacterial genomes with<br />

<strong>in</strong>tegrative <strong>and</strong> conjugative elements. Res Microbiol 155<br />

(5):376–386<br />

Carattoli A (2001) Importance of <strong>in</strong>tegrons <strong>in</strong> the diffusion of<br />

resistance. Vet Res 32(3–4):243–259<br />

Carver TJ, Rutherford KM, Berriman M, Raj<strong>and</strong>ream MA, Barrell<br />

BG, Parkhill J (2005) ACT: the Artemis Comparison Tool.<br />

Bio<strong>in</strong>formatics 21(16):3422–3423<br />

3 http://www.ucl.ac.uk/wibr/services/docs/miamiv1.doc<br />

Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH, Ecker DJ,<br />

Blyn LB (2002) A bio<strong>in</strong>formatics based approach to discover<br />

small RNA genes <strong>in</strong> the Escherichia coli genome. Biosystems<br />

65(2–3):157–177<br />

Dobr<strong>in</strong>dt U, Hacker J (2001) Whole genome plasticity <strong>in</strong> pathogenic<br />

bacteria. Curr Op<strong>in</strong> Microbiol 5(4):550–557<br />

Dobr<strong>in</strong>dt U, Hochhut B, Hentschel U, Hacker J (2004) Genomic<br />

isl<strong>and</strong>s <strong>in</strong> pathogenic <strong>and</strong> environmental microorganisms. Nat<br />

Rev Microbiol (2):414–424<br />

Doolittle WF (1999a) Lateral genomics. Trends Cell Biol 12(9):<br />

M5–M8<br />

Doolittle WF (1999b) Phylogenetic classification <strong>and</strong> the universal<br />

tree. Science 5423(284):2124–2129<br />

Dufraigne C, Fertil B, Lesp<strong>in</strong>ats S, Giron A, Deschavanne P (2005)<br />

Detection <strong>and</strong> characterisation of horizontal transfers <strong>in</strong><br />

prokaryotes us<strong>in</strong>g genomic signature. Nucleic Acids Res 1<br />

(33):e6<br />

Duponnois R, Ba AM, Mateille T (1999) Beneficial effects of<br />

Enterobacter cloacae <strong>and</strong> Pseudomonas mendoc<strong>in</strong>a for biocontrol<br />

of Meloidogyne <strong>in</strong>cognita with the endospore-form<strong>in</strong>g<br />

bacterium Oasteuria penetrans. Nematology 1(1):95–101<br />

Ellis RW (1999) New technologies for mak<strong>in</strong>g vacc<strong>in</strong>es. Vacc<strong>in</strong>e 17<br />

(13–14):1596–1604<br />

Falkow S (1975) Infectious multiple drug resistance. Pion Limited,<br />

London, Engl<strong>and</strong><br />

Fani R, Brilli M, Lio P (2005) The orig<strong>in</strong> <strong>and</strong> evolution of operons:<br />

the piecewise build<strong>in</strong>g of the proteobacterial histid<strong>in</strong>e operon.<br />

J Mol Evol 60(3):378–390<br />

Field D, Hughes J (2005) Catalogu<strong>in</strong>g our current genome<br />

collection. Microbiology 151(Pt 4):1016–1019<br />

Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF,<br />

Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM,<br />

McKenney K, Sutton G, FitzHugh W, Fields C, Gocyne JD, Scott<br />

J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips<br />

CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna<br />

MC, Nguyen DT, Saudek DM, Br<strong>and</strong>on RC, F<strong>in</strong>e LD, Fritchman<br />

JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA,<br />

Small KV, Fraser CM, Smith HO, Venter JC (1995) Wholegenome<br />

r<strong>and</strong>om sequenc<strong>in</strong>g <strong>and</strong> assembly of Haemophilus<br />

<strong>in</strong>fluenzae Rd. Science 5223(269):496–498, 507–512<br />

Fluit AC, Schmitz F-J (2004) Resistance <strong>in</strong>tegrons <strong>and</strong> super<strong>in</strong>tegrons.<br />

Cl<strong>in</strong> Microbiol Infect 10:272–288<br />

Fouts DE, Mongod<strong>in</strong> EF, M<strong>and</strong>rell RE, Miller WG, Rasko DA,<br />

Ravel J, Br<strong>in</strong>kac LM, DeBoy RT, Parker CT, Daugherty SC,<br />

Dodson RJ, Durk<strong>in</strong> AS, Madupu R, Sullivan SA, Shetty JU,<br />

Ayodeji MA, Shvartsbeyn A, Schatz MC, Badger JH, Fraser<br />

CM, Nelson KE (2005) Major structural differences <strong>and</strong> novel<br />

potential virulence mechanisms from the genomes of multiple<br />

campylobacter species. PLoS Biol 3(1):e15<br />

Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA,<br />

Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM,<br />

Fritchman RD, Weidman JF, Small KV, S<strong>and</strong>usky M,<br />

Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips<br />

CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC,<br />

Lucier TS, Peterson SN, Smith HO, Hutchison CA 3rd, Venter<br />

JC (1995) The m<strong>in</strong>imal gene complement of Mycoplasma<br />

genitalium. Science 270(5235):397–403<br />

Fraser-Liggett CM (2005) Insights on biology <strong>and</strong> evolution from<br />

microbial genome sequenc<strong>in</strong>g. Genome Res 15:1603–1610<br />

Galun E (2003) Transposable elements: a guide to the perplexed <strong>and</strong><br />

the novice. Kluwer Academic, Dordrecht, The Netherl<strong>and</strong>s, pp<br />

25–73<br />

Gil R, Latorre A, Moya A (2004) Bacterial endosymbionts of <strong>in</strong>sects:<br />

<strong>in</strong>sights from comparative genomics. Environ Microbiol 6<br />

(11):1109–1122<br />

Giovannoni SJ, Tripp HJ, Givan S, Podar M, Verg<strong>in</strong> KL, Baptista D,<br />

Bibbs L, Eads J, Richardson TH, Noordewier M, Rappe MS,<br />

Short JM, Carr<strong>in</strong>gton JC, Mathur EJ (2005) Genome<br />

streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a cosmopolitan oceanic bacterium. Science<br />

309(5738):1242–1245


Goebel W, Gross R (2001) Intracellularsurvivalstrategiesofmutualistic<br />

<strong>and</strong> parasitic prokaryotes. Trends Microbiol 9(6):267–273<br />

Goldmann DA, Kl<strong>in</strong>ger JD (1986) Pseudomonas cepacia:<br />

biology, mechanisms of virulence, epidemiology. J Pediatr<br />

108(5 Pt 2):806–812<br />

Gottesman S (2005) Micros for microbes: non-cod<strong>in</strong>g regulatory<br />

RNAs <strong>in</strong> bacteria. Trends Genet 7:399–404<br />

Hall<strong>in</strong> PF, Ussery DW (2004) <strong>CBS</strong> genome atlas database: a dynamic<br />

storage for bio<strong>in</strong>formatic results <strong>and</strong> sequence data. Bio<strong>in</strong>formatics<br />

20(18):3682–3686<br />

Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2004a) Genome update:<br />

chromosome atlases. Microbiology 150(Pt 10):3091–3093<br />

Hall<strong>in</strong> PF, Coenye T, B<strong>in</strong>newies TT, Jarmer H, Saerfeldt HH, Ussery<br />

DW (2004b) Genome update: correlation of bacterial genomic<br />

properties. Microbiology 150(Pt 12):3899–3903<br />

H<strong>and</strong>elsman J (2004) Metagenomics: application of genomics to<br />

uncultured microorganisms. Microbiol Mol Biol Rev 68:669–685<br />

Harrison A, Dyer DW, Gillaspy A, Ray WC, Mungur R, Carson<br />

MB, Zhong H, Gipson J, Gipson M, Johnson LS, Lewis L,<br />

Bakaletz LO, Munson RS Jr (2005) Genomic sequence of an<br />

otitis media isolate of nontypeable Haemophilus <strong>in</strong>fluenzae:<br />

comparative study with H. <strong>in</strong>fluenzae serotype d, stra<strong>in</strong> KW20.<br />

J Bacteriol 187(13):4627–4636<br />

Hayashi T, Mak<strong>in</strong>o K, Ohnishi M, Kurokawa K, Ishii K, Yokoyama<br />

K, Han CG, Ohtsubo E, Nakayama K, Murata T, Tanaka M,<br />

Tobe T, Iida T, Takami H, Honda T, Sasakawa C, Ogasawara N,<br />

Yasunaga T, Kuhara S, Shiba T, Hattori M, Sh<strong>in</strong>agawa H<br />

(2001) Complete genome sequence of enterohemorrhagic<br />

Escherichia coli O157:H7 <strong>and</strong> genomic comparison with a<br />

laboratory stra<strong>in</strong> K-12. DNA Res 8:11–22<br />

Holmes AJ, Gill<strong>in</strong>gs MR, Nield BS, Mabbutt BC, Nevala<strong>in</strong>en KM,<br />

Stokes HW (2003) The gene cassette metagenome is a basic<br />

resource for bacterial genome evolution. Environ Microbiol 5<br />

(5):383–394<br />

Horowitz NH (1945) On the evolution of biochemical synthesis.<br />

Proc Natl Acad Sci U S A 31:153–157<br />

Horowitz NH (1965) The evolution of biochemical synthesis—<br />

retrospect <strong>and</strong> prospect. In: Bryson V, Vogel HJ (eds) Evolv<strong>in</strong>g<br />

genes <strong>and</strong> prote<strong>in</strong>s. Academic, New York, pp 15–23<br />

Itoh T, Takemoto K, Mori H, Gojobori T (1999) Evolutionary<br />

<strong>in</strong>stability of operon structures disclosed by sequence comparisons<br />

of complete microbial genomes. Mol Biol Evol 3:332–346<br />

Jacob F, Monod J (1961) Genetic regulatory mechanisms <strong>in</strong> the<br />

synthesis of prote<strong>in</strong>s. J Mol Biol 3:318–356<br />

Jacob F, Perr<strong>in</strong> D, Sanchez C, Monod J (1960) Operon: a group of<br />

genes with the expression coord<strong>in</strong>ated by an operator. C R<br />

Hebd Seances Acad Sci 250:1727–1729<br />

Jaffe JD, Stange-Thomann N, Smith C, DeCaprio D, Fisher S,<br />

Butler J, Calvo S, Elk<strong>in</strong>s T, FitzGerald MG, Hafez N, Kodira<br />

CD, Major J, Wang S, Wilk<strong>in</strong>son J, Nicol R, Nusbaum C,<br />

Birren B, Berg HC, Church GM (2004) The complete genome<br />

<strong>and</strong> proteome of Mycoplasma mobile. Genome Res 14<br />

(8):1447–1461<br />

Janga SC, Collado-Vides J, Moreno-Hagelsieb G (2005) Nebulon: a<br />

system for the <strong>in</strong>ference of functional relationships of gene<br />

products from the rearrangement of predicted operons. Nucleic<br />

Acids Res 33(8):2521–2530<br />

Jores J, Rumer L, Wieler LH (2004) Impact of the locus of enterocyte<br />

effacement pathogenicity isl<strong>and</strong> on the evolution of pathogenic<br />

Escherichia coli. Int J Med Microbiol 294(2–3):103–113<br />

(Review)<br />

Juhala RJ, Ford ME, Duda RL, Youlton A, Hatfull GF, Hendrix RW<br />

(2000) Genomic sequences of bacteriophages HK97 <strong>and</strong><br />

HK022: pervasive genetic mosaicism <strong>in</strong> the lambdoid bacteriophages.<br />

J Mol Biol 299(1):27–51<br />

Kennedy SP, Ng WV, Salzberg SL, Hood L, DasSarma S (2001)<br />

Underst<strong>and</strong><strong>in</strong>g the adaptation of Halobacterium species NRC-1<br />

to its extreme environment through computational analysis of<br />

its genome sequence. Genome Res 11:1641–1650<br />

Kiil K, B<strong>in</strong>newies TT, Sicheritz-Ponten T, Willenbrock H, Hall<strong>in</strong> PF,<br />

Wassenaar TM, Ussery DW (2005a) Genome update: sigma factors<br />

<strong>in</strong> 240 bacterial genomes. Microbiology 151(Pt 10):3147–3150<br />

183<br />

Kiil K, Ferchaud JB, David C, B<strong>in</strong>newies TT, Wu H, Sicheritz-<br />

Ponten T, Willenbrock H, Ussery DW (2005b) Genome update:<br />

distribution of two-component transduction systems <strong>in</strong> 250<br />

bacterial genomes. Microbiology 151(Pt 11):3447–3452<br />

Kong H, L<strong>in</strong> L-F, Porter N, Stickel S, Byrd D, Posfai J, Roberts RJ<br />

(2000) Functional analysis of putative restriction–modification<br />

system genes <strong>in</strong> the Helicobacter pylori J99 genome. Nucleic<br />

Acids Res 28:3216–3223<br />

Kummerfeld SK, Teichmann SA (2006) DBD: a transcription factor<br />

prediction database. Nucleic Acids Res 34(Database issue):<br />

D74–D81<br />

Kun<strong>in</strong> V, Goldovsky L, Darzentas N, Ouzounis CA (2005) The net<br />

of life: reconstruct<strong>in</strong>g the microbial phylogenetic network.<br />

Genome Res 15(7):954–959<br />

Kuwahara T, Yamashita A, Hirakawa H, Nakayama H, Toh H,<br />

Okada N, Kuhara S, Hattori M, Hayashi T, Ohnishi Y (2004)<br />

Genomic analysis of Bacteroides fragilis reveals extensive<br />

DNA <strong>in</strong>versions regulat<strong>in</strong>g cell surface adaptation. Proc Natl<br />

Acad Sci U S A 101(41):14919–14924<br />

Lawrence JG, Roth JR (1996) Selfish operons: horizontal transfer<br />

may drive the evolution of gene clusters. Genetics 143<br />

(4):1843–1860<br />

Lazcano A, Diaz-Villagomez E, Mills T, Oro J (1995) On the levels of<br />

enzymatic substrate specificity: implications for the early<br />

evolution of metabolic pathways. Adv Space Res 15(3):345–356<br />

Lewis M, Chang G, Horton NC, Kercher MA, Pace HC,<br />

Schumacher MA, Brennan RG, Lu P (1996) Crystal structure<br />

of the lactose operon repressor <strong>and</strong> its complexes with DNA<br />

<strong>and</strong> <strong>in</strong>ducer. Science 271(5253):1247–1254<br />

Ley RE, Backhed F, Turnbaugh P, Lozupone CA, Knight RD,<br />

Gordon JI (2005) Obesity alters gut microbial ecology. Proc<br />

Natl Acad Sci U S A 102(31):11070–11075<br />

L<strong>in</strong> L-F, Posfai J, Roberts RJ, Kong H (2001) <strong>Comparative</strong><br />

genomics of the restriction–modification systems <strong>in</strong> Helicobacter<br />

pylori. Proc Natl Acad Sci U S A 98:2740–2745<br />

Lobner-Olesen A, Skovgaard O, Mar<strong>in</strong>us MG (2005) Dam methylation:<br />

coord<strong>in</strong>at<strong>in</strong>g cellular processes. Curr Op<strong>in</strong> Microbiol 8<br />

(2):154–160<br />

Lund O, Nielsen M, Kesmir C, Christensen JK, Lundegaard C,<br />

Worn<strong>in</strong>g P, Brunak C (2002) Web-based <strong>tools</strong> for vacc<strong>in</strong>e<br />

design. In: Korber BT, Br<strong>and</strong>er C, Haynes BF, Koup R, Kuiken<br />

C, Moore JP, Walker BD, Watk<strong>in</strong>s D (eds) HIV molecular<br />

immunology. Los Alamos, NM, pp 45–51<br />

Lund O, Nielsen M, Lundegaard C, Kesmit C, Brunak S (2005)<br />

Immunological bio<strong>in</strong>formatics. MIT, Cambridge, Massachusetts<br />

Lupski JR, We<strong>in</strong>stock GM (1992) Short, <strong>in</strong>terspersed repetitive<br />

DNA sequences <strong>in</strong> prokaryotic genomes. J Bacteriol 174<br />

(14):4525–4529<br />

Maas R (2004) Prereplicative pur<strong>in</strong>e methylation <strong>and</strong> postreplicative<br />

demethylation <strong>in</strong> each DNA duplication of the Escherichia coli<br />

replication cycle. J Biol Chem 279(49):51568–51573<br />

Mahillon J, Leonard C, Ch<strong>and</strong>ler M (1999) IS elements as<br />

constituents of bacterial genomes. Res Microbiol 150:675–687<br />

Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben<br />

LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du<br />

L, Fierro JM, Gomes XV, Godw<strong>in</strong> BC, He W, Helgesen S, Ho<br />

CH, Irzyk GP, J<strong>and</strong>o SC, Alenquer ML, Jarvie TP, Jirage KB,<br />

Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei<br />

M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE,<br />

McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R,<br />

Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson<br />

JW, Sr<strong>in</strong>ivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer<br />

GA, Wang SH, Wang Y, We<strong>in</strong>er MP, Yu P, Begley RF,<br />

Rothberg JM (2005) Genome sequenc<strong>in</strong>g <strong>in</strong> microfabricated<br />

high-density picolitre reactors. Nature 437(7057):376–380<br />

McCl<strong>in</strong>tock B (1950) The orig<strong>in</strong> <strong>and</strong> behavior of mutable loci <strong>in</strong><br />

maize. Proc Natl Acad Sci U S A 36(6):344–355<br />

McGillivary G, Tomaras AP, Rhodes ER, Actis LA (2005) Clon<strong>in</strong>g<br />

<strong>and</strong> sequenc<strong>in</strong>g of a genomic isl<strong>and</strong> found <strong>in</strong> the Brazilian<br />

purpuric fever clone of Haemophilus <strong>in</strong>fluenzae biogroup<br />

aegyptius. Infect Immun 73(4):1927–1938


184<br />

Middendorf B, Hochhut B, Leipold K, Dobr<strong>in</strong>dt U, Blum-Oehler G,<br />

Hacker J (2004) Instability of pathogenicity isl<strong>and</strong>s <strong>in</strong><br />

uropathogenic Escherichia coli 536. J Bacteriology 186<br />

(10):3086–3096<br />

Mongod<strong>in</strong> EF, Emerson JB, Nelson KE (2005) Microbial metagenomics.<br />

Genome Biol 6(10):347<br />

Mullis K, Faloona F, Scharf S, Saiki R, Horn G, Erlich H (1986)<br />

Specific enzymatic amplification of DNA <strong>in</strong> vitro: the<br />

polymerase cha<strong>in</strong> reaction. Cold Spr<strong>in</strong>g Harb Symp Quant<br />

Biol 51(Pt 1):263–273<br />

Nagy Z, Ch<strong>and</strong>ler M (2004) Regulation of transposition <strong>in</strong> bacteria.<br />

Res Microbiol 155:387–398<br />

Nishi T, Ikemura T, Kanaya S (2005) GeneLook: a novel ab <strong>in</strong>itio<br />

gene identification system suitable for automated annotation of<br />

prokaryotic sequences. Gene 346:115–125<br />

Novikova N, De Boever P, Poddubko S, Deshevaya E, Polikarpov<br />

N, Rakova N, Con<strong>in</strong>x I, Mergeay M (2006) Survey of<br />

environmental biocontam<strong>in</strong>ation on board the International<br />

Space Station. Res Microbiol 157(1):5–12<br />

Ochman H, Lawrence JG, Groisman EA (2000) Lateral gene transfer<br />

<strong>and</strong> the nature of bacterial evolution. Nature 405:299–304<br />

Ohnishi M, Kurokawa K, Hayashi T (2001) Diversification of<br />

Escherichia coli genomes: are bacteriophages the major<br />

contributors? Trends Microbiol 9:481–485<br />

Okuda S, Katayama T, Kawashima S, Goto S, Kanehisa (2006)<br />

MODB: a database of operons accumulat<strong>in</strong>g known operons<br />

across multiple genomes. Nucleic Acids Res 34(Database<br />

issue):D358–362<br />

Olsen GJ, Lane DJ, Giovannoni SJ, Pace NR, Stahl DA (1986)<br />

Microbial ecology <strong>and</strong> evolution: a ribosomal RNA approach.<br />

Annu Rev Microbiol 40:337–365<br />

O’Malley MA, Bostanci A, Calvert J (2005) Whole-genome<br />

patent<strong>in</strong>g. Nat Rev Genet 6(6):502–506<br />

Ortutay C, Gaspari Z, Toth G, Jager E, Vida G, Orosz L, Vellai T<br />

(2003) Speciation <strong>in</strong> Chlamydia: genome-wide phylogenetic<br />

analyses identified a reliable set of acquired genes. J Mol Evol<br />

57:672–680<br />

Ou HY, Chen LL, Lonnen J, Chaudhuri RR, Thani AB, Smith R,<br />

Garton NJ, H<strong>in</strong>ton J, Pallen M, Barer MR, Rajakumar K (2006)<br />

A novel strategy for the identification of genomic isl<strong>and</strong>s by<br />

comparative analysis of the contents <strong>and</strong> contexts of tRNA sites<br />

<strong>in</strong> closely related bacteria. Nucleic Acids Res 34(1):e3<br />

Pal C, Hurst LD (2004) Evidence aga<strong>in</strong>st the selfish operon theory.<br />

Trends Genet 20(6):232–234<br />

Parkhill J, Sebaihia M, Preston A, Murphy LD, Thomson N, Harris<br />

DE, Holden MT, Churcher CM, Bentley SD, Mungall KL,<br />

Cerdeno-Tarraga AM, Temple L, James K, Harris B, Quail MA,<br />

Achtman M, Atk<strong>in</strong> R, Baker S, Basham D, Bason N,<br />

Cherevach I, Chill<strong>in</strong>gworth T, Coll<strong>in</strong>s M, Cron<strong>in</strong> A, Davis P,<br />

Doggett J, Feltwell T, Goble A, Haml<strong>in</strong> N, Hauser H, Holroyd<br />

S, Jagels K, Leather S, Moule S, Norberczak H, O’Neil S,<br />

Ormond D, Price C, Rabb<strong>in</strong>owitsch E, Rutter S, S<strong>and</strong>ers M,<br />

Saunders D, Seeger K, Sharp S, Simmonds M, Skelton J,<br />

Squares R, Squares S, Stevens K, Unw<strong>in</strong> L, Whitehead S,<br />

Barrell BG, Maskell DJ (2003) <strong>Comparative</strong> analysis of the<br />

genome sequences of Bordetella pertussis, Bordetella parapertussis<br />

<strong>and</strong> Bordetella bronchiseptica. Nat Genet 35(1):32–40<br />

Paulsen IT, Banerjei L, Myers GSA, Nelson KE, Seshadri R, Read TD,<br />

Fouts, DE, Eisen JA, Gill SR, Heidelberg JF, Tettel<strong>in</strong> H, Dodson<br />

RJ, Umayam L, Br<strong>in</strong>kac L, Beanan M, Daugherty S, DeBoy RT,<br />

Durk<strong>in</strong> S, Kolonay J, Madupu R, Nelson W, Vamathevan J, Tran<br />

B, Upton J, Hansen T, Shetty J, Khouri H, Utterback T, Radune D,<br />

Ketchum KA, Dougherty BA, Fraser CM (2003) Role of mobile<br />

DNA <strong>in</strong> the evolution of vancomyc<strong>in</strong>-resistant Enterococcus<br />

faecalis. Science 299(5615):2071–2074<br />

Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW<br />

(2000) A DNA structural atlas for Escherichia coli. J Mol Biol<br />

299(4):907–930<br />

Pennisi E (2005) Biochemistry. Cut-rate genomes on the horizon?<br />

Science 309(5736):862<br />

Penyalver R, Lopez MM (1999) Cocolonization of the rhizosphere<br />

by pathogenic agrobacterium stra<strong>in</strong>s <strong>and</strong> nonpathogenic stra<strong>in</strong>s<br />

K84 <strong>and</strong> K1026, used for crown gall biocontrol. Appl Environ<br />

Microbiol 65(5):1936–1940<br />

Peters EDJ, Leverste<strong>in</strong>-Van Hall MA, Box ATA, Verhoef J, Fluit AC<br />

(2001) Novel gene cassettes <strong>and</strong> <strong>in</strong>tegrons. Antimicrob Agents<br />

Chemother 45(10):2961–2964<br />

Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B,<br />

Com<strong>and</strong>ucci M, Jenn<strong>in</strong>gs GT, Baldi L, Bartol<strong>in</strong>i E, Capecchi<br />

B, Galeotti CL, Luzzi E, Manetti R, Marchetti E, Mora M, Nuti<br />

S, Ratti G, Sant<strong>in</strong>i L, Sav<strong>in</strong>o S, Scarselli M, Storni E, Zuo P,<br />

Broeker M, Hundt E, Knapp B, Blair E, Mason T, Tettel<strong>in</strong> H,<br />

Hood DW, Jeffries AC, Saunders NJ, Granoff DM, Venter JC,<br />

Moxon ER, Gr<strong>and</strong>i G, Rappuoli R (2000) Identification of<br />

vacc<strong>in</strong>e c<strong>and</strong>idates aga<strong>in</strong>st serogroup B men<strong>in</strong>gococcus by<br />

whole-genome sequenc<strong>in</strong>g. Science 287:1816–1820<br />

Prescott L, Harvey JP, Kle<strong>in</strong> DA (1999) Microbiology, 4th edn.<br />

McGraw-Hill, New York, USA<br />

Price MN, Huang KH, Alm EJ, Ark<strong>in</strong> AP (2005) A novel method<br />

for accurate operon predictions <strong>in</strong> all sequenced prokaryotes.<br />

Nucleic Acids Res 33(3):880–892<br />

Rappuoli R (2001) Reverse vacc<strong>in</strong>ology, a genome-based approach<br />

to vacc<strong>in</strong>e development. Vacc<strong>in</strong>e 19:2688–2691<br />

Rendulic S, Jagtap P, Ros<strong>in</strong>us A, Epp<strong>in</strong>ger M, Baar C, Lanz C,<br />

Keller H, Lambert C, Evans KJ, Goesmann A, Meyer F,<br />

Sockett RE, Schuster SC (2004) A predator unmasked: life<br />

cycle of Bdellovibrio bacteriovorus from a genomic perspective.<br />

Science 303(5658):689–692<br />

Reznikoff WS (1992) The lactose operon-controll<strong>in</strong>g elements:<br />

a complex paradigm. Mol Microbiol 6(17):2419–2422<br />

Robb<strong>in</strong>s-Manke JL, Zdraveski ZZ, Mar<strong>in</strong>us M, Essigmann JM<br />

(2005) Analysis of global gene expression <strong>and</strong> double-str<strong>and</strong>break<br />

formation <strong>in</strong> DNA aden<strong>in</strong>e methyltransferase- <strong>and</strong><br />

mismatch repair-deficient Escherichia coli. J Bacteriol 187<br />

(20):7027–7037<br />

Roberts RJ, V<strong>in</strong>cze T, Psfai J, Macelis D (2005) REBASE—<br />

restriction enzymes <strong>and</strong> DNA methyl transferases. Nucleic<br />

Acids Res 33:D230–D232<br />

Rocha EPC, Danch<strong>in</strong> A, Viari A (1999) Functional <strong>and</strong> evolutionary<br />

role of long repeats <strong>in</strong> prokaryotes. Res Microbiol 150:725–733<br />

Rogoz<strong>in</strong> IB, Makarova KS, Wolf YI, Koon<strong>in</strong> EV (2004) <strong>Computational</strong><br />

approaches for the analysis of gene neighbourhoods <strong>in</strong><br />

prokaryotic genomes. Brief Bio<strong>in</strong>form 5(2):131–149<br />

Rosenfeld JA, Sarkar IN, Planet PJ, Figurski DH, DeSalle R (2004)<br />

ORFcurator: molecular curation of genes <strong>and</strong> gene clusters <strong>in</strong><br />

prokaryotic organisms. Bio<strong>in</strong>formatics 20(18):3462–3465<br />

Salgado H, Gama-Castro S, Peralta-Gil M, Diaz-Peredo E, Sanchez-<br />

Solano F, Santos-Zavaleta A, Mart<strong>in</strong>ez-Flores I, Jimenez-Jac<strong>in</strong>to<br />

V, Bonavides-Mart<strong>in</strong>ez C, Segura-Salazar J, Mart<strong>in</strong>ez-Antonio<br />

A, Collado-Vides J (2006a) RegulonDB (version 5.0): Escherichia<br />

coli K-12 transcriptional regulatory network, operon<br />

organization, <strong>and</strong> growth conditions. Nucleic Acids Res 34<br />

(Database issue):D394–D397<br />

Salgado H, Santos-Zavaleta A, Gama-Castro S, Peralta-Gil M,<br />

Penaloza-Sp<strong>in</strong>ola MI, Mart<strong>in</strong>ez-Antonio A, Karp PD, Collado-<br />

Vides J (2006b) The comprehensive updated regulatory<br />

network of Escherichia coli K-12. BMC Bio<strong>in</strong>formatics 7(1):5<br />

Sanger F, Donelson JE, Coulson AR, Kossel H, Fischer D (1973)<br />

Use of DNA polymerase I primed by a synthetic oligonucleotide<br />

to determ<strong>in</strong>e a nucleotide sequence <strong>in</strong> phage fl DNA. Proc<br />

Natl Acad Sci U S A 70(4):1209–1213<br />

Sanger F, Air GM, Barrell BG, Brown NL, Coulson AR, Fiddes CA,<br />

Hutchison CA, Slocombe PM, Smith M (1977) Nucleotide<br />

sequence of bacteriophage phi X174 DNA. Nature 265<br />

(5596):687–695


Schmidt H, Hensel M (2004) Pathogenicity isl<strong>and</strong>s <strong>in</strong> bacterial<br />

pathogenesis. Cl<strong>in</strong> Microbiol Rev 17(1):14–56<br />

Schneider G, Dobr<strong>in</strong>dt U, Bruggemann H, Nagy G, Janke B, Blum-<br />

Oehler G, Buchrieser C, Gottschalk G, Emody L, Hacker J<br />

(2004) The pathogenicity isl<strong>and</strong>-associated K15 capsule determ<strong>in</strong>ant<br />

exhibits a novel genetic structure <strong>and</strong> correlates with<br />

virulence <strong>in</strong> uropathogenic Escherichia coli stra<strong>in</strong> 536. Infect<br />

Immun 72(10):5993–6001<br />

Serruto D, Adu-Bobie J, Capecchi B, Rappuoli R, Pizza M,<br />

Masignani V (2004) Biotechnology <strong>and</strong> vacc<strong>in</strong>es: application<br />

of functional genomics to Neisseria men<strong>in</strong>gitidis <strong>and</strong> other<br />

bacterial pathogens. J Biotechnol 113:15–32<br />

Sharp PM, Li WH (1987) The codon adaptation <strong>in</strong>dex—a measure<br />

of directional synonymous codon usage bias, <strong>and</strong> its potential<br />

applications. Nucleic Acids Res 15(3):1281–1295<br />

Shendure J, Porreca GJ, Reppas NB, L<strong>in</strong> X, McCutcheon JP,<br />

Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM<br />

(2005) Accurate multiplex polony sequenc<strong>in</strong>g of an evolved<br />

bacterial genome. Science 309(5741):1728–1732<br />

Shimizu T, Ohtani K, Hirakawa H, Ohshima K, Yamashita A, Shiba<br />

T, Ogasawara N, Hattori M, Kuhara, Hayashi H (2002)<br />

Complete genome sequence of Clostridium perfr<strong>in</strong>gens, an<br />

anaerobic flesh-eater. Proc Natl Acad Sci U S A 99(2):996–1001<br />

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A (2001) On<br />

the total number of genes <strong>and</strong> their length distribution <strong>in</strong><br />

complete microbial genomes. Trends Genet 17(8):425–428<br />

Stahl FW, Murray NE (1966) The evolution of gene clusters <strong>and</strong><br />

genetic circularity <strong>in</strong> microorganisms. Genetics 53(3):569–576<br />

Starl<strong>in</strong>ger P, Saedler H (1976) IS-elements <strong>in</strong> microorganisms. Curr<br />

Top Microbiol Immunol 75:111–152<br />

Talarico S, Cave MD, Marrs CF, Foxman B, Zhang L, Yang Z (2005)<br />

Variation of the Mycobacterium tuberculosis PE_PGRS 33 gene<br />

among cl<strong>in</strong>ical isolates. J Cl<strong>in</strong> Microbiol 43(10):4954–4960<br />

Taoka M, Yamauchi Y, Sh<strong>in</strong>kawa T, Kaji H, Motohashi W,<br />

Nakayama H, Takahashi N, Isobe T (2004) Only a small<br />

subset of the horizontally transferred chromosomal genes <strong>in</strong><br />

Escherichia coli are translated <strong>in</strong>to prote<strong>in</strong>s. Mol Cell<br />

Proteomics 3(8):780–787<br />

Tobes R, Ramos JL (2005) REP code: def<strong>in</strong><strong>in</strong>g bacterial identity <strong>in</strong><br />

extragenic space. Environ Microbiol 7(2):225–228<br />

Toh H, Weiss BL, Perk<strong>in</strong> SA, Yamashita A, Oshima K, Hattori M,<br />

Aksoy S (2006) Massive genome erosion <strong>and</strong> functional<br />

adaptations provide <strong>in</strong>sights <strong>in</strong>to the symbiotic lifestyle of<br />

Sodalis gloss<strong>in</strong>idius <strong>in</strong> the tsetse host. Genome Res 16:149–156<br />

Tomb JF, White O, Kerlavage AR, Clayton RA, Sutton GG,<br />

Fleischmann RD, Ketchum KA, Klenk HP, Gill S, Dougherty<br />

BA, Nelson K, Quackenbush J, Zhou L, Kirkness EF, Peterson<br />

S, Loftus B, Richardson D, Dodson R, Khalak HG, Glodek A,<br />

McKenney K, Fitzegerald LM, Lee N, Adams MD, Hickey EK,<br />

Berg DE, Gocayne JD, Utterback TR, Peterson JD, Kelley JM,<br />

Cotton MD, Weidman JM, Fujii C, Bowman C, Watthey L,<br />

Wall<strong>in</strong> E, Hayes WS, Borodovsky M, Karp PD, Smith HO,<br />

Fraser CM, Venter JC (1997) The complete genome sequence<br />

of the gastric pathogen Helicobacter pylori. Nature 388<br />

(6642):539–547<br />

Torsvik V, Salte K, Sorheim R, Goksoyr J (1990) Comparison of<br />

phenotypic diversity <strong>and</strong> DNA heterogeneity <strong>in</strong> a population of<br />

soil bacteria. Appl Environ Microbiol 56:776–781<br />

Tr<strong>in</strong>ge SG, Rub<strong>in</strong> EM (2005) Metagenomics: DNA sequenc<strong>in</strong>g of<br />

environmental samples. Nat Rev Genet 6(11):805–814<br />

Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ,<br />

Richardson PM, Solovyev VV, Rub<strong>in</strong> EM, Rokhsar DS,<br />

Banfield JF (2004) Community structure <strong>and</strong> metabolism<br />

through reconstruction of microbial genomes from the environment.<br />

Nature 428(6978):37–43<br />

185<br />

Ussery DW, Hall<strong>in</strong> PF (2004a) Genome update: AT content <strong>in</strong><br />

sequenced prokaryotic genomes. Microbiology 150(Pt 4):749–752<br />

Ussery DW, Hall<strong>in</strong> PF (2004b) Genome update: length distributions of<br />

sequenced prokaryotic genomes. Microbiology 150(Pt 3):513–516<br />

Ussery DW, B<strong>in</strong>newies TT, Gouveia-Oliveira R, Jarmer H, Hall<strong>in</strong><br />

PF (2004a) Genome update: DNA repeats <strong>in</strong> bacterial genomes.<br />

Microbiology 150(Pt 11):3519–3521<br />

Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Coenye T (2004b) Genome<br />

update: rRNAs <strong>in</strong> sequenced microbial genomes. Microbiology<br />

150(Pt 5):1113–1115<br />

Ussery DW, Hall<strong>in</strong> PF, Lagesen K, Wassenaar TM (2004c) Genome<br />

update: tRNAs <strong>in</strong> sequenced microbial genomes. Microbiology<br />

150(Pt 6):1603–1606<br />

Ussery DW, T<strong>in</strong>dbaek N, Hall<strong>in</strong> PF (2004d) Genome update:<br />

promoter profiles. Microbiology 150(Pt 9):2791–2793<br />

Vallenet D, Labarre L, Rouy Z, Barbe V, Bocs S, Cruveiller S, Lajus<br />

A, Pascal G, Scarpelli C, Medigue C (2006) MaGe: a microbial<br />

genome annotation system supported by synteny results.<br />

Nucleic Acids Res 34(1):53–65<br />

van Belkum A, Scherer S, van Alphen L, Verbrugh H (1998) Short<br />

sequence DNA repeats <strong>in</strong> prokaryotic genomes. Microbiol Mol<br />

Biol Rev 62(2):275–293<br />

van der Meer JR, Sentchilo V (2003) Genomic isl<strong>and</strong>s <strong>and</strong> the<br />

evolution of catabolic pathways <strong>in</strong> bacteria. Curr Op<strong>in</strong><br />

Biotechnol 14:248–254<br />

Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong<br />

X, Lu P, Szafron D, Gre<strong>in</strong>er R, Wishart DS (2005) BASys: a web<br />

server for automated bacterial genome annotation. Nucleic<br />

Acids Res 33(Web Server issue):W455–W459<br />

Venter JC, Rem<strong>in</strong>gton K, Heidelberg JF, Halpern AL, Rusch D,<br />

Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE,<br />

Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson<br />

J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C,<br />

Rogers YH, Smith HO (2004) Environmental genome shotgun<br />

sequenc<strong>in</strong>g of the Sargasso Sea. Science 304(5667):66–74<br />

Vezzi A, Campanaro S, D’Angelo M, Simonato F, Vitulo N, Lauro<br />

FM, Cestaro A, Malacrida G, Simionati B, Cannata N,<br />

Romualdi C, Bartlett DH, Valle G (2005) Life at depth:<br />

Photobacterium profundum genome sequence <strong>and</strong> expression<br />

analysis. Science 307(5714):1459–1461<br />

Willenbrock H, B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Ussery DW (2005) Genome<br />

update: 2D cluster<strong>in</strong>g of bacterial genomes. Microbiology 151<br />

(Pt 2):333–336<br />

Worn<strong>in</strong>g P, Jensen LJ, Nelson KE, Brunak S, Ussery DW (2000)<br />

Structural analysis of DNA sequence: evidence for lateral gene<br />

transfer <strong>in</strong> Thermotoga maritima. Nucleic Acids Res 28<br />

(3):706–709<br />

Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Stærfeldt H-H, Ussery DW (2006)<br />

Orig<strong>in</strong> of replication <strong>in</strong> circular prokaryotic chromosomes.<br />

Environ Microbiol (In press)<br />

Yan F, Polk DB (2004) Commensal bacteria <strong>in</strong> the gut: learn<strong>in</strong>g who<br />

our friends are. Curr Op<strong>in</strong> Gastroenterol 20(6):565–571<br />

Zagursky RJ, Russell D (2001) Bio<strong>in</strong>formatics: use <strong>in</strong> bacterial<br />

vacc<strong>in</strong>e discovery. Biotechniques 31:636–659<br />

Zhang R, Zhang CT (2004) A systematic method to identify<br />

genomic isl<strong>and</strong>s <strong>and</strong> its applications <strong>in</strong> analyz<strong>in</strong>g the genomes<br />

of Corynebacterium glutamicum <strong>and</strong> Vibrio vulnificus CMCP6<br />

chromosome I. Bio<strong>in</strong>formatics 20(5):612–622<br />

Zheng Y, Anton BP, Roberts RJ, Kasif S (2005) Phylogenetic<br />

detection of conserved gene clusters <strong>in</strong> microbial genomes.<br />

BMC Bio<strong>in</strong>formatics 6:243<br />

Zubrzycki IZ (2004) Analysis of the products of genes encompassed<br />

by the theoretically predicted pathogenicity isl<strong>and</strong>s of Mycobacterium<br />

tuberculosis <strong>and</strong> Mycobacterium bovis. Prote<strong>in</strong>s:<br />

Struct, Funct, Bio<strong>in</strong>f 54:563–568


1<br />

<strong>Comparative</strong> Genomics<br />

2.8 Paper III: Global features of the Alcanivorax borkumensis<br />

SK2 genome


Environmental Microbiology (2007) doi:10.1111/j.1462-2920.2007.01483.x<br />

Global features of the Alcanivorax borkumensis<br />

SK2 genome<br />

Oleg N. Reva, 1,3 Peter F. Hall<strong>in</strong>, 2 Hanni Willenbrock, 2<br />

Thomas Sicheritz-Ponten, 2 Burkhard Tümmler 1 <strong>and</strong><br />

David W. Ussery 2<br />

1 Kl<strong>in</strong>ische Forschergruppe, OE6711, Mediz<strong>in</strong>ische<br />

Hochschule Hannover, Carl-Neuberg-Strasse 1,<br />

D-30625 Hannover, Germany.<br />

2 Center for Biological Sequence Analysis, Technical<br />

University of Denmark, Lyngby, Denmark.<br />

3 Biochemistry Department, University of Pretoria,<br />

Lynnwood Road, Hillcrest, 0002 Pretoria, South Africa.<br />

Summary<br />

The global feature of the completely sequenced<br />

Alcanivorax borkumensis SK2 type stra<strong>in</strong> chromosome<br />

is its symmetry <strong>and</strong> homogeneity. The orig<strong>in</strong><br />

<strong>and</strong> term<strong>in</strong>us of replication are located opposite<br />

to each other <strong>in</strong> the chromosome <strong>and</strong> are discerned<br />

with high signal to noise ratios by maximal oligonucleotide<br />

usage biases on the lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g<br />

str<strong>and</strong>. Genomic DNA structure is rather uniform<br />

throughout the chromosome with respect to <strong>in</strong>tr<strong>in</strong>sic<br />

curvature, position preference or base<br />

stack<strong>in</strong>g energy. The orthologs <strong>and</strong> paralogs of<br />

A. borkumensis genes with the highest sequence<br />

homology were found <strong>in</strong> most cases among<br />

g-Proteobacteria, with Ac<strong>in</strong>etobacter <strong>and</strong> P. aerug<strong>in</strong>osa<br />

as closest relatives. A. borkumensis shares<br />

a similar oligonucleotide usage <strong>and</strong> promoter<br />

structure with the Pseudomonadales. A comparatively<br />

low number of only 18 genome isl<strong>and</strong>s with<br />

atypical oligonucleotide usage was detected <strong>in</strong> the<br />

A. borkumensis chromosome. The gene clusters that<br />

confer the assimilation of aliphatic hydrocarbons, are<br />

localized <strong>in</strong> two genome isl<strong>and</strong>s which were probably<br />

acquired from an ancestor of the Yers<strong>in</strong>ia l<strong>in</strong>eage,<br />

whereas the alk genes of Pseudomonas putida still<br />

exhibit the typical Alcanivorax oligonucleotide signature<br />

<strong>in</strong>dicat<strong>in</strong>g a complex evolution of this major<br />

hydrocarbonoclastic trait.<br />

Received 8 August, 2007; accepted 26 September, 2007.<br />

*For correspondence. E-mail tuemmler.burkhard@mh-hannover.de;<br />

Tel. (+49) 511 5322920; Fax (+49) 511 5326723.<br />

Introduction<br />

Alcanivorax borkumensis stra<strong>in</strong> SK2 is a cosmopolitan<br />

oil-degrad<strong>in</strong>g oligotrophic mar<strong>in</strong>e g-proteobacterium<br />

(Yakimov et al., 1998). The SK2 stra<strong>in</strong> is the paradigm for<br />

hydrocarbonoclastic bacteria that are specialized for<br />

hydrocarbon degradation but have an otherwise highly<br />

restricted substrate spectrum, be<strong>in</strong>g capable of utiliz<strong>in</strong>g<br />

only a few organic acids such as pyruvate, but not simple<br />

sugars, for growth (Yakimov et al., 1998; Sabirova et al.,<br />

2006). A. borkumensis is present <strong>in</strong> low abundance <strong>in</strong><br />

unpolluted environments, but it rapidly becomes the dom<strong>in</strong>ant<br />

bacterium <strong>in</strong> oil-polluted open ocean <strong>and</strong> coastal<br />

waters, where it can constitute 80–90% of the oildegrad<strong>in</strong>g<br />

microbial community (Harayama et al., 1999;<br />

Kasai et al., 2001; 2002; Syutsubo et al., 2001; Röl<strong>in</strong>g<br />

et al., 2002; Hara et al., 2003; McKew et al., 2007a,b).<br />

The genome of A. borkumensis was recently<br />

sequenced <strong>and</strong> annotated (Schneiker et al., 2006). In this<br />

paper, we perform a genome wide comparative genomics<br />

analysis <strong>and</strong> a detailed characterization of the global<br />

features of the A. borkumensis stra<strong>in</strong> SK2 genome. This<br />

work on A. borkumensis stra<strong>in</strong> SK2 aimed to visualize the<br />

prospective potential of genome l<strong>in</strong>guistic approaches<br />

for functional <strong>and</strong> comparative analysis of bacterial<br />

genomes.<br />

Results <strong>and</strong> discussion<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd<br />

DNA structure <strong>and</strong> highly expressed genes<br />

The genome atlas (Fig. 1) shows a comb<strong>in</strong>ation of some<br />

general <strong>in</strong>formative properties of the chromosome.<br />

These are structural features (<strong>in</strong>tr<strong>in</strong>sic curvature, stack<strong>in</strong>g<br />

energy <strong>and</strong> position preference), repeat properties (global<br />

direct <strong>and</strong> <strong>in</strong>verted repeats) <strong>and</strong> the ma<strong>in</strong> base composition<br />

features (GC skew <strong>and</strong> percent AT). Stack<strong>in</strong>g energy<br />

measures helix rigidity <strong>and</strong> position preference is a<br />

flexibility measure (Jensen et al., 1999; Pedersen et al.,<br />

2000). Regions that exhibit low position preference correlate<br />

with an enrichment of highly expressed genes (Dlakic<br />

et al., 2004; Willenbrock <strong>and</strong> Ussery, 2007). Examples <strong>in</strong><br />

A. borkumensis are the rrn operons, the genes encod<strong>in</strong>g<br />

ribosomal prote<strong>in</strong>s <strong>and</strong> the gene cluster labelled rpoC on<br />

the atlas which among others encodes RNA polymerase<br />

subunits. Low position preference was found to correlate<br />

with high codon adaptation <strong>in</strong>dices as the common


2 O. N. Reva et al.<br />

Fig. 1. Genome Atlas of A. borkumensis SK2 show<strong>in</strong>g different structural parameters <strong>and</strong> the distribution of global repeats, GC skew <strong>and</strong><br />

A + T contents. Colour <strong>in</strong>tensity <strong>in</strong>creases with the deviation from the average. Values close to the average are shaded very light grey; values<br />

with more than 3 st<strong>and</strong>ard deviations from the average are most strongly coloured.<br />

measure for highly expressed genes (Willenbrock et al.,<br />

2006) <strong>in</strong>dicat<strong>in</strong>g that the local DNA structure is an important<br />

determ<strong>in</strong>ant of codon usage <strong>and</strong> gene expression.<br />

Moreover, <strong>in</strong>tr<strong>in</strong>sic curvature is often encountered<br />

upstream of highly expressed genes (Skovgaard et al.,<br />

2002) which correlates well with the fact that promoter<br />

DNA tends to be more curved than DNA <strong>in</strong> cod<strong>in</strong>g regions<br />

(Pedersen et al., 2000).<br />

The chromosome is rather homogeneous <strong>in</strong> all analysed<br />

structural features. The number of repeats is low, <strong>and</strong><br />

the term<strong>in</strong>us of replication is opposite to the orig<strong>in</strong> of<br />

replication as <strong>in</strong>dicated by GC skew (Ussery et al., 2002).<br />

The three rRNA operons organized <strong>in</strong> the order<br />

16S-23S-5S are located <strong>in</strong> three areas with low position<br />

preference (green marks <strong>in</strong> the 3rd circle) <strong>and</strong> possible<br />

upstream regions with high <strong>in</strong>tr<strong>in</strong>sic curvature (blue <strong>in</strong> the<br />

1st circle) near 0.4 Mb – 0.5 Mbases (two regions) <strong>and</strong><br />

2.25 Mbases (one region).<br />

Phylogenomics by sequence homology<br />

The genome of A. borkumensis was compared with exist<strong>in</strong>g<br />

sequence <strong>in</strong>formation <strong>in</strong> other Proteobacteria by con-<br />

struct<strong>in</strong>g phylogenetic trees for each am<strong>in</strong>o acid<br />

sequence <strong>and</strong> organisms for which a similar gene existed.<br />

By extract<strong>in</strong>g the phylogenomic <strong>in</strong>formation of the result<strong>in</strong>g<br />

1919 phylogenetic trees a phylome atlas could be<br />

constructed (Fig. 2). In most cases the orthologs <strong>and</strong><br />

paralogs with the highest sequence homology were found<br />

among g-Proteobacteria. A substantial proportion of<br />

A. borkumensis genes had their closest homologues <strong>in</strong><br />

a- <strong>and</strong> b-Proteobacteria, but no closest homologue was<br />

detected <strong>in</strong> d- <strong>and</strong> e-Proteobacteria. Inspection of the collected<br />

phylogenetic connections revealed that the<br />

most closely related organisms are Ac<strong>in</strong>etobacter sp.<br />

<strong>and</strong> Pseudomonas aerug<strong>in</strong>osa, although <strong>in</strong> trees where<br />

both Pseudomonas <strong>and</strong> Ac<strong>in</strong>etobacter are present,<br />

A. borkumensis tends to cluster more often with the latter<br />

one. No obvious horizontal gene transfers seem to have<br />

taken place. Regions around 350.000 <strong>and</strong> 450.000 are<br />

very ‘pure’ g-proteobacteria regions.<br />

Genome analysis of oligonucleotide usage<br />

Oligonucleotide usage (OU) has been shown to be a<br />

genome specific signature (Pride et al., 2003; Reva <strong>and</strong><br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


Tümmler, 2004). Genomic regions termed the ‘core<br />

sequences’ are characterized by OU patterns be<strong>in</strong>g<br />

similar to the global pattern of the chromosome. However,<br />

many loci with alternative OU patterns typically contribute<br />

to <strong>in</strong> total more than 10% of a bacterial genome. These<br />

loci with atypical OU patterns comprise heterogeneous<br />

subsets of parasitic <strong>and</strong> recent foreign DNA, ancient<br />

genes for ribosomal constituents (RNAs <strong>and</strong> prote<strong>in</strong>s),<br />

multidoma<strong>in</strong> genes <strong>and</strong> non-cod<strong>in</strong>g sequences with multiple<br />

t<strong>and</strong>em repeats (Reva <strong>and</strong> Tümmler, 2005). Hence<br />

laterally transferred gene isl<strong>and</strong>s can be reliably identified<br />

<strong>in</strong> complete genomes by their atypical oligonucleotide<br />

usage (Reva <strong>and</strong> Tümmler, 2005; Chen et al., 2007;<br />

Klockgether et al., 2007). Here, we focused on tetranucleotide<br />

usage (TU) parameters because the 256 different<br />

tetranucleotide words are optimal to differentiate bacterial<br />

genome sequences by the frequency <strong>and</strong> <strong>in</strong>formativeness<br />

of the <strong>in</strong>dividual element. TU patterns represent the deviations<br />

of tetranucleotide word counts <strong>in</strong> a given sequence<br />

from an equiprobable distribution. Selection <strong>and</strong> counterselection<br />

of the oligonucleotide words are driven by their<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 3<br />

Fig. 2. Phylome Atlas of A. borkumensis SK2 genes <strong>in</strong>dicat<strong>in</strong>g their closest bacterial homologues. Each of the concentric circles represents a<br />

taxonomic group as described <strong>in</strong> the figure legend on the right, with the outermost circle correspond<strong>in</strong>g to the top-most feature, <strong>and</strong> the<br />

<strong>in</strong>nermost circle correspond<strong>in</strong>g to the bottom-most feature. Light b<strong>and</strong>s <strong>in</strong>dicate A. borkumensis SK2 genes with no homologue <strong>in</strong> the<br />

respective taxonomic group.<br />

stereochemical properties such as base stack<strong>in</strong>g energy,<br />

propeller twist angle, prote<strong>in</strong> deformability, bendability<br />

<strong>and</strong> position preference (Reva <strong>and</strong> Tümmler, 2004). By<br />

permutation analysis, the 256 tetranucleotides were<br />

assigned to 39 equivalence classes each of which characterized<br />

by the same values for the five properties mentioned<br />

above (Baldi <strong>and</strong> Baisnee, 2000). Words of the<br />

same equivalence class tend to occur at similar frequencies<br />

<strong>in</strong> a nucleotide sequence (Reva <strong>and</strong> Tümmler, 2004).<br />

Oligonucleotide usage conservation reflects to some<br />

extent the phylogeny of microorganisms (Pride et al.,<br />

2003; Teel<strong>in</strong>g et al., 2004).<br />

Phylogenomics by tetranucleotide usage analysis<br />

TU patterns were calculated for all sequenced genomes<br />

of g-Proteobacteria. Four examples of TU patterns determ<strong>in</strong>ed<br />

for A. borkumensis SK2, Pseudomonas putida<br />

KT2440, Escherichia coli K-12 <strong>and</strong> Shewanella oneidensis<br />

MR-1 are shown <strong>in</strong> Fig. 3. Tetranucleotide words were<br />

grouped by the equivalence classes <strong>and</strong> sorted <strong>in</strong> order of<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


4 O. N. Reva et al.<br />

decrease of the base stack<strong>in</strong>g energy. Figure 4 visualizes<br />

the phylogenetic relationships differentiated by TU patterns<br />

of 29 g-Proteobacterial taxa each of which represented<br />

by not more than a s<strong>in</strong>gle sequenced stra<strong>in</strong>.<br />

A. borkumensis forms a cluster with Pseudomonas,<br />

Methylococcus, Xanthomonas <strong>and</strong> Xylella (Fig. 4).<br />

Despite the variation <strong>in</strong> GC-content, from 52 to 54% <strong>in</strong><br />

Xylella <strong>and</strong> Alcanivorax to more than 65% <strong>in</strong> Xanthomonas<br />

<strong>and</strong> Pseudomonas, the TU patterns of these<br />

Fig. 3. Tetranucleotide usage patterns of<br />

A. borkumensis SK2, P. putida KT2440, E. coli<br />

K12 MG1655 <strong>and</strong> S. oneidensis MR-1. The<br />

deviation Dw of observed from expected<br />

counts is shown for all 256 tetranucleotide<br />

words (16 ¥ 16 cells) by colour code (right<br />

bar). Tetranucleotides are grouped <strong>in</strong>to 39<br />

classes of equivalent structural features (Baldi<br />

<strong>and</strong> Baisnee, 2000) <strong>and</strong> sorted by decreas<strong>in</strong>g<br />

base stack<strong>in</strong>g energy row-by-row start<strong>in</strong>g at<br />

the upper left corner (class 39). The words<br />

correspond<strong>in</strong>g to the cells <strong>in</strong> colour plots are<br />

shown <strong>in</strong> the table <strong>in</strong> lower part of the figure.<br />

microorganisms are similar <strong>and</strong> separated from other<br />

g-Proteobacteria. There is an abundance of GC-rich tetranucleotides<br />

with high base stack<strong>in</strong>g energy <strong>in</strong> the<br />

sequence of A. borkumensis SK2 (words belong<strong>in</strong>g to<br />

equivalence classes 37–39, 30 <strong>and</strong> 27) that is similar to<br />

the TU pattern of P. putida KT2440 (Fig. 3). Words of the<br />

AT-rich classes 7, 10, 13 <strong>and</strong> 32 are significantly underrepresented<br />

<strong>in</strong> both species. The major difference<br />

between TU patterns is the abundance of poly A <strong>and</strong> poly<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


T stretches (words of class 1) <strong>in</strong> A. borkumensis <strong>in</strong> correspondence<br />

with its lower GC-content of 54.7%. Although<br />

E. coli <strong>and</strong> S. oneidensis share a similar GC contents with<br />

A. borkumensis, their tetranucleotides usage is different<br />

from Alcanivorax. The parity of GC with AT <strong>in</strong> the genome<br />

correlates with a balanced use of GC-rich <strong>and</strong> AT-rich<br />

words with high <strong>and</strong> low base stack<strong>in</strong>g energy. In contrast,<br />

words with <strong>in</strong>termediate values of the base stack<strong>in</strong>g<br />

energy (classes 25, 31, 36 <strong>and</strong> 29) are mostly underrepresented<br />

(Fig. 3). The data suggests that oligonucleotide<br />

usage drives GC-content <strong>and</strong> not vice versa. To give<br />

another example: the GC-rich words of class 21 are<br />

rare <strong>in</strong> all g-Proteobacteria irrespectively of their<br />

GC-content (Fig. 3), but these words are overrepresented<br />

<strong>in</strong> a-Proteobacteria (Agrobacterium, Bordetella, Caulobacter,<br />

Rhizobium).<br />

Anomalous local TU patterns <strong>in</strong> the<br />

A. borkumensis genome<br />

A. borkumensis shares a common taxonomic group<br />

with Pseudomonas, Methylococcus, Xanthomonas <strong>and</strong><br />

Xylella. Although the TU patterns are genome specific<br />

signatures, the oligonucleotide usage may vary locally <strong>in</strong><br />

segments made up by horizontally acquired elements,<br />

phylogenetically ancient genes such as rRNAs or genes<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 5<br />

Fig. 4. Tree of the similarity of TU patterns of<br />

completely sequenced g-Proteobacteria<br />

stra<strong>in</strong>s. Distance D-values (see Experimental<br />

procedures) between two TU patterns were<br />

calculated, <strong>and</strong> the tree was constructed from<br />

the distance matrix of all D-values by the<br />

m<strong>in</strong>imum evolution neighbour-jo<strong>in</strong><strong>in</strong>g method<br />

(Saitou <strong>and</strong> Nei, 1987).<br />

with peculiar codon usage (Reva <strong>and</strong> Tümmler, 2004;<br />

2005). In other words, anomalous local TU patterns can<br />

be expected for the most recent <strong>and</strong> the most ancient<br />

genes. Local TU patterns were calculated <strong>in</strong> 8 kbp long<br />

overlapp<strong>in</strong>g slid<strong>in</strong>g w<strong>in</strong>dows <strong>in</strong> steps of 2 kbp. Distances<br />

D between local <strong>and</strong> global TU patterns are shown <strong>in</strong><br />

Fig. 5. The 18 regions with D-values above the 95% confidence<br />

<strong>in</strong>terval are listed <strong>in</strong> Table 1.<br />

Three clusters with anomalous D-values encode ribosomal<br />

RNAs that belong to the most ancient <strong>and</strong> conserved<br />

elements of all bacterial genomes. All the other 15<br />

regions with atypical TU most likely were recently<br />

acquired, three of which conta<strong>in</strong> transposase genes.<br />

In total 11 transposases were annotated <strong>in</strong> the<br />

A. borkumensis SK2 genome but for five of them no significant<br />

deviations of the local TU patterns were detected<br />

<strong>in</strong> adjacent regions. If <strong>in</strong>serted mobile elements had lost<br />

their mobility due to disruptive mutations, they undergo an<br />

amelioration process smooth<strong>in</strong>g the differences <strong>in</strong> oligonucleotide<br />

usage between <strong>in</strong>serts <strong>and</strong> the host genome<br />

<strong>and</strong> thus cannot be detected by anomalous TU patterns<br />

anymore (Pride et al., 2003).<br />

Five regions with high D-values (Fig. 5) only encode<br />

hypothetical prote<strong>in</strong>s (Table 1). One further region conta<strong>in</strong>s<br />

genes of the type II secretion system <strong>and</strong> two<br />

regions encode type IV pili biogenesis prote<strong>in</strong>s the latter<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


6 O. N. Reva et al.<br />

of which are known to have spread among proteobacteria<br />

by horizontal transfer with the orig<strong>in</strong>al codon usage <strong>and</strong><br />

GC content be<strong>in</strong>g reta<strong>in</strong>ed (Spangenberg et al., 1997).<br />

The most extended region with high D-values encodes<br />

a cluster of genes for glycosyltransferases <strong>and</strong> polysaccharide<br />

biosynthesis prote<strong>in</strong>s (Abo_858-Abo_880:<br />

1 018 000–1 060 000 bp) characterized by the second<br />

largest D-value <strong>and</strong> low GC-content (m<strong>in</strong>imum 45% GC).<br />

The region term<strong>in</strong>ates abruptly after Abo_880 at an AsntRNA<br />

gene. The TU pattern of the locus was compared<br />

with those of 177 sequenced bacterial chromosomes, 316<br />

plasmids <strong>and</strong> 104 phages (Reva <strong>and</strong> Tümmler, 2004).<br />

The pattern was distant from all analysed sequences. The<br />

best hit of D = 34.9% was observed for the 5833 bp large<br />

bacteriophage Pf3 that <strong>in</strong>fects P. aerug<strong>in</strong>osa harbour<strong>in</strong>g<br />

the RP1 plasmid (Luiten et al., 1985). A stretch of 1550 bp<br />

Table 1. Chromosomal regions of A. borkumensis with atypical TU patterns.<br />

Coord<strong>in</strong>ates<br />

Left Right<br />

D a (%) Annotation<br />

Fig. 5. Deviations of TU patterns <strong>in</strong> local<br />

regions of A. borkumensis SK2 chromosome.<br />

Local TU patterns were determ<strong>in</strong>ed <strong>in</strong> 8 kbp<br />

slid<strong>in</strong>g w<strong>in</strong>dow <strong>in</strong> steps of 2 kbp. D, the<br />

distance betweeen local <strong>and</strong> chromosomal<br />

tetranucleotide patterns as def<strong>in</strong>ed <strong>in</strong><br />

Experimental procedures, is plotted versus<br />

the coord<strong>in</strong>ates of the chromosome start<strong>in</strong>g<br />

from the putative replication orig<strong>in</strong>.The upper<br />

border of the 95% confidence <strong>in</strong>terval of<br />

D-values is shown by the horizontal l<strong>in</strong>e.<br />

upstream of the tRNA gene is 48% identical <strong>in</strong> nucleotide<br />

sequence with the Pf3 sequence (2344-4078 bp).<br />

Accord<strong>in</strong>g to this <strong>in</strong> silico f<strong>in</strong>d<strong>in</strong>g we propose that this<br />

gene isl<strong>and</strong> was captured from a phage that typically<br />

target the 3′-end of a tRNA gene (Dobr<strong>in</strong>dt et al.,<br />

2004).<br />

The alkB genes encod<strong>in</strong>g the degradation of alkanes<br />

which is the prom<strong>in</strong>ent name-giv<strong>in</strong>g feature of the taxon<br />

Alcanivorax, are located <strong>in</strong> two isl<strong>and</strong>s (Schneiker et al.,<br />

2006) with anomalous TU patterns (Table 1). Very close<br />

homologues were identified <strong>in</strong> mar<strong>in</strong>e bacteria <strong>and</strong><br />

Pseudomonas species (Schneiker et al., 2006). The<br />

alkane hydroxylase gene cluster is widely distributed<br />

among hydrocarbon-utiliz<strong>in</strong>g g-Proteobacteria due to its<br />

possible horizontal transfer (van Beilen et al., 2001;<br />

2004). The role of these genes <strong>in</strong> the degradation of<br />

126 000 140 000 42.20 Abo_114–120: lysR transcriptional regulator, haloacid dehalogenase hydrolase, amiC amidase, gntR<br />

transcriptional regulator, alkB2 alkane monooxygenase, type I pili biogenesis prote<strong>in</strong>s<br />

190 000 198 000 40.47 Abo_172–178: ilvD-1 dihydroxy-acid dehydratase, conserved hypothetical prote<strong>in</strong>s,<br />

long-cha<strong>in</strong>-fatty-acid-CoA ligase, acyl-CoA dehydrogenases<br />

234 000 245 000 47.95 Abo_209–214: conserved hypothetical prote<strong>in</strong>s, transposase, type II secretion system prote<strong>in</strong>s<br />

400 000 408 000 49.42 first operon for rRNAs<br />

502 000 510 000 46.26 Abo_439–446: ispA lipoprote<strong>in</strong> signal peptidase, fkpB peptidyl-prolyl cis-trans isomerase, ispH<br />

hydroxymethylbutenyl pyrophosphate reductase, type IV pili biogenesis prote<strong>in</strong>s, conserved<br />

hypothetical prote<strong>in</strong>s<br />

526 000 534 000 43.41 second operon for rRNAs<br />

670 000 678 000 40.29 Abo_581–583: type IV pili biogenesis prote<strong>in</strong>s<br />

792 000 800 000 43.00 Abo_2680–2681: hypothetical prote<strong>in</strong>s<br />

1 020 000 1 056 000 50.43 Abo_859–878: polysaccharide biosynthesis prote<strong>in</strong>s<br />

1 742 000 1 750 000 40.88 Abo_1439: periplasmic b<strong>in</strong>d<strong>in</strong>g doma<strong>in</strong>/transglycosylase SLTdoma<strong>in</strong> fusion<br />

1 892 000 1 900 000 46.32 Abo_2841–2847: hypothetical prote<strong>in</strong>s<br />

2 026 000 2 034 000 41.90 Abo_1668–1671: conserved hypothetical prote<strong>in</strong>s, 3 transposases, siderophore biosynthesis prote<strong>in</strong>,<br />

glycosyl transferase<br />

2 088 000 2 096 000 40.65 Abo_ 1707–1708: conserved hypothetical prote<strong>in</strong>s<br />

2 146 000 2 154 000 47.05 Abo_2897–2905: iscA iron-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> IscA, metal-sulfur cluster biosynthetic enzyme, sufE Fe-S<br />

metabolism associated doma<strong>in</strong> prote<strong>in</strong>, iscS cyste<strong>in</strong>e desulfurase, rrf2 family prote<strong>in</strong>, hypothetical<br />

prote<strong>in</strong>s, SIR2-like transcriptional silencer<br />

2 254 000 2 262 000 49.71 third operon for rRNAs<br />

2 364 000 2 372 000 52.56 Abo_1942: penicill<strong>in</strong>-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong>, hypothetical prote<strong>in</strong>s, 2 transposases<br />

2 632 000 2 640 000 40.17 Abo_2979–2984: hypothetical prote<strong>in</strong>s<br />

3 060 000 3 076 000 42.94 Abo_2516–3066: Na+/H+ antiporter, alkS alkB1GHJ regulator, alkB1 alkane monooxygenase,<br />

alkG rubredox<strong>in</strong>, aldH aldehyde dehydrogenase, hypothetical prote<strong>in</strong>s<br />

a. D, distance betweeen local <strong>and</strong> chromosomal TU patterns as def<strong>in</strong>ed <strong>in</strong> Experimental procedures.<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


short-cha<strong>in</strong> n-alkanes by A. borkumensis SK2 <strong>and</strong> AP1<br />

was experimentally proven (Smits et al., 2002; Hara et al.,<br />

2004; Sabirova et al., 2006). Interest<strong>in</strong>gly, the two regions<br />

compris<strong>in</strong>g of alkS, alkB1, alkG <strong>and</strong> aldH alkanedegradation<br />

genes <strong>and</strong> of alkB2 <strong>and</strong> transcriptional<br />

regulators, respectively (Table 1), are as similar to each<br />

other <strong>in</strong> their TU patterns (D = 34.3%) as each of them<br />

is to Yers<strong>in</strong>ia pestis (D = 32.2% for alkB1, D = 33.4%<br />

for alkB2), Yers<strong>in</strong>ia enterocolitica (D = 29.5% for alkB1,<br />

D = 34.4% for alkB2) <strong>and</strong> Shewanella oneidensis MR-1<br />

(D = 32.5% for alkB1, D = 42.4% for alkB2). This data<br />

suggests that the alkB1 <strong>and</strong> alkB2 genes were delivered<br />

to A. borkumensis from an ancestor of the Yers<strong>in</strong>ia<br />

l<strong>in</strong>eage. The AlkB1 am<strong>in</strong>o acid sequences of A. borkumensis<br />

stra<strong>in</strong>s AP1 <strong>and</strong> SK2 are highly homologous to<br />

that of P. putida stra<strong>in</strong>s P1 <strong>and</strong> GPO1 (van Beilen et al.,<br />

2001; 2004; Smits et al., 2002; Hara et al., 2004), but their<br />

TU patterns are not that similar (D = 37.1). Surpris<strong>in</strong>gly,<br />

the TU pattern of the alkB cluster of P. putida<br />

is significantly more similar with the global TU pattern of<br />

the whole A. borkumensis chromosome (16.7%, stra<strong>in</strong><br />

GPO1, 19%, stra<strong>in</strong> P1), but more distant from the<br />

P. putida KT2440 chromosome (30.1% <strong>and</strong> 30.3%).<br />

D-values of 17 or 19% are with<strong>in</strong> the first quartile (0–26%)<br />

far below the median value of 28.4% for local TU patterns<br />

of the A. borkumensis chromosome (Fig. 5) <strong>in</strong>dicat<strong>in</strong>g<br />

that. the P. putida alkB gene behaves as if it were part of<br />

the Alcanivorax core genome. We note the strik<strong>in</strong>g phenomenon<br />

that there was converg<strong>in</strong>g evolution of the<br />

cod<strong>in</strong>g sequence of the catabolic alk transposon <strong>in</strong><br />

Alkanivorax <strong>and</strong> Pseudomonas, but that the genes<br />

reta<strong>in</strong>ed the oligonucleotide signature of their donors,<br />

most likely Alkanivorax for Pseudomonas <strong>and</strong> Yers<strong>in</strong>ialike<br />

organisms for Alkanivorax.<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 7<br />

Orig<strong>in</strong> of replication<br />

The GC skew plotted <strong>in</strong> the seventh circle of the genome<br />

atlas (Fig. 1) reflects a general bias of pur<strong>in</strong>es towards the<br />

lead<strong>in</strong>g str<strong>and</strong> of DNA replication, however, it has almost<br />

no correlation to the structural properties of DNA<br />

(Skovgaard et al., 2002). The GC skew is often useful<br />

when locat<strong>in</strong>g the orig<strong>in</strong> <strong>and</strong> term<strong>in</strong>us of replication<br />

(Jensen et al., 1999).<br />

The circle is blue on the right side <strong>and</strong> purple on the left<br />

side. The two big gaps of colours <strong>in</strong> the top <strong>and</strong> <strong>in</strong> the<br />

bottom of the circle may be the orig<strong>in</strong> <strong>and</strong> the term<strong>in</strong>us of<br />

replication. This may also be visualized more clearly <strong>in</strong> the<br />

orig<strong>in</strong> plot (Fig. 6) (Worn<strong>in</strong>g et al., 2006). Here, the difference<br />

between hypothetical lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> is<br />

plotted (red) for various positions on the chromosome.<br />

The peaks <strong>in</strong>dicat<strong>in</strong>g maximal oligonucleotide skew correspond<br />

to orig<strong>in</strong> <strong>and</strong> term<strong>in</strong>us. The term<strong>in</strong>us was identified<br />

as the peaks show<strong>in</strong>g low G/C weighted str<strong>and</strong> bias<br />

at 1 502 000 bp position. The orig<strong>in</strong> was identified as the<br />

other peak at 3 118 000 bp position. The signal to noise of<br />

14.0 was among the top 10% of sequenced Proteobacteria,<br />

<strong>in</strong>dicat<strong>in</strong>g a big difference between lead<strong>in</strong>g <strong>and</strong><br />

lagg<strong>in</strong>g str<strong>and</strong> mak<strong>in</strong>g the prediction of orig<strong>in</strong> very<br />

confident.<br />

Structural analysis of promoter regions<br />

Structural features of the genomic DNA may <strong>in</strong>dicate promoter<br />

regions, as promoters normally have high curvature,<br />

melt easily <strong>and</strong> are more rigid. The DNA structural<br />

parameters mentioned earlier (position preference, stack<strong>in</strong>g<br />

energy, <strong>and</strong> <strong>in</strong>tr<strong>in</strong>sic curvature) together with AT<br />

content <strong>and</strong> DNAse sensitivity (Brukner et al., 1995) were<br />

Fig. 6. Localization of the orig<strong>in</strong> <strong>and</strong> the<br />

term<strong>in</strong>us of replication <strong>in</strong> the A. borkumensis<br />

SK2 chromosome derived from str<strong>and</strong> bias<br />

curves: the median oligonucleotide skew<br />

curve (red), the GC weighted median (green)<br />

<strong>and</strong> the AT weighted median (blue) (Worn<strong>in</strong>g<br />

et al., 2006).<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


8 O. N. Reva et al.<br />

compiled <strong>in</strong>to a structural profile of all upstream regions of<br />

A. borkumensis (see section Experimental procedures).<br />

The profile uses z-scores to measure how the average<br />

value of the properties vary from m<strong>in</strong>us 400 bp to 400 bp<br />

around the translation start (Fig. 7). A. borkumensis has<br />

only a cod<strong>in</strong>g density of 87% caus<strong>in</strong>g a wider spacer of<br />

the <strong>in</strong>tergenic region <strong>and</strong> this appears to give rise to a<br />

larger <strong>and</strong> wider peak of curvature, stack<strong>in</strong>g energy <strong>and</strong><br />

AT content (Fig. 7A). For comparison we also analysed<br />

the promoter profile of another ocean bacterium, C<strong>and</strong>idatus<br />

Pelagibacter ubique HTCC1062 (Giovannoni et al.,<br />

2005), an example of a highly streaml<strong>in</strong>ed genome with a<br />

cod<strong>in</strong>g density of 96%. Here we observed a much weaker<br />

curvature signal, <strong>and</strong> the distribution of stack<strong>in</strong>g energy<br />

<strong>and</strong> AT content was more narrow <strong>and</strong> had higher maxima<br />

(Fig. 7B).<br />

Next, the probability of open<strong>in</strong>g dur<strong>in</strong>g stress-<strong>in</strong>duced<br />

DNA duplex destabilization was computed by us<strong>in</strong>g the<br />

program SIDD (Wang et al., 2004), cover<strong>in</strong>g five different<br />

values of the super-helical density s = {-0.025, -0.035,<br />

-0.045, -0.055, -0.065}. As super-coil<strong>in</strong>g is be<strong>in</strong>g<br />

pushed, the probability of open<strong>in</strong>g <strong>in</strong>creases at lower<br />

super-helical densities <strong>in</strong> A. borkumensis (Fig. 7C). In<br />

contrast, a narrower SIDD profile that exhibits only<br />

m<strong>in</strong>or dependence on super-helical density (Fig. 7D),<br />

was calculated for the C<strong>and</strong>idatus Pelagibacter ubique<br />

HTCC1062 genome.<br />

The structural profile for the promoter regions of<br />

A. borkumensis was compared with that of closely related<br />

species as found above (see Fig. 4). Generally, it looked<br />

more like the promoter profile of members of the<br />

Pseudomonadales than the general comparison organism,<br />

E. coli. Moreover, the promoter profile was very different<br />

compared with the promoter profile of X. fastidiosa<br />

stra<strong>in</strong>s, even though they where very similar with regard<br />

to their TU profile (see Fig. 4). The promoter profiles for<br />

the above mentioned organisms may be found at our<br />

website (http://www.cbs.dtu.dk/services/GenomeAtlas/).<br />

Am<strong>in</strong>o acid <strong>and</strong> codon usage<br />

We have exam<strong>in</strong>ed the codon <strong>and</strong> am<strong>in</strong>o acid usage of<br />

A. borkumensis <strong>and</strong> compared this with both the usage of<br />

bacteria <strong>in</strong> general <strong>and</strong> of 16 oceanic bacteria (Entrez<br />

project IDs 230, 10 645, 12 530, 13 233, 13 239, 13 282,<br />

13 642, 13 643, 13 654, 13 655, 13 902, 13 906, 13 910,<br />

13 911, 13 989, 15 660) Willenbrock et al., 2006). In<br />

Fig. 8, the codon usage plot of A. borkumensis is<br />

superimposed on the cumulative plot of all completely<br />

sequenced bacteria <strong>in</strong> public databases (N = 518,<br />

Fig. 8A) or of that of 16 oceanic bacteria (Fig. 8B).<br />

A few codons are differentially utilized <strong>in</strong> A. borkumensis<br />

(GUC, CUG), but all values are with<strong>in</strong> the range of three<br />

st<strong>and</strong>ard deviations. In other words, codon usage of<br />

A. borkumensis resides with<strong>in</strong> the typical range of<br />

eubacteria.<br />

Interest<strong>in</strong>gly, the sequenced oceanic bacteria share a<br />

very similar am<strong>in</strong>o acid usage (Fig. 8D), whereas broad<br />

variations thereof were noted amongst all sequenced<br />

bacteria that represent the whole spectrum of habitats<br />

(Fig. 8C). A. borkumensis roughly follows the profile of the<br />

oceanic bacteria, although cyste<strong>in</strong>e, tryptophan, leuc<strong>in</strong>e,<br />

prol<strong>in</strong>e, arg<strong>in</strong><strong>in</strong>e, ser<strong>in</strong>e are under-utilized, <strong>and</strong> glutamic<br />

acid, lys<strong>in</strong>e, phenylalan<strong>in</strong>e, histid<strong>in</strong>e, methion<strong>in</strong>e, <strong>and</strong><br />

tyros<strong>in</strong>e are over-utilized – all exceed<strong>in</strong>g the threest<strong>and</strong>ard<br />

deviation boundaries.<br />

Conclusion<br />

Fig. 7. Profile of structural properties of<br />

promoter regions (A <strong>and</strong> B) <strong>and</strong> probabilities<br />

of open<strong>in</strong>g dur<strong>in</strong>g stress-<strong>in</strong>duced DNA duplex<br />

destabilization at various super-helical<br />

densities (C <strong>and</strong> D) <strong>in</strong> the A. borkumensis<br />

SK2 (A <strong>and</strong> C) <strong>and</strong> C<strong>and</strong>idatus Pelagibacter<br />

ubique HTCC1062 (B <strong>and</strong> D) chromosomes.<br />

Each annotated gene was aligned at the<br />

translation start site <strong>and</strong> the average values<br />

for the SIDD probabilities, AT-content, position<br />

preference, stack<strong>in</strong>g energy, <strong>in</strong>tr<strong>in</strong>sic<br />

curvature <strong>and</strong> DNase sensitivity were<br />

calculated at each position <strong>in</strong> the alignment.<br />

The values were subsequently converted <strong>in</strong>to<br />

z-scores, us<strong>in</strong>g the average <strong>and</strong> st<strong>and</strong>ard<br />

deviation of the entire chromosome. Values<br />

are smoothed over a 5 bp w<strong>in</strong>dow.<br />

Inspection of the collected phylogenetic connections<br />

revealed that the most closely related organisms are<br />

Ac<strong>in</strong>etobacter sp. <strong>and</strong> Pseudomonas aerug<strong>in</strong>osa,<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


although <strong>in</strong> trees where both Pseudomonas <strong>and</strong> Ac<strong>in</strong>etobacter<br />

are present, A. borkumensis tends to cluster more<br />

often with the latter one.<br />

The major structural feature of the A. borkumensis<br />

chromosome is its symmetry <strong>and</strong> homogeneity. The<br />

genome conta<strong>in</strong>s only very few regions with extraord<strong>in</strong>arily<br />

low or high curvature, position preference or base<br />

stack<strong>in</strong>g energy. The chromosomal frame is symmetric:<br />

The orig<strong>in</strong> <strong>and</strong> the term<strong>in</strong>us of replication are located<br />

opposite to each other <strong>in</strong> the chromosome <strong>and</strong> are clearly<br />

discerned by maxima of oligonucleotide usage biases<br />

between lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong>.<br />

The genetic repertoire of A. borkumensis is most similar<br />

to that of Ac<strong>in</strong>etobacter <strong>and</strong> P. aerug<strong>in</strong>osa. Moreover,<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 9<br />

Fig. 8. Codon usage (A <strong>and</strong> B) <strong>and</strong> am<strong>in</strong>o acid usage (C <strong>and</strong> D) of A. borkumensis SK2 compared with those of 518 completely sequenced<br />

bacteria (A <strong>and</strong> C) or compared with those of 16 sequenced oceanic bacteria. Frequencies of am<strong>in</strong>o acids <strong>and</strong> codons were counted for each<br />

genome <strong>and</strong> normalized. Mean value (grey l<strong>in</strong>e) <strong>and</strong> three st<strong>and</strong>ard deviations (grey solid area) represent the global usage of <strong>in</strong>dividual<br />

codons (A <strong>and</strong> B) <strong>and</strong> am<strong>in</strong>o acids (C <strong>and</strong> D) <strong>in</strong> the 518 (A <strong>and</strong> C) or 16 (B <strong>and</strong> D) reference genomes. The red l<strong>in</strong>e (A <strong>and</strong> B) shows the<br />

codon usage <strong>and</strong> the blue l<strong>in</strong>e (C <strong>and</strong> D) shows the am<strong>in</strong>o acid usage of A. borkumensis.<br />

A. borkumensis shares a similar oligonucleotide usage<br />

with the Xanthomonadales <strong>and</strong> Pseudomonadales <strong>in</strong>dicat<strong>in</strong>g<br />

close phylogenetic relationships with these orders<br />

<strong>in</strong> accordance with 16S rDNA sequence relatedness<br />

(Schneiker et al., 2006). Amongst this subgroup of completely<br />

sequenced genomes, the A. borkumensis chromosome<br />

harbours the relatively lowest number of genome<br />

isl<strong>and</strong>s with atypical tetranucleotide usage. P. putida<br />

KT2440, for example, carries threefold more isl<strong>and</strong>s per<br />

Megabase <strong>in</strong> its chromosome (We<strong>in</strong>el et al., 2002). Interest<strong>in</strong>gly,<br />

one of the three enzyme systems that are<br />

upregulated <strong>in</strong> alkane-grown cells (Sabirova et al., 2006),<br />

the well-known alkB1 cluster, is encoded by genome<br />

isl<strong>and</strong>s. The molecular evolution of the alk genes that are<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


10 O. N. Reva et al.<br />

encoded by a catabolic transposon (van Beilen et al.,<br />

2001) is remarkable: the Alcanivorax genes were probably<br />

acquired from the Yers<strong>in</strong>ia l<strong>in</strong>eage, whereas the<br />

P. putida genes exhibit the typical Alcanivorax tetranucleotide<br />

signature. Horizontal gene transfer was relevant to<br />

confer the – probably – most important metabolic trait to<br />

A. borkumensis, but otherwise the stable seawater habitat<br />

apparently did not favour the shuffl<strong>in</strong>g <strong>and</strong> exchange<br />

of genes with other taxa. Instead a symmetric <strong>and</strong><br />

structurally homogeneous chromosome evolved that<br />

lacks numerous metabolic traits (Yakimov et al., 1998;<br />

Schneiker et al., 2006) found <strong>in</strong> their versatile Pseudomonas<br />

relatives which are endowed with twofold larger chromosomes<br />

(Stover et al., 2000; Nelson et al., 2002).<br />

Experimental procedures<br />

Genomic sequence<br />

The comparative genomics analyses were based on the<br />

genomic sequence of A. borkumensis SK2 (Golysh<strong>in</strong> et al.,<br />

2003) <strong>and</strong> its annotation (Schneiker et al., 2006).<br />

Atlas visualization<br />

Atlases, developed <strong>in</strong> house, make it possible to visualize<br />

correlations between position dependent <strong>in</strong>formation conta<strong>in</strong>ed<br />

with<strong>in</strong> a chromosome. Circular graphical representations<br />

of the entire A. borkumensis genome were created<br />

us<strong>in</strong>g the atlas visualization tool, GeneWiz. Each feature,<br />

such as AT content is represented by a separate circle <strong>in</strong> the<br />

atlas. Typically, mean values are pictured <strong>in</strong> grey <strong>and</strong> extreme<br />

values are highlighted <strong>in</strong> a user def<strong>in</strong>ed colour (Pedersen<br />

et al., 2000).<br />

Phylome atlas. For each am<strong>in</strong>o acid sequence, phylogenetic<br />

trees were automatically constructed as described <strong>in</strong><br />

Sicheritz-Ponten <strong>and</strong> Andersson (2001). The phylogenomic<br />

<strong>in</strong>formation of the result<strong>in</strong>g 1919 phylogenetic trees was<br />

extracted <strong>and</strong> analysed <strong>in</strong> the PyPhy system.<br />

Genome atlas. The genome atlas is a comb<strong>in</strong>ation of some<br />

general <strong>in</strong>formative properties. These are some structural<br />

features (<strong>in</strong>tr<strong>in</strong>sic curvature, stack<strong>in</strong>g energy <strong>and</strong> position<br />

preference), some repeat properties (global direct <strong>and</strong><br />

<strong>in</strong>verted repeats) <strong>and</strong> the ma<strong>in</strong> base composition features<br />

(GC skew <strong>and</strong> percent AT).<br />

Intr<strong>in</strong>sic curvature was calculated us<strong>in</strong>g the CURVATURE<br />

software (Shpigelman et al., 1993). Stack<strong>in</strong>g energy of a<br />

DNA segment was determ<strong>in</strong>ed by the method of Ornste<strong>in</strong> <strong>and</strong><br />

colleagues (1978). Position preference was based on a tr<strong>in</strong>ucleotide<br />

model that estimates the helix flexibility (Satchwell<br />

et al., 1986). Base composition is generally divided <strong>in</strong>to AT<br />

content <strong>and</strong> GC skews. Both were calculated from the nucleotide<br />

sequence. Global direct <strong>and</strong> <strong>in</strong>verted repeats were<br />

found us<strong>in</strong>g variations of an algorithm that f<strong>in</strong>ds the highest<br />

degree of homology for a 15 bp repeat with<strong>in</strong> a w<strong>in</strong>dow of<br />

length 100 bp (Jensen et al., 1999).<br />

Codon <strong>and</strong> am<strong>in</strong>o acid usage<br />

Codon <strong>and</strong> am<strong>in</strong>o acid usage were calculated from all cod<strong>in</strong>g<br />

regions <strong>in</strong> the genome as annotated <strong>in</strong> the GenBank entries.<br />

The relative synonymous codon usage was calculated by<br />

compar<strong>in</strong>g the codon distribution from a set of highly<br />

expressed genes with a background distribution estimated<br />

from the codon usage of all cod<strong>in</strong>g regions <strong>in</strong> the genome<br />

(Willenbrock et al., 2006). In order to identify a set of constitutively<br />

highly expressed genes <strong>in</strong> A. borkumensis, the reference<br />

set of 27 very highly expressed Escherichia coli genes<br />

orig<strong>in</strong>ally compiled by Sharp <strong>and</strong> Li (1986) was aligned at the<br />

prote<strong>in</strong> level aga<strong>in</strong>st all genes annotated <strong>in</strong> the GenBank<br />

entry us<strong>in</strong>g BLASTP version 2.2.9 (Altschul et al., 1997). For<br />

each of these very highly expressed genes, the gene with the<br />

best alignment was added to a set of very highly expressed<br />

genes if it had an E-value below 10 -6 .<br />

TU patterns<br />

Overlapp<strong>in</strong>g tetranucleotide words were counted <strong>in</strong> the bacterial<br />

nucleotide sequences by shift<strong>in</strong>g the w<strong>in</strong>dow <strong>in</strong> steps of<br />

1 nucleotide. The total word number <strong>in</strong> a circular sequence<br />

equals to the sequence length. The observed counts of words<br />

(Co) were compared with the expected counts of words (Ce).<br />

Assum<strong>in</strong>g the same distribution frequency for all words irrespective<br />

of their composition <strong>and</strong> sequence mononucleotide<br />

content, Ce matches the ratio of the sequence length to the<br />

number of different tetranucleotide words Nw (256 for<br />

tetranucleotides).<br />

The deviation Dw of observed from expected counts is<br />

given by<br />

∆w= ( o−e)× o<br />

−<br />

C C C 1<br />

For the comparison of sequences by TU patterns, the words<br />

<strong>in</strong> each sequence were ranked by Dw values. Rank numbers<br />

<strong>in</strong>stead of word counts were used to simplify pattern comparison<br />

<strong>and</strong> to remove sequence length bias.<br />

The distance D between two patterns was calculated as<br />

the sum of absolute distances between ranks of identical<br />

words <strong>in</strong> patterns i <strong>and</strong> j as follows <strong>and</strong> expressed as a<br />

percent of the possible maximal distance:<br />

where<br />

D(<br />

% )= ×<br />

∑<br />

100 w<br />

D<br />

max<br />

rank − rank<br />

w, i w, i<br />

D<br />

max<br />

Nw( Nw−1)<br />

=<br />

2<br />

Dmax is the maximal distance that is theoretically possible<br />

between two patterns. For TU patterns Nw is 256. For more<br />

<strong>in</strong>formation about methods of oligonucleotide usage statistics<br />

see Reva <strong>and</strong> Tümmler (2004; 2005).<br />

Orig<strong>in</strong> plot<br />

The orig<strong>in</strong> plot was constructed as described <strong>in</strong> Worn<strong>in</strong>g<br />

<strong>and</strong> colleagues (2006). In brief, the difference between a<br />

hypothetical lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> is plotted for various<br />

positions on the chromosome. The frequencies of all<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


oligonucleotides from 2-mers to 8-mers on the lead<strong>in</strong>g <strong>and</strong><br />

lagg<strong>in</strong>g str<strong>and</strong>s <strong>in</strong> a 60% w<strong>in</strong>dow are counted <strong>and</strong> the <strong>in</strong>formation<br />

content was calculated <strong>and</strong> summarized over all<br />

oligos for every putative orig<strong>in</strong>. The G/C <strong>and</strong> A/T weighted<br />

str<strong>and</strong> bias were <strong>in</strong>cluded to dist<strong>in</strong>guish between orig<strong>in</strong> <strong>and</strong><br />

term<strong>in</strong>us.<br />

Structural profile of the promoter region<br />

Each annotated gene was aligned at the translation start site<br />

<strong>and</strong> the average values for five DNA structural features<br />

(AT content, position preference, stack<strong>in</strong>g energy, <strong>in</strong>tr<strong>in</strong>sic<br />

curvature, DNase sensitivity; see chapter on Genome Atlas)<br />

were calculated at each position <strong>in</strong> the alignment. The values<br />

was subsequently centered <strong>and</strong> scaled <strong>and</strong> smoothed with<strong>in</strong><br />

a 5 bp w<strong>in</strong>dow us<strong>in</strong>g Gaussian smooth<strong>in</strong>g.<br />

Acknowledgements<br />

The analysis has been performed with<strong>in</strong> the frame of the<br />

‘Task Force Genome L<strong>in</strong>guistics’ of the competence<br />

network ‘Genome Research on Bacteria Relevant for Agriculture,<br />

Environment <strong>and</strong> Biotechnology’ funded by the<br />

Federal M<strong>in</strong>istry of Education <strong>and</strong> Research (BMBF),<br />

Germany (Contracts 031U213D <strong>and</strong> 031U113D). We thank<br />

Peter Golysh<strong>in</strong>, Vitor Mart<strong>in</strong>s dos Santos <strong>and</strong> Kenneth N.<br />

Timmis, Helmhotz Center for Infection Research, Braunschweig,<br />

for stimulat<strong>in</strong>g discussions dur<strong>in</strong>g the <strong>in</strong>itiation of the<br />

study <strong>and</strong> Olaf Kaiser, Lehrstuhl für Genetik, Universität<br />

Bielefeld, for the provision of sequence data at an early<br />

stage of the sequenc<strong>in</strong>g project. O.R. has been a recipient<br />

of a postdoctoral stipend of the DFG-sponsored International<br />

Tra<strong>in</strong><strong>in</strong>g Group ‘Pseudomonas: Pathogenicity <strong>and</strong><br />

Biotechnology’.<br />

References<br />

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.,<br />

Zhang, Z., Miller, W., <strong>and</strong> Lipman, D.J. (1997) Gapped<br />

BLAST <strong>and</strong> PSI-BLAST: a new generation of prote<strong>in</strong><br />

database search programs. Nucleic Acids Res 25: 3389–<br />

3402.<br />

Baldi, P., <strong>and</strong> Baisnee, P.F. (2000) Sequence analysis by<br />

additive scales: DNA structure for sequences <strong>and</strong> repeats<br />

of all lengths. Bio<strong>in</strong>formatics 16: 865–889.<br />

van Beilen, J.B., Panke, S., Lucch<strong>in</strong>i, S., Franch<strong>in</strong>i, A.G.,<br />

Rothlisberger, M., <strong>and</strong> Witholt, B. (2001) Analysis of<br />

Pseudomonas putida alkane-degradation gene clusters<br />

<strong>and</strong> flank<strong>in</strong>g <strong>in</strong>sertion sequences: evolution <strong>and</strong> regulation<br />

of the alk genes. Microbiology 147: 1621–1630.<br />

van Beilen, J.B., Mar<strong>in</strong>, M.M., Smits, T.H.M., Röthlisberger,<br />

M., Franch<strong>in</strong>i, A.G., Witholt, B., <strong>and</strong> Rojo, F. (2004)<br />

Characterization of two alkane hydroxylase genes from<br />

the mar<strong>in</strong>e hydrocarbonoclastic bacterium Alcanivorax<br />

borkumensis. Environ Microbiol 6: 264–273.<br />

Brukner, I., Sanchez, R., Suck, D., <strong>and</strong> Pongor, S. (1995)<br />

Sequence-dependent bend<strong>in</strong>g propensity of DNA as<br />

revealed by DNase I: parameters for tr<strong>in</strong>ucleotides. EMBO<br />

J 14: 1812–1818.<br />

<strong>Comparative</strong> genomics of Alcanivorax borkumensis 11<br />

Chen, X.-H., Koumoutsi, A., Scholz, R., Eisenreich, A.,<br />

Schneider, K., Schneider, I., et al. (2007) <strong>Comparative</strong><br />

analysis of the complete genome sequence of the plant<br />

growth promot<strong>in</strong>g Bacillus amyloliquefaciens FZB42.<br />

Nat Biotechnol 25: 1007–1014.<br />

Dlakic, M., Ussery, D., <strong>and</strong> Brunak, S. (2004) DNA bendability<br />

<strong>and</strong> nucleosome position<strong>in</strong>g <strong>in</strong> transcriptional<br />

regulation. In DNA Conformation <strong>and</strong> Transcription.<br />

Ohyama, T. (ed.). Aust<strong>in</strong>, TX: L<strong>and</strong>es Bioscience, pp. 198–<br />

211.<br />

Dobr<strong>in</strong>dt, U., Hochhut, B., Hentschel, U., <strong>and</strong> Hacker, J.<br />

(2004) Genomic isl<strong>and</strong>s <strong>in</strong> pathogenic <strong>and</strong> environmental<br />

microorganisms. Nat Rev Microbiol 2: 414–424.<br />

Giovannoni, S.J., Tripp, H.J., Givan, S., Podar, M., Verg<strong>in</strong>,<br />

K.L., Baptista, D., et al. (2005) Genome streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a<br />

cosmopolitan oceanic bacterium. Science 309: 1242–<br />

1245.<br />

Golysh<strong>in</strong>, P.N., Mart<strong>in</strong>s Dos Santos, V.A., Kaiser, O., Ferrer,<br />

M., Sabirova, Y.S., Lunsdorf, H., et al. (2003) Genome<br />

sequence completed of Alcanivorax borkumensis, a<br />

hydrocarbon-degrad<strong>in</strong>g bacterium that plays a global role<br />

<strong>in</strong> oil removal from mar<strong>in</strong>e systems. J Biotechnol 106:<br />

215–220.<br />

Hara, A., Syutsubo, K., <strong>and</strong> Harayama, S. (2003) Alcanivorax<br />

which prevails <strong>in</strong> oil-contam<strong>in</strong>ated seawater exhibits broad<br />

substrate specificity for alkane degradation. Environ<br />

Microbiol 5: 746–753.<br />

Hara, A., Baik, S.H., Syutsubo, K., Misawa, N., Smits, T.H.,<br />

van Beilen, J.B., <strong>and</strong> Harayama, S. (2004) Clon<strong>in</strong>g <strong>and</strong><br />

functional analysis of alkB genes <strong>in</strong> Alcanivorax borkumensis<br />

SK2. Environ Microbiol 6: 191–197.<br />

Harayama, S., Kishira, H., Kasai, Y., <strong>and</strong> Shutsubo, K. (1999)<br />

Petroteum biodegradation <strong>in</strong> mar<strong>in</strong>e environments. J Mol<br />

Microbiol Biotechnol 1: 63–70.<br />

Jensen, L.J., Friis, C., <strong>and</strong> Ussery, D.W. (1999) Three<br />

views of microbial genomes. Res Microbiol 150: 773–<br />

777.<br />

Kasai, Y., Kishira, H., Sasaki, I., Syutsubo, K., Watanabe, K.,<br />

<strong>and</strong> Harama, S. (2002) Prodom<strong>in</strong>ant growth of Alcanivorax<br />

stra<strong>in</strong>s <strong>in</strong> oil-contam<strong>in</strong>ated <strong>and</strong> nutrient-supplemented sea<br />

water. Environ Microbiol 4: 141–147.<br />

Kasai, Y., Kishira, H., Syutsubo, K., <strong>and</strong> Harayama, S. (2001)<br />

Molecular detection of mar<strong>in</strong>e bacterial populations on<br />

beaches contam<strong>in</strong>ated by the Nakhodka tanker oilaccident.<br />

Environ Microbiol 3: 246–255.<br />

Klockgether, J., Würdemann, D., Reva, O., Wiehlmann, L.,<br />

<strong>and</strong> Tümmler, B. (2007) Diversity of the abundant<br />

pKLC102/PAGI-2 family of genomic isl<strong>and</strong>s <strong>in</strong> Pseudomonas<br />

aerug<strong>in</strong>osa. J Bacteriol 189: 2443–2459.<br />

Luiten, R.G., Putterman, D.G., Schoenmakers, J.G.,<br />

Kon<strong>in</strong>gs, R.N., <strong>and</strong> Day, L.A. (1985) Nucleotide sequence<br />

of the genome of Pf3, an IncP-1 plasmid-specific filamentous<br />

bacteriophage of Pseudomonas aerug<strong>in</strong>osa. J Virol<br />

56: 268–276.<br />

McKew, B.A., Coulon, F., Osborn, A.M., Timmis, K.N., <strong>and</strong><br />

McGenity, T.J. (2007a) Determ<strong>in</strong><strong>in</strong>g the identity <strong>and</strong> roles<br />

of oil-metaboliz<strong>in</strong>g mar<strong>in</strong>e bacteria from the Thames<br />

estuary, UK. Environ Microbiol 9: 165–176.<br />

McKew, B.A., Coulon, F., Yakimov, M.M., Denaro, R., Genovese,<br />

M., Smith, C.J., et al. (2007b) Efficacy of <strong>in</strong>tervention<br />

strategies for bioremediation of crude oil <strong>in</strong> mar<strong>in</strong>e<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


12 O. N. Reva et al.<br />

systems <strong>and</strong> effects on <strong>in</strong>digenous hydrocarbonoclastic<br />

bacteria. Environ Microbiol 9: 1562–1571.<br />

Nelson, K.E., We<strong>in</strong>el, C., Paulsen, I.T., Dodson, R.J., Hilbert,<br />

H., Mart<strong>in</strong>s dos Santos, V.A., et al. (2002) Complete<br />

genome sequence <strong>and</strong> comparative analysis of the metabolically<br />

versatile Pseudomonas putida KT2440. Environ<br />

Microbiol 4: 799–808.<br />

Ornste<strong>in</strong>, R., Re<strong>in</strong>, R., Breen, D., <strong>and</strong> MacElroy, R. (1978)<br />

An optimized potential function for the calculation of<br />

nucleic acid <strong>in</strong>teraction energies. Biopolymers 17: 2341–<br />

2360.<br />

Pedersen, A.G., Jensen, L.J., Brunak, S., Staerfeldt, H.H.,<br />

<strong>and</strong> Ussery, D.W. (2000) A DNA structural atlas for<br />

Escherichia coli. J Mol Biol 299: 907–930.<br />

Pride, D.T., Me<strong>in</strong>ersmann, R.J., Wassenaar, T.M., <strong>and</strong><br />

Blaser, M.J. (2003) Evolutionary implications of microbial<br />

genome tetranucleotide frequency biases. Genome Res<br />

13: 145–158.<br />

Reva, O.N., <strong>and</strong> Tümmler, B. (2004) Global features of<br />

sequences of bacterial chromosomes, plasmids <strong>and</strong><br />

phages revealed by analysis of oligonucleotide usage<br />

patterns. BMC Bio<strong>in</strong>formatics 5: 90.<br />

Reva, O.N., <strong>and</strong> Tümmler, B. (2005) Differentiation of regions<br />

with atypical oligonucleotide composition <strong>in</strong> bacterial<br />

genomes. BMC Bio<strong>in</strong>formatics 6: 251.<br />

Röl<strong>in</strong>g, W.F., Milner, M.G., Jones, D.M., Lee, K., Daniel, F.,<br />

Swannell, R.J., et al. (2002) Robust hydrocarbon degradation<br />

<strong>and</strong> dynamics of bacterial communities dur<strong>in</strong>g nutrient<br />

– enhanced oil spill bioremediation. Appl Environ Microbiol<br />

68: 5537–5548.<br />

Sabirova, J.S., Ferrer, M., Regenhardt, D., Timmis, K.N., <strong>and</strong><br />

Golysh<strong>in</strong>, P.N. (2006) Proteomic <strong>in</strong>sights <strong>in</strong>to metabolic<br />

adaptations <strong>in</strong> Alcanivorax borkumensis <strong>in</strong>duced by alkane<br />

utilization. J Bacteriol 188: 3763–3773.<br />

Saitou, N., <strong>and</strong> Nei, M. (1987) The neighbor-jo<strong>in</strong><strong>in</strong>g method:<br />

a new method for reconstruct<strong>in</strong>g phylogenetic trees. Mol<br />

Biol Evol 4: 406–425.<br />

Satchwell, S.C., Drew, H.R., <strong>and</strong> Travers, A.A. (1986)<br />

Sequence periodicities <strong>in</strong> chicken nucleosome core DNA.<br />

J Mol Biol 191: 659–675.<br />

Schneiker, S., Mart<strong>in</strong>s dos Santos, V.A., Bartels, D., Bekel,<br />

T., Brecht, M., Buhrmester, J., et al. (2006) Genome<br />

sequence of the ubiquitous hydrocarbon-degrad<strong>in</strong>g mar<strong>in</strong>e<br />

bacterium Alcanivorax borkumensis. Nat Biotechnol 24:<br />

997–1004.<br />

Sharp, P.M., <strong>and</strong> Li, W.H. (1986) Codon usage <strong>in</strong> regulatory<br />

genes <strong>in</strong> Escherichia coli does not reflect selection for ‘rare’<br />

codons. Nucleic Acids Res 14: 7737–7749.<br />

Shpigelman, E.S., Trifonov, E.N., <strong>and</strong> Bolshoy, A. (1993)<br />

CURVATURE: software for the analysis of curved DNA.<br />

Comput Appl Biosci 9: 435–440.<br />

Sicheritz-Ponten, T., <strong>and</strong> Andersson, S.G. (2001) A phyloge-<br />

nomic approach to microbial evolution. Nucleic Acids Res<br />

29: 545–552.<br />

Skovgaard, M., Jensen, L.J., Friis, C., Stærfeldt, H.H.,<br />

Worn<strong>in</strong>g, P., Brunak, S., <strong>and</strong> Ussery, D.W. (2002) The<br />

atlas visualisation of genome-wide <strong>in</strong>formation. In Methods<br />

<strong>in</strong> Microbiology. Wren, B., <strong>and</strong> Dorrell, N. (eds). London,<br />

UK: Academic Press, pp. 49–63.<br />

Smits, T.H., Balada, S.B., Witholt, B., <strong>and</strong> van Beilen, J.B.<br />

(2002) Functional analysis of alkane hydroxylases from<br />

gram-negative <strong>and</strong> gram-positive bacteria. J Bacteriol 184:<br />

1733–1742.<br />

Spangenberg, C., Fislage, R., Röml<strong>in</strong>g, U., <strong>and</strong> Tümmler, B.<br />

(1997) Disrespectful type IV pil<strong>in</strong>s. Mol Microbiol 25: 203–<br />

204.<br />

Stover, C.K., Pham, X.Q., Erw<strong>in</strong>, A.L., Mizoguchi, S.D., Warrener,<br />

P., Hickey, M.J., et al. (2000) Complete genome<br />

sequence of Pseudomonas aerug<strong>in</strong>osa PA01, an opportunistic<br />

pathogen. Nature 406: 959–964.<br />

Syutsubo, K., Kishira, H., <strong>and</strong> Harayama, S. (2001) Development<br />

of specific oliogonucleotide probes for the identification<br />

<strong>and</strong> <strong>in</strong> situ defection of hydrocarbon – degrad<strong>in</strong>g<br />

Alcanivorax stra<strong>in</strong>s. Environ Microbiol 3: 371–379.<br />

Teel<strong>in</strong>g, H., Meyerdierks, A., Bauer, M., Amann, R., <strong>and</strong><br />

Glockner, F.O. (2004) Application of tetranucleotide<br />

frequencies for the assignment of genomic fragments.<br />

Environ Microbiol 6: 938–947.<br />

Ussery, D., Soumpasis, D.M., Brunak, S., Staerfeldt, H.H.,<br />

Worn<strong>in</strong>g, P., <strong>and</strong> Krogh, A. (2002) Bias of pur<strong>in</strong>e stretches<br />

<strong>in</strong> sequenced chromosomes. Comput Chem 26: 531–541.<br />

Wang, H., Noordewier, M., <strong>and</strong> Benham, C.J. (2004) Stress-<br />

Induced DNA Duplex destabilization (SIDD) <strong>in</strong> the E. coli<br />

genome: SIDD sites are closely associated with promoters.<br />

Genome Res 14: 1575–1584.<br />

We<strong>in</strong>el, C., Nelson, K.E., <strong>and</strong> Tümmler, B. (2002) Global<br />

features of the Pseudomonas putida KT2440 genome<br />

sequence. Environ Microbiol 4: 809–818.<br />

Willenbrock, H., <strong>and</strong> Ussery, D.W. (2007) Prediction of highly<br />

expressed genes <strong>in</strong> microbes based on chromat<strong>in</strong><br />

accessibility. BMC Mol Biol 8: 11.<br />

Willenbrock, H., Friis, C., Juncker, A.S., <strong>and</strong> Ussery, D.W.<br />

(2006) An environmental signature for 323 microbial<br />

genomes based on codon adaptation <strong>in</strong>dices. Genome Biol<br />

7: R114.<br />

Worn<strong>in</strong>g, P., Jensen, L.J., Hall<strong>in</strong>, P.F., Staerfeldt, H.H., <strong>and</strong><br />

Ussery, D.W. (2006) Orig<strong>in</strong> of replication <strong>in</strong> circular<br />

prokaryotic chromosomes. Environ Microbiol 8: 353–<br />

361.<br />

Yakimov, M.M., Golysh<strong>in</strong>, P.N., Lang, S., Moore, E.R.,<br />

Abraham, W.R., Lunsdorf, H., <strong>and</strong> Timmis, K.N. (1998)<br />

Alcanivorax borkumensis General nov., sp. nov., a new,<br />

hydrocarbon-degrad<strong>in</strong>g <strong>and</strong> surfactant-produc<strong>in</strong>g mar<strong>in</strong>e<br />

bacterium. Int J Syst Bacteriol 48: 339–348.<br />

©2007TheAuthors<br />

Journal compilation © 2007 Society for Applied Microbiology <strong>and</strong> Blackwell Publish<strong>in</strong>g Ltd, Environmental Microbiology


Paper III: Global features of the Alcanivorax borkumensis SK2 genome


1<br />

2.9 Paper IV: The orig<strong>in</strong>s of Vibrio species<br />

<strong>Comparative</strong> Genomics


Microb Ecol<br />

DOI 10.1007/s00248-009-9596-7<br />

MINIREVIEWS<br />

On the Orig<strong>in</strong>s of a Vibrio Species<br />

Tammi Vesth & Trudy M. Wassenaar & Peter F. Hall<strong>in</strong> &<br />

Lars Snipen & Kar<strong>in</strong> Lagesen & David W. Ussery<br />

Received: 3 July 2009 /Accepted: 17 September 2009<br />

# The Author(s) 2009. This article is published with open access at Spr<strong>in</strong>gerl<strong>in</strong>k.com<br />

Abstract Thirty-two genome sequences of various Vibrionaceae<br />

members are compared, with emphasis on what<br />

makes V. cholerae unique. As few as 1,000 gene families<br />

are conserved across all the Vibrionaceae genomes analysed;<br />

this fraction roughly doubles for gene families<br />

conserved with<strong>in</strong> the species V. cholerae. Of these,<br />

approximately 200 gene families that cluster on various<br />

locations of the genome are not found <strong>in</strong> other sequenced<br />

Vibrionaceae; these are possibly unique to the V. cholerae<br />

species. By compar<strong>in</strong>g gene family content of the analysed<br />

genomes, the relatedness to a particular species is identified<br />

for two unspeciated genomes. Conversely, two genomes<br />

T. Vesth : T. M. Wassenaar : P. F. Hall<strong>in</strong> : L. Snipen :<br />

K. Lagesen : D. W. Ussery (*)<br />

Center for Biological Sequence Analysis,<br />

Department of Systems Biology,<br />

The Technical University of Denmark,<br />

Build<strong>in</strong>g 208,<br />

2800 Kgs. Lyngby, Denmark<br />

e-mail: dave@cbs.dtu.dk<br />

T. M. Wassenaar<br />

Molecular Microbiology <strong>and</strong> Genomics Consultants,<br />

Zotzenheim, Germany<br />

P. F. Hall<strong>in</strong><br />

Novozymes A/S,<br />

Krogshøjvej 36,<br />

2880 Bagsværd, Denmark<br />

L. Snipen<br />

Biostatistics, Department of Chemistry, Biotechnology,<br />

<strong>and</strong> Food Sciences, Norwegian University of Life Sciences,<br />

Ås, Norway<br />

K. Lagesen<br />

Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute<br />

of Medical Microbiology, University of Oslo,<br />

Oslo, Norway<br />

presumably belong<strong>in</strong>g to the same species have suspiciously<br />

dissimilar gene family content. We are able to identify a<br />

number of genes that are conserved <strong>in</strong>, <strong>and</strong> unique to, V.<br />

cholerae. Some of these genes may be crucial to the niche<br />

adaptation of this species.<br />

Introduction<br />

The species concept for bacteria has long been under siege<br />

from several angles, <strong>and</strong> now with thous<strong>and</strong>s of bacterial<br />

genomes be<strong>in</strong>g sequenced, the disputes have <strong>in</strong>tensified [8].<br />

One frequently used def<strong>in</strong>ition of a bacterial species is “a<br />

category that circumscribes a (preferably) genomically<br />

coherent group of <strong>in</strong>dividual isolates/stra<strong>in</strong>s shar<strong>in</strong>g a high<br />

degree of similarity <strong>in</strong> (many) <strong>in</strong>dependent features,<br />

comparatively tested under highly st<strong>and</strong>ardized conditions”<br />

[12]. Such <strong>in</strong>dependent features are usually phenotypes that<br />

can easily be tested. For a new species to be def<strong>in</strong>ed,<br />

amongst other criteria, <strong>in</strong>ter-species DNA–DNA hybridisation<br />

has to be below 70%, although this rule is not<br />

without its limitations [18]. In the late 1970s <strong>and</strong> 1980s, the<br />

16S rRNA gene sequence was <strong>in</strong>troduced as a molecular<br />

clock that could be used to <strong>in</strong>fer phylogenetic relationships<br />

[50]. Ideally, isolates belong<strong>in</strong>g to the same species have<br />

identical or nearly identical 16S rRNA genes, <strong>and</strong> these<br />

differ from isolates belong<strong>in</strong>g to different species [32, 44].<br />

In practice, this is not always the case. Examples exist of<br />

different species shar<strong>in</strong>g identical rRNA genes (for<br />

<strong>in</strong>stance, E. coli <strong>and</strong> Shigella [37] that are even placed <strong>in</strong><br />

different genera); <strong>in</strong> addition, isolates of one species can<br />

have different rRNA genes beyond the 97% that is<br />

considered to demarcate species [4]. Lateral transfer of<br />

genetic material (to which ribosomal genes are believed to<br />

be resistant) destroys the phylogenetic relationship, so that


phylogenies based on alternative housekeep<strong>in</strong>g genes can<br />

differ from a 16S rRNA tree <strong>and</strong> frequently are not even <strong>in</strong><br />

accordance to each other. Such observations question the<br />

validity of a phylogenetic tree as the most suitable model<br />

for bacterial ancestry, when multiple genetic transfers<br />

would produce a network-like evolutionary structure [6].<br />

On the other h<strong>and</strong>, it is observed that lateral gene transfer is<br />

most frequent between genetically related members shar<strong>in</strong>g<br />

a similar base content <strong>and</strong> occupy<strong>in</strong>g the same ecological<br />

niche [29]. Nevertheless, a core of genes can be recognised<br />

that produce coherent phylogenetic trees, though these may<br />

not represent the species’ complete evolutionary history as<br />

they comprise only a m<strong>in</strong>or fraction of the genetic content<br />

of the organism [35].<br />

Whether a tree or a network is more accurate to describe<br />

phylogeny, <strong>in</strong> either case bacterial species may be considered<br />

as a cloud of isolates hav<strong>in</strong>g a higher level of genetic<br />

similarity to each other than to organisms belong<strong>in</strong>g to a<br />

different species. When such clouds have fuzzy <strong>and</strong><br />

overlapp<strong>in</strong>g borders, the species concept falls apart but that<br />

will only apply to certa<strong>in</strong> cases [7]. S<strong>in</strong>ce 16S rRNA genes<br />

are not <strong>in</strong>formative on the level of diversity with<strong>in</strong> a<br />

species, the 'density' of a cloud of isolates mak<strong>in</strong>g up a<br />

species cannot be determ<strong>in</strong>ed by this gene. Those genes<br />

shared by all isolates belong<strong>in</strong>g to one species comprise the<br />

core genome of that species [39], <strong>and</strong> the degree of<br />

diversity <strong>in</strong> the rema<strong>in</strong><strong>in</strong>g non-core genes determ<strong>in</strong>es the<br />

density of the species cloud.<br />

We hypothesised that certa<strong>in</strong> genes can be recognised as<br />

specific to a particular species, to be conserved <strong>in</strong> that<br />

species but not present <strong>in</strong> related species. We tested our<br />

hypothesis with complete genome sequences of the bacterial<br />

family Vibrionaceae, which belong to the γ-<br />

Proteobacteria <strong>and</strong> comprises eight genera. Most available<br />

genome sequences belong to the genus Vibrio. This genus<br />

conta<strong>in</strong>s 51 recognised species [10, 46] which are ma<strong>in</strong>ly<br />

found <strong>in</strong> mar<strong>in</strong>e environments, frequently liv<strong>in</strong>g <strong>in</strong> association<br />

with mar<strong>in</strong>e organisms such as corals, fish, squid or<br />

zooplankton. Most of them are symbionts <strong>and</strong> only a few<br />

are human pathogens, notably particular serotypes of V.<br />

cholerae produc<strong>in</strong>g cholera, Vibrio parahaemolyticus<br />

(caus<strong>in</strong>g gastroenteritis) <strong>and</strong> Vi vulnificus (caus<strong>in</strong>g wound<br />

<strong>in</strong>fections) [46]. Other Vibrionaceae, <strong>in</strong>clud<strong>in</strong>g V. vulnificus,<br />

Aliivibrio salmonicida <strong>and</strong> V. harveyi, are fish or<br />

shellfish pathogens <strong>and</strong> have major economic impact.<br />

Photobacterium profundum, represent<strong>in</strong>g another genus<br />

with<strong>in</strong> the Vibrionaceae, was also <strong>in</strong>cluded.<br />

The gene content of 32 available sequenced Vibrionaceae<br />

genomes was compared <strong>and</strong> the results were analysed <strong>in</strong><br />

various ways. The data allowed us to identify possible V.<br />

cholerae-specific genes, s<strong>in</strong>ce this species was represented<br />

by 18 genomes that was a sufficient number to test<br />

conservation both with<strong>in</strong> the species <strong>and</strong> across species.<br />

We found that a two-component signal transduction pathway<br />

is uniquely conserved <strong>in</strong> V. cholerae but is not found outside<br />

this species. Our f<strong>in</strong>d<strong>in</strong>gs further <strong>in</strong>dicated that possibly a<br />

relatively small set of genes could confer niche specialisation<br />

allow<strong>in</strong>g V. cholerae to be adopted to a unique environment,<br />

so that over time V. cholerae have become a dist<strong>in</strong>ct species.<br />

Materials <strong>and</strong> Methods<br />

Genomes <strong>and</strong> Gene Annotations Used<br />

Publicly available genome sequences of Vibrionaceae were<br />

selected that were provided <strong>in</strong> less than 300 contigs <strong>and</strong> <strong>in</strong><br />

which full-length 16S rRNA sequence could be found us<strong>in</strong>g<br />

the rRNA gene f<strong>in</strong>der RNAmmer [19]. The 32 genome<br />

sequences <strong>in</strong>cluded are shown <strong>in</strong> Table 1.<br />

The gene annotations as provided <strong>in</strong> GenBank were<br />

used, except for those genomes marked “Easygene” <strong>in</strong><br />

Table 1 where prote<strong>in</strong> annotation was not available <strong>in</strong> the<br />

RefSeq file at the time of analysis, <strong>and</strong> we used EasyGene<br />

[20] to identify the genes. As a control, an available<br />

GenBank annotation was compared to a generated Easygene<br />

annotation to confirm that the number of identified<br />

genes was comparable.<br />

Ribosomal RNA Analysis<br />

RNAmmer [19] was used to identify 16S rRNA sequences<br />

with<strong>in</strong> the 32 genomes. Sequences were considered reliable<br />

if they were between 1,400 <strong>and</strong> 1,700 nucleotides long <strong>and</strong><br />

had an RNAmmer score above 1,800. In cases where the<br />

program found multiple <strong>and</strong> variable 16S sequences with<strong>in</strong><br />

a genome, one of these (with satisfactory RNAmmer<br />

scores) was arbitrarily chosen. The sequences were aligned<br />

us<strong>in</strong>g PRANK [23, 24], <strong>and</strong> the program MEGA4 was used<br />

to elucidate a phylogenetic tree [45]. With<strong>in</strong> MEGA4, the<br />

tree was created us<strong>in</strong>g the Neighbor-Jo<strong>in</strong><strong>in</strong>g method with<br />

the uniform rate Jukes–Cantor distance measure <strong>and</strong> the<br />

complete-delete option. Five hundred resampl<strong>in</strong>gs were<br />

done to f<strong>in</strong>d the bootstrap values.<br />

Pan-Genome Family Cluster<strong>in</strong>g<br />

T. Vesth et al.<br />

Cluster<strong>in</strong>g based on shared gene families from the Vibrio<br />

pan-genome was constructed, based on BLASTP similarity<br />

us<strong>in</strong>g default sett<strong>in</strong>gs. A BLASTP hit was considered<br />

significant if the alignment produced at least 50% identity<br />

for at least 50% of the length of the longest gene (either<br />

query or subject). Us<strong>in</strong>g this criterion, each pair of genes<br />

produc<strong>in</strong>g a significant reciprocal best hit was scored as<br />

belong<strong>in</strong>g to the same gene family. A genome matrix was<br />

constructed, conta<strong>in</strong><strong>in</strong>g one row for each genome <strong>and</strong> one


Orig<strong>in</strong>s of V. cholerae<br />

Table 1 Vibrionaceae genomes used <strong>in</strong> this analysis<br />

GPID Organism Contigs Accession/GenBank Status No. of genes Ref.<br />

36 V. cholerae N16961 a<br />

2 AE003852.1 Fully sequenced 3,828 [15]<br />

15667 V. cholerae O395 TIGR a<br />

2 CP000626.1 Fully sequenced 3,875 [11]<br />

32853 V. cholerae O395 TEDA a<br />

2 CP001235.1 Fully sequenced 3,934 [49]<br />

33555 V. cholerae MJ-1236 a<br />

2 CP001485.1 Fully sequenced 3,774 [31]<br />

15666 V. cholerae MO10 a<br />

153 NZ_AAKF00000000 Unf<strong>in</strong>ished (Easygene) 3,421 [5]<br />

15670 V. cholerae V52 a<br />

268 NZ_AAKJ00000000 Unf<strong>in</strong>ished (NCBI) 3,815 [16]<br />

33559 V. cholerae BX330286 a<br />

8 NZ_ACIA00000000 Unf<strong>in</strong>ished (NCBI) 3,632 [31]<br />

33557 V. cholerae B33 a<br />

17 NZ_ACHZ00000000 Unf<strong>in</strong>ished (NCBI) 3,748 [31]<br />

33553 V. cholerae RC9 a<br />

11 NZ_ACHX00000000 Unf<strong>in</strong>ished (NCBI) 3,811 [31]<br />

32851 V. cholerae M66-2 2 CP001233.1 Fully sequenced 3,693 [49]<br />

18495 V. cholerae MZO-2 162 NZ_AAWF00000000 Unf<strong>in</strong>ished (NCBI) 3,425 [16]<br />

18265 V. cholerae 1587 254 NZ_AAUR00000000 Unf<strong>in</strong>ished (NCBI) 3,758 [16]<br />

18253 V. cholerae 2740-80 257 NZ_AAUT00000000 Unf<strong>in</strong>ished (NCBI) 3,771 [16]<br />

17723 V. cholerae AM-19226 154 NZ_AATY00000000 Unf<strong>in</strong>ished (Easygene) 3,407 [33]<br />

33561 V. cholerae 12129 12 NZ_ACFQ00000000 Unf<strong>in</strong>ished (NCBI) 3,574 [31]<br />

33549 V. cholerae VL426 5 NZ_ACHV00000000 Unf<strong>in</strong>ished (NCBI) 3,461 [31]<br />

33579 V. cholerae TM 11079-80 35 NZ_ACHW00000000 Unf<strong>in</strong>ished (NCBI) 3,621 [31]<br />

33551 V. cholerae TMA 21 20 NZ_ACHY00000000 Unf<strong>in</strong>ished (NCBI) 3,600 [31]<br />

13564 V. campbellii AND4 143 NZ_ABGR00000000 Unf<strong>in</strong>ished (NCBI) 3,935 [13]<br />

19857 V. harveyi BAA-1116 3 CP000789.1 Fully sequenced 6,064 [1]<br />

349 V. vulnificus CMCP6 2 AE016795.2 Fully sequenced 4,538 [38]<br />

1430 V. vulnificus YJ016 3 BA000037.2 Fully sequenced 5,028 [3]<br />

19397 V. shilonii AK1 158 NZ_ABCH00000000 Unf<strong>in</strong>ished (NCBI) 5,360 [41]<br />

15693 Vibrio sp. Ex25 222 NZ_AAKK00000000 Unf<strong>in</strong>ished (Easygene) 4,004 [16]<br />

13616 Vibrio sp. MED222 99 NZ_AAND00000000 Unf<strong>in</strong>ished (NCBI) 4,590 [36]<br />

32815 V. splendidus LGP32 2 FM954973.1 Fully sequenced 4,434 [27]<br />

19395 V. parahaemolyticus 16 78 NZ_ACCV00000000 Unf<strong>in</strong>ished (Easygene) 3,780 [9]<br />

360 V. parahaemolyticus 2210633 2 BA000031.2 Fully sequenced 4,832 [25]<br />

12986 A. fischeri ES114 3 CP000020.1 Fully sequenced 3,823 [42]<br />

19393 A. fischeri MJ11 3 CP001133.1 Fully sequenced 4,039 [26]<br />

30703 A. salmonicida LFI1238 6 FM178379.1 Fully sequenced 4,284 [17]<br />

13128 P. profundum SS9 3 CR354531.1 Fully sequenced 5,480 [48]<br />

GPID genome project identifier at NCBI. Contigs the number of contiguous sequences, which for a completely sequenced genome is at least two<br />

(for two chromosomes) <strong>and</strong> can be up to six when plasmids are present. Unf<strong>in</strong>ished sequences are represented by multiple contigs per<br />

chromosome<br />

a<br />

Stra<strong>in</strong>s conta<strong>in</strong><strong>in</strong>g the genes encod<strong>in</strong>g the cholera enterotox<strong>in</strong> subunits are <strong>in</strong>dicated<br />

column for each gene family. Cell (i, j) <strong>in</strong> this matrix is 1 if<br />

genome i has a member <strong>in</strong> gene family j, 0 otherwise. A<br />

hierarchical cluster<strong>in</strong>g, with average l<strong>in</strong>kage based on the<br />

Manhattan distance between genomes was then performed.<br />

Two trees were made, one with more weight given to gene<br />

families present <strong>in</strong> most (90%, or between 27 <strong>and</strong> 30)<br />

Vibrio genomes (“stabilome”), <strong>and</strong> the other with more<br />

weight given to gene families present <strong>in</strong> only a few (two,<br />

three, or four) genomes (“mobilome”). Thus, the orig<strong>in</strong>al<br />

Boolean matrix is now scaled differently, depend<strong>in</strong>g on the<br />

number of genomes <strong>in</strong> each gene family [44]. For both<br />

trees, s<strong>in</strong>gletons (families which are only found <strong>in</strong> one<br />

genome) have been excluded.<br />

Pan- <strong>and</strong> Core Genome Analysis<br />

The results of the BLAST analysis were also used to<br />

construct a pan- <strong>and</strong> core genome plot as follows. Based on<br />

cluster<strong>in</strong>gs from the pan-genome family tree, an ordered set<br />

of genomes was constructed with V. cholerae genomes at<br />

the start. For the first chosen genome, all BLAST hits found<br />

<strong>in</strong> the second genome were recorded <strong>and</strong> the accumulative


Figure 1 Phylogenetic tree of<br />

the 16S rRNA gene extracted<br />

from 32 sequenced Vibrio<br />

genomes listed <strong>in</strong> Table 1. Environmental<br />

V. cholerae lack<strong>in</strong>g<br />

the cholera enterotox<strong>in</strong> genes<br />

are highlighted <strong>in</strong> bright green,<br />

whilst pathogenic V. cholerae<br />

genomes are <strong>in</strong> dark green.<br />

Further colour<strong>in</strong>g was used for<br />

species for which two genomes<br />

are represented<br />

number of gene families (as def<strong>in</strong>ed above) now recognised <strong>in</strong><br />

total was plotted for the pan-genome. The number of gene<br />

families with at least one representative gene <strong>in</strong> both genomes<br />

was plotted for the core genome. A runn<strong>in</strong>g total is plotted for<br />

the pan-genome which <strong>in</strong>creases as more genomes are added,<br />

whilst the core genome represent<strong>in</strong>g conserved gene families<br />

slowly decreases with the addition of more genomes.<br />

Whole-Genome BLAST Analysis <strong>and</strong> Construction<br />

of a BLAST Matrix<br />

The predicted genes of every genome (annotated or found<br />

by Easygene) were translated <strong>and</strong> every gene was compared,<br />

by BLASTP aga<strong>in</strong>st every other genome <strong>and</strong> its own<br />

genome. In the latter case, the hit to self was ignored. The<br />

50/50 rule for BLAST hits as described above was used. If<br />

these requirements were met, genes were comb<strong>in</strong>ed <strong>in</strong> a<br />

gene family. The BLAST results were visualised <strong>in</strong> a<br />

BLAST matrix [2], which summarises the results of<br />

genomic pairwise comparisons <strong>and</strong> reports, both as percentage<br />

<strong>and</strong> as absolute numbers, the number of reciprocal<br />

BLAST hits as a fraction of the total number of gene<br />

families found <strong>in</strong> the two genomes. For easier visual<br />

<strong>in</strong>spection, the cells <strong>in</strong> the matrix are coloured darker as<br />

56<br />

88<br />

65<br />

55<br />

86<br />

the fraction of similarity <strong>in</strong>creases. Hits identified with<strong>in</strong> a<br />

genome are differently coloured.<br />

BLAST Atlas<br />

BLAST results were also visualised <strong>in</strong> a BLAST atlas, this<br />

time visualis<strong>in</strong>g, for all genes <strong>in</strong> the reference genome V<br />

cholerae N16961, their best hit <strong>in</strong> all other genomes, aga<strong>in</strong><br />

with a threshold of 50% identity over at least 50% of the<br />

length of the query prote<strong>in</strong>. The atlas displays the hits as they<br />

are located <strong>in</strong> the reference stra<strong>in</strong> [14]. The BLAST scores<br />

obta<strong>in</strong>ed for each queried gene is plotted, so that conserved<br />

<strong>and</strong> variable regions are located with respect to the reference<br />

genome. Note that genes absent <strong>in</strong> the reference genome are<br />

not shown <strong>in</strong> the lanes of the query genomes.<br />

Results<br />

Vibrio sp. MED222<br />

A<br />

A<br />

Vibrio sp. Ex25<br />

Ribosomal RNA Analysis<br />

A phylogenetic tree based on the 16S rRNA gene extracted<br />

from the 32 analysed Vibrionaceae genomes is shown <strong>in</strong><br />

Fig. 1. The 18 V. cholerae genomes build a tight subcluster,<br />

45<br />

T. Vesth et al.


Orig<strong>in</strong>s of V. cholerae<br />

68<br />

68<br />

93<br />

64<br />

64<br />

95<br />

100<br />

Vibrio, stabilome<br />

0.20 0.15 0.10 0.05 0.00<br />

Relative manhattan distance<br />

quite distanced from the other species. Above this <strong>in</strong> the<br />

figure, another subcluster compris<strong>in</strong>g eight genomes represent<strong>in</strong>g<br />

at least six species is recognised, <strong>and</strong> with<strong>in</strong> this<br />

cluster the two V. parahaemolyticus genes are not found on<br />

the same branch. A third cluster, a bit further removed,<br />

<strong>in</strong>cludes Aliivibrio fischeri <strong>and</strong> A. almonidica as well as V.<br />

splendidus <strong>and</strong> Vibrio species MED 222; the gene of<br />

Photobacterium profundum is the most distant.<br />

Pan-Genome Family Trees<br />

99<br />

99<br />

100<br />

48<br />

100<br />

100<br />

98<br />

98<br />

67<br />

Vibrio harveyi ATCC BAA1116<br />

Vibrio parahaemolyticus RIMD2210633<br />

Vibrio vulnificus CMCP6<br />

Vibrio vulnificus YJ016<br />

Vibrio sp MED222<br />

Vibrio splendidus LGP32<br />

Vibrio shilonii AK1<br />

Vibrio sp Ex25<br />

Vibrio parahaemolyticus 16<br />

Vibrio campbellii AND4<br />

Aliivibrio fischeri<br />

Aliivibrio fischeri<br />

Aliivibrio salmonicida LFI1238<br />

Photobacterium profundum SS9<br />

Vibrio cholerae 1587<br />

Vibrio cholerae AM 19226<br />

Vibrio cholerae MO10<br />

Vibrio cholerae B33VCE<br />

Vibrio cholerae MJ1236<br />

Vibrio cholerae RC9<br />

Vibrio cholerae BX330286<br />

Vibrio cholerae M662<br />

Vibrio cholerae O395 TEDA<br />

Vibrio cholerae N16961<br />

Vibrio cholerae O395 TIGR<br />

Vibrio cholerae 12129<br />

Vibrio cholerae TMA21<br />

Vibrio cholerae V52<br />

Vibrio cholerae TM1107980<br />

Vibrio cholerae 2740 80<br />

Vibrio cholerae VL426<br />

Vibrio cholerae MZO 2<br />

Start<strong>in</strong>g with a database conta<strong>in</strong><strong>in</strong>g the total set of all Vibrio<br />

gene families, a profile of match<strong>in</strong>g gene families was<br />

constructed for each <strong>in</strong>dividual genome. This was stored as<br />

a matrix, conta<strong>in</strong><strong>in</strong>g a column for each gene families, <strong>and</strong> a<br />

row for each genome. The rows conta<strong>in</strong> a 0 or 1<br />

represent<strong>in</strong>g the presence or absence of the gene family.<br />

This matrix was weighted to emphasise either the genes<br />

found <strong>in</strong> most genomes (the “stabilome”) or <strong>in</strong> only a few<br />

genomes (the “mobilome”); from these weighted matrices,<br />

cluster<strong>in</strong>g of gene families yielded the result<strong>in</strong>g trees shown<br />

<strong>in</strong> Fig. 2. Shorter distances represent genomes with many<br />

gene families <strong>in</strong> common, <strong>and</strong> larger distances reflect<br />

genomes with fewer gene families <strong>in</strong> common. As<br />

expected, <strong>in</strong> both trees, genomes from the same species<br />

cluster together, whereby the depth of resolution with<strong>in</strong> a<br />

species is considerably better than can be seen <strong>in</strong> the 16S<br />

rRNA tree <strong>in</strong> Fig. 1. Similarity between the unspeciated<br />

100<br />

80<br />

66<br />

37<br />

54<br />

100<br />

98<br />

67<br />

46<br />

Figure 2 Pan-genome family cluster<strong>in</strong>g of the 32 Vibrio genome<br />

sequences. The two plots represent weighted values for genes present<br />

<strong>in</strong> at least 90% of the genomes (stabilome) or genes found <strong>in</strong> only a<br />

40<br />

100<br />

100<br />

58<br />

82<br />

91<br />

59<br />

59<br />

100<br />

100<br />

71<br />

48<br />

Vibrio, mobilome<br />

59<br />

80<br />

100<br />

100<br />

100<br />

0.20 0.15 0.10 0.05 0.00<br />

100<br />

100<br />

100<br />

Relative manhattan distance<br />

Vibrio isolate MED222 <strong>and</strong> V. splendidus is suggested by<br />

their close cluster<strong>in</strong>g; this is a connection also suggested by<br />

others [21]. Note that the unspeciated Vibrio isolate Ex25<br />

<strong>and</strong> V. parahaemolyticus 2210633 cluster together <strong>in</strong> the<br />

mobilome tree, but are more distant <strong>in</strong> the stabilome. This<br />

implies that the genes shared between these two genomes<br />

are less common genes with<strong>in</strong> the Vibrio genomes<br />

exam<strong>in</strong>ed here. As already <strong>in</strong>dicated by the 16S rRNA<br />

tree, the two V. parahaemolyticus isolates are quite<br />

dissimilar, <strong>and</strong> appear on separate branches. The Aliivibrio<br />

cluster is placed with<strong>in</strong> Vibrio genomes <strong>in</strong> both the<br />

stabilome <strong>and</strong> the mobilome, as was the case for their 16S<br />

rRNA gene. P. profundum is not such an outlier as <strong>in</strong> the<br />

16S rRNA tree, <strong>and</strong> <strong>in</strong> the stabilome. It is even positioned<br />

close to the Aliivibrio genomes. Zoom<strong>in</strong>g <strong>in</strong> at the genomes<br />

of V. cholerae, a division <strong>in</strong>to two subclusters can be seen;<br />

these clusters correspond to environmental vs. cl<strong>in</strong>ical<br />

isolates (with the exception of V52 <strong>in</strong> the stabilome).<br />

Pan- <strong>and</strong> Core Genome Plot<br />

99<br />

77<br />

100<br />

67<br />

82<br />

100<br />

90<br />

90<br />

100<br />

89<br />

89<br />

Vibrio sp Ex25<br />

Vibrio parahaemolyticus RIMD2210633<br />

Vibrio campbellii AND4<br />

Vibrio cholerae 1587<br />

Vibrio cholerae AM 19226<br />

Vibrio cholerae MZO 2<br />

Vibrio cholerae 2740 80<br />

Vibrio cholerae V52<br />

Vibrio cholerae MO10<br />

Vibrio cholerae O395 TIGR<br />

Vibrio cholerae BX330286<br />

Vibrio cholerae RC9<br />

Vibrio cholerae B33VCE<br />

Vibrio cholerae MJ1236<br />

Vibrio cholerae N16961<br />

Vibrio cholerae M662<br />

Vibrio cholerae O395 TEDA<br />

Vibrio cholerae TMA21<br />

Vibrio cholerae 12129<br />

Vibrio cholerae TM1107980<br />

Vibrio cholerae VL426<br />

Vibrio parahaemolyticus 16<br />

Aliivibrio fischeri<br />

Aliivibrio fischeri<br />

Aliivibrio salmonicida LFI1238<br />

Vibrio vulnificus CMCP6<br />

Vibrio vulnificus YJ016<br />

Vibrio sp MED222<br />

Vibrio splendidus LGP32<br />

Vibrio harveyi ATCC BAA1116<br />

Vibrio shilonii AK1<br />

Photobacterium profundum SS9<br />

BLAST results were analysed to construct a pan-genome,<br />

which is a hypothetical collection of all the gene families<br />

that are found <strong>in</strong> the <strong>in</strong>vestigated genomes [28]. The core<br />

genome was constructed from all gene families that were<br />

represented at least once <strong>in</strong> every genome. Thus, the gene<br />

families conserved <strong>in</strong> all genomes represent their core<br />

genome; add<strong>in</strong>g the rema<strong>in</strong><strong>in</strong>g gene families produces the<br />

65<br />

82<br />

100<br />

100<br />

100<br />

100<br />

few (two to four) genomes (mobilome). The colours highlight<strong>in</strong>g the<br />

species are the same as <strong>in</strong> Fig. 1


25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

Pan genome<br />

Core genome<br />

New gene families<br />

V. cholerae TM11079-80<br />

V. cholerae TMA21<br />

V. cholerae 12129<br />

V. cholerae MZO-2<br />

V. cholerae AM-19226<br />

V. cholerae 1587<br />

V. cholerae 2740-80<br />

V. cholerae V52<br />

V. cholerae B33VCE<br />

V. cholerae MJ1236<br />

V. cholerae RC9<br />

V. cholerae BX330286<br />

V. cholerae MO10<br />

V. cholerae O395 TIGR<br />

V. cholerae O395 TEDA<br />

V. cholerae M66-2<br />

V. cholerae N16961<br />

Figure 3 Pan- <strong>and</strong> core genome plot of the 32 Vibrionaceae genomes. The colours highlight<strong>in</strong>g species are the same as <strong>in</strong> Fig. 1<br />

pan-genome. The result<strong>in</strong>g pan- <strong>and</strong> core genome plot is<br />

shown <strong>in</strong> Fig. 3. The genomes start with the documented<br />

cl<strong>in</strong>ical isolates of V. cholerae <strong>and</strong> then follow the order<br />

suggested by the pan-genome family cluster<strong>in</strong>g (Fig. 2),<br />

although genomes from the same species were kept<br />

together (the two V. parahaemolyticus genomes were split<br />

<strong>in</strong> the trees). As more genomes are added <strong>in</strong> the plot, the<br />

number of gene families <strong>in</strong> the pan-genome (blue l<strong>in</strong>e)<br />

<strong>in</strong>creases, <strong>and</strong> the number of conserved gene families (red<br />

l<strong>in</strong>e) <strong>in</strong> the core genome decreases, albeit at a lower rate.<br />

This is because every genome can add many novel (<strong>and</strong><br />

frequently different) genes to the pan-genome but only<br />

decreases the core genome with a few genes that are absent<br />

V. cholerae VL426<br />

P.profundum SS9<br />

V.shilonii AK1<br />

A.salmonicida LFI1238<br />

A. fisheri MJ11<br />

A. fisheri ES114<br />

Vibrio. sp MED222<br />

V.splendidus LGB2<br />

V. vulnificus YJ016<br />

V. vulnificus CMCP6<br />

V.harveyi BAA-1116<br />

V.campbellii<br />

Vibrio sp Ex25<br />

V. parahaem. 2210633<br />

V. parahaem. 16<br />

T. Vesth et al.<br />

<strong>in</strong> that particular stra<strong>in</strong> but that were conserved <strong>in</strong> the<br />

previously analysed genomes. The pan-genome curve<br />

<strong>in</strong>creases with a relative steep slope when a novel species<br />

is added, as is obvious when a V. parahaemolyticus genome<br />

is added after the last V. cholerae. A stable plateau can be<br />

seen for the pan-genome of V. cholerae around 6,500 genes.<br />

Nevertheless, a small <strong>in</strong>crease occurs when add<strong>in</strong>g V.<br />

cholerae 11587; this is caused by the difference between<br />

the two subclusters of V. cholerae seen <strong>in</strong> Fig. 2. V.<br />

cholerae stra<strong>in</strong> 2740-80 behaves atypical <strong>in</strong> all the figures<br />

shown; although documented as an environmental isolate, it<br />

appears closer to the cl<strong>in</strong>ical isolates, <strong>in</strong> terms of overall<br />

genomic properties.


Orig<strong>in</strong>s of V. cholerae<br />

Figure 4 BLAST matrix of<br />

the 32 Vibrionaceae genomes.<br />

The colours highlight<strong>in</strong>g the<br />

species are the same as <strong>in</strong> Fig. 1.<br />

S<strong>in</strong>ce the reciprocal similarity<br />

(reported as percent) is not<br />

readable at this resolution, every<br />

matrix cell is coloured us<strong>in</strong>g the<br />

scales as <strong>in</strong>dicated. The bottom<br />

row identifies hits (other than<br />

hits-to-self) found with<strong>in</strong> a genome.<br />

Four matrix cells report<strong>in</strong>g<br />

high pairwise similarities are<br />

outl<strong>in</strong>ed; their numbers are<br />

specified <strong>in</strong> the text<br />

Homology between proteomes<br />

30.0 %<br />

90.0 %<br />

Homology with<strong>in</strong> proteomes<br />

6.0 %<br />

0.0 %<br />

A.salmonicida LFI1238<br />

V.species Ex25<br />

V.campbellii AND4<br />

V.harveyi BAA1116<br />

V.shilonii AK1<br />

P.profundum SS9<br />

27.2 %<br />

1,946 / 7,165<br />

31.2 %<br />

2,143 / 6,862<br />

32.5 %<br />

2,385 / 7,336<br />

31.1 %<br />

2,163 / 6,948<br />

V.cholerae N16961<br />

V.cholerae 0395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae V52<br />

V.cholerae M66-2<br />

V.cholerae MO10<br />

V.cholerae BX330286<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

27.1 %<br />

1,964 / 7,245<br />

27.5 %<br />

1,971 / 7,179<br />

35.8 %<br />

2,018 / 5,637<br />

32.6 %<br />

2,405 / 7,380<br />

31.5 %<br />

2,169 / 6,884<br />

26.3 %<br />

1,893 / 7,208<br />

38.7 %<br />

2,143 / 5,536<br />

35.9 %<br />

2,049 / 5,713<br />

33.1 %<br />

2,415 / 7,299<br />

30.4 %<br />

2,098 / 6,893<br />

28.0 %<br />

1,962 / 7,016<br />

32.1 %<br />

1,846 / 5,747<br />

38.3 %<br />

2,156 / 5,631<br />

36.4 %<br />

2,055 / 5,647<br />

31.7 %<br />

2,323 / 7,337<br />

32.3 %<br />

2,164 / 6,706<br />

28.7 %<br />

1,944 / 6,766<br />

V.parahaemolyticus 2210633<br />

V.parahaemolyticus 16<br />

V.vulnificus CMCP6<br />

V.vulnificus YJ016<br />

V.species MED222<br />

V.splendidus LGP32<br />

A.fischeri ES114<br />

A.fischeri MJ11<br />

34.0 %<br />

1,963 / 5,771<br />

32.1 %<br />

1,873 / 5,828<br />

38.8 %<br />

2,162 / 5,566<br />

34.7 %<br />

1,968 / 5,677<br />

33.6 %<br />

2,410 / 7,181<br />

33.0 %<br />

2,137 / 6,467<br />

28.2 %<br />

1,960 / 6,957<br />

35.0 %<br />

1,949 / 5,561<br />

33.7 %<br />

1,977 / 5,865<br />

32.5 %<br />

1,873 / 5,769<br />

37.9 %<br />

2,110 / 5,560<br />

37.3 %<br />

2,045 / 5,477<br />

34.3 %<br />

2,377 / 6,932<br />

32.4 %<br />

2,155 / 6,649<br />

27.6 %<br />

1,965 / 7,122<br />

40.3 %<br />

2,326 / 5,771<br />

34.8 %<br />

1,967 / 5,647<br />

34.2 %<br />

1,983 / 5,797<br />

30.6 %<br />

1,777 / 5,804<br />

40.3 %<br />

2,167 / 5,378<br />

38.7 %<br />

2,021 / 5,225<br />

33.8 %<br />

2,403 / 7,116<br />

31.8 %<br />

2,169 / 6,817<br />

27.7 %<br />

1,965 / 7,093<br />

V.cholerae B33VCE<br />

38.4 %<br />

2,291 / 5,971<br />

39.8 %<br />

2,339 / 5,873<br />

35.3 %<br />

1,972 / 5,581<br />

32.5 %<br />

1,896 / 5,827<br />

33.3 %<br />

1,863 / 5,593<br />

41.6 %<br />

2,140 / 5,139<br />

37.4 %<br />

2,032 / 5,428<br />

33.3 %<br />

2,418 / 7,252<br />

32.1 %<br />

2,173 / 6,778<br />

27.8 %<br />

1,967 / 7,064<br />

V.cholerae 2740-80<br />

41.7 %<br />

2,552 / 6,116<br />

38.0 %<br />

2,307 / 6,067<br />

40.4 %<br />

2,345 / 5,808<br />

33.6 %<br />

1,884 / 5,612<br />

35.3 %<br />

1,981 / 5,619<br />

34.4 %<br />

1,846 / 5,360<br />

40.6 %<br />

2,159 / 5,323<br />

36.7 %<br />

2,048 / 5,585<br />

33.5 %<br />

2,420 / 7,225<br />

32.2 %<br />

2,173 / 6,752<br />

25.7 %<br />

1,850 / 7,198<br />

V.cholerae 1587<br />

V.cholerae TM11079-80<br />

V.cholerae TMA21<br />

V.cholerae VL426<br />

44.3 %<br />

2,515 / 5,683<br />

41.2 %<br />

2,564 / 6,224<br />

38.5 %<br />

2,311 / 6,004<br />

38.6 %<br />

2,251 / 5,839<br />

36.3 %<br />

1,965 / 5,413<br />

36.6 %<br />

1,964 / 5,371<br />

33.4 %<br />

1,852 / 5,547<br />

39.5 %<br />

2,169 / 5,493<br />

37.0 %<br />

2,051 / 5,545<br />

33.6 %<br />

2,420 / 7,193<br />

30.3 %<br />

2,079 / 6,856<br />

25.6 %<br />

1,841 / 7,194<br />

42.2 %<br />

2,215 / 5,254<br />

43.7 %<br />

2,527 / 5,781<br />

41.9 %<br />

2,575 / 6,151<br />

37.0 %<br />

2,227 / 6,026<br />

41.7 %<br />

2,346 / 5,626<br />

37.7 %<br />

1,947 / 5,165<br />

35.5 %<br />

1,974 / 5,563<br />

32.7 %<br />

1,868 / 5,705<br />

39.7 %<br />

2,168 / 5,459<br />

37.2 %<br />

2,052 / 5,516<br />

31.0 %<br />

2,282 / 7,362<br />

29.7 %<br />

2,044 / 6,887<br />

28.1 %<br />

1,904 / 6,782<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

40.0 %<br />

2,421 / 6,055<br />

41.6 %<br />

2,225 / 5,354<br />

44.5 %<br />

2,539 / 5,707<br />

40.0 %<br />

2,473 / 6,185<br />

39.7 %<br />

2,312 / 5,825<br />

42.9 %<br />

2,314 / 5,388<br />

36.6 %<br />

1,961 / 5,354<br />

34.6 %<br />

1,982 / 5,732<br />

33.0 %<br />

1,872 / 5,667<br />

40.0 %<br />

2,171 / 5,428<br />

34.4 %<br />

1,944 / 5,645<br />

30.8 %<br />

2,270 / 7,379<br />

32.4 %<br />

2,098 / 6,481<br />

26.9 %<br />

1,851 / 6,869<br />

70.3 %<br />

2,933 / 4,174<br />

39.6 %<br />

2,438 / 6,154<br />

42.3 %<br />

2,236 / 5,283<br />

42.8 %<br />

2,449 / 5,718<br />

42.9 %<br />

2,564 / 5,977<br />

40.6 %<br />

2,270 / 5,592<br />

41.9 %<br />

2,334 / 5,571<br />

35.7 %<br />

1,969 / 5,522<br />

34.8 %<br />

1,984 / 5,694<br />

33.2 %<br />

1,872 / 5,641<br />

38.2 %<br />

2,104 / 5,504<br />

34.8 %<br />

1,952 / 5,606<br />

33.3 %<br />

2,327 / 6,984<br />

31.2 %<br />

2,045 / 6,565<br />

28.2 %<br />

1,949 / 6,915<br />

73.6 %<br />

3,045 / 4,135<br />

69.2 %<br />

2,953 / 4,267<br />

40.0 %<br />

2,440 / 6,094<br />

41.3 %<br />

2,181 / 5,277<br />

45.9 %<br />

2,535 / 5,526<br />

44.1 %<br />

2,533 / 5,743<br />

39.9 %<br />

2,299 / 5,768<br />

40.9 %<br />

2,343 / 5,733<br />

35.9 %<br />

1,971 / 5,485<br />

35.0 %<br />

1,985 / 5,667<br />

30.2 %<br />

1,747 / 5,786<br />

37.3 %<br />

2,064 / 5,537<br />

38.1 %<br />

1,994 / 5,228<br />

32.1 %<br />

2,268 / 7,062<br />

32.6 %<br />

2,153 / 6,600<br />

27.9 %<br />

1,942 / 6,969<br />

71.6 %<br />

3,010 / 4,205<br />

74.9 %<br />

3,101 / 4,142<br />

69.7 %<br />

2,944 / 4,221<br />

38.4 %<br />

2,348 / 6,120<br />

43.8 %<br />

2,234 / 5,101<br />

47.1 %<br />

2,503 / 5,310<br />

43.3 %<br />

2,559 / 5,916<br />

38.9 %<br />

2,309 / 5,932<br />

41.2 %<br />

2,346 / 5,697<br />

36.1 %<br />

1,971 / 5,458<br />

31.9 %<br />

1,857 / 5,817<br />

30.0 %<br />

1,736 / 5,791<br />

41.6 %<br />

2,134 / 5,135<br />

36.4 %<br />

1,935 / 5,317<br />

34.2 %<br />

2,394 / 7,002<br />

31.8 %<br />

2,123 / 6,682<br />

27.9 %<br />

1,941 / 6,954<br />

V.parahaemolyticus 2210633<br />

V.parahaemolyticus 16<br />

V.vulnificus CMCP6<br />

V.vulnificus YJ016<br />

V.species MED222<br />

V.splendidus LGP32<br />

A.fischeri ES114<br />

A.fischeri MJ11<br />

A.salmonicida LFI1238<br />

V.species Ex25<br />

V.campbellii AND4<br />

V.harveyi BAA1116<br />

V.shilonii AK1<br />

V.cholerae 12129<br />

V.cholerae TM11079-80<br />

V.cholerae TMA21<br />

V.cholerae VL426<br />

75.9 %<br />

3,094 / 4,077<br />

72.6 %<br />

3,068 / 4,226<br />

75.5 %<br />

3,089 / 4,092<br />

66.3 %<br />

2,833 / 4,271<br />

41.4 %<br />

2,445 / 5,905<br />

45.9 %<br />

2,223 / 4,842<br />

46.4 %<br />

2,534 / 5,464<br />

42.3 %<br />

2,572 / 6,075<br />

39.3 %<br />

2,314 / 5,892<br />

41.4 %<br />

2,346 / 5,670<br />

32.8 %<br />

1,843 / 5,611<br />

32.1 %<br />

1,861 / 5,795<br />

33.6 %<br />

1,805 / 5,377<br />

39.1 %<br />

2,048 / 5,244<br />

37.7 %<br />

2,026 / 5,367<br />

33.4 %<br />

2,359 / 7,060<br />

32.0 %<br />

2,130 / 6,656<br />

27.9 %<br />

1,909 / 6,851<br />

68.7 %<br />

2,874 / 4,181<br />

77.2 %<br />

3,155 / 4,088<br />

73.5 %<br />

3,065 / 4,172<br />

69.8 %<br />

2,942 / 4,217<br />

73.2 %<br />

2,952 / 4,034<br />

42.4 %<br />

2,408 / 5,683<br />

44.3 %<br />

2,232 / 5,038<br />

45.2 %<br />

2,546 / 5,633<br />

42.7 %<br />

2,578 / 6,032<br />

39.4 %<br />

2,314 / 5,868<br />

38.0 %<br />

2,213 / 5,823<br />

33.1 %<br />

1,848 / 5,585<br />

35.6 %<br />

1,922 / 5,398<br />

31.9 %<br />

1,743 / 5,469<br />

40.4 %<br />

2,139 / 5,293<br />

37.3 %<br />

2,022 / 5,418<br />

33.4 %<br />

2,375 / 7,115<br />

32.0 %<br />

2,097 / 6,549<br />

29.6 %<br />

2,295 / 7,753<br />

70.4 %<br />

2,922 / 4,153<br />

67.2 %<br />

2,880 / 4,288<br />

78.0 %<br />

3,149 / 4,038<br />

68.5 %<br />

2,914 / 4,256<br />

76.0 %<br />

3,059 / 4,025<br />

73.5 %<br />

2,863 / 3,897<br />

41.8 %<br />

2,434 / 5,818<br />

43.0 %<br />

2,240 / 5,212<br />

45.5 %<br />

2,548 / 5,599<br />

42.9 %<br />

2,579 / 6,005<br />

36.9 %<br />

2,208 / 5,989<br />

38.0 %<br />

2,209 / 5,811<br />

36.7 %<br />

1,906 / 5,192<br />

34.2 %<br />

1,872 / 5,473<br />

33.5 %<br />

1,845 / 5,501<br />

39.4 %<br />

2,118 / 5,370<br />

37.3 %<br />

2,019 / 5,407<br />

33.0 %<br />

2,325 / 7,056<br />

35.2 %<br />

2,581 / 7,333<br />

27.9 %<br />

1,972 / 7,061<br />

64.7 %<br />

2,888 / 4,463<br />

70.3 %<br />

2,965 / 4,217<br />

69.7 %<br />

2,916 / 4,183<br />

71.5 %<br />

2,986 / 4,175<br />

74.1 %<br />

3,024 / 4,083<br />

75.2 %<br />

2,954 / 3,928<br />

76.4 %<br />

2,970 / 3,887<br />

40.8 %<br />

2,445 / 5,993<br />

43.4 %<br />

2,242 / 5,171<br />

45.8 %<br />

2,552 / 5,568<br />

39.4 %<br />

2,432 / 6,172<br />

36.4 %<br />

2,186 / 6,003<br />

41.8 %<br />

2,264 / 5,418<br />

34.9 %<br />

1,843 / 5,282<br />

35.7 %<br />

1,970 / 5,513<br />

32.9 %<br />

1,824 / 5,545<br />

40.3 %<br />

2,145 / 5,320<br />

37.8 %<br />

2,001 / 5,288<br />

46.4 %<br />

3,371 / 7,266<br />

34.3 %<br />

2,276 / 6,634<br />

29.4 %<br />

2,212 / 7,534<br />

76.9 %<br />

3,165 / 4,117<br />

64.9 %<br />

2,940 / 4,533<br />

72.2 %<br />

2,986 / 4,136<br />

69.0 %<br />

2,860 / 4,145<br />

79.5 %<br />

3,125 / 3,932<br />

73.0 %<br />

2,908 / 3,986<br />

80.4 %<br />

3,080 / 3,831<br />

73.1 %<br />

2,977 / 4,072<br />

41.1 %<br />

2,450 / 5,957<br />

43.6 %<br />

2,244 / 5,143<br />

42.2 %<br />

2,409 / 5,711<br />

39.1 %<br />

2,413 / 6,176<br />

39.9 %<br />

2,238 / 5,609<br />

39.9 %<br />

2,202 / 5,514<br />

36.8 %<br />

1,951 / 5,307<br />

34.9 %<br />

1,952 / 5,586<br />

33.0 %<br />

1,831 / 5,549<br />

39.8 %<br />

2,086 / 5,245<br />

47.0 %<br />

2,741 / 5,827<br />

34.9 %<br />

2,496 / 7,160<br />

34.4 %<br />

2,472 / 7,184<br />

27.8 %<br />

2,222 / 7,979<br />

83.4 %<br />

3,315 / 3,973<br />

76.7 %<br />

3,195 / 4,167<br />

67.6 %<br />

2,983 / 4,413<br />

68.5 %<br />

2,869 / 4,191<br />

71.8 %<br />

2,896 / 4,036<br />

77.9 %<br />

3,002 / 3,856<br />

78.5 %<br />

3,061 / 3,901<br />

77.3 %<br />

3,098 / 4,009<br />

73.4 %<br />

2,971 / 4,050<br />

41.3 %<br />

2,449 / 5,936<br />

41.1 %<br />

2,153 / 5,242<br />

41.4 %<br />

2,372 / 5,735<br />

43.0 %<br />

2,483 / 5,781<br />

38.0 %<br />

2,171 / 5,707<br />

42.0 %<br />

2,320 / 5,530<br />

36.1 %<br />

1,940 / 5,373<br />

35.2 %<br />

1,954 / 5,558<br />

33.1 %<br />

1,804 / 5,448<br />

64.9 %<br />

3,384 / 5,214<br />

37.8 %<br />

2,081 / 5,503<br />

38.7 %<br />

2,880 / 7,439<br />

33.0 %<br />

2,516 / 7,615<br />

28.1 %<br />

2,155 / 7,667<br />

82.4 %<br />

3,302 / 4,009<br />

81.3 %<br />

3,320 / 4,085<br />

81.6 %<br />

3,264 / 4,000<br />

65.1 %<br />

2,880 / 4,423<br />

73.7 %<br />

2,947 / 4,001<br />

71.5 %<br />

2,801 / 3,915<br />

83.0 %<br />

3,135 / 3,777<br />

75.8 %<br />

3,073 / 4,056<br />

77.1 %<br />

3,088 / 4,007<br />

73.8 %<br />

2,975 / 4,030<br />

37.9 %<br />

2,313 / 6,099<br />

41.2 %<br />

2,152 / 5,228<br />

46.3 %<br />

2,464 / 5,326<br />

40.1 %<br />

2,373 / 5,919<br />

40.1 %<br />

2,293 / 5,719<br />

41.1 %<br />

2,303 / 5,603<br />

36.2 %<br />

1,940 / 5,352<br />

35.3 %<br />

1,926 / 5,455<br />

31.9 %<br />

2,074 / 6,494<br />

45.0 %<br />

2,357 / 5,232<br />

39.9 %<br />

2,372 / 5,942<br />

37.0 %<br />

2,900 / 7,832<br />

36.5 %<br />

2,593 / 7,105<br />

29.5 %<br />

2,198 / 7,456<br />

83.2 %<br />

3,325 / 3,995<br />

80.8 %<br />

3,319 / 4,106<br />

81.9 %<br />

3,311 / 4,041<br />

77.5 %<br />

3,153 / 4,066<br />

67.3 %<br />

2,909 / 4,320<br />

81.0 %<br />

2,989 / 3,688<br />

72.2 %<br />

2,861 / 3,960<br />

79.4 %<br />

3,144 / 3,961<br />

75.1 %<br />

3,061 / 4,077<br />

78.0 %<br />

3,097 / 3,971<br />

65.6 %<br />

2,791 / 4,256<br />

38.2 %<br />

2,320 / 6,080<br />

46.0 %<br />

2,220 / 4,821<br />

43.5 %<br />

2,367 / 5,437<br />

43.5 %<br />

2,550 / 5,859<br />

39.2 %<br />

2,272 / 5,796<br />

41.6 %<br />

2,314 / 5,569<br />

36.3 %<br />

1,906 / 5,250<br />

35.5 %<br />

2,270 / 6,400<br />

32.3 %<br />

1,842 / 5,705<br />

46.1 %<br />

2,626 / 5,697<br />

37.5 %<br />

2,396 / 6,387<br />

34.6 %<br />

2,682 / 7,762<br />

36.7 %<br />

2,562 / 6,982<br />

30.3 %<br />

2,110 / 6,968<br />

85.8 %<br />

3,291 / 3,837<br />

80.7 %<br />

3,321 / 4,117<br />

81.6 %<br />

3,311 / 4,057<br />

76.3 %<br />

3,142 / 4,120<br />

78.4 %<br />

3,157 / 4,029<br />

67.8 %<br />

2,836 / 4,184<br />

74.9 %<br />

2,944 / 3,932<br />

69.0 %<br />

2,868 / 4,158<br />

79.3 %<br />

3,138 / 3,958<br />

76.3 %<br />

3,076 / 4,029<br />

71.3 %<br />

2,953 / 4,142<br />

65.2 %<br />

2,768 / 4,246<br />

42.3 %<br />

2,399 / 5,675<br />

42.7 %<br />

2,113 / 4,953<br />

45.9 %<br />

2,501 / 5,451<br />

42.2 %<br />

2,506 / 5,941<br />

39.8 %<br />

2,292 / 5,756<br />

41.5 %<br />

2,272 / 5,479<br />

35.9 %<br />

2,233 / 6,219<br />

34.5 %<br />

1,965 / 5,696<br />

32.6 %<br />

2,040 / 6,250<br />

43.2 %<br />

2,655 / 6,143<br />

36.9 %<br />

2,259 / 6,124<br />

36.7 %<br />

2,759 / 7,516<br />

30.4 %<br />

2,085 / 6,866<br />

29.7 %<br />

2,127 / 7,169<br />

V.cholerae N16961<br />

V.cholerae 0395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae V52<br />

V.cholerae M66-2<br />

V.cholerae MO10<br />

V.cholerae BX330286<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

V.cholerae B33VCE<br />

V.cholerae 2740-80<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

V.cholerae 12129<br />

V.cholerae 1587<br />

79.6 %<br />

3,139 / 3,944<br />

82.5 %<br />

3,278 / 3,971<br />

81.4 %<br />

3,309 / 4,067<br />

75.3 %<br />

3,136 / 4,162<br />

83.7 %<br />

3,275 / 3,915<br />

74.3 %<br />

2,987 / 4,018<br />

68.1 %<br />

2,876 / 4,226<br />

73.3 %<br />

2,983 / 4,071<br />

69.1 %<br />

2,864 / 4,145<br />

80.3 %<br />

3,147 / 3,918<br />

69.2 %<br />

2,925 / 4,226<br />

70.2 %<br />

2,930 / 4,172<br />

71.6 %<br />

2,802 / 3,915<br />

39.7 %<br />

2,303 / 5,796<br />

44.3 %<br />

2,213 / 5,001<br />

46.1 %<br />

2,513 / 5,455<br />

42.9 %<br />

2,536 / 5,906<br />

39.2 %<br />

2,230 / 5,684<br />

43.9 %<br />

2,762 / 6,293<br />

36.2 %<br />

1,976 / 5,464<br />

35.3 %<br />

2,191 / 6,211<br />

30.5 %<br />

2,050 / 6,715<br />

40.2 %<br />

2,413 / 5,999<br />

38.6 %<br />

2,289 / 5,931<br />

29.6 %<br />

2,214 / 7,478<br />

29.4 %<br />

2,083 / 7,082<br />

28.3 %<br />

1,980 / 6,989<br />

92.9 %<br />

3,489 / 3,754<br />

78.1 %<br />

3,147 / 4,032<br />

83.4 %<br />

3,267 / 3,919<br />

76.0 %<br />

3,147 / 4,141<br />

82.6 %<br />

3,267 / 3,954<br />

86.6 %<br />

3,253 / 3,757<br />

78.6 %<br />

3,113 / 3,962<br />

66.4 %<br />

2,917 / 4,393<br />

73.6 %<br />

2,979 / 4,045<br />

70.0 %<br />

2,873 / 4,102<br />

73.1 %<br />

3,000 / 4,103<br />

64.3 %<br />

2,805 / 4,365<br />

77.2 %<br />

2,983 / 3,866<br />

68.6 %<br />

2,743 / 4,001<br />

42.9 %<br />

2,463 / 5,745<br />

43.5 %<br />

2,200 / 5,058<br />

45.7 %<br />

2,506 / 5,480<br />

42.3 %<br />

2,475 / 5,845<br />

41.4 %<br />

2,698 / 6,523<br />

45.4 %<br />

2,507 / 5,523<br />

36.3 %<br />

2,179 / 6,005<br />

33.1 %<br />

2,209 / 6,672<br />

33.1 %<br />

2,074 / 6,269<br />

42.3 %<br />

2,451 / 5,795<br />

33.6 %<br />

1,915 / 5,695<br />

29.3 %<br />

2,244 / 7,665<br />

26.7 %<br />

1,916 / 7,168<br />

28.0 %<br />

2,022 / 7,222<br />

77.1 %<br />

3,186 / 4,134<br />

89.7 %<br />

3,485 / 3,884<br />

80.2 %<br />

3,169 / 3,953<br />

79.4 %<br />

3,143 / 3,956<br />

82.9 %<br />

3,277 / 3,954<br />

85.6 %<br />

3,244 / 3,790<br />

91.2 %<br />

3,355 / 3,679<br />

75.5 %<br />

3,125 / 4,141<br />

66.3 %<br />

2,908 / 4,386<br />

74.3 %<br />

2,982 / 4,014<br />

68.3 %<br />

2,820 / 4,126<br />

69.5 %<br />

2,908 / 4,185<br />

71.6 %<br />

2,868 / 4,006<br />

71.8 %<br />

2,855 / 3,974<br />

77.1 %<br />

2,975 / 3,861<br />

40.8 %<br />

2,400 / 5,876<br />

44.9 %<br />

2,242 / 4,998<br />

45.1 %<br />

2,444 / 5,415<br />

46.4 %<br />

3,042 / 6,550<br />

43.7 %<br />

2,492 / 5,705<br />

43.7 %<br />

2,670 / 6,112<br />

34.2 %<br />

2,205 / 6,448<br />

35.7 %<br />

2,219 / 6,213<br />

34.9 %<br />

2,114 / 6,065<br />

34.5 %<br />

1,963 / 5,692<br />

32.5 %<br />

1,919 / 5,903<br />

28.3 %<br />

2,095 / 7,406<br />

34.5 %<br />

2,335 / 6,762<br />

25.5 %<br />

1,872 / 7,339<br />

80.4 %<br />

3,303 / 4,109<br />

74.9 %<br />

3,187 / 4,253<br />

81.1 %<br />

3,280 / 4,046<br />

75.1 %<br />

3,024 / 4,028<br />

87.0 %<br />

3,272 / 3,762<br />

83.2 %<br />

3,208 / 3,855<br />

90.1 %<br />

3,346 / 3,715<br />

91.7 %<br />

3,455 / 3,766<br />

74.6 %<br />

3,103 / 4,160<br />

67.0 %<br />

2,915 / 4,348<br />

68.5 %<br />

2,844 / 4,150<br />

68.0 %<br />

2,806 / 4,126<br />

76.7 %<br />

2,961 / 3,861<br />

67.9 %<br />

2,780 / 4,092<br />

82.5 %<br />

3,117 / 3,780<br />

73.0 %<br />

2,911 / 3,989<br />

42.4 %<br />

2,451 / 5,781<br />

43.5 %<br />

2,155 / 4,958<br />

48.9 %<br />

2,994 / 6,128<br />

43.2 %<br />

2,597 / 6,013<br />

41.9 %<br />

2,637 / 6,301<br />

40.8 %<br />

2,680 / 6,565<br />

36.6 %<br />

2,201 / 6,016<br />

38.1 %<br />

2,277 / 5,979<br />

55.5 %<br />

2,683 / 4,838<br />

33.9 %<br />

1,991 / 5,874<br />

30.3 %<br />

1,795 / 5,923<br />

43.4 %<br />

2,981 / 6,875<br />

30.9 %<br />

2,144 / 6,948<br />

26.1 %<br />

2,254 / 8,624<br />

88.1 %<br />

3,495 / 3,966<br />

88.8 %<br />

3,489 / 3,927<br />

80.2 %<br />

3,271 / 4,079<br />

77.3 %<br />

3,164 / 4,093<br />

80.7 %<br />

3,108 / 3,853<br />

83.0 %<br />

3,126 / 3,768<br />

91.4 %<br />

3,373 / 3,689<br />

90.4 %<br />

3,439 / 3,805<br />

96.0 %<br />

3,531 / 3,678<br />

75.4 %<br />

3,111 / 4,124<br />

64.7 %<br />

2,847 / 4,403<br />

70.6 %<br />

2,886 / 4,087<br />

73.1 %<br />

2,818 / 3,854<br />

71.8 %<br />

2,849 / 3,968<br />

78.6 %<br />

3,059 / 3,894<br />

78.0 %<br />

3,045 / 3,906<br />

74.7 %<br />

2,922 / 3,914<br />

40.9 %<br />

2,360 / 5,769<br />

43.5 %<br />

2,547 / 5,858<br />

47.2 %<br />

2,608 / 5,524<br />

67.5 %<br />

3,741 / 5,540<br />

39.7 %<br />

2,672 / 6,728<br />

72.3 %<br />

3,688 / 5,101<br />

38.7 %<br />

2,246 / 5,808<br />

75.0 %<br />

3,261 / 4,346<br />

52.4 %<br />

2,666 / 5,085<br />

30.5 %<br />

1,813 / 5,939<br />

46.2 %<br />

2,452 / 5,307<br />

45.0 %<br />

3,018 / 6,702<br />

30.1 %<br />

2,581 / 8,574<br />

25.9 %<br />

2,170 / 8,370<br />

P.profundum SS9<br />

3.0 %<br />

110 / 3,665<br />

4.2 %<br />

155 / 3,729<br />

4.3 %<br />

157 / 3,665<br />

3.3 %<br />

120 / 3,599<br />

2.8 %<br />

99 / 3,560<br />

1.8 %<br />

59 / 3,353<br />

2.9 %<br />

100 / 3,429<br />

2.8 %<br />

102 / 3,619<br />

3.0 %<br />

109 / 3,575<br />

2.6 %<br />

92 / 3,593<br />

3.5 %<br />

125 / 3,567<br />

2.8 %<br />

99 / 3,586<br />

2.5 %<br />

84 / 3,305<br />

2.2 %<br />

73 / 3,311<br />

2.1 %<br />

72 / 3,454<br />

2.4 %<br />

83 / 3,442<br />

2.9 %<br />

99 / 3,427<br />

1.9 %<br />

62 / 3,316<br />

3.2 %<br />

147 / 4,662<br />

2.1 %<br />

79 / 3,683<br />

2.8 %<br />

121 / 4,337<br />

3.1 %<br />

150 / 4,773<br />

2.3 %<br />

103 / 4,463<br />

2.8 %<br />

118 / 4,277<br />

2.6 %<br />

96 / 3,691<br />

2.9 %<br />

112 / 3,894<br />

3.3 %<br />

111 / 3,378<br />

2.7 %<br />

103 / 3,886<br />

2.3 %<br />

88 / 3,822<br />

3.9 %<br />

201 / 5,117<br />

3.9 %<br />

200 / 5,078<br />

5.0 %<br />

243 / 4,897


Gap F<br />

2M<br />

2.5M<br />

Gap E<br />

875k<br />

750k<br />

625k<br />

0M<br />

V. cholerae 01<br />

El Tor N16961<br />

chromosome 1<br />

2,961,149 bp<br />

1000k<br />

1.5M<br />

0k<br />

500k<br />

Gap D<br />

V. cholerae 01<br />

El Tor N16961<br />

chromosome 2<br />

1,072,310 bp<br />

125k<br />

375k<br />

0.5M<br />

1M<br />

250k<br />

Gap C<br />

Gap A<br />

Gap B<br />

Super<strong>in</strong>tegron<br />

Gap G<br />

Outer circle<br />

P.profundum SS9<br />

V.shilonii AK1<br />

V.harveyi BAA-116<br />

V.campebellii AND4<br />

V.parahaemolyticus 16<br />

V.parahaemolyticus 2210633<br />

Vibrio spp. Ex25<br />

A.salmonicida LF11238<br />

A.fischeri MJ11<br />

A.fischeri ES114<br />

V.splendidus LGP32<br />

V.species MED222<br />

V.vulnificus YJ016<br />

V.vulnificus CMCP6<br />

V.cholerae VL426<br />

V.cholerae 12129<br />

V.cholerae TMA21<br />

V.cholerae TM11079-80<br />

V.cholerae 1587<br />

V.cholerae AM-19226<br />

V.cholerae MZO-2<br />

V.cholerae 2740-80<br />

V.cholerae BX330286<br />

V.cholerae B33VCE<br />

V.cholerae RC9<br />

V.cholerae MJ1236<br />

V.cholerae M66-2<br />

V.cholerae V52<br />

V.cholerae MO10<br />

V.cholerae O395 TEDA<br />

V.cholerae 0395 TIGR<br />

V.cholerae N16961<br />

genes positive str<strong>and</strong><br />

genes negatve str<strong>and</strong><br />

Stack<strong>in</strong>g energy<br />

Position preference<br />

Global direct repeats<br />

GC skew<br />

Inner circle<br />

T. Vesth et al.


Orig<strong>in</strong>s of V. cholerae<br />

When the first genome of A. fischeri is added, which is<br />

not a member of the Vibrio genus, it does not add<br />

significantly more novel genes to the pan-genome than<br />

Vibrio genomes did. This contrasts with P. profundum<br />

which produces a sharp <strong>in</strong>crease <strong>in</strong> the pan-genome, as<br />

does, <strong>in</strong>terest<strong>in</strong>gly, V. shilonii. Note that there are approximately<br />

20,200 total gene families with<strong>in</strong> the 32 sequenced<br />

Vibrionaceae genomes, whereas the core genome decreases<br />

to approximately 1,000 gene families.<br />

BLAST Comparison Visualised <strong>in</strong> a BLAST Matrix<br />

A BLAST matrix provides a visual overview of reciprocal<br />

pairwise whole-genome comparisons, as shown <strong>in</strong> Fig. 4.<br />

The stronger a matrix cell is coloured, the more similarity<br />

was detected between the gene content of two genomes. As<br />

can be seen <strong>in</strong> the lower right triangle, all V. cholerae<br />

genomes are highly similar, with similarity rang<strong>in</strong>g between<br />

64% <strong>and</strong> 93% for any given pair of genomes. No statistical<br />

difference was observed when compar<strong>in</strong>g cl<strong>in</strong>ical isolates<br />

to environmental isolates. The two A. fischeri <strong>and</strong> the two<br />

V. vulnificus genomes also share a high degree of identity<br />

with<strong>in</strong> their species (75% <strong>and</strong> 67%, respectively), visible at<br />

the bottom of the matrix. In contrast, the two V. parahaemolyticus<br />

genomes only share 35% identity, which is<br />

not higher than the similarity detected between genomes of<br />

different species. With 72% similarity, isolate MED222<br />

most closely matches V. splendidus <strong>and</strong> with 65% isolate<br />

EX25 aga<strong>in</strong> shares most similarity with V. parahaemolyticus<br />

2210633.<br />

BLAST Atlas<br />

A BLAST atlas was constructed us<strong>in</strong>g V. cholerae N16961<br />

(O1, El Tor) as the reference genome, shown <strong>in</strong> Fig. 5. The<br />

best blast hits identified <strong>in</strong> the query genomes are<br />

plotted <strong>in</strong> the lanes around the reference genome, with<br />

different colours for different species. In general,<br />

chromosome 1 is more strongly conserved than chromosome<br />

2. A large part of chromosome 2 of N16961<br />

displays very little conservation <strong>in</strong> the other genomes;<br />

this area represents a super <strong>in</strong>tegron [40] that conta<strong>in</strong>s<br />

the V. cholerae-specific repeat (VCR) sequences, as well<br />

Figure 5 BLAST atlas with V. cholerae stra<strong>in</strong> N16961 as a reference<br />

stra<strong>in</strong>, show<strong>in</strong>g chromosomes 1 (top) <strong>and</strong> 2 (bottom). The best<br />

BLAST hits identified with genes from N16961 <strong>in</strong> the other V.<br />

cholerae genomes are represented <strong>in</strong> dark red, for the location as it<br />

appears <strong>in</strong> N16961. Blast hits <strong>in</strong> the other genomes are shown <strong>in</strong><br />

various colours as <strong>in</strong>dicated to the right. Major areas conserved <strong>in</strong> V.<br />

cholerae but not <strong>in</strong> other Vibrionaceae are identified as gap B, gap C,<br />

gap D <strong>and</strong> gap F <strong>in</strong> green; areas that are found <strong>in</strong> toxigenic V. cholerae<br />

only are marked black as gap A, gap E <strong>and</strong> gap G. The super<strong>in</strong>tegron<br />

on chromosome 2 of V. cholerae is also <strong>in</strong>dicated<br />

as a high number of gene cassettes. The repeat sequences<br />

are visible as black boxes <strong>in</strong> the repeat lane of the<br />

reference genome (second <strong>in</strong>ner lane). Although all V.<br />

cholerae genomes conta<strong>in</strong> a super<strong>in</strong>tegron, its genes are<br />

very diverse between isolates [34] which expla<strong>in</strong>s the lack<br />

of blast hits <strong>in</strong> this region.<br />

Several regions of the atlas have been highlighted. Gaps<br />

B, C, D <strong>and</strong> F on chromosome 1 (<strong>in</strong>dicated <strong>in</strong> green)<br />

conta<strong>in</strong> genes that are conserved <strong>in</strong> the represented<br />

genomes of V. cholerae but not <strong>in</strong> the other Vibrionaceae.<br />

The gaps marked A, E <strong>and</strong> G <strong>in</strong>dicate regions that are<br />

specific to the toxigenic, cl<strong>in</strong>ical isolates only. Annotated,<br />

V. cholerae-specific genes present <strong>in</strong> all these regions are<br />

listed <strong>in</strong> Table 2 (hypothetical genes are excluded). Genes<br />

specific for tox<strong>in</strong>ogenic V. cholerae identified <strong>in</strong> gap A<br />

<strong>in</strong>clude, amongst others, biosynthesis genes for the tox<strong>in</strong><br />

co-regulated pilus (which is required for transmission of the<br />

prophage CTXΦ carry<strong>in</strong>g the enterotox<strong>in</strong> genes), as well as<br />

genes encod<strong>in</strong>g citrate lyase. Note that the genes <strong>in</strong> gap A<br />

are also found <strong>in</strong> the environmental isolate V. cholerae<br />

2740-80.<br />

Gap B conta<strong>in</strong>s a number of outer membrane prote<strong>in</strong><br />

genes <strong>in</strong>volved <strong>in</strong> sugar modification that are found <strong>in</strong> all V.<br />

cholerae genomes. Genes from gap C encod<strong>in</strong>g a histid<strong>in</strong>e<br />

k<strong>in</strong>ase two-component signal transduction regulatory system<br />

are also conserved with<strong>in</strong> the species, as genes <strong>in</strong> gaps<br />

D <strong>and</strong> F, <strong>in</strong>volved <strong>in</strong> chemotaxis <strong>and</strong> possible multidrug<br />

resistance.<br />

Gap E, conta<strong>in</strong><strong>in</strong>g genes conserved <strong>in</strong> toxigenic stra<strong>in</strong>s<br />

only, holds the prophage CTXΦ that conta<strong>in</strong>s the genes<br />

encod<strong>in</strong>g cholera enterotox<strong>in</strong> subunits A <strong>and</strong> B; this<br />

enterotox<strong>in</strong> is responsible for the excessive, watery diarrhoea<br />

typical for cholera. Upon b<strong>in</strong>d<strong>in</strong>g to target cell GM1<br />

gangliosides, enterotox<strong>in</strong> enters the cell <strong>and</strong> stimulates<br />

adenylate cyclase by ADP ribosylation. The resultant<br />

<strong>in</strong>creased cyclic AMP levels <strong>in</strong>duce excessive electrolyte<br />

movement <strong>and</strong> sodium plus water secretion [43]. Stra<strong>in</strong><br />

M66-2 is believed to be a precursor of the seventh<br />

p<strong>and</strong>emic V. cholerae that lacks the prophage CTXΦ <strong>and</strong><br />

the enterotox<strong>in</strong> genes [11]. Gap E bears the RTX tox<strong>in</strong><br />

operon, which encodes a pore-form<strong>in</strong>g cytotox<strong>in</strong> [22]. An<br />

RTX tox<strong>in</strong> is also present <strong>in</strong> environmental isolate 2740-80<br />

<strong>and</strong> <strong>in</strong> V. vulnificus.<br />

Gap G on chromosome 2 consists of a set of five genes,<br />

all <strong>in</strong> the same orientation, <strong>in</strong> a putative operon, flanked by<br />

genes on the complimentary str<strong>and</strong>. This appears to be a<br />

remnant of a mobile element, as these genes are flanked by<br />

a transposase gene on the 3′ end, <strong>and</strong> there is a small global<br />

repeat on the 5′ end. Only the first two of the five genes have<br />

an assigned function, with the first gene be<strong>in</strong>g a GMP<br />

reductase, <strong>and</strong> the second a putative DNA methyltransferase.<br />

The rema<strong>in</strong><strong>in</strong>g three genes are hypothetical, but their<br />

strik<strong>in</strong>gly strong conservation <strong>in</strong> all pathogenic stra<strong>in</strong>s <strong>and</strong>


Table 2 A selection of genes located <strong>in</strong> the gaps marked <strong>in</strong> Fig. 5<br />

Gap A (850000–913000)<br />

852903–851557 Citrate/sodium symporter<br />

853165–854235 Citrate (pro-3S)-lyase ligase<br />

854287–854583 Citrate lyase subunit gamma<br />

854565–855455 Citrate lyase, beta subunit<br />

855391–856995 Citrate lyase, alpha subunit<br />

856992–857528 citX prote<strong>in</strong><br />

857506–858447 citG prote<strong>in</strong><br />

869812–866873 Helicase-related prote<strong>in</strong><br />

870391–869813 Tellurite resistance prote<strong>in</strong>-related<br />

871298–870819 Transcriptional regulator, putative<br />

873242–874225 Transposase, putative<br />

876974–880015 ToxR-activated gene A prote<strong>in</strong><br />

881390–884728 Inner membrane prote<strong>in</strong>, putative<br />

885773–886267 tagD prote<strong>in</strong><br />

888405–886543 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

888846–889511 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

889496–889906 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

890449–891123 Tox<strong>in</strong> co-regulated pil<strong>in</strong><br />

891203–892495 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

892495–892947 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

892950–894419 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

894412–894867 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

894855–895691 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

895707–896165 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

896155–897666 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

897641–898663 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

898673–899689 Tox<strong>in</strong> co-regulated pilus biosynthesis<br />

899896–900726 TCP pilus virulence regulatory prote<strong>in</strong><br />

900726–901487 Leader peptidase TcpJ<br />

901494–903374 Accessory colonization factor AcfB<br />

903380–904150 Accessory colonization factor AcfC<br />

904648–905556 tagE prote<strong>in</strong><br />

906206–905559 Accessory colonization factor AcfA<br />

914124–912856 Phage family <strong>in</strong>tegrase<br />

Gap B (975000–1010000)<br />

978644–979144 Phosphotyros<strong>in</strong>e prote<strong>in</strong> phosphatase<br />

981833–982387 Ser<strong>in</strong>e acetyltransferase-related prote<strong>in</strong><br />

982384–983532 Exopolysacch. biosynth prote<strong>in</strong> EpsF<br />

983529–984938 Polysacch. export prote<strong>in</strong>, putative (gfcE)<br />

986166–986597 Ser<strong>in</strong>e acetyltransferase-related prote<strong>in</strong><br />

986597–987937 capK prote<strong>in</strong>, putative<br />

987913–989010 Polysaccharide biosynthesis prote<strong>in</strong>, putative<br />

1001910–1002437 Polysaccharide export-related prote<strong>in</strong> (gfcE)<br />

1002462–1004675 Putative exopolysacch. biosynth prote<strong>in</strong><br />

Gap C (1130000–1160000)<br />

1139646–1142912 Chit<strong>in</strong>ase, putative<br />

1147856–1148998 Response regulator<br />

1149033–1149398 Response regulator<br />

1149990–1151309 Sensory box sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />

Table 2 (cont<strong>in</strong>ued)<br />

1151321–1152625 Sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />

1152625–1154235 Response regulator<br />

1154252–1155595 Response regulator<br />

1157228–1155624 Sensor histid<strong>in</strong>e k<strong>in</strong>ase<br />

1158044–1157232 Periplasmic b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong>-related<br />

Gap D (1478000–1520000)<br />

2086826–2087584 CDP-diacylglycerol-glyc.-3phosph-3-phosphatidyltransferase<br />

2087587–2088519 Phosphatidate cytidylyltransferase<br />

2094741–2095604 PvcB prote<strong>in</strong><br />

2098112–2097183 LysR family transcriptional regulator<br />

2098432–2100258 pvcA prote<strong>in</strong><br />

2117923–2119977 Methyl-accept<strong>in</strong>g chemotaxis prote<strong>in</strong><br />

2120575–2120030 Transcriptional regulator<br />

2120663–2121826 Benzoate transport prote<strong>in</strong><br />

Gap E (1537000–1587500)<br />

1541452–1543170 Sensor histid<strong>in</strong>e k<strong>in</strong>ase/response regulator<br />

1545396–1543231 Tox<strong>in</strong> secretion transporter, putative<br />

1546802–1545399 RTX tox<strong>in</strong> transporter<br />

1548919–1546757 RTX tox<strong>in</strong> transporter<br />

1549662–1550123 RTX tox<strong>in</strong> activat<strong>in</strong>g prote<strong>in</strong><br />

1550108–1563784 RTX tox<strong>in</strong> RtxA<br />

1564376–1564152 RstC prote<strong>in</strong><br />

1564844–1564470 RstB1 prote<strong>in</strong><br />

1565901–1564822 RstA1 prote<strong>in</strong><br />

1566027–1566365 Transcriptional repressor RstR<br />

1567341–1566967 Cholera enterotox<strong>in</strong>, B subunit<br />

1568114–1567338 Cholera enterotox<strong>in</strong>, A subunit<br />

1569412–1568213 Zona occludens tox<strong>in</strong><br />

1569702–1569409 Accessory cholera enterotox<strong>in</strong><br />

1571241–1570993 Colonization factor<br />

1571760–1571377 RstB2 prote<strong>in</strong><br />

1572817–1571738 RstA1 prote<strong>in</strong><br />

1572943–1573281 Transcriptional repressor RstR<br />

1577272–1575704 Phage replication prote<strong>in</strong> Cri<br />

1582123–1580555 Phage replication prote<strong>in</strong> Cri<br />

1583160–1583513 Transposase OrfAB, subunit A<br />

1583510–1584382 Transposase OrfAB, subunit B<br />

Gap F (1896000–1956000)<br />

1896092–1897327 Phage family <strong>in</strong>tegrase<br />

1900831–1898009 Helicase, putative<br />

1903632–1902898 Chemotaxis prote<strong>in</strong> MotB-related<br />

1908858–1905790 Type I restriction enzyme HsdR<br />

1916009–1913628 DNA methylase HsdM, putative<br />

1933231–1935654 Neuram<strong>in</strong>idase<br />

1936007–1935801 Transcriptional regulator<br />

1936121–1936597 DNA repair prote<strong>in</strong> RadC, putative<br />

1938391–1937519 Transposase OrfAB, subunit B<br />

1938732–1938388 Transposase OrfAB, subunit A<br />

1941671–1941351 Transcriptional regulator, putative<br />

T. Vesth et al.


Orig<strong>in</strong>s of V. cholerae<br />

Table 2 (cont<strong>in</strong>ued)<br />

1942032–1941658 Middle operon regulator-related<br />

1944457–1943306 eha prote<strong>in</strong><br />

Gap G (chromosome II, 21300–223000)<br />

213207–214250 GMP reductase<br />

214574–215725 DNA methyltransferase<br />

220262–219825 IS1004 transposase<br />

All gene annotations are taken from the reference genome V. cholerae<br />

stra<strong>in</strong> N16961. Hypothetical prote<strong>in</strong>s were excluded. Gaps A, E <strong>and</strong> G<br />

are conserved <strong>in</strong> pathogenic stra<strong>in</strong>s, whereas gaps B, C, D <strong>and</strong> F are<br />

conserved <strong>in</strong> all V. cholerae genomes analysed (Figure 1)<br />

complete absence of homologues <strong>in</strong> the other Vibrio genomes<br />

strongly po<strong>in</strong>t towards a potential biological significance.<br />

Discussion<br />

The recent availability of many Vibrionaceae genomes,<br />

<strong>in</strong>clud<strong>in</strong>g a substantial number of V. cholerae genomes,<br />

allows the possibility to take a closer look at the similarities<br />

<strong>and</strong> differences of species with<strong>in</strong> the genus Vibrio. This can<br />

exam<strong>in</strong>e, on a genome scale, what dist<strong>in</strong>guishes V. cholerae<br />

from the other Vibrio species. S<strong>in</strong>ce not all V. cholerae<br />

isolates are pathogenic, the presence of the prophagebear<strong>in</strong>g<br />

cholera enterotox<strong>in</strong>, the ma<strong>in</strong> virulence factor for<br />

cholera, is not a suitable marker for this species. We<br />

attempted to identify a set of V. cholerae-specific genes,<br />

<strong>and</strong> also explored the <strong>in</strong>ternal diversity with<strong>in</strong> the V.<br />

cholerae genomes that have been sequenced to date.<br />

On a phylogenetic tree based on the 16S ribosomal RNA<br />

gene, those isolates that do not belong to the genus Vibrio<br />

were positioned as outliers, as expected. This tree further<br />

<strong>in</strong>dicated the closest resembl<strong>in</strong>g 16S rRNA sequence for<br />

the two sequenced Vibrio stra<strong>in</strong>s that are currently not<br />

assigned to a species. It was observed that the two<br />

sequenced V. parahaemolyticus stra<strong>in</strong>s were not placed<br />

together. The complete gene content of each genome was<br />

next compared by BLAST <strong>and</strong> the results were pooled <strong>in</strong>to<br />

gene families which were subjected to cluster analysis. This<br />

provided evidence that the 18 V. cholerae genomes fall <strong>in</strong>to<br />

two subclusters, one ma<strong>in</strong>ly conta<strong>in</strong><strong>in</strong>g cl<strong>in</strong>ical isolates <strong>and</strong><br />

the other environmental isolates.<br />

The gene family cluster<strong>in</strong>g, subsequent pan-genome<br />

analysis <strong>and</strong> the pairwise BLAST results, as summarised<br />

<strong>in</strong> the BLAST matrix, all supported the relatedness of<br />

Vibrio species Ex25 to V. parahaemolyticus 2210633 but<br />

not to V. parahaemolyticus 16. This latter genome was quite<br />

different from V. parahaemolyticus 2210633 <strong>in</strong> all analyses.<br />

Although it is possible that the species V. parahaemolyticus<br />

is far more genetically diverse than V. cholerae, A. fischeri<br />

or V. vulnificus, an alternative explanation is that one of the<br />

sequenced isolates is perhaps <strong>in</strong>correctly named as V.<br />

parahaemolyticus. The similarity between Vibrio species<br />

MED222 <strong>and</strong> V. splendidus based on gene families is <strong>in</strong><br />

agreement with their related 16S rRNA genes <strong>and</strong> published<br />

data [21]. However, <strong>in</strong> contrast to what the ribosomal<br />

gene suggests, our whole-genome comparison <strong>in</strong>dicates that<br />

the three Aliivibrio genomes (A. salmonicida <strong>and</strong> two A.<br />

fischeri) are not so different from Vibrio after all. Their<br />

recent placement <strong>in</strong> the genus Aliivibrio, a decision based<br />

on five genes (the 16S rRNA gene <strong>and</strong> four housekeep<strong>in</strong>g<br />

genes) <strong>and</strong> phenotypical characteristics [47], appears not to<br />

be reflective of the whole genome picture presented here.<br />

The BLAST results were graphically summarised <strong>in</strong> a<br />

BLAST atlas, which visualised V. cholerae-specific gene<br />

clusters. These coded for polysaccharide biosynthesis<br />

enzymes, response regulators <strong>and</strong> chemotaxis prote<strong>in</strong>s,<br />

amongst others. In addition, a V. cholerae-specific, histid<strong>in</strong>e<br />

k<strong>in</strong>ase two-component signal transduction regulatory system<br />

was identified. The two-component signal transduction<br />

pathway is a powerful regulat<strong>in</strong>g system for bacteria to<br />

adapt to a particular ecological niche. There is a precedent<br />

for this claim, as the <strong>in</strong>troduction of a s<strong>in</strong>gle regulatory<br />

prote<strong>in</strong> <strong>in</strong> Vibrio fischeri stra<strong>in</strong> MJ11 has been shown to<br />

specifically enable colonization of the squid Euprymna<br />

scolopes [26].<br />

As expected, the ma<strong>in</strong> differences observed between V.<br />

cholerae cl<strong>in</strong>ical isolates <strong>and</strong> the environmental stra<strong>in</strong>s are<br />

due to genes related to virulence. Two exceptions are the<br />

presence of a number of virulence genes <strong>in</strong> the environmental<br />

stra<strong>in</strong> V. cholerae 2740-80 <strong>and</strong> the absence of<br />

enterotox<strong>in</strong> genes <strong>in</strong> cl<strong>in</strong>ical isolate M66-2. It has already<br />

been suggested that M66-2 might be a predecessor of<br />

p<strong>and</strong>emic, enterotoxic V. cholerae [11]. From sequence<br />

comparison of four housekeep<strong>in</strong>g genes, it was concluded<br />

that V. cholerae 2740-80 is <strong>in</strong>termediary between toxigenic<br />

<strong>and</strong> non-toxigenic isolates [30]. This view is confirmed by<br />

the data presented here, although we propose to consider<br />

the possibility that the isolate arose from a p<strong>and</strong>emic clone<br />

that has lost the CTXΦ prophage, rather than be<strong>in</strong>g a<br />

precursor of a pathogen.<br />

In conclusion, several different methods of genome<br />

comparisons have yielded a picture of V. cholerae genomes<br />

as form<strong>in</strong>g a dist<strong>in</strong>ct cluster, compared to related species,<br />

<strong>and</strong> a relatively small number of genes might be responsible<br />

for environmental niche adaptation <strong>and</strong> hence for generation<br />

of this dist<strong>in</strong>ct species. Likely c<strong>and</strong>idates <strong>in</strong>clude<br />

multiple two-component signal transduction regulatory<br />

prote<strong>in</strong>s as well as chemotaxis prote<strong>in</strong>s.<br />

Acknowledgements We would like to thank Tim B<strong>in</strong>newies for<br />

early work on this project, <strong>and</strong> also to the Danish Research Councils<br />

<strong>and</strong> the DTU Globalization funds for f<strong>in</strong>ancial support.


Open Access This article is distributed under the terms of the<br />

Creative Commons Attribution Noncommercial License which permits<br />

any noncommercial use, distribution, <strong>and</strong> reproduction <strong>in</strong> any<br />

medium, provided the orig<strong>in</strong>al author(s) <strong>and</strong> source are credited.<br />

References<br />

1. Bassler B et al. (2007) CP000789.1: Direct submission to<br />

GenBank<br />

2. B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW (2005)<br />

Genome update: proteome comparisons. Microbiol 151:1–4<br />

3. Chen CY, Wu KM, Chang YC, Chang CH, Tsai HC, Liao TL, Liu<br />

YM, Chen HJ, Shen AB, Li JC, Su TL, Shao CP, Lee CT, Hor LI,<br />

Tsai SF (2003) <strong>Comparative</strong> genome analysis of Vibrio vulnificus,<br />

a mar<strong>in</strong>e pathogen. Genome Res 13:2577–2587<br />

4. Clayton RA, Sutton G, H<strong>in</strong>kle PS, Bult C, Fields C (1995)<br />

Intraspecific variation <strong>in</strong> small-subunit rRNA sequences <strong>in</strong><br />

GenBank: why s<strong>in</strong>gle sequences may not adequately represent<br />

prokaryotic taxa. Int J Syst Bacteriol 45:595–599<br />

5. Colwell R, Grim CJ, Young S, Jaffe D, Gnerre S, Berl<strong>in</strong> A,<br />

Heiman D, Hepburn T, Shea T, Sykes S, Alvarado L, Kodira C,<br />

Heidelberg J, L<strong>and</strong>er E, Galagan J, Nusbaum C, Birren B (2008)<br />

NZ_AAKF00000000: Direct submission to GenBank<br />

6. Doolittle WF (1995) Phylogenetic classification <strong>and</strong> the universal<br />

tree. Science 284:2124–2129<br />

7. Doolittle WF, Papke RT (2006) Genomics <strong>and</strong> the bacterial<br />

species problem. Genome Biol 7:116<br />

8. Doolittle WF, Zhaxybayeva O (2009) On the orig<strong>in</strong> of prokaryotic<br />

species. Genome Res 19:744–756<br />

9. Edwards R, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton G,<br />

Rogers Y-H, Friedman R, Frazier M, Venter JC (2008)<br />

NZ_ACCV00000000: Direct submission to GenBank<br />

10. Farmer JJ, J<strong>and</strong>a JM (2005) Vibrionaceae. In: Bergey’s<br />

manual of systematic bacteriology, 2nd edn, vol 2 part B.<br />

Spr<strong>in</strong>ger, New York, pp 491–546<br />

11. Feng L, Reeves PR, Lan R, Ren Y, Gao C, Zhou Z, Ren Y, Cheng<br />

J, Wang W, Wang J, Qian W, Li D, Wang L (2008) A recalibrated<br />

molecular clock <strong>and</strong> <strong>in</strong>dependent orig<strong>in</strong>s for the cholera p<strong>and</strong>emic<br />

clones. PLoS ONE 3:e4053<br />

12. Gevers D, Cohan FM, Lawrence JG, Sprat BG, Coeyne T, Feil EJ,<br />

Stackebr<strong>and</strong>t E, Van de Peer Y, V<strong>and</strong>amme P, Thompson FL,<br />

Sw<strong>in</strong>gs J (2005) Re-evaluat<strong>in</strong>g prokaryotic species. Nat Rev<br />

Microbiol 3:733–739<br />

13. Hagstrom A, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton<br />

G, Rogers Y-H, Friedman R, Frazier M, Venter JC (2007)<br />

NZ_ABGR00000000: Direct submission to GenBank<br />

14. Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2008) The genome<br />

BLASTatlas—a GeneWiz extension for visualization of wholegenome<br />

homology. Mol Biosyst 4:363–371<br />

15. Heidelberg JF, Eisen JA, Nelson WC, Clayton RA, Gw<strong>in</strong>n ML,<br />

Dodson RJ, Haft DH, Hickey EK, Peterson JD, Umayam L, Gill<br />

SR, Nelson KE, Read TD, Tettel<strong>in</strong> H, Richardson D, Ermolaeva<br />

MD, Vamathevan J, Bass S, Q<strong>in</strong> H, Dragoi I, Sellers P, McDonald<br />

L, Utterback T, Fleishmann RD, Nierman WC, White O, Salzberg<br />

SL, Smith HO, Colwell RR, Mekalanos JJ, Venter JC, Fraser CM<br />

(2000) DNA sequence of both chromosomes of the cholera<br />

pathogen Vibrio cholerae. Nature 406:477–483<br />

16. Heidelberg J, Sebastian Y. NZ_AAKJ00000000, NZ_AAUT00000000,<br />

NZ_AAKK00000000, NZ_AAUR00000000, NZ_AAWF00000000:<br />

Direct submission to GenBank<br />

17. Hjerde E, Lorentzen MS, Holden MT, Seeger K, Paulsen S, Bason<br />

N, Churcher C, Harris D, Norbertczak H, Quail MA, S<strong>and</strong>ers S,<br />

Thurston S, Parkhill J, Willassen NP, Thomson NR (2008) The<br />

genome sequence of the fish pathogen Aliivibrio salmonicida<br />

T. Vesth et al.<br />

stra<strong>in</strong> LFI1238 shows extensive evidence of gene decay. BMC<br />

Genomics 9:616<br />

18. Konstant<strong>in</strong>idis T, Ramette A, Tiedje JA (2006) The bacterial<br />

species def<strong>in</strong>ition <strong>in</strong> the genomic era. Phil Trans R Soc B<br />

361:1929–1940<br />

19. Lagesen K, Hall<strong>in</strong> P, Rødl<strong>and</strong> EA, Staerfeldt HH, Rognes T,<br />

Ussery DW (2007) RNAmmer: consistent <strong>and</strong> rapid annotation of<br />

ribosomal RNA genes. Nucleic Acids Res 35:3100–3108<br />

20. Larsen TS, Krogh A (2003) EasyGene—a prokaryotic gene f<strong>in</strong>der<br />

that ranks ORFs by statistical significance. BMC Bio<strong>in</strong>formatics<br />

4:29<br />

21. Le Roux F, Zou<strong>in</strong>e M, Chakroun N, B<strong>in</strong>esse J, Saulnier D,<br />

Bouchier C, Zidane N, Ma L, Rusniok C, Lajus A, Buchrieser C,<br />

Médigue C, Polz MF, Mazel D (2009) Genome sequence of Vibrio<br />

splendidus: an abundant planctonic mar<strong>in</strong>e species with a large<br />

genotypic diversity. Environ Microbiol 11:1959–1970<br />

22. L<strong>in</strong> W, Fullner KJ, Clayton R, Sexton JA, Rogers MB, Calia KE,<br />

Calderwood SB, Fraser C, Mekalanos JJ (1999) Identification of<br />

a Vibrio cholerae RTX tox<strong>in</strong> gene cluster that is tightly l<strong>in</strong>ked to<br />

the cholera tox<strong>in</strong> prophage. Proc Natl Acad Sci U S A 96:1071–<br />

1076<br />

23. Loytynoja A, Goldman N (2005) An algorithm for progressive<br />

multiple alignment of sequences with <strong>in</strong>sertions. Proc Natl Acad<br />

Sci U S A 102:10557–10562<br />

24. Loytynoja A, Goldman N (2008) Phylogeny-aware gap placement<br />

prevents errors <strong>in</strong> sequence alignment <strong>and</strong> evolutionary analysis.<br />

Science 320:1632–1635<br />

25. Mak<strong>in</strong>o K, Oshima K, Kurokawa K, Yokoyama K, Uda T,<br />

Tagomori K, Iijima Y, Najima M, Nakano M, Yamashita A,<br />

Kubota Y, Kimura S, Yasunaga T, Honda T, Sh<strong>in</strong>agawa H, Hattori<br />

M, Iida T (2003) Genome sequence of Vibrio parahaemolyticus: a<br />

pathogenic mechanism dist<strong>in</strong>ct from that of V. cholerae. Lancet<br />

361:743–749<br />

26. M<strong>and</strong>el MJ, Wollenberg MS, Stabb EV, Visick KL, Ruby EG<br />

(2009) A s<strong>in</strong>gle regulatory gene is sufficient to alter bacterial host<br />

range. Nature 458:215–218<br />

27. Mazel D, Le Roux F (2008) FM954973.1: Direct submission to<br />

GenBank<br />

28. Med<strong>in</strong>i D, Donati C, Tettel<strong>in</strong> H, Masignani V, Rappuoli R<br />

(2005) The microbial pan-genome. Curr Op<strong>in</strong> Genet Dev<br />

15:589–594<br />

29. Medrano-Soto A, Moreno-Hagelsieb G, V<strong>in</strong>uesa P, Christen JA,<br />

Collado-Vides J (2001) Succesful lateral transfer requires codon<br />

usage compatibility between foreign genes <strong>and</strong> recipient genomes.<br />

Mol Biol Evol 21:1884–1894<br />

30. Mohapatra SS, Ramach<strong>and</strong>ran D, Mantri CK, Colwell RR, S<strong>in</strong>gh<br />

DV (2009) Determ<strong>in</strong>ation of relationships among non-toxigenic<br />

Vibrio cholerae O1 biotype El Tor stra<strong>in</strong>s from housekeep<strong>in</strong>g<br />

gene sequences <strong>and</strong> ribotype patterns. Res Microbiol 160:<br />

57–62<br />

31. Munk A, Tapia R, Green L, Rogers Y, Detter JC, Bruce D, Brett<strong>in</strong> TS,<br />

Colwell R, Grim C, Vonste<strong>in</strong> V, Bartels D. CP001485.1,<br />

NZ_ACHV00000000, NZ_ACHY00000000, NZ_ACHW00000000,<br />

NZ_ACHX00000000, NZ_ACHZ00000000, NZ_ACIA00000000,<br />

NZ_ACFQ00000000: Direct submission to GenBank<br />

32. Murray RG, Stackebr<strong>and</strong>t E (1995) Taxonomic note: implementation<br />

of the provisional status C<strong>and</strong>idatus for <strong>in</strong>completely<br />

described procaryotes. Int J Syst Bacteriol 45:186–187<br />

33. Nierman WC (2006) NZ_AATY00000000: Direct submission to<br />

GenBank<br />

34. Pang B, Yan M, Cui Z, Ye X, Diao B, Ren Y, Gao S, Zhang L,<br />

Kan B (2007) Genetic diversity of toxigenic <strong>and</strong> nontoxigenic<br />

Vibrio cholerae serogroups O1 <strong>and</strong> O139 revealed by array-based<br />

comparative genomic hybridization. J Bacteriol 189:4837–4879<br />

35. Philippe H, Douady CJ (2003) Horizontal gene transfer <strong>and</strong><br />

phylogenetics. Curr Op<strong>in</strong> Microbiol 6:498–505


Orig<strong>in</strong>s of V. cholerae<br />

36. P<strong>in</strong>hassi J, Pedros-Alio C, Ferriera S, Johnson J, Kravitz S,<br />

Halpern A, Rem<strong>in</strong>gton K, Beeson K, Tran B, Rogers Y-H,<br />

Friedman R, Venter JC (2006) NZ_AAND00000000: Direct<br />

submission to GenBank<br />

37. Pupo GM, Lan R, Reeves PR (2000) Multiple <strong>in</strong>dependent orig<strong>in</strong>s<br />

of Shigella clones of Escherichia coli <strong>and</strong> convergent evolution of<br />

many of their characteristics. Proc Natl Acad Sci U S A<br />

97:10567–10572<br />

38. Rhee JH, Kim SY, Chung SS, Lee SE, Choy HE (2002)<br />

AE016795.2: Direct submission to GenBank<br />

39. Riley MA, Lizotte-Waniewski M (2009) Population genomics <strong>and</strong><br />

the bacterial species concept. Methods Mol Biol 532:367–377<br />

40. Rowe-Magnus DA, Guérout AM, Mazel D (1999) Super<strong>in</strong>tegrons.<br />

Res Microbiol 150:641–651<br />

41. Rosenberg E, Ferriera S, Johnson J, Kravitz S, Beeson K, Sutton<br />

G, Rogers Y-H, Friedman R, Frazier M. Venter JC (2006)<br />

NZ_ABCH00000000: Direct submission to GenBank<br />

42. 3Ruby EG, Urbanowski M, Campbell J, Dunn A, Fa<strong>in</strong>i M, Gunsalus<br />

R, Lostroh P, Lupp C, McCann J, Millikan D, Schaefer A, Stabb E,<br />

Stevens A, Visick K, Whistler C, Greenberg EP (2005) Complete<br />

genome sequence of Vibrio fischeri: a symbiotic bacterium with<br />

pathogenic congeners. Proc Natl Acad Sci U S A 102:3004–3009<br />

43. Sánchez J, Holmgren J (2005) Virulence factors, pathogenesis <strong>and</strong><br />

vacc<strong>in</strong>e protection <strong>in</strong> cholera <strong>and</strong> ETEC diarrhoea. Curr Op<strong>in</strong><br />

Immunol 17:388–398<br />

44. Stackebr<strong>and</strong>t E, Frederiksen W, Garrity GM, Grimont PA,<br />

Kämpfer P, Maiden MC, Nesme X, Rosselló-Mora R, Sw<strong>in</strong>gs J,<br />

Trüper HG, Vauter<strong>in</strong> L, Ward AC, Whitman WB (2002) Report of<br />

the ad hoc committee for the re-evaluation of the species def<strong>in</strong>ition<br />

<strong>in</strong> bacteriology. Int J Syst Evol Microbiol 52:1043–1047<br />

45. Tamura K, Dudley J, Nei M, Kumar S (2007) MEGA4: Molecular<br />

Evolutionary Genetics Analysis (MEGA) software version 4.0.<br />

Mol Biol Evol 24:1596–1599<br />

46. Thompson FL, Iida T, Sw<strong>in</strong>gs J (2004) Biodiversity of vibrios.<br />

Microbiol Mol Biol Rev 68:403–431<br />

47. Urbanczyk H, Ast JC, Higg<strong>in</strong>s MJ, Carson J, Dunlap PV (2007)<br />

Reclassification of Vibrio fischeri, Vibrio logei, Vibrio salmonicida<br />

<strong>and</strong> Vibrio wodanis as Aliivibrio fischeri gen. nov., comb.<br />

nov., Aliivibrio logei comb. nov., Aliivibrio salmonicida comb.<br />

nov. <strong>and</strong> Aliivibrio wodanis comb. nov. Int J Syst Evol Microbiol<br />

57:2823–2829<br />

48. Vezzi A, Campanaro S, D'Angelo M, Simonato F, Vitulo N, Lauro<br />

FM, Cestaro A, Malacrida G, Simionati B, Cannata N, Romualdi<br />

C, Bartlett DH, Valle G (2005) Life at depth: Photobacterium<br />

profundum genome sequence <strong>and</strong> expression analysis. Science<br />

30:1459–1461<br />

49. Wang L, Feng L, Reeves P, Lan R, Ren Y, Gao C, Zhou Z, Ren Y,<br />

Wang W (2008) CP001233.1. CP001235.1: Direct submission to<br />

GenBank<br />

50. Woese CR (1987) Bacterial evolution. Microbial Rev 51:221–271


1<br />

<strong>Comparative</strong> Genomics<br />

2.10 Paper V: Tools for comparison of bacterial genomes


74 Tools for Comparison of<br />

Bacterial Genomes<br />

T. M. Wassenaar 1,2 . T. T. B<strong>in</strong>newies 1,3 . P. F. Hall<strong>in</strong> 1 . D. W. Ussery 1, *<br />

1<br />

Center for Biological Sequence Analysis, Technical University of<br />

Denmark, Kgs. Lyngby, Denmark<br />

*dave@cbs.dtv.dk<br />

2<br />

Molecular Microbiology <strong>and</strong> Genomics Consultants, Zotzenheim,<br />

Germany<br />

3<br />

Roche Diagnostics Ltd., Advanced Systems Group, Global Platforms &<br />

Support, Rotkreuz, Switzerl<strong>and</strong><br />

1 Introduction . . . . . . ..................................................................4314<br />

2 Genomic DNA Sequence Comparisons . ...........................................4314<br />

3 Visualization of Genomic Data: The Genome Atlas ..............................4317<br />

4 Whole Genome Alignment Methods . . . . ...........................................4319<br />

5 Compar<strong>in</strong>g the Cod<strong>in</strong>g Fraction of Genomes . . . . . . . . ..............................4321<br />

6 Codon Usage Comparisons . . . . .....................................................4322<br />

7 Prote<strong>in</strong> Sequence Comparisons . . . . . . . . . ...........................................4322<br />

8 Gene Synteny <strong>and</strong> Genome Isl<strong>and</strong>s . . . . . ...........................................4325<br />

9 M<strong>in</strong>imal Information About a Genome Sequence . . . ..............................4325<br />

10 Research Needs . . . ..................................................................4325<br />

K. N. Timmis (ed.), H<strong>and</strong>book of Hydrocarbon <strong>and</strong> Lipid Microbiology, DOI 10.1007/978-3-540-77587-4_337,<br />

# Spr<strong>in</strong>ger-Verlag Berl<strong>in</strong> Heidelberg, 2010


4314 74<br />

Tools<br />

Abstract: Of the plethora of bio<strong>in</strong>formatical <strong>tools</strong> available, some useful <strong>tools</strong> that allow<br />

complete genome sequences to be compared are described here. Comparisons of genome<br />

length, base composition, gene density, numbers of tRNA <strong>and</strong> rRNA genes, <strong>and</strong> codon usage<br />

can provide useful biological <strong>in</strong>sights. Examples are provided of a Genome Atlas plot, to<br />

summarize many features of a s<strong>in</strong>gle genome, <strong>and</strong> a BLAST Atlas, <strong>in</strong> which multiple genomes<br />

can be comb<strong>in</strong>ed. A table of web-services for useful <strong>tools</strong> is provided.<br />

1 Introduction<br />

Presently, there are about 900 bacterial <strong>and</strong> archaeal genomes that have been fully sequenced<br />

<strong>and</strong> become publicly available 1 <strong>and</strong> their number more than doubled last year. Approximately<br />

40% of the sequenced genomes are obta<strong>in</strong>ed from environmental (terrestrial <strong>and</strong> mar<strong>in</strong>e)<br />

organisms. In addition, metagenomic projects are now produc<strong>in</strong>g a vast amount of sequences.<br />

Here we provide a brief overview of methods to compare sequenced bacterial genomes. Of the<br />

many methods available to compare bacterial genomes (B<strong>in</strong>newies et al., 2006) > Table 1<br />

lists several that we f<strong>in</strong>d useful. It is beyond the scope of this review to provide a detailed<br />

analysis of these methods, <strong>and</strong> the list is far from complete. The <strong>tools</strong> discussed here provide<br />

some <strong>in</strong>terest<strong>in</strong>g <strong>in</strong>formation on fundamental biological features <strong>and</strong> can be used to compare<br />

a few or large numbers of genomes. The <strong>tools</strong> are easy to use <strong>and</strong> produce results that are easy<br />

to <strong>in</strong>terpret <strong>and</strong> can be graphically represented. The latter is an important quality determ<strong>in</strong>ant<br />

of any sequence analysis tool when deal<strong>in</strong>g with genomes, as the complexity of <strong>in</strong>put data is<br />

so large.<br />

2 Genomic DNA Sequence Comparisons<br />

A genome can be more than one DNA molecule. Approximately 10% of the bacterial genomes<br />

sequenced so far have more than one chromosome. By def<strong>in</strong>ition a genome <strong>in</strong>cludes all<br />

chromosomes (<strong>and</strong> plasmids) that constitute an organism’s total DNA. Chromosomes are<br />

essential, s<strong>in</strong>gle-copy, <strong>in</strong>dependently replicat<strong>in</strong>g DNA molecules present <strong>in</strong> each member of<br />

the species. Some species conta<strong>in</strong> plasmids; these are frequently stra<strong>in</strong>-specific <strong>and</strong> sometimes<br />

(<strong>in</strong>correctly, <strong>in</strong> our op<strong>in</strong>ion) omitted from a genome sequence.<br />

At the time of writ<strong>in</strong>g, the largest bacterial genome sequenced is that of Solibacter usitatus<br />

(stra<strong>in</strong> Ell<strong>in</strong> 6076), a soil bacterium belong<strong>in</strong>g to the Acidobacteria. It consists of a s<strong>in</strong>gle<br />

chromosome of 9.97 mega basepairs (Mbp). The smallest bacterial genome known is<br />

that of Carsonella ruddii (PV), an endosymbiont of a plant sap-feed<strong>in</strong>g <strong>in</strong>sect with a mere<br />

159,662 bp. Genome size is a rough <strong>in</strong>dicator of biological adaptive potential so it is no<br />

surprise that soil bacteria have bigger genomes, as they have to adapt to environmental<br />

variation, whereas the protective niche of an endosymbiont allows for a small genome.<br />

The genome size of an organism is easy to calculate <strong>and</strong> tabulate. > Figure 1a gives<br />

a graphical representation for genome size variation with<strong>in</strong> bacterial phyla. A ‘‘box <strong>and</strong><br />

whiskers’’ plot as shown <strong>in</strong> > Fig. 1 visualizes the distribution of a property that can be<br />

1 Completed genome statistics obta<strong>in</strong>ed from the NCBI Genome Project web pages: http://www.ncbi.nlm.nih.gov/<br />

genomes/lproks.cgi<br />

for Comparison of Bacterial Genomes


. Table 1<br />

Methods for comparison of bacterial genomes<br />

Method URL References<br />

Length, %GC http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi Wheeler et al. (2007)<br />

Chromosome<br />

alignment (ACT)<br />

Chromosome<br />

alignment (MUMMER)<br />

http://www.sanger.ac.uk/Software/ACT/ Carver et al. (2005)<br />

http://www.webact.org/WebACT/home<br />

http://mummer.sourceforge.net Kurtz et al. (2004)<br />

Repeats – various http://www.cbs.dtu.dk/services/GenomeAtlas Ussery et al. (2004)<br />

Repeats –<br />

tetranucleotides<br />

Repeats – short,<br />

t<strong>and</strong>em<br />

Tools for Comparison of Bacterial Genomes 74<br />

http://www.megx.net/tetra Teel<strong>in</strong>g et al. (2004)<br />

http://m<strong>in</strong>isatellites.u-psud.fr/GPMS/default.php Denoeud <strong>and</strong><br />

Vergnaud (2004)<br />

Repeats – VNTRs http://vntr.csie.ntu.edu.tw Chang et al. (2007)<br />

Replication Orig<strong>in</strong>s http://www.cbs.dtu.dk/services/GenomeAtlas Worn<strong>in</strong>g et al.<br />

(2006)<br />

Noncod<strong>in</strong>g RNAs http://rfam.sanger.ac.uk Griffiths-Jones, et al.<br />

(2005)<br />

rRNAs http://www.cbs.dtu.dk/services/RNAmmer Lagesen et al. (2007)<br />

Genome Atlas http://www.cbs.dtu.dk/services/GenomeAtlas Hall<strong>in</strong> <strong>and</strong> Ussery<br />

(2004)<br />

BLAST Atlas (zoomable) http://www.cbs.dtu.dk/services/gwBrowser<br />

UPDATE!<br />

‘‘Genome Properties’’ http://cmr.tigr.org/tigr-scripts/CMR/shared/<br />

GenomePropertiesHomePage.cgi<br />

Hall<strong>in</strong> <strong>and</strong> Ussery<br />

(2004)<br />

Selengut et al.<br />

(2007)<br />

4315<br />

expressed as a numerical value, such as length, %GC, number of genes, etc. Such plots show<br />

the spread of the data <strong>and</strong> are made as follows: the values are sorted <strong>and</strong> divided <strong>in</strong>to two equal<br />

parts, separated by the median, which is marked as a bar <strong>in</strong> the middle of the distribution. A<br />

box is drawn to cover the range where the middle 50% of the data are (exclud<strong>in</strong>g the first 25%<br />

<strong>and</strong> the last 25% of the data). The ‘‘whiskers’’ are the hatched l<strong>in</strong>es, connect<strong>in</strong>g the lowest (left)<br />

<strong>and</strong> highest (right) values, with the exception of outlier po<strong>in</strong>ts, which are shown as <strong>in</strong>dividual<br />

dots. Outliers are def<strong>in</strong>ed as data that are distant by more than 1.5 times the range of the box.<br />

The base composition of genomes, i.e., their %GC content (or %AT which together make<br />

100%), can also be compared, as shown <strong>in</strong> > Fig. 1b. The GC content of a genome can range<br />

from 17% <strong>in</strong> C. ruddii to 75% GC <strong>in</strong> Anaeromyxobacter dehalogenans. The smallest genome is<br />

also the most AT rich, <strong>and</strong> many of the larger genomes are quite GC rich. It is not clear if there<br />

is a biological force <strong>in</strong> play beh<strong>in</strong>d this correlation, although it has been observed that the<br />

ecological niche an organism occupies roughly correlates to both genome size <strong>and</strong> GC content<br />

(Foerstner et al., 2005, Musto et al., 2006).<br />

In addition to the average GC content for a whole genome, local variation with<strong>in</strong> a given<br />

genome can be exam<strong>in</strong>ed, <strong>and</strong> this reveals two general trends for almost all bacterial genomes.<br />

First, on a more global, chromosomal level a large region flank<strong>in</strong>g the orig<strong>in</strong> of DNA


4316 74<br />

Tools<br />

Size distribution of prokaryotic genomes (N = 779) AT content distribution of prokaryotic genomes (N = 779)<br />

for Comparison of Bacterial Genomes<br />

Crenarchaeota (n = 16)<br />

Euryarchaeota (n = 35)<br />

Nanoarchaeota (n = 1)<br />

Acidobacteria (n = 2)<br />

Act<strong>in</strong>obacteria (n = 55)<br />

Aquificae (n = 3)<br />

Bacteroidetes/chlorobi ( n = 26)<br />

Chlamydiae/verrucomicrobia (n = 13)<br />

Chloroflexi (n = 7)<br />

Cyanobacteria (n = 33)<br />

De<strong>in</strong>ococcus/thermus (n = 4)<br />

Firmicutes (n = 155)<br />

Fusobacteria (n = 1)<br />

Planctomycetes (n = 1)<br />

Alphaproteobacteria (n = 94)<br />

Betaproteobacteria (n = 61)<br />

Gammaproteobacteria (n = 191)<br />

Deltaproteobacteria (n = 21)<br />

Epsilonproteobacteria (n = 22)<br />

Spirochaetes (n = 16)<br />

Thermotogea (n = 8)<br />

Other archaea (n = 1)<br />

Other bacteria (n = 13)<br />

80<br />

70<br />

50 60<br />

AT content (percent)<br />

40<br />

30<br />

12<br />

10<br />

6 8<br />

Genome size (Mbp)<br />

4<br />

2<br />

0<br />

. Figure 1<br />

(a) Box <strong>and</strong> Whisker plot of genome length distribution for 779 bacterial chromosomes, grouped by phyla. The phylum <strong>and</strong> the number of chromosomes<br />

<strong>in</strong>cluded are <strong>in</strong>dicated at the left. Each phylum is colored accord<strong>in</strong>g to our GenomeAtlas website. (b) The distribution of average chromosomal AT content<br />

for the same set of bacterial genomes.


eplication tends to be more GC rich, <strong>and</strong> the region around the replication term<strong>in</strong>us usually<br />

is more ATrich. AT-rich sequences melt more easily than GC-rich sequences, due <strong>in</strong> part to the<br />

extra hydrogen bond present <strong>in</strong> a GC base pair. Contra-<strong>in</strong>tuitively, this would make the orig<strong>in</strong><br />

of replication the least likely to start replication. However, with<strong>in</strong> the ‘‘large region’’ around<br />

the orig<strong>in</strong> of approximately 5% of the chromosome, there is a short stretch of more AT rich<br />

basepairs, where the replication orig<strong>in</strong> bubble opens up. Second, <strong>and</strong> zoom<strong>in</strong>g <strong>in</strong> at genes, the<br />

average GC content of <strong>in</strong>tergenic regions is generally lower than that of cod<strong>in</strong>g sequences.<br />

These regions will melt more readily, are more curved <strong>and</strong> more rigid than the chromosomal<br />

average, <strong>in</strong> order to enable gene expression (Pedersen et al., 2000, Ussery <strong>and</strong> Hall<strong>in</strong>, 2004).<br />

This is true for nearly all of the bacterial genomes sequenced, regardless of GC content. In order<br />

to calculate relative or local %GC, a w<strong>in</strong>dow has to be def<strong>in</strong>ed (say, <strong>in</strong>vestigat<strong>in</strong>g 100 basepairs)<br />

for which the %GC is calculated. This w<strong>in</strong>dow is then moved along the genome by s<strong>in</strong>glenucleotide<br />

steps, <strong>and</strong> the %GC is scored related to the middle of each w<strong>in</strong>dow. These scores can<br />

then be graphically represented. A web-based tool for this is available at the Genome Atlas<br />

Website 2 <strong>in</strong> which local %GC can be visualized by color codes as discussed below.<br />

3 Visualization of Genomic Data: The Genome Atlas<br />

Genome atlases are circular plots of chromosomes or plasmids (a l<strong>in</strong>ear version is available<br />

when applicable) on which general properties of the DNA molecule are plotted as colors.<br />

Genome atlases are available from our web server 2 for many of the currently sequenced<br />

bacterial genomes. > Figure 2 shows a Genome Atlas for the chromosome of Geobacillus<br />

kaustophilus stra<strong>in</strong> HTA426 (a thermophilic Firmicute that also conta<strong>in</strong>s a plasmid of 4.8 kb).<br />

This isolate was obta<strong>in</strong>ed from a deep sea sediment of the Mariana Trench <strong>in</strong> the Pacific Ocean<br />

(Takami et al., 2004a, b). Its genome is 3.5 Mbp long <strong>and</strong> conta<strong>in</strong>s 52.1% GC. G. kaustophilus<br />

has been suggested to provide a possible solution for paraff<strong>in</strong> deposition problems with oil<br />

production (Sood <strong>and</strong> Lal, 2008). A Genome Atlas maps four different aspects of the<br />

chromosomal DNA sequence <strong>in</strong> various lanes <strong>in</strong> a st<strong>and</strong>ard manner: DNA structural features<br />

are represented <strong>in</strong> the three outer lanes, all cod<strong>in</strong>g sequences are <strong>in</strong>dicated <strong>in</strong> the next lane, two<br />

k<strong>in</strong>ds of repeats are mapped <strong>in</strong> the next two lanes, <strong>and</strong> base composition properties are plotted<br />

<strong>in</strong> the two <strong>in</strong>nermost lanes (Jensen et al., 1999). The scale <strong>in</strong> the center corresponds with the<br />

sequence number<strong>in</strong>g <strong>in</strong> GenBank. The DNA structural features of the three outermost circles<br />

are based on the physical chemical properties of the DNA helix. The annotated genes are given<br />

<strong>in</strong> blue for prote<strong>in</strong>-cod<strong>in</strong>g genes oriented clockwise, <strong>and</strong> red for genes on the other str<strong>and</strong><br />

(counterclockwise). The tRNA <strong>and</strong> rRNA genes have their own color. The clockwise str<strong>and</strong><br />

corresponds with the sequence stored <strong>in</strong> GenBank (genes on the other str<strong>and</strong> are annotated as<br />

‘‘complement’’ <strong>in</strong> there). To identify global repeats (sequences that are repeated somewhere<br />

else on the chromosome) we search for the best match of a 100 bp w<strong>in</strong>dow aga<strong>in</strong>st the entire<br />

chromosome. Search<strong>in</strong>g on the positive str<strong>and</strong> results <strong>in</strong> direct repeats (both sequences run <strong>in</strong><br />

the same direction) whilst search<strong>in</strong>g on the negative str<strong>and</strong> gives <strong>in</strong>verted repeats (the two<br />

repeat units run <strong>in</strong> opposite directions). For most of these general properties summarized <strong>in</strong> a<br />

Genome Atlas (structural properties, repeats, base composition) dedicated atlases are also<br />

available, where more features are given (such as local <strong>and</strong> simple repeats <strong>in</strong> a Repeat Atlas, or<br />

2 http://www.cbs.dtu.dk/services/GenomeAtlas/<br />

Tools for Comparison of Bacterial Genomes 74<br />

4317


4318 74<br />

Tools<br />

Genome atlas<br />

Intr<strong>in</strong>sic curvature<br />

dev<br />

avg<br />

0.17 0.22<br />

Stack<strong>in</strong>g energy<br />

for Comparison of Bacterial Genomes<br />

dev<br />

avg<br />

–9.03 –7.55<br />

Position preference<br />

dev<br />

avg<br />

0.14 0.17<br />

Annotations: CDS +<br />

CDS –<br />

rRNA<br />

tRNA<br />

0M<br />

0.5M<br />

3M<br />

Global direct repeats<br />

G. kaustophilus<br />

HTA426<br />

ma<strong>in</strong> chromosome<br />

fix<br />

avg<br />

1M<br />

2.5M<br />

5.00 7.50<br />

3,544,776 bp<br />

Global <strong>in</strong>verted repeats<br />

fix<br />

avg<br />

5.00 7.50<br />

1.5M<br />

2M<br />

GC Skew<br />

dev<br />

avg<br />

–0.15 0.14<br />

Percent AT<br />

fix<br />

avg<br />

0.20 0.80<br />

Resolution: 1418<br />

Center for biological sequence analysis<br />

http://www.cbs.dtu.dk/<br />

. Figure 2<br />

Genome atlas of the ma<strong>in</strong> chromosome of Geobacillus kaustrophilus. See text for further explanation.


Tools for Comparison of Bacterial Genomes 74<br />

base composition <strong>in</strong> a Base Atlas). Such specialized atlases are expla<strong>in</strong>ed <strong>in</strong> detail <strong>in</strong> a book that<br />

we recently produced (Ussery et al., 2008).<br />

As can be seen <strong>in</strong> > Fig. 2, the genes <strong>in</strong> this chromosome are strongly favor<strong>in</strong>g one str<strong>and</strong>:<br />

the positive str<strong>and</strong> for the first (right) half <strong>and</strong> the negative str<strong>and</strong> for the second (left) half of<br />

the chromosome. These happen to be the lead<strong>in</strong>g str<strong>and</strong> dur<strong>in</strong>g replication. Replication starts<br />

at the orig<strong>in</strong>, (the 12 o’clock position here), <strong>and</strong> proceeds on either side along the circle with<br />

both a lead<strong>in</strong>g <strong>and</strong> lagg<strong>in</strong>g str<strong>and</strong> until the bubble reaches the term<strong>in</strong>us, at 6 o’clock, <strong>and</strong> the<br />

ends are comb<strong>in</strong>ed. The positive str<strong>and</strong> represented by a genome sequence is the lead<strong>in</strong>g<br />

str<strong>and</strong> but only for the first half up till the term<strong>in</strong>us. Read<strong>in</strong>g across the term<strong>in</strong>us along the<br />

sequence on the same str<strong>and</strong> one enters the lagg<strong>in</strong>g str<strong>and</strong>. Gene preference for the lead<strong>in</strong>g<br />

str<strong>and</strong> is a general feature for Firmicutes <strong>and</strong> for some other bacteria.<br />

In > Fig. 2 the two outward lanes identify some regions with strong structural properties<br />

(for <strong>in</strong>stance the region around 2 o’clock, <strong>in</strong>dicated by a black l<strong>in</strong>e). The observed strong<br />

curvature (blue <strong>in</strong> the outward lane) where the DNA would easily melt (red <strong>in</strong> the second lane)<br />

suggests this region conta<strong>in</strong>s genes that are highly expressed.<br />

There are a number of global repeats, notably <strong>in</strong> the first quarter of the chromosome. Note<br />

that the ribosomal RNA genes (light blue <strong>in</strong> the annotation lane) are located here, as <strong>in</strong>dicated<br />

by the arrows, <strong>and</strong> these are picked up as global repeats, as <strong>in</strong>deed they are repeated genes.<br />

The GC skew lane shows the bias of G’s towards one str<strong>and</strong> or the other, averaged over a<br />

10,000 bp w<strong>in</strong>dow. In contrast to many Firmicutes with a strong GC skew, this genome only<br />

has a weak GC skew (the right half is light blue <strong>and</strong> the left half is light p<strong>in</strong>k). The <strong>in</strong>nermost<br />

circle colors the local AT content when it is more than three st<strong>and</strong>ard deviations distant from<br />

the global average. Note a light red color around the 2 o’clock region: this local deviation <strong>in</strong> AT<br />

content is related to the structural features located here.<br />

The Genome Atlas of the Archaea Methanosarc<strong>in</strong>a acetivorans, shown <strong>in</strong> > Fig. 3, tells a<br />

different story. This strictly anaerobic organism so efficiently produces methane that it is held<br />

responsible for virtually all biogenic methane. It can also oxidate CO to CO 2 (Lessner et al.,<br />

2006). Stra<strong>in</strong> C2A (the type stra<strong>in</strong> of the species) was isolated from a mar<strong>in</strong>e sediment<br />

(Galagan et al., 2002). Its genome is 5.7 Mbp long <strong>and</strong> conta<strong>in</strong>s 42.7% GC. The Genome<br />

Atlas shows that its genes are evenly distributed over the two str<strong>and</strong>s, <strong>and</strong> a GC skew is absent.<br />

Instead, the lower quart of the genome conta<strong>in</strong>s many strong structural features. The genome<br />

only conta<strong>in</strong>s three rRNA gene copies (<strong>in</strong>dicated by arrows) one of which is located on the<br />

negative str<strong>and</strong> (but as discussed above, this is actually the lead<strong>in</strong>g str<strong>and</strong>, as is preferred for<br />

nearly all bacterial rRNA genes). Many other global repeats are visible, notably <strong>in</strong> the region<br />

around 1.2 Mbp, which is strongly curved <strong>and</strong> easily melted, <strong>and</strong> is slightly more AT rich than<br />

the rest of the genome. Here, the important carbon-monoxide dehydrogenase gene locus is<br />

present, as are multiple transposases, which could be an <strong>in</strong>dication of horizontally acquired<br />

DNA. The genome is relatively poorly annotated, with many genes given as ‘‘predicted<br />

prote<strong>in</strong>’’ only, which is not uncommon for archaeal genomes.<br />

In conclusion, a Genome atlas comb<strong>in</strong>es a number of features <strong>in</strong> one s<strong>in</strong>gle figure that<br />

summarizes a very long <strong>and</strong> detailed story about a chromosome or plasmid.<br />

4 Whole Genome Alignment Methods<br />

4319<br />

Another way to compare genomes is based on alignment of nucleotide or am<strong>in</strong>o acid<br />

sequences. Sequence alignment is a common tool to identify similarities, with BLAST, for


4320 74<br />

Tools<br />

Genome atlas<br />

Intr<strong>in</strong>sic curvature<br />

dev<br />

avg<br />

0.18 0.24<br />

for Comparison of Bacterial Genomes<br />

dev<br />

avg<br />

Stack<strong>in</strong>g energy<br />

–8.10 –7.21<br />

dev<br />

avg<br />

Position preference<br />

0.13 0.15<br />

0.5M<br />

0M<br />

M<br />

Annotations: CDS +<br />

CDS –<br />

rRNA<br />

tRNA<br />

5M<br />

1M<br />

4.5M<br />

1.5M<br />

M. acetivorans C2A<br />

5,751,492 bp<br />

Global direct repeats<br />

fix<br />

avg<br />

4<br />

2M<br />

5.00 7.50<br />

3.5M<br />

2.5M<br />

Global <strong>in</strong>verted repeats<br />

fix<br />

avg<br />

5.00 7.50<br />

3M<br />

GC skew<br />

dev<br />

avg<br />

–0.03 0.02<br />

fix<br />

avg<br />

Percent AT<br />

0.20 0.80<br />

Resolution: 2301<br />

Center for biological sequence analysis<br />

http://www.cbs.dtu.dk/<br />

. Figure 3<br />

Genome atlas of the ma<strong>in</strong> chromosome of the Archea Methanosarc<strong>in</strong>a acetivorans.


Basic Local Alignment Search Tool, the most common (Altschul et al., 1990). However<br />

BLAST is not automatically suitable for large DNA <strong>in</strong>put segments such as complete<br />

genomes. A more suitable program to align sequences <strong>in</strong> the range of megabases is Mummer,<br />

developed at TIGR, of which version 3 is now publicly available (Kurtz et al., 2004). Further,<br />

this method has been recently extended to <strong>in</strong>clude the average nucleotide identity <strong>in</strong> the<br />

conserved core genes of a set of genomes (Deloger et al., 2009). Moreover, graphical representation<br />

of the result<strong>in</strong>g alignment becomes an issue. Specific <strong>tools</strong> have been designed to align<br />

genome sequences <strong>and</strong> visualize such events. The Artemis Comparison Tool (ACT) is worth<br />

mention<strong>in</strong>g of which two versions are available: a downloadable version to be used on a local<br />

computer (Carver et al., 2005) <strong>and</strong> a web-based version with pre-computed comparisons<br />

between several hundred bacterial genomes. 3 BLAST results of entire bacterial chromosomes<br />

aga<strong>in</strong>st each other have also been used to construct phylogenetic trees (Henz et al., 2005). Blast<br />

comparisons will be treated <strong>in</strong> Section 7 of this chapter.<br />

5 Compar<strong>in</strong>g the Cod<strong>in</strong>g Fraction of Genomes<br />

The typical cod<strong>in</strong>g density for a bacterial genome is about 90%, rang<strong>in</strong>g from 95%<br />

for Pelagibacter ubique (an alpha-proteal mar<strong>in</strong>e bacterium that counts to the most numerous<br />

bacteria <strong>in</strong> the world) (Giovannoni et al., 2005) to around 75% for M. acetivorans.<br />

Intracellular bacteria can have a cod<strong>in</strong>g density as low as 50%. This means the majority<br />

of bacterial DNA codes for genes, which mostly are not spliced so that <strong>in</strong>trons are absent<br />

(with very few exceptions). However, not every open read<strong>in</strong>g frame is a gene, <strong>and</strong> it<br />

appears that many bacterial genomes are over-annotated, predict<strong>in</strong>g 10–15% more genes<br />

than are real (Skovgaard et al., 2001). These over-annotated genes are frequently short<br />

open read<strong>in</strong>g frames. In addition, genes can be missed <strong>in</strong> the annotation. A frequent mistake<br />

is that genes are annotated on the wrong str<strong>and</strong>, which can happen if the read<strong>in</strong>g frame is<br />

open <strong>in</strong> either direction. The <strong>in</strong>tergenic regions separat<strong>in</strong>g genes regulate transcription,<br />

<strong>and</strong> <strong>in</strong> <strong>in</strong>tracellular bacteria frequently conta<strong>in</strong> pseudogenes or repeats. Genes not cod<strong>in</strong>g<br />

for prote<strong>in</strong>s <strong>in</strong>clude tRNA <strong>and</strong> rRNA genes, <strong>and</strong> some parts of <strong>in</strong>tergenic regions can<br />

be transcribed <strong>in</strong>to stable RNA that are transcribed but do not code for prote<strong>in</strong>s. E. coli<br />

conta<strong>in</strong>s several hundred small non-cod<strong>in</strong>g RNA genes (ncRNA) (Chen et al., 2002) that<br />

can act as regulators (Gottesman, 2005). Their role <strong>in</strong> environmental bacteria is virtually<br />

unexplored.<br />

Although tRNA <strong>and</strong> rRNA genes are essential to life, they are sometimes missed <strong>in</strong> the<br />

annotation of a genome, a rather embarrass<strong>in</strong>g omission, or occasionally annotated on<br />

the wrong str<strong>and</strong> (Lagesen et al., 2007). The number <strong>and</strong> location of rRNA operons <strong>in</strong> a<br />

genome can say someth<strong>in</strong>g about an organism. It appears that organisms with short doubl<strong>in</strong>g<br />

times have larger numbers of rRNA <strong>and</strong> tRNA genes. Compar<strong>in</strong>g > Figs. 2 <strong>and</strong> 3 it is<br />

likely that G. kaustrophilus with 9 rRNA copies, nearly all located close to the orig<strong>in</strong> of<br />

replication (which boosts expression dur<strong>in</strong>g replication as their copy number <strong>in</strong>creases) can<br />

divide more quickly than M. acetivorans which only has three copies. Some really fast-grow<strong>in</strong>g<br />

bacteria can have 14 or more rRNA copies, as can be viewed from our list of genomes. 4<br />

3 http://www.webact.org/WebACT/home<br />

4 www.cbs.dtu.dk/services/GenomeAtlas/<br />

Tools for Comparison of Bacterial Genomes 74<br />

4321


4322 74<br />

Tools<br />

for Comparison of Bacterial Genomes<br />

6 Codon Usage Comparisons<br />

Once the genes of a given genome have been def<strong>in</strong>ed, their codon usage can be analyzed. S<strong>in</strong>ce<br />

the genetic code is redundant, with up to 6 codons per am<strong>in</strong>o acid, variable codons are used at<br />

different frequencies. Much of the redundancy <strong>in</strong> the genetic code is due to third base<br />

variation. > Figure 4 displays the am<strong>in</strong>o acid usage for three prokaryotic genomes: Methanosphaera<br />

stadtmanae (27.6% GC), an archaeal methanogen that uses methanol <strong>and</strong> hydrogen to<br />

produce methane; Desulfitobacterium hafniense (47.4% GC), a Firmicute that efficiently<br />

dehalogenates tetrachloroethene <strong>and</strong> polychloroethanes; <strong>and</strong> Anaeromyxobacter dehalogenans<br />

(75% GC). This species, the first myxobacteria to be grown as a pure culture, can use orthosubstituted<br />

mono- <strong>and</strong> dichlor<strong>in</strong>ated phenols. The frequency of each possible codon is plotted<br />

<strong>in</strong> a wheel plot <strong>in</strong> the upper part of the figure, arranged such that their third base is conserved<br />

<strong>in</strong> each quarter. The bias <strong>in</strong> codon usage towards the third position can also be seen <strong>in</strong> the<br />

sequence logo plots <strong>in</strong> the lower part of > Fig. 4. From both graphics it is evident that genomic<br />

GC content highly affects codon use (or the other way round). Based on a genome’s bias <strong>in</strong><br />

codon usage, it is possible to predict its likely environmental niche (Willenbrock et al., 2006).<br />

Moreover, it is known that am<strong>in</strong>o acid usage (not shown here) depends on environment, based<br />

on analysis of metagenomic samples (Musto et al., 2006, Foerstner et al., 2005).<br />

7 Prote<strong>in</strong> Sequence Comparisons<br />

One can compare each <strong>in</strong>dividual gene <strong>in</strong> a given genome by BLAST aga<strong>in</strong>st a set of genomes.<br />

This produces a huge amount of data that can be graphically represented <strong>in</strong> a BLAST Matrix<br />

(B<strong>in</strong>newies et al., 2005, Ussery et al., 2009). A BLAST Matrix is not symmetrical, as the<br />

outcome is determ<strong>in</strong>ed by which genome is used as query sequence. The diagonal of a BLAST<br />

matrix represents a BLASTof a genome aga<strong>in</strong>st itself. The self-match (the gene f<strong>in</strong>d<strong>in</strong>g itself) is<br />

discarded, thus the reported scores reflect <strong>in</strong>ternal homologues present <strong>in</strong> a given genome.<br />

Most of these have been derived from gene duplication <strong>and</strong> are thus paralogs.<br />

When more <strong>in</strong>formation should be visualized a BLAST Atlas is helpful. Such an atlas uses<br />

one genome as a reference aga<strong>in</strong>st which the gene conservation of other genomes is plotted<br />

(Hall<strong>in</strong> <strong>and</strong> Ussery, 2004, Skovgaard et al., 2002). In this case gene location only refers to the<br />

location <strong>in</strong> the reference genome, which of course can be varied <strong>in</strong> multiple BLAST Atlases.<br />

A BLAST Atlas is also a suitable platform to visualize metagenomic data. So far, we have<br />

not dealt with metagenomics extensively, ma<strong>in</strong>ly because this approach very rarely results <strong>in</strong><br />

completely assembled microbiological genomes. But for a BLAST Atlas, that is not a problem,<br />

as one can comb<strong>in</strong>e all the metagenomic DNA <strong>in</strong> one lane, thereby ignor<strong>in</strong>g from which<br />

organism the detected genes orig<strong>in</strong>ated. All obta<strong>in</strong>ed BLAST hits are plotted around a<br />

reference genome. An example of a BLAST Atlas is given <strong>in</strong> > Fig. 5, centered around<br />

Pelotomaculum thermopropionicum, a thermophilic, syntropic Firmicute that can utilize<br />

1-butanol, 1-propanol, 1-pentanol or 1,3-propanediol as a carbon source. Note that despite<br />

the high number of lanes, conserved <strong>and</strong> variable genes can still be easily visually <strong>in</strong>spected.<br />

From compact<strong>in</strong>g a s<strong>in</strong>gle genome <strong>in</strong>to a Genome Atlas, we’ve now moved several levels up<br />

<strong>and</strong> compact multiple genomes <strong>in</strong>to a s<strong>in</strong>gle atlas. In > Fig. 5, the P. thermopropionicum<br />

genome is compared to many species of Clostridia, as well as other bacteria. Unfortunately,<br />

very few BLAST hits were found with the metagenomics samples so there is very little color <strong>in</strong><br />

those three lanes. Compared to well characterized genomes (like E. coli), relatively few hits are


Methanosphaera stadtmanae DSM 3091<br />

Desulfitobacterium hafniense Y51<br />

Anaeromyxobacter dehalogenans 2CP-C<br />

GGG<br />

GGG<br />

GGG<br />

GAA<br />

GAA<br />

CAA<br />

CGG<br />

GAA<br />

CAA<br />

CGG<br />

UAA<br />

GCG<br />

CAA<br />

CGG<br />

UAA<br />

GCG<br />

CUA<br />

AAA<br />

UGG<br />

UAA<br />

GCG<br />

UGG<br />

UGG<br />

CUA<br />

UUA<br />

AAA<br />

UUA<br />

GUA<br />

AUA<br />

AGG<br />

CCG<br />

CUA<br />

UUA<br />

AAA<br />

CCG<br />

AUA<br />

AGG<br />

CCG<br />

AUA<br />

AGG<br />

GUA<br />

UCG<br />

UCG<br />

GUA<br />

UCG<br />

GUG<br />

GUG<br />

GUG<br />

ACG<br />

ACG<br />

ACG<br />

ACA<br />

UCA<br />

ACA<br />

CCA<br />

CUG<br />

UCA<br />

ACA<br />

CCA<br />

UCA<br />

CUG<br />

GCA<br />

CCA<br />

CUG<br />

GCA<br />

UUG<br />

UUG<br />

GCA<br />

UUG<br />

AGA<br />

AUG<br />

GAG<br />

AGA<br />

AUG<br />

GAG<br />

UGA<br />

AGA<br />

AUG<br />

GAG<br />

UGA<br />

CGA<br />

CAG<br />

UGA<br />

CGA<br />

CAG<br />

CGA<br />

CAG<br />

GGA<br />

UAG<br />

G GA<br />

UAG<br />

GGA<br />

UAG<br />

AAU<br />

72% AT<br />

AAG<br />

AAU<br />

25% AT 53% AT<br />

AAG<br />

AAU<br />

AAG<br />

UAU<br />

UAU<br />

GGC<br />

GGC<br />

CAU<br />

CGC<br />

UAU<br />

GGC<br />

CAU<br />

CGC<br />

UGC<br />

CAU<br />

CGC<br />

UGC<br />

UGC<br />

GAU<br />

AUU<br />

AGC<br />

GAU<br />

AUU<br />

AGC<br />

GAU<br />

AUU<br />

AGC<br />

UUU<br />

UUU<br />

GCC<br />

UUU<br />

Tools for Comparison of Bacterial Genomes 74<br />

GCC<br />

GCC<br />

CUU<br />

CCC<br />

CUU<br />

CCC<br />

UCC<br />

ACU<br />

ACC<br />

CUU<br />

UC CCC<br />

UCC<br />

ACU<br />

ACC<br />

ACU<br />

ACC<br />

GUU<br />

GUU<br />

UCU<br />

UCU<br />

GUU<br />

UCU<br />

CCU<br />

AGU<br />

CCU<br />

AGU<br />

AUC<br />

UUC<br />

CUC<br />

GUC<br />

AUC<br />

UU CUC<br />

GUC<br />

CCU<br />

AGU<br />

AUC<br />

UUC<br />

CUC<br />

GUC<br />

UGU<br />

UGU<br />

GCU<br />

AAC<br />

UGU<br />

GCU<br />

AAC<br />

CGU<br />

UAC<br />

GCU<br />

AAC<br />

CGU<br />

UAC<br />

CAC<br />

CGU<br />

UAC<br />

CAC<br />

GGU<br />

GAC<br />

CAC<br />

GGU<br />

GAC<br />

GGU<br />

GAC<br />

C<br />

0.6<br />

0.6<br />

0.6<br />

0.5<br />

0.5<br />

0.5<br />

U AG<br />

0.4<br />

0.4<br />

0.4<br />

0.3<br />

0.3<br />

0.3<br />

0.2<br />

0.2<br />

G<br />

0.2<br />

C<br />

A<br />

0.1<br />

0.1<br />

UA G CU<br />

A<br />

CU<br />

GA<br />

0.1<br />

CU<br />

CG<br />

A<br />

G<br />

U<br />

G<br />

U<br />

A<br />

C G<br />

A<br />

C<br />

U<br />

UA<br />

C<br />

G<br />

1 st 2 nd 3 rd 1 st 2 nd 3 rd 1 st 2 nd 3 rd<br />

4323<br />

. Figure 4<br />

Frequency wheel plots of codon usage (top) <strong>and</strong> sequence logo plots (bottom) of Anaeromyxobacter dehalogenans (left), Desulfitobacterium hafniense<br />

(middle) <strong>and</strong> Methanosphaera stadtmanae (right).


4324 74<br />

Tools<br />

for Comparison of Bacterial Genomes<br />

2.5M<br />

2M<br />

0M<br />

P. thermopropionicum<br />

SI<br />

3,025,375 bp<br />

1.5M<br />

0.5M<br />

1M<br />

2 Alkaliphilus species<br />

Bacillus fragilis<br />

17 Clostridium species<br />

4 Desulfitobacterium species<br />

E. coli K-12<br />

6 other species belong<strong>in</strong>g<br />

to Clostridia<br />

. Figure 5<br />

BLAST Atlas with Pelotomaculum thermoproopionicuma the reference genome. Around this the<br />

BLAST hits of 31 genomes of other bacteria are added as listed to the right, from the outermost<br />

circle (top <strong>in</strong> the legend), to the <strong>in</strong>nermost circle of the bacterial genomes (bottom of legend).<br />

The outermost lane shows the hits of P. thermopropionicum <strong>in</strong> the UniProt database (which<br />

does not conta<strong>in</strong> all annotated genes as it requires biological evidence of a gene product).<br />

The next three lanes are metagenomic DNA samples from...[Dave specify] <strong>and</strong> next follow<br />

30 genomes of other bacteria as listed to the right.<br />

found <strong>in</strong> other genomes, <strong>in</strong>dicated by lack of strong colour <strong>in</strong> most of the lanes <strong>in</strong> Figure 5.<br />

This is probably a reflection of the huge diversity <strong>in</strong> DNA content <strong>in</strong> such samples, reduc<strong>in</strong>g<br />

the chance of a BLAST hit. It is a sober<strong>in</strong>g thought that there is still so little we know, <strong>and</strong> so<br />

much that rema<strong>in</strong>s to be discovered <strong>in</strong> the microbial world.<br />

There are many methods be<strong>in</strong>g developed which utilizes sets of conserved genes <strong>and</strong> gene<br />

families <strong>in</strong> related organisms to cluster organisms <strong>in</strong>to groups; these groups can represent<br />

known taxonomic relationships. For example, certa<strong>in</strong> genes might be common to a set of<br />

organisms grow<strong>in</strong>g <strong>in</strong> a particular ecological niche. Some examples of such regions along the<br />

chromosome can be seen <strong>in</strong> the BLAST atlas plots where genomes of related organisms of<br />

different species are compared.


8 Gene Synteny <strong>and</strong> Genome Isl<strong>and</strong>s<br />

A comparison of genes present, absent or diverged between genomes usually ignores gene synteny:<br />

the position at which such genes are found. The term was co<strong>in</strong>ed for eukaryotes to describe genes<br />

that were located on the same chromosome; <strong>in</strong> bacterial genomes the local neighbor<strong>in</strong>g genes,<br />

their order <strong>and</strong> direction are usually compared. The closer two organisms are, the more likely is<br />

gene synteny to be conserved (between genomes of the same genus, or species, subspecies or<br />

phylogenic clade, <strong>in</strong> <strong>in</strong>creas<strong>in</strong>g order). Gene synteny is destroyed by <strong>in</strong>versions (chang<strong>in</strong>g the<br />

direction of one or several genes), translocations (chang<strong>in</strong>g the position of genes) <strong>and</strong> <strong>in</strong>sertion<br />

<strong>and</strong> deletion events. All of these can result from mistakes dur<strong>in</strong>g replication, or be the result of<br />

self-replicat<strong>in</strong>g mobile elements, such as bacteriophages, <strong>in</strong>tegrons, transposons etc.<br />

The events that affect gene synteny, comb<strong>in</strong>ed with po<strong>in</strong>t mutations accumulat<strong>in</strong>g dur<strong>in</strong>g<br />

replication are the two major forces that <strong>in</strong>crease genetic diversity; selection of those organisms<br />

that are fittest to survive particular conditions decreases diversity. Evolution further<br />

depends on the change of such selective conditions. With a slow but steady re-shuffl<strong>in</strong>g of<br />

genes by evolutionary processes, a pattern emerges of a genetic ‘‘backbone’’ of genes whose<br />

location is relatively conserved between genomes of reasonable genetic distance, <strong>and</strong> groups of<br />

‘‘cluttered’’ genes that are far more variable, <strong>in</strong> what have been termed ‘‘genome isl<strong>and</strong>s.’’<br />

Genome isl<strong>and</strong>s usually conta<strong>in</strong> genes that are all <strong>in</strong>volved <strong>in</strong> a particular phenotypic process.<br />

Examples are pathogenicity isl<strong>and</strong>s, symbiosis isl<strong>and</strong>s, metabolic isl<strong>and</strong>s or magnetosome<br />

isl<strong>and</strong>s. Examples are sulfur metabolism isl<strong>and</strong>s discovered <strong>in</strong> metagenomic sequences from<br />

mar<strong>in</strong>e sediments (Mussmann et al., 2005) or the magnetosome isl<strong>and</strong> conta<strong>in</strong><strong>in</strong>g all genes<br />

that produce the <strong>in</strong>tracellular organelle enabl<strong>in</strong>g magnetotactic bacteria to orient themselves<br />

along magnetic field l<strong>in</strong>es (Richter et al., 2007). The evolutionary advantage of genome isl<strong>and</strong>s<br />

is obvious. They can be regarded as genetic ‘‘build<strong>in</strong>g blocks’’; when transferred from one<br />

organism to the next, they can confer a complete phenotypic trait to the acceptor, enabl<strong>in</strong>g,<br />

for <strong>in</strong>stance, adaptation to a novel ecological niche.<br />

9 M<strong>in</strong>imal Information About a Genome Sequence<br />

Genome sequences are stored <strong>in</strong> public databases such as GenBank under their biological<br />

names (preceded by ‘‘c<strong>and</strong>idatus’’ for undecided taxonomic position), or by a code of<br />

numbers <strong>and</strong> letters for unculturable organisms that have not been classified. Unfortunately,<br />

other relevant <strong>in</strong>formation is often lack<strong>in</strong>g. It has become apparent that biological <strong>and</strong><br />

environmental data are important, <strong>and</strong> a recent st<strong>and</strong>ard for ‘‘M<strong>in</strong>imal Information about a<br />

Genome Sequence’’ has been proposed (Field et al., 2008). The Genomic St<strong>and</strong>ards Consortium<br />

5 (GSC, http://gensc.org) promotes the st<strong>and</strong>ardization of genome sequenc<strong>in</strong>g descriptions<br />

<strong>and</strong> their exchange <strong>and</strong> <strong>in</strong>tegration <strong>in</strong> the scientific community. Overall, it is important<br />

that genome sequence <strong>in</strong>formation is released <strong>in</strong>to the public doma<strong>in</strong> <strong>in</strong> a timely manner so<br />

that global scientific progress can be ma<strong>in</strong>ta<strong>in</strong>ed.<br />

10 Research Needs<br />

Tools for Comparison of Bacterial Genomes 74<br />

4325<br />

For very few environmental species multiple genome sequences are available. From genomic<br />

<strong>in</strong>tra-species comparisons of pathogenic bacteria we know that these provide an extra layer of


4326 74<br />

Tools<br />

<strong>in</strong>formation, as genetic diversity with<strong>in</strong> a bacterial species can be enormous. When multiple<br />

genomes are available for a species we can def<strong>in</strong>e its core genome (all genes that are present <strong>in</strong><br />

all genomes of that species), its pan-genome (all genes that have been found <strong>in</strong> that species)<br />

<strong>and</strong> its dispensable genes that are responsible for the variation between isolates. Multiple<br />

genomes per species, together with more metagenomic data <strong>and</strong> more archaeal genome<br />

sequences, comprise our most urgent data gaps. The research <strong>tools</strong> for analysis of the<br />

genomes are available. Generate the sequences <strong>and</strong> the feast can beg<strong>in</strong>.<br />

References<br />

for Comparison of Bacterial Genomes<br />

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ<br />

(1990) Basic local alignment search tool. J Mol Biol<br />

215: 403–410.<br />

B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Staerfeldt HH, Ussery DW<br />

(2005) Genome update: proteome comparisons.<br />

Microbiology 151: 1–4.<br />

B<strong>in</strong>newies TT, et al. (2006) Ten years of bacterial genome<br />

sequenc<strong>in</strong>g: comparative-genomics-based discoveries.<br />

Funct Integr Genomics 6: 165–185.<br />

Carver TJ, Rutherford KM, Berriman M, Raj<strong>and</strong>ream<br />

MA, Barrell BG, Parkhill J (2005) ACT: the Artemis<br />

Comparison Tool. Bio<strong>in</strong>formatics 21: 3422–3423.<br />

Chang CH, Chang YC, Underwood A, Chiou CS, Kao CY<br />

(2007) VNTRDB: a bacterial variable number t<strong>and</strong>em<br />

repeat locus database. Nucleic Acids Res 35:<br />

D416–D421.<br />

Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH,<br />

Ecker DJ, Blyn LB (2002) A bio<strong>in</strong>formatics based<br />

approach to discover small RNA genes <strong>in</strong> the Escherichia<br />

coli genome. Biosystems 65: 157–177.<br />

Deloger M, El Karoui M, Petit MA (2009) A genomic<br />

distance based on MUM <strong>in</strong>dicates discont<strong>in</strong>uity between<br />

most bacterial species <strong>and</strong> genera. J Bacteriol<br />

191: 91–99.<br />

Denoeud F, Vergnaud G (2004) Identification of polymorphic<br />

t<strong>and</strong>em repeats by direct comparison of<br />

genome sequence from different bacterial stra<strong>in</strong>s: a<br />

web-based resource. BMC Bio<strong>in</strong>formatics 5: 4.<br />

Field D, et al. (2008) The m<strong>in</strong>imum <strong>in</strong>formation about a<br />

genome sequence (MIGS) specification. Nature Biotechnol<br />

26:541–547.<br />

Foerstner KU, von Mer<strong>in</strong>g C, Hooper SD, Bork P (2005)<br />

Environments shape the nucleotide composition of<br />

genomes. EMBO Rep 6: 1208–1213.<br />

Galagan JE, et al. (2002) The genome of M. acetivorans<br />

reveals extensive metabolic <strong>and</strong> physiological diversity.<br />

Genome Res 12: 532–542.<br />

Giovannoni SJ, et al. (2005) Genome streaml<strong>in</strong><strong>in</strong>g <strong>in</strong> a<br />

cosmopolitan oceanic bacterium. Science 309:<br />

1242–1245.<br />

Gottesman S (2005) Micros for microbes: non-cod<strong>in</strong>g<br />

regulatory RNAs <strong>in</strong> bacteria. Trends Genet 21:<br />

399–404.<br />

Griffiths-Jones S, Moxon S, Marshall M, Khanna A,<br />

Eddy SR, Bateman A (2005) Rfam: annotat<strong>in</strong>g<br />

non-cod<strong>in</strong>g RNAs <strong>in</strong> complete genomes. Nucleic<br />

Acids Res 33: D121–D124.<br />

Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW (2008) The genome<br />

BLAST atlas - a GeneWiz extension for visualization<br />

of whole-genome homology. Mol Biosyst 4: 363–371.<br />

Hall<strong>in</strong> PF, Ussery DW (2004) <strong>CBS</strong> Genome Atlas<br />

Database: a dynamic storage for bio<strong>in</strong>formatic results<br />

<strong>and</strong> sequence data. Bio<strong>in</strong>formatics 20: 3682–3686.<br />

Henz SR, Huson DH, Auch AF, Nieselt-Struwe K,<br />

Schuster SC (2005) Whole-genome prokaryotic<br />

phylogeny. Bio<strong>in</strong>formatics 21: 2329–2335.<br />

Jensen LJ, Friis C, Ussery DW (1999) Three views of<br />

microbial genomes. Res Microbiol 150: 773–777.<br />

Kurtz S, Philippy A, Delcher AL, Smoot M, Shumway M,<br />

Antonescu C, Salzberg SL (2004) Versatile <strong>and</strong> open<br />

software for compar<strong>in</strong>g large genomes. Genome Biol<br />

5: R12.<br />

Lagesen K, Hall<strong>in</strong> P, Rodl<strong>and</strong> EA, Staerfeldt HH,<br />

Rognes T, Ussery DW (2007) RNAmmer: consistent<br />

<strong>and</strong> rapid annotation of ribosomal RNA genes.<br />

Nucleic Acids Res 35: 3100–3108.<br />

Lessner DJ, et al. (2006) An unconventional pathway for<br />

reduction of CO 2 to methane <strong>in</strong> CO-grown Methanosarc<strong>in</strong>a<br />

acetivorans revealed by proteomics. Proc<br />

Natl Acad Sci USA 103: 17921–17926.<br />

Mussmann M, Richter M, Lombardot T, Meyerdierks A,<br />

Kuever J, Kube M, Glöckner FO, Amann R (2005)<br />

Clustered genes related to sulfate respiration <strong>in</strong> uncultured<br />

prokaryotes support the theory of their<br />

concomitant horizontal transfer. J Bacteriol. 187:<br />

7126–7137.<br />

Musto H, Naya H, Zavala A, Romero H, Alvarez-Val<strong>in</strong> F,<br />

Bernardi G (2006) Genomic GC level, optimal<br />

growth temperature, <strong>and</strong> genome size <strong>in</strong> prokaryotes.<br />

Biochem Biophys Res Commun 347: 1–3.<br />

Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH,<br />

Ussery DW (2000) A DNA structural atlas for<br />

Escherichia coli. J Mol Biol 299: 907–930.<br />

Richter M, Kube M, Bazyl<strong>in</strong>ski DA, Lombardot T,<br />

Glöckner FO, Re<strong>in</strong>hardt R, Schüler D (2007) <strong>Comparative</strong><br />

genome analysis of four magnetotactic


acteria reveals a complex set of group-specific<br />

genes implicated <strong>in</strong> magnetosome biom<strong>in</strong>eralization<br />

<strong>and</strong> function. J Bacteriol 189: 4899–4910.<br />

Selengut JD, et al. (2007) TIGRFAMs <strong>and</strong> Genome Properties:<br />

<strong>tools</strong> for the assignment of molecular function<br />

<strong>and</strong> biological process <strong>in</strong> prokaryotic genomes.<br />

Nucleic Acids Res 35: D260–D264.<br />

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A<br />

(2001) On the total number of genes <strong>and</strong> their<br />

length distribution <strong>in</strong> complete microbial genomes.<br />

Trends Genet 17: 425–428.<br />

Skovgaard M, Jensen LJ, Friis C, Stærfeldt HH, Worn<strong>in</strong>g P,<br />

Brunak S, Ussery D (2002) The atlas visualisation of<br />

genome-wide <strong>in</strong>formation. Meth Microbiol. 33:<br />

49–63.<br />

Sood N, Lal B. (2008). Isolation <strong>and</strong> characterization of a<br />

potential paraff<strong>in</strong>-wax degrad<strong>in</strong>g thermophilic bacterial<br />

stra<strong>in</strong> Geobacillus kaustophilus TERI NSM for<br />

application <strong>in</strong> oil wells with paraff<strong>in</strong> deposition<br />

problems. Chemosphere 70: 1445–1451.<br />

Takami H, et al. (2004a) Genomic characterization of<br />

thermophilic Geobacillus species isolated from the<br />

deepest sea mud of the Mariana Trench. Extremophiles<br />

8: 351–356.<br />

Takami H, et al. (2004b) Thermoadaptation trait<br />

revealed by the genome sequence of thermophilic<br />

Tools for Comparison of Bacterial Genomes 74<br />

4327<br />

Geobacillus kaustophilus. Nucl Acids Res 32:<br />

6292–6303.<br />

Teel<strong>in</strong>g H, Waldmann J, Lombardot T, Bauer M,<br />

Glockner FO (2004) TETRA: a web-service <strong>and</strong> a<br />

st<strong>and</strong>-alone program for the analysis <strong>and</strong> comparison<br />

of tetranucleotide usage patterns <strong>in</strong> DNA<br />

sequences. BMC Bio<strong>in</strong>formatics 5: 163.<br />

Ussery DW, Hall<strong>in</strong> PF (2004) Genome update: AT content<br />

<strong>in</strong> sequenced prokaryotic genomes. Microbiology<br />

150: 749–752.<br />

Ussery DW, Bor<strong>in</strong>i S, Wassenaar TM (2009) Comput<strong>in</strong>g<br />

for <strong>Comparative</strong> Microbial Genomics: Bio<strong>in</strong>formatics<br />

for Microbiologists (<strong>Computational</strong> series)<br />

London, Verlag: Spr<strong>in</strong>ger.<br />

Wheeler DL, et al. (2007) Database resources of the<br />

National Center for Biotechnology Information.<br />

Nucleic Acids Res 35: D5–D12.<br />

Willenbrock H, Friis C, Friis AS, Ussery DW (2006) An<br />

environmental signature for 323 microbial genomes<br />

based on codon adaptation <strong>in</strong>dices. Genome Biol 7:<br />

R114.<br />

Worn<strong>in</strong>g P, Jensen LJ, Hall<strong>in</strong> PF, Staerfeldt HH,<br />

Ussery DW (2006) Orig<strong>in</strong> of replication <strong>in</strong> circular<br />

prokaryotic chromosomes. Environ Microbiol 8:<br />

353–361.


Chapter 3<br />

rRNA operons <strong>and</strong> promoter analysis<br />

rRNA operons <strong>and</strong> promoter<br />

analysis<br />

3.1 Introduction<br />

This chapter covers two papers (VI <strong>and</strong> VII), deal<strong>in</strong>g with rRNA localization with<strong>in</strong> the<br />

genome, <strong>and</strong> analysis of the promoter region upstream of rRNA operons. The RNAmmer<br />

tool (Lagesen et al., 2007) presented <strong>in</strong> paper VI was motivated by the lack of a software<br />

<strong>tools</strong> that was able to accurately <strong>and</strong> consistently annotate ribosomal RNA (rRNA) genes<br />

<strong>in</strong> prokaryotes. BLAST strategies are widely used for this as the rRNA genes are highly<br />

conserved. However, homology search methods produces often less accurate gene boundaries<br />

as they fail to account for the observed variation <strong>in</strong> some regions. Hidden Markov<br />

Model (HMM) strategies, such as RNAmmer, can take <strong>in</strong>to account conserved stem loop<br />

structures, greatly improv<strong>in</strong>g the accuracy of prediction of the full length rRNA genes.<br />

Particular detail will be given to the E. coli rRNA operons <strong>in</strong> terms of promoter predictions,<br />

s<strong>in</strong>ce much experimental <strong>in</strong>formation is known about this system. An application<br />

of the gwBrowser as a tool for visualization of promoter regions upstream of the rRNA<br />

operons <strong>in</strong> E. coli concludes the chapter. The gwBrowser effort is currently be<strong>in</strong>g published<br />

<strong>in</strong> the St<strong>and</strong>ards In Genomic Sciences journal. The P1 <strong>and</strong> P2 prediction <strong>tools</strong> are<br />

still developmental, <strong>and</strong> have not been published.<br />

Encod<strong>in</strong>g the central structure of the ribosome, the 5S, 16S, <strong>and</strong> 23S rRNA genes are<br />

essential for prote<strong>in</strong> synthesis <strong>and</strong> are transcribed at high levels. In E. coli the rrn operons<br />

are regulated by a t<strong>and</strong>em promotor system. With abundant transcription, the system is<br />

favorable for study<strong>in</strong>g the mechanisms of highly expressed genes <strong>and</strong> establish connection<br />

to the physical properties of the DNA. In this work, the SIDD energy (Wang et al., 2004;<br />

Wang & Benham, 2008) was used to measure the energy requirement to melt the DNA<br />

helix near the promotor region. The work was carried out dur<strong>in</strong>g my visit to Professor<br />

Craig Benhams lab at UC Davis, fall 2007.<br />

3.2 P1 <strong>and</strong> P2 promoters <strong>in</strong> E. coli<br />

The seven rRNA operons of E. coli are regulated by the two promotors P1 <strong>and</strong> P2,<br />

where P1 is active predom<strong>in</strong>ately dur<strong>in</strong>g exponential growth whereas P2 is active dur<strong>in</strong>g<br />

stationay phase (Hirvonen et al., 2001; Murray & Gourse, 2004). Apart from the –10 <strong>and</strong><br />

–35 hexamers, the P1 site conta<strong>in</strong>s between 3 <strong>and</strong> 5 FIS (Factor for Inversion Stimulation)<br />

b<strong>in</strong>d<strong>in</strong>g sites <strong>and</strong> an UP element. FIS has been reported to <strong>in</strong>crease the transcription <strong>in</strong><br />

vivo by 4-10 fold <strong>in</strong> this system (Bokal et al., 1995).<br />

105


Conservation of regulatory elements<br />

-35<br />

-10<br />

σ<br />

α<br />

α ββ‘ subunit<br />

+1<br />

CDS<br />

Figure 3.1: The transcription of bacterial genes.<br />

The first step <strong>in</strong> transcription occurs when the sigma factor first b<strong>in</strong>ds to the -10 <strong>and</strong><br />

-35 region, followed by a wrap of the DNA template around the large RNA polymerase<br />

holoenzyme complex, caus<strong>in</strong>g a bend of the DNA molecule (figure 3.1). Roughly 150 bp of<br />

DNA is wrapped around the polymerase, form<strong>in</strong>g a constra<strong>in</strong>ed supercoil. The wrapp<strong>in</strong>g<br />

<strong>in</strong>teraction with the two α-subunits are particularly important, for the right orientation<br />

of DNA with respect to the promoter sites <strong>and</strong> transcription <strong>in</strong>itiation.<br />

B<strong>in</strong>d<strong>in</strong>g of the FIS prote<strong>in</strong> can strongly bend the DNA, <strong>and</strong> if properly spaced, greatly<br />

facilitate the wrapp<strong>in</strong>g of the DNA around the alpha subunits. The DNA bend<strong>in</strong>g takes<br />

place via a helix-turn-helix structure <strong>and</strong> is recognized by a 15 nucleotide symmetric motif<br />

(Hengen et al., 1997). The stress that is <strong>in</strong>duced when FIS b<strong>in</strong>ds to the DNA helix,<br />

causes a bend which destabilizes the helix lower<strong>in</strong>g the energy required for melt<strong>in</strong>g further<br />

downstream (Wang & Benham, 2008; Bokal et al., 1995). While be<strong>in</strong>g highly expressed<br />

dur<strong>in</strong>g exponential phase FIS ensures an <strong>in</strong>creased activity of P1 compared with P2. In an<br />

E. coli stra<strong>in</strong> lack<strong>in</strong>g the FIS prote<strong>in</strong> the P2 promotor is more active dur<strong>in</strong>g exponential<br />

growth. The same study suggest FIS to have a repression effect on P2 (Liebig & Wagner,<br />

1995). Both P1 <strong>and</strong> P2 conta<strong>in</strong>s an UP element b<strong>in</strong>d<strong>in</strong>g to the RNA polymerase α Cterm<strong>in</strong>al<br />

doma<strong>in</strong> (αCTD). This work aims at apply<strong>in</strong>g an <strong>in</strong>formation content method to<br />

the P1 <strong>and</strong> P2 system, account<strong>in</strong>g for helical spac<strong>in</strong>g between these regulatory elements as<br />

well as the conservation of the motifs. The t<strong>and</strong>em promotor system is depicted <strong>in</strong> figure<br />

3.2.<br />

3.3 Conservation of regulatory elements<br />

Information content is widely used <strong>in</strong> bio<strong>in</strong>formatics to f<strong>in</strong>d <strong>and</strong> rank <strong>in</strong>dependent motifs<br />

as an alternative to mach<strong>in</strong>e learn<strong>in</strong>g approaches. Shultzaberger <strong>and</strong> co-workers have exp<strong>and</strong>ed<br />

earlier applications of <strong>in</strong>formation content by describ<strong>in</strong>g the helical fac<strong>in</strong>g between<br />

regulatory elements on the DNA str<strong>and</strong> (Shultzaberger et al., 2007). This framework allows<br />

for an additive comb<strong>in</strong>ation of both aligned weight matrices <strong>and</strong> their spac<strong>in</strong>g to<br />

produce a f<strong>in</strong>al score of the entire structure. When observ<strong>in</strong>g the σ 70 promotor consist<strong>in</strong>g<br />

of the –10 <strong>and</strong> –35 hexamers, the spac<strong>in</strong>g corespond to each box be<strong>in</strong>g located on oposite<br />

sides of the DNA helix (see figure 3.3).<br />

Chang<strong>in</strong>g the spac<strong>in</strong>g will likely cause a disruption of the b<strong>in</strong>d<strong>in</strong>g by RNA polymerase.<br />

This is accounted for by apply<strong>in</strong>g a cos<strong>in</strong>e function to the distance score (see equation 3.2).<br />

Shultzaberger’s equations were used to model the P1 <strong>and</strong> P2 system.<br />

To score a given query sequence of length L aga<strong>in</strong>st a weight matrix, a b × p matrix<br />

is first generated by align<strong>in</strong>g the query sequence <strong>and</strong> the matrix. This provides all Rb,p<br />

106


tuB<br />

murI<br />

Fis III Fis II Fis I UP -35 -10<br />

m<strong>in</strong>: -4nt<br />

center:2nt<br />

max:4nt<br />

m<strong>in</strong>: 0nt<br />

center:3nt<br />

max:6nt<br />

m<strong>in</strong>: 13nt<br />

center:16nt<br />

max:19nt<br />

P1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

16S tRNA 23S 5S<br />

Glu murB<br />

-35 -10<br />

m<strong>in</strong>: 0nt<br />

center:3nt<br />

max:6nt<br />

P2 P1<br />

m<strong>in</strong>: 13nt<br />

center:16nt<br />

max:19nt<br />

Figure 3.2: The promotor structure of the rrnB operon <strong>in</strong> E. coli.<br />

-35<br />

!<br />

-10<br />

!<br />

-10 -35<br />

Figure 3.3: The –10 <strong>and</strong> –35 hexamers of the E. coli σ 70 promotor correspond to the motifs be<strong>in</strong>g<br />

located on opposite side of the DNA helix. Delition or <strong>in</strong>sertions of the spac<strong>in</strong>g cases a shift of<br />

approx. 36deg per nucleotide.<br />

107


Conservation of regulatory elements<br />

values.<br />

nb,p<br />

Rb,p = log2(4) + log2<br />

N<br />

L<br />

Rtot = RB,p<br />

p=1<br />

(3.1)<br />

–where b ∈ AT GC iterates through the four bases, p denotes the position <strong>in</strong> the<br />

alignment, L is the length of the alignment (or width of the matrix), <strong>and</strong> nb,p is the<br />

number of bases b at position p, <strong>and</strong> B denotes the nucleotide at position p <strong>in</strong> the query<br />

sequence. Shultzaberger <strong>and</strong> co-workers account for the helical fac<strong>in</strong>g by <strong>in</strong>troduc<strong>in</strong>g the<br />

accessibility, n(d) (equation 3.2) <strong>and</strong> the gap surprisal, GS(d) (see equation 3.3).<br />

n(d) = 1 + cos[ 2π<br />

(d − c)] (3.2)<br />

w<br />

–where c is the center distance between two b<strong>in</strong>d<strong>in</strong>g sites (e.g. optimally spaced), d is<br />

the query distance, w = 10.6 is the distance of a one helix turn of B-form DNA. F<strong>in</strong>ally,<br />

this gives GS(d) as follows:<br />

n(d)<br />

GS(d) = log2<br />

N<br />

(3.3)<br />

–where N is the sum of all n(d) (see equation 3.4). The sign of the GS(d) was changed<br />

from the orig<strong>in</strong>al equation described by Shultzaberger <strong>and</strong> co-workers to allow for comb<strong>in</strong><strong>in</strong>g<br />

all scores by addition.<br />

N =<br />

max<br />

<br />

d=m<strong>in</strong><br />

n(d) (3.4)<br />

–where m<strong>in</strong> <strong>and</strong> max are the boundaries of a given w<strong>in</strong>dow exam<strong>in</strong>ed. F<strong>in</strong>ally, summariz<strong>in</strong>g<br />

all Ri <strong>and</strong> GS(d) values gives the total <strong>in</strong>formation of all motifs <strong>and</strong> all spacers (see<br />

figure 3.5)<br />

Ri(tot) = Ri(m1) + GS(d, m1) + Ri(m2) + ... + GS(d, mn−1) + Ri(mn) (3.5)<br />

3.3.1 Model<strong>in</strong>g the P1 <strong>and</strong> P2 <strong>in</strong> selected enterics<br />

Exist<strong>in</strong>g experimentally verified –10 <strong>and</strong> –35 hexamers (Huerta & Collado-Vides, 2003)<br />

were converted <strong>in</strong>to Rb,p matrices together with data for known UP elements (Estrem<br />

et al., 1998) <strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g sites (Hengen et al., 1997). Figure 3.4 shows logo plots of<br />

the <strong>in</strong>formation content of these studies. The <strong>in</strong>itial weight matrices founded the basis<br />

for iteratively build<strong>in</strong>g the f<strong>in</strong>al <strong>in</strong>formation model of the P1 <strong>and</strong> P2 promotor structure,<br />

us<strong>in</strong>g the follow<strong>in</strong>g procedure:<br />

1. E. coli <strong>and</strong> Shigella genomes<br />

108<br />

2. rRNA gene f<strong>in</strong>d<strong>in</strong>g <strong>and</strong> make upstream sequence<br />

3. Apply models based on literature weight matrices<br />

4. Ref<strong>in</strong>e weight matrices accord<strong>in</strong>g to observations<br />

5. Formulate f<strong>in</strong>al model


Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

T A T A A T<br />

1<br />

2<br />

(a)<br />

3<br />

4<br />

Position<br />

T T G A C A<br />

1<br />

2<br />

(c)<br />

3<br />

4<br />

Position<br />

5<br />

6<br />

5<br />

6<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

Bits<br />

1<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

2<br />

1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

T G A A A T T T T T T T T T G A A A A G T A<br />

3<br />

2<br />

3<br />

4<br />

4<br />

5<br />

5<br />

6<br />

6<br />

7<br />

7<br />

8<br />

8<br />

9<br />

10<br />

(b)<br />

9<br />

10<br />

11<br />

12<br />

Position<br />

0.0<br />

A T T G G T Y A A A W T T T R A C C A A T<br />

Figure 3.4: Logo plots show<strong>in</strong>g the <strong>in</strong>itial weight matrices used for search<strong>in</strong>g E. coli <strong>and</strong> Shigella<br />

genomes: –10 hexamer (a), –35 hexamer (b), UP element (c), <strong>and</strong> FIS b<strong>in</strong>d<strong>in</strong>g motif (d).<br />

The 16S rRNA genes of all E. coli <strong>and</strong> Shigella genomes were annotated us<strong>in</strong>g RNAmmer.<br />

For the list of genomes, see table 3.1. All 16S rRNA genes were aligned us<strong>in</strong>g clustalw<br />

(Thompson et al., 1994) <strong>and</strong> a neighbor-jo<strong>in</strong><strong>in</strong>g tree was constructed (see figure 3.5). The<br />

figure shows additional Salmonella <strong>and</strong> Yers<strong>in</strong>ia genomes for comparison.<br />

(d)<br />

11<br />

13<br />

12<br />

Position<br />

14<br />

13<br />

15<br />

16<br />

14<br />

17<br />

15<br />

18<br />

16<br />

19<br />

17<br />

20<br />

21<br />

18<br />

22<br />

19<br />

20<br />

21<br />

109


Conservation of regulatory elements<br />

Escherichia coli 536<br />

Escherichia coli APEC O1<br />

Escherichia coli CFT073<br />

Shigella sonnei Ss046<br />

Shigella boydii Sb227<br />

Shigella flexneri 2a str. 301<br />

Shigella flexneri 2a str. 2457T<br />

Escherichia coli UTI89<br />

Escherichia coli K12<br />

Escherichia coli O157:H7 EDL933<br />

Escherichia coli O157:H7 str. Sakai<br />

Escherichia coli W3110<br />

Shigella dysenteriae Sd197<br />

Salmonella enterica subsp. enterica serovar Choleraesuis str. SC-B67<br />

Salmonella enterica subsp. enterica serovar Paratyphi A str. ATCC 9150<br />

Salmonella enterica subsp. enterica serovar Typhi Ty2<br />

Salmonella enterica subsp. enterica serovar Typhi str. CT18<br />

Salmonella typhimurium LT2<br />

Yers<strong>in</strong>ia pestis Antiqua<br />

Yers<strong>in</strong>ia pestis CO92<br />

Yers<strong>in</strong>ia pestis KIM<br />

Yers<strong>in</strong>ia pestis Nepal516<br />

Yers<strong>in</strong>ia pestis Pestoides F<br />

Yers<strong>in</strong>ia pestis biovar Microtus str. 91001<br />

Yers<strong>in</strong>ia pseudotuberculosis IP 32953<br />

Figure 3.5: Neighbor-jo<strong>in</strong><strong>in</strong>g tree of first 1k bases of all 16S rRNA genes of Yers<strong>in</strong>ia, Salmonella,<br />

Shigella, <strong>and</strong> E. coli<br />

110


RNA operons <strong>and</strong> promoter analysis<br />

Organism Accession Reference<br />

Escherichia coli 101-1 AAMK00000000 (unpublished)<br />

Escherichia coli 53638 AAKB00000000 (unpublished)<br />

Escherichia coli 536 CP000247 (Brzuszkiewicz et al., 2006)<br />

Escherichia coli APEC O1 CP000468 (Johnson et al., 2007)<br />

Escherichia coli B171 AAJX00000000 (unpublished)<br />

Escherichia coli B7A AAJT00000000 (unpublished)<br />

Escherichia coli B AAWW00000000 (unpublished)<br />

Escherichia coli CFT073 AE014075 (Welch et al., 2002)<br />

Escherichia coli E110019 AAJW00000000 (unpublished)<br />

Escherichia coli E22 AAJV00000000 (unpublished)<br />

Escherichia coli F11 AAJU00000000 (unpublished)<br />

Escherichia coli K12 U00096 (Blattner et al., 1997)<br />

Escherichia coli O157:H7 EDL933 AE005174 (Perna et al., 2001)<br />

Escherichia coli O157:H7 str. Sakai BA000007 (Hayashi et al., 2001)<br />

Escherichia coli SECEC SMS-3-5 ABAQ00000000 (unpublished)<br />

Escherichia coli UTI89 CP000243 (Chen et al., 2006)<br />

Escherichia coli W3110 AP009048 (Hayashi et al., 2006)<br />

Shigella boydii CDC 3083-94 AAKA00000000 (unpublished)<br />

Shigella boydii Sb227 CP000036 (Yang et al., 2005)<br />

Shigella dysenteriae 1012 AAMJ00000000 (unpublished)<br />

Shigella dysenteriae Sd197 CP000034 (Yang et al., 2005)<br />

Shigella flexneri 2a str. 2457T AE014073 (Liao et al., 2003)<br />

Shigella flexneri 2a str. 301 AE005674 (J<strong>in</strong> et al., 2002)<br />

Shigella sonnei Ss046 CP000038 (Yang et al., 2005)<br />

Table 3.1: Escherichia coli <strong>and</strong> Shigella genomes currently available at the time of the work<br />

(October 2007)<br />

111


Conservation of regulatory elements<br />

Ri<br />

Ri<br />

−15 −10 −5 0 5 10<br />

−10 −5 0 5 10 15<br />

P1: Raw comb<strong>in</strong>ed scores, −10,−35, UP (E.coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to 16S gene start<br />

(a)<br />

P2: Raw comb<strong>in</strong>ed scores, −10,−35, UP (E. coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to 16S gene start<br />

(c)<br />

Ri<br />

Ri<br />

−15 −10 −5 0 5 10 15<br />

−10 −5 0 5 10 15<br />

P1: Adjusted comb<strong>in</strong>ed scores, −10,−35, UP (E.coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to gene start<br />

(b)<br />

P2: Adjusted comb<strong>in</strong>ed scores, −10,−35, UP (E. coli) (N=63)<br />

−500 −400 −300 −200 −100 0<br />

Position relative to gene start<br />

Figure 3.6: Profiles show<strong>in</strong>g the maximum Ri(tot) scores of the <strong>in</strong>itial weight matrices applied to<br />

E. coli <strong>and</strong> Shigella: Unadjusted P1 scores (a), Adjusted P1 scores (b), Unadjusted P2 scores (c),<br />

<strong>and</strong> Adjusted P2 scores (d)<br />

3.3.2 Iterat<strong>in</strong>g weight matrix frequencies<br />

The program iscan was developed to query a given DNA sequence <strong>and</strong> for every position <strong>in</strong><br />

this sequence calculate the maximum Ri(tot) that can be obta<strong>in</strong>ed by try<strong>in</strong>g out different<br />

spac<strong>in</strong>g configuraitons with<strong>in</strong> a specified w<strong>in</strong>dow. The iscan algorithm aligns the first<br />

matrix with the query (<strong>in</strong> this case the –10 hexamer) <strong>and</strong> tries all distances between 13<br />

<strong>and</strong> 19 nucleotides towards the –35 hexamer, us<strong>in</strong>g 16 nucleotides as the center. Then<br />

the program locks the optimal of those distances, <strong>and</strong> cont<strong>in</strong>ues with the next box (<strong>in</strong><br />

this case the the UP element) until all elements have been <strong>in</strong>cluded. For source code, see<br />

appendix D.5. The spac<strong>in</strong>g configuration of the two models is shown <strong>in</strong> figure ??.<br />

The maximum Ri(tot) values of all operons were stacked <strong>and</strong> average <strong>and</strong> st<strong>and</strong>ard<br />

deviation values were plotted as function of position. Because the distance between P1/P2<br />

<strong>and</strong> the 16S gene varies slightly, the unadjusted plots appear noisy. By shift<strong>in</strong>g the plots<br />

slightly by align<strong>in</strong>g to local maxima around P1 <strong>and</strong> P2 renders the P1 <strong>and</strong> P2 model scores<br />

sharper (see figure 3.6).<br />

3.3.3 Ref<strong>in</strong><strong>in</strong>g E. coli <strong>and</strong> Shigella models<br />

All peaks of Ri(tot) around the regions of P1 <strong>and</strong> P2 have been collected, <strong>and</strong> the P1 <strong>and</strong><br />

P2 models were ref<strong>in</strong>ed by adjust<strong>in</strong>g matrix parameters accord<strong>in</strong>g to the observed base<br />

frequencies <strong>in</strong> the hits obta<strong>in</strong>ed. The logo plots of are shown <strong>in</strong> figure 3.7<br />

112<br />

(d)


Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

Bits<br />

T A T A A T<br />

1<br />

2<br />

(a)<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

3<br />

4<br />

Position<br />

1<br />

5<br />

6<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

T C A A A A A A T T A T T T A A A A T T T C<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

(b)<br />

T T T G C T T G A A A A A T G A G C G G T<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

(d)<br />

11<br />

12<br />

Position<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

11<br />

12<br />

Position<br />

21<br />

13<br />

14<br />

Bits<br />

15<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

16<br />

rRNA operons <strong>and</strong> promoter analysis<br />

17<br />

1<br />

18<br />

19<br />

20<br />

21<br />

22<br />

T A T T A T<br />

2<br />

(e)<br />

T C A G A A A A A G A A A G C A A A A A A A<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

(g)<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

3<br />

4<br />

Position<br />

5<br />

6<br />

Bits<br />

Bits<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

2.0<br />

1.5<br />

1.0<br />

0.5<br />

0.0<br />

T T G T C A<br />

1<br />

1<br />

2<br />

(c)<br />

3<br />

4<br />

Position<br />

5<br />

T T G A C T<br />

Figure 3.7: Logos show<strong>in</strong>g the base compostion of P1 <strong>and</strong> P2 of E. coli genomes, as identified<br />

by <strong>in</strong>itial P1 <strong>and</strong> P2 scan: P1 –10 hexamer (a), P1 –35 hexamer (b), P1 UP element (c), P1 FIS<br />

b<strong>in</strong>d<strong>in</strong>g motif (d), P2 –10 hexamer (e), P2 –35 hexamer (f), P2 UP element (g)<br />

Position<br />

18<br />

19<br />

20<br />

21<br />

22<br />

2<br />

(f)<br />

3<br />

4<br />

Position<br />

5<br />

6<br />

6<br />

113


DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />

Z−score<br />

−0.8 −0.6 −0.4 −0.2 0.0<br />

U00096: SIDD measure − free energy<br />

−400 −200 0 200 400<br />

Distance from translation start<br />

s=−0.025<br />

s=−0.035<br />

s=−0.045<br />

s=−0.055<br />

Figure 3.8: Average profiles of SIDD energy calculated at five different helix densities -0.025,<br />

-0.035, -0.045, <strong>and</strong> -0.055. All genes have been aligned at the translation start.<br />

3.4 DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />

An algorithm developed by Benham <strong>and</strong> co-workers (Wang & Benham, 2008; Wang et al.,<br />

2004) estimates the SIDD energy which is the free energy required to open the DNA helix<br />

under different superhelix densities. When observ<strong>in</strong>g the SIDD energy 400 nucleotides on<br />

each side of the translation start of all cod<strong>in</strong>g sequences <strong>in</strong> E. coli K12 (accession U00096) a<br />

clear drop <strong>in</strong> the energy requirement is visible. The drop orig<strong>in</strong>ates from the transcription<br />

start rather than the translation start, which examples the broad appearance of the curve.<br />

Figure 3.8 plots the SIDD energy values at different helix densities (-0.025, -0.035, -0.045,<br />

<strong>and</strong> -0.055). The graph represents the z-scores show<strong>in</strong>g how the average SIDD energy at<br />

a given relative position compares with the average <strong>and</strong> st<strong>and</strong>ard deviation of the entire<br />

chromosome. z-score below zero correspond to SIDD energies lower then the average of<br />

the chromosome, which melts more easily.<br />

3.4.1 codesearch: Mapp<strong>in</strong>g nummerical data to genome annotations<br />

The codesearch tool was written to enable searches for various annotation patterns of a<br />

genome <strong>and</strong> to map nummerical data relative to these annotations. The tool requires a<br />

pregenerated codefile which condenses all annotations of the genome <strong>in</strong>to a s<strong>in</strong>gle str<strong>in</strong>g,<br />

correspond<strong>in</strong>g to one character per nucleotide position (see table 3.2). The tool allows the<br />

user to provide a regular expression to search <strong>in</strong> the pre-generated code file.<br />

A list of nummerical data perta<strong>in</strong><strong>in</strong>g to the <strong>in</strong>dividual nucleotides of the genome can<br />

then be <strong>in</strong>cluded. When def<strong>in</strong>ed, codesearch will extract the nummerical values correspond<strong>in</strong>g<br />

to the regions match<strong>in</strong>g the pattern. The output of codesearch is divided <strong>in</strong>to<br />

two tab-separated columns: First column conta<strong>in</strong> the genomic region where pattern has<br />

matched, the other column contians either the sequence as a str<strong>in</strong>g (when runn<strong>in</strong>g <strong>in</strong><br />

114


Code Mean<strong>in</strong>g Example<br />

C Cod<strong>in</strong>g CCCCCCCCCCCCC<br />

> Annotation start on forward str<strong>and</strong> .....>CCCC...<br />

< Annotation start on reverse str<strong>and</strong> ...CCCCTTT.....<br />

t 5S rRNA ..tttssss......<br />

l 23S rRNA ...lllllcodesearch −cod U00096 . cod . gz −seq U00096 . fsa −pat ’(.{5 ,5} > s {1 ,1}) ’<br />

2 223773..223779 AAATTGA<br />

3 3939833..3939839 AAATTGA<br />

4 4033556..4033562 AAATTGA<br />

5 4164684..4164690 AAATTGA<br />

6 4206172..4206178 AAATTGA<br />

7 3426782..3426776 ATTGAAG<br />

8 2729177..2729171 ATTGAAG<br />

9 >codesearch −cod U00096 . cod . gz −dat U00096 . sidd35 . gz : 1 , 4 −pat<br />

’(.{5 ,5} > s {1 ,1}) ’\<br />

10 −format ’%0.2f ’ | tab2tbl −−w<strong>in</strong>dow = ’ −5 ,2 ’ −org ’ E . coli K12 ’ −col<br />

blue<br />

11 def org col −5 −4 −3 −2 −1 1 2<br />

12 223773..223779 E . coli K12 blue 7.93 7.93 7.94 8.00 8.26 8.28 8.37<br />

13 3939833..3939839 E . coli K12 blue 7.91 7.90 7.92 7.99 8.25 8.28 8.36<br />

14 4033556..4033562 E . coli K12 blue 7.83 7.83 7.85 7.92 8.19 8.22 8.32<br />

15 4164684..4164690 E . coli K12 blue 7.85 7.85 7.87 7.95 8.21 8.25 8.34<br />

16 4206172..4206178 E . coli K12 blue 7.91 7.91 7.92 7.99 8.26 8.28 8.37<br />

17 3426782..3426776 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.73<br />

18 2729177..2729171 E . coli K12 blue 7.91 7.93 7.99 8.26 8.28 8.37 8.72<br />

Us<strong>in</strong>g heatmap to generate energy l<strong>and</strong>scape<br />

The R function heatmap described <strong>in</strong> chapter 2, was used to compare both SIDD profiles<br />

<strong>and</strong> the profiles of P1/P2 model scores. All promotor sequences were aligned first accord<strong>in</strong>g<br />

to the peak score of the P1 model (near the expected site of P1) <strong>and</strong> second accord<strong>in</strong>g to<br />

the peak score of the P2 model (near the expected site of P2). In figure 3.9 the model scores<br />

are visualized us<strong>in</strong>g the heatmap function on the green, heatmaps on the left, whereas the<br />

rightmost heatmaps conta<strong>in</strong> the SIDD energies (blue) of the aligned promotor sequences.<br />

This analysis show that a deep drop <strong>in</strong> the SIDD energy occurs for approximately half of<br />

the promotor sequences, near the P1 site.<br />

115


DNA melt<strong>in</strong>g <strong>and</strong> SIDD energy<br />

P1<br />

-10 box<br />

16S rRNA +1<br />

P2<br />

-10 box<br />

16S rRNA +1<br />

−500<br />

−490<br />

−480<br />

−470<br />

−460<br />

−450<br />

−440<br />

−430<br />

−420<br />

−410<br />

−400<br />

−390<br />

−380<br />

−370<br />

−360<br />

−350<br />

−340<br />

−330<br />

−320<br />

−310<br />

−300<br />

−290<br />

−280<br />

−270<br />

−260<br />

−250<br />

−240<br />

−230<br />

−220<br />

−210<br />

−200<br />

−190<br />

−180<br />

−170<br />

−160<br />

−150<br />

−140<br />

−130<br />

−120<br />

−110<br />

−100<br />

−90<br />

−80<br />

−70<br />

−60<br />

−50<br />

−40<br />

−30<br />

−20<br />

−10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

−500<br />

−490<br />

−480<br />

−470<br />

−460<br />

−450<br />

−440<br />

−430<br />

−420<br />

−410<br />

−400<br />

−390<br />

−380<br />

−370<br />

−360<br />

−350<br />

−340<br />

−330<br />

−320<br />

−310<br />

−300<br />

−290<br />

−280<br />

−270<br />

−260<br />

−250<br />

−240<br />

−230<br />

−220<br />

−210<br />

−200<br />

−190<br />

−180<br />

−170<br />

−160<br />

−150<br />

−140<br />

−130<br />

−120<br />

−110<br />

−100<br />

−90<br />

−80<br />

−70<br />

−60<br />

−50<br />

−40<br />

−30<br />

−20<br />

−10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

-22 34<br />

Promotor sequences<br />

Model score (bits)<br />

500<br />

490<br />

480<br />

470<br />

460<br />

450<br />

440<br />

430<br />

420<br />

410<br />

400<br />

390<br />

380<br />

370<br />

360<br />

350<br />

340<br />

330<br />

320<br />

310<br />

300<br />

290<br />

280<br />

270<br />

260<br />

250<br />

240<br />

230<br />

220<br />

210<br />

200<br />

190<br />

180<br />

170<br />

160<br />

150<br />

140<br />

130<br />

120<br />

110<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

+60<br />

500<br />

490<br />

480<br />

470<br />

460<br />

450<br />

440<br />

430<br />

420<br />

410<br />

400<br />

390<br />

380<br />

370<br />

360<br />

350<br />

340<br />

330<br />

320<br />

310<br />

300<br />

290<br />

280<br />

270<br />

260<br />

250<br />

240<br />

230<br />

220<br />

210<br />

200<br />

190<br />

180<br />

170<br />

160<br />

150<br />

140<br />

130<br />

120<br />

110<br />

100<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

+1<br />

+10<br />

+20<br />

+30<br />

+40<br />

+50<br />

SIDD energy (kcal/mol<br />

5.8 10.0<br />

Gaps are appended<br />

to each promotor<br />

region to adjust to<br />

maxima of the P1/P2<br />

model scores<br />

Figure 3.9: E. coli <strong>and</strong> Shigella rrnB energy l<strong>and</strong>scape visualized us<strong>in</strong>g the heatmap function.<br />

Each vertical column corresponds to a promotor sequence, whereas the horizontal rows represent<br />

average values over 10 bp with<strong>in</strong> each sequence. Coord<strong>in</strong>ates labeled on the horizontal rows are<br />

relative to the 16S rRNA gene start. The upper heatmaps show P1 whereas the lower heatmaps<br />

show P2. Leftmost heatmaps show P1/P2 model scores <strong>in</strong> green, whereas rightmost heatmaps<br />

show the SIDD energy <strong>in</strong> blue.<br />

116


RNA operons <strong>and</strong> promoter analysis<br />

3.5 The genomic context: visualiz<strong>in</strong>g operons <strong>and</strong> DNA<br />

properties<br />

Dur<strong>in</strong>g the thesis work, this author has been <strong>in</strong>volved <strong>in</strong> the development of a next generation<br />

genome browser to replace the older GeneWiz software developed at <strong>CBS</strong> (Pedersen<br />

et al., 2000; Jensen et al., 1999). The old GeneWiz is still used by the BLASTatlas service<br />

to generate the static atlas graphic. The goal with the new version was to create an<br />

<strong>in</strong>teractive <strong>and</strong> platform-<strong>in</strong>dependant program that would allow the user to zoom from a<br />

global genomic scale down to the nucleotide level. The basic pr<strong>in</strong>ciples of transform<strong>in</strong>g<br />

nummerical data <strong>in</strong>to a color coded representation rema<strong>in</strong>ed identical to the GeneWiz<br />

method. But the old GeneWiz software required several m<strong>in</strong>utes to regenerate a plot <strong>and</strong><br />

the challenge was to provide an efficient data flow that would allow this regeneration <strong>in</strong><br />

fractions of a second. Eva Rotenberg <strong>and</strong> Hans Henrik Stærfeldt from <strong>CBS</strong> authored the<br />

gwBrowser Java code which h<strong>and</strong>les the plott<strong>in</strong>g, whereas this author has been responsible<br />

for the server side software. For the fast visualization to be possible, all nummerical data<br />

that are plotted must be pre-b<strong>in</strong>ned <strong>and</strong> accessible for all of the zoom-levels. A system was<br />

established which could conta<strong>in</strong> these pre-b<strong>in</strong>ned data for a number of genomes us<strong>in</strong>g a<br />

MySQL database. The first solution <strong>in</strong>volved a s<strong>in</strong>gle large table, with fields correspond<strong>in</strong>g<br />

to genome id, position, zoom level, field, <strong>and</strong> value. It quickly proved unfeasible. S<strong>in</strong>ce<br />

stor<strong>in</strong>g all zoom levels for a genome of length N requires 2×N records, a rough estimation<br />

shows that a 1,000 genomes of 3Mb <strong>and</strong> 20 different DNA properties (field) requires 120<br />

billion database records. Ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g these large search <strong>in</strong>dexes <strong>and</strong> prevent<strong>in</strong>g table locks<br />

dur<strong>in</strong>g update made this solution impossible. A different solution was tried splitt<strong>in</strong>g each<br />

genome <strong>in</strong>to its own table <strong>and</strong> this solved many speed issues but did not perform satisfactory.<br />

Instead, data are stored <strong>in</strong> b<strong>in</strong>ary files - one file per genome <strong>and</strong> zoom level. All<br />

values are written as fixed-width data <strong>and</strong> us<strong>in</strong>g memory mapp<strong>in</strong>g the server can quickly<br />

obta<strong>in</strong> data with<strong>in</strong> the file know<strong>in</strong>g the coord<strong>in</strong>ates of the w<strong>in</strong>dow. The list<strong>in</strong>g belows<br />

shows how the client retrieves data for the genome id AL111168GENOMEatlas, from position<br />

1 to 37,473 bp, at zoom level 5. Figure 3.10 shows the workflow of the gwBrowser<br />

software. For further details on this tool, please refer to paper VII. The software is now<br />

available via http://www.cbs.dtu.dk/services/gwBrowser.<br />

1 set server = http : / / ws . cbs . dtu . dk/cgi−b<strong>in</strong>/gwBrowser −0.91/ server . cgi<br />

2 curl $server"?d=AL111168GENOMEatlas&m=d&f=dnap0&b=1&e=37473&l=5&z=<br />

false"<br />

3.6 Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser<br />

Modern high-throughput sequenc<strong>in</strong>g techniques currently lack sufficient read lengths to<br />

span many repetitive elements of genomes, especially the rRNA genes mentioned above. To<br />

assess how well a given set of reads can close a genome sequence, a method was developed<br />

which accounts for both quality scores of the reads <strong>and</strong> the uniqueness of the reads. The<br />

concept of the method is to map the qualities of all reads back to a reference genome <strong>and</strong><br />

apply a weight to the qualities accord<strong>in</strong>g to the uniqueness of the reads. Reads that have<br />

multiple hits throughout the genome will contribute little whereas reads that at specific<br />

will contribute fully. Figure 3.11 shows the pr<strong>in</strong>ciple of the method <strong>and</strong> it was <strong>in</strong>tegrated<br />

<strong>in</strong>to the gwBrowser software.<br />

117


Visualiz<strong>in</strong>g sequenc<strong>in</strong>g quality us<strong>in</strong>g gwBrowser<br />

Configure <strong>and</strong><br />

submit atlas<br />

‘<br />

wait for process<strong>in</strong>g<br />

q’r(i)<br />

genome<br />

Browser applet<br />

Reference genome,<br />

annotations, sequenc<strong>in</strong>g<br />

reads, query genomes,<br />

custom numerical data<br />

Edit<strong>in</strong>g of atlas layout<br />

Atlas layout (XML)<br />

Request (atlas ID, zoom level,<br />

w<strong>in</strong>dow, field name ... )<br />

Returned data<br />

Ma<strong>in</strong> server<br />

1 2<br />

3<br />

XML configuration<br />

CLIENT SIDE SERVER SIDE<br />

hit H1<br />

score<br />

S1<br />

mapped reads<br />

ref. genome<br />

Figure 3.10: Pr<strong>in</strong>ciple workflow of gwBrowser data exchange.<br />

read<br />

1<br />

2<br />

3<br />

qr(i)<br />

i<br />

q’r(i)<br />

hit H2<br />

score S2<br />

genome<br />

read<br />

hit H3<br />

score S3<br />

Align<strong>in</strong>g read<br />

sequence to<br />

genome<br />

hit Hr<br />

score Sr<br />

Map quality scores<br />

to genome <strong>and</strong><br />

apply weight<br />

4<br />

5<br />

Data b<strong>in</strong>n<strong>in</strong>g of<br />

zoom levels<br />

B<strong>in</strong>ned data<br />

Browser server<br />

Weighted coverage<br />

Sequence Weighted agreement coverage<br />

Max Sequence unique agreement qual<br />

Information Max unique Content qual<br />

Read Information anbsense Content<br />

Annotations<br />

Read anbsense<br />

CDS+<br />

Annotations CDS-<br />

Weighted coverage<br />

rRNA CDS+<br />

tRNA CDS-<br />

Sequence agreement<br />

rRNA<br />

Intr<strong>in</strong>sic tRNA Curvature<br />

Max unique qual<br />

Stack<strong>in</strong>g Intr<strong>in</strong>sic Curvature Energy<br />

Information Content<br />

Position Stack<strong>in</strong>g Preference Energy<br />

Read anbsense<br />

Global Position Annotations Direct Preference Repeats<br />

CDS+ rRNA<br />

CDS! CDS+<br />

tRNA<br />

Global Inverted Direct<br />

CDS-<br />

Repeats<br />

rRNA<br />

GC Global Skew Inverted<br />

tRNA<br />

Repeats<br />

Intr<strong>in</strong>sic Curvature<br />

Percent GC SkewAT<br />

Stack<strong>in</strong>g Energy<br />

Percent AT<br />

F<strong>in</strong>ally, all maximum values Position Preference are<br />

plotted on the reference genome<br />

Global Direct Repeats<br />

us<strong>in</strong>g GeneWiz Browser. The<br />

marked b<strong>and</strong> <strong>in</strong> the example Global Inverted above Repeats<br />

shows a regions with low<br />

GC Skew<br />

uniqueness.<br />

Percent AT<br />

From all positions <strong>in</strong> the genome,<br />

obta<strong>in</strong> the maximum uniqueness<br />

value derived from the mapped<br />

reads.<br />

Figure 3.11: Mapp<strong>in</strong>g qualities of sequenc<strong>in</strong>g reads to a reference genome while account<strong>in</strong>g for<br />

the uniqueness of the read.<br />

118


P2<br />

-10<br />

-35<br />

UP<br />

P1<br />

-10<br />

-35<br />

UP<br />

FIS<br />

FIS<br />

FIS<br />

rrnB<br />

rrnD<br />

rrnE<br />

rrnB<br />

rrnA<br />

rrnC<br />

rrnG<br />

E. coli K12<br />

MG1665<br />

rRNA operons <strong>and</strong> promoter analysis<br />

rrnH<br />

SIDD, s:-0.055<br />

SIDD, s:-0.045<br />

SIDD, s:-0.035<br />

Annotations<br />

CDS+<br />

CDS-<br />

rRNA<br />

tRNA<br />

Intr<strong>in</strong>sic Curvature<br />

Stack<strong>in</strong>g Energy<br />

Position Preference<br />

GC Skew<br />

Percent AT<br />

Figure 3.12: A zoom of the P1 P2 t<strong>and</strong>em promotor system upstream of the rrnB operon of E.<br />

coli K12.<br />

3.6.1 Visualiz<strong>in</strong>g the P1 <strong>and</strong> P2 structure us<strong>in</strong>g gwBrowser<br />

The gwBrowser tool allows the user to append various types of annotations like TSS mark,<br />

boxes, <strong>and</strong> arrows once the b<strong>in</strong>n<strong>in</strong>g step has f<strong>in</strong>ished. This allows to visualize promotor<br />

structures like the P1 / P2 system <strong>and</strong> to <strong>in</strong>tegrate this with various DNA properties.<br />

The gwBrowser tool was applied to study the E. coli rrnb promotor system to correlate<br />

the annotated regulatory elements with a the SIDD energy (Wang et al., 2004; Wang &<br />

Benham, 2008) (see figure 3.12).<br />

The plot <strong>in</strong> figure 3.12 shows a drop <strong>in</strong> free energy upstream of P1 <strong>and</strong> P2, which<br />

from an energetic viewpo<strong>in</strong>t expla<strong>in</strong> the high transcription rate. The transcription factor<br />

FIS stimulates transcription at several promoters, <strong>and</strong> for example the b<strong>in</strong>d<strong>in</strong>g of FIS<br />

at the leuV promoter (Ross et al., 1999) has been suggested to transmit the superhelical<br />

destabilization downstream to the po<strong>in</strong>t where the RNAP twists <strong>and</strong> opens the helix (Wang<br />

et al., 2004). This model may be valid for the rrnB P1 promoter also, as the activity of<br />

leuV <strong>and</strong> rrnB P1 are comparable (Bauer et al., 1988).<br />

3.7 Summary<br />

Ribosomal RNA genes play an important role <strong>in</strong> the cells, <strong>and</strong> can be highly transcribed<br />

- often more than 90% of the total transcripts <strong>in</strong> rapidly grow<strong>in</strong>g bacterial cells are from<br />

rRNA genes. Further, rRNA genes are important <strong>in</strong> determ<strong>in</strong><strong>in</strong>g taxonomy. Further,<br />

correctly f<strong>in</strong>d<strong>in</strong>g the location of the start/stop positions for the rRNA genes is difficult to<br />

do with BLAST searches; we have developed RNAmmer to f<strong>in</strong>d the rRNA genes. Once the<br />

genes are mapped, further studies, such as promoter profil<strong>in</strong>g can be done. The gwBrowser<br />

allows one to zoom <strong>in</strong> on particular areas of the chromosome, <strong>and</strong> <strong>in</strong> the case of rRNA<br />

promoters, to map important structural properties of the DNA <strong>in</strong> the promoter region.<br />

119


Summary<br />

120


1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

3.8 Paper VI: RNAmmer: Fast two-level HMM prediction<br />

of rRNA <strong>in</strong> prokaryotic genome sequences<br />

121


3100–3108 Nucleic Acids Research, 2007, Vol. 35, No. 9 Published onl<strong>in</strong>e 22 April 2007<br />

doi:10.1093/nar/gkm160<br />

RNAmmer: consistent <strong>and</strong> rapid annotation<br />

of ribosomal RNA genes<br />

Kar<strong>in</strong> Lagesen 1,2, *, Peter Hall<strong>in</strong> 3 , E<strong>in</strong>ar Andreas Rødl<strong>and</strong> 1,2,4,5 , Hans-Henrik Stærfeldt 3 ,<br />

Torbjørn Rognes 1,2,4 <strong>and</strong> David W. Ussery 1,2,3<br />

1 Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute of Medical Microbiology, University of Oslo,<br />

NO-0027 Oslo, Norway, 2 Centre for Molecular Biology <strong>and</strong> Neuroscience <strong>and</strong> Institute of Medical Microbiology,<br />

Rikshospitalet-Radiumhospitalet Medical Centre, NO-0027 Oslo, Norway, 3 Center for Biological Sequence<br />

Analysis, Biocentrum-DTU, Technical University of Denmark, DK-2800 Lyngby, Denmark, 4 Department of<br />

Informatics, University of Oslo, PO Box 1080 Bl<strong>in</strong>dern, NO-0316 Oslo, Norway <strong>and</strong> 5 Norwegian Comput<strong>in</strong>g<br />

Center, PO Box 114 Bl<strong>in</strong>dern, NO-0314 Oslo, Norway<br />

Received December 1, 2006; Revised <strong>and</strong> Accepted March 2, 2007<br />

ABSTRACT<br />

The publication of a complete genome sequence is<br />

usually accompanied by annotations of its genes.<br />

In contrast to prote<strong>in</strong> cod<strong>in</strong>g genes, genes for<br />

ribosomal RNA (rRNA) are often poorly or <strong>in</strong>consistently<br />

annotated. This makes comparative<br />

studies based on rRNA genes difficult. We have<br />

therefore created computational predictors for the<br />

major rRNA species from all k<strong>in</strong>gdoms of life <strong>and</strong><br />

compiled them <strong>in</strong>to a program called RNAmmer.<br />

The program uses hidden Markov models tra<strong>in</strong>ed on<br />

data from the 5S ribosomal RNA database <strong>and</strong><br />

the European ribosomal RNA database project.<br />

A pre-screen<strong>in</strong>g step makes the method fast with<br />

little loss of sensitivity, enabl<strong>in</strong>g the analysis of<br />

a complete bacterial genome <strong>in</strong> less than a m<strong>in</strong>ute.<br />

Results from runn<strong>in</strong>g RNAmmer on a large set of<br />

genomes <strong>in</strong>dicate that the location of rRNAs can be<br />

predicted with a very high level of accuracy. Novel,<br />

unannotated rRNAs are also predicted <strong>in</strong> many<br />

genomes. The software as well as the genome<br />

analysis results are available at the <strong>CBS</strong> web server.<br />

INTRODUCTION<br />

Ribosomes are the molecular mach<strong>in</strong>es which form the<br />

connection between nucleic acids <strong>and</strong> prote<strong>in</strong>s <strong>in</strong> all liv<strong>in</strong>g<br />

organisms. The ribosome’s dependence on ribosomal<br />

RNAs (rRNAs) for its function has caused them to be<br />

conserved at both the sequence <strong>and</strong> the structure level.<br />

Because of this, rRNAs are often used <strong>in</strong> comparative<br />

studies such as phylogenetic <strong>in</strong>ference. <strong>Comparative</strong><br />

studies have become more popular as more genomes<br />

have been completely sequenced, but can potentially<br />

*To whom correspondence should be addressed. Tel: þ4722844786; Email: kar<strong>in</strong>.lagesen@medis<strong>in</strong>.uio.no<br />

become complicated when some of the genes they are<br />

based on are poorly annotated or not annotated at all.<br />

Unfortunately, this is often a problem with rRNAs as<br />

genome annotation pipel<strong>in</strong>es usually do not <strong>in</strong>clude <strong>tools</strong><br />

specific for rRNA detection. Instead, rRNAs are often<br />

located by sequence similarity searches such as BLAST.<br />

Although such searches may give reasonable answers due<br />

to the high level of sequence conservation <strong>in</strong> the core<br />

regions of the genes, us<strong>in</strong>g such results for annotation<br />

purposes can be problematic. The validity of the search<br />

results depends on the program <strong>and</strong> database used.<br />

Chang<strong>in</strong>g one or both of these can drastically change<br />

the results. Genomic databases have grown exponentially<br />

over the past two decades <strong>and</strong> search programs have as a<br />

consequence had to undergo constant revisions <strong>in</strong> order to<br />

meet the requirements of the research community. Thus,<br />

the results of a search done today are probably very<br />

different from those produced several years ago. An added<br />

complication is that the most commonly used database<br />

search methods have poor performance for noncod<strong>in</strong>g<br />

RNAs. A recent study compar<strong>in</strong>g several different<br />

methods for predict<strong>in</strong>g noncod<strong>in</strong>g RNAs, <strong>in</strong>clud<strong>in</strong>g<br />

rRNAs, found that the most commonly used methods<br />

gave the most <strong>in</strong>accurate results (1).<br />

Through our work on the GenomeAtlas database (2),<br />

we have seen the results of poor annotation of rRNAs.<br />

Some genomes do not have any rRNAs annotated at all,<br />

whereas other genomes seem to have rRNAs annotated<br />

on the wrong str<strong>and</strong>. We <strong>in</strong>itially tried to do systematic<br />

BLAST (3) searches, but it proved difficult to ma<strong>in</strong>ta<strong>in</strong><br />

consistency throughout this process. The high level of<br />

sequence conservation among the rRNAs enabled us to<br />

create hidden Markov models (HMMs) from structural<br />

alignments. Such models are more capable of captur<strong>in</strong>g<br />

the sequence variation that is <strong>in</strong>herently present <strong>in</strong><br />

the rRNA gene families than simple BLAST searches.<br />

ß 2007 The Author(s)<br />

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/<br />

by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, <strong>and</strong> reproduction <strong>in</strong> any medium, provided the orig<strong>in</strong>al work is properly cited.


Us<strong>in</strong>g HMMs also simplifies the use of common criteria<br />

for prediction assessment. A library of HMMs was<br />

constructed <strong>and</strong> the program RNAmmer was developed<br />

to make use of this library. RNAmmer is available<br />

through the <strong>CBS</strong> web site, as a web service or as a<br />

st<strong>and</strong>-alone package. It has been tested on all published<br />

genomes <strong>and</strong> gives accurate predictions of rRNAs. The<br />

program also has the added benefit of produc<strong>in</strong>g results<br />

that are comparable between genomes.<br />

Our work has focused on three of the major rRNA<br />

species. The ribosome consists of two subunits, the small<br />

<strong>and</strong> the large subunit, which pair up to form the<br />

functional ribosome. The rRNAs present <strong>in</strong> prokaryotes<br />

are the 5S <strong>and</strong> 23S <strong>in</strong> the large subunit, <strong>and</strong> the 16S <strong>in</strong> the<br />

small subunit. In eukaryotes, 5S, 5.8S <strong>and</strong> 28S rRNA exist<br />

<strong>in</strong> the large subunit, <strong>and</strong> 18S rRNA <strong>in</strong> the small subunit.<br />

The 5.8S is not considered <strong>in</strong> this work. There are<br />

substantial sequence <strong>and</strong> secondary structure similarities<br />

between eukaryotic <strong>and</strong> prokaryotic rRNAs; however,<br />

the eukaryotic rRNAs commonly have longer stems <strong>and</strong><br />

larger loops than those of the prokaryotes. The subunits<br />

are composed of both RNAs <strong>and</strong> prote<strong>in</strong>s. S<strong>in</strong>ce their<br />

discovery <strong>in</strong> the early 1950s, it has been debated whether<br />

ribosomal function should be credited to the rRNAs or<br />

the prote<strong>in</strong>s. Recent crystal studies have revealed that<br />

prote<strong>in</strong> synthesis to a large extent is dependent on the<br />

rRNAs (4–7) <strong>and</strong> this has most likely been <strong>in</strong>strumental<br />

for their high level of conservation.<br />

In prokaryotes, the 16S, 23S <strong>and</strong> 5S rRNAs are<br />

commonly transcribed together, while the 18S, 28S <strong>and</strong><br />

5.8S rRNAs form a transcriptional unit <strong>in</strong> eukaryotes.<br />

Eukaryotic 5S rRNA commonly appear <strong>in</strong> highly duplicated<br />

t<strong>and</strong>em repeats (8). In most organisms, there are<br />

several copies of the rRNA transcription unit, <strong>and</strong><br />

although as much as 11% sequence divergence has been<br />

observed between units with<strong>in</strong> the same genome, the<br />

difference is usually less than 1% (9). In several cases,<br />

segments are also edited out of the transcribed rRNA.<br />

These segments may be <strong>in</strong>trons that after splic<strong>in</strong>g leave<br />

a cont<strong>in</strong>uous rRNA, or they can be <strong>in</strong>terven<strong>in</strong>g sequences<br />

(IVS) that leave a fragmented rRNA which is still<br />

functional with<strong>in</strong> the ribosome structure (10). Introns<br />

are most prevalent <strong>in</strong> eukaryotes <strong>and</strong> archaeas, while<br />

<strong>in</strong>terven<strong>in</strong>g sequences have been seen <strong>in</strong> eukaryotes <strong>and</strong><br />

bacteria. Introns are predom<strong>in</strong>antly found with<strong>in</strong> conserved<br />

sequences close to tRNA <strong>and</strong> mRNA-b<strong>in</strong>d<strong>in</strong>g<br />

sites (10), whereas <strong>in</strong>terven<strong>in</strong>g sequences are ord<strong>in</strong>arily<br />

seen <strong>in</strong> hypervariable regions (11).<br />

METHODS AND MATERIALS<br />

Us<strong>in</strong>g HMMs to f<strong>in</strong>d new members of a sequence family<br />

requires reliable multiple alignments. The 16S/18S <strong>and</strong><br />

23S/28S rRNA alignments were retrieved from the<br />

European ribosomal RNA database (ERRD) (12).<br />

In this database, annotated large <strong>and</strong> small subunit<br />

ribosomal RNA sequences from the EMBL nucleotide<br />

database with a length of at least 70% of their estimated<br />

full length have been aligned. Multiple alignments of 5S<br />

rRNAs were retrieved from the 5S Ribosomal RNA<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3101<br />

Database (13). Data from both databases were downloaded<br />

on October 27, 2005. The alignments are<br />

all structural alignments, i.e. aligned us<strong>in</strong>g secondary<br />

structure <strong>in</strong>formation ga<strong>in</strong>ed from comparative sequence<br />

analysis. The 5S alignments were already divided<br />

<strong>in</strong>to separate alignments for archaeal, bacterial <strong>and</strong><br />

eukaryotic sequences, whereas the ERRD data were not.<br />

The alignments for 16/18S <strong>and</strong> 23/28S rRNAs were<br />

divided <strong>in</strong>to the same groups as the 5S data to provide<br />

k<strong>in</strong>gdom-specific predictors. The data was stored <strong>in</strong><br />

a MySQL database for easier h<strong>and</strong>l<strong>in</strong>g.<br />

The ERRD data conta<strong>in</strong>ed sequences from ‘environmental<br />

samples’. These were excluded s<strong>in</strong>ce there was little<br />

<strong>in</strong>formation about them. The 5S were generally around<br />

120 nt long, the 16/18S around 1500 nt <strong>and</strong> the 23/28S<br />

around 3000 nt long, all with no obvious outliers. The<br />

length of the eukaryotic rRNAs varied substantially,<br />

more than those of bacterial <strong>and</strong> archaeal rRNAs, but no<br />

sequences <strong>in</strong> the alignments seemed obviously wrong.<br />

The sequences were divided <strong>in</strong>to phylogenetic groups to<br />

help with further analysis. Due to sequenc<strong>in</strong>g bias, some<br />

phylogenetic groups dom<strong>in</strong>ated the data sets. Such a skew<br />

could potentially cause the predictors to be less sensitive<br />

on underrepresented phylogenetic groups. Among<br />

the bacteria, 82% of the sequences were from three<br />

phyla: Act<strong>in</strong>obacteria, Firmicutes <strong>and</strong> Proteobacteria.<br />

Around 70% of the archaeal sequences were from<br />

Euryarchaeota; among the eukaryotes, the Streptophyta<br />

comprised 15% of the data. Several of the sequences also<br />

proved to be very similar. Therefore, redundancy reduction<br />

<strong>in</strong>spired by Hobohms second algorithm (14) was<br />

performed. This algorithm starts with a sorted list of the<br />

number of neighbors each sequence has. An all-aga<strong>in</strong>st-all<br />

comparison between the sequences is performed <strong>and</strong><br />

neighborship is judged by the level of similarity found.<br />

Similarity was measured by Score ¼ P<br />

i, j nijSij=ðN gÞ<br />

where i <strong>and</strong> j sum over the four nucleotides, nij counts the<br />

number of aligned nucleotide pairs (i, j ), N is the length of<br />

the sequence <strong>and</strong> g is the number of gap-only positions; S ij<br />

refers to the scor<strong>in</strong>g matrix EDNAFULL created by Todd<br />

Lowe. The maximum similarity level allowed was set to<br />

ensure that each phylum was represented. Similarity<br />

graphs were formed for each group, with the sequences<br />

as vertices <strong>and</strong> edges between similar sequences. The<br />

sequence with the highest connectivity <strong>and</strong> its edges were<br />

deleted from the graph, <strong>and</strong> this was repeated until no<br />

edges rema<strong>in</strong>ed. At the end, all removed sequences were<br />

checked to see if they had any edges to vertices <strong>in</strong> the<br />

rema<strong>in</strong><strong>in</strong>g set. If not, they were re<strong>in</strong>stated. This procedure<br />

was implemented as a C program.<br />

Sequences <strong>in</strong> ERRD may conta<strong>in</strong> ambiguous nucleotide<br />

symbols represent<strong>in</strong>g nucleotides that have not been<br />

uniquely determ<strong>in</strong>ed. These occur more frequently <strong>in</strong><br />

bacteria <strong>and</strong> eukaryotes than <strong>in</strong> archaea, <strong>and</strong> primarily at<br />

both ends of the alignment: <strong>in</strong> 16/18S, predom<strong>in</strong>antly<br />

at the end; <strong>in</strong> 23/28S, predom<strong>in</strong>antly at the beg<strong>in</strong>n<strong>in</strong>g.<br />

In the latter case, this was mostly due the high prevalence<br />

of gaps at the end of the alignment. As we found that<br />

ambiguous nucleotides at the ends reduced the ability to<br />

predict start <strong>and</strong> stop positions accurately, we decided to<br />

remove all sequences with five or more ambiguous


3102 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

Table 1. The <strong>in</strong>itial number of rRNA sequences <strong>and</strong> the number of sequences excluded for different reasons.<br />

K<strong>in</strong>gdom Type Initial count Environmental samples Incomplete sequences Redundancy reduction Total <strong>in</strong> HMM<br />

Archaea 5S 58 0 0 10 48<br />

16S 589 239 471 287 76<br />

23S 37 0 18 8 15<br />

Bacteria 5S 461 0 0 101 360<br />

16S 12 107 1429 10 723 2485 743<br />

23S 398 0 155 130 127<br />

Eukaryotes 5S 316 0 0 33 283<br />

18S 6585 24 5222 836 979<br />

28S 157 0 91 8 58<br />

Environmental samples were excluded due to lack of phylogenetic <strong>in</strong>formation. Sequences with too many unknown nucleotides <strong>in</strong> either end of the<br />

sequence were excluded to improve HMM accuracy. Redundancy reduction was performed to reduce bias. Note that these groups may overlap. The<br />

last column <strong>in</strong>dicates the number of sequences used to build each HMM.<br />

nucleotides <strong>in</strong> either end of the sequence. A summary of<br />

the number of sequences removed dur<strong>in</strong>g curation of the<br />

alignments is shown <strong>in</strong> Table 1.<br />

The software package HMMer (15) version 2.3.2 was<br />

used to create HMMs from alignments where all columns<br />

conta<strong>in</strong><strong>in</strong>g only gaps had been removed. It was configured<br />

for nucleotides, <strong>and</strong> to compensate for skews <strong>in</strong> the<br />

nucleotide distribution a custom null model for each<br />

alignment was used. Although redundancy reduction had<br />

been performed, the Henikoff position-based weigh<strong>in</strong>g<br />

scheme (16) was used to reduce any rema<strong>in</strong><strong>in</strong>g biases.<br />

When us<strong>in</strong>g the HMMs to search genome sequences,<br />

the default alignment method was used: a match must<br />

span the entire model, <strong>and</strong> several matches may be found<br />

with<strong>in</strong> one sequence.<br />

With the aim of <strong>in</strong>creas<strong>in</strong>g the search speed, we<br />

determ<strong>in</strong>ed the 75 most conserved consecutive columns<br />

<strong>in</strong> each alignment, as illustrated <strong>in</strong> Figure 1, <strong>and</strong> produced<br />

‘spotter’ HMMs based on these. S<strong>in</strong>ce searches with the<br />

smaller spotter models would be considerably faster,<br />

we wanted to <strong>in</strong>vestigate the possibility of us<strong>in</strong>g the<br />

spotter to pre-screen for c<strong>and</strong>idates, us<strong>in</strong>g the full HMMs<br />

only on regions surround<strong>in</strong>g the spotter hits. Spotter <strong>and</strong><br />

full model searches were done separately. Spotter <strong>and</strong> full<br />

model predictions were matched based on whether they<br />

had overlapp<strong>in</strong>g nucleotides on the same str<strong>and</strong>. A l<strong>in</strong>ear<br />

regression was used to express spotter score <strong>in</strong> terms of<br />

full model score. Variation was estimated as l<strong>in</strong>ear <strong>in</strong> full<br />

model score with non-positive regression coefficients.<br />

Least squares estimates were used <strong>in</strong> both cases. Spotter<br />

scores were assumed to be miss<strong>in</strong>g when negative <strong>and</strong>,<br />

hence, assumed to follow a truncated normal distribution;<br />

expected scores <strong>and</strong> square deviations were used to replace<br />

miss<strong>in</strong>g values <strong>in</strong> the two regressions. From this model, we<br />

computed the lowest full model score, T99, for which there<br />

was at least a 99% likelihood of gett<strong>in</strong>g a correspond<strong>in</strong>g<br />

spotter hit, <strong>and</strong> the likelihood, Pm<strong>in</strong>, that a full model hit<br />

with the lowest found score should have a correspond<strong>in</strong>g<br />

spotter hit.<br />

Both the full HMMs <strong>and</strong> the spotter HMMs were run<br />

on all fully sequenced genomes found <strong>in</strong> the Genome Atlas<br />

database (listed <strong>in</strong> Supplementary Table S1). All predictions<br />

with non-negative score <strong>and</strong> E-value at most 100<br />

were reported. Only full model hits with E-value 50.01<br />

were accepted as reliable hits, but none with E-value<br />

between 0.01 <strong>and</strong> 100 were reported. As rRNAs with<strong>in</strong> a<br />

genome tend to be very similar, usually with at least 99%<br />

identity, different full model hits with<strong>in</strong> a genome<br />

correspond<strong>in</strong>g to actual rRNAs should be expected to<br />

have similar scores. However, we found a substantial<br />

number of hits with far lower scores which we assume to<br />

be pseudogenes, truncated rRNAs or otherwise nonfunctional<br />

rRNA copies. To ensure that these did not have<br />

an adverse effect on the analyses, we excluded full model<br />

hits hav<strong>in</strong>g a score less than 80% of the maximal score<br />

<strong>in</strong> that genome. These are listed <strong>in</strong> Supplementary<br />

Table S2.<br />

Annotations of rRNAs were obta<strong>in</strong>ed from GenBank.<br />

Unfortunately, rRNAs have not been annotated <strong>in</strong> a<br />

uniform manner <strong>and</strong> it was often unclear exactly what<br />

was annotated. In some cases, both the separate rRNAs<br />

<strong>and</strong> the full operon was annotated. In all such cases, the<br />

operons were longer than 5000 nt, <strong>and</strong> all annotations<br />

longer than that were thus excluded. In our experience,<br />

this affected only operons. In other cases, different pieces<br />

of the same gene had been annotated as separate entities.<br />

Thus, some predictions matched several annotation<br />

entries; these are listed <strong>in</strong> Supplementary Table S3. A<br />

prediction was considered to match an annotation if they<br />

were on the same str<strong>and</strong> <strong>and</strong> the length of their overlap<br />

was at least half the length of the shorter of the two; it was<br />

considered to be annotated if it matched at least one<br />

annotation. The deviation between annotated <strong>and</strong> predicted<br />

start <strong>and</strong> stop positions was also exam<strong>in</strong>ed, but<br />

predictions with multiple match<strong>in</strong>g annotations were<br />

excluded from this comparison.<br />

Additional analyses were performed for experimentally<br />

verified 16S <strong>in</strong> Anaplasma marg<strong>in</strong>ale St. Maries (M60313),<br />

Chlamydia muridarum Nigg (D85718), Escherichia coli<br />

K12 MG1655 (J01695), Sulfolobus tokodaii St. 7<br />

(AB022438), Thermus thermophilus HB8 (X07998) <strong>and</strong><br />

Nitrobacter hamburgensis X14 (L11663). <strong>Computational</strong><br />

speed was assessed on M. capricolum ATCC 27343<br />

(CP000123) Solibacter usitatus Ell<strong>in</strong>6076 (CP000473) <strong>and</strong><br />

Sargasso Sea data (AACY01000001-AACY01811372).<br />

All test searches reported were performed on an<br />

SGI Altix 3000 mach<strong>in</strong>e us<strong>in</strong>g one 1.3 GHz Itanium 2<br />

processor.


Information content<br />

Information content<br />

Information content<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

RESULTS<br />

0 20 40 60 80 100 120 140<br />

0 50 100 150<br />

0 50 100 150<br />

Position <strong>in</strong> Alignment<br />

The predictions of the full HMM models have been<br />

compared first aga<strong>in</strong>st annotations, then aga<strong>in</strong>st the<br />

spotter models.<br />

Full model predictions versus annotation<br />

As Table 2 shows, the predictors appeared to be better<br />

at detect<strong>in</strong>g bacterial rRNAs <strong>and</strong> less powerful for<br />

eukaryotic rRNAs. The highest accuracy was seen for<br />

the 16/18S rRNAs followed by the 23/28S. Two groups of<br />

rRNAs were particularly difficult to locate: the archaeal<br />

5S <strong>and</strong> the eukaryotic 18S. The miss<strong>in</strong>g archaeal 5S were<br />

all from four euryarchaeotic genomes which are all<br />

anaerobic methane producers. The eukaryotic 18S that<br />

the predictors could not f<strong>in</strong>d were all from two genomes,<br />

Guillardia theta <strong>and</strong> Plasmodium falciparum.<br />

Closer evaluation revealed that several annotated<br />

rRNAs that lacked a match<strong>in</strong>g prediction had actually<br />

been detected, but on the opposite str<strong>and</strong>. In eukaryotes,<br />

this was only seen with Arabidopsis thaliana 5S.<br />

In bacteria, most of the reverse predictions were 5S; <strong>in</strong><br />

archaea, they were predom<strong>in</strong>antly 16S <strong>and</strong> 23S. It should<br />

be noted that for all the reverse str<strong>and</strong> predictions<br />

the predicted start <strong>and</strong> stop positions agreed well<br />

with the annotation, <strong>in</strong>dicat<strong>in</strong>g that they have been<br />

annotated on the wrong str<strong>and</strong>. Annotated rRNAs<br />

that lacked match<strong>in</strong>g predictions <strong>in</strong> either direction are<br />

listed <strong>in</strong> Supplementary Table S4.<br />

Table 2 gives the number of predicted rRNAs that did<br />

not have a correspond<strong>in</strong>g annotation: putative novel<br />

rRNAs. About 70% of them were 5S rRNAs, <strong>and</strong> only a<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0 500 1000 1500<br />

0 500 1000 1500 2000 2500 3000<br />

0 1000 2000 3000 4000 5000<br />

Position <strong>in</strong> Alignment<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3103<br />

A 5S, Archaea (n = 48) B 16S, Archaea (n = 76) C 23S, Archaea (n = 15)<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

0.0 1.0 2.0<br />

few were archaeal. In bacteria, most of the novel rRNAs<br />

were found <strong>in</strong> Firmicutes <strong>and</strong> Gammaproteobacterias,<br />

although it should be noted that these two phyla are<br />

the two dom<strong>in</strong>ant groups <strong>and</strong> conta<strong>in</strong> the bulk of the<br />

currently sequenced bacterial genomes. Among the<br />

eukaryotes, only A. thaliana had novel rRNAs. The<br />

scores of the new rRNA predictions did not significantly<br />

differ from those that were annotated, <strong>in</strong>dicat<strong>in</strong>g that<br />

these are true rRNAs not yet annotated. The 5S is often<br />

omitted <strong>in</strong> the rRNA annotation; s<strong>in</strong>ce the eukaryotic 5S<br />

is usually separated from the 18-28S sequence, they might<br />

be less visible to annotators.<br />

Start <strong>and</strong> stop deviations<br />

0 500 1000 1500 2000 2500 3000 3500<br />

D 5S, Bacteria (n = 360) E 16S, Bacteria (n = 743) F 23S, Bacteria (n = 127)<br />

0 1000 2000 3000 4000<br />

G 8S, Eukaryotes (n = 283) H 18S, Eukaryotes (n = 979) I 28S, Eukaryotes (n = 58)<br />

0 1000 2000 3000 4000 5000 6000 7000<br />

Position <strong>in</strong> Alignment<br />

Figure 1. The graphs show conservation <strong>in</strong> the alignments as measured by <strong>in</strong>formation content: C ¼ P<br />

i fi log 2ðfi=qiÞ where i sums over the four<br />

nucleotides, f i is the frequency of nucleotide i <strong>in</strong> the column <strong>and</strong> qi ¼ 1=4 is used as the background frequency. Ambiguous nucleotide symbols were<br />

evenly divided between the correspond<strong>in</strong>g f i, gaps between all four nucleotides. The grey l<strong>in</strong>e represents the value for each position <strong>in</strong> the alignment,<br />

the black l<strong>in</strong>e is a runn<strong>in</strong>g average over 75 nt around the current position, whereas the white dot <strong>in</strong>dicates the center of the most conserved 75 nt<br />

region of the alignment.<br />

The differences between predicted <strong>and</strong> annotated start<br />

<strong>and</strong> stop positions are illustrated <strong>in</strong> Figure 2 <strong>and</strong> it shows<br />

that they agree well. The median of the start <strong>and</strong> stop<br />

prediction deviations were <strong>in</strong> most groups zero or very<br />

close to zero with more than half with<strong>in</strong> 10 nucleotides.<br />

This was not the case for the eukaryotes.<br />

For eukaryotic 5S, only five genomes conta<strong>in</strong>ed<br />

predictions with match<strong>in</strong>g annotations. The predictions<br />

were uniform <strong>in</strong> length, whereas the annotations<br />

were more variable. The predictions that <strong>in</strong>dicated a<br />

substantially shorter 5S than annotated were all <strong>in</strong><br />

Schizosaccharomyces pombe: the average length of the<br />

annotations was 170 nt, whereas the correspond<strong>in</strong>g<br />

predictions were all 114 nt. For eukaryotic 18S, however,<br />

predicted start <strong>and</strong> stop positions were very accurate,<br />

although many annotated 18S were missed.


3104 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

Table 2. The number of rRNAs annotated <strong>and</strong> predicted <strong>in</strong> the genomes that were exam<strong>in</strong>ed.<br />

K<strong>in</strong>gdom Type Annotated Same str<strong>and</strong> Other str<strong>and</strong> Not found Full model predictions Novel<br />

Archaea (n ¼ 27) 5S 56 (24) 43 (21) 1 (1) 12 (8) 47 (23) 4 (3)<br />

16S 47 (25) 45 (25) 2 (2) 0 (0) 47 (27) 2 (2)<br />

23S 47 (25) 44 (24) 2 (2) 1 (1) 46 (26) 2 (2)<br />

Bacteria (n ¼ 321) 5S 1205 (285) 1166 (285) 30 (16) 9 (5) 1339 (320) 173 (69)<br />

16S 1172 (299) 1146 (299) 22 (12) 4 (4) 1237 (320) 91 (34)<br />

23S 1197 (297) 1154 (291) 22 (13) 21 (12) 1248 (313) 94 (36)<br />

Eukaryotes (n ¼ 13) 5S 65 (7) 46 (6) 19 (1) 0 (0) 324 (9) 278 (5)<br />

18S 13 (4) 6 (4) 0 (0) 7 (2) 13 (6) 7 (3)<br />

28S 13 (5) 12 (4) 0 (0) 1 (1) 19 (7) 7 (3)<br />

The table gives the number of annotations, <strong>and</strong> splits this <strong>in</strong>to those match<strong>in</strong>g predictions on the same str<strong>and</strong>, on the other str<strong>and</strong>, <strong>and</strong> not found.<br />

The total number of full model predictions is given. Novel predictions are full model predictions not match<strong>in</strong>g any annotation on the same str<strong>and</strong>,<br />

<strong>and</strong> <strong>in</strong>clude those annotated on the other str<strong>and</strong>. Numbers <strong>in</strong> parentheses <strong>in</strong>dicate the number of genomes. It should be noted that the eukaryotic<br />

annotated count is somewhat uncerta<strong>in</strong> due to ambiguous rRNA annotations. The genomes which were analyzed were from the GenomeAtlas<br />

database, a database over all available fully sequenced genomes.<br />

Archaea<br />

Bacteria<br />

Eukaryotes<br />

Start<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

5S<br />

(43/1163/46)<br />

For eukaryotic 28S, only two genomes had predictions<br />

with match<strong>in</strong>g annotations. One of them, Encephalitozoon<br />

cuniculi, had stop positions predicted once 1112 nt <strong>and</strong><br />

twice 4797 nt downstream of the annotation, whereas<br />

the start position was accurately predicted. In the<br />

other genome, Guillardia theta, the start positions were<br />

uniformly predicted 110 nt upstream of the annotated<br />

position, but with the stop position quite accurately<br />

predicted.<br />

1000<br />

Stop<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

1000<br />

Start<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

16/18S<br />

(44/1146/6)<br />

1000<br />

Stop<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

1000<br />

Start<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

23/28S<br />

(42/1150/9)<br />

Stop<br />

1000<br />

−100<br />

−10<br />

0<br />

10<br />

100<br />

1000<br />

Figure 2. Deviation of start <strong>and</strong> stop positions between predicted <strong>and</strong> annotated RNA is presented as pairs of panels. The number of predictions<br />

among the archaea, bacteria <strong>and</strong> eukaryotes are denoted beneath the panel group head<strong>in</strong>g. The zero position <strong>in</strong> each panel corresponds to the<br />

annotation start or stop position with predicted positions presented relative to these. The yellow dot <strong>in</strong>dicates the median deviation <strong>and</strong> the black<br />

box the quartile range. The h<strong>in</strong>ges on the side of the box extend from the side of the box to the data po<strong>in</strong>t that is closest to, but does not exceed, 1.5<br />

times the <strong>in</strong>terquartile range. The curves show the density of the distribution.<br />

S<strong>in</strong>ce rRNAs tend to be very similar with<strong>in</strong> a genome,<br />

predictions with<strong>in</strong> each genome generally had similar<br />

lengths. This similarity with<strong>in</strong> genomes as well as with<strong>in</strong><br />

groups of closely related genomes caused multiple peaks<br />

<strong>in</strong> the distributions of endpo<strong>in</strong>t deviations. An example<br />

of this can be seen <strong>in</strong> the bacterial 16S predictions where<br />

some of the predicted start <strong>and</strong> stop positions were<br />

clustered downstream of the annotation <strong>and</strong> where some<br />

of the predicted start positions were clustered upstream<br />

1000


of the annotation. Some of the major contributors to<br />

the upstream peak <strong>in</strong> the start positions were different<br />

Streptococcus pyogenes stra<strong>in</strong>s, Bacillus genomes <strong>and</strong><br />

Yers<strong>in</strong>ia pestis genomes. These, <strong>in</strong> addition to<br />

Streptococcus agalactiae stra<strong>in</strong>s <strong>and</strong> Vibrio parahaemolyticus,<br />

were also prevalent <strong>in</strong> the stop position downstream<br />

peak. There was also a downstream peak <strong>in</strong> the<br />

start positions, <strong>and</strong> the genomes caus<strong>in</strong>g this peak were<br />

ma<strong>in</strong>ly Staphylococcus aureus, Bacillus cereus <strong>and</strong> several<br />

Escherichia coli relatives.<br />

Most of the start <strong>and</strong> stop deviations did not exceed<br />

100 nt. However, there were a few cases of deviations<br />

exceed<strong>in</strong>g 1000 nt, <strong>and</strong> these are not shown <strong>in</strong> the figure.<br />

This was the case for eukaryotic 23S <strong>and</strong> was ma<strong>in</strong>ly due<br />

to the three previously described stop positions predicted<br />

considerably downstream of the annotated stop position.<br />

In the two longer predictions from E. cuniculi, this was<br />

due to the HMM plac<strong>in</strong>g the latter 100 nt of the prediction<br />

further downstream to achieve a better score. Such <strong>in</strong>serts<br />

would most likely not appear when the spotter model is<br />

used first, s<strong>in</strong>ce the <strong>in</strong>serted sequence would be too long.<br />

To test this, a truncated version of the sequence was run<br />

through the predictor. The stop position was then<br />

accurately predicted. This phenomenon also expla<strong>in</strong>s<br />

some cases among the bacterial 16S predictions where the<br />

start position was placed very far upstream of the<br />

annotation. There were 27 rRNAs that had a start<br />

position predicted to start anywhere from 13 000 to<br />

40 000 nt upstream of the annotated start position. All<br />

but one of these were Firmicutes, mostly Streptococci <strong>and</strong><br />

Staphylococci. Closer study of the sequences revealed that<br />

the misplaced start position predictions were aga<strong>in</strong> due to<br />

long sequences be<strong>in</strong>g <strong>in</strong>serted near the start of the rRNA,<br />

<strong>in</strong>dicat<strong>in</strong>g that the first part of the HMM had been<br />

misplaced <strong>in</strong> the same manner as for Guillardia theta’s stop<br />

predictions. To test if these were the same k<strong>in</strong>d of <strong>in</strong>serts,<br />

a region end<strong>in</strong>g <strong>in</strong> the same place as the predictions but<br />

start<strong>in</strong>g 10 000 nt earlier was run through the full model<br />

predictor. This led to the bacterial 16S rRNAs be<strong>in</strong>g<br />

predicted with a deviation <strong>in</strong> start <strong>and</strong> stop positions on<br />

par with what was otherwise seen.<br />

Comparison to experimentally verified rRNAs<br />

Annotations were often ambiguous <strong>and</strong> considered<br />

unreliable. For discrepancies between annotations <strong>and</strong><br />

RNAmmer predictions, it is not a priori clear which of the<br />

two is correct. However, some genomes with experimentally<br />

verified rRNAs were selected to further assess the<br />

accuracy of start <strong>and</strong> stop predictions. The genomes<br />

we exam<strong>in</strong>ed were Anaplasma marg<strong>in</strong>ale Str. Maries,<br />

Chlamydia muridarum Nigg, Escherichia coli K12<br />

MG1655, Sulfolobus tokodaii Str. 7, Thermus thermophilus<br />

HB8 <strong>and</strong> Nitrobacter hamburgensis X14. These genomes<br />

all had complete 16S sequences accord<strong>in</strong>g to the NCBI<br />

database <strong>and</strong> had accompany<strong>in</strong>g literature which said that<br />

they were experimentally determ<strong>in</strong>ed. When check<strong>in</strong>g<br />

the positions of these rRNAs with BLAST aga<strong>in</strong>st the<br />

genome, some discrepancies were found. Due to this we<br />

used the BLAST results when compar<strong>in</strong>g annotated<br />

rRNAs to predictions.<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3105<br />

In total, there were 14 copies of the six 16S sequences,<br />

<strong>and</strong> all of them were found by our predictions. Stop<br />

predictions were more accurate than start predictions.<br />

In all but four cases, the start position was predicted<br />

to be 7 nt downstream of the annotated start position.<br />

In A. marg<strong>in</strong>ale <strong>and</strong> S. tokodaii, the start position was<br />

predicted to be the same as annotation, <strong>and</strong> both of the<br />

two entries from C. muridarum were predicted to start 3 nt<br />

downstream of annotated start position. In N. hamburgensis<br />

the start position was, <strong>in</strong> contrast to the other cases,<br />

predicted to start 7 nt upstream of annotated start<br />

position. The stop positions <strong>in</strong> all but three predictions<br />

ended on the same position as the annotation. In N.<br />

hamburgensis predicted stop was 9 nt downstream,<br />

whereas <strong>in</strong> S. tokoaii <strong>and</strong> A. marg<strong>in</strong>ale the predicted<br />

stop was 1 nt downstream of annotation. Thus,<br />

all predictions were with<strong>in</strong> 10 nt of the annotated start<br />

<strong>and</strong> stop positions.<br />

Comparison to RFAM<br />

RFAM is a database of RNA families which <strong>in</strong>corporates<br />

secondary structure <strong>in</strong> its analyses. We have made a<br />

comparison with the 5S rRNA predictions of<br />

RFAM (17,18) for a selection of twenty prokaryotic<br />

genomes listed <strong>in</strong> Supplementary Table S5. There were a<br />

total of 55 5S annotated <strong>in</strong> these genomes. RNAmmer<br />

found 53 of them, while 54 were found <strong>in</strong> RFAM. In three<br />

of the genomes, both methods predicted a 5S to with<strong>in</strong> a<br />

few nucleotides of the annotated position, but both placed<br />

it on the other str<strong>and</strong>. Both predictors identified three new<br />

5S rRNAs with<strong>in</strong> these genomes, <strong>and</strong> at approximately the<br />

same positions. Two of these new 5S rRNAs followed<br />

another annotated 5S rRNA, look<strong>in</strong>g like a t<strong>and</strong>em<br />

repeat. In most cases, both methods placed the start<br />

position a few nucleotides downstream of the annotation,<br />

whereas the stop position was more evenly distributed<br />

around the annotated position. RNAmmer generally<br />

predicted rRNAs to be shorter by a nucleotide or two<br />

than RFAM, usually at start of the genes.<br />

Spotter pre-screen<strong>in</strong>g<br />

Table 3 shows that, with the exception of archaeal 5S,<br />

no full model hits were missed by the spotter model.<br />

Also, the spotter produced relatively few false positives,<br />

except for the eukaryotic 5S.<br />

M<strong>in</strong>imum, maximum, quantile <strong>and</strong> median scores for<br />

all the full model predictions are shown <strong>in</strong> Table 3, giv<strong>in</strong>g<br />

some <strong>in</strong>dication of the range of scores that rRNAs can be<br />

expected to have. The table also <strong>in</strong>cludes the threshold T99<br />

<strong>and</strong> the likelihood Pm<strong>in</strong> which <strong>in</strong>dicate that all full model<br />

predictions were expected to have correspond<strong>in</strong>g spotter<br />

model predictions except some among the archaeal 5S.<br />

Based on the relatively stable lengths of the different<br />

types of rRNAs <strong>and</strong> the correspond<strong>in</strong>g full model hits <strong>and</strong><br />

the position of the spotter hit with<strong>in</strong> them, we decided on<br />

w<strong>in</strong>dow sizes around spotter model hits to use when the<br />

spotter model is used first. These were chosen to be 300 nt<br />

for the 5S rRNA, 5000 nt for the 16/18S <strong>and</strong> 9000 nt for<br />

the 23/28S. Be<strong>in</strong>g roughly three times the length of the


3106 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

Table 3. Evaluation of spotter <strong>and</strong> full model predictions.<br />

K<strong>in</strong>gdom Type Number of model predictions Full model scores T99 Pm<strong>in</strong><br />

correspond<strong>in</strong>g rRNAs, we consider rRNA sequences to be<br />

unlikely to extend beyond these w<strong>in</strong>dows.<br />

<strong>Computational</strong> speed<br />

Search<strong>in</strong>g Mycoplasma capricolum ATCC27343, about<br />

1 Mbp, for bacterial 16S took 14 m<strong>in</strong>utes us<strong>in</strong>g the full<br />

HMM. Us<strong>in</strong>g the spotter to screen the sequence, then the<br />

full model on the spotter hits, reduced the time to<br />

16 seconds. Search times are expected to <strong>in</strong>crease<br />

proportionally to the genome size; when us<strong>in</strong>g the spotter<br />

model to screen the sequence, search time will also<br />

<strong>in</strong>crease with <strong>in</strong>creas<strong>in</strong>g number of spotter hits.<br />

Time differences between search<strong>in</strong>g long <strong>and</strong> short<br />

sequences were exam<strong>in</strong>ed by search<strong>in</strong>g through the<br />

complete sequence of Solibacter usitatus Ell<strong>in</strong>6076, <strong>and</strong><br />

through the Sargasso Sea environmental samples (19).<br />

Search<strong>in</strong>g the S. usitatus genome, about 10 Mbp, took 48<br />

seconds per Mbp. Two copies from each rRNAs family<br />

were found. The Sargasso Sea samples consisted of<br />

811 372 entries total<strong>in</strong>g over 800 Mbp. On this set the<br />

search speed was 407 seconds per Mbp. The article (19)<br />

accompany<strong>in</strong>g this set <strong>in</strong>dicated 1164 small subunit rRNA<br />

genes (16/18S) or fragments of genes; we found only 332,<br />

but our predictors are not able to f<strong>in</strong>d fragments of<br />

rRNAs. In addition, we found 562 5S <strong>and</strong> 68 23S<br />

sequences.<br />

DISCUSSION<br />

Full Spotter FPS M<strong>in</strong> Q1 Med Q3 Max<br />

Archaea 5S 47 35 7 2.9 12.7 20.0 35.3 50.6 34.9 0.69<br />

16S 47 47 0 1180.8 1891.9 1937.9 2004.0 2096.5 50 1.0<br />

23S 46 46 1 2240.7 2714.1 2870.7 3155.3 3267.3 50 1.0<br />

Bacteria 5S 1339 1339 123 39.9 77.7 89.5 94.6 109.6 14.0 1.0<br />

16S 1237 1237 31 721.9 1905.5 1989.4 2058.7 2148.5 50 1.0<br />

23S 1248 1248 20 2502.8 3267.8 3586.5 3690.7 3876.1 50 1.0<br />

Eukaryotes 5S 324 324 251 43.9 51.1 53.9 74.3 82.2 50 1.0<br />

18S 13 13 14 625.3 625.3 1733.1 1777.5 1777.6 50 1.0<br />

28S 19 19 5 1434.2 2904.7 3225.0 3335.9 3380.9 50 1.0<br />

This table shows the total number of full models, the number of spotter predictions that had match<strong>in</strong>g full model predictions <strong>and</strong> the number of false<br />

positive spotter model predictions. The characteristics of the full model prediction score distributions are shown. FPS denotes the number of false<br />

positive spotter predictions. T99 refers to the lowest score a full model could have while still be<strong>in</strong>g detected with 99% probability by a spotter model<br />

with positive score. Pm<strong>in</strong> is the probability that a spotter with positive score would f<strong>in</strong>d a full model with the m<strong>in</strong>imum score <strong>in</strong>dicated. The lowest<br />

score for a full model score can be used as a lower limit on which results could be expected to be real.<br />

Our aim has been to enable high-throughput searches for<br />

rRNA while produc<strong>in</strong>g accurate <strong>and</strong> consistent predictions<br />

suitable for comparative analyses. For this purpose,<br />

we have developed the RNAmmer package which relies on<br />

HMMs for both speed <strong>and</strong> accuracy. HMMs were made<br />

us<strong>in</strong>g HMMer (15), which from a multiple alignment<br />

produces an HMM where match states represent columns<br />

with a specific nucleotide distribution, correspond<strong>in</strong>g<br />

deletion states represent the possibility of gaps, <strong>and</strong><br />

<strong>in</strong>sertion states represent columns with large numbers of<br />

gaps; transition probabilities between the states <strong>in</strong>dicate<br />

how likely each of the states are. HMMs thus differ from<br />

sequence alignments <strong>in</strong> that the likelihood of <strong>in</strong>sertions<br />

<strong>and</strong> deletions may vary along the sequence. When<br />

search<strong>in</strong>g a sequence with an HMM, the score <strong>in</strong>dicates<br />

how well the sequence segment matches the model. The<br />

<strong>in</strong>formation content of a position, which reflects the<br />

nucleotide distribution <strong>and</strong> the likelihood of gaps,<br />

<strong>in</strong>dicates how well that position is conserved. A good<br />

match to the HMM may come either from a highly<br />

conserved region which may well be short, or from a<br />

longer region with only weak conservation. We f<strong>in</strong>d both<br />

these cases. Bacterial 16S are detected despite almost half<br />

of the nucleotides be<strong>in</strong>g assigned to <strong>in</strong>sert states, as other<br />

regions are highly conserved. For archaeal 23S, however,<br />

the <strong>in</strong>formation content of each position is low, but the<br />

sequence is long <strong>and</strong> there are few allowed <strong>in</strong>sert states.<br />

These aspects can also expla<strong>in</strong> cases of poor performance,<br />

both of the full model <strong>and</strong> of the spotter model.<br />

The low <strong>in</strong>formation content <strong>in</strong> the eukaryotic 5S <strong>and</strong><br />

18S alignments <strong>in</strong>dicates that these sequences are more<br />

divergent than archaeal <strong>and</strong> bacterial 5S <strong>and</strong> 16S.<br />

In addition, 40% of the 5S <strong>and</strong> 75% of the 18S alignment<br />

give rise to <strong>in</strong>sert states <strong>in</strong> the HMM. Thus, there is little<br />

for the HMM to recognize. In addition, many of the<br />

missed 18S rRNAs were from Cryptophyta, a phylum<br />

which makes up only 0.6% of the alignment data.<br />

The archaeal 5S show the same characteristics as the<br />

eukaryotic 5S <strong>and</strong> 18S, which most likely expla<strong>in</strong>s the low<br />

performance for these rRNAs. The score for archaeal 5S<br />

hits were generally low, <strong>and</strong> the spotter score comes only<br />

from a 75 nt part of the sequence giv<strong>in</strong>g it even lower score<br />

caus<strong>in</strong>g it to miss 12 of the full model hits. It is notable,<br />

however, that these were the only cases missed by the<br />

spotter model: with the exception of archaeal 5S, our<br />

analyses show that the spotter should be able to detect<br />

rRNAs unless they are much further diverged than what<br />

we f<strong>in</strong>d <strong>in</strong> our data.<br />

Columns at the beg<strong>in</strong>n<strong>in</strong>g <strong>and</strong> end of the multiple<br />

alignments often have low conservation <strong>and</strong> many gaps.<br />

Such columns are generally accommodated <strong>in</strong>to the<br />

HMM as <strong>in</strong>sert states, but HMMer ignores them at the<br />

beg<strong>in</strong>n<strong>in</strong>g <strong>and</strong> end of the alignment. An example is the 5S,


where match states stop around 10 columns from the<br />

end of the alignments effectively caus<strong>in</strong>g the HMM to<br />

predict the last conserved nucleotide of the consensus<br />

sequence rather than the stop of the rRNAs. Hence, it is<br />

not uncommon for the stop position of the 5S to be<br />

predicted up to 10 nt downstream of the annotated stop<br />

position.<br />

These effects can also expla<strong>in</strong> the endpo<strong>in</strong>t accuracy<br />

that was seen when we compared our results to<br />

experimentally determ<strong>in</strong>ed 16S sequences. We tried to<br />

f<strong>in</strong>d sequences where the ends had been experimentally<br />

verified by RACE or PCR, but such rRNAs proved<br />

difficult to f<strong>in</strong>d. All the ones we selected were sequenced,<br />

but it is uncerta<strong>in</strong> to what extent the authors had<br />

tried to determ<strong>in</strong>e the ends. These experimentally<br />

found rRNAs did show better agreement with annotation<br />

than predictions <strong>in</strong> general, although this is not sufficient<br />

to conclude that our predictions are more accurate. Our<br />

stop predictions were very accurate, but more deviation<br />

was seen <strong>in</strong> the start predictions. These results could reflect<br />

more variation <strong>in</strong> the beg<strong>in</strong>n<strong>in</strong>g of the alignments, which<br />

as <strong>in</strong> the 5S case could effectively cause the HMM to<br />

predict the last conserved nucleotide of the consensus<br />

sequence rather than the end of the rRNAs.<br />

In some cases, larger endpo<strong>in</strong>t deviations occur. This<br />

can happen when one of the ends of the model f<strong>in</strong>ds a<br />

better match <strong>in</strong> a different part of the sequence. Insertion<br />

states sometimes allows the HMM to <strong>in</strong>sert long gap<br />

regions <strong>and</strong> thus f<strong>in</strong>d a match<strong>in</strong>g stop position far from<br />

the rest of the sequence. As shown for the bacterial 16S<br />

sequences that displayed this phenomenon, this is less of a<br />

problem when the spotter model is employed. The w<strong>in</strong>dow<br />

searched around the spotter hit would most likely be too<br />

short to accommodate such an <strong>in</strong>sert, <strong>and</strong> the model<br />

would match with the proper sequence.<br />

For fragmented rRNAs, long gap regions may be<br />

correctly predicted. This was seen for Coxiella burnetii 23S<br />

where our prediction has the same start position<br />

as annotated, but where the predicted stop position<br />

is 1884 nt downstream of GenBank’s stop position.<br />

However, accord<strong>in</strong>g to Entrez Gene, this rRNA appears<br />

<strong>in</strong> four pieces <strong>and</strong> with the same stop position as ours,<br />

suggest<strong>in</strong>g that <strong>in</strong> some cases ‘too long’ predictions might<br />

actually be correct. These cases should normally not be<br />

masked when us<strong>in</strong>g the spotter unless <strong>in</strong>serts between the<br />

fragments would make it exceed the w<strong>in</strong>dow size.<br />

The HMM produced by HMMer requires time of order<br />

O(NM) to search a sequence of length N us<strong>in</strong>g a model<br />

with M states, M be<strong>in</strong>g proportional to the length of the<br />

multiple alignment. However, the speed is <strong>in</strong>creased by<br />

us<strong>in</strong>g a 75 nt long spotter model to pre-screen the<br />

sequence, which requires time of order O(N), <strong>and</strong> then<br />

runn<strong>in</strong>g the full HMM on w<strong>in</strong>dows around each spotter<br />

hit which requires time of order OðKM 2 Þ for K spotter<br />

hits, <strong>and</strong> w<strong>in</strong>dow size proportional to M. The benefit of<br />

us<strong>in</strong>g the spotter is clearly illustrated <strong>in</strong> the M. capricolum<br />

searches. However, the time difference between the<br />

S. usitatus <strong>and</strong> the Sargasso Sea data searches shows<br />

that the spotter might lose its mission when deal<strong>in</strong>g with<br />

many shorter sequences.<br />

Nucleic Acids Research, 2007, Vol. 35, No. 9 3107<br />

There are other approaches to predict<strong>in</strong>g non-cod<strong>in</strong>g<br />

RNA. One commonly used method is sequence alignment,<br />

e.g. BLAST (3), Paralign (20) or FASTA (21). Another is<br />

based on structure-sensitive Stochastic Context Free<br />

Grammars (SCFG) (22) which form the basis of the<br />

tRNA prediction program tRNAscan-SE (23) <strong>and</strong> of<br />

Infernal (24), which is used when creat<strong>in</strong>g RFAM. While<br />

the sequence alignment methods are very fast, they are not<br />

particularly suited for prediction of non-cod<strong>in</strong>g RNA (1).<br />

Infernal, however, has a general worst case runn<strong>in</strong>g time<br />

of order OðMN 3 Þ, which is prohibitive. The RFAM<br />

database (17,18), which <strong>in</strong>cludes 5S <strong>and</strong> the 5 0 doma<strong>in</strong><br />

of 16S, uses BLAST to pre-screen genome sequences,<br />

followed by Infernal; despite a more efficient approach<br />

than the general SCFG, it does not analyze the entire 16S.<br />

A search for 5S <strong>in</strong> a 1 Mbp genome us<strong>in</strong>g Infernal took<br />

4 hours 45 m<strong>in</strong>utes: almost 1000 times as much as the<br />

16 seconds used by RNAmmer for the much larger 16S<br />

model. A time-sav<strong>in</strong>g approach to SCFGs could be to use<br />

the RaveNna (25) package which can convert an RFAM<br />

SCFG to an HMM. This drastically reduces the runn<strong>in</strong>g<br />

time; however, its usefulness would be limited s<strong>in</strong>ce no<br />

models for the larger rRNAs are available. Another factor<br />

is that the 5S found by RaveNna (26) which were not<br />

already <strong>in</strong> RFAM were all <strong>in</strong> organellar sequences,<br />

sequences not analyzed by RNAmmer. For further<br />

comparisons <strong>and</strong> comments on these different methods,<br />

we refer to (1).<br />

The RNAmmer program is available as a traditional<br />

HTML-based prediction server at http://www.cbs.dtu.dk/<br />

services/RNAmmer as well as through a SOAP-based<br />

web service. It is also available for download through<br />

the same site.<br />

SUPPLEMENTARY DATA<br />

Supplementary Data is available at NAR onl<strong>in</strong>e.<br />

ACKNOWLEDGEMENTS<br />

We are grateful for fund<strong>in</strong>g from EMBIO at the<br />

University of Oslo, the Research Council of Norway<br />

<strong>and</strong> the Danish Center for Scientific Comput<strong>in</strong>g. It was<br />

also supported by a grant from the European Union<br />

through the EMBRACE Network of Excellence, contract<br />

number LSHG-CT-2004-512092. We would also like to<br />

thank our colleagues for critical read<strong>in</strong>g of the manuscript.<br />

Fund<strong>in</strong>g to pay the Open Access publication charge<br />

was provided by Research Council of Norway.<br />

Conflict of <strong>in</strong>terest statement. None declared.<br />

REFERENCES<br />

1. Freyhult,E., Bollback,J. <strong>and</strong> Gardner,P. (2007) Explor<strong>in</strong>g genomic<br />

dark matter: a critical assessment of the performance of homology<br />

search methods on noncod<strong>in</strong>g RNA. Genome Res., 17, 117–125.<br />

2. Pedersen,A., Jensen,L., Brunak,S., Staerfeldt,H. <strong>and</strong> Ussery,D.<br />

(2000) A DNA structural atlas for Escherichia coli. J. Mol. Biol.,<br />

299, 907–930.<br />

3. Altschul,S., Gish,W., Miller,W., Myers,E. <strong>and</strong> Lipman,D. (1990)<br />

Basic local alignment search tool. J. Mol. Biol., 215, 403–10.


3108 Nucleic Acids Research, 2007, Vol. 35, No. 9<br />

4. Wimberly,B., Brodersen,D., Clemons,W. Jr., Morgan-Warren,R.,<br />

Carter,A., Vonrhe<strong>in</strong>,C., Hartsch,T. <strong>and</strong> Ramakrishnan,V. (2000)<br />

Structure of the 30s ribosomal subunit. Nature, 407, 327–339.<br />

5. Schluenzen,F., Tocilj,A., Zarivach,R., Harms,J., Gluehmann,M.,<br />

Janell,D., Bashan,A., Bartels,H., Agmon,I. et al. (2000) Structure<br />

of functionally activated small ribosomal subunit at 3.3 angstroms<br />

resolution. Cell, 102, 615–623.<br />

6. Nissen,P., Hansen,J., Ban,N., Moore,P. <strong>and</strong> Steitz,T. (2000)<br />

The structural basis of ribosome activity <strong>in</strong> peptide bond synthesis.<br />

Science, 289, 920–930.<br />

7. Yusupov,M., Yusupova,G., Baucom,A., Lieberman,K., Earnest,T.,<br />

Cate,J. <strong>and</strong> Noller,H. (2001) Crystal structure of the ribosome at<br />

5.5 A ˚ resolution. Science, 292, 883–896.<br />

8. Srivastava,A. <strong>and</strong> Schless<strong>in</strong>ger,D. (1991) Structure <strong>and</strong> organization<br />

of ribosomal DNA. Biochimie, 73, 631–638.<br />

9. Ac<strong>in</strong>as,S., Marcel<strong>in</strong>o,L., Klepac-Ceraj,V. <strong>and</strong> Polz,M. (2004)<br />

Divergence <strong>and</strong> redundancy of 16s rRNA sequences <strong>in</strong> genomes<br />

with multiple rrn operons. J Bacteriol, 186, 2629–2635.<br />

10. Jackson,S., Cannone,J., Lee,J., Gutell,R. <strong>and</strong> Woodson,S. (2002)<br />

Distribution of rRNA <strong>in</strong>trons <strong>in</strong> the three-dimensional structure<br />

of the ribosome. J Mol Biol, 323, 35–52.<br />

11. Evguenieva-Hackenberg,E. (2005) Bacterial ribosomal RNA <strong>in</strong><br />

pieces. Mol Microbiol, 57, 318–325.<br />

12. Wuyts,J., Perriere,G. <strong>and</strong> Van De Peer,Y. (2004) The European<br />

ribosomal RNA database. Nucleic Acids Res, 32 Database issue,<br />

D101–D103.<br />

13. Szymanski,M., Barciszewska,M., Erdmann,V. <strong>and</strong> Barciszewski,J.<br />

(2002) 5s Ribosomal RNA database. Nucleic Acids Res., 30, 176–178.<br />

14. Hobohm,U., Scharf,M., Schneider,R. <strong>and</strong> S<strong>and</strong>er,C. (1992) Selection<br />

of representative prote<strong>in</strong> data sets. Prote<strong>in</strong> Sci., 1, 409–417.<br />

15. Eddy,S. (1998) Profile hidden markov models. Bio<strong>in</strong>formatics, 14,<br />

755–763.<br />

16. Henikoff,S. <strong>and</strong> Henikoff,J. (1994) Position-based sequence weights.<br />

J. Mol. Biol., 243, 574–578.<br />

17. Griffiths-Jones,S., Moxon,S., Marshall,M., Khanna,A., Eddy,S.<br />

<strong>and</strong> Bateman,A. (2005) Rfam: annotat<strong>in</strong>g non-cod<strong>in</strong>g RNAs <strong>in</strong><br />

complete genomes. Nucleic Acids Res., 33 Database Issue,<br />

D121–D124.<br />

18. Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. <strong>and</strong><br />

Eddy,S. (2003) Rfam: an RNA family database. Nucleic Acids Res.,<br />

31, 439–441.<br />

19. Venter,J., Rem<strong>in</strong>gton,K., Heidelberg,J., Halpern,A., Rusch,D.,<br />

Eisen,J., Wu,D., Paulsen,I., Nelson,K. et al. (2004) Environmental<br />

genome shotgun sequenc<strong>in</strong>g of the Sargasso Sea. Science, 304,<br />

66–74.<br />

20. Rognes,T. (2001) ParAlign: a parallel sequence alignment algorithm<br />

for rapid <strong>and</strong> sensitive database searches. Nucleic Acids Res, 29,<br />

1647–1652.<br />

21. Pearson,W. <strong>and</strong> Lipman,D. (1988) Improved <strong>tools</strong> for biological<br />

sequence comparison. Proc. Natl. Acad. Sci. USA, 85, 2444–2448.<br />

22. Durb<strong>in</strong>,R., Eddy,S.R., Krogh,A. <strong>and</strong> Mitchison,G. (2000)<br />

Biological Sequence Analysis: Probabilistic Models of Prote<strong>in</strong>s <strong>and</strong><br />

Nucleic Acids. Cambridge University Press.<br />

23. Lowe,T. <strong>and</strong> Eddy,S. (1997) tRNAscan-SE: a program for<br />

improved detection of transfer RNA genes <strong>in</strong> genomic sequence.<br />

Nucleic Acids Res., 25, 955–964.<br />

24. Eddy,S. (2002) A memory-efficient dynamic programm<strong>in</strong>g algorithm<br />

for optimal alignment of a sequence to an RNA secondary<br />

structure. BMC Bio<strong>in</strong>formatics, 3, 18.<br />

25. We<strong>in</strong>berg,Z. <strong>and</strong> Ruzzo,W. (2006) Sequence-based heuristics for<br />

faster annotation of non-cod<strong>in</strong>g RNA families. Bio<strong>in</strong>formatics, 22(1).<br />

26. We<strong>in</strong>berg,Z. <strong>and</strong> W.L.,R. (2004) In RECOMB 04: Proceed<strong>in</strong>gs of<br />

the Eighth Annual International Conference on <strong>Computational</strong><br />

Molecular Biology, ACM Press, pp. 243–251.


1<br />

rRNA operons <strong>and</strong> promoter analysis<br />

3.9 Paper VII: GeneWiz browser: An Interactive Tool for<br />

Visualiz<strong>in</strong>g Sequenced Chromosomes<br />

131


St<strong>and</strong>ards <strong>in</strong> Genomic Sciences (2009) 1: 204-215 DOI:10.4056/sigs.28177<br />

GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g<br />

Sequenced Chromosomes<br />

Peter F. Hall<strong>in</strong> 1 , Hans-Henrik Stærfeldt 1 , Eva Rotenberg 1, 2 , Tim T. B<strong>in</strong>newies 1, 3 , Craig J.<br />

Benham 4 , <strong>and</strong> David W. Ussery 1<br />

1 Center for Biological Sequence Analysis, Department of Systems Biology, The Technical<br />

University of Denmark, 2800 Kgs. Lyngby, Denmark.<br />

2 Lersoe Parkalle 37, 2TV, 2100 Copenhagen, Denmark<br />

3 Roche Diagnostics Ltd., CH-6343 Rotkreuz, Switzerl<strong>and</strong><br />

4 UC Davis Genome Center, University of California, Davis, California, U.S.A.<br />

We present an <strong>in</strong>teractive web application for visualiz<strong>in</strong>g genomic data of prokaryotic chromosomes.<br />

The tool (GeneWiz browser) allows users to carry out various analyses such as<br />

mapp<strong>in</strong>g alignments of homologous genes to other genomes, mapp<strong>in</strong>g of short sequenc<strong>in</strong>g<br />

reads to a reference chromosome, <strong>and</strong> calculat<strong>in</strong>g DNA properties such as curvature or stack<strong>in</strong>g<br />

energy along the chromosome. The GeneWiz browser produces an <strong>in</strong>teractive graphic<br />

that enables zoom<strong>in</strong>g from a global scale down to s<strong>in</strong>gle nucleotides, without chang<strong>in</strong>g the<br />

size of the plot. Its ability to disproportionally zoom provides optimal readability <strong>and</strong> <strong>in</strong>creased<br />

functionality compared to other browsers. The tool allows the user to select the display<br />

of various genomic features, color sett<strong>in</strong>g <strong>and</strong> data ranges. Custom numerical data can<br />

be added to the plot allow<strong>in</strong>g, for example, visualization of gene expression <strong>and</strong> regulation<br />

data. Further, st<strong>and</strong>ard atlases are pre-generated for all prokaryotic genomes available <strong>in</strong><br />

GenBank, provid<strong>in</strong>g a fast overview of all available genomes, <strong>in</strong>clud<strong>in</strong>g recently deposited<br />

genome sequences. The tool is available onl<strong>in</strong>e from<br />

http://www.cbs.dtu.dk/services/gwBrowser. Supplemental material <strong>in</strong>clud<strong>in</strong>g <strong>in</strong>teractive atlases<br />

is available onl<strong>in</strong>e at http://www.cbs.dtu.dk/services/gwBrowser/suppl/.<br />

Introduction<br />

The development of fast <strong>and</strong> <strong>in</strong>expensive genome<br />

sequenc<strong>in</strong>g technologies has led to the generation<br />

of vast amounts of genomic <strong>in</strong>formation. As ge-­‐<br />

nomic sequenc<strong>in</strong>g becomes both more powerful<br />

<strong>and</strong> affordable, the h<strong>and</strong>l<strong>in</strong>g <strong>and</strong> analysis of the<br />

generated data produces novel challenges <strong>and</strong><br />

shifts the focus away from the discovery process<br />

towards technical considerations of h<strong>and</strong>l<strong>in</strong>g,<br />

stor<strong>in</strong>g <strong>and</strong> analyz<strong>in</strong>g sequence data. An impor-­‐<br />

tant step when explor<strong>in</strong>g a new genome is to com-­‐<br />

pare it to exist<strong>in</strong>g sequences, <strong>in</strong> order to identify<br />

both novel <strong>and</strong> conserved features. Many auto-­‐<br />

mated computational methods are available that<br />

attempt to derive prote<strong>in</strong> function from sequence<br />

[1-3]. In a metagenomic study by Harr<strong>in</strong>gton <strong>and</strong><br />

co-­‐workers it was estimated that 76% of the ex-­‐<br />

am<strong>in</strong>ed prote<strong>in</strong> cod<strong>in</strong>g genes could be assigned a<br />

function. However, to assess predictions for <strong>in</strong>di-­‐<br />

vidual genes the visualization rema<strong>in</strong>s critical to<br />

provide the biologist with an overview of the ge-­‐<br />

nomic context. Are genes of <strong>in</strong>terest situated <strong>in</strong><br />

clusters? In operons? How are they regulated?<br />

How does their DNA base composition compare<br />

with that of the rest of the genome? In order to<br />

display such features both on a genome scale <strong>and</strong><br />

<strong>in</strong> close-­‐up down to the level of nucleotides, we<br />

developed the GeneWiz browser which is based<br />

on the ‘Genome Atlas’ concept [4,5]. This tool can<br />

also display local DNA structural properties, so<br />

that regulatory or repeat regions can easily be<br />

identified <strong>and</strong> <strong>in</strong>terpreted <strong>in</strong> a chromosomal con-­‐<br />

text.<br />

Dur<strong>in</strong>g development of the GeneWiz browser, it<br />

became apparent that novel sequenc<strong>in</strong>g technolo-­‐<br />

gy creates a further dem<strong>and</strong>. The current genera-­‐<br />

tion of sequenc<strong>in</strong>g <strong>in</strong>struments utilizes primed<br />

The Genomic St<strong>and</strong>ards Consortium


synthesis <strong>in</strong> flow cells to simultaneously obta<strong>in</strong><br />

the sequences of millions of different DNA tem-­‐<br />

plates, an approach that changed the field of DNA<br />

sequenc<strong>in</strong>g [6,7]. Flow sequenc<strong>in</strong>g, also known as<br />

sequenc<strong>in</strong>g by synthesis (SBS) on a solid surface,<br />

tracks nucleotides as they are added to a grow<strong>in</strong>g<br />

DNA str<strong>and</strong> [8]. SBS is used by high-­‐throughput<br />

sequenc<strong>in</strong>g systems which have become commer-­‐<br />

cially available <strong>in</strong> the past two years. Examples<br />

<strong>in</strong>clude the sequencer GS Titanium (commercia-­‐<br />

lized by 454/Roche); Genome Analyser GA-­‐II (So-­‐<br />

lexa/Illum<strong>in</strong>a); <strong>and</strong> SOLiD 3 system (Applied<br />

Biosystems).<br />

These developments have <strong>in</strong>creased the speed of<br />

sequenc<strong>in</strong>g while significantly reduc<strong>in</strong>g its cost<br />

[9,10]. This much higher throughput provides<br />

greater coverage, but at the cost of much shorter<br />

read-­‐lengths: from 50 bases with SOLiD 3 to 75<br />

bases with Illum<strong>in</strong>a GA II. Even reads of 500 bases<br />

obta<strong>in</strong>ed with the 454-­‐Titanium are still shorter<br />

than read lengths typically obta<strong>in</strong>ed us<strong>in</strong>g the<br />

Sanger method [9,11]. The output from modern<br />

high-­‐through sequenc<strong>in</strong>g equipment challenges<br />

the assembly software by generat<strong>in</strong>g shorter <strong>and</strong><br />

ambiguous reads. Process<strong>in</strong>g of this flood of se-­‐<br />

quence data has rapidly become a bottleneck, <strong>and</strong><br />

develop<strong>in</strong>g the necessary skills <strong>and</strong> <strong>tools</strong> will most<br />

likely be a driv<strong>in</strong>g factor <strong>in</strong> the execution of<br />

second-­‐generation sequenc<strong>in</strong>g [12]. As a first step<br />

<strong>in</strong> this development, it needs to be determ<strong>in</strong>ed to<br />

what extent assembly of short-­‐read sequences can<br />

be trusted, an assessment for which the GeneWiz<br />

browser can also be used.<br />

Methods<br />

Our method of visualization is based on color-­‐<br />

encoded lanes to display numerical <strong>in</strong>formation<br />

on a genome atlas similar to GeneWiz [4,5]. The<br />

color encod<strong>in</strong>g can be done either us<strong>in</strong>g a l<strong>in</strong>ear<br />

scale with a fixed m<strong>in</strong>imum <strong>and</strong> maximum range,<br />

or a dynamic scale of st<strong>and</strong>ard deviations. Us<strong>in</strong>g<br />

the latter, color <strong>in</strong>tensity decreases as data ap-­‐<br />

proach average values, thereby emphasiz<strong>in</strong>g re-­‐<br />

gions of significant variation. The web <strong>in</strong>terface is<br />

divided <strong>in</strong>to four optional sections, to address<br />

various biological viewpo<strong>in</strong>ts of chromosomes: 1)<br />

DNA properties 2) Mapp<strong>in</strong>g of homologous genes<br />

by BLAST 3) Mapp<strong>in</strong>g of short sequenc<strong>in</strong>g reads 4)<br />

Custom lanes such as S<strong>in</strong>gle Nucleotide Polymor-­‐<br />

Hall<strong>in</strong>, et al.<br />

phism (SNP) or microarray data. The output of<br />

each method is a numerical vector of length cor-­‐<br />

respond<strong>in</strong>g to that of the reference sequence, <strong>and</strong><br />

the methods used for this construction are de-­‐<br />

scribed <strong>in</strong> detail below.<br />

Read quality assessment<br />

Gene duplications, rRNA operons <strong>and</strong> other repeti-­‐<br />

tive chromosomal regions are known to cause<br />

difficulties dur<strong>in</strong>g the assembly of short reads [13].<br />

To assess the degree of ambiguity of sequenc<strong>in</strong>g<br />

reads, a method was developed that derives the<br />

uniqueness of all reads, account<strong>in</strong>g for both the<br />

read quality <strong>and</strong> the match to the reference ge-­‐<br />

nome.<br />

Sequence reads from Illum<strong>in</strong>a <strong>and</strong> 454 are re-­‐<br />

ported with base qualities: a per-­‐nucleotide meas-­‐<br />

ure that denotes the credibility of the base calls. A<br />

method was derived which condenses these quali-­‐<br />

ties <strong>in</strong>to values per position <strong>in</strong> the reference ge-­‐<br />

nome <strong>and</strong> calculates the follow<strong>in</strong>g <strong>in</strong>formation:<br />

uniqueness-­‐weighted quality, <strong>in</strong>formation content,<br />

sequence agreement, <strong>and</strong> repeat-­‐weighted cover-­‐<br />

age, (see methods). These estimates provide a<br />

prelim<strong>in</strong>ary overview of regions that may appear<br />

problematic to assemble. In general, low unique-­‐<br />

ness is found <strong>in</strong> the gaps between the assembled<br />

contigs generated by the default assembly <strong>tools</strong><br />

from a given sequence dataset, as will be demon-­‐<br />

strated below. A high score of uniqueness-­‐<br />

weighted quality <strong>in</strong>dicates that the base is unique-­‐<br />

ly identified by a read <strong>and</strong> that it has a high base<br />

quality <strong>in</strong> that read. The approach is illustrated <strong>in</strong><br />

Figure 1.<br />

From the mapp<strong>in</strong>g, five different parameters were<br />

calculate which together summarizes the trust-­‐<br />

worth<strong>in</strong>ess of the reads given the assembly:<br />

Weighted coverage Under the assumption that<br />

all reads would map only once (Hr=1), the coverage<br />

c(i) can be calculated as the number of<br />

alignments R mapped at position i. A weighted<br />

coverage c’(i)=wr,h (see equation below) is used<br />

to correct for higher coverage artificially <strong>in</strong>troduced<br />

by repeats:<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 205


GeneWiz browser<br />

Figure 1 | Mapp<strong>in</strong>g reads to a reference genome account<strong>in</strong>g for uniqueness. In step 1, each read is<br />

aligned aga<strong>in</strong>st the reference genome. In the second step, the quality of each read is weighted accord<strong>in</strong>g<br />

to the uniqueness of the hit. A read giv<strong>in</strong>g rise to two hits S 1 <strong>and</strong> S 2 <strong>in</strong> the reference genome<br />

will be weighted proportionally with the relative alignment scores; if scores are identical, the<br />

mapp<strong>in</strong>g of S 1 <strong>and</strong> S 2 will be applied a weight of w=0.5 (see equation below). Step 3 maps the<br />

weighted qualities back to the reference genome so that each genomic position conta<strong>in</strong>s an array<br />

of weighted qualities. Once all reads are mapped, <strong>in</strong> step 4 only the maximum weighted quality<br />

value is kept <strong>and</strong>, step 5, the maximum weighted quality scores are color coded to reveal regions<br />

of low uniqueness.<br />

Uniqueness-weighted quality This measure cor-­‐<br />

responds to the base qualities obta<strong>in</strong>ed from the<br />

reads that are mapped to the reference genome,<br />

weighted by the uniqueness of the read. Consider<br />

read r, which has a quality profile , where i is<br />

the position <strong>in</strong> the read. The read is aligned to the<br />

reference genome by BLAST, <strong>and</strong> all Hr hits are<br />

<strong>in</strong>cluded, when the follow<strong>in</strong>g criteria are met:<br />

BLAST score Sh of hit h is greater than or equal to<br />

S0 (optionally provided by the user), Sh S1 x<br />

where S1 is the score of the first/best hit, x [0;1]<br />

is a constant provided by the user, <strong>and</strong> the E-­‐value<br />

is equal to or less than a threshold specified by the<br />

user. The follow<strong>in</strong>g formula is used to derive the<br />

weighted quality :<br />

The value is plotted on a color scale whereby low<br />

<strong>in</strong>formation (r<strong>and</strong>om distribution, least expected)<br />

is given <strong>in</strong> dark colors, <strong>and</strong> high <strong>in</strong>formation (high<br />

From all the q’r(i) values obta<strong>in</strong>ed at each position<br />

<strong>in</strong> the genome, the maximum uniqueness-­‐<br />

weighted quality is chosen when all reads have<br />

been mapped.<br />

Information content provides a number <strong>in</strong> bits of<br />

<strong>in</strong>formation [14] represent<strong>in</strong>g to what degree the<br />

reads agree: zero bits means equal distribution of<br />

A, T, G <strong>and</strong> C at a given position <strong>and</strong> 2 bits means<br />

complete conservation of a s<strong>in</strong>gle base.<br />

conservation, most expected) as light or neutral<br />

color. This measure may be useful for visualiz<strong>in</strong>g<br />

s<strong>in</strong>gle nucleotide polymorphisms.<br />

206 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


Read absence. A boolean where ‘one’ <strong>in</strong>dicates<br />

complete absence of aligned reads.<br />

Visualization of whole-genome homology<br />

The BLASTatlas method [15] derives a map of per-­‐<br />

nucleotide numbers on a reference genome to<br />

visualize the matches <strong>in</strong> the alignment between<br />

the reference genome <strong>and</strong> a query. The query can<br />

constitute any number of genomic contigs, scaf-­‐<br />

folds, full genomes, or collections thereof. This<br />

provides a method to identify regions of a refer-­‐<br />

ence genome that are conserved throughout mul-­‐<br />

tiple samples, as well as those that are unique. The<br />

BLASTatlas method is <strong>in</strong>tegrated <strong>in</strong>to the GeneWiz<br />

browser software to facilitate a user-­‐friendly <strong>in</strong>-­‐<br />

terface. Accord<strong>in</strong>g to the BLAST algorithm chosen,<br />

DNA or prote<strong>in</strong> sequences of the reference are<br />

aligned with the best match <strong>in</strong> the query (us<strong>in</strong>g<br />

either blastp, blastn, tblastn, or blastx). The align-­‐<br />

ment is then mapped back to the reference ge-­‐<br />

nome. A match adds a 'one' whereas a mismatch<br />

adds a 'zero' at each position along the chromo-­‐<br />

Hall<strong>in</strong>, et al.<br />

some. These ones <strong>and</strong> zeros translate <strong>in</strong>to smooth<br />

color zones due to b<strong>in</strong>n<strong>in</strong>g<br />

DNA properties <strong>and</strong> DNA destabilization<br />

Through the web <strong>in</strong>terface it is currently possible<br />

to select from 36 different nucleotide composition<br />

<strong>and</strong> DNA structural properties [4,5,16-22]. In addi-­‐<br />

tion to this, calculations of so-­‐called SIDD energy<br />

estimates are provided, offer<strong>in</strong>g an approximation<br />

of promoter regions. This method estimates the<br />

free energy required to open the DNA helix, calcu-­‐<br />

<br />

-­‐0.035, -­‐0.044, -­‐0.055, us<strong>in</strong>g the SIDD algorithm<br />

[23]. All of these parameters can be applied <strong>in</strong> any<br />

comb<strong>in</strong>ation to any of the prokaryotic genomes<br />

available from the web <strong>in</strong>terface, or to a custom<br />

sequence provided by the user. Alternatively, the<br />

parameters may be applied as collections form<strong>in</strong>g<br />

8 st<strong>and</strong>ard atlases: Genome-­‐, Base-­‐, Structure-­‐,<br />

Cruciform-­‐, A-­‐DNA-­‐, Z-­‐DNA-­‐, the Repeat-­‐atlas, <strong>and</strong><br />

f<strong>in</strong>ally the SIDD atlas, which is <strong>in</strong>troduced <strong>in</strong> this<br />

manuscript (Figure 3).<br />

Figure 3 Configuration <strong>and</strong> references for pre-def<strong>in</strong>ed groups of DNA sequence- <strong>and</strong> structural<br />

properties: Genome-, Base-, Structure-, Cruciform-, A-DNA-, Z-DNA-, Repeat-, <strong>and</strong> SIDD-atlas.<br />

Custom data<br />

A designated section of the GeneWiz browser is<br />

assigned for custom data. It allows the user to<br />

provide a per-­‐nucleotide list of numerical values<br />

along with a desired color <strong>and</strong> data range. Al-­‐<br />

though not presented here, this allows for visuali-­‐<br />

zation of additional <strong>in</strong>formation such as microar-­‐<br />

ray data that has been pre-­‐processed by the user,<br />

by mapp<strong>in</strong>g gene expression, regulation change, or<br />

p-values back to genomic coord<strong>in</strong>ates. In addition<br />

to the ma<strong>in</strong> genome annotation cover<strong>in</strong>g CDSs,<br />

tRNAs, <strong>and</strong> rRNAs, the user may specify miscella-­‐<br />

neous <strong>and</strong> pseudo-­‐gene annotations separately. A<br />

button allows the query of selected reference ge-­‐<br />

nomes aga<strong>in</strong>st a replicate of pseudogenes.org [24].<br />

Other annotations of possible pseudogenes can be<br />

added, such as GenePRIMP output (geneprimp.jgi-­‐<br />

psf.org/).<br />

Dynamic visualization<br />

The GeneWiz browser allows dynamic dispropor-­‐<br />

tional zoom<strong>in</strong>g, mean<strong>in</strong>g that zoom<strong>in</strong>g occurs<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 207


GeneWiz browser<br />

nearly <strong>in</strong>stantly when requested by the user, by<br />

redraw<strong>in</strong>g all the components like tracks, legends,<br />

marks <strong>and</strong> text for every view. This allows the<br />

browser to scale the plot to make use of the entire<br />

plott<strong>in</strong>g area, by not rescal<strong>in</strong>g all parts of the plot<br />

equally. For example, zoom<strong>in</strong>g 10 x will stretch a<br />

data lane 10 <strong>in</strong> genome position axis, however<br />

the lane height <strong>and</strong> distance to the neighbor lane<br />

will rema<strong>in</strong> constant. The dynamic nature of the<br />

GeneWiz browser requires pre-­‐b<strong>in</strong>n<strong>in</strong>g of data for<br />

each zoom level, all of which are stored on a cen-­‐<br />

tral server; for improved efficiency only data re-­‐<br />

quested by the user are sent. The approach to<br />

store per-­‐nucleotide <strong>in</strong>formation as table records<br />

<strong>in</strong> a database (e.g. MySQL) has proved unfeasible,<br />

as the number of records per genome exceeds<br />

millions, <strong>and</strong> the construction of <strong>in</strong>dexes would be<br />

very time consum<strong>in</strong>g. Instead, a memory mapp<strong>in</strong>g<br />

technique was chosen, that allows the server to<br />

directly obta<strong>in</strong> the values from b<strong>in</strong>ary files when<br />

provided with the zoom w<strong>in</strong>dow <strong>and</strong> level, for any<br />

chromosome <strong>in</strong> the database. (Examples are pro-­‐<br />

vided as supplemental data, http://www.cbs.-­‐<br />

dtu.dk/services/gwBrowser/suppl/).<br />

The client is written as a JavaApplet, that obta<strong>in</strong>s<br />

the data remotely from the server<br />

(http://ws.cbs.dtu.dk/cgi-­‐b<strong>in</strong>/gwBrowser-­‐<br />

0.91/server.cgi). The browser server is written <strong>in</strong><br />

Perl/CGI, while a compiled c-­‐program h<strong>and</strong>les the<br />

access to the b<strong>in</strong>ary data files. The options cur-­‐<br />

rently supported are listed <strong>in</strong> Table 2.<br />

Table 2 GeneWiz Browser server options.<br />

Option description<br />

d The unique identifier for the atlas<br />

Feature type (e.g. CDS,rRNA,tRNA) when return<strong>in</strong>g<br />

ft<br />

annotations<br />

f Data field to return<br />

b Beg<strong>in</strong> of w<strong>in</strong>dow<br />

e End of w<strong>in</strong>dow<br />

l Zoom level<br />

z Enable zlib compression of output<br />

m=i Return the genome length<br />

m=avg/stddev/m<strong>in</strong>/max Return aggregate data for w<strong>in</strong>dow/genome<br />

m=d<br />

Return data values provided field, w<strong>in</strong>dow <strong>and</strong> zoom<br />

level<br />

m=c Return colors provided two or three-step ranges<br />

m=n Return nucleotides provided the w<strong>in</strong>dow<br />

m=a Return annotations (used together with option ‘ft’)<br />

<strong>and</strong> genes as well as numerical data associated<br />

These options (Table 2) can be <strong>in</strong>corporated <strong>in</strong>to a with each nucleotide. The disproportional capabil-­‐<br />

s<strong>in</strong>gle URL. For example, one could request all ity of the GeneWiz browser implies that all com-­‐<br />

ponents (legends, tracks, marks, etc.) are regene-­‐<br />

<br />

m-­‐ rated for every view requested by the user. Figure 4<br />

http://ws.cbs.dtu.dk/cgi-­‐<br />

outl<strong>in</strong>es the GeneWiz browser workflow.<br />

b<strong>in</strong>/gwBrowser-­‐<br />

-­‐<br />

When submitt<strong>in</strong>g a job via the web <strong>in</strong>terface, the<br />

<br />

request is assigned a job identifier, under which<br />

<br />

a-­‐<br />

all data lanes <strong>and</strong> configurations are kept. After<br />

tions are described <strong>in</strong> the xml record, which can<br />

the job has been processed the user may alter lane<br />

be downloaded from the web<br />

order, colors, ranges, <strong>and</strong> append various types of<br />

(http://ws.cbs.dtu.dk/cgi-­‐b<strong>in</strong>/gwBrowser-­‐<br />

marks to the plot. The layout of a given browser<br />

0.91/fetchxml.cgi?AL111168GENOMEatlas). Fur-­‐<br />

<strong>in</strong>stance is governed by an XML file, located on the<br />

ther examples are provided <strong>in</strong> the supplemental<br />

server. When generat<strong>in</strong>g the graphical representa-­‐<br />

data section.<br />

tion of the genome, the client Java program will<br />

make requests to the server to acquire aggregated<br />

The GeneWiz workflow <strong>and</strong> data displayed<br />

values, such as the averages, st<strong>and</strong>ard deviations,<br />

The GeneWiz browser plots <strong>and</strong> provides dispro-­‐<br />

m<strong>in</strong>ima, <strong>and</strong> maxima as well as lane data <strong>and</strong> an-­‐<br />

portional zoom<strong>in</strong>g for data perta<strong>in</strong><strong>in</strong>g to features<br />

notations.<br />

208 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


Hall<strong>in</strong>, et al.<br />

Figure 4 | The dataflow of the GeneWiz browser service. 1) The selected reference genome <strong>and</strong> the<br />

lanes to be <strong>in</strong>cluded are def<strong>in</strong>ed via the web <strong>in</strong>terface. 2) The request is sent to the analysis server<br />

that h<strong>and</strong>les the calculations. 3) When the job is f<strong>in</strong>ished, the web page redirects to the applet<br />

viewer that allows the user to navigate <strong>and</strong> edit the plot layout.<br />

Premade atlases<br />

The genome sequences stored <strong>in</strong> the <strong>CBS</strong> Genome<br />

Atlas Database [25] are synchronized with NCBI<br />

Entrez genome projects <strong>and</strong> have been pre-­‐<br />

processed for all of the eight st<strong>and</strong>ard atlases<br />

mentioned above. This allows the user to select<br />

from currently 1,636 pre-­‐b<strong>in</strong>ned replicons from<br />

864 prokaryotic sequenc<strong>in</strong>g projects, searchable<br />

by replicon name, GenBank accession number, or<br />

organism name (http://www.cbs.dtu.dk/-­‐ servic-­‐<br />

es/gwBrowser/precalc/)<br />

Results<br />

Evaluation of re-sequenc<strong>in</strong>g quality<br />

Three re-­‐sequenced bacterial genomes were ex-­‐<br />

am<strong>in</strong>ed, one genome sequence was generated us-­‐<br />

<strong>in</strong>g the Illum<strong>in</strong>a GA technology, whereas two ge-­‐<br />

nome sequences were generated utiliz<strong>in</strong>g the 454-­‐<br />

Titanium technology (Table 3). The public se-­‐<br />

quence was selected as reference for mapp<strong>in</strong>g the<br />

re-­‐sequenc<strong>in</strong>g reads us<strong>in</strong>g the GeneWiz browser<br />

tool. The r<strong>and</strong>omness <strong>in</strong> fragmentation was esti-­‐<br />

mated by compar<strong>in</strong>g the experimental data with<br />

<strong>in</strong>-silico digestions, generated at 40X coverage<br />

us<strong>in</strong>g read lengths between 30 to 5,000 bp. A good<br />

correspondence between the <strong>in</strong>-silico <strong>and</strong> experi-­‐<br />

mental reads suggests little bias towards certa<strong>in</strong><br />

chromosomal regions (Figure 5, panel A). The as-­‐<br />

sembled contigs provided by 454 (C. jejuni <strong>and</strong> E.<br />

coli) are mapped to the reference genome us<strong>in</strong>g<br />

BLAST <strong>and</strong> annotated <strong>in</strong> the perimeter of the at-­‐<br />

lases (two leftmost atlases <strong>in</strong> Figure 5, panel A+B).<br />

The detailed atlas of the experimental data (true<br />

reads), are shown <strong>in</strong> Figure 5, panel B. Panel C<br />

shows quality/count of reads plotted as a function<br />

of read position. Note that the read quality de-­‐<br />

creases the further the distance from the beg<strong>in</strong>-­‐<br />

n<strong>in</strong>g of the read.<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 209


GeneWiz browser<br />

Table 3 Sequenc<strong>in</strong>g details of three bacterial genomes, two of which were re-sequenced us<strong>in</strong>g<br />

454-Titanium <strong>and</strong> one with Illum<strong>in</strong>a GA technology.<br />

E. coli K12 MG1655 C. jejuni<br />

NCTC11168<br />

S. typhi Ty2<br />

Stra<strong>in</strong> id ATCC: 700926D-5 ATCC:<br />

700819D-5<br />

ERA000001<br />

Technology 454-Titanium 454-Titanium Illum<strong>in</strong>a GA II<br />

Read count 538,784 502,438 1,650,370<br />

Avg read length ((std.<br />

dev)<br />

522 (=53) 598 (=75) 51 (=0)<br />

Truncated length 600 600 35<br />

Coverage 61X 183X 18X<br />

Genome size 4,639,675 bp 1,641,481 bp 4,791,961 bp<br />

Accession <strong>and</strong> orig<strong>in</strong>al<br />

Reference<br />

U00096 [26] AL111168 [27] AE014613 [28]<br />

Figure 5 | Panel A: The maximum uniqueness quality is shown for the actual reads (green-to-blue<br />

lane) plotted <strong>in</strong> the outermost lanes, us<strong>in</strong>g the published genome as a reference. The follow<strong>in</strong>g<br />

lanes show <strong>in</strong>-silico digestions at 40 X coverage (red-to-blue lane), us<strong>in</strong>g read lengths 30, 50, 70,<br />

200, 500, 1,000, 1,000, <strong>and</strong> 5,000 bases. Panel B shows the weighted coverage, agreement with<br />

reference, maximum uniqueness quality, <strong>in</strong>formation content, read absence, <strong>and</strong> AT content. All<br />

six plots can be accessed for zoom<strong>in</strong>g via the supplemental data section. Panel C displays the read<br />

count (green, secondary ord<strong>in</strong>ate) <strong>and</strong> read quality (red, primary ord<strong>in</strong>ate) as a function of read<br />

length. Note that read counts differ with<strong>in</strong> the three datasets, result<strong>in</strong>g <strong>in</strong> different scales on the<br />

secondary ord<strong>in</strong>ate. For the two 454-Titanium sets (C. jejuni <strong>and</strong> E. coli K12), an assembly was<br />

provided which allows a mapp<strong>in</strong>g of contigs to the reference genome. These marks are shown <strong>in</strong><br />

gray <strong>in</strong> the perimeter of these plots. Red marks <strong>in</strong>dicate contigs with two or more hits <strong>in</strong> the reference.<br />

210 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


Genome homology: Compar<strong>in</strong>g multiple<br />

Burkholderia species<br />

A comparative study aimed at mapp<strong>in</strong>g for exam-­‐<br />

ple pathogenic isl<strong>and</strong>s or gene losses among dif-­‐<br />

ferent bacterial genomes can benefit from a graph-­‐<br />

ical representation provided by the BLASTatlas<br />

method. The genus of Burkholderia covers a num-­‐<br />

ber of important animal <strong>and</strong> human pathogens<br />

known to cause melioidosis (B. pseudomallei) <strong>and</strong><br />

pulmonary <strong>in</strong>fection <strong>in</strong> cystic fibrosis (CF) patients<br />

(B. cepacia), whereas B. thail<strong>and</strong>ensis, which is<br />

closely related to B. pseudomallei, rarely gives rise<br />

to diseases <strong>in</strong> humans [29,30]. Both species of B.<br />

thail<strong>and</strong>ensis <strong>and</strong> B. mallei display large chromo-­‐<br />

somal deletions when compared to B. pseudomallei.<br />

However, the more scattered nature of the<br />

Hall<strong>in</strong>, et al.<br />

gene loss observed <strong>in</strong> B. thail<strong>and</strong>ensis suggests<br />

that B. mallei evolved from B. pseudomallei<br />

through the loss of larger regions [31]. These dele-­‐<br />

tions are evident from the atlas shown <strong>in</strong> Figure 6<br />

where the two chromosomes of Burkholderia<br />

pseudomallei 1710b are used as BLASTatlas refer-­‐<br />

ence <strong>in</strong> a comparison with 14 publicly available<br />

Burkholderia genomes (B. thail<strong>and</strong>ensis plus all<br />

species hav<strong>in</strong>g two or more stra<strong>in</strong>s sequenced, see<br />

supplemental data). In addition it is evident that a<br />

strong preference of deletion exist for chromo-­‐<br />

some II. Ong <strong>and</strong> co-­‐workers report that deletions<br />

<strong>in</strong> chromosome II counts for 70% <strong>and</strong> 61% of the<br />

total gene loss <strong>in</strong> B. mallei <strong>and</strong> B. thail<strong>and</strong>ensis,<br />

respectively.<br />

Figure 6 | BLASTatlas of Burkholderia pseudomallei 1710b chromosomes I+II compared with 14<br />

Burkholderia species. Show<strong>in</strong>g from the outermost circles: B. ambifaria (2, purple), B. cenocepacia<br />

(4, red) B. thail<strong>and</strong>ensis (1, green) 10774, B. mallei (4, green), <strong>and</strong> B. pseudomallei (3, blue). Innermost<br />

circles show percent AT, <strong>and</strong> CG skew. Note, that to allow visual comparison between B.<br />

thail<strong>and</strong>ensis <strong>and</strong> B. mallei, both species are colored green: the outermost green lane corresponds<br />

to the s<strong>in</strong>gle B. thail<strong>and</strong>ensis, whereas the rema<strong>in</strong><strong>in</strong>g four green lanes are all B. mallei. GenBank<br />

accession numbers as well as <strong>in</strong>teractive plots are available through the supplemental data section.<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 211


GeneWiz browser<br />

The SIDD atlas: Annotation of regulatory<br />

elements<br />

The browser application enables the user to ap-­‐<br />

pend various annotation marks such as transcrip-­‐<br />

tion start site arrows, gene labels, <strong>and</strong> boxes. A<br />

f<strong>in</strong>al example illustrates how these marks can be<br />

used to <strong>in</strong>tegrate known regulatory elements with<br />

DNA properties <strong>and</strong> gene annotations to draw a<br />

more complete picture of a promoter region. The<br />

regulatory elements of the E. coli K12 MG1665 rrn<br />

operons [32] have been annotated <strong>in</strong> a st<strong>and</strong>ard<br />

SIDD atlas, provid<strong>in</strong>g a visualization of the P1/P2<br />

promoter structure (Figure 7). A zoom of the pro-­‐<br />

moter region reveals a strong SIDD site near the<br />

predom<strong>in</strong>ant P1 promoter approximately 40 bp.<br />

upstream of the P1 transcription start site. The<br />

transcription factor FIS stimulates transcription at<br />

several promoters, <strong>and</strong> for example the b<strong>in</strong>d<strong>in</strong>g of<br />

FIS at the leuV promoter [33] has been suggested<br />

to transmit the superhelical destabilization down-­‐<br />

stream to the po<strong>in</strong>t where the RNAP twists <strong>and</strong><br />

opens the helix [34]. This model may be valid for<br />

the rrnB P1 promoter also, as the activity of leuV<br />

<strong>and</strong> rrnB P1 are comparable [35].<br />

Figure 7 | A zoom upstream of the E. coli K12 MG1665 rrnB operon. The three outer-most lanes<br />

show SIDD at three superhelix densities of sigma=-0.055, -0.045, <strong>and</strong> -0.035. The lower free energy<br />

required to melt the helix can be observed near the UP element of P1, for the SIDD lane at sigma<br />

= -0.045. The atlas is available for zoom<strong>in</strong>g on the supplemental data section.<br />

Discussion<br />

Visualization of the multidimensional <strong>in</strong>formation<br />

that is represented by a s<strong>in</strong>gle genome sequence<br />

rema<strong>in</strong>s complex. An <strong>in</strong>dispensable property of a<br />

genome visualization tool is that it must be zoom-­‐<br />

able, so that <strong>in</strong>formation can be <strong>in</strong>terpreted at<br />

vary<strong>in</strong>g scales. Two recently published methods,<br />

the DNAPlotter [36] <strong>and</strong> the Genome Projector<br />

[37], both enable the user to build circular plots of<br />

numerical data related to genes as well as graphs<br />

of numerical data perta<strong>in</strong><strong>in</strong>g to the nucleotides.<br />

These <strong>tools</strong> create static graphics <strong>and</strong> allows only<br />

for proportional zoom<strong>in</strong>g, hence mak<strong>in</strong>g the plot<br />

hard to <strong>in</strong>terpret when zoom<strong>in</strong>g too deep. Both of<br />

these <strong>tools</strong> allow for visualization of <strong>in</strong>dividual<br />

genomes, but do not allow easy comparison across<br />

multiple genomes. With the ease of new genome<br />

sequences becom<strong>in</strong>g available, it is essential to be<br />

able to quickly compare other genomes to a refer-­‐<br />

ence.<br />

A number of other <strong>tools</strong> approach genome visuali-­‐<br />

zation from different angles: Genome Diagram [38]<br />

<strong>and</strong> Circos [39] are comm<strong>and</strong> l<strong>in</strong>e programs gene-­‐<br />

rat<strong>in</strong>g publication quality static images <strong>and</strong> vector<br />

graphics. Although these <strong>tools</strong> allow comparison<br />

of other genomes, are flexible <strong>and</strong> allow visualiza-­‐<br />

tion of numerical data, they lack an <strong>in</strong>teractive<br />

layer.<br />

The GeneWiz browser described here uses dis-­‐<br />

proportional zoom<strong>in</strong>g to overcome this. From a<br />

technical perspective, the choice of programm<strong>in</strong>g<br />

language for writ<strong>in</strong>g graphical browsers is of im-­‐<br />

portance. There are obvious advantages of provid-­‐<br />

212 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


<strong>in</strong>g platform-­‐<strong>in</strong>dependent Java software like that<br />

of the GeneWiz browser, but often this is at the<br />

cost of performance. Nevertheless, our tool de-­‐<br />

monstrates the usefulness of a genome browser<br />

that relies on <strong>in</strong>teractive, true disproportional<br />

zoom<strong>in</strong>g to visualize annotated genes <strong>and</strong> features<br />

as well as numerical data provided at s<strong>in</strong>gle nuc-­‐<br />

leotide resolution. By build<strong>in</strong>g a comprehensive<br />

tool that is both scalable <strong>and</strong> flexible, we have<br />

shown how different types of genomic data can be<br />

<strong>in</strong>tegrated <strong>in</strong>to a s<strong>in</strong>gle, easily navigated graphic<br />

that can be annotated further by the user.<br />

Author contributions<br />

P.F.H. wrote the paper <strong>and</strong> composed the web<br />

<strong>in</strong>terfaces, as well as most parts of the server back<br />

end. H.H.S. wrote the c-­‐code of the data b<strong>in</strong>n<strong>in</strong>g<br />

<strong>and</strong> retrieval software <strong>and</strong> contributed to the Java<br />

Applet; E.R. wrote the majority of the Java Applet<br />

code <strong>and</strong> formulation of the XML configurations.<br />

Reference<br />

1. Harr<strong>in</strong>gton ED, S<strong>in</strong>gh AH, Doerks T, Letunic I,<br />

von Mer<strong>in</strong>g C, Jensen LJ, Raes J, Bork P. Quantitative<br />

assessment of prote<strong>in</strong> function prediction<br />

from metagenomics shotgun sequences. Proc Natl<br />

Acad Sci USA 2007; 104:13913-13918. PubMed<br />

doi:10.1073/pnas.0702636104<br />

2. Jensen LJ, Gupta R, Blom N, Devos D, Tamames<br />

J, Kesmir C, Nielsen H, Staerfeldt HH, Rapacki K,<br />

Workman C et al. Prediction of human prote<strong>in</strong><br />

function from post-translational modifications <strong>and</strong><br />

localization features. J Mol Biol 2002; 319:1257-<br />

1265. PubMed doi:10.1016/S0022-<br />

2836(02)00379-0<br />

3. Friedberg I. Automated prote<strong>in</strong> function prediction--the<br />

genomic challenge. Brief Bio<strong>in</strong>form<br />

2006; 7:225. PubMed doi:10.1093/bib/bbl004<br />

4. Jensen LJ, Friis C, Ussery DW. Three views of<br />

microbial genomes. Res Microbiol 1999;<br />

150:773-777. PubMed doi:10.1016/S0923-<br />

2508(99)00116-3<br />

5. Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH,<br />

Ussery DW. A DNA structural atlas for Escherichia<br />

coli. J Mol Biol 2000; 299:907-930. PubMed<br />

doi:10.1006/jmbi.2000.3787<br />

6. Hall N. Advanced sequenc<strong>in</strong>g technologies <strong>and</strong><br />

their wider impact <strong>in</strong> microbiology. J Exp Biol<br />

2007; 210:1518-1525. PubMed<br />

doi:10.1242/jeb.001370<br />

Hall<strong>in</strong>, et al.<br />

T.T.B. provided source data <strong>and</strong> analysis of C. jejuni<br />

<strong>and</strong> E. coli sequenc<strong>in</strong>g reads <strong>and</strong> C.J.B. assisted<br />

writ<strong>in</strong>g the paper (paragraphs on SIDD energy).<br />

D.W.U. assisted <strong>in</strong> writ<strong>in</strong>g the paper, supervised<br />

the project <strong>and</strong> provided ideas for figures <strong>and</strong><br />

analysis. All authors have read <strong>and</strong> made correc-­‐<br />

tions to the manuscript.<br />

Acknowledgements<br />

This work is funded <strong>in</strong> part by grants from the Danish<br />

Center for Scientific Comput<strong>in</strong>g, NSF Research Grant<br />

DBI-­‐0416764, The Danish Research Council grant 26-­‐<br />

06-­‐0349, <strong>and</strong> the EU EMBRACE network of Excellence,<br />

contract number LSHG-­‐CT-­‐2004-­‐512092. We thank<br />

Mark Driscoll <strong>and</strong> Marcel Margulies from 454 Life<br />

Sciences for provid<strong>in</strong>g the data for C. jejuni <strong>and</strong> E. coli<br />

<strong>and</strong> Julian Parkhill at the Sanger <strong>in</strong>stitute for provid<strong>in</strong>g<br />

the S. typhi sequenc<strong>in</strong>g data. We thank also Dr. Trudy<br />

Wassenaar <strong>and</strong> Dr. Lars Juhl Jensen for mak<strong>in</strong>g sugges-­‐<br />

tions to the manuscript.<br />

7. Holt RA, Jones SJ. The new paradigm of flow cell<br />

sequenc<strong>in</strong>g. Genome Res 2008; 18:839-846.<br />

PubMed doi:10.1101/gr.073262.107<br />

8. Käller M, Lundeberg J, Ahmadian A. Arrayed<br />

identification of DNA signatures. Expert Rev Mol<br />

Diagn 2007; 7:65-76. PubMed<br />

doi:10.1586/14737159.7.1.65<br />

9. Gupta PK. S<strong>in</strong>gle-molecule DNA sequenc<strong>in</strong>g<br />

technologies for future genomics research. Trends<br />

Biotechnol 2008; 26:602-611. PubMed<br />

doi:10.1016/j.tibtech.2008.07.003<br />

10. Shendure J, Ji H. Next-generation DNA sequenc<strong>in</strong>g.<br />

Nat Biotechnol 2008; 26:1135-1145.<br />

PubMed doi:10.1038/nbt1486<br />

11. Smith DR, Qu<strong>in</strong>lan AR, Peckham HE, Makowsky<br />

K, Tao W, Woolf B, Shen L, Donahue WF, Tusneem<br />

N, Stromberg MP et al. Rapid wholegenome<br />

mutational profil<strong>in</strong>g us<strong>in</strong>g nextgeneration<br />

sequenc<strong>in</strong>g technologies. Genome Res<br />

2008; 18:1638-1642. PubMed<br />

doi:10.1101/gr.077776.108<br />

12. L<strong>in</strong> F, Schröder H, Schmidt B. Solv<strong>in</strong>g the Bottleneck<br />

Problem <strong>in</strong> Bio<strong>in</strong>formatics Comput<strong>in</strong>g: An<br />

Architectural Perspective. J VLSI Signal Process<br />

2007; 48:185-188. doi:10.1007/s11265-007-<br />

0088-z<br />

13. Phillippy AM, Schatz MC, Pop M. Genome assembly<br />

forensics: f<strong>in</strong>d<strong>in</strong>g the elusive mis-<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 213


GeneWiz browser<br />

assembly. Genome Biol 2008; 9:R55. PubMed<br />

doi:10.1186/gb-2008-9-3-r55<br />

14. Tolstrup N, Rouzé P, Brunak S. A branch po<strong>in</strong>t<br />

consensus from Arabidopsis found by noncircular<br />

analysis allows for better prediction of<br />

acceptor sites. Nucleic Acids Res 1997; 25:3159-<br />

3163. PubMed doi:10.1093/nar/25.15.3159<br />

15. Hall<strong>in</strong> PF, B<strong>in</strong>newies TT, Ussery DW. The genome<br />

BLASTatlas-a GeneWiz extension for visualization<br />

of whole-genome homology. Mol Biosyst<br />

2008; 4:363-371. PubMed<br />

doi:10.1039/b717118h<br />

16. Bolshoy A, McNamara P, Harr<strong>in</strong>gton RE, Trifonov<br />

EN. Curved DNA without A-A: experimental estimation<br />

of all 16 DNA wedge angles. Proc Natl<br />

Acad Sci USA 1991; 88:2312-2316. PubMed<br />

doi:10.1073/pnas.88.6.2312<br />

17. Brukner I, Sánchez R, Suck D, Pongor S. Sequence-dependent<br />

bend<strong>in</strong>g propensity of DNA as<br />

revealed by DNase I: parameters for tr<strong>in</strong>ucleotides.<br />

EMBO J 1995; 14:1812-1818. PubMed<br />

18. van Noort V, Worn<strong>in</strong>g P, Ussery DW, Rosche<br />

WA, S<strong>in</strong>den RR. Str<strong>and</strong> misalignments lead to quasipal<strong>in</strong>drome<br />

correction. Trends Genet 2003;<br />

19:365-369. PubMed doi:10.1016/S0168-<br />

9525(03)00136-7<br />

19. Olson WK, Gor<strong>in</strong> AA, Lu XJ, Hock LM, Zhurk<strong>in</strong><br />

VB. DNA sequence-dependent deformability deduced<br />

from prote<strong>in</strong>-DNA crystal complexes. Proc<br />

Natl Acad Sci USA 1998; 95:11163-11168.<br />

PubMed doi:10.1073/pnas.95.19.11163<br />

20. Ornste<strong>in</strong> RL, Re<strong>in</strong> R, Breen DL, MacElroy RD. An<br />

optimized potential function for the calculation of<br />

nucleic acid <strong>in</strong>teraction energies. I- Base stack<strong>in</strong>g.<br />

Biopolymers 1978; 17:2341-2360.<br />

doi:10.1002/bip.1978.360171005<br />

21. Satchwell SC, Drew HR, Travers AA. Sequence<br />

periodicities <strong>in</strong> chicken nucleosome core DNA. J<br />

Mol Biol 1986; 191:659-675. PubMed<br />

doi:10.1016/0022-2836(86)90452-3<br />

22. Ussery D, Soumpasis DM, Brunak S, Staerfeldt<br />

HH, Worn<strong>in</strong>g P, Krogh A. Bias of pur<strong>in</strong>e stretches<br />

<strong>in</strong> sequenced chromosomes. Comput Chem<br />

2002; 26:531-541. PubMed doi:10.1016/S0097-<br />

8485(02)00013-X<br />

23. Wang H, Benham CJ. Superhelical destabilization<br />

<strong>in</strong> regulatory regions of stress response genes.<br />

PLOS Comput Biol 2008; 4:e17. PubMed<br />

doi:10.1371/journal.pcbi.0040017<br />

24. Karro JE, Yan Y, Zheng D, Zhang Z, Carriero N,<br />

Cayt<strong>in</strong>g P, Harrrison P, Gerste<strong>in</strong> M. Pseudo-<br />

gene.org: a comprehensive database <strong>and</strong> comparison<br />

platform for pseudogene annotation. Nucleic<br />

Acids Res 2007; 35:D55-D60. PubMed<br />

doi:10.1093/nar/gkl851<br />

25. Hall<strong>in</strong> PF, Ussery DW. <strong>CBS</strong> Genome Atlas Database:<br />

a dynamic storage for bio<strong>in</strong>formatic results<br />

<strong>and</strong> sequence data. Bio<strong>in</strong>formatics 2004;<br />

20:3682-3686. PubMed<br />

doi:10.1093/bio<strong>in</strong>formatics/bth423<br />

26. Blattner FR, Plunkett G, Bloch CA, Perna NT,<br />

Burl<strong>and</strong> V, Riley M, Collado-Vides J, Glasner JD,<br />

Rode CK, Mayhew GF et al. The complete genome<br />

sequence of Escherichia coli K-12. Science<br />

1997; 277:1453-1462. PubMed<br />

doi:10.1126/science.277.5331.1453<br />

27. Parkhill J, Wren BW, Mungall K, Ketley JM,<br />

Churcher C, Basham D, Chill<strong>in</strong>gworth T, Davies<br />

RM, Feltwell T, Holroyd S et al. The genome sequence<br />

of the food-borne pathogen Campylobacter<br />

jejuni reveals hypervariable sequences. Nature<br />

2000; 403:665-668. PubMed<br />

doi:10.1038/35001088<br />

28. Deng W, Liou SR, Plunkett G, Mayhew GF, Rose<br />

DJ, Burl<strong>and</strong> V, Kodoyianni V, Schwartz DC,<br />

Blattner FR. <strong>Comparative</strong> genomics of Salmonella<br />

enterica serovar Typhi stra<strong>in</strong>s Ty2 <strong>and</strong> CT18. J<br />

Bacteriol 2003; 185:2330-2337. PubMed<br />

doi:10.1128/JB.185.7.2330-2337.2003<br />

29. Brett PJ, DeShazer D, Woods DE. Burkholderia<br />

thail<strong>and</strong>ensis sp. nov., a Burkholderia pseudomallei-like<br />

species. Int J Syst Bacteriol 1998; 48:317-<br />

320. PubMed<br />

30. Smith MD, Angus BJ, Wuthiekanun V, White NJ.<br />

Arab<strong>in</strong>ose assimilation def<strong>in</strong>es a nonvirulent biotype<br />

of Burkholderia pseudomallei. Infect Immun<br />

1997; 65:4319-4321. PubMed<br />

31. Ong C, Ooi CH, Wang D, Chong H, Ng KC, Rodrigues<br />

F, Lee MA, Tan P. Patterns of large-scale<br />

genomic variation <strong>in</strong> virulent <strong>and</strong> avirulent Burkholderia<br />

species. Genome Res 2004; 14:2295-<br />

2307. PubMed doi:10.1101/gr.1608904<br />

32. Hirvonen CA, Ross W, Wozniak CE, Marasco E,<br />

Anthony JR, Aiyar SE, Newburn VH, Gourse RL.<br />

Contributions of UP elements <strong>and</strong> the transcription<br />

factor FIS to expression from the seven rrn P1<br />

promoters <strong>in</strong> Escherichia coli. J Bacteriol 2001;<br />

183:6305-6314. PubMed<br />

doi:10.1128/JB.183.21.6305-6314.2001<br />

33. Ross W, Salomon J, Holmes WM, Gourse RL.<br />

Activation of Escherichia coli leuV transcription<br />

by FIS. J Bacteriol 1999; 181:3864-3868. PubMed<br />

214 St<strong>and</strong>ards <strong>in</strong> Genomic Sciences


34. Wang H, Noordewier M, Benham CJ. Stress<strong>in</strong>duced<br />

DNA duplex destabilization (SIDD) <strong>in</strong><br />

the E. coli genome: SIDD sites are closely associated<br />

with promoters. Genome Res 2004;<br />

14:1575-1584. PubMed doi:10.1101/gr.2080004<br />

35. Bauer BF, Kar EG, Elford RM, Holmes WM. Sequence<br />

determ<strong>in</strong>ants for promoter strength <strong>in</strong> the<br />

leuV operon of Escherichia coli. Gene 1988;<br />

63:123-134. PubMed doi:10.1016/0378-<br />

1119(88)90551-3<br />

36. Carver T, Thomson N, Bleasby A, Berriman M,<br />

Parkhill J. DNAPlotter: circular <strong>and</strong> l<strong>in</strong>ear <strong>in</strong>teractive<br />

genome visualization. Bio<strong>in</strong>formatics 2009;<br />

25:119-120. PubMed<br />

doi:10.1093/bio<strong>in</strong>formatics/btn578<br />

Hall<strong>in</strong>, et al.<br />

37. Arakawa K, Tamaki S, Kono N, Kido N, Ikegami<br />

K, Ogawa R, Tomita M. Genome Projector:<br />

zoomable genome map with multiple views. BMC<br />

Bio<strong>in</strong>formatics 2009; 10:31. PubMed<br />

doi:10.1186/1471-2105-10-31<br />

38. Pritchard L, White JA, Birch PR, Toth IK. GenomeDiagram:<br />

a python package for the visualization<br />

of large-scale genomic data. Bio<strong>in</strong>formatics<br />

2006; 22:616-617. PubMed<br />

doi:10.1093/bio<strong>in</strong>formatics/btk021<br />

39. Krzyw<strong>in</strong>ski M, Sche<strong>in</strong> J, Birol I, Connors J, Gascoyne<br />

R, Horsman D, Jones SJ, Marra MA. Circos:<br />

an <strong>in</strong>formation aesthetic for comparative genomics.<br />

Genome Res 2009; 19:1639-1645. PubMed<br />

doi:10.1101/gr.092759.109<br />

http://st<strong>and</strong>ards<strong>in</strong>genomics.org 215


Paper VII: GeneWiz browser: An Interactive Tool for Visualiz<strong>in</strong>g Sequenced Chromosomes<br />

144


Chapter 4<br />

Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

Web Services <strong>and</strong> <strong>Interoperability</strong><br />

<strong>in</strong> Genomics<br />

This chapter describes work done connection with the EU project EMBRACE. The deliverables<br />

def<strong>in</strong>ed for <strong>CBS</strong> have had both outreach obligations as well as implementation<br />

tasks of provid<strong>in</strong>g <strong>tools</strong> <strong>and</strong> databases through Web Services. This author’s contributions<br />

reflect this duality; there was a responsibility for develop<strong>in</strong>g the server <strong>in</strong>frastructure for<br />

host<strong>in</strong>g Web Services while also teach<strong>in</strong>g about us<strong>in</strong>g <strong>and</strong> design concepts on several occasions<br />

(see appendix A.1). <strong>CBS</strong> is now us<strong>in</strong>g this work to <strong>in</strong>tegrate all major prediction<br />

servers under the same Web Services umbrella. There are currently 17 services offered<br />

us<strong>in</strong>g this technology 1 . The work on Web Services has made the foundation for creat<strong>in</strong>g<br />

an onl<strong>in</strong>e resource like BLASTatlas (paper I). Further, the RNAmmer tool (VI) is offered<br />

both as a traditional web <strong>in</strong>terface <strong>and</strong> through Web Services <strong>and</strong> these implementations<br />

demonstrate the usefullness of programmtic access to <strong>tools</strong>.<br />

4.1 Introduction<br />

Over the past decade, the <strong>in</strong>ternet has undoubtedly revolutionized the way <strong>in</strong>formation<br />

is exchanged <strong>in</strong> the modern society. From bank transactions, digital road maps <strong>and</strong><br />

satellite images, email<strong>in</strong>g, news articles, <strong>and</strong> social networks, these services are now hard<br />

to imag<strong>in</strong>e, without a digitally connected world. Biological <strong>and</strong> bio<strong>in</strong>formatic <strong>in</strong>formation<br />

is no exception as it relies on the <strong>in</strong>ternet to provide the transport of sequence data,<br />

experimental results, scientific articles etc. Both the number <strong>and</strong> complexity of biological<br />

<strong>in</strong>formation <strong>in</strong>creases day by day. As new experimental techniques become available, new<br />

types of data as well as new ways of comb<strong>in</strong><strong>in</strong>g them, are <strong>in</strong>troduced. For decades, the<br />

exchange of biological <strong>in</strong>formation over the <strong>in</strong>ternet has been <strong>in</strong> the form of human readable<br />

HTML documents (HyperText Markup Language) - or flat files resid<strong>in</strong>g on FTP servers<br />

(File Transfer Protocol). When designed, HTML was <strong>in</strong>tended to host static <strong>in</strong>formation<br />

presented by a server to a human be<strong>in</strong>g us<strong>in</strong>g a browser. Today, computers are required<br />

to digest the huge amounts of <strong>in</strong>formation with less <strong>in</strong>volvement of humans, <strong>and</strong> more<br />

advanced technologies are now required. To successfully <strong>in</strong>tegrate the vast amounts of<br />

data provided by the life science community, <strong>in</strong>teroperability rema<strong>in</strong>s a key issue. It<br />

may seem unrealistic to reach a po<strong>in</strong>t where every biologist <strong>and</strong> bio<strong>in</strong>formatician has<br />

the world’s biological databases <strong>and</strong> <strong>tools</strong> accessible through programmatic access, from<br />

their favorite programm<strong>in</strong>g language. However, with the current technologies <strong>in</strong> Web<br />

1 BLASTatlas, EasyGene, EPipe, GeneWiz, GenomeAtlas, hERG, MaxAlign, NetChop, NetCTL, Net-<br />

Glycate, NetNGlyc, NetOGlyc, NetPhos, RNAmmer, SIDDbase, SignalP, <strong>and</strong> TMHMM<br />

145


<strong>Interoperability</strong><br />

Figure 4.1: Screen shot of NCBI Entrez Genome projects web page<br />

Services, an <strong>in</strong>teroperble life science community may not be far away. When connected,<br />

the communities will be able to exchange not only data but many services such as <strong>tools</strong><br />

for predict<strong>in</strong>g prote<strong>in</strong> function, perform<strong>in</strong>g sequence alignments, or gene f<strong>in</strong>d<strong>in</strong>g.<br />

4.2 <strong>Interoperability</strong><br />

”The term ’<strong>in</strong>teroperability’ is def<strong>in</strong>ed as the ability ... <strong>in</strong>formation, by IEEE (http...)”.<br />

The term ’<strong>in</strong>teroperability’ is def<strong>in</strong>ed as the ability of two or more systems to exchange<br />

<strong>and</strong> make use of <strong>in</strong>formation (IEEE, http://www.ieee.org). Whether systems can be<br />

said to be ’<strong>in</strong>teroperable’ depends on how one <strong>in</strong>terprets ’make use of’. Consider the list of<br />

full prokaryotic genome sequences, ma<strong>in</strong>ta<strong>in</strong>ed by NCBI at http://www.ncbi.nlm.nih.<br />

gov/genomes/lproks.cgi, as shown <strong>in</strong> figure figure 4.1.<br />

To automatically retrieve this list, one may write a parser to transform the HTML<br />

<strong>in</strong>to a computer-readable text. Apart from be<strong>in</strong>g overly sensitive to changes <strong>in</strong> the HTML<br />

document, such a parser will lack the knowledge beh<strong>in</strong>d the data s<strong>in</strong>ce the format is not<br />

typed nor structured. It is only when <strong>in</strong>terpreted by an <strong>in</strong>ternet browser <strong>and</strong> presented<br />

graphically to a human, that this <strong>in</strong>formation makes any sense. Both recipient <strong>and</strong> receiver<br />

must <strong>in</strong> other words have knowledge about the <strong>in</strong>formation that is exchanged, before these<br />

can be said to be <strong>in</strong>teroperable. The are two aspects of <strong>in</strong>teroperability: First, there must<br />

exist agreement on the format by which data is exchanged. Whether this is structured<br />

XML or any arbitrary format, the server must return the format expected by the client<br />

upon a request. Second, the description <strong>and</strong> underst<strong>and</strong><strong>in</strong>g of the content of the data be<strong>in</strong>g<br />

exchanged is a requirement when build<strong>in</strong>g client-side code <strong>and</strong> objects <strong>in</strong> Web Services.<br />

Without the knowledge of exact data types, the programm<strong>in</strong>g environment (e.g. C, Java,<br />

Perl) fails to declare the objects with proper variable types.<br />

146


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

List<strong>in</strong>g 4.1: Abbreviated <strong>in</strong>put to the queryGenomes operations of the Genome Atlas Database<br />

3.0 web service<br />

1 <br />

4 <br />

5 <br />

6 <br />

7 <br />

8 AL111168<br />

9 yes<br />

10 <br />

11 <br />

12 <br />

13 <br />

4.2.1 SOAP based Web Services<br />

The SOAP st<strong>and</strong>ard (Simple Object Access Protocol, prior to version 1.2) is to a large<br />

extent an agreed-upon technology describ<strong>in</strong>g a protocol to exchange <strong>in</strong>formation <strong>in</strong> structured<br />

XML messages (eXtensible Markup Language). The protocol was recommended by<br />

W3C (World Wide Web Consortium) <strong>in</strong> 2003, <strong>and</strong> describes the messag<strong>in</strong>g format between<br />

a client <strong>and</strong> a server which <strong>in</strong> most cases are transported over HTTP. In list<strong>in</strong>gs 4.1<br />

<strong>and</strong> 4.2 an example request <strong>and</strong> response from the <strong>CBS</strong> Genome Atlas Database 3.0 Web<br />

Service is provided, us<strong>in</strong>g operation queryGenomes to query the database for a genbank<br />

accession number.<br />

The SOAP messages are XML structures consist<strong>in</strong>g of a SOAP envelope, which then<br />

consist of a header (not <strong>in</strong>cluded here) <strong>and</strong> a body. A special envelope style called<br />

’wrapped’ is used for the <strong>CBS</strong> services, mean<strong>in</strong>g that the content of both response <strong>and</strong> request<br />

is wrapped by an element named accord<strong>in</strong>g to the operation issued (here queryGenomes).<br />

This enables the server to easily dispatch the message to the proper <strong>in</strong>ternal code. The<br />

SOAP protocol forms the basic language for exchang<strong>in</strong>g messages over HTTP but does not<br />

describe the structure of the messages exchanged by a given resource nor does it expla<strong>in</strong><br />

its functionality. The WSDL (Web Services Description Language) file closes this gap by<br />

def<strong>in</strong><strong>in</strong>g <strong>in</strong>formation which enables a user or computer to communicate with the resource.<br />

The WSDL declares all the operations supported by a resource <strong>and</strong> the composition of the<br />

XML structures allowed by the operations. F<strong>in</strong>ally, the WSDL def<strong>in</strong>es the endpo<strong>in</strong>t URL<br />

to which the request SOAP message is submitted. The essential data of the WSDL are<br />

the descriptions of the XML structure, formulated <strong>in</strong> the XSD language (XML Schema<br />

Def<strong>in</strong>ition). The schema for the request of the queryGenomes operations can be seen from<br />

list<strong>in</strong>g 4.3. Figure 4.2 shows a schematic draw<strong>in</strong>g of a SOAP resource.<br />

4.3 EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />

EMBRACE Network of Excellence is a project funded by the European Commission under<br />

the sixth framework programme (FP6). The <strong>in</strong>tention of the EMBRACE projects was<br />

partly to <strong>in</strong>tegrate the major <strong>tools</strong> <strong>and</strong> databases with<strong>in</strong> the life science communities. A<br />

technology recommendation workgroup with<strong>in</strong> EMBRACE has <strong>in</strong>vestigated which current<br />

technologies could form the basis of the <strong>in</strong>tegration <strong>and</strong> it has recommended SOAP based<br />

147


EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />

List<strong>in</strong>g 4.2: Abbreviated output from the queryGenomes operations of the Genome Atlas Database<br />

3.0 web service<br />

1 <br />

2 <br />

3 <br />

5 <br />

6 <br />

7 <br />

8 <br />

9 <br />

10 <br />

11 B a c t e r i a<br />

12 E p s i l o n p r o t e o b a c t e r i a<br />

13 8<br />

14 Campylobacter j e j u n i subsp . j e j u n i NCTC 11168<br />

15 AL111168<br />

16 NC 002163<br />

17 Chromosome<br />

18 <br />

19 <br />

20 <br />

21 <br />

22 <br />

23 <br />

24 <br />

25 <br />

26 <br />

148


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

List<strong>in</strong>g 4.3: XSD entry of the queryGenomes request message<br />

1 <br />

2 <br />

3 <br />

4 <br />

5 <br />

6 <br />

8 <br />

10 <br />

12 <br />

14 <br />

16 <br />

18 <br />

20 <br />

21 <br />

22 <br />

23 <br />

24 <br />

25 <br />

26 <br />

27 <br />

28 <br />

29 <br />

30 <br />

31 <br />

32 <br />

SOAP request<br />

<strong>and</strong> response<br />

SOAP client<br />

Client user / computer<br />

endpo<strong>in</strong>t WSDL Schemas<br />

HTTP server<br />

WSDL <strong>and</strong> schema files<br />

downloaded by client <strong>in</strong><br />

XML<br />

Figure 4.2: Schematic layout of a simple SOAP resource, where WSDL <strong>and</strong> schemas reside on the<br />

same server. WSDL <strong>and</strong> schemas are read <strong>and</strong> <strong>in</strong>tepreted by the SOAP client <strong>in</strong> order compose<br />

the outgo<strong>in</strong>g request <strong>and</strong> parse the <strong>in</strong>com<strong>in</strong>g server response.<br />

149


EMBRACE: An EU <strong>in</strong>itiative for enhance <strong>in</strong>teroperability<br />

Web Services described by WSDL files where data structures are typed us<strong>in</strong>g the XSD<br />

format.<br />

4.3.1 Quasi - a light-weight SOAP server<br />

One of the ma<strong>in</strong> obstacles for many SOAP servers <strong>and</strong> clients is the computational overhead<br />

<strong>and</strong> memory consumption <strong>in</strong>volved <strong>in</strong> pars<strong>in</strong>g large <strong>and</strong> complex XML structures.<br />

For the BLASTatlas service, this was a limitt<strong>in</strong>g factor. Try<strong>in</strong>g a conventional server package<br />

called SOAP::Lite, rendered the submit process to require more memory than what is<br />

<strong>in</strong> a modern desktop computer while tak<strong>in</strong>g around 20 m<strong>in</strong>utes just to prepare the message<br />

before submit. Once submitted, the server required the same overhead to parse the <strong>in</strong>com<strong>in</strong>g<br />

XML. The XML::Compile package for Perl prooved superior as a client framework.<br />

However, for the server side, there was a dem<strong>and</strong> for speed, flexibility <strong>and</strong> custom adjustment<br />

which led to the development of a light-wight SOAP server called ’quasi’ (’QUite<br />

A Soap Implementaion’ or ’QUAsi Soap Implementation’). Apart from the speed it has<br />

further advantages:<br />

• The server can be launched both remotely <strong>and</strong> locally. The later allows quick <strong>and</strong><br />

easy test<strong>in</strong>g of services by read<strong>in</strong>g SOAP message from STDIN<br />

• XML pars<strong>in</strong>g method (e.g. XML::Simple or XML::Twig) may be chosen <strong>in</strong>dependently<br />

for each operations <strong>and</strong> even postponed until after the job is placed <strong>in</strong> the<br />

queue <strong>and</strong> the job id is returned. This is an advantage for very big messages<br />

• Control over the code stack enable implementation of custom functionality much<br />

faster.<br />

4.3.2 quasi mktemp - From template to Web Service<br />

To take the ease-of-implementation to a new step, a template creator was written which<br />

reads from a st<strong>and</strong>ard <strong>CBS</strong> template an example Web Service. The user provides the<br />

name <strong>and</strong> version of the service <strong>and</strong> the tool prepares an entire <strong>in</strong>stallation of the service<br />

on the servers. The template created gives the follow<strong>in</strong>g :<br />

• Creates automatically WSDL <strong>and</strong> XSD files for the name <strong>and</strong> version of the service,<br />

placed <strong>in</strong> the proper location of the file system<br />

• Example directory with a work<strong>in</strong>g Perl example us<strong>in</strong>g the service<br />

• Has built-<strong>in</strong> templates for both syncrhonous <strong>and</strong> asynchronous access<br />

• Creates the proper entry <strong>in</strong> the central services database table<br />

• When the template creator has run a web page will be available describ<strong>in</strong>g the<br />

service <strong>and</strong> provid<strong>in</strong>g l<strong>in</strong>ks to WSDL <strong>and</strong> XSD files as well as WSDL-embedded<br />

documentation<br />

When design<strong>in</strong>g Web Services, it is not a trivial task to keep track of namespaces,<br />

declerations of <strong>in</strong>put/output objects, operation names etc. The feedback received so far<br />

for this tool <strong>in</strong>dicates that function<strong>in</strong>g examples clearly reduces chances for mistakes. The<br />

manual for the software is found <strong>in</strong> appendix D.6.<br />

150


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

4.4 ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />

ENCODE (the Encyclopedia Of DNA Elements) was launched <strong>in</strong> September 2003 by<br />

the National Human Genome Research Institute. The goal was to identify all functional<br />

elements <strong>in</strong> the human genome sequence. In the pilot phase 1 percent (30 Mb) from<br />

44 selected regions of the human genome has been analysed by ENCODE consortium<br />

researchers (Birney et al., 2007).<br />

GENCODE is a sub-project of ENCODE, which seeks to identify all prote<strong>in</strong>-cod<strong>in</strong>g<br />

genes <strong>in</strong> the ENCODE selected regions. For each prote<strong>in</strong> cod<strong>in</strong>g gene this means the<br />

del<strong>in</strong>eation of a complete mRNA sequence for at least one splice isoform, <strong>and</strong> often for<br />

a number of additional alternative splice forms. The contributions from the BioSapiens<br />

partners are focused on <strong>in</strong>formation from a prote<strong>in</strong> annotation perspective. Special attention<br />

is given to the potential aspect of alternative splic<strong>in</strong>g <strong>and</strong> the putative effect it has<br />

on functional diversification of genes.<br />

In the pilot phase of the Biosapiens project the properties of the cod<strong>in</strong>g sequences<br />

for the 44 regions have been analyzed by the Biosapiens partners separately. The results<br />

from s<strong>in</strong>gle groups were collected <strong>and</strong> the ma<strong>in</strong> f<strong>in</strong>d<strong>in</strong>gs were published (Tress et al., 2007).<br />

Furthermore the entire collection of annotations created by all partners was made available<br />

as supplementary material for the publication.<br />

In the current phase of the BioSapiens project the goal is establish a scale-up of the<br />

annotation approach applied to the pilot ENCODE sequences to cover the 100% of the human<br />

genome, <strong>in</strong>clud<strong>in</strong>g all the isoforms. For the scale-up, the ENCODE Pipel<strong>in</strong>e (EPipe)<br />

was constructed (this Biosapiens deliverable), which is a WWW service that allows researchers<br />

to compare functional annotations for all splice variants of a given gene <strong>in</strong> an<br />

automatic way, or alternatively use it for analysis of mutated sequence variants conta<strong>in</strong><strong>in</strong>g<br />

SNPs. The author of this thesis. This author has been responsible for the development<br />

of the ma<strong>in</strong> parts of the EPipe software as well as for implement<strong>in</strong>g a large part of the<br />

modules (feature predictors). The EPipe projects is an ongo<strong>in</strong>g effort which has <strong>in</strong>volved<br />

a number of people dur<strong>in</strong>g its development.<br />

4.4.1 Collect<strong>in</strong>g Web Services clients <strong>in</strong> EPipe<br />

EPipe uses a number of local <strong>and</strong> remote resources for prote<strong>in</strong> feature prediction. The<br />

ability of EPipe to connect to remote resources via Web Services is <strong>in</strong>corporated with<strong>in</strong><br />

the <strong>in</strong>dividual modules. This put a great deal of flexibility as to which resourses to support<br />

(e.g. BioMoby, SOAP etc). The pipel<strong>in</strong>e is shown <strong>in</strong> figure 4.3.<br />

EPipe itself is offered both as a SOAP web service (http://www.cbs.dtu.dk/ws/<br />

EPipe <strong>and</strong> a traditional web <strong>in</strong>terfece (http://www.cbs.dtu.dk/services/EPipe). A<br />

schematic overview of the workflow <strong>in</strong> EPipe is shown <strong>in</strong> figure 4.4.<br />

4.4.2 Mapp<strong>in</strong>g Pfam annotations to prote<strong>in</strong> structure: mecA<br />

In Staphylococcus aureus the mecA gene encodes a penicill<strong>in</strong>-b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> (PBP2a),<br />

result<strong>in</strong>g <strong>in</strong> Methicill<strong>in</strong> resistance (Ender et al., 2009). The EPipe software can be used to<br />

map a range of different relevant features onto the prote<strong>in</strong> structure, <strong>in</strong> order to visualize<br />

differences between homologs of this prote<strong>in</strong>. In this example however, a s<strong>in</strong>gle MecR1<br />

prote<strong>in</strong> from Staphylococcus aureus stra<strong>in</strong> A5937, GenBank accession no. EEV85461, is<br />

processed. Figure 4.5 shows the structure browser of EPipe which allows the user to<br />

browse the different features that are predicted, by show<strong>in</strong>g the mapp<strong>in</strong>g onto the prote<strong>in</strong><br />

structure. Here, the three Pfam doma<strong>in</strong>s Transpeptidase, MecA N, <strong>and</strong> PBP dimer appear<br />

as significant hits.<br />

151


ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />

Input sequences<br />

Cache filter<br />

BLAST aga<strong>in</strong>st<br />

PDB <strong>in</strong>dividually<br />

Cache filter<br />

Cache filter<br />

Cache filter<br />

Cache filter<br />

module IV<br />

alignment module I module II module III<br />

Positional<br />

features<br />

Non-positional<br />

features<br />

Alignment<br />

dependent<br />

module X<br />

Map feature<br />

coord<strong>in</strong>ates to<br />

alignment<br />

Map features onto<br />

best structure<br />

XML of all results<br />

Cache filter<br />

Render images <strong>in</strong><br />

parallel <strong>and</strong> present<br />

to output pages<br />

Table of<br />

nonpositional<br />

features<br />

Conclusion<br />

table<br />

Plot alignment <strong>and</strong><br />

positions hav<strong>in</strong>g<br />

different feature<br />

configuration<br />

Plot alignment<br />

<strong>and</strong> features<br />

with remapped<br />

coord<strong>in</strong>ates<br />

Similarity <strong>in</strong><br />

feature space<br />

Figure 4.3: Schematic layout of the ENCODE pipel<strong>in</strong>e, EPipe. The ma<strong>in</strong> program ensures that<br />

as much as possible is dispatched <strong>in</strong> parrallel. Modules may either be alignment dependent or not.<br />

If the alignment is required to predict the prote<strong>in</strong> features, the module is not launched until the<br />

alignment algorithm has f<strong>in</strong>ished. Modules may either return global features of the entire prote<strong>in</strong><br />

(e.g. cellular localization), or return positional features (e.g. phosphorylation sites).<br />

152


Web Services <strong>and</strong> <strong>Interoperability</strong> <strong>in</strong> Genomics<br />

Figure 4.4: The <strong>in</strong>put web page of EPipe: Upper part def<strong>in</strong>es sequence upload <strong>and</strong> alignment<br />

method, <strong>and</strong> lower part selects which modules / methods to run. When applicable, gene ontologies<br />

have been added to each feature <strong>and</strong> feature values (light green boxes).<br />

153


ENCODE pipel<strong>in</strong>e: apply<strong>in</strong>g Web Services<br />

Figure 4.5: The mecA encoded prote<strong>in</strong> (EEV85461) shows homology to PDB entry 1VQQ (Lim<br />

& Strynadka, 2002). Top panel shows the EPipe structure browser which allows for any 90 degrees<br />

rotat<strong>in</strong>g. Lower panel shows a post-process<strong>in</strong>g of the PyMol script, generated by EPipe.<br />

154


Chapter 5<br />

Conclusion <strong>and</strong> perspectives<br />

Conclusion <strong>and</strong> perspectives<br />

This thesis has presented a number comparative genomics <strong>tools</strong> that have been used<br />

throughout different research projects <strong>and</strong> peer review publications. The aim has been to<br />

provide methods that enable the scientist to keep up with the <strong>in</strong>creas<strong>in</strong>g speed by which<br />

genome sequences are published. Visualization plays a key role <strong>and</strong> f<strong>in</strong>d<strong>in</strong>g better ways<br />

to present sequence <strong>in</strong>formation <strong>in</strong> a condensed <strong>and</strong> <strong>in</strong>tuitive way is essential for deriv<strong>in</strong>g<br />

knowledge from the large number of bacterial stra<strong>in</strong>s be<strong>in</strong>g sequenced.<br />

Information content has previously been used to quantify conservation of DNA motifs,<br />

<strong>and</strong> a recent extension of this <strong>in</strong>formation framework has allowed to model complete<br />

promotors such as the P1/P2 system described <strong>in</strong> this work. The models shown here<br />

are to a large extent specific towards E. coli P1/P2 sites. However, the design of the<br />

matrix <strong>and</strong> spac<strong>in</strong>g configuration format of the iscan tool enables for a much broader<br />

application. The tool may be used to test different hypothesis of promotor configurations<br />

across a broader range of organisms by estimat<strong>in</strong>g the promotor conservation a s<strong>in</strong>gle<br />

comparable measure. There is still efforts to be made to implement benchmark<strong>in</strong>g <strong>and</strong> to<br />

exam<strong>in</strong>e other promotor systems.<br />

S<strong>in</strong>ce the start of the human genome project (HGP) <strong>in</strong> 1990 there has been large<br />

<strong>in</strong>vestments to develop <strong>and</strong> improve sequenc<strong>in</strong>g technology. The present stage, where a<br />

bacterial genome can be sequenced for a few thous<strong>and</strong> dollars with<strong>in</strong> few hours, is a result<br />

of years of competition <strong>and</strong> <strong>in</strong>vestments <strong>in</strong> genome projects. There are no signs that new<br />

achievements <strong>in</strong> sequenc<strong>in</strong>g technology stops here. The concept of sequenc<strong>in</strong>g s<strong>in</strong>gle DNA<br />

molecules real time has long been an ultimate goal with<strong>in</strong> genomics <strong>and</strong> DNA sequenc<strong>in</strong>g.<br />

It has been demonstrated how a DNA synthesis reaction can be monitored real-time, by<br />

immobiliz<strong>in</strong>g a DNA polymerase with<strong>in</strong> a small (20 zeptoliter) well (Eid et al., 2009). If the<br />

technology reaches a f<strong>in</strong>al product, it may well start a new era <strong>in</strong> comparative genomics.<br />

Once it is possible to obta<strong>in</strong> a genome sequence at the same rate as the DNA replication<br />

itself, <strong>and</strong> at superior read lengths, sophisticated software must be implemented for the<br />

downstream process<strong>in</strong>g. The technology can give a boost to the quality of metagenomic<br />

sequenc<strong>in</strong>g, <strong>and</strong> solve the current issues of proper assembly of these data sets.<br />

The BLASTatlas tool presented <strong>in</strong> this thesis <strong>in</strong>corporates a number of software to<br />

calculate different DNA properties as well as scripts for mapp<strong>in</strong>g sequence alignments to a<br />

reference genome. The number of dependencies makes it difficult to package the software<br />

<strong>and</strong> make <strong>in</strong>stallation on other computer systems. To share these more complex <strong>tools</strong><br />

among scientists Web Services plays an important role <strong>and</strong> it has been demonstrated how<br />

analysis <strong>and</strong> visualization methods can be offered us<strong>in</strong>g this technology. At first glance the<br />

traditional web <strong>in</strong>terfaces seems more user-friendly. However, implement<strong>in</strong>g <strong>in</strong>teroperable<br />

methods like that of the BLASTatlas method, forces a process <strong>in</strong> which the communication<br />

is formalized <strong>and</strong> def<strong>in</strong>ed <strong>in</strong> every detail. This allows direct <strong>in</strong>tegration <strong>in</strong>to the user’s pro-<br />

155


gramm<strong>in</strong>g environment which scales significantly better. Mak<strong>in</strong>g one or two comparisons<br />

us<strong>in</strong>g a web <strong>in</strong>terface will <strong>in</strong> most cases be faster than us<strong>in</strong>g the Web Services counterpart.<br />

The true advantages are achieved when analysis are repeated possibly hundreds of<br />

times <strong>and</strong> when l<strong>in</strong>k<strong>in</strong>g <strong>in</strong>put/output between different remote resources. Integration of<br />

biological data us<strong>in</strong>g SOAP based Web Services is ga<strong>in</strong><strong>in</strong>g acceptance. When the technology<br />

has matured it will undoubtedly enhance the way biological <strong>in</strong>formation is exploited<br />

by allow<strong>in</strong>g seamless flow between for example public sequence databases, repositories of<br />

experimental data <strong>and</strong> bio<strong>in</strong>formtic prediction servers.<br />

156


Appendix A<br />

Appendix: Workshops, teach<strong>in</strong>g, <strong>and</strong> conferences<br />

Appendix: Workshops, teach<strong>in</strong>g,<br />

<strong>and</strong> conferences<br />

A.1 Lectures <strong>and</strong> Presentations<br />

A.1.1 DTU Course 27101: Framework Course <strong>in</strong> Biotechnology <strong>and</strong><br />

Food Sciences<br />

Taught autumn 2008 by Prof. David Ussery, this cause featured weekly computer exercises<br />

throughout the semester <strong>and</strong> projects requir<strong>in</strong>g computer work. I planned <strong>and</strong> supervised<br />

the exercises as well as assisted the students do<strong>in</strong>g project work. See also: http://www.<br />

cbs.dtu.dk/dtucourse/genomics27101.php<br />

A.1.2 <strong>Comparative</strong> Microbial Genomics Workshop<br />

Held June 2 nd - 6 st 2008, Bangkok, Thail<strong>and</strong>. I assisted the plann<strong>in</strong>g of the workshop,<br />

lectured on rRNA operon structure, web services, <strong>and</strong> genome visualization methods <strong>and</strong><br />

was responsible for computer exercises. Web page: http://www.cbs.dtu.dk/courses/<br />

thaiworkshop08/programme.php<br />

A.1.3 <strong>Comparative</strong> Microbial Genomics <strong>and</strong> Taxonomy<br />

Held August 14 st - 18 st 2006, Petropolis, Brazil. I assisted the plann<strong>in</strong>g of the workshop<br />

<strong>and</strong> was responsible for computer exercises. See also: http://www.cbs.dtu.dk/courses/<br />

brazilworkshop/programme.php<br />

A.1.4 EMBRACE Workshop on Client Side Script<strong>in</strong>g for Web Services<br />

Work package D5.2.X2. Held February 6 st - 8 st 2008, <strong>CBS</strong>. Responsible for computer exercises<br />

<strong>and</strong> lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2008-02-06/<br />

A.1.5 EMBRACE Workshop on Bio<strong>in</strong>formatics of Immunology<br />

Work package D5.2.6. Held January 24 st - 26 st 2007, <strong>CBS</strong>. Responsible for computer exercises<br />

<strong>and</strong> lectures. See also: http://www.cbs.dtu.dk/courses/embrace/2007-01-24/<br />

A.1.6 EMBRACE 3 rd AGM: Implementation of web services<br />

Presentation held April 23 rd 2007 at CNRS Institute of Biology <strong>and</strong> Chemistry of Prote<strong>in</strong>s<br />

<strong>in</strong> Lyon, France.<br />

157


Workshops <strong>and</strong> meet<strong>in</strong>gs<br />

A.1.7 EMBRACE Workshop on Perl, SQL <strong>and</strong> Web Services<br />

Scheduled for November 16 th - 20 th 2009. See also: http://www.cbs.dtu.dk/courses/<br />

embrace/2009-11-16/<br />

A.2 Workshops <strong>and</strong> meet<strong>in</strong>gs<br />

A.2.1 EMBRACE Workshop: SOAP web services<br />

April 2006, Bergen, Norway.<br />

A.2.2 EUCOMM Bio<strong>in</strong>formatics Tra<strong>in</strong><strong>in</strong>g Course<br />

February 2007, H<strong>in</strong>xton, United K<strong>in</strong>gdom<br />

A.2.3 EMBRACE Workshop: Modern computer <strong>tools</strong> for the biosciences<br />

March 2007, Uppsala, Sweden<br />

A.2.4 EMBRACE 3rd Annual General Meet<strong>in</strong>g<br />

April 2007, Lyon, France<br />

A.2.5 EMBRACE Workshop: Deploy<strong>in</strong>g Web Services for Biological<br />

Sequence Annotation<br />

May 2007, Geneva, Switzerl<strong>and</strong><br />

A.2.6 EMBRACE 4th Annual General Meet<strong>in</strong>g<br />

April 2008, Heidelberg, Germany<br />

A.2.7 Technical discussion of EMBRACE registry<br />

June 2008, Amsterdam, Holl<strong>and</strong><br />

A.2.8 EMBRACE meet<strong>in</strong>g: Discussion of st<strong>and</strong>ard data types<br />

Januar 2009, Bergen, Norway<br />

A.3 Conferences<br />

A.3.1 Conference: Metagenomics, July 2007, San Diego U.S.A.<br />

B<strong>in</strong>newies TT, Hall<strong>in</strong> PF, Sellami N, Ussery DW Prediction of Pathogenicity Networks <strong>in</strong><br />

Bacterial Genomes<br />

A.3.2 Conference: ASM Biodefense 2007, February 2007, Wash<strong>in</strong>gton<br />

U.S.A.<br />

Poster: Hall<strong>in</strong> PF <strong>and</strong> B<strong>in</strong>newies TT. Gene organization of RNA genes <strong>and</strong> secretion<br />

system components of the Sargasso Sea environmental samples<br />

158


Appendix B<br />

Appendix: Ph.D. study plan<br />

Appendix: Ph.D. study plan<br />

159


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Nedenstående studieplan er accepteret af studerende og vejleder<br />

Hovedvejleders underskrift lokal nr. Studerendes underskrift<br />

Ph.d.-studieplan<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

Ph.d.-program: Bio<strong>in</strong>formatics<br />

Institut: BioCentrum<br />

Startdato: March 1 2006<br />

Slutdato: February 2009<br />

Hovedvejleder: Associate professor David W. Ussery<br />

(Titel, navn, <strong>in</strong>stitut, tlf.)<br />

BioCentrum-DTU, Technical University of Denmark,<br />

Build<strong>in</strong>g 301, DK-2800 Lyngby, Denmark<br />

E-mail address: dave@cbs.dtu.dk<br />

Phone (direct): (+45) 45 25 24 88<br />

Medvejleder: Guest Researcher Gertrude Maria Wassenaar<br />

(Titel, navn,<br />

<strong>in</strong>stitution/virksomhed)<br />

BioCentrum-DTU, Technical University of Denmark,<br />

Build<strong>in</strong>g 301, DK-2800 Lyngby, Denmark<br />

E-mail address: trudy@cbs.dtu.dk<br />

Phone (direct): (+45) 45 25 24 77<br />

Dato: 18-11-2007<br />

Studiets titel: DNA Structural Analysis <strong>and</strong> Transcript Prediction <strong>in</strong> Prokaryotic<br />

genomes<br />

1


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

Studiets hovedemne:<br />

The goal of this project is to obta<strong>in</strong> better underst<strong>and</strong><strong>in</strong>g about the structural<br />

mechanisms that are <strong>in</strong>volved <strong>in</strong> the <strong>in</strong>itiation of transcription of DNA <strong>in</strong><br />

Prokaryotic genomes <strong>and</strong> to use this <strong>in</strong>formation to make better <strong>and</strong> consistent<br />

transcript predictions. We have presented a database (Hall<strong>in</strong> <strong>and</strong> Ussery 2004)<br />

which holds several k<strong>in</strong>ds of <strong>in</strong>formation for each of the over 300 fully<br />

sequenced Prokaryotic genomes that are currently available. Different research<br />

groups have made efforts to gather sequence data <strong>and</strong> analysis of the fully<br />

sequenced microbial genomes that are be<strong>in</strong>g published.<br />

Currently we rely on the authors' annotation of genome sequences when<br />

comparative genomics are applied to our data sets. However, different authors<br />

use different <strong>tools</strong>, approaches <strong>and</strong> criteria dur<strong>in</strong>g the annotation process. There<br />

are examples of genomes that are predicted to be 50-100% over annotated<br />

(Skovgaard et al. 2001). Once reliable <strong>and</strong> automated processes for predict<strong>in</strong>g<br />

transcriptomes are established, comparative analysis can be applied on the entire<br />

collection of organisms. It is envisioned that the users of our website can<br />

<strong>in</strong>teractively be able to browse any piece of DNA to look for structural properties<br />

<strong>and</strong> repeats.<br />

_________________<br />

Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A On the<br />

total number of genes <strong>and</strong> their length distribution <strong>in</strong> complete<br />

microbial genomes (2001) Trends Genet.17:425-8.<br />

Peter F. Hall<strong>in</strong> <strong>and</strong> David W. Ussery <strong>CBS</strong> Genome Atlas<br />

Database: A dynamic storage for bio<strong>in</strong>formatic results <strong>and</strong><br />

sequence data (2004). Bio<strong>in</strong>formatics 20:3682-3686.<br />

(Her beskrives den videnskabelige projektdels <strong>in</strong>dhold samt mål og midler. Hvis beskrivelsen er på mere end 1 A4side<br />

gives en kort oversigt her med henvisn<strong>in</strong>g til selve beskrivelsen, der vedlægges som bilag).<br />

Det eksterne<br />

forskn<strong>in</strong>gsophold<br />

Professor Craig John Benham, University of California, Davis.<br />

Benhams research focuses on mathematical modell<strong>in</strong>g of DNA<br />

destabilization <strong>and</strong> prediction of open<strong>in</strong>g of the DNA molecule<br />

dur<strong>in</strong>g a transcription event. His strong mathematical approach is<br />

novel <strong>and</strong> would contribute significantly to our prediction methods<br />

<strong>and</strong> could possibly help expla<strong>in</strong><strong>in</strong>g biological / experimental<br />

results. It is the idea that Craig Benhams calculations will be<br />

<strong>in</strong>tegrated <strong>in</strong>to the prediction algorithms that is a major topic of my<br />

project.<br />

A 12 weeks <strong>in</strong>ternship is scheduled for October-December to Craig<br />

Benhams lab to <strong>in</strong>tegrate SIDD predictions (Stress Induced DNA<br />

Duplex Destabilization) with <strong>CBS</strong> databases <strong>and</strong> to prepare 1-2<br />

manuscripts on SIDD measures on a global prokaryotic scale.<br />

2


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

(Her anføres de forskn<strong>in</strong>gsmiljøer uden for DTU, hvor den ph.d.-studerende planlægges at opholde sig. Er der<br />

<strong>in</strong>dgået konkrete aftaler, anføres dette. For hvert ophold angives det skønnede tidsforbrug (f.eks. i uger), og det<br />

samlede tidsforbrug til eksterne ophold anføres).<br />

Kursusdelen:<br />

Kurser på DTU<br />

Eksterne kurser<br />

Kurser meritoverført i<br />

forb<strong>in</strong>delse med<br />

<strong>in</strong>dskrivn<strong>in</strong>g:<br />

Biological Sequence Analysis PhD 12 ECTS [OK]<br />

27802 Metabolic Eng<strong>in</strong>eer<strong>in</strong>g <strong>and</strong><br />

Systems biology<br />

PhD 5 ECTS F1A<br />

27725 Globale regulatoriske netværk i<br />

mikroorganismer<br />

MSc 5 ECTS F2B<br />

27617 Prote<strong>in</strong> structure <strong>and</strong><br />

computational biology<br />

Msc 5 ECTS F5A<br />

27041 Introduction to Systems Biology Msc 5 ECTS E3A<br />

For kurser, som ikke f<strong>in</strong>des i studiehåndbogen, skal der vedlægges en beskrivelse af det faglige <strong>in</strong>dhold. Her<br />

anføres studiets forventede kursus/uddannelsesaktivteter. For hver del angives det skønnede antal ECTS-po<strong>in</strong>t, der<br />

sammenlagt skal svare til ca. 30 ECTS-po<strong>in</strong>t. 30 ECTS-po<strong>in</strong>t svarer til ca. 840 timers arbejde).<br />

Formidl<strong>in</strong>gsdelen ( <strong>in</strong>kl.<br />

pligtarbejde):<br />

I have spent a total of about a month's time prepar<strong>in</strong>g <strong>and</strong> assist<strong>in</strong>g<br />

<strong>in</strong> computer exercises for the <strong>CBS</strong> course <strong>Comparative</strong> Microbial<br />

Genomics <strong>and</strong> Taxonomy (Petropolis, Brazil, Aug. 2006,<br />

http://www.cbs.dtu.dk/courses/brazilworkshop) <strong>and</strong> <strong>in</strong> prepar<strong>in</strong>g<br />

<strong>and</strong> giv<strong>in</strong>g talks at several meet<strong>in</strong>gs.<br />

Exercises <strong>in</strong> course ”Biological Sequence Analysis” (<strong>CBS</strong> –DTU)<br />

1 hrs. Presentation, Modern computer <strong>tools</strong> for the biosciences<br />

(Uppsala, Sweden) Presentation: Embrace workshop on<br />

bio<strong>in</strong>formtics of Immunology (<strong>CBS</strong> – DTU) Presentation: Web<br />

Services implementation on <strong>CBS</strong>: Third Anual General Meet<strong>in</strong>g of<br />

EMBRACE, (Lyon France).<br />

I plan to put <strong>in</strong> an additional month of work for giv<strong>in</strong>g <strong>and</strong><br />

prepar<strong>in</strong>g presentations <strong>and</strong> lectures for a one week workshop to be<br />

3


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

held at <strong>CBS</strong> <strong>in</strong> February 2008:<br />

http://www.cbs.dtu.dk/courses/embrace/2008-02-<br />

06/programme.php. Lectures <strong>and</strong> exercies will be adjusted to cover<br />

promoter analysis us<strong>in</strong>g the EMBRACE technology. We <strong>in</strong>tend to<br />

use graphical as well as statistical approaches to characterize<br />

promoter signatures of prokaryotic genomes. These are core topics<br />

of the thesis.<br />

Poster presentation at Metagenomics 2007, San Diego: “Gene<br />

organization of RNA genes <strong>and</strong> secretion system components of<br />

the Sargasso Sea environmental samples”<br />

(Her anføres studiets forventede dels formidl<strong>in</strong>gs-aktivteter og dels det pålagte pligtarbejde. For hver del angives<br />

det skønnede tidsforbrug (f.eks. i uger), der sammenlagt skal svare til 3 måneder).<br />

Tidsplan:<br />

1st half year (March 06 –August 06)<br />

Publication on rRNA gene predictor (RNAmmer). <strong>Comparative</strong> Microbial Genomics worksshop <strong>in</strong><br />

Brasil. Meet<strong>in</strong>gs <strong>and</strong> work for <strong>CBS</strong> <strong>in</strong> connection to EMBRACE.<br />

2nd half year (September 06 – Feb 07)<br />

Lactococcus microarray project with Chr Hansen. Book chapter on <strong>Comparative</strong> Genomics, editor<br />

Dawn Field. EMBRACE meet<strong>in</strong>gs <strong>and</strong> workshops.<br />

3rd half year (March 07 –August 07)<br />

Followup article on RNAmmer – <strong>and</strong> rRNA/tRNA operons.<br />

4th half year (September 07 – Feb 08)<br />

(Oct-Dec) Internship, Craig Benham: Davis, California,<br />

Include work from Craig Benhams lab <strong>in</strong>to RNAmmer followup manuscript <strong>and</strong> prepare SIDDbase<br />

application note <strong>and</strong> article on SIDD measures <strong>in</strong> prokaryotic promotor sequences.<br />

Prepare manuscripts<br />

5th. half year (March 08 –August 08)<br />

Course: Globale regulatoriske netværk i mikroorganismer (F2B)<br />

Course: Prote<strong>in</strong> structure <strong>and</strong> computational biology (F5A)<br />

Course: 1 week may/june: 27802 Metabolic Eng<strong>in</strong>eer<strong>in</strong>g <strong>and</strong> Systems Biology<br />

Thesis writ<strong>in</strong>g+Prepare manuscripts<br />

6th. half year (September 08 – Feb 09)<br />

Course: Introduction to Systems Biology<br />

Thesis writ<strong>in</strong>g<br />

(Tidsplanen bør <strong>in</strong>deholde tidspunkter/perioder for alle væsentlige aktiviteter her i forb<strong>in</strong>delse med ph.d.uddannelsen.<br />

Det er vigtigt, at tidsplanen er fuldstændig., Den kan vedlægges som appendiks).<br />

Kort beskrivelse af<br />

vejledn<strong>in</strong>gens form:<br />

Det kan bl.a. aftales, hvor tit vejledn<strong>in</strong>gen sker i form af møder eller ved skriftlig tilbagemeld<strong>in</strong>g<br />

4


Danmarks Tekniske Universitet AFI, Ph.d.-uddannelse<br />

September 2005<br />

Ph.d.-studerendes navn: Peter Fischer Hall<strong>in</strong><br />

Cpr.-nr.: 160877 2053<br />

Patenter/<strong>in</strong>novation: Der er s<strong>and</strong>synlighed for, at der under projektet udvikles<br />

teknologier eller software, som kan patenteres?<br />

Hvis Ja<br />

Ja x Nej<br />

Kort redegørelse for hvilke metoder, der anvendes til oplær<strong>in</strong>g af den ph.d.-studerende i de <strong>in</strong>novationsmæssige<br />

aspekter<br />

Andet:<br />

(Her kan anføres <strong>and</strong>re forhold af betydn<strong>in</strong>g for bedømmelsen af studieplanen).<br />

5


Appendix C<br />

Appendix: Courses<br />

C.1 Global regulatory networks <strong>in</strong> microorganisms<br />

DTU course 27725, ECTS 5, M.sc. level.<br />

C.2 Prote<strong>in</strong> Structure <strong>and</strong> <strong>Computational</strong> Biology<br />

DTU course 27617, ECTS 5, M.sc. level.<br />

C.3 Biological Sequence Analysis<br />

DTU course 27803, ECTS 12.5, PhD level.<br />

C.4 <strong>Comparative</strong> Genome Analysis<br />

Copenhagen University, Department of Biology, ECTS 5.<br />

Appendix: Courses<br />

C.5 Doctorial sem<strong>in</strong>ar on bus<strong>in</strong>ess economics for academic<br />

entrepreneurs<br />

Aarhus school of bus<strong>in</strong>ess, University of Aarhus, ECTS 3, PhD level.<br />

C.6 ECTS summary<br />

Total ECTS is 30.5 of which 15.5 at PhD level.<br />

165


Appendix D<br />

Appendix: Software<br />

D.1 fetchgbk manual<br />

S Y N O P S I S<br />

f e t c h g b k − d o w n l o a d s g e n b a n k / r e f s e q r e c o r d s i n g e n b a n k f o r m a t , s p e c i f y i n g e i t h e r<br />

a c c e s s i o n s n u m b e r , a c c e s s i o n r a n g e s , o r p r o j e c t i d .<br />

f e t c h g b k (−h ) (−p [ P R O J E C T _ I D ] ) (−a [ A C C E S S I O N / R A N G E ] ) (−d [ D A T A B A S E ] )<br />

D E S C R I P T I O N<br />

W h e n d e f i n i n g t h e p r o j e c t id , u s i n g −p o p t i o n , o p t i o n −a i s i g n o r e d a n d a l l<br />

a c c e s s i o n n u m b e r s f o r a l l s e g m e n t s o f t h a t p r o j e c t , a r e f e t c h e d f r o m t j e p r o j e c t .<br />

W h e n u s i n g t h e −p o p t i o n , t h e −d o p t i o n i s i n e f f e c t , a l l o w i n g y o u t o c o n t r o l w h i c h<br />

d a t a b a s e t o u s e ( r e f s e q / g e n b a n k )<br />

W h e n u s i n g t h e −a o p t i o n , t h e p r o g r a m w i l l r e t r i e v e o n l y t h a t a c c e s s i o n ( o r r a n g e<br />

o f a c c e s s i o n s ) . I t w i l l i g n o r e t h e −d o p t i o n . T h e p r o g r a m p r i n t e s g e n b a n k f o r m a t<br />

d a t a t o s t d o u t . O p t i o n −l i s u s e d t o s h o w o n l y a T A B s e p a r a t e d l i s t s h o w i n g a c c e s s i o n<br />

a n d s e g m e n t n a m e<br />

V E R S I O N<br />

2008 −08 −15: v e r s i o n 1 . 0 c r e a t e d / p f h<br />

−p [ n u m b e r ]<br />

T h e N C B I G e n o m e P r o j e c t n u m b e r , l i k e w h a t c a n b e f o u n d h e r e :<br />

h t t p : / / w w w . n c b i . n l m . n i h . g o v / g e n o m e s / l p r o k s . c g i . T h i s o p t i o n o v e r r u l e s t h e −a o p t i o n .<br />

−a [ a c c e s s i o n n o . o r a c c e s s i o n n u m b e r r a n g e ]<br />

W h e n u s i n g t h i s o p t i o n , t h e p r o g r a m i s i n s t r u c t e d t o d o w n l o a d o n l y t h i s r e c o r d ( o r<br />

t h e s e r e c o r d s , o f a r a n g e i s d e f i n e d ) . T h e −d o p t i o n i s i g n o r e d<br />

−d [ g e n b a n k / r e f s e q ]<br />

C h o i c e o f d a t a b a s e . H a s o n l y e f f e c t w h e n u s i n g o p t i o n −p .<br />

−l<br />

B o o l e a n , i n s t r u c t i n g t h e p r o g r a m n o t t o s h o w g e n b a n k r e c o r d s , b u t o n l y l i s t s e g m e n t<br />

n a m e s f o r e a c h a c c e s s i o n .<br />

−h<br />

S h o w i n g t h i s h e l p p a g e<br />

E X A M P L E S<br />

f e t c h g b k −p 19391 −d r e f s e q | g r e p L O C U S<br />

f e t c h g b k −p 19391 −d g e n b a n k | g r e p L O C U S<br />

f e t c h g b k −a N Z _ A B I Z 0 0 0 0 0 0 0 0 | g r e p L O C U S<br />

f e t c h g b k −a N Z _ A B I H 0 1 0 0 0 0 0 1 −N Z _ A B I H 0 1 0 0 0 0 3 8 | g r e p L O C U S<br />

f e t c h g b k −a C P 0 0 0 8 9 6 | g r e p L O C U S<br />

f e t c h g b k −p 12997 −d r e f s e q −l<br />

A U T H O R<br />

P e t e r F i s c h e r H a l l i n , A u g u s t 2008 , p f h @ c b s . d t u . d k<br />

166


D.2 Sample output from queryGenomes<br />

As output from list<strong>in</strong>g 2.3.<br />

Appendix: Software<br />

1 #k<strong>in</strong>gdom phyla pid organism genbank r e f s e q segment c o l o r ATCONTENT NGENES<br />

2 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 7 9 N C _ 0 1 1 3 1 2<br />

C h r o m o s o m e 1 f f d d 4 4 0 . 6 0 7 7 3069<br />

3 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 0 N C _ 0 1 1 3 1 3<br />

C h r o m o s o m e 2 f f d d 4 4 0 . 6 1 7 6 1105<br />

4 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 2 N C _ 0 1 1 3 1 4 P l a s m i d<br />

p V S A L 3 2 0 f f d d 4 4 0 . 6 2 7 1 32<br />

5 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 1 N C _ 0 1 1 3 1 1 P l a s m i d<br />

p V S A L 8 4 0 f f d d 4 4 0 . 5 9 9 3 72<br />

6 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 3 N C _ 0 1 1 3 1 5 P l a s m i d<br />

p V A L 4 3 f f d d 4 4 0 . 6 1 9 3<br />

7 B a c t e r i a G a m m a p r o t e o b a c t e r i a 30703 A l i i v i b r i o s a l m o n i c i d a L F I 1 2 3 8 F M 1 7 8 3 8 4 N C _ 0 1 1 3 1 6 P l a s m i d<br />

p V S A L 4 3 f f d d 4 4 0 . 6 4 3 9 3<br />

8 B a c t e r i a D e l t a p r o t e o b a c t e r i a 9637 B d e l l o v i b r i o b a c t e r i o v o r u s H D 1 0 0 B X 8 4 2 6 0 1 N C _ 0 0 5 3 6 3<br />

C h r o m o s o m e f f d d 4 4 0 . 4 9 3 5 3583<br />

9 B a c t e r i a G a m m a p r o t e o b a c t e r i a 28329 C e l l v i b r i o j a p o n i c u s U e d a 1 0 7 C P 0 0 0 9 3 4 N C _ 0 1 0 9 9 5<br />

C h r o m o s o m e f f d d 4 4 0 . 4 8 0 1 3754<br />

10 B a c t e r i a B a c t e r o i d e t e s / C h l o r o b i 12607 C h l o r o b i u m p h a e o v i b r i o i d e s D S M 265 C P 0 0 0 6 0 7 N C _ 0 0 9 3 3 7<br />

C h r o m o s o m e f f b b 5 5 0 . 4 7 0 1 1753<br />

11 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29493 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . A T C C<br />

27774 C P 0 0 1 3 5 8 N C _ 0 1 1 8 8 3 C h r o m o s o m e f f d d 4 4 0 . 4 1 9 3 2356<br />

12 B a c t e r i a D e l t a p r o t e o b a c t e r i a 329 D e s u l f o v i b r i o d e s u l f u r i c a n s s u b s p . d e s u l f u r i c a n s s t r . G 2 0<br />

C P 0 0 0 1 1 2 N C _ 0 0 7 5 1 9 C h r o m o s o m e f f d d 4 4 0 . 4 2 1 6 3775<br />

13 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 6 N C _ 0 1 2 7 9 5 P l a s m i d<br />

p D M C 2 f f d d 4 4 0 . 6 2 8 3 10<br />

14 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 4 N C _ 0 1 2 7 9 6<br />

C h r o m o s o m e f f d d 4 4 0 . 3 7 2 3 4629<br />

15 B a c t e r i a D e l t a p r o t e o b a c t e r i a 32105 D e s u l f o v i b r i o m a g n e t i c u s RS −1 A P 0 1 0 9 0 5 N C _ 0 1 2 7 9 7 P l a s m i d<br />

p D M C 1 f f d d 4 4 0 . 4 1 9 7 65<br />

16 B a c t e r i a D e l t a p r o t e o b a c t e r i a 29541 D e s u l f o v i b r i o s a l e x i g e n s D S M 2638 C P 0 0 1 6 4 9 N C _ 0 1 2 8 8 1<br />

C h r o m o s o m e f f d d 4 4 0 . 5 2 9 1 3807<br />

17 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 8 N C _ 0 0 8 7 4 1 P l a s m i d<br />

p D V U L 0 1 f f d d 4 4 0 . 3 4 3 1 150<br />

18 B a c t e r i a D e l t a p r o t e o b a c t e r i a 17227 D e s u l f o v i b r i o v u l g a r i s D P 4 C P 0 0 0 5 2 7 N C _ 0 0 8 7 5 1 C h r o m o s o m e<br />

f f d d 4 4 0 . 3 6 9 9 2941<br />

19 B a c t e r i a D e l t a p r o t e o b a c t e r i a 27731 D e s u l f o v i b r i o v u l g a r i s s t r . M i y a z a k i F C P 0 0 1 1 9 7 N C _ 0 1 1 7 6 9<br />

C h r o m o s o m e f f d d 4 4 0 . 3 2 8 9 3180<br />

20 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 5 N C _ 0 0 2 9 3 7<br />

C h r o m o s o m e f f d d 4 4 0 . 3 6 8 6 3379<br />

21 B a c t e r i a D e l t a p r o t e o b a c t e r i a 51 D e s u l f o v i b r i o v u l g a r i s s t r . H i l d e n b o r o u g h A E 0 1 7 2 8 6 N C _ 0 0 5 8 6 3<br />

M e g a p l a s m i d f f d d 4 4 0 . 3 4 3 2 152<br />

22 B a c t e r i a O t h e r B a c t e r i a 30733 T h e r m o d e s u l f o v i b r i o y e l l o w s t o n i i D S M 11347 C P 0 0 1 1 4 7 N C _ 0 1 1 2 9 6<br />

C h r o m o s o m e 888888 0 . 6 5 8 7 2033<br />

23 B a c t e r i a G a m m a p r o t e o b a c t e r i a 29177 T h i o a l k a l i v i b r i o s p . HL−E b G R 7 C P 0 0 1 3 3 9 N C _ 0 1 1 9 0 1<br />

C h r o m o s o m e f f d d 4 4 0 . 3 4 9 4 3283<br />

24 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 3 N C _ 0 1 2 5 7 8 C h r o m o s o m e I<br />

f f d d 4 4 0 . 5 2 1 7 2650<br />

25 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32851 V i b r i o c h o l e r a e M66 −2 C P 0 0 1 2 3 4 N C _ 0 1 2 5 8 0 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 5 2 9 6 1043<br />

26 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 5 N C _ 0 1 2 6 6 8 C h r o m o s o m e 1<br />

f f d d 4 4 0 . 5 2 4 8 2770<br />

27 B a c t e r i a G a m m a p r o t e o b a c t e r i a 33555 V i b r i o c h o l e r a e MJ −1236 C P 0 0 1 4 8 6 N C _ 0 1 2 6 6 7 C h r o m o s o m e 2<br />

f f d d 4 4 0 . 5 3 2 5 1004<br />

28 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 2<br />

N C _ 0 0 2 5 0 5 C h r o m o s o m e I f f d d 4 4 0 . 5 2 3 2736<br />

29 B a c t e r i a G a m m a p r o t e o b a c t e r i a 36 V i b r i o c h o l e r a e O 1 b i o v a r E l T o r s t r . N 1 6 9 6 1 A E 0 0 3 8 5 3<br />

N C _ 0 0 2 5 0 6 C h r o m o s o m e I I f f d d 4 4 0 . 5 3 0 9 1092<br />

30 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 6 N C _ 0 0 9 4 5 6 C h r o m o s o m e 1<br />

f f d d 4 4 0 . 5 3 1 2 1133<br />

31 B a c t e r i a G a m m a p r o t e o b a c t e r i a 15667 V i b r i o c h o l e r a e O 3 9 5 C P 0 0 0 6 2 7 N C _ 0 0 9 4 5 7 C h r o m o s o m e 2<br />

f f d d 4 4 0 . 5 2 2 2 2742<br />

32 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 0 N C _ 0 0 6 8 4 0 C h r o m o s o m e I<br />

f f d d 4 4 0 . 6 1 0 4 2575<br />

33 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 1 N C _ 0 0 6 8 4 1 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 6 2 9 8 1172<br />

34 B a c t e r i a G a m m a p r o t e o b a c t e r i a 12986 V i b r i o f i s c h e r i E S 1 1 4 C P 0 0 0 0 2 2 N C _ 0 0 6 8 4 2 P l a s m i d p E S 1 0 0<br />

f f d d 4 4 0 . 6 1 5 8 55<br />

35 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 3 N C _ 0 1 1 1 8 6 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 6 2 7 5 1254<br />

36 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 4 N C _ 0 1 1 1 8 5 P l a s m i d p M J 1 0 0<br />

f f d d 4 4 0 . 6 5 2 195<br />

37 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19393 V i b r i o f i s c h e r i M J 1 1 C P 0 0 1 1 3 9 N C _ 0 1 1 1 8 4 C h r o m o s o m e I<br />

f f d d 4 4 0 . 6 1 1 2 2590<br />

38 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 1 N C _ 0 0 9 7 7 7 P l a s m i d<br />

p V I B H A R f f d d 4 4 0 . 5 6 2 1 120<br />

39 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 8 9 N C _ 0 0 9 7 8 3<br />

C h r o m o s o m e I f f d d 4 4 0 . 5 4 4 5 3570<br />

40 B a c t e r i a G a m m a p r o t e o b a c t e r i a 19857 V i b r i o h a r v e y i A T C C BAA −1116 C P 0 0 0 7 9 0 N C _ 0 0 9 7 8 4<br />

C h r o m o s o m e I I f f d d 4 4 0 . 5 4 7 3 2374<br />

41 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 1 N C _ 0 0 4 6 0 3<br />

C h r o m o s o m e I f f d d 4 4 0 . 5 4 6 1 3080<br />

42 B a c t e r i a G a m m a p r o t e o b a c t e r i a 360 V i b r i o p a r a h a e m o l y t i c u s R I M D 2210633 B A 0 0 0 0 3 2 N C _ 0 0 4 6 0 5<br />

C h r o m o s o m e I I f f d d 4 4 0 . 5 4 6 5 1752<br />

43 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 3 N C _ 0 1 1 7 4 4 C h r o m o s o m e 2<br />

f f d d 4 4 0 . 5 6 3 6 1486<br />

44 B a c t e r i a G a m m a p r o t e o b a c t e r i a 32815 V i b r i o s p l e n d i d u s L G P 3 2 F M 9 5 4 9 7 2 N C _ 0 1 1 7 5 3 C h r o m o s o m e 1<br />

f f d d 4 4 0 . 5 5 9 6 2950<br />

45 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 5 N C _ 0 0 4 4 5 9 C h r o m o s o m e I<br />

f f d d 4 4 0 . 5 3 5 5 2973<br />

46 B a c t e r i a G a m m a p r o t e o b a c t e r i a 349 V i b r i o v u l n i f i c u s C M C P 6 A E 0 1 6 7 9 6 N C _ 0 0 4 4 6 0 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 5 2 8 8 1565<br />

167


BLASTatlas configurations<br />

47 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 7 N C _ 0 0 5 1 3 9 C h r o m o s o m e I<br />

f f d d 4 4 0 . 5 3 5 9 3262<br />

48 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 B A 0 0 0 0 3 8 N C _ 0 0 5 1 4 0 C h r o m o s o m e I I<br />

f f d d 4 4 0 . 5 2 7 9 1697<br />

49 B a c t e r i a G a m m a p r o t e o b a c t e r i a 1430 V i b r i o v u l n i f i c u s Y J 0 1 6 A P 0 0 5 3 5 2 N C _ 0 0 5 1 2 8 P l a s m i d p Y J 0 1 6<br />

f f d d 4 4 0 . 5 5 0 7 69<br />

D.3 BLASTatlas configurations<br />

D.3.1 file blast.cfg<br />

1 l e g e n d : B . a m b i f a r i a A M M D<br />

2 p r o g r a m : b l a s t p<br />

3 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2<br />

4 r a n g e : 0 . 0 , 0 . 8<br />

5 s o u r c e : f i l e s / 1 3 4 9 0 . f s a<br />

6<br />

7 l e g e n d : B . a m b i f a r i a M C 4 0 −6<br />

8 p r o g r a m : b l a s t p<br />

9 c o l o r : 1 0 1 0 1 0 _ 0 2 0 0 0 2<br />

10 r a n g e : 0 . 0 , 0 . 8<br />

11 s o u r c e : f i l e s / 1 7 4 1 1 . f s a<br />

12<br />

13 l e g e n d : B . c e n o c e p a c i a A U 1054<br />

14 p r o g r a m : b l a s t p<br />

15 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

16 r a n g e : 0 . 0 , 0 . 8<br />

17 s o u r c e : f i l e s / 1 3 9 1 9 . f s a<br />

18<br />

19 l e g e n d : B . c e n o c e p a c i a H I 2 4 2 4<br />

20 p r o g r a m : b l a s t p<br />

21 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

22 r a n g e : 0 . 0 , 0 . 8<br />

23 s o u r c e : f i l e s / 1 3 9 1 8 . f s a<br />

24<br />

25 l e g e n d : B . c e n o c e p a c i a J 2 3 1 5<br />

26 p r o g r a m : b l a s t p<br />

27 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

28 r a n g e : 0 . 0 , 0 . 8<br />

29 s o u r c e : f i l e s / 3 3 9 . f s a<br />

30<br />

31 l e g e n d : B . c e n o c e p a c i a MC0 −3<br />

32 p r o g r a m : b l a s t p<br />

33 c o l o r : 1 0 1 0 1 0 _ 0 8 0 0 0 0<br />

34 r a n g e : 0 . 0 , 0 . 8<br />

35 s o u r c e : f i l e s / 1 7 9 2 9 . f s a<br />

36<br />

37 l e g e n d : B . g l u m a e B G R 1<br />

38 p r o g r a m : b l a s t p<br />

39 c o l o r : 1 0 1 0 1 0 _ 0 5 0 5 0 5<br />

40 r a n g e : 0 . 0 , 0 . 8<br />

41 s o u r c e : f i l e s / 3 3 9 0 1 . f s a<br />

42<br />

43 . . . . . .<br />

D.3.2 file custom.cfg<br />

1<br />

2 l e g e n d : S I D D @ −0.035<br />

3 c o l o r : 0 0 0 0 1 0 _ 1 0 1 0 1 0<br />

4 r a n g e : 9 : 1 0<br />

5 b o x f i l t e r : 5 0 0 0<br />

6 s o u r c e : g u n z i p −c B X 5 7 1 9 6 6 −57 a 2 f 2 c 2 e 1 1 c a 0 d d 8 c d 7 4 4 9 3 d 6 6 7 d 4 d 6 −3173005. s i d d −−0.035−c−10−c . o u t . g z |<br />

c u t −f 4 |<br />

D.4 BLASTmatrix example<br />

This Perl script constructs an XML configuration file by look<strong>in</strong>g up the Genome Atlas<br />

Database through MySQL. It queries for all Campylobacter stra<strong>in</strong>s currently available.<br />

1 #! / u s r / b<strong>in</strong> / p e r l<br />

2 u s e s t r i c t ;<br />

3<br />

4 m y $ S A C O _ E X T R A C T = " / u s r / c b s / b i o / b i n / l i n u x 6 4 / s a c o _ e x t r a c t " ;<br />

5 m y %c o l o r s = ( l a r i => ’ 0 , 1 0 4 , 1 3 9 ’ , j e j u n i => ’ 0 , 1 3 9 , 6 9 ’ , h o m i n i s => ’ 66 , 66 , 1 1 1 ’ , f e t u s<br />

=> ’ 1 3 9 , 1 0 1 , 8 ’ , c u r v u s=>’ 1 4 0 , 23 , 2 3 ’ , c o n c i s u s=>’ 2 0 5 , 1 7 3 , 0 ’ ) ;<br />

6<br />

7 m y $ s o u r c e s = " " ; # h o l d s the s o u r c e s p a r t o f the c o n f i g u r a t i o n − r e p l a c e i n t o DATA s e c t i o n<br />

8<br />

9 o p e n O R G A N I S M , " m y s q l - N - B - e \ " s e l e c t pid , o r g a n i s m _ n a m e f r o m g e n o m e a t l a s 3 _ c u r .<br />

g e n b a n k _ c o m p l e t e _ p r j w h e r e o r g a n i s m _ n a m e l i k e ’ c a m p y l o b a c t e r % ’ o r d e r b y o r g a n i s m _ n a m e \ " | "<br />

o r d i e $ ! ;<br />

10 w h i l e (< O R G A N I S M >) {<br />

11 c h o m p ;<br />

12 m y ( $ p i d , $ o r g a n i s m _ n a m e ) = s p l i t /\ t / ;<br />

168


Appendix: Software<br />

13 w a r n " $ o r g a n i s m _ n a m e ( p i d $ p i d ) \ n " ;<br />

14 m y ( $ g e n u s , $ s p e c i e s , $ s t r a i n ) = ( $1 , $2 , $ 3 ) i f $ o r g a n i s m _ n a m e = /(\ S+) (\ S+) ( . ∗ ) / ;<br />

15 m y $ c o l o r = " 1 0 0 , 1 0 0 , 1 0 0 " ;<br />

16 $ c o l o r = $ c o l o r s { $ s p e c i e s } i f d e f i n e d $ c o l o r s { $ s p e c i e s } ;<br />

17 $ s o u r c e s .= "<br />

18 < e n t r y ><br />

19 < s o u r c e > . / $ p i d . p r o t e i n s . fsa < / s o u r c e ><br />

20 < t i t l e > $ g e n u s $ s p e c i e s < / t i t l e ><br />

21 < s u b t i t l e > $ s t r a i n < / s u b t i t l e ><br />

22 < g r o u p > $ s p e c i e s < / g r o u p ><br />

23 < c o l o r > $ c o l o r < / c o l o r ><br />

24 <br />

25 " ;<br />

26 o p e n P I D , " > $ p i d . p r o t e i n s . f s a " o r d i e $ ! ;<br />

27 o p e n A C C E S S I O N , " m y s q l - N - B - e \ " s e l e c t g e n b a n k , s e g m e n t _ n a m e f r o m g e n o m e a t l a s 3 _ c u r .<br />

g e n b a n k _ c o m p l e t e _ s e q w h e r e p i d = $ p i d a n d s e g m e n t _ n a m e n o t l i k e ’ g e n o m e % ’ \ " | " ;<br />

28 w h i l e (< A C C E S S I O N > ) {<br />

29 c h o m p ;<br />

30 m y ( $ g e n b a n k , $ s e g m e n t _ n a m e ) = s p l i t /\ t / ;<br />

31 c h o m p $ g e n b a n k ;<br />

32 w a r n " a d d i n g $ s e g m e n t _ n a m e ( a c c e s s i o n $ g e n b a n k ) \ n " ;<br />

33 m y $ g b k = " / h o m e / d a t a b a s e s / g e n o m e a t l a s d b - 3 . 0 _ c u r / d a t a / $ g e n b a n k / $ g e n b a n k . g b k " ;<br />

34 o p e n P R O T , " $ S A C O _ E X T R A C T - I g e n b a n k - O f a s t a - t < $ g b k 2 > / d e v / n u l l | " o r d i e $ ! ;<br />

35 w h i l e (< P R O T >) {<br />

36 p r i n t P I D ;<br />

37 }<br />

38 c l o s e P R O T ;<br />

39 }<br />

40 c l o s e A C C E S S I O N ;<br />

41 c l o s e P I D ;<br />

42 }<br />

43 c l o s e O R G A N I S M ;<br />

44 w a r n " d u m p i n g x m l c o n f i g o n s t d o u t . . . \ n " ;<br />

45 w h i l e (< D A T A >) {<br />

46 s//$ s o u r c e s / g ;<br />

47 p r i n t ;<br />

48 }<br />

49<br />

50 _ _ D A T A _ _<br />

51 <br />

52 <br />

53 P r o t e o m e c o m p a r i s o n o f C a m p y l o b a c t e r s p e c i e s <br />

54 −<br />

55 <br />

56 <br />

57 a u t o <br />

58 a u t o <br />

59 <br />

60 0.9<br />

61 0.9<br />

62 0.9<br />

63 <br />

64 <br />

65 0.975<br />

66 0<br />

67 0<br />

68 <br />

69 <br />

70 <br />

71 a u t o <br />

72 a u t o <br />

73 <br />

74 0.9<br />

75 0.9<br />

76 0.9<br />

77 <br />

78 <br />

79 0<br />

80 0.975<br />

81 0<br />

82 <br />

83 <br />

84 <br />

85 <br />

86 <br />

87 <br />

88 <br />

D.5 iscan source code<br />

1 #! / u s r / b<strong>in</strong> / p e r l<br />

2 u s e s t r i c t ;<br />

3<br />

4 m y $ p w m ;<br />

5 m y %m a t r i x ;<br />

6 m y $ s p a c e r ;<br />

7 m y @ P W M ;<br />

8 m y $ p i = 3 . 1 4 1 5 9 2 6 5 ;<br />

9<br />

10 # read the model f i l e s # i n c l u d e s u p p o r t e d r e c u r s i v e l y (NO CHECK FOR LOOPS ! )<br />

11 m y %s e t u p ;<br />

12 m y @ L I N E S ;<br />

169


iscan source code<br />

13 i f ( d e f i n e d $ A R G V [ 0 ] ) {<br />

14 @ L I N E S = r e a d _ m o d ( $ A R G V [ 0 ] ) ;<br />

15 } e l s e {<br />

16 w h i l e (< D A T A >) {<br />

17 p r i n t ;<br />

18 }<br />

19 c l o s e D A T A ;<br />

20 d i e " n o m o d e l p r o v i d e d . t e m p l a t e m o d e l d u m p e d \ n " ;<br />

21 }<br />

22<br />

23 m y $ p w m i d = −1;<br />

24 p r i n t " # t h i s i s t h e m o d e l : \ n " ;<br />

25 f o r e a c h ( @ L I N E S ) {<br />

26 p r i n t " # $ _ \ n " ;<br />

27 i f ( / ˆ \ [ p w m \ ] \ s∗=\s ∗ ( . ∗ ) /) {<br />

28 $ p w m i d ++;<br />

29 p u s h @ P W M , " $ p w m i d : $ 1 " ;<br />

30 }<br />

31 m y $ p w m = $ P W M [$# P W M ] ;<br />

32 $ s e t u p { $ p w m }{ $ 1 } = $ 2 i f /ˆ(\ w+)\ s∗=\s ∗([\.\ −0 −9]+) / ;<br />

33 n e x t u n l e s s / ˆ \ [ ( [ A T G C ]+) \ ] / ;<br />

34 m y @ F = s p l i t / [ \ s \ t ] + / ;<br />

35 s h i f t @ F ;<br />

36 e r r ( " p w m n o t d e f i n e d " ) u n l e s s d e f i n e d $ p w m ;<br />

37 @ { $ m a t r i x { $ p w m }{ $ 1 }} = @ F ;<br />

38 $ m a t r i x { $ p w m }{ c o u n t } [ $ _ ] += $ F [ $ _ ] f o r e a c h ( 0 . . $#F ) ;<br />

39 }<br />

40<br />

41 # make a lookup t a b l e o f d i s t a n c e i n f o r m a t i o n measure<br />

42 m y %S P A C E R _ L O O K U P ;<br />

43 f o r e a c h m y $ s p a c e r ( k e y s %s e t u p ) {<br />

44 m y $ m i n = $ s e t u p { $ s p a c e r }{ m i n } ;<br />

45 m y $ m a x = $ s e t u p { $ s p a c e r }{ m a x } ;<br />

46 m y $ c e n t e r = $ s e t u p { $ s p a c e r }{ c e n t e r } ;<br />

47 p r i n t f " # p a r s i n g a c c e s s i b i l i t y f o r $ s p a c e r ( m i n = $ m i n , m a x = $ m a x , c e n t e r = $ c e n t e r ) \ n " ;<br />

48 m y $ n = 0 ;<br />

49 $ n += 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ _ − $ c e n t e r ) ) f o r e a c h ( $ m i n . . $ m a x ) ;<br />

50 f o r e a c h m y $ d ( $ m i n . . $ m a x ) {<br />

51 i f ( $ c e n t e r e q " " ) {<br />

52 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = 0 ;<br />

53 } e l s e {<br />

54 $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } = −(−l o g ( ( 1 + c o s ( ( ( 2 ∗ $ p i ) / 1 0 . 6 ) ∗ ( $ d<br />

− $ c e n t e r ) ) ) / $ n ) / l o g ( 2 ) ) ;<br />

55 }<br />

56 p r i n t f " # d = % d , s c o r e = % 0 . 2 f \ n " , $d , $ S P A C E R _ L O O K U P { $ d }{ $ m i n }{ $ m a x }{ $ c e n t e r } ;<br />

57 }<br />

58 }<br />

59<br />

60 # compute matrix based o f f r e q u e n c i e s<br />

61 f o r e a c h m y $ p w m ( k e y s %m a t r i x ) {<br />

62 p r i n t " # p r e p a r i n g m a t r i x ’ $ p w m ’\ n " ; ;<br />

63 f o r e a c h m y $ l e t t e r ( q w / A T G C /) {<br />

64 p r i n t " # [ $ l e t t e r ] " ;<br />

65 f o r e a c h m y $ i ( 0 . . $#{ $ m a t r i x { $ p w m }{ A }} ) {<br />

66 m y $ i 1 = " - " ;<br />

67 m y $ i 2 = s p r i n t f ( ’ % 5 s ’ , ’ - ’ ) ;<br />

68 i f ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] > 0 ) {<br />

69 $ i 1 = 2 + l o g ( $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] / $ m a t r i x { $ p w m }{ c o u n t } [ $ i ] ) / l o g ( 2 ) − 0 ;<br />

70 $ i 2 = s p r i n t f ( ’ % 5 s ’ , s p r i n t f ( ’ % 0 . 2 f ’ , $ i 1 ) ) ;<br />

71 }<br />

72 $ m a t r i x { $ p w m }{ $ l e t t e r } [ $ i ] = $ i 1 i f $ i 1 n e " - " ;<br />

73 p r i n t " \ t $ i 2 " ;<br />

74 }<br />

75 p r i n t " \ n " ;<br />

76 }<br />

77 }<br />

78<br />

79 # l o o p o v e r a l l s e q u e n c e s i n i n p u t<br />

80 m y @ i n p = &r e a d _ f a s t a ;<br />

81 f o r e a c h m y $ s ( 0 . . $#i n p ) {<br />

82 m y $ s e q = $ i n p [ $ s ]−>{ s e q } ;<br />

83 p r i n t f " # S E Q U E N C E % s \ n " , $ i n p [ $ s ]−>{ i d } ;<br />

84 p r i n t f " # % d b p \ n " , l e n g t h ( $ s e q ) ;<br />

85 m y %L E N ;<br />

86 m y %B I T ;<br />

87 f o r e a c h m y $ p w m ( @ P W M ) {<br />

88 p r i n t " # g e n e r a t i n g b i t s c o r e s f o r m a t r i x ’ $ p w m ’\ n " ;<br />

89 @ { $ B I T { $ p w m }} = &s c a n ( $ s e q ,%{ $ m a t r i x { $ p w m }}) ;<br />

90 $ L E N { $ p w m } = s c a l a r ( @ { $ m a t r i x { $ p w m }{ A }}) ;<br />

91 p r i n t f " # % d e l e m e n t s i n a r r a y \ n " , s c a l a r ( @ { $ B I T { $ p w m }} ) ;<br />

92 }<br />

93 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s e q ) − $ L E N { $ P W M [ 0 ] } ) ) {<br />

94 p r i n t f " # c o n s i d e r i n g p o s i t i o n % d ( r o o t m o d e l ) \ n " , $ p + 1 ;<br />

95 # f i n d the s c o r e o f the i n i t i a l matrix , f o r t h i s g i v e n p o s i t i o n<br />

96 m y $ w = $ s e t u p { $ P W M [ 0 ] } { w e i g h t } ;<br />

97 m y $ f s i = $ B I T { $ P W M [ 0 ] } [ $ p ] ∗ $ w ;<br />

98 m y $ o f f s e t = $ p ;<br />

99 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ P W M [ 0 ] } ) ;<br />

100 m y $ s = s p r i n t f " % s \ t % 0 . 2 f " , $ s i g n a l , $ f s i ;<br />

101<br />

102 f o r e a c h m y $ p w m _ i n d e x (1 . . $#P W M ) {<br />

103 m y $ p w m = $ P W M [ $ p w m _ i n d e x ] ;<br />

104 m y $ w = $ s e t u p { $ p w m }{ w e i g h t } ;<br />

105<br />

106 # g e t the s p a c i n g d e t a i l s f o r the upstream s p a c e r<br />

107 m y $ p r e v _ p w m = $ P W M [ $ p w m _ i n d e x − 1 ] ;<br />

108<br />

170


Appendix: Software<br />

109 m y ( $ m i n , $ m a x , $ c e n t e r ) = ( $ s e t u p { $ p r e v _ p w m }{ m i n } ,<br />

110 $ s e t u p { $ p r e v _ p w m }{ m a x } , $ s e t u p { $ p r e v _ p w m }{ c e n t e r }) ;<br />

111<br />

112 m y $ o p t _ s p a c e r ;<br />

113 m y $ o p t _ u n i t _ s c o r e ;<br />

114<br />

115 # c a l c u l a t e u n i t s c o r e s f o r each o f the s p a c i n g c o n f i g u r a t i o n s<br />

116 # A u n i t i s the s p a c e r <strong>and</strong> the f o l l o w i n g matrix . We s e a r c h f o r the<br />

117 # s p a c e r g i v i n g r i s e t o the h i g h e s t u n i t s c o r e<br />

118<br />

119 p r i n t f " # a d j u s t i n g s p a c e r d o w n s t r a m o f ’ $ p w m ’\ n " ;<br />

120<br />

121 f o r e a c h m y $ s p a c e r ( $ m i n . . $ m a x ) {<br />

122 # don ’ t c o n t i n u e , o f the o f f s e t g o e s beyond z e r o . . .<br />

123 l a s t i f $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r < 0 ;<br />

124 n e x t i f $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w < $ s e t u p { $ p w m }{ t h r e s h o l d } a n d<br />

d e f i n e d $ s e t u p { $ p w m }{ t h r e s h o l d } ;<br />

125<br />

126 # i f no o p t i m a l s p a c e r i s d e c l a r e d y e t ( e . g . b e c a u s e t h i s i s<br />

127 # the f i r s t round ) then do i t now<br />

128 $ o p t _ s p a c e r = $ s p a c e r u n l e s s d e f i n e d $ o p t _ s p a c e r ;<br />

129 m y $ t e s t _ u n i t _ s c o r e = $ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] ∗ $ w + $ S P A C E R _ L O O K U P {<br />

$ s p a c e r }{ $ m i n }{ $ m a x }{ $ c e n t e r } ;<br />

130 p r i n t f " # s p a c e r : % d , s c o r e : % 0 . 1 f ( % 0 . 1 f + % 0 . 1 f ) \ n " , $ s p a c e r , $ t e s t _ u n i t _ s c o r e ,<br />

$ B I T { $ p w m } [ $ o f f s e t − $ L E N { $ p w m } − $ s p a c e r ] , $ S P A C E R _ L O O K U P { $ s p a c e r }{ $ m i n }{ $ m a x }{<br />

$ c e n t e r } ;<br />

131 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e u n l e s s d e f i n e d $ o p t _ u n i t _ s c o r e ;<br />

132 i f ( $ t e s t _ u n i t _ s c o r e > $ o p t _ u n i t _ s c o r e ) {<br />

133 $ o p t _ s p a c e r = $ s p a c e r ;<br />

134 $ o p t _ u n i t _ s c o r e = $ t e s t _ u n i t _ s c o r e ;<br />

135 }<br />

136 } # f o r e a c h my $ s p a c e r<br />

137<br />

138 # o f f s e t i s where the c u r r e n t pwm s t a r t s<br />

139 $ o f f s e t = $ o f f s e t − $ L E N { $ p w m } − $ o p t _ s p a c e r ;<br />

140<br />

141 p r i n t f " # n e w o f f s e t % d \ n " , $ o f f s e t ;<br />

142<br />

143 i f ( ! d e f i n e d $ o p t _ u n i t _ s c o r e ) {<br />

144 p r i n t f " # u n a b l e t o d e t e r m i n e s p a c e r \ n " ;<br />

145 $ s .= s p r i n t f " \ t - \ t % s \ t - " , ( ’ - ’ x $ L E N { $ p w m }) ;<br />

146 n e x t ;<br />

147 } e l s e {<br />

148 p r i n t f " # s p a c e r $ o p t _ s p a c e r c h o s e n , u n i t ’% s ’ g i v e s s c o r e % 0 . 1 f \ n " , $ p w m ,<br />

$ o p t _ u n i t _ s c o r e ;<br />

149 $ f s i += $ o p t _ u n i t _ s c o r e ;<br />

150 m y $ s i g n a l = s u b s t r ( $ s e q , $ o f f s e t , $ L E N { $ p w m }) ;<br />

151 $ s .= s p r i n t f " \ t % d \ t % s \ t % 0 . 2 f " , $ o p t _ s p a c e r , $ s i g n a l , $ f s i ;<br />

152 }<br />

153 } # f o r e a c h my $pwm <strong>in</strong>dex<br />

154 # p r i n t the f i n a l b i t s c o r e<br />

155 p r i n t f " % d \ t % 0 . 2 f \ t % s \ t \ n " , ( $ p +1) , $ f s i , $ s ;<br />

156 } # my $p = 0<br />

157 } # f o r ( $s = 0 . . . .<br />

158<br />

159<br />

160 #######################################<br />

161 # HELPER FUNCTIONS<br />

162 #######################################<br />

163<br />

164<br />

165 # scan u s i n g a matrix o f i n f o r m a t i o n<br />

166 s u b s c a n {<br />

167 m y @ a ;<br />

168 m y ( $s ,% m ) = @ _ ;<br />

169 m y $ m a = $#{$ m { A } } ;<br />

170 f o r e a c h m y $ p ( 0 . . ( l e n g t h ( $ s )−$#{$ m { A }} −1 ) ) {<br />

171 m y $ R i = 0 ;<br />

172 $ R i += $ m { s u b s t r ( $s , $ p+$_ , 1 ) } [ $ _ ] f o r e a c h ( 0 . . $ m a ) ;<br />

173 p u s h @ a , $ R i ;<br />

174 }<br />

175 # r e t u r n a l i s t hav<strong>in</strong>g n−l +1 e l e m e n t s e h e r e n i s the s e q u e n c e l e n g t h ,<br />

176 # n i s the matrix s i z e ( f o r −10 ( hexamer , n=6)<br />

177 r e t u r n @ a ;<br />

178 }<br />

179<br />

180 ###############################################################<br />

181 # s p a c e r b i t s c o r e c a l c u l a t i o n s c o o r d i n a t e s a r e s h i f t e d 6bp<br />

182 ###############################################################<br />

183<br />

184 s u b r e a d _ m o d {<br />

185 m y @ r e t ;<br />

186 m y $ f n = $ _ [ 0 ] ;<br />

187 m y $ i ;<br />

188 o p e n $ i , $ f n o r e r r ( " u n a b l e t o o p e n f i l e ’ $ f n ’: $ ! \ n " ) ;<br />

189 w h i l e ( r e a d l i n e ( $ i ) ) {<br />

190 c h o m p ;<br />

191 i f (/ˆ#\ s ∗ i n c l u d e \ s ∗ ( . ∗ ) /) {<br />

192 m y @ a = r e a d _ m o d ( $ 1 ) ;<br />

193 p u s h @ r e t , @ a ;<br />

194 } e l s e {<br />

195 n e x t i f / ˆ [ \ s \#] + / ;<br />

196 n e x t u n l e s s /ˆ\ S +/;<br />

197 p u s h @ r e t , $ _ ;<br />

198 }<br />

199 }<br />

200 c l o s e $ i ;<br />

171


quasi mktemp manual<br />

201 r e t u r n @ r e t ;<br />

202 }<br />

203<br />

204 s u b r e a d _ f a s t a {<br />

205 m y @ f a s t a ; # c o n t a i n s a l l<br />

206 m y $ i d = −1;<br />

207 w h i l e ( ) {<br />

208 c h o m p ;<br />

209 i f ( /ˆ >(.∗) / ) {<br />

210 $ i d ++;<br />

211 $ f a s t a [ $ i d ]−>{ i d } = $ 1 ;<br />

212 } e l s i f ( / ˆ ( [ A−Za−z ]+) /) {<br />

213 $ f a s t a [ $ i d ]−>{ s e q } .= $ 1 ;<br />

214 }<br />

215 }<br />

216 r e t u r n @ f a s t a ;<br />

217 }<br />

218<br />

219 s u b e r r {<br />

220 p r i n t $ _ [ 0 ] ;<br />

221 e x i t 1 ;<br />

222 }<br />

223 e x i t 0 ;<br />

224<br />

225 _ _ D A T A _ _<br />

226 [ p w m ]=−10 r e g i o n<br />

227 w e i g h t =1<br />

228 [ A ] 0 63 0 63 63 0<br />

229 [ T ] 63 0 63 0 0 63<br />

230 [ G ] 0 0 0 0 0 0<br />

231 [ C ] 0 0 0 0 0 0<br />

232 [ s p a c e r ]<br />

233 m i n =13<br />

234 c e n t e r =16<br />

235 m a x =19<br />

236 [ p w m ]=−35 r e g i o n<br />

237 w e i g h t =1<br />

238 [ A ] 0 0 0 0 0 36<br />

239 [ T ] 63 63 0 54 0 9<br />

240 [ G ] 0 0 63 0 18 9<br />

241 [ C ] 0 0 0 9 45 9<br />

242 [ s p a c e r ]<br />

243 m i n =0<br />

244 c e n t e r =3<br />

245 m a x =6<br />

246 [ p w m ]= U P<br />

247 w e i g h t =0.5<br />

248 [ A ] 18 0 45 27 45 54 54 54 18 9 45 9 2 9 18 45 54 45 9 2 0 9<br />

249 [ T ] 45 11 0 0 18 0 9 9 36 45 18 54 45 45 27 9 9 18 54 54 63 17<br />

250 [ G ] 0 9 18 36 0 0 0 0 9 9 0 0 0 9 9 0 0 0 0 7 0 0<br />

251 [ C ] 0 43 0 0 0 9 0 0 0 0 0 0 16 0 9 9 0 0 0 0 0 37<br />

252 [ s p a c e r ]<br />

253 m i n=−4<br />

254 c e n t e r =2<br />

255 m a x =4<br />

256 [ p w m ]= F I S<br />

257 w e i g h t =0.5<br />

258 t h r e s h o l d =0<br />

259 [ A ] 26 27 16 0 18 9 0 29 54 54 54 45 42 3 2 36 7 2 18 22 16<br />

260 [ T ] 36 36 45 0 0 38 43 0 0 0 9 0 18 45 0 0 0 0 1 0 45<br />

261 [ G ] 1 0 2 63 18 7 20 34 9 9 0 18 3 13 45 0 54 0 44 41 0<br />

262 [ C ] 0 0 0 0 27 9 0 0 0 0 0 0 0 2 16 27 2 61 0 0 2<br />

D.6 quasi mktemp manual<br />

1 N A M E<br />

2 q u a s i _ m k t e m p − c r e a t e a t e m p l a t e C B S W e b S e r v i c e i m p l e m e n t a t i o n<br />

3<br />

4 S Y N O P S I S<br />

5 p e r l q u a s i _ m k t e m p l [− n S E R V I C E N A M E ] [− v V E R S I O N ] [− w W S N U M B E R ] (−f ) (− r e m o v e ) (−t<br />

T E M P L A T E N A M E )<br />

6<br />

7 D E S C R I P T I O N<br />

8 T h i s s c r i p t c r e a t e s a f u n c t i o n a l t e m p l a t e S O A P W e b S e r v i c e i m p l e m e n t a t i o n u n d e r Q u a s i<br />

i n c l u d i n g<br />

9 a w o r k i n g e x a m p l e . T h e o b j e c t t y p e s t h i s s e r v i c e r e c i e v e s / g e n e r a t e s a r e t h e C B S s t a n d a r d<br />

s e q u e n c e<br />

10 d a t a o b j e c t / a n n o t a t i o n d a t a o b j e c t .<br />

11<br />

12 T h e f o l l o w i n g e l e m e n t s a r e c r e a t e d b y t h e p r o g r a m :<br />

13<br />

14 ∗ W S D L f i l e , w i t h p r o p e r n a m e s p a c e s a n d o p e r a t i o n ( s )<br />

15 ∗ A n X S D i n c l u d e d b y t h e W S D L<br />

16 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i / c o n t a i n i n g t h e P e r l m o d u l e (<br />

m o d u l e . p m )<br />

17 ∗ A d i r e c t o r y i n / u s r / o p t / w w w / p u b / C B S / w s / c o n t a i n i n g t h e XSD , W S D L a n d e x a m p l e f i l e s .<br />

18 ∗ A n e n t r y i n m y s q l . W e b S e r v i c e s . s e r v i c e s<br />

19 ∗ A n i n d e x . p h p a n d i n c l u d e . h t m l l o c a t e d i n / u s r / o p t / w w w / p u b / C B S / w s / [ S E R V I C E N A M E ]<br />

20<br />

21 To−d o l i s t , o n c e y o u h a v e c r e a t e d t h e t e m p l a t e :<br />

22<br />

23 [ ] A l t e r t h e W S D L s o i t c o n t a i n s t h e o p e r a t i o n s y o u n e e d<br />

24 [ ] A l t e r t h e X S D s o a l l o p e r a t i o n d a t a t y p e s a r e d e f i n e d<br />

172


Appendix: Software<br />

25 [ ] A l t e r t h e f i l e m o d u l e . p m a n d p o s s i b l y w r a p p e r . pl , l o c a t e d i n / u s r / o p t / w w w / cgi−b i n / s o a p<br />

/ w s / q u a s i / [ S E R V I C E ] / [ W S ] /<br />

26 [ ] A l t e r t h e e x a m p l e s o t h a t i t c o n t a i n s a r e l e v a n t e x a m p l e f o r y o u r s e r v i c e .<br />

27 [ ] A l t e r t h e i n c l u d e . h t m l s o t h a t i t d e s c r i b e s t h e u s a g e o f t h e e x a m p l e s c r i p t<br />

28 [ ] O n c e y o u a r e h a p p y w i t h t h e i m p l e m e n t a t i o n , r e m o v e t h e f l a g ” i n t e r n a l _ o n l y ” f r o m m y s q l<br />

. W e b S e r v i c e s . s e r v i c e s<br />

29 a n d c h a n g e t h e d e s i r e d d e s c r i p t i o n f o r y o u r s e r v i c e ( i n f i e l d ’ d e s c r i p t i o n ’ )<br />

30<br />

31 O P T I O N S<br />

32 −n S E R V I C E N A M E<br />

33 C a s e −s e n s i t i v e s e r v i c e n a m e , e . g . S i g n a l P<br />

34<br />

35 −v V E R S I O N<br />

36 T h e v e r s i o n o f t h e s e r v i c e i n t h e f o r m X . Y , e . g . 1 . 2<br />

37<br />

38 −w W S N U M B E R<br />

39 T h i s i s t h e i m p l e m e n t a t i o n n u m b e r f o r t h i s s e r v i c e a n d v e r s i o n . T h e n u m b e r<br />

40 s t a r t s a t z e r o .<br />

41<br />

42 −f<br />

43 F o r c e s o v e r w r i t i n g e x i s t i n g f i l e s<br />

44<br />

45 −r e m o v e<br />

46 R e m o v e s a l l f i l e s p e r t a i n i n g t o t h i s s e r v i c e / v e r s i o n / i m p l e m e n t a i o n − b e c a r e f u l l !<br />

47<br />

48 −t T E M P L A T E<br />

49 N e w t e m p l a t e s c a n b e i n s t a l l e d . U s e o p t i o n −t l i s t t o l i s t a l l t e m p l a t e s<br />

50<br />

51 A U T H O R<br />

52 P e t e r F i s c h e r H a l l i n , p f h @ c b s . d t u . dk , S e p t e m b e r 2008<br />

53<br />

54 S E E A L S O<br />

55 / u s r / o p t / q u a q /<br />

56 / u s r / o p t / w w w / cgi−b i n / C B S / s o a p / w s / q u a s i . c g i<br />

57<br />

58 A U T H O R<br />

59 P e t e r H a l l i n 2008−09−15, p f h @ c b s . d t u . d k<br />

173


BIBLIOGRAPHY<br />

Bibliography<br />

S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, & D. J.<br />

Lipman (1997). ‘Gapped blast <strong>and</strong> psi–blast: a new generation of prote<strong>in</strong> database<br />

searchprograms.’ Nucleic Acids Res 25:3389–402.<br />

B. F. Bauer, E. G. Kar, R. M. Elford, & W. M. Holmes (1988). ‘Sequence determ<strong>in</strong>ants<br />

for promoter strength <strong>in</strong> the leuv operon of Escherichia coli.’ Gene 63:123–34.<br />

J. Besemer, A. Lomsadze, & M. Borodovsky (2001). ‘GeneMarks: a self–tra<strong>in</strong><strong>in</strong>g method<br />

for prediction of gene starts <strong>in</strong> microbial genomes. Implications for f<strong>in</strong>d<strong>in</strong>g sequence<br />

motifs <strong>in</strong> regulatory regions.’ Nucleic Acids Res 29:2607–18.<br />

T. T. B<strong>in</strong>newies, P. F. Hall<strong>in</strong>, H.-H. Staerfeldt, & D. W. Ussery (2005). ‘Genome Update:<br />

proteome comparisons.’ Microbiology 151:1–4.<br />

T. T. B<strong>in</strong>newies, Y. Motro, P. F. Hall<strong>in</strong>, O. Lund, D. Dunn, T. La, D. J. Hampson,<br />

M. Bellgard, T. M. Wassenaar, & D. W. Ussery (2006). ‘Ten years of bacterial genome<br />

sequenc<strong>in</strong>g: comparative–genomics–baseddiscoveries.’ Funct Integr Genomics 6:165–85.<br />

E. Birney, J. A. Stamatoyannopoulos, A. Dutta, R. Guigo, T. R. G<strong>in</strong>geras, E. H. Margulies,<br />

Z. Weng, M. Snyder, E. T. Dermitzakis, R. E. Thurman, M. S. Kuehn, C. M.<br />

Taylor, S. Neph, C. M. Koch, S. Asthana, A. Malhotra, I. Adzhubei, J. A. Greenbaum,<br />

R. M. Andrews, P. Flicek, P. J. Boyle, H. Cao, N. P. Carter, G. K. Clell<strong>and</strong>, S. Davis,<br />

N. Day, P. Dhami, S. C. Dillon, M. O. Dorschner, H. Fiegler, P. G. Giresi, J. Goldy,<br />

M. Hawrylycz, A. Haydock, R. Humbert, K. D. James, B. E. Johnson, E. M. Johnson,<br />

T. T. Frum, E. R. Rosenzweig, N. Karnani, K. Lee, G. C. Lefebvre, P. A. Navas, F. Neri,<br />

S. C. J. Parker, P. J. Sabo, R. S<strong>and</strong>strom, A. Shafer, D. Vetrie, M. Weaver, S. Wilcox,<br />

M. Yu, F. S. Coll<strong>in</strong>s, J. Dekker, J. D. Lieb, T. D. Tullius, G. E. Crawford, S. Sunyaev,<br />

W. S. Noble, I. Dunham, F. Denoeud, A. Reymond, P. Kapranov, J. Rozowsky,<br />

D. Zheng, R. Castelo, A. Frankish, J. Harrow, S. Ghosh, A. S<strong>and</strong>el<strong>in</strong>, I. L. Hofacker,<br />

R. Baertsch, D. Keefe, S. Dike, J. Cheng, H. A. Hirsch, E. A. Sek<strong>in</strong>ger, J. Lagarde,<br />

J. F. Abril, A. Shahab, C. Flamm, C. Fried, J. Hackermuller, J. Hertel, M. L<strong>in</strong>demeyer,<br />

K. Missal, A. Tanzer, S. Washietl, J. Korbel, O. Emanuelsson, J. S. Pedersen, N. Holroyd,<br />

R. Taylor, D. Swarbreck, N. Matthews, M. C. Dickson, D. J. Thomas, M. T.<br />

Weirauch, J. Gilbert, J. Drenkow, I. Bell, X. Zhao, K. G. Sr<strong>in</strong>ivasan, W.-K. Sung, H. S.<br />

Ooi, K. P. Chiu, S. Foissac, T. Alioto, M. Brent, L. Pachter, M. L. Tress, A. Valencia,<br />

S. W. Choo, C. Y. Choo, C. Ucla, C. Manzano, C. Wyss, E. Cheung, T. G. Clark,<br />

J. B. Brown, M. Ganesh, S. Patel, H. Tammana, J. Chrast, C. N. Henrichsen, C. Kai,<br />

J. Kawai, U. Nagalakshmi, J. Wu, Z. Lian, J. Lian, P. Newburger, X. Zhang, P. Bickel,<br />

J. S. Mattick, P. Carn<strong>in</strong>ci, Y. Hayashizaki, S. Weissman, T. Hubbard, R. M. Myers,<br />

174


BIBLIOGRAPHY<br />

J. Rogers, P. F. Stadler, T. M. Lowe, C.-L. Wei, Y. Ruan, K. Struhl, M. Gerste<strong>in</strong>, S. E.<br />

Antonarakis, Y. Fu, E. D. Green, U. Karaoz, A. Siepel, J. Taylor, L. A. Liefer, K. A.<br />

Wetterstr<strong>and</strong>, P. J. Good, E. A. Fe<strong>in</strong>gold, M. S. Guyer, G. M. Cooper, G. Asimenos,<br />

C. N. Dewey, M. Hou, S. Nikolaev, J. I. Montoya-Burgos, A. Loytynoja, S. Whelan,<br />

F. Pardi, T. Mass<strong>in</strong>gham, H. Huang, N. R. Zhang, I. Holmes, J. C. Mullik<strong>in</strong>, A. Ureta-<br />

Vidal, B. Paten, M. Ser<strong>in</strong>ghaus, D. Church, K. Rosenbloom, W. J. Kent, E. A. Stone,<br />

S. Batzoglou, N. Goldman, R. C. Hardison, D. Haussler, W. Miller, A. Sidow, N. D.<br />

Tr<strong>in</strong>kle<strong>in</strong>, Z. D. Zhang, L. Barrera, R. Stuart, D. C. K<strong>in</strong>g, A. Ameur, S. Enroth, M. C.<br />

Bieda, J. Kim, A. A. Bh<strong>in</strong>ge, N. Jiang, J. Liu, F. Yao, V. B. Vega, C. W. H. Lee,<br />

P. Ng, A. Shahab, A. Yang, Z. Moqtaderi, Z. Zhu, X. Xu, S. Squazzo, M. J. Oberley,<br />

D. Inman, M. A. S<strong>in</strong>ger, T. A. Richmond, K. J. Munn, A. Rada-Iglesias, O. Wallerman,<br />

J. Komorowski, J. C. Fowler, P. Couttet, A. W. Bruce, O. M. Dovey, P. D. Ellis, C. F.<br />

Langford, D. A. Nix, G. Euskirchen, S. Hartman, A. E. Urban, P. Kraus, S. Van Calcar,<br />

N. He<strong>in</strong>tzman, T. H. Kim, K. Wang, C. Qu, G. Hon, R. Luna, C. K. Glass, M. G. Rosenfeld,<br />

S. F. Aldred, S. J. Cooper, A. Halees, J. M. L<strong>in</strong>, H. P. Shulha, X. Zhang, M. Xu,<br />

J. N. S. Haidar, Y. Yu, Y. Ruan, V. R. Iyer, R. D. Green, C. Wadelius, P. J. Farnham,<br />

B. Ren, R. A. Harte, A. S. H<strong>in</strong>richs, H. Trumbower, H. Clawson, J. Hillman-Jackson,<br />

A. S. Zweig, K. Smith, A. Thakkapallayil, G. Barber, R. M. Kuhn, D. Karolchik, L. Armengol,<br />

C. P. Bird, P. I. W. de Bakker, A. D. Kern, N. Lopez-Bigas, J. D. Mart<strong>in</strong>, B. E.<br />

Stranger, A. Woodroffe, E. Davydov, A. Dimas, E. Eyras, I. B. Hallgrimsdottir, J. Huppert,<br />

M. C. Zody, G. R. Abecasis, X. Estivill, G. G. Bouffard, X. Guan, N. F. Hansen,<br />

J. R. Idol, V. V. B. Maduro, B. Maskeri, J. C. McDowell, M. Park, P. J. Thomas, A. C.<br />

Young, R. W. Blakesley, D. M. Muzny, E. Sodergren, D. A. Wheeler, K. C. Worley,<br />

H. Jiang, G. M. We<strong>in</strong>stock, R. A. Gibbs, T. Graves, R. Fulton, E. R. Mardis, R. K.<br />

Wilson, M. Clamp, J. Cuff, S. Gnerre, D. B. Jaffe, J. L. Chang, K. L<strong>in</strong>dblad-Toh, E. S.<br />

L<strong>and</strong>er, M. Koriab<strong>in</strong>e, M. Nefedov, K. Osoegawa, Y. Yosh<strong>in</strong>aga, B. Zhu, & P. J. de Jong<br />

(2007). ‘Identification <strong>and</strong> analysis of functional elements <strong>in</strong> 1of the human genome by<br />

the encode pilot project.’ Nature 447:799–816.<br />

F. R. Blattner, G. r. Plunkett, C. A. Bloch, N. T. Perna, V. Burl<strong>and</strong>, M. Riley, J. Collado-<br />

Vides, J. D. Glasner, C. K. Rode, G. F. Mayhew, J. Gregor, N. W. Davis, H. A.<br />

Kirkpatrick, M. A. Goeden, D. J. Rose, B. Mau, & Y. Shao (1997). ‘The complete<br />

genome sequence of Escherichia coli k–12.’ Science 277:1453–62.<br />

A. J. t. Bokal, W. Ross, & R. L. Gourse (1995). ‘The transcriptional activator prote<strong>in</strong> fis:<br />

Dna <strong>in</strong>teractions <strong>and</strong>cooperative <strong>in</strong>teractions with rna polymerase at the Escherichia<br />

coli rrnbp1 promoter.’ J Mol Biol 245:197–207.<br />

A. Bolshoy, P. McNamara, R. E. Harr<strong>in</strong>gton, & E. N. Trifonov (1991). ‘Curved dna<br />

without a–a: experimental estimation of all 16 dna wedgeangles.’ Proc Natl Acad Sci U<br />

S A 88:2312–6.<br />

P. J. Brett, D. DeShazer, & D. E. Woods (1998). ‘Burkholderia thail<strong>and</strong>ensis sp. nov., a<br />

Burkholderia pseudomallei–likespecies.’ Int J Syst Bacteriol 48:317–20.<br />

E. Brzuszkiewicz, H. Bruggemann, H. Liesegang, M. Emmerth, T. Olschlager, G. Nagy,<br />

K. Albermann, C. Wagner, C. Buchrieser, L. Emody, G. Gottschalk, J. Hacker, & U. Dobr<strong>in</strong>dt<br />

(2006). ‘How to become a uropathogen: comparative genomic analysis ofextra<strong>in</strong>test<strong>in</strong>al<br />

pathogenic Escherichia coli stra<strong>in</strong>s.’ Proc Natl Acad Sci U S A 103:12879–84.<br />

S. L. Chen, C.-S. Hung, J. Xu, C. S. Reigstad, V. Magr<strong>in</strong>i, A. Sabo, D. Blasiar, T. Bieri,<br />

R. R. Meyer, P. Ozersky, J. R. Armstrong, R. S. Fulton, J. P. Latreille, J. Spieth, T. M.<br />

175


BIBLIOGRAPHY<br />

Hooton, E. R. Mardis, S. J. Hultgren, & J. I. Gordon (2006). ‘Identification of genes<br />

subject to positive selection <strong>in</strong> uropathogenicstra<strong>in</strong>s of Escherichia coli: a comparative<br />

genomics approach.’ Proc Natl Acad Sci U S A 103:5977–82.<br />

A. L. Delcher, D. Harmon, S. Kasif, O. White, & S. L. Salzberg (1999). ‘Improved microbial<br />

gene identification with glimmer.’ Nucleic Acids Res 27:4636–41.<br />

J. Eid, A. Fehr, J. Gray, K. Luong, J. Lyle, G. Otto, P. Peluso, D. Rank, P. Baybayan,<br />

B. Bettman, A. Bibillo, K. Bjornson, B. Chaudhuri, F. Christians, R. Cicero, S. Clark,<br />

R. Dalal, A. Dew<strong>in</strong>ter, J. Dixon, M. Foquet, A. Gaertner, P. Hardenbol, C. He<strong>in</strong>er,<br />

K. Hester, D. Holden, G. Kearns, X. Kong, R. Kuse, Y. Lacroix, S. L<strong>in</strong>, P. Lundquist,<br />

C. Ma, P. Marks, M. Maxham, D. Murphy, I. Park, T. Pham, M. Phillips, J. Roy,<br />

R. Sebra, G. Shen, J. Sorenson, A. Tomaney, K. Travers, M. Trulson, J. Vieceli, J. Wegener,<br />

D. Wu, A. Yang, D. Zaccar<strong>in</strong>, P. Zhao, F. Zhong, J. Korlach, & S. Turner (2009).<br />

‘Real–time dna sequenc<strong>in</strong>g from s<strong>in</strong>gle polymerase molecules.’ Science 323:133–8.<br />

M. Ender, B. Berger-Bachi, & N. McCallum (2009). ‘A novel dna–b<strong>in</strong>d<strong>in</strong>g prote<strong>in</strong> modulat<strong>in</strong>g<br />

methicill<strong>in</strong> resistance <strong>in</strong> Staphylococcus aureus.’ BMC Microbiol 9:15.<br />

S. T. Estrem, T. Gaal, W. Ross, & R. L. Gourse (1998). ‘Identification of an up element<br />

consensus sequence for bacterialpromoters.’ Proc Natl Acad Sci U S A 95:9761–6.<br />

P. F. Hall<strong>in</strong> & D. W. Ussery (2004). ‘Cbs Genome Atlas Database: a dynamic storage for<br />

bio<strong>in</strong>formatic results <strong>and</strong> sequence data.’ Bio<strong>in</strong>formatics 20:3682–6.<br />

K. Hayashi, N. Morooka, Y. Yamamoto, K. Fujita, K. Isono, S. Choi, E. Ohtsubo, T. Baba,<br />

B. L. Wanner, H. Mori, & T. Horiuchi (2006). ‘Highly accurate genome sequences of<br />

Escherichia coli k–12 stra<strong>in</strong>s mg1655<strong>and</strong> w3110.’ Mol Syst Biol 2:2006.0007.<br />

T. Hayashi, K. Mak<strong>in</strong>o, M. Ohnishi, K. Kurokawa, K. Ishii, K. Yokoyama, C. G. Han,<br />

E. Ohtsubo, K. Nakayama, T. Murata, M. Tanaka, T. Tobe, T. Iida, H. Takami,<br />

T. Honda, C. Sasakawa, N. Ogasawara, T. Yasunaga, S. Kuhara, T. Shiba, M. Hattori,<br />

& H. Sh<strong>in</strong>agawa (2001). ‘Complete genome sequence of enterohemorrhagic Escherichia<br />

coli o157:h7 <strong>and</strong>genomic comparison with a laboratory stra<strong>in</strong> k–12.’ DNA Res 8:11–22.<br />

P. N. Hengen, S. L. Bartram, L. E. Stewart, & T. D. Schneider (1997). ‘Information<br />

analysis of Fis b<strong>in</strong>d<strong>in</strong>g sites.’ Nucleic Acids Res 25:4994–5002.<br />

C. A. Hirvonen, W. Ross, C. E. Wozniak, E. Marasco, J. R. Anthony, S. E. Aiyar, V. H.<br />

Newburn, & R. L. Gourse (2001). ‘Contributions of up elements <strong>and</strong> the transcription<br />

factor fis toexpression from the seven rrn p1 promoters <strong>in</strong> Escherichia coli.’ J Bacteriol<br />

183:6305–14.<br />

A. M. Huerta & J. Collado-Vides (2003). ‘Sigma70 promoters <strong>in</strong> Escherichia coli: specific<br />

transcription <strong>in</strong> denseregions of overlapp<strong>in</strong>g promoter–like signals.’ J Mol Biol 333:261–<br />

78.<br />

L. J. Jensen, C. Friis, & D. W. Ussery (1999). ‘Three views of microbial genomes.’ Res<br />

Microbiol 150:773–7.<br />

L. J. Jensen, M. Skovgaard, T. Sicheritz-Ponten, N. T. Hansen, H. Johansson, M. K.<br />

Joergensen, K. Kiil, P. F. Hall<strong>in</strong>, & D. Ussery (2005). THE PSEUDOMONADS VOL<br />

I. GENOMICS, LIFE STYLE AND MOLECULAR ARCHITECTURE, vol. 1, chap.<br />

Chapter 5: <strong>Comparative</strong> genomics of four Pseudomonas species, pp. 139–164. Kluwer<br />

Academic / Plenum Publishers, New York.<br />

176


BIBLIOGRAPHY<br />

Q. J<strong>in</strong>, Z. Yuan, J. Xu, Y. Wang, Y. Shen, W. Lu, J. Wang, H. Liu, J. Yang, F. Yang,<br />

X. Zhang, J. Zhang, G. Yang, H. Wu, D. Qu, J. Dong, L. Sun, Y. Xue, A. Zhao, Y. Gao,<br />

J. Zhu, B. Kan, K. D<strong>in</strong>g, S. Chen, H. Cheng, Z. Yao, B. He, R. Chen, D. Ma, B. Qiang,<br />

Y. Wen, Y. Hou, & J. Yu (2002). ‘Genome sequence of Shigella flexneri 2a: <strong>in</strong>sights<br />

<strong>in</strong>to pathogenicitythrough comparison with genomes of Escherichia coli k12 <strong>and</strong> o157.’<br />

Nucleic Acids Res 30:4432–41.<br />

T. J. Johnson, S. Kariyawasam, Y. Wannemuehler, P. Mangiamele, S. J. Johnson,<br />

C. Doetkott, J. A. Skyberg, A. M. Lynne, J. R. Johnson, & L. K. Nolan (2007). ‘The<br />

genome sequence of avian pathogenic Escherichia coli stra<strong>in</strong> o1:k1:h7shares strong similarities<br />

with human extra<strong>in</strong>test<strong>in</strong>al pathogenic e. coligenomes.’ J Bacteriol 189:3228–36.<br />

J. Kyte & R. F. Doolittle (1982). ‘A simple method for display<strong>in</strong>g the hydropathic character<br />

of a prote<strong>in</strong>.’ J Mol Biol 157:105–32.<br />

K. Lagesen, P. Hall<strong>in</strong>, E. A. Rodl<strong>and</strong>, H.-H. Staerfeldt, T. Rognes, & D. W. Ussery (2007).<br />

‘RNAmmer: consistent <strong>and</strong> rapid annotation of ribosomal rna genes.’ Nucleic Acids Res<br />

35:3100–8.<br />

T. S. Larsen & A. Krogh (2003). ‘EasyGene–a prokaryotic gene f<strong>in</strong>der that ranks ORFs<br />

by statistical significance.’ BMC Bio<strong>in</strong>formatics 4:21.<br />

T. Lefebure & M. J. Stanhope (2007). ‘Evolution of the core <strong>and</strong> pan–genome of Streptococcus:<br />

positive selection, recomb<strong>in</strong>ation, <strong>and</strong> genome composition.’ Genome Biol<br />

8:R71.<br />

X. Liao, T. Y<strong>in</strong>g, H. Wang, J. Wang, Z. Shi, E. Feng, K. Wei, Y. Wang, X. Zhang,<br />

L. Huang, G. Su, & P. Huang (2003). ‘A two–dimensional proteome map of Shigella<br />

flexneri.’ Electrophoresis 24:2864–82.<br />

B. Liebig & R. Wagner (1995). ‘Effects of different growth conditions on the <strong>in</strong> vivo<br />

activity of thet<strong>and</strong>em Escherichia coli ribosomal rna promoters p1 <strong>and</strong> p2.’ Mol Gen<br />

Genet 249:328–35.<br />

D. Lim & N. C. J. Strynadka (2002). ‘Structural basis for the beta lactam resistance of<br />

pbp2a from methicill<strong>in</strong>–resistant Staphylococcus aureus.’ Nat Struct Biol 9:870–6.<br />

T. M. Lowe & S. R. Eddy (1997). ‘tRNAscan–se: a program for improved detection of<br />

transfer rna genes <strong>in</strong>genomic sequence.’ Nucleic Acids Res 25:955–64.<br />

J. P. McCutcheon, B. R. McDonald, & N. A. Moran (2009). ‘Orig<strong>in</strong> of an alternative<br />

genetic code <strong>in</strong> the extremely small <strong>and</strong> gc–rich genome of a bacterial symbiont.’ PLoS<br />

Genet 5:e1000565.<br />

C. E. McEwan, D. Gatherer, & N. R. McEwan (1998). ‘Nitrogen–fix<strong>in</strong>g aerobic bacteria<br />

have higher genomic gc content than non–fix<strong>in</strong>g species with<strong>in</strong> the same genus.’<br />

Hereditas 128:173–8.<br />

W. G. Miller, C. T. Parker, M. Rubenfield, G. L. Mendz, M. M. S. M. Wosten, D. W.<br />

Ussery, J. F. Stolz, T. T. B<strong>in</strong>newies, P. F. Hall<strong>in</strong>, G. Wang, J. A. Malek, A. Rogos<strong>in</strong>,<br />

L. H. Stanker, & R. E. M<strong>and</strong>rell (2007). ‘The complete genome sequence <strong>and</strong> analysis<br />

of the epsilonproteobacteriumArcobacter butzleri.’ PLoS One 2:e1358.<br />

H. D. Murray & R. L. Gourse (2004). ‘Unique roles of the rrn p2 rrna promoters <strong>in</strong><br />

Escherichia coli.’ Mol Microbiol 52:1375–87.<br />

177


BIBLIOGRAPHY<br />

A. Nakabachi, A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A. Moran, & M. Hattori<br />

(2006). ‘The 160–kilobase genome of the bacterial endosymbiont Carsonella.’ Science<br />

314:267.<br />

C. Ong, C. H. Ooi, D. Wang, H. Chong, K. C. Ng, F. Rodrigues, M. A. Lee, & P. Tan<br />

(2004). ‘Patterns of large–scale genomic variation <strong>in</strong> virulent <strong>and</strong> avirulentBurkholderia<br />

species.’ Genome Res 14:2295–307.<br />

J. Parkhill, B. W. Wren, K. Mungall, J. M. Ketley, C. Churcher, D. Basham, T. Chill<strong>in</strong>gworth,<br />

R. M. Davies, T. Feltwell, S. Holroyd, K. Jagels, A. V. Karlyshev, S. Moule,<br />

M. J. Pallen, C. W. Penn, M. A. Quail, M. A. Raj<strong>and</strong>ream, K. M. Rutherford, A. H. van<br />

Vliet, S. Whitehead, & B. G. Barrell (2000). ‘The genome sequence of the food–borne<br />

pathogen Campylobacter jejunireveals hypervariable sequences.’ Nature 403:665–8.<br />

A. G. Pedersen, L. J. Jensen, S. Brunak, H. H. Staerfeldt, & D. W. Ussery (2000). ‘A dna<br />

structural atlas for Escherichia coli.’ J Mol Biol 299:907–30.<br />

V. Perez-Brocal, R. Gil, S. Ramos, A. Lamelas, M. Postigo, J. M. Michelena, F. J. Silva,<br />

A. Moya, & A. Latorre (2006). ‘A small microbial genome: the end of a long symbiotic<br />

relationship?’ Science 314:312–3.<br />

N. T. Perna, G. r. Plunkett, V. Burl<strong>and</strong>, B. Mau, J. D. Glasner, D. J. Rose, G. F. Mayhew,<br />

P. S. Evans, J. Gregor, H. A. Kirkpatrick, G. Posfai, J. Hackett, S. Kl<strong>in</strong>k, A. Bout<strong>in</strong>,<br />

Y. Shao, L. Miller, E. J. Grotbeck, N. W. Davis, A. Lim, E. T. Dimalanta, K. D.<br />

Potamousis, J. Apodaca, T. S. Anantharaman, J. L<strong>in</strong>, G. Yen, D. C. Schwartz, R. A.<br />

Welch, & F. R. Blattner (2001). ‘Genome sequence of enterohaemorrhagic Escherichia<br />

coli o157:h7.’ Nature 409:529–33.<br />

O. N. Reva, P. F. Hall<strong>in</strong>, H. Willenbrock, T. Sicheritz-Ponten, B. Tummler, & D. W.<br />

Ussery (2008). ‘Global features of the Alcanivorax borkumensis sk2 genome.’ Environ<br />

Microbiol 10:614–25.<br />

E. P. C. Rocha (2004). ‘Codon usage bias from trna‘s po<strong>in</strong>t of view: redundancy, specialization,<br />

<strong>and</strong> efficient decod<strong>in</strong>g for translation optimization.’ Genome Res 14:2279–86.<br />

W. Ross, J. Salomon, W. M. Holmes, & R. L. Gourse (1999). ‘Activation of Escherichia<br />

coli leuv transcription by fis.’ J Bacteriol 181:3864–8.<br />

K. Rutherford, J. Parkhill, J. Crook, T. Horsnell, P. Rice, M. A. Raj<strong>and</strong>ream, & B. Barrell<br />

(2000). ‘Artemis: sequence visualization <strong>and</strong> annotation.’ Bio<strong>in</strong>formatics 16:944–5.<br />

R. A. Sanford, J. R. Cole, & J. M. Tiedje (2002). ‘Characterization <strong>and</strong> description of<br />

Anaeromyxobacter dehalogenans gen. nov., sp. nov., an aryl–halorespir<strong>in</strong>g facultative<br />

anaerobic myxobacterium.’ Appl Environ Microbiol 68:893–900.<br />

S. C. Satchwell, H. R. Drew, & A. A. Travers (1986). ‘Sequence periodicities <strong>in</strong> chicken<br />

nucleosome core dna.’ J Mol Biol 191:659–75.<br />

S. Schneiker, O. Perlova, O. Kaiser, K. Gerth, A. Alici, M. O. Altmeyer, D. Bartels,<br />

T. Bekel, S. Beyer, E. Bode, H. B. Bode, C. J. Bolten, J. V. Choudhuri, S. Doss,<br />

Y. A. Elnakady, B. Frank, L. Gaigalat, A. Goesmann, C. Groeger, F. Gross, L. Jelsbak,<br />

L. Jelsbak, J. Kal<strong>in</strong>owski, C. Kegler, T. Knauber, S. Konietzny, M. Kopp, L. Krause,<br />

D. Krug, B. L<strong>in</strong>ke, T. Mahmud, R. Mart<strong>in</strong>ez-Arias, A. C. McHardy, M. Merai, F. Meyer,<br />

S. Mormann, J. Munoz-Dorado, J. Perez, S. Pradella, S. Rachid, G. Raddatz, F. Rosenau,<br />

C. Ruckert, F. Sasse, M. Scharfe, S. C. Schuster, G. Suen, A. Treuner-Lange, G. J.<br />

178


BIBLIOGRAPHY<br />

Velicer, F.-J. Vorholter, K. J. Weissman, R. D. Welch, S. C. Wenzel, D. E. Whitworth,<br />

S. Wilhelm, C. Wittmann, H. Blocker, A. Puhler, & R. Muller (2007). ‘Complete genome<br />

sequence of the myxobacterium Sorangium cellulosum.’ Nat Biotechnol 25:1281–9.<br />

R. K. Shultzaberger, Z. Chen, K. A. Lewis, & T. D. Schneider (2007). ‘Anatomy of<br />

Escherichia coli sigma70 promoters.’ Nucleic Acids Res 35:771–88.<br />

M. D. Smith, B. J. Angus, V. Wuthiekanun, & N. J. White (1997). ‘Arab<strong>in</strong>ose assimilation<br />

def<strong>in</strong>es a nonvirulent biotype of Burkholderiapseudomallei.’ Infect Immun 65:4319–21.<br />

H. Tettel<strong>in</strong>, V. Masignani, M. J. Cieslewicz, C. Donati, D. Med<strong>in</strong>i, N. L. Ward, S. V.<br />

Angiuoli, J. Crabtree, A. L. Jones, A. S. Durk<strong>in</strong>, R. T. Deboy, T. M. Davidsen, M. Mora,<br />

M. Scarselli, I. Margarit y Ros, J. D. Peterson, C. R. Hauser, J. P. Sundaram, W. C.<br />

Nelson, R. Madupu, L. M. Br<strong>in</strong>kac, R. J. Dodson, M. J. Rosovitz, S. A. Sullivan,<br />

S. C. Daugherty, D. H. Haft, J. Selengut, M. L. Gw<strong>in</strong>n, L. Zhou, N. Zafar, H. Khouri,<br />

D. Radune, G. Dimitrov, K. Watk<strong>in</strong>s, K. J. B. O’Connor, S. Smith, T. R. Utterback,<br />

O. White, C. E. Rubens, G. Gr<strong>and</strong>i, L. C. Madoff, D. L. Kasper, J. L. Telford, M. R.<br />

Wessels, R. Rappuoli, & C. M. Fraser (2005). ‘Genome analysis of multiple pathogenic<br />

isolates of Streptococcus agalactiae: implications for the microbial “pan–genome“.’ Proc<br />

Natl Acad Sci U S A 102:13950–5.<br />

J. D. Thompson, D. G. Higg<strong>in</strong>s, & T. J. Gibson (1994). ‘Clustal w: improv<strong>in</strong>g the sensitivity<br />

of progressive multiple sequencealignment through sequence weight<strong>in</strong>g, position–<br />

specific gap penalties <strong>and</strong>weight matrix choice.’ Nucleic Acids Res 22:4673–80.<br />

H. Toh, B. L. Weiss, S. A. H. Perk<strong>in</strong>, A. Yamashita, K. Oshima, M. Hattori, & S. Aksoy<br />

(2006). ‘Massive genome erosion <strong>and</strong> functional adaptations provide <strong>in</strong>sights <strong>in</strong>to the<br />

symbiotic lifestyle of Sodalis gloss<strong>in</strong>idius <strong>in</strong> the tsetse host.’ Genome Res 16:149–56.<br />

M. L. Tress, P. L. Martelli, A. Frankish, G. A. Reeves, J. J. Wessel<strong>in</strong>k, C. Yeats, P. I. Olason,<br />

M. Albrecht, H. Hegyi, A. Giorgetti, D. Raimondo, J. Lagarde, R. A. Laskowski,<br />

G. Lopez, M. I. Sadowski, J. D. Watson, P. Fariselli, I. Rossi, A. Nagy, W. Kai, Z. Storl<strong>in</strong>g,<br />

M. Ors<strong>in</strong>i, Y. Assenov, H. Blankenburg, C. Huthmacher, F. Ramirez, A. Schlicker,<br />

F. Denoeud, P. Jones, S. Kerrien, S. Orchard, S. E. Antonarakis, A. Reymond, E. Birney,<br />

S. Brunak, R. Casadio, R. Guigo, J. Harrow, H. Hermjakob, D. T. Jones, T. Lengauer,<br />

C. A. Orengo, L. Patthy, J. M. Thornton, A. Tramontano, & A. Valencia (2007). ‘The<br />

implications of alternative splic<strong>in</strong>g <strong>in</strong> the encode prote<strong>in</strong> complement.’ Proc Natl Acad<br />

Sci U S A 104:5495–500.<br />

J. W. Tukey (1977). Exploratory Data Analysis. Addison-Wesley.<br />

D. W. Ussery, P. F. Hall<strong>in</strong>, K. Lagesen, & T. M. Wassenaar (2004). ‘Genome update:<br />

tRNAs <strong>in</strong> sequenced microbial genomes.’ Microbiology 150:1603–6.<br />

T. Visnes, B. Doseth, H. S. Pettersen, L. Hagen, M. M. L. Sousa, M. Akbari, M. Otterlei,<br />

B. Kavli, G. Slupphaug, & H. E. Krokan (2009). ‘Uracil <strong>in</strong> dna <strong>and</strong> its process<strong>in</strong>g by<br />

different dna glycosylases.’ Philos Trans R Soc Lond B Biol Sci 364:563–8.<br />

H. Wang & C. J. Benham (2008). ‘Superhelical destabilization <strong>in</strong> regulatory regions of<br />

stress responsegenes.’ PLoS Comput Biol 4:e17.<br />

H. Wang, M. Noordewier, & C. J. Benham (2004). ‘Stress–<strong>in</strong>duced dna duplex destabilization<br />

(sidd) <strong>in</strong> the e. coli genome:sidd sites are closely associated with promoters.’<br />

Genome Res 14:1575–84.<br />

179


BIBLIOGRAPHY<br />

R. A. Welch, V. Burl<strong>and</strong>, G. r. Plunkett, P. Redford, P. Roesch, D. Rasko, E. L. Buckles,<br />

S.-R. Liou, A. Bout<strong>in</strong>, J. Hackett, D. Stroud, G. F. Mayhew, D. J. Rose, S. Zhou, D. C.<br />

Schwartz, N. T. Perna, H. L. T. Mobley, M. S. Donnenberg, & F. R. Blattner (2002).<br />

‘Extensive mosaic structure revealed by the complete genome sequence ofuropathogenic<br />

Escherichia coli.’ Proc Natl Acad Sci U S A 99:17020–4.<br />

H. Willenbrock, C. Friis, A. S. Juncker, & D. W. Ussery (2006). ‘An environmental<br />

signature for 323 microbial genomes based on codon adaptation <strong>in</strong>dices.’ Genome Biol<br />

7:R114.<br />

K.-M. Wu, L.-H. Li, J.-J. Yan, N. Tsao, T.-L. Liao, H.-C. Tsai, C.-P. Fung, H.-J. Chen,<br />

Y.-M. Liu, J.-T. Wang, C.-T. Fang, S.-C. Chang, H.-Y. Shu, T.-T. Liu, Y.-T. Chen, Y.-<br />

R. Shiau, T.-L. Lauderdale, I.-J. Su, R. Kirby, & S.-F. Tsai (2009). ‘Genome sequenc<strong>in</strong>g<br />

<strong>and</strong> comparative analysis of Klebsiella pneumoniae ntuh–k2044, a stra<strong>in</strong> caus<strong>in</strong>g liver<br />

abscess <strong>and</strong> men<strong>in</strong>gitis.’ J Bacteriol 191:4492–501.<br />

F. Yang, J. Yang, X. Zhang, L. Chen, Y. Jiang, Y. Yan, X. Tang, J. Wang, Z. Xiong,<br />

J. Dong, Y. Xue, Y. Zhu, X. Xu, L. Sun, S. Chen, H. Nie, J. Peng, J. Xu, Y. Wang,<br />

Z. Yuan, Y. Wen, Z. Yao, Y. Shen, B. Qiang, Y. Hou, J. Yu, & Q. J<strong>in</strong> (2005). ‘Genome<br />

dynamics <strong>and</strong> diversity of Shigella species, the etiologic agents ofbacillary dysentery.’<br />

Nucleic Acids Res 33:6445–58.<br />

180

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!