29.07.2013 Views

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

Computational tools and Interoperability in Comparative ... - CBS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 1<br />

Introduction<br />

Introduction<br />

S<strong>in</strong>ce the publication of the first complete bacterial genome sequence <strong>in</strong> 1995 close to a<br />

thous<strong>and</strong> prokaryotes have been fully sequenced <strong>and</strong> made publicly available. These data<br />

represent large efforts by many scientists <strong>and</strong> technicians, clos<strong>in</strong>g gaps <strong>in</strong> the chromosomal<br />

sequences <strong>and</strong> provid<strong>in</strong>g detailed gene annotations. These genome projects constitute a<br />

valuable collection of prokaryotic diversity <strong>and</strong> they serve as an <strong>in</strong>dispensable resource for<br />

comparative studies when novel features of newly discovered organisms are identified.<br />

We are however witness<strong>in</strong>g a transition phase as genome sequenc<strong>in</strong>g becomes a trivial<br />

step carried out by any researcher or company <strong>in</strong> the need of a better characterization of an<br />

organism. Sequenc<strong>in</strong>g equipment <strong>and</strong> the capability of assembl<strong>in</strong>g an entire genome will<br />

likely follow the same path as any other technological advance the world has seen. Telephones,<br />

cars, aeroplanes, <strong>and</strong> computers all have started as costly <strong>and</strong> clumsy attempts,<br />

<strong>and</strong> ended up as ma<strong>in</strong>stream affordable <strong>and</strong> efficient products, taken for granted. Noth<strong>in</strong>g<br />

will prevent sequenc<strong>in</strong>g technology to follow the same path <strong>and</strong> it will likely end up as a<br />

t<strong>in</strong>y desktop <strong>in</strong>strument on a doctor’s table next to the blood preasure measur<strong>in</strong>g device.<br />

But the decreas<strong>in</strong>g novelty of present<strong>in</strong>g a new genome sequence could cause a decl<strong>in</strong>e <strong>in</strong><br />

the number of published genomes <strong>in</strong> the near future, caus<strong>in</strong>g less control <strong>and</strong> organization<br />

of these data, with fewer dem<strong>and</strong>s on data <strong>in</strong>tegrity, sequenc<strong>in</strong>g <strong>and</strong> annotation quality.<br />

Some major issues arrise as massive amounts of genomic data becomes a reality. There<br />

are signs that our ability to process <strong>and</strong> analyze genomic data is be<strong>in</strong>g overtaken by the<br />

technological developments of the sequenc<strong>in</strong>g equipment. For example, over the past<br />

twenty-five years, GenBank has grown roughly 100,000 fold, whereas the computer process<strong>in</strong>g<br />

power, follow<strong>in</strong>g Moore’s law has grown “only” a 1,000 times. The overwhelm<strong>in</strong>g<br />

data generated by modern sequenc<strong>in</strong>g mach<strong>in</strong>es constitite tough challenges for most biologist<br />

<strong>and</strong> although efforts are constantly be<strong>in</strong>g made to improve gene prediction <strong>and</strong><br />

genome assembly software, these steps are not yet function<strong>in</strong>g <strong>in</strong> a scalable <strong>and</strong> unsupervised<br />

fashion. Further, post-annotation steps deriv<strong>in</strong>g knowledge from predicted genes<br />

rema<strong>in</strong> one of the biggest challenges. How do we transform contigs of nucleotide sequences<br />

<strong>in</strong>to knowledge to derive the phenotype of the organism?<br />

As more prokaryotic genomes are be<strong>in</strong>g sequenced, there are now a number of species<br />

for which multiple stra<strong>in</strong>s are sequenced. Roughly one fourth of all prokaryotic projects<br />

exist with<strong>in</strong> species where 5 or more stra<strong>in</strong>s are available. As this coverage of diversity<br />

<strong>in</strong>creases, we may beg<strong>in</strong> to answer some key questions with better confidence. How do<br />

we def<strong>in</strong>e core sets of genes? Can we estimate the size of the pan genome? Which<br />

features are novel <strong>in</strong> selected stra<strong>in</strong>s <strong>and</strong> are these features regionally conserved with<strong>in</strong><br />

the chromosomes? To answer these questions, there is a fundamental need to visuzalize<br />

<strong>and</strong> overview the similarity <strong>and</strong> differences between larger number of genomes. Obta<strong>in</strong><strong>in</strong>g<br />

such an overview allows some questions concern<strong>in</strong>g gene acquisition <strong>and</strong> chromosomal<br />

1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!