What Bioinformatics - Analyse und Management komplexer Systeme

What Bioinformatics Is All About 

Jürgen Sühnel 

jsuehnel@fli-leibniz.de 

leibniz.de 

Complex Ideas II, 

June 20, 2008, Altes Schloss Dornburg

Fritz Lipmann 

The Journal of Biological Chemistry 

186, 235, 1950

Complex Biological Systems 

Any function performed by a system that is not the result of a single part in the system, 

but rather is the result of interacting parts in the system, is an emergent property. 

Systems that have emergent properties are said to be irreducible. 

A system is said to be complex if its emergent properties are unpredictable ??. 

Living ‚systems‘ are complex systems. 

The recognition that complex systems, especially life, are truly understood from 

knowledge of the interactions of their component parts is fundamental to 

Systems Biology. 

However, the identification and analysis of system‘s parts is an important aspect of 

research into complex systems.

Molecular Biology

Molecules of Life 

(CARBOHYDRATES)

One- and Three-Dimensional (3D) Structures of Biopolymers 

sequence, primary structure 

(3D) structure

Pathways, Networks, Systems

Pathways, Networks, Systems 

Eisenberg et al., Nature 2000, 405, 823-826.

Technological Development: Microarray Techniques, … 

Automatization 

Parallelization 

Miniaturization

New Scientific Goals 

• Sequencing of complete genomes 

• Determination of interaction patterns of all proteins in a 

cell 

• Determination of all 3D-structures of a genome 

(Structural Genomics) 

• Determination of the m-RNA or protein pattern of all genes 

• High-throughput technologies (drug screening, …) 

•...

Genome Sizes

Genome Sizes 

Name Base pairs Genes 

Phi-X 174 5,386 10 E.coli virus 

Mycoplasma genitalium 580,073 483 

E. coli 4,639,221 4,337 bacterium 

Arabidopsis thaliana 115,409,949 25,498 small plant genome 

(Ackerschmalwinde) 

Human 3.3x10 9 ~25,000 

Rice ~430.000.000 ~50,000 

(Nippon Bare)

Down syndrome, 

trisomy 21 

FLI

Genomic Sequence 

Human chromosome 14, 

Long arm (FastA format) 

Chromosome 14 is characterized by a heterochromatic 

short arm that contains essentially ribosomal RNA genes, 

and a euchromatic long arm in which most, if not all, of the 

protein-coding genes are located. The finished sequence 

of human chromosome 14 comprises 87,410,661 base 

pairs, representing 100% of its euchromatic 

portion, in a single continuous segment covering the 

entire long arm with no gaps. Two loci of crucial 

importance for the immune system, as well 

as more than 60 disease genes, have been 

localized so far on chromosome 14. 

We identified 1,050 genes and gene fragments, 

and 393 pseudogenes.

It´s not just the genes: epigenetics.

Data Explosion

Data Explosion

The First Complete Genomes (of free-living organisms)

The First Complete Genomes (of free-living organisms)

Data Explosion

PDB Content Growth 

New Structures Per Year Per Day 

1993: 698 ~ 2 

2003: 4181 ~11 

2004: 5212 ~14 

2005: 5402 ~15 

2006: 6541 ~18 

(no theoretical structures) 

Protein Data Bank – 

3D Structure Database of Biological Macromolecules 

Start

More Than 1000 Biological Databases 

Sequences 

Genome Browsers 

Jena Prokaryotic 

Genome Viewer 

3D Structures 

Protein Domain 

And Motif Classification 

Interactions and Networks 

Disease Information 

Other 

Tandem Splice Site 

Database

What is Bioinformatics ? 

Bioinformatics 

is a new discipline that covers all aspects of 

Acquisition 

Storage 

Processing 

Analysis 

Interpretation 

of biological data.

What is Bioinformatics ? 

Bioinformatics ? 

Computational Biology ? 

Theoretical Biophysics / Theoretical Biochemistry ? 

Mathematical Biology ? 

Systems Biology ? 

Integrative Biology ? 

Theoretical Biology ? 

The developments described may lead to a paradigm change in biology. 

Reductionism vs. Holism ?

The Role of Theory in Biology: Little Impact Thus Far 

Sydney Brenner 

Princeton University 2003: 

The Watson Lecture: Biology in the Era of Complete Genomes 

ETH Zürich 2006: 

Pauli-Vorlesung: Theoretical Biology in the Next Decade 

The close interplay between theory, modeling, and experiment has dominated 

many other branches of science, particularly physics and astrophysics, 

but it had little impact on biology until now. 

I used to refer to the Journal of Theoretical Biology as the cure to insomnia. 

No longer will I be able to say that. The genome and its vast accumulation of data 

have the potential to change all of that by opening the doors for scientists with 

more analytical and theoretical bents.

The Role of Theory in Biology: An Analogy 

Johannes Kepler used Tycho's detailed astronomical information to develop his theories of 

astronomy (Kepler Laws). 

In a certain sense bioinformatics and theoretical biology are currently in a situation that still 

awaits the discovery of the Kepler Laws.

How Bioinformatics Contributes to Genome Analysis 

Assembly of overlapping genome fragments 

D. W. Mount: Bioinformatics, 

Cold Spring Harbor Laboratory Press, 2001. 

Gene identification 

Gene annotation 

function 1 function 2 

function 3 

Data integration from 

collaborative projects 

Database development 

Comparative genomics 

? 

mutation ?

Sequence Comparison 

Similar, i.e. homologous, sequences have a similar structure and function. 

Typical task: 

• A relatively small newly sequenced genome has 1000 protein coding 

genes 

• Scan a database with 6 million entries for sequence similarity for 

each of the 1000 protein sequences 

(pairwise alignment allowing for gaps) 

For the gapped alignment of two sequences of length 100 there are 

more than 10 75 possible arrangements. 

Alignment by 

• Visual inspection (dot plots) 

• Dynamic programming (optimal alignment) 

• Heuristic methods (BLAST, FASTA)

Dot Plot 

D. W. Mount: Bioinformatics, Cold Spring Harbor Laboratory Press, 2001.

Local BLAST Alignment 

DNA repair endonuclease XPF

Local BLAST Alignment

Romualdi A et al.; GenColors: annotation and comparative genomics of prokaryotes made easy. Methods Mol Biol. 2007;395:75-96.

Single Nucleotide Polymorphisms (SNPs) 

Genetic basis for differences between individuals: 

SNP density in the complete genome: : 1/1.91 kBasen 

SNP density within genes: 1/1.08 kBasen 

Genes cover only 5% of the complete genome. 

So, most of teh SNPs occur outside of genes. 

SNPs within genes or in the surrounding region 

may predispose individuals for a certain disease. 

SNP analyses can possibly also explain 

the different responses of individuals to drugs.

The International HapMap Project is a partnership of scientists and funding agencies 

to develop a public resource that will help researchers to find genes associated with 

human disease and response to pharmaceuticals. 

Sites in the genome where the DNA sequences of many individuals differ by a single base are 

called single nucleotide polymorphisms (SNPs). For example, some people may have a 

chromosome with an A at a particular site where others have a chromosome with a G. 

Each form is called an allele. 

Each person has two copies of all chromosomes except the sex chromosomes. 

The set of alleles that a person has is called a genotype. For this SNP a person could 

have the genotype AA, AG, or GG. The term genotype can refer to the SNP alleles that a 

person has at a particular SNP, or for many SNPs across the genome. A method that 

discovers what genotype a person has is called genotyping. 

About 10 million SNPs exist in human populations, where the rarer SNP allele has a frequency 

of at least 1%. Alleles of SNPs that are close together tend to be inherited together. 

A set of associated SNP alleles in a region of a chromosome is called a "haplotype". 

Most chromosome regions have only a few common haplotypes (each with a frequency of 

at least 5%), which account for most of the variation from person to person in a population. 

A chromosome region may contain many SNPs, but only a few "tag" SNPs can provide most of 

the information on the pattern of genetic variation in the region.

Genetic Diseases 

monogenic 

polygenic 

Mucoviscoidosis, 

Cystic fibrosis 

CFTR gene (chromosom 7) 

Discovered in 1989 

Cancer 

Oncogenes 

Tumor suppressor genes

Common Genetic Disorders

Drug research is 

the search for a needle in a haystack. 

www.kubinyi.de

Costs in Drug Research 

Cost for discovering and developing a new drug: 

several € 100 million up to € 1000 million (average € 802 M) 

Time to market: 

10 – 15 years

Drug Development 

Virtual Ligand Screening

Drug Development

Structure-based 

Design: Virtual Screening 

Virtual Screening: 

Select subsets of compounds for assay that are more likely to contain 

active hits than a sample chosen at random 

Time Scales: 

Docking of 1 compound 

Docking of the 1.1 million data set 

30 s 

(SGI R10000 processor) 

6 days 

(64-processor SGI ORIGIN) 

ACD-SC: Database from Molecular Design Ltd. 

Agonists: Known active compounds 

Docking of ligands to the estrogen receptor 

(nuclear hormone receptor)

Structure-based 

Design: Virtual Screening

Computer Power for Bioinformatics/Computational Biology

Computer Power for Bioinformatics/Computational Biology 

Blue Gene is an IBM Research project dedicated to exploring the 

frontiers in supercomputing: 

in computer architecture, 

in the software required to program and control massively parallel systems, 

and in the use of computation to advance our understanding of important biological processes 

such as protein folding. 

The Blue Gene/L machine was designed and built in collaboration with the Department of 

Energy's NNSA/Lawrence Livermore National Laboratory in California, and the LLNL system 

has a peak speed of 596 Teraflops. Blue Gene systems occupy the #1 (LLNL Blue Gene/L) 

and a total of 4 of the top 10 positions in the TOP500 supercomputer list announced in November 2007.

Computer Power for Bioinformatics/Computational Biology

Protein Folding by Molecular Dynamics




Protein Folding by Molecular Dynamics 

Fastest-folding protein yet discovered 

(submicrosecond folding) 

PDB ID: 2f4k


Text Mining: Duplication of Scientific Articles

Text Mining: Duplication of Scientific Articles 

http://discovery.swmed.edu/dejavu/

Methods in Bioinformatics and Computational Biology 

•Pairwise and Multiple Sequence Alignment 

•Assembly of Genome Fragments 

•Gene Prediction and Annotation 

•Genome Analysis and Functional Genomics (Proteomics) 

•Phylogenetic Analysis 

•Analysis, Classification and Prediction of Nucleic Acid and Protein Structures 

•Drug Design 

•Analysis of Data Obtained by Microarray Technologies 

•Data/Text Mining (Databases / Web) 

•Network Analysis (Systems Biology) 

Database Technologies | Pattern Recognition | Visualization Techniques | Statistics | 

Quantum Chemistry | Molecular Dynamics | Artificial Intelligence Methods | ... 

HTML, XML (specialized markup languages), Perl, Python, CGI, C, Java, ...

Ethical Issues 

Genes are the organism‘s blueprint. 

Therefore, any information, analysis or manipulation 

is of particular importance for each individual. 

Biobanks 

Genetic fingerprints 

Preimplantation diagnosis 

…

Genetic fingerprinting, DNA typing 

Alec Jeffreys, discoverer of genetic fingerprinting 

Identification of genome ranges specific for an individual, 

currently in introns, (STR - short tandem repeats) 

• Attribution of individuals to sites of crime 

• Identification of blood relationships, 

paternity test) 

• Identification of accident victims

Genetic fingerprinting, DNA typing

Biobanks

What are the Most Important Current Contributions of Bioinformatics ics to Biology ? 

Functional Prediction by Sequence Comparison ? 

BLAST 

(October 18, 2005) 

(June 19, 2008)


Sequences 

Genome Browsers 

Jena Prokaryotic 

Genome Viewer 

3D Structures 

Protein Domain 

And Motif Classification 

Interactions and Networks 

Disease Information 

Other 

Databases


New Conceptual Ideas 

Eisenberg et al., Nature 2000, 405, 823-826.

Outlook 

Improved automatized data validation procedures 

Better interoperability of databases, data integration 

Analysis of intergenic regions (promotor sites, micro RNAs) 

Comparative genomics 

Epigenetics 

Analysis of more complex systems – Systems Biology 

Integration of different scientific disciplines towards 

a real Computational and Theoretical Biology

Outlook 

A warning: 

Do not be too fast. Otherwise you will have only a loose the connection to reality.

Aim 

Information 

Knowledge 

Close connection between experimental and computational/theoretical approaches

www.fli-leibniz.de/jcb/ 

Coordinator / Spokesman: J. Sühnel (2001-2007) | S. Schuster (2008 - …) 

Manager: K. Wagner (2001-2007) | L. Blei (2008 - …)

JCB Members (24 research labs + 3 companies) 

Bacterial Genetics (S. Brantl, FSU, Biology and Pharmacy) 

Biocomputing (J. Sühnel, FLI) 

Bioinformatics (R. Backofen, ALU) 

Bioinformatics (S. Böcker, FSU, Mathematics and Computer Science) 

Bioinformatics (S. Schuster, FSU, Biology and Pharmacy) 

Bioinformatics and Population Genetics (K. Schmid, MPICE) 

Bioinformatics / Pattern Recognition (U. Möller, HKI) 

Biophysics / Bioinformatics (A. H. Gitter, FHJ) 

Bio-Systems Analysis (P. Dittrich, FSU, Mathematics and Computer Science) 

Computational Neuroscience (H. Witte, FSU, Medicine) 

Entomology (D. Heckel, MPICE) 

Experimental Rheumatology (R. Kinne, FSU, Medicine) 

General Botany (M. Mittag, Biology and Pharmacy) 

Genetics (G. Theissen, FSU, Biology and Pharmacy) 

Genome Analysis (M. Platzer, FLI) 

Language and Information Engineering Lab at Jena University – 

JULIE Lab (U. Hahn, FSU, Philosophy) 

Medical Engineering and Biotechnology (A. Voss, FHJ) 

Molecular and Applied Microbiology / Systems Biology (R. Guthke, HKI) 

Biomolecular NMR Spectroscopy (M. Görlach, FLI) 

Molecular Cell Biology (R. Wetzker, FSU; Medicine) 

Practical Computer Science I (C. Beckstein, FSU, Mathematics and Computer Science) 

Practical Computer Science II (E.G. Schukat-Talamazzini, FSU, Mathematics and Computer Science) 

Protein Crystallography (M. Than, FLI) 

Theoretical Computer Science (R. Niedermeier, FSU, Mathematics and Computer Science) 

 

 

 

BioControl Jena GmbH 

Clondiag Chip Technologies GmbH 

SIRS-Lab

Publications, Talks, Posters, Diploma/PhD Theses 

JCB Publications (experimental and theoretical) 

JCB 

groups/companies 

BMBF/JCB funded 

authors 

Peer-reviewed journal articles: 404 (203) 137 (56) 

Contributions to conference proceedings: 51 (25) 39 (19) 

(most of them also peer-reviewed) 

Book contributions: 20 (13) 8 (3) 

Books 4 (1) - 

Other (not peer-reviewed) publications: 31 (25) 14 (12) 

Total 510 (266) 198 (90) 

Talks (outside Jena): more than 200 

Posters: more than 250 

Diploma Theses: 

18 (15 finished) 

PhD Theses: 

22 ( 7 finished) 

Last update: 15/02/2007 (in brackets: 21/02/2005)

What Bioinformatics - Analyse und Management komplexer Systeme

Create successful ePaper yourself

Delete template?

Save as template?