20.11.2014 Views

What Bioinformatics - Analyse und Management komplexer Systeme

What Bioinformatics - Analyse und Management komplexer Systeme

What Bioinformatics - Analyse und Management komplexer Systeme

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>What</strong> <strong>Bioinformatics</strong> Is All About<br />

Jürgen Sühnel<br />

jsuehnel@fli-leibniz.de<br />

leibniz.de<br />

Complex Ideas II,<br />

June 20, 2008, Altes Schloss Dornburg


Fritz Lipmann<br />

The Journal of Biological Chemistry<br />

186, 235, 1950


Complex Biological Systems<br />

Any function performed by a system that is not the result of a single part in the system,<br />

but rather is the result of interacting parts in the system, is an emergent property.<br />

Systems that have emergent properties are said to be irreducible.<br />

A system is said to be complex if its emergent properties are unpredictable ??.<br />

Living ‚systems‘ are complex systems.<br />

The recognition that complex systems, especially life, are truly <strong>und</strong>erstood from<br />

knowledge of the interactions of their component parts is f<strong>und</strong>amental to<br />

Systems Biology.<br />

However, the identification and analysis of system‘s parts is an important aspect of<br />

research into complex systems.


Molecular Biology


Molecules of Life<br />

(CARBOHYDRATES)


One- and Three-Dimensional (3D) Structures of Biopolymers<br />

sequence, primary structure<br />

(3D) structure


Pathways, Networks, Systems


Pathways, Networks, Systems<br />

Eisenberg et al., Nature 2000, 405, 823-826.


Technological Development: Microarray Techniques, …<br />

Automatization<br />

Parallelization<br />

Miniaturization


New Scientific Goals<br />

• Sequencing of complete genomes<br />

• Determination of interaction patterns of all proteins in a<br />

cell<br />

• Determination of all 3D-structures of a genome<br />

(Structural Genomics)<br />

• Determination of the m-RNA or protein pattern of all genes<br />

• High-throughput technologies (drug screening, …)<br />

•...


Genome Sizes


Genome Sizes<br />

Name Base pairs Genes<br />

Phi-X 174 5,386 10 E.coli virus<br />

Mycoplasma genitalium 580,073 483<br />

E. coli 4,639,221 4,337 bacterium<br />

Arabidopsis thaliana 115,409,949 25,498 small plant genome<br />

(Ackerschmalwinde)<br />

Human 3.3x10 9 ~25,000<br />

Rice ~430.000.000 ~50,000<br />

(Nippon Bare)


Down syndrome,<br />

trisomy 21<br />

FLI


Genomic Sequence<br />

Human chromosome 14,<br />

Long arm (FastA format)<br />

Chromosome 14 is characterized by a heterochromatic<br />

short arm that contains essentially ribosomal RNA genes,<br />

and a euchromatic long arm in which most, if not all, of the<br />

protein-coding genes are located. The finished sequence<br />

of human chromosome 14 comprises 87,410,661 base<br />

pairs, representing 100% of its euchromatic<br />

portion, in a single continuous segment covering the<br />

entire long arm with no gaps. Two loci of crucial<br />

importance for the immune system, as well<br />

as more than 60 disease genes, have been<br />

localized so far on chromosome 14.<br />

We identified 1,050 genes and gene fragments,<br />

and 393 pseudogenes.


It´s not just the genes: epigenetics.


Data Explosion


Data Explosion


The First Complete Genomes (of free-living organisms)


The First Complete Genomes (of free-living organisms)


Data Explosion


PDB Content Growth<br />

New Structures Per Year Per Day<br />

1993: 698 ~ 2<br />

2003: 4181 ~11<br />

2004: 5212 ~14<br />

2005: 5402 ~15<br />

2006: 6541 ~18<br />

(no theoretical structures)<br />

Protein Data Bank –<br />

3D Structure Database of Biological Macromolecules<br />

Start


More Than 1000 Biological Databases<br />

Sequences<br />

Genome Browsers<br />

Jena Prokaryotic<br />

Genome Viewer<br />

3D Structures<br />

Protein Domain<br />

And Motif Classification<br />

Interactions and Networks<br />

Disease Information<br />

Other<br />

Tandem Splice Site<br />

Database


<strong>What</strong> is <strong>Bioinformatics</strong> ?<br />

<strong>Bioinformatics</strong><br />

is a new discipline that covers all aspects of<br />

Acquisition<br />

Storage<br />

Processing<br />

Analysis<br />

Interpretation<br />

of biological data.


<strong>What</strong> is <strong>Bioinformatics</strong> ?<br />

<strong>Bioinformatics</strong> ?<br />

Computational Biology ?<br />

Theoretical Biophysics / Theoretical Biochemistry ?<br />

Mathematical Biology ?<br />

Systems Biology ?<br />

Integrative Biology ?<br />

Theoretical Biology ?<br />

The developments described may lead to a paradigm change in biology.<br />

Reductionism vs. Holism ?


The Role of Theory in Biology: Little Impact Thus Far<br />

Sydney Brenner<br />

Princeton University 2003:<br />

The Watson Lecture: Biology in the Era of Complete Genomes<br />

ETH Zürich 2006:<br />

Pauli-Vorlesung: Theoretical Biology in the Next Decade<br />

The close interplay between theory, modeling, and experiment has dominated<br />

many other branches of science, particularly physics and astrophysics,<br />

but it had little impact on biology until now.<br />

I used to refer to the Journal of Theoretical Biology as the cure to insomnia.<br />

No longer will I be able to say that. The genome and its vast accumulation of data<br />

have the potential to change all of that by opening the doors for scientists with<br />

more analytical and theoretical bents.


The Role of Theory in Biology: An Analogy<br />

Johannes Kepler used Tycho's detailed astronomical information to develop his theories of<br />

astronomy (Kepler Laws).<br />

In a certain sense bioinformatics and theoretical biology are currently in a situation that still<br />

awaits the discovery of the Kepler Laws.


How <strong>Bioinformatics</strong> Contributes to Genome Analysis<br />

Assembly of overlapping genome fragments<br />

D. W. Mount: <strong>Bioinformatics</strong>,<br />

Cold Spring Harbor Laboratory Press, 2001.<br />

Gene identification<br />

Gene annotation<br />

function 1 function 2<br />

function 3<br />

Data integration from<br />

collaborative projects<br />

Database development<br />

Comparative genomics<br />

?<br />

mutation ?


Sequence Comparison<br />

Similar, i.e. homologous, sequences have a similar structure and function.<br />

Typical task:<br />

• A relatively small newly sequenced genome has 1000 protein coding<br />

genes<br />

• Scan a database with 6 million entries for sequence similarity for<br />

each of the 1000 protein sequences<br />

(pairwise alignment allowing for gaps)<br />

For the gapped alignment of two sequences of length 100 there are<br />

more than 10 75 possible arrangements.<br />

Alignment by<br />

• Visual inspection (dot plots)<br />

• Dynamic programming (optimal alignment)<br />

• Heuristic methods (BLAST, FASTA)


Dot Plot<br />

D. W. Mount: <strong>Bioinformatics</strong>, Cold Spring Harbor Laboratory Press, 2001.


Local BLAST Alignment<br />

DNA repair endonuclease XPF


Local BLAST Alignment


Romualdi A et al.; GenColors: annotation and comparative genomics of prokaryotes made easy. Methods Mol Biol. 2007;395:75-96.


Single Nucleotide Polymorphisms (SNPs)<br />

Genetic basis for differences between individuals:<br />

SNP density in the complete genome: : 1/1.91 kBasen<br />

SNP density within genes: 1/1.08 kBasen<br />

Genes cover only 5% of the complete genome.<br />

So, most of teh SNPs occur outside of genes.<br />

SNPs within genes or in the surro<strong>und</strong>ing region<br />

may predispose individuals for a certain disease.<br />

SNP analyses can possibly also explain<br />

the different responses of individuals to drugs.


The International HapMap Project is a partnership of scientists and f<strong>und</strong>ing agencies<br />

to develop a public resource that will help researchers to find genes associated with<br />

human disease and response to pharmaceuticals.<br />

Sites in the genome where the DNA sequences of many individuals differ by a single base are<br />

called single nucleotide polymorphisms (SNPs). For example, some people may have a<br />

chromosome with an A at a particular site where others have a chromosome with a G.<br />

Each form is called an allele.<br />

Each person has two copies of all chromosomes except the sex chromosomes.<br />

The set of alleles that a person has is called a genotype. For this SNP a person could<br />

have the genotype AA, AG, or GG. The term genotype can refer to the SNP alleles that a<br />

person has at a particular SNP, or for many SNPs across the genome. A method that<br />

discovers what genotype a person has is called genotyping.<br />

About 10 million SNPs exist in human populations, where the rarer SNP allele has a frequency<br />

of at least 1%. Alleles of SNPs that are close together tend to be inherited together.<br />

A set of associated SNP alleles in a region of a chromosome is called a "haplotype".<br />

Most chromosome regions have only a few common haplotypes (each with a frequency of<br />

at least 5%), which account for most of the variation from person to person in a population.<br />

A chromosome region may contain many SNPs, but only a few "tag" SNPs can provide most of<br />

the information on the pattern of genetic variation in the region.


Genetic Diseases<br />

monogenic<br />

polygenic<br />

Mucoviscoidosis,<br />

Cystic fibrosis<br />

CFTR gene (chromosom 7)<br />

Discovered in 1989<br />

Cancer<br />

Oncogenes<br />

Tumor suppressor genes


Common Genetic Disorders


Drug research is<br />

the search for a needle in a haystack.<br />

www.kubinyi.de


Costs in Drug Research<br />

Cost for discovering and developing a new drug:<br />

several € 100 million up to € 1000 million (average € 802 M)<br />

Time to market:<br />

10 – 15 years


Drug Development<br />

Virtual Ligand Screening


Drug Development


Structure-based<br />

Design: Virtual Screening<br />

Virtual Screening:<br />

Select subsets of compo<strong>und</strong>s for assay that are more likely to contain<br />

active hits than a sample chosen at random<br />

Time Scales:<br />

Docking of 1 compo<strong>und</strong><br />

Docking of the 1.1 million data set<br />

30 s<br />

(SGI R10000 processor)<br />

6 days<br />

(64-processor SGI ORIGIN)<br />

ACD-SC: Database from Molecular Design Ltd.<br />

Agonists: Known active compo<strong>und</strong>s<br />

Docking of ligands to the estrogen receptor<br />

(nuclear hormone receptor)


Structure-based<br />

Design: Virtual Screening


Computer Power for <strong>Bioinformatics</strong>/Computational Biology


Computer Power for <strong>Bioinformatics</strong>/Computational Biology<br />

Blue Gene is an IBM Research project dedicated to exploring the<br />

frontiers in supercomputing:<br />

in computer architecture,<br />

in the software required to program and control massively parallel systems,<br />

and in the use of computation to advance our <strong>und</strong>erstanding of important biological processes<br />

such as protein folding.<br />

The Blue Gene/L machine was designed and built in collaboration with the Department of<br />

Energy's NNSA/Lawrence Livermore National Laboratory in California, and the LLNL system<br />

has a peak speed of 596 Teraflops. Blue Gene systems occupy the #1 (LLNL Blue Gene/L)<br />

and a total of 4 of the top 10 positions in the TOP500 supercomputer list announced in November 2007.


Computer Power for <strong>Bioinformatics</strong>/Computational Biology


Protein Folding by Molecular Dynamics


Protein Folding by Molecular Dynamics


Protein Folding by Molecular Dynamics


Protein Folding by Molecular Dynamics


Protein Folding by Molecular Dynamics<br />

Fastest-folding protein yet discovered<br />

(submicrosecond folding)<br />

PDB ID: 2f4k


Protein Folding by Molecular Dynamics


Text Mining: Duplication of Scientific Articles


Text Mining: Duplication of Scientific Articles<br />

http://discovery.swmed.edu/dejavu/


Methods in <strong>Bioinformatics</strong> and Computational Biology<br />

•Pairwise and Multiple Sequence Alignment<br />

•Assembly of Genome Fragments<br />

•Gene Prediction and Annotation<br />

•Genome Analysis and Functional Genomics (Proteomics)<br />

•Phylogenetic Analysis<br />

•Analysis, Classification and Prediction of Nucleic Acid and Protein Structures<br />

•Drug Design<br />

•Analysis of Data Obtained by Microarray Technologies<br />

•Data/Text Mining (Databases / Web)<br />

•Network Analysis (Systems Biology)<br />

Database Technologies | Pattern Recognition | Visualization Techniques | Statistics |<br />

Quantum Chemistry | Molecular Dynamics | Artificial Intelligence Methods | ...<br />

HTML, XML (specialized markup languages), Perl, Python, CGI, C, Java, ...


Ethical Issues<br />

Genes are the organism‘s blueprint.<br />

Therefore, any information, analysis or manipulation<br />

is of particular importance for each individual.<br />

Biobanks<br />

Genetic fingerprints<br />

Preimplantation diagnosis<br />


Genetic fingerprinting, DNA typing<br />

Alec Jeffreys, discoverer of genetic fingerprinting<br />

Identification of genome ranges specific for an individual,<br />

currently in introns, (STR - short tandem repeats)<br />

• Attribution of individuals to sites of crime<br />

• Identification of blood relationships,<br />

paternity test)<br />

• Identification of accident victims


Genetic fingerprinting, DNA typing


Biobanks


<strong>What</strong> are the Most Important Current Contributions of <strong>Bioinformatics</strong> ics to Biology ?<br />

Functional Prediction by Sequence Comparison ?<br />

BLAST<br />

(October 18, 2005)<br />

(June 19, 2008)


<strong>What</strong> are the Most Important Current Contributions of <strong>Bioinformatics</strong> ics to Biology ?<br />

Sequences<br />

Genome Browsers<br />

Jena Prokaryotic<br />

Genome Viewer<br />

3D Structures<br />

Protein Domain<br />

And Motif Classification<br />

Interactions and Networks<br />

Disease Information<br />

Other<br />

Databases


<strong>What</strong> are the Most Important Current Contributions of <strong>Bioinformatics</strong> ics to Biology ?<br />

New Conceptual Ideas<br />

Eisenberg et al., Nature 2000, 405, 823-826.


Outlook<br />

Improved automatized data validation procedures<br />

Better interoperability of databases, data integration<br />

Analysis of intergenic regions (promotor sites, micro RNAs)<br />

Comparative genomics<br />

Epigenetics<br />

Analysis of more complex systems – Systems Biology<br />

Integration of different scientific disciplines towards<br />

a real Computational and Theoretical Biology


Outlook<br />

A warning:<br />

Do not be too fast. Otherwise you will have only a loose the connection to reality.


Aim<br />

Information<br />

Knowledge<br />

Close connection between experimental and computational/theoretical approaches


www.fli-leibniz.de/jcb/<br />

Coordinator / Spokesman: J. Sühnel (2001-2007) | S. Schuster (2008 - …)<br />

Manager: K. Wagner (2001-2007) | L. Blei (2008 - …)


JCB Members (24 research labs + 3 companies)<br />

Bacterial Genetics (S. Brantl, FSU, Biology and Pharmacy)<br />

Biocomputing (J. Sühnel, FLI)<br />

<strong>Bioinformatics</strong> (R. Backofen, ALU)<br />

<strong>Bioinformatics</strong> (S. Böcker, FSU, Mathematics and Computer Science)<br />

<strong>Bioinformatics</strong> (S. Schuster, FSU, Biology and Pharmacy)<br />

<strong>Bioinformatics</strong> and Population Genetics (K. Schmid, MPICE)<br />

<strong>Bioinformatics</strong> / Pattern Recognition (U. Möller, HKI)<br />

Biophysics / <strong>Bioinformatics</strong> (A. H. Gitter, FHJ)<br />

Bio-Systems Analysis (P. Dittrich, FSU, Mathematics and Computer Science)<br />

Computational Neuroscience (H. Witte, FSU, Medicine)<br />

Entomology (D. Heckel, MPICE)<br />

Experimental Rheumatology (R. Kinne, FSU, Medicine)<br />

General Botany (M. Mittag, Biology and Pharmacy)<br />

Genetics (G. Theissen, FSU, Biology and Pharmacy)<br />

Genome Analysis (M. Platzer, FLI)<br />

Language and Information Engineering Lab at Jena University –<br />

JULIE Lab (U. Hahn, FSU, Philosophy)<br />

Medical Engineering and Biotechnology (A. Voss, FHJ)<br />

Molecular and Applied Microbiology / Systems Biology (R. Guthke, HKI)<br />

Biomolecular NMR Spectroscopy (M. Görlach, FLI)<br />

Molecular Cell Biology (R. Wetzker, FSU; Medicine)<br />

Practical Computer Science I (C. Beckstein, FSU, Mathematics and Computer Science)<br />

Practical Computer Science II (E.G. Schukat-Talamazzini, FSU, Mathematics and Computer Science)<br />

Protein Crystallography (M. Than, FLI)<br />

Theoretical Computer Science (R. Niedermeier, FSU, Mathematics and Computer Science)<br />

<br />

<br />

<br />

BioControl Jena GmbH<br />

Clondiag Chip Technologies GmbH<br />

SIRS-Lab


Publications, Talks, Posters, Diploma/PhD Theses<br />

JCB Publications (experimental and theoretical)<br />

JCB<br />

groups/companies<br />

BMBF/JCB f<strong>und</strong>ed<br />

authors<br />

Peer-reviewed journal articles: 404 (203) 137 (56)<br />

Contributions to conference proceedings: 51 (25) 39 (19)<br />

(most of them also peer-reviewed)<br />

Book contributions: 20 (13) 8 (3)<br />

Books 4 (1) -<br />

Other (not peer-reviewed) publications: 31 (25) 14 (12)<br />

Total 510 (266) 198 (90)<br />

Talks (outside Jena): more than 200<br />

Posters: more than 250<br />

Diploma Theses:<br />

18 (15 finished)<br />

PhD Theses:<br />

22 ( 7 finished)<br />

Last update: 15/02/2007 (in brackets: 21/02/2005)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!