What Bioinformatics - Analyse und Management komplexer Systeme
What Bioinformatics - Analyse und Management komplexer Systeme
What Bioinformatics - Analyse und Management komplexer Systeme
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>What</strong> <strong>Bioinformatics</strong> Is All About<br />
Jürgen Sühnel<br />
jsuehnel@fli-leibniz.de<br />
leibniz.de<br />
Complex Ideas II,<br />
June 20, 2008, Altes Schloss Dornburg
Fritz Lipmann<br />
The Journal of Biological Chemistry<br />
186, 235, 1950
Complex Biological Systems<br />
Any function performed by a system that is not the result of a single part in the system,<br />
but rather is the result of interacting parts in the system, is an emergent property.<br />
Systems that have emergent properties are said to be irreducible.<br />
A system is said to be complex if its emergent properties are unpredictable ??.<br />
Living ‚systems‘ are complex systems.<br />
The recognition that complex systems, especially life, are truly <strong>und</strong>erstood from<br />
knowledge of the interactions of their component parts is f<strong>und</strong>amental to<br />
Systems Biology.<br />
However, the identification and analysis of system‘s parts is an important aspect of<br />
research into complex systems.
Molecular Biology
Molecules of Life<br />
(CARBOHYDRATES)
One- and Three-Dimensional (3D) Structures of Biopolymers<br />
sequence, primary structure<br />
(3D) structure
Pathways, Networks, Systems
Pathways, Networks, Systems<br />
Eisenberg et al., Nature 2000, 405, 823-826.
Technological Development: Microarray Techniques, …<br />
Automatization<br />
Parallelization<br />
Miniaturization
New Scientific Goals<br />
• Sequencing of complete genomes<br />
• Determination of interaction patterns of all proteins in a<br />
cell<br />
• Determination of all 3D-structures of a genome<br />
(Structural Genomics)<br />
• Determination of the m-RNA or protein pattern of all genes<br />
• High-throughput technologies (drug screening, …)<br />
•...
Genome Sizes
Genome Sizes<br />
Name Base pairs Genes<br />
Phi-X 174 5,386 10 E.coli virus<br />
Mycoplasma genitalium 580,073 483<br />
E. coli 4,639,221 4,337 bacterium<br />
Arabidopsis thaliana 115,409,949 25,498 small plant genome<br />
(Ackerschmalwinde)<br />
Human 3.3x10 9 ~25,000<br />
Rice ~430.000.000 ~50,000<br />
(Nippon Bare)
Down syndrome,<br />
trisomy 21<br />
FLI
Genomic Sequence<br />
Human chromosome 14,<br />
Long arm (FastA format)<br />
Chromosome 14 is characterized by a heterochromatic<br />
short arm that contains essentially ribosomal RNA genes,<br />
and a euchromatic long arm in which most, if not all, of the<br />
protein-coding genes are located. The finished sequence<br />
of human chromosome 14 comprises 87,410,661 base<br />
pairs, representing 100% of its euchromatic<br />
portion, in a single continuous segment covering the<br />
entire long arm with no gaps. Two loci of crucial<br />
importance for the immune system, as well<br />
as more than 60 disease genes, have been<br />
localized so far on chromosome 14.<br />
We identified 1,050 genes and gene fragments,<br />
and 393 pseudogenes.
It´s not just the genes: epigenetics.
Data Explosion
Data Explosion
The First Complete Genomes (of free-living organisms)
The First Complete Genomes (of free-living organisms)
Data Explosion
PDB Content Growth<br />
New Structures Per Year Per Day<br />
1993: 698 ~ 2<br />
2003: 4181 ~11<br />
2004: 5212 ~14<br />
2005: 5402 ~15<br />
2006: 6541 ~18<br />
(no theoretical structures)<br />
Protein Data Bank –<br />
3D Structure Database of Biological Macromolecules<br />
Start
More Than 1000 Biological Databases<br />
Sequences<br />
Genome Browsers<br />
Jena Prokaryotic<br />
Genome Viewer<br />
3D Structures<br />
Protein Domain<br />
And Motif Classification<br />
Interactions and Networks<br />
Disease Information<br />
Other<br />
Tandem Splice Site<br />
Database
<strong>What</strong> is <strong>Bioinformatics</strong> ?<br />
<strong>Bioinformatics</strong><br />
is a new discipline that covers all aspects of<br />
Acquisition<br />
Storage<br />
Processing<br />
Analysis<br />
Interpretation<br />
of biological data.
<strong>What</strong> is <strong>Bioinformatics</strong> ?<br />
<strong>Bioinformatics</strong> ?<br />
Computational Biology ?<br />
Theoretical Biophysics / Theoretical Biochemistry ?<br />
Mathematical Biology ?<br />
Systems Biology ?<br />
Integrative Biology ?<br />
Theoretical Biology ?<br />
The developments described may lead to a paradigm change in biology.<br />
Reductionism vs. Holism ?
The Role of Theory in Biology: Little Impact Thus Far<br />
Sydney Brenner<br />
Princeton University 2003:<br />
The Watson Lecture: Biology in the Era of Complete Genomes<br />
ETH Zürich 2006:<br />
Pauli-Vorlesung: Theoretical Biology in the Next Decade<br />
The close interplay between theory, modeling, and experiment has dominated<br />
many other branches of science, particularly physics and astrophysics,<br />
but it had little impact on biology until now.<br />
I used to refer to the Journal of Theoretical Biology as the cure to insomnia.<br />
No longer will I be able to say that. The genome and its vast accumulation of data<br />
have the potential to change all of that by opening the doors for scientists with<br />
more analytical and theoretical bents.
The Role of Theory in Biology: An Analogy<br />
Johannes Kepler used Tycho's detailed astronomical information to develop his theories of<br />
astronomy (Kepler Laws).<br />
In a certain sense bioinformatics and theoretical biology are currently in a situation that still<br />
awaits the discovery of the Kepler Laws.
How <strong>Bioinformatics</strong> Contributes to Genome Analysis<br />
Assembly of overlapping genome fragments<br />
D. W. Mount: <strong>Bioinformatics</strong>,<br />
Cold Spring Harbor Laboratory Press, 2001.<br />
Gene identification<br />
Gene annotation<br />
function 1 function 2<br />
function 3<br />
Data integration from<br />
collaborative projects<br />
Database development<br />
Comparative genomics<br />
?<br />
mutation ?
Sequence Comparison<br />
Similar, i.e. homologous, sequences have a similar structure and function.<br />
Typical task:<br />
• A relatively small newly sequenced genome has 1000 protein coding<br />
genes<br />
• Scan a database with 6 million entries for sequence similarity for<br />
each of the 1000 protein sequences<br />
(pairwise alignment allowing for gaps)<br />
For the gapped alignment of two sequences of length 100 there are<br />
more than 10 75 possible arrangements.<br />
Alignment by<br />
• Visual inspection (dot plots)<br />
• Dynamic programming (optimal alignment)<br />
• Heuristic methods (BLAST, FASTA)
Dot Plot<br />
D. W. Mount: <strong>Bioinformatics</strong>, Cold Spring Harbor Laboratory Press, 2001.
Local BLAST Alignment<br />
DNA repair endonuclease XPF
Local BLAST Alignment
Romualdi A et al.; GenColors: annotation and comparative genomics of prokaryotes made easy. Methods Mol Biol. 2007;395:75-96.
Single Nucleotide Polymorphisms (SNPs)<br />
Genetic basis for differences between individuals:<br />
SNP density in the complete genome: : 1/1.91 kBasen<br />
SNP density within genes: 1/1.08 kBasen<br />
Genes cover only 5% of the complete genome.<br />
So, most of teh SNPs occur outside of genes.<br />
SNPs within genes or in the surro<strong>und</strong>ing region<br />
may predispose individuals for a certain disease.<br />
SNP analyses can possibly also explain<br />
the different responses of individuals to drugs.
The International HapMap Project is a partnership of scientists and f<strong>und</strong>ing agencies<br />
to develop a public resource that will help researchers to find genes associated with<br />
human disease and response to pharmaceuticals.<br />
Sites in the genome where the DNA sequences of many individuals differ by a single base are<br />
called single nucleotide polymorphisms (SNPs). For example, some people may have a<br />
chromosome with an A at a particular site where others have a chromosome with a G.<br />
Each form is called an allele.<br />
Each person has two copies of all chromosomes except the sex chromosomes.<br />
The set of alleles that a person has is called a genotype. For this SNP a person could<br />
have the genotype AA, AG, or GG. The term genotype can refer to the SNP alleles that a<br />
person has at a particular SNP, or for many SNPs across the genome. A method that<br />
discovers what genotype a person has is called genotyping.<br />
About 10 million SNPs exist in human populations, where the rarer SNP allele has a frequency<br />
of at least 1%. Alleles of SNPs that are close together tend to be inherited together.<br />
A set of associated SNP alleles in a region of a chromosome is called a "haplotype".<br />
Most chromosome regions have only a few common haplotypes (each with a frequency of<br />
at least 5%), which account for most of the variation from person to person in a population.<br />
A chromosome region may contain many SNPs, but only a few "tag" SNPs can provide most of<br />
the information on the pattern of genetic variation in the region.
Genetic Diseases<br />
monogenic<br />
polygenic<br />
Mucoviscoidosis,<br />
Cystic fibrosis<br />
CFTR gene (chromosom 7)<br />
Discovered in 1989<br />
Cancer<br />
Oncogenes<br />
Tumor suppressor genes
Common Genetic Disorders
Drug research is<br />
the search for a needle in a haystack.<br />
www.kubinyi.de
Costs in Drug Research<br />
Cost for discovering and developing a new drug:<br />
several € 100 million up to € 1000 million (average € 802 M)<br />
Time to market:<br />
10 – 15 years
Drug Development<br />
Virtual Ligand Screening
Drug Development
Structure-based<br />
Design: Virtual Screening<br />
Virtual Screening:<br />
Select subsets of compo<strong>und</strong>s for assay that are more likely to contain<br />
active hits than a sample chosen at random<br />
Time Scales:<br />
Docking of 1 compo<strong>und</strong><br />
Docking of the 1.1 million data set<br />
30 s<br />
(SGI R10000 processor)<br />
6 days<br />
(64-processor SGI ORIGIN)<br />
ACD-SC: Database from Molecular Design Ltd.<br />
Agonists: Known active compo<strong>und</strong>s<br />
Docking of ligands to the estrogen receptor<br />
(nuclear hormone receptor)
Structure-based<br />
Design: Virtual Screening
Computer Power for <strong>Bioinformatics</strong>/Computational Biology
Computer Power for <strong>Bioinformatics</strong>/Computational Biology<br />
Blue Gene is an IBM Research project dedicated to exploring the<br />
frontiers in supercomputing:<br />
in computer architecture,<br />
in the software required to program and control massively parallel systems,<br />
and in the use of computation to advance our <strong>und</strong>erstanding of important biological processes<br />
such as protein folding.<br />
The Blue Gene/L machine was designed and built in collaboration with the Department of<br />
Energy's NNSA/Lawrence Livermore National Laboratory in California, and the LLNL system<br />
has a peak speed of 596 Teraflops. Blue Gene systems occupy the #1 (LLNL Blue Gene/L)<br />
and a total of 4 of the top 10 positions in the TOP500 supercomputer list announced in November 2007.
Computer Power for <strong>Bioinformatics</strong>/Computational Biology
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics
Protein Folding by Molecular Dynamics<br />
Fastest-folding protein yet discovered<br />
(submicrosecond folding)<br />
PDB ID: 2f4k
Protein Folding by Molecular Dynamics
Text Mining: Duplication of Scientific Articles
Text Mining: Duplication of Scientific Articles<br />
http://discovery.swmed.edu/dejavu/
Methods in <strong>Bioinformatics</strong> and Computational Biology<br />
•Pairwise and Multiple Sequence Alignment<br />
•Assembly of Genome Fragments<br />
•Gene Prediction and Annotation<br />
•Genome Analysis and Functional Genomics (Proteomics)<br />
•Phylogenetic Analysis<br />
•Analysis, Classification and Prediction of Nucleic Acid and Protein Structures<br />
•Drug Design<br />
•Analysis of Data Obtained by Microarray Technologies<br />
•Data/Text Mining (Databases / Web)<br />
•Network Analysis (Systems Biology)<br />
Database Technologies | Pattern Recognition | Visualization Techniques | Statistics |<br />
Quantum Chemistry | Molecular Dynamics | Artificial Intelligence Methods | ...<br />
HTML, XML (specialized markup languages), Perl, Python, CGI, C, Java, ...
Ethical Issues<br />
Genes are the organism‘s blueprint.<br />
Therefore, any information, analysis or manipulation<br />
is of particular importance for each individual.<br />
Biobanks<br />
Genetic fingerprints<br />
Preimplantation diagnosis<br />
…
Genetic fingerprinting, DNA typing<br />
Alec Jeffreys, discoverer of genetic fingerprinting<br />
Identification of genome ranges specific for an individual,<br />
currently in introns, (STR - short tandem repeats)<br />
• Attribution of individuals to sites of crime<br />
• Identification of blood relationships,<br />
paternity test)<br />
• Identification of accident victims
Genetic fingerprinting, DNA typing
Biobanks
<strong>What</strong> are the Most Important Current Contributions of <strong>Bioinformatics</strong> ics to Biology ?<br />
Functional Prediction by Sequence Comparison ?<br />
BLAST<br />
(October 18, 2005)<br />
(June 19, 2008)
<strong>What</strong> are the Most Important Current Contributions of <strong>Bioinformatics</strong> ics to Biology ?<br />
Sequences<br />
Genome Browsers<br />
Jena Prokaryotic<br />
Genome Viewer<br />
3D Structures<br />
Protein Domain<br />
And Motif Classification<br />
Interactions and Networks<br />
Disease Information<br />
Other<br />
Databases
<strong>What</strong> are the Most Important Current Contributions of <strong>Bioinformatics</strong> ics to Biology ?<br />
New Conceptual Ideas<br />
Eisenberg et al., Nature 2000, 405, 823-826.
Outlook<br />
Improved automatized data validation procedures<br />
Better interoperability of databases, data integration<br />
Analysis of intergenic regions (promotor sites, micro RNAs)<br />
Comparative genomics<br />
Epigenetics<br />
Analysis of more complex systems – Systems Biology<br />
Integration of different scientific disciplines towards<br />
a real Computational and Theoretical Biology
Outlook<br />
A warning:<br />
Do not be too fast. Otherwise you will have only a loose the connection to reality.
Aim<br />
Information<br />
Knowledge<br />
Close connection between experimental and computational/theoretical approaches
www.fli-leibniz.de/jcb/<br />
Coordinator / Spokesman: J. Sühnel (2001-2007) | S. Schuster (2008 - …)<br />
Manager: K. Wagner (2001-2007) | L. Blei (2008 - …)
JCB Members (24 research labs + 3 companies)<br />
Bacterial Genetics (S. Brantl, FSU, Biology and Pharmacy)<br />
Biocomputing (J. Sühnel, FLI)<br />
<strong>Bioinformatics</strong> (R. Backofen, ALU)<br />
<strong>Bioinformatics</strong> (S. Böcker, FSU, Mathematics and Computer Science)<br />
<strong>Bioinformatics</strong> (S. Schuster, FSU, Biology and Pharmacy)<br />
<strong>Bioinformatics</strong> and Population Genetics (K. Schmid, MPICE)<br />
<strong>Bioinformatics</strong> / Pattern Recognition (U. Möller, HKI)<br />
Biophysics / <strong>Bioinformatics</strong> (A. H. Gitter, FHJ)<br />
Bio-Systems Analysis (P. Dittrich, FSU, Mathematics and Computer Science)<br />
Computational Neuroscience (H. Witte, FSU, Medicine)<br />
Entomology (D. Heckel, MPICE)<br />
Experimental Rheumatology (R. Kinne, FSU, Medicine)<br />
General Botany (M. Mittag, Biology and Pharmacy)<br />
Genetics (G. Theissen, FSU, Biology and Pharmacy)<br />
Genome Analysis (M. Platzer, FLI)<br />
Language and Information Engineering Lab at Jena University –<br />
JULIE Lab (U. Hahn, FSU, Philosophy)<br />
Medical Engineering and Biotechnology (A. Voss, FHJ)<br />
Molecular and Applied Microbiology / Systems Biology (R. Guthke, HKI)<br />
Biomolecular NMR Spectroscopy (M. Görlach, FLI)<br />
Molecular Cell Biology (R. Wetzker, FSU; Medicine)<br />
Practical Computer Science I (C. Beckstein, FSU, Mathematics and Computer Science)<br />
Practical Computer Science II (E.G. Schukat-Talamazzini, FSU, Mathematics and Computer Science)<br />
Protein Crystallography (M. Than, FLI)<br />
Theoretical Computer Science (R. Niedermeier, FSU, Mathematics and Computer Science)<br />
<br />
<br />
<br />
BioControl Jena GmbH<br />
Clondiag Chip Technologies GmbH<br />
SIRS-Lab
Publications, Talks, Posters, Diploma/PhD Theses<br />
JCB Publications (experimental and theoretical)<br />
JCB<br />
groups/companies<br />
BMBF/JCB f<strong>und</strong>ed<br />
authors<br />
Peer-reviewed journal articles: 404 (203) 137 (56)<br />
Contributions to conference proceedings: 51 (25) 39 (19)<br />
(most of them also peer-reviewed)<br />
Book contributions: 20 (13) 8 (3)<br />
Books 4 (1) -<br />
Other (not peer-reviewed) publications: 31 (25) 14 (12)<br />
Total 510 (266) 198 (90)<br />
Talks (outside Jena): more than 200<br />
Posters: more than 250<br />
Diploma Theses:<br />
18 (15 finished)<br />
PhD Theses:<br />
22 ( 7 finished)<br />
Last update: 15/02/2007 (in brackets: 21/02/2005)