13.07.2015 Views

Grand Challenges in Computational Biology - Project web sites - Inria

Grand Challenges in Computational Biology - Project web sites - Inria

Grand Challenges in Computational Biology - Project web sites - Inria

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Grand</strong> <strong>Challenges</strong> <strong>in</strong> <strong>Computational</strong> <strong>Biology</strong>Kimmen SjölanderUC BerkeleyReconstruct<strong>in</strong>gthe Tree of LifeCITRIS-INRIA workshop24 May, 2011Prediction of biologicalpathways and networksHuman microbiome andmetagenome dataset analysisInfectious disease: new drugsand diagnostics;pharmacogenomicsInterpret<strong>in</strong>g genetic variationSupported <strong>in</strong> part by a grant from the DOE Systems <strong>Biology</strong> Knowledgebase• Phylogenomic predictions of function and structure for microbial genomes and metagenomes.• Simultaneous functional and taxonomic annotation of environmental sequences and human microbiome data.This work is supported by:The US DOE Systems <strong>Biology</strong> Knowledgebase,the NSF Microbial Genome Sequenc<strong>in</strong>g Program,a Presidential Early Career Award for Scientists and Eng<strong>in</strong>eers from the NSF,and by an R01 from the NHGRI (NIH).


The expand<strong>in</strong>g genomics universe• The situation now: huge quantities of noisy, error-ridden and poorlyconnected data– Experimental data are sparse: ~1% of sequences have experimental supportfor their assigned functions– Errors abound: Up to 25% of sequences are mis-annotated [1, 2]– The one-time static annotation protocol does not allow annotations to bemodified <strong>in</strong> the light of new evidence [3]– Expert knowledge is critical to detect<strong>in</strong>g and correct<strong>in</strong>g annotation errors• But manual annotation is expensive and does not scale to the quantity ofsequences be<strong>in</strong>g produced1. Annotation Error <strong>in</strong> Public Databases: Misannotation of Molecular Function <strong>in</strong> Enzyme Superfamilies, Schnoes et al, PLoS<strong>Computational</strong> <strong>Biology</strong> 20092. Phylogenomic <strong>in</strong>ference of prote<strong>in</strong> molecular function: advances and challenges," Sjolander, Bio<strong>in</strong>formatics 20043. Genome re-annotation: a wiki solution? Salzberg, Genome <strong>Biology</strong> 2007


Increas<strong>in</strong>g the specificity of function prediction requires the<strong>in</strong>tegration of heterogeneous data & bio<strong>in</strong>formatics methodsHomology & orthology predictionGenome neighborsExpression dataLocalization <strong>in</strong>formation3D structureYeast-2-hybrid dataPhylogenetic profilesPull-down assaysSite-directed mutagenesisText-m<strong>in</strong><strong>in</strong>g (co-occurrence <strong>in</strong> an abstract)Etc.Eisenberg et al, Prote<strong>in</strong> function <strong>in</strong> the post-genomic eraNature 2000Sjölander, K., "Phylogenomic <strong>in</strong>ference of prote<strong>in</strong> molecular function: advances and challenges," Bio<strong>in</strong>formatics 2004Matthews et al, Identification of Potential Interaction Networks Us<strong>in</strong>g Sequence-Based Searches for Conserved Prote<strong>in</strong>-Prote<strong>in</strong> Interactions or InterologsGenomeResearch 2001Troyanskaya et al, A Bayesian framework for comb<strong>in</strong><strong>in</strong>g heterogeneous data sources for gene function prediction (<strong>in</strong> Saccharomyces cerevisiae), PNAS, 2003Myers et al, Discovery of biological networks from diverse functional genomic data, Genome <strong>Biology</strong> 20053


Data is not the same th<strong>in</strong>g as <strong>in</strong>formation


Biologists who need to use bio<strong>in</strong>formatics tools aredivided by a huge gulf from the computer scientistswho are creat<strong>in</strong>g these tools


Automatic prote<strong>in</strong> function predictionus<strong>in</strong>g a hyper-dimensional network


Hyperdimensional <strong>in</strong>formation networkfor data <strong>in</strong>tegration, navigation & community annotationNodes: Genes/prote<strong>in</strong>sEdges: different types of connection between genes (e.g., orthology, similarstructure, <strong>in</strong>teraction, disease association, regulated by, adjacent <strong>in</strong> metabolicnetwork, genome neighbor, etc.).Edges have weights proportional to confidenceExperimental data can enter at any po<strong>in</strong>t <strong>in</strong> the graph, and be propagated toneighbor<strong>in</strong>g nodes based on learned rules:• Biological process for one gene can be made available to genome neighbors• A prote<strong>in</strong>-prote<strong>in</strong> <strong>in</strong>teraction between two genes <strong>in</strong> one species can be used to <strong>in</strong>fer correspond<strong>in</strong>g<strong>in</strong>teraction between their orthologs <strong>in</strong> another• Roles <strong>in</strong> a pathway (e.g., EC number) known for one gene can be assigned to an ortholog• Participation <strong>in</strong> a biological process can be <strong>in</strong>ferred based on genome neighbors• 3D structure <strong>in</strong>formation can be propagated to all homologs• Prote<strong>in</strong> structure <strong>in</strong>formation can be propagated to all homologsBiologists can: subscribe to news feeds arriv<strong>in</strong>g at their selected nodes, upload data, attach l<strong>in</strong>ks totheir papers, manually curate biological functionsManual annotations from biologists will need to be weighted accord<strong>in</strong>g to estimated confidence


Phylogenomic tools for <strong>in</strong>vestigat<strong>in</strong>g and<strong>in</strong>terpret<strong>in</strong>g (meta)genome datasets(DOE Systems <strong>Biology</strong> Knowledgebase grant)<strong>Challenges</strong> <strong>in</strong> metagenome data analysis:• Most tools designed for these data answer only What species are present? and do notanswer the question, Whats go<strong>in</strong>g on? (what processes & pathways are represented)• Sequences are fragmentary and noisy, present<strong>in</strong>g additional challenges to bio<strong>in</strong>formaticsmethods• Huge datasets (<strong>in</strong> the millions of reads)Harness<strong>in</strong>g the power of the human microbiome, Blaser, PNAS 2010The New Science of Metagenomics: Reveal<strong>in</strong>g the Secrets of Our Microbial Planet Committee on Metagenomics: <strong>Challenges</strong>and Functional Applications, National Research Council. 2007.


SNP prioritization and <strong>in</strong>terpret<strong>in</strong>ghuman genetic variationPrediction of biological pathwaysand network alignmentSNPs occurr<strong>in</strong>g <strong>in</strong> cod<strong>in</strong>g regions of the genome can be prioritized for <strong>in</strong>vestigationbased on:• Predicted biological process or function of gene conta<strong>in</strong><strong>in</strong>g SNP• Predicted <strong>in</strong>teractions (hubs of networks) of gene conta<strong>in</strong><strong>in</strong>g SNP• Impact of mutation at that site (INTREPID and Discern methods)9


PhyloFacts Pathogen Commons• Drug target identification & prioritization• Development of accurate diagnosticsTB collaborations• UC Berkeley Center for Emerg<strong>in</strong>g and Neglected Diseases (Tom Alber, Lee Riley, others)• Royal Institute of Tropical Diseases, Amsterdam, Netherlands (Richard Anthony)• Institute of Bio<strong>in</strong>formatics, Bangalore, India (Akhilesh Pande)• IISc, Bangalore, India (Nagasuma Chandra)


Was there really life before the <strong>web</strong>?How can we br<strong>in</strong>g this to biology?

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!