13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Ontologies for Biologists: A Community Model for theAnnotation <strong>of</strong> <strong>Genom</strong>ic DataM. ASHBURNER,* C.J. MUNGALL, †‡ AND S.E. LEWIS ‡*Department <strong>of</strong> Genetics, University <strong>of</strong> Cambridge & EMBL - EBI, Hinxton, Cambridge, United Kingdom;† Howard Hughes Medical Institute, and ‡ University <strong>of</strong> California, Berkeley, California<strong>Genom</strong>ics has made biology into an information science.David Botstein, June, 2003We celebrate the 50th anniversary <strong>of</strong> the discovery <strong>of</strong>the structure <strong>of</strong> DNA in 1953. <strong>The</strong> nature <strong>of</strong> the geneticcode became the theoretical preoccupation <strong>of</strong> the ensuingdecade, and immediately the language <strong>of</strong> information scienceentered biology (see Kay 2000). <strong>The</strong> discoveries <strong>of</strong>the genetic code, both its structure (syntax) and content(semantics), and <strong>of</strong> mechanisms for error recovery duringits transmission, were followed, in the next two decades,by the development <strong>of</strong> methods to readily determineDNA sequences, to clone specific DNA sequences inbacterial hosts, and to amplify sequences in vitro by thepolymerase chain reaction. Within 40 years or so <strong>of</strong> Watsonand Crick’s discovery, the first complete sequences<strong>of</strong> bacterial genomes were determined; today we celebratethe completion <strong>of</strong> the human genome. <strong>The</strong>se advancesbrought a revolution that has affected even themost conservative fields <strong>of</strong> biology (see, e.g., Hebert etal. 2003).None <strong>of</strong> these advances, except the very first, wouldhave been possible without the application <strong>of</strong> methodsfrom computer science, a field whose growth and maturityhave closely paralleled that <strong>of</strong> molecular biology.Computational methods were introduced to biology in theearly 1950s, for the calculation <strong>of</strong> Fourier summations inprotein crystallography (Bennett and Kendrew 1952;Huxley 1990). However, it was the need to capture, store,assemble, and analyze DNA sequence data (Staden 1980)and to correlate them with other biological knowledgethat motivated a new field <strong>of</strong> science—bioinformatics.<strong>The</strong> first attempts to collect protein and nucleic acid sequencedata were published as slim printed books (Dayh<strong>of</strong>fet al. 1965; Cr<strong>of</strong>t 1973; Barrell and Clark 1974). <strong>The</strong>development <strong>of</strong> public computer files <strong>of</strong> sequence datasoon followed, with the establishment <strong>of</strong> the PIRDatabase in 1980, the EMBL Data Library and Genbankin 1982, and the DDBJ in 1986.Few can doubt that, without the international nucleicsequence data library as a common, freely available, depository<strong>of</strong> all public sequence data, neither the achievement<strong>of</strong> sequencing the human genome, nor its analysis,would have been possible. MEDLINE, Swiss-Prot, andPDB are equally core databases that also are very broadin content (“horizontal”) and play an absolutely centralrole enabling genomic research. In addition, there are anincreasing number <strong>of</strong> “vertical” databases, narrow in content,but covering this content in great detail. <strong>The</strong>se includethe model organism databases and databases for arestricted class <strong>of</strong> objects; for example, transcription factorsor eukaryotic promoters. In the last decade or so,model organism databases have been developed as communityprojects for all <strong>of</strong> the biological organisms commonlyused in research, and the number <strong>of</strong> specialistdatabases for biological objects has mushroomed. This isillustrated by the growth, from 24 in 1993 to 129 in 2003,in the number <strong>of</strong> databases that have contributed to theannual database issue <strong>of</strong> Nucleic Acids Research.<strong>The</strong> proliferation <strong>of</strong> databases in this general field ledmany, in the early 1990s, to agonize about questions <strong>of</strong>database “interoperability” and to attempt to develop“federations” <strong>of</strong> databases (see, e.g., Fasman 1994) ordatabase warehouses (e.g., the Integrated <strong>Genom</strong>e Database[Ritter 1994]; see Davidson et al. 1995) or to dictateto database builders a common technical solution todatabase design. <strong>The</strong>se attempts failed, with the only possibleexception being the success <strong>of</strong> ACeDB in a number<strong>of</strong> communities (Durbin and Thierry-Mieg 1991). Yet theproblems that the multiplicity <strong>of</strong> databases posed to bothbench biologists and computational biologists have notgone away; indeed, they have worsened. <strong>The</strong>y haveworsened for two reasons: One is that the number <strong>of</strong> databases(and the amount <strong>of</strong> data) has simply increased, theother is that biologists have increasingly realized the importance<strong>of</strong> knowledge from a variety <strong>of</strong> organisms otherthan their own favorite model. Mouse biologists foundthat they needed information about genes from Drosophilaor yeast; human geneticists needed information aboutgenes in Caenorhabditis elegans or zebrafish.One strategy to overcome some <strong>of</strong> the problems resultingfrom the dispersion <strong>of</strong> biological knowledge is toexploit insights from related fields. This is the approachtaken by the many different databases participating inthe Gene Ontology project. This project began in mid-1998 as a collaboration <strong>of</strong> three model organismdatabases (those for Saccharomyces cerevisiae, Drosophila,and mouse) to solve a very well defined problem:“How can gene products be described in a biologi-Cold Spring Harbor Symposia on Quantitative Biology, Volume LXVIII. © 2003 Cold Spring Harbor Laboratory Press 0-87969-709-1/04. 227

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!