The Genom of Homo sapiens.pdf
The Genom of Homo sapiens.pdf
The Genom of Homo sapiens.pdf
- TAGS
- homo
- www.yumpu.com
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Ontologies for Biologists: A Community Model for theAnnotation <strong>of</strong> <strong>Genom</strong>ic DataM. ASHBURNER,* C.J. MUNGALL, †‡ AND S.E. LEWIS ‡*Department <strong>of</strong> Genetics, University <strong>of</strong> Cambridge & EMBL - EBI, Hinxton, Cambridge, United Kingdom;† Howard Hughes Medical Institute, and ‡ University <strong>of</strong> California, Berkeley, California<strong>Genom</strong>ics has made biology into an information science.David Botstein, June, 2003We celebrate the 50th anniversary <strong>of</strong> the discovery <strong>of</strong>the structure <strong>of</strong> DNA in 1953. <strong>The</strong> nature <strong>of</strong> the geneticcode became the theoretical preoccupation <strong>of</strong> the ensuingdecade, and immediately the language <strong>of</strong> information scienceentered biology (see Kay 2000). <strong>The</strong> discoveries <strong>of</strong>the genetic code, both its structure (syntax) and content(semantics), and <strong>of</strong> mechanisms for error recovery duringits transmission, were followed, in the next two decades,by the development <strong>of</strong> methods to readily determineDNA sequences, to clone specific DNA sequences inbacterial hosts, and to amplify sequences in vitro by thepolymerase chain reaction. Within 40 years or so <strong>of</strong> Watsonand Crick’s discovery, the first complete sequences<strong>of</strong> bacterial genomes were determined; today we celebratethe completion <strong>of</strong> the human genome. <strong>The</strong>se advancesbrought a revolution that has affected even themost conservative fields <strong>of</strong> biology (see, e.g., Hebert etal. 2003).None <strong>of</strong> these advances, except the very first, wouldhave been possible without the application <strong>of</strong> methodsfrom computer science, a field whose growth and maturityhave closely paralleled that <strong>of</strong> molecular biology.Computational methods were introduced to biology in theearly 1950s, for the calculation <strong>of</strong> Fourier summations inprotein crystallography (Bennett and Kendrew 1952;Huxley 1990). However, it was the need to capture, store,assemble, and analyze DNA sequence data (Staden 1980)and to correlate them with other biological knowledgethat motivated a new field <strong>of</strong> science—bioinformatics.<strong>The</strong> first attempts to collect protein and nucleic acid sequencedata were published as slim printed books (Dayh<strong>of</strong>fet al. 1965; Cr<strong>of</strong>t 1973; Barrell and Clark 1974). <strong>The</strong>development <strong>of</strong> public computer files <strong>of</strong> sequence datasoon followed, with the establishment <strong>of</strong> the PIRDatabase in 1980, the EMBL Data Library and Genbankin 1982, and the DDBJ in 1986.Few can doubt that, without the international nucleicsequence data library as a common, freely available, depository<strong>of</strong> all public sequence data, neither the achievement<strong>of</strong> sequencing the human genome, nor its analysis,would have been possible. MEDLINE, Swiss-Prot, andPDB are equally core databases that also are very broadin content (“horizontal”) and play an absolutely centralrole enabling genomic research. In addition, there are anincreasing number <strong>of</strong> “vertical” databases, narrow in content,but covering this content in great detail. <strong>The</strong>se includethe model organism databases and databases for arestricted class <strong>of</strong> objects; for example, transcription factorsor eukaryotic promoters. In the last decade or so,model organism databases have been developed as communityprojects for all <strong>of</strong> the biological organisms commonlyused in research, and the number <strong>of</strong> specialistdatabases for biological objects has mushroomed. This isillustrated by the growth, from 24 in 1993 to 129 in 2003,in the number <strong>of</strong> databases that have contributed to theannual database issue <strong>of</strong> Nucleic Acids Research.<strong>The</strong> proliferation <strong>of</strong> databases in this general field ledmany, in the early 1990s, to agonize about questions <strong>of</strong>database “interoperability” and to attempt to develop“federations” <strong>of</strong> databases (see, e.g., Fasman 1994) ordatabase warehouses (e.g., the Integrated <strong>Genom</strong>e Database[Ritter 1994]; see Davidson et al. 1995) or to dictateto database builders a common technical solution todatabase design. <strong>The</strong>se attempts failed, with the only possibleexception being the success <strong>of</strong> ACeDB in a number<strong>of</strong> communities (Durbin and Thierry-Mieg 1991). Yet theproblems that the multiplicity <strong>of</strong> databases posed to bothbench biologists and computational biologists have notgone away; indeed, they have worsened. <strong>The</strong>y haveworsened for two reasons: One is that the number <strong>of</strong> databases(and the amount <strong>of</strong> data) has simply increased, theother is that biologists have increasingly realized the importance<strong>of</strong> knowledge from a variety <strong>of</strong> organisms otherthan their own favorite model. Mouse biologists foundthat they needed information about genes from Drosophilaor yeast; human geneticists needed information aboutgenes in Caenorhabditis elegans or zebrafish.One strategy to overcome some <strong>of</strong> the problems resultingfrom the dispersion <strong>of</strong> biological knowledge is toexploit insights from related fields. This is the approachtaken by the many different databases participating inthe Gene Ontology project. This project began in mid-1998 as a collaboration <strong>of</strong> three model organismdatabases (those for Saccharomyces cerevisiae, Drosophila,and mouse) to solve a very well defined problem:“How can gene products be described in a biologi-Cold Spring Harbor Symposia on Quantitative Biology, Volume LXVIII. © 2003 Cold Spring Harbor Laboratory Press 0-87969-709-1/04. 227