13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Ensembl: A <strong>Genom</strong>e InfrastructureE. BIRNEY AND THE ENSEMBL TEAMEBI, Wellcome Trust <strong>Genom</strong>e Campus, Hinxton, Cambridge CB10 1SD; and Sanger Institute,Wellcome Trust <strong>Genom</strong>e Campus, Hinxton, Cambridge CB10 1SA, United Kingdom<strong>The</strong> genome sequence <strong>of</strong> any organism is an invaluableresource for molecular biologists. Experiments are eithertrivial to design or greatly enhanced by the knowledge <strong>of</strong>the genome sequence; linkage analysis leads directly to aset <strong>of</strong> genes in the critical region, and association studiescan be designed to any region (including, in theory, theentire genome). At least as important as the ease <strong>of</strong> experimentaldesign is the fact that the genome is essentiallycomplete, and therefore, researchers can be confidentthat the aspects <strong>of</strong> biology they are studying must bepresent in the genome sequence in some manner. <strong>The</strong> factthat all biology is somehow associated with some aspect<strong>of</strong> the genome, that the genome is complete and essentiallyunchanging, means that it is a foundational resourcefor biology. <strong>The</strong> generation <strong>of</strong> the human (Landfer et al.2001) and mouse (Waterston et al. 2002) genome sequencesprovided landmarks for the understanding <strong>of</strong> humanbiology.<strong>The</strong>re is a catch. <strong>The</strong> genome sequence <strong>of</strong> large organismsis itself a large, unwieldy data set and is opaque toanalysis: For all <strong>of</strong> the arguments <strong>of</strong> completeness <strong>of</strong> thegenome, if we can’t ascribe at least some function to parts<strong>of</strong> the sequence, it becomes effectively unusable. In additionto this scientific problem <strong>of</strong> knowing which sequencesare functional or not, there is the somewhat mundaneproblem <strong>of</strong> simply handling the data size. Thisengineering problem is compounded due to the number <strong>of</strong>genomes now sequenced and the churn rate <strong>of</strong> genomeand cDNA sequence.Ensembl is designed to overcome these problems andtherefore to make genomes far more useful to a broadrange <strong>of</strong> audiences (Clamp et al. 2003). Ensembl focuseson providing information to three classes <strong>of</strong> users: (1)Bench biologists, who generally are focused on one ortwo genes and want a user-friendly, graphical web-basedsystem to access the genome. <strong>The</strong> ensembl web site,www.ensembl.org, is focused on this user. (2) Mid-scalefunctional genomics users, who are working with sets <strong>of</strong>genes, either due to positional cloning or expression analysis.<strong>The</strong>se users <strong>of</strong>ten need their own “slice and dice”data-extraction routines. <strong>The</strong> EnsMart data-mining system(described below) is focused on this user. (3) Largescalegenomics groups and other bioinformatics groups.We have found that by simply being open in terms <strong>of</strong> boths<strong>of</strong>tware and data, we are able to satisfy most <strong>of</strong> thisgroup’s needs.Ensembl is not the only group analyzing and displayingthese genomes. <strong>The</strong> UCSC group under David Hausslerand the NCBI group are both very active in this area.We enjoy healthy competition with these groups and collaborateon the underlying data resources, ensuring, forexample, that the assembly is common between all threesites.RESULTSTable 1 outlines the genomes which Ensembl displays.We make a distinction between genomes where we predictgenes and genomes where the gene structures areprovided by another group. Notice that for all genomesthere have been multiple gene builds, in each case marshalinga set <strong>of</strong> data resources (e.g., cDNA and EST datasets) and tuning the gene prediction process for eachgenome. A gene build takes information from threesources. cDNAs generated from the target genome arereconciled back onto the genome sequence by a specific“best in genome” process. <strong>The</strong>n, remaining pieces <strong>of</strong> thegenome which show strong protein similarity to genes inother organsisms are used to generate “novel” genes viathe program genewise. Finally, EST sequences aremapped back to the genome, clustered, and then a minimumset <strong>of</strong> transcripts which represent the clusteredESTs are generated. Depending on the species, the ESTdata are sometimes merged with the main cDNA and proteinbuild (e.g., in Anopheles gambiae) and sometimes arekept separate (e.g., in <strong>Homo</strong> <strong>sapiens</strong>). This is due to thedifference in EST quality in different genomes, with thelarge, error-prone human EST set proving the hardest touse.<strong>The</strong> Ensembl web site (www.ensembl.org) is designedfor biological researchers to quickly orient themselves onthe genome and design experiments on the basis <strong>of</strong> thegenome information. Our two main displays are focusedon genomic sequence and gene products. Increasingly,we have discovered that more and more people want towork with subsets <strong>of</strong> gene products encoded by a particulargenome. This “set” working behavior is catered to bythe EnsMart data-mining tool available both through theWeb and as a downloadable command-line tool.Assessing the quality <strong>of</strong> our gene prediction is hard becausewe use all available data at any point in our buildprocess, and new experimental cDNA approaches tend t<strong>of</strong>ocus on currently undiscovered cases. Assessments viaoverlap to dual-genome predictors, which use only thehuman and mouse genome as input (and therefore shouldbe unbiased toward other cDNA or EST evidence), suggestthat there are around another 1,000 protein-codinggenes outside <strong>of</strong> Ensembl to predict (Guigó et al. 2003).Cold Spring Harbor Symposia on Quantitative Biology, Volume LXVIII. © 2003 Cold Spring Harbor Laboratory Press 0-87969-709-1/04. 213

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!