13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

214 BIRNEY ET AL.Table 1. General Information on Species Activities<strong>Genom</strong>e Gene Exons/ Builds to Assembly %WebSpecies size number gene date authority hitsIn-house genebuilds<strong>Homo</strong> <strong>sapiens</strong> 3.23 Gb 24037 8.7 11 NCBI 57Mus musculus 2.50 Gb 24948 8.7 4 NCBI 20Rattus norvegicus 2.56 Gb 23751 7.9 3 RGSC a 9Danio rerio 1.57 Gb 20062 7.9 5 Sanger Institute 7Anopheles gambiae 278 Mb 14707 4.0 4 IAGP b 1Caenorhabditis briggsae 106 Mb 11884 7.2 1 Sanger Institute 1Genebuilds performed outside EnsemblFugu rubripes 390 Mb 35180 4.7 2 IFGC c 1Drosophila melanogaster 128 Mb 13525 4.6 N/A FlyBase 1Caenorhabditis elegans 103 Mb 19988 6.2 N/A WormBase 3a RGSC = Rat <strong>Genom</strong>e Sequencing Consortium: Baylor College <strong>of</strong> Medicine, Celera <strong>Genom</strong>ics, <strong>Genom</strong>e <strong>The</strong>rapeutics, <strong>The</strong> Institutefor <strong>Genom</strong>e Research, <strong>The</strong> University <strong>of</strong> British Columbia.b IAGP = International Anopheles <strong>Genom</strong>e Project: EBI/Sanger Institute, Celera <strong>Genom</strong>ics, Genoscope, University <strong>of</strong> NotreDame, EMBL, Institut Pasteur, IMBB, TIGR.c IFGC = International Fugu <strong>Genom</strong>e Consortium: Institute <strong>of</strong> Molecular and Cell Biology (Singapore), Joint <strong>Genom</strong>e Institute,Human <strong>Genom</strong>e Mapping Project Resource Centre, Institute for Systems Biology.Other estimates have run closer to 4,000 new genes.<strong>The</strong>re has been a healthy debate on the number <strong>of</strong> finalgenes in the human genome, which I was foolish enoughto open a book on in 2000, taking bets for the number <strong>of</strong>genes in the human genome. <strong>The</strong> rules <strong>of</strong> the bet (writtenin 2000) were that we would declare a winner in 2003.Sadly, our estimates <strong>of</strong> the gene number do not have aclose margin <strong>of</strong> error to come to any firm number; however,I was saved by the fact that the distribution <strong>of</strong> betsis centered around 50,000 (median number in the bettingpool was 52,689). This large overbetting has meant that,although we are not certain <strong>of</strong> the precise number, therewere only three individuals who placed even close to estimatedbets below 30,000 protein-coding genes. We dulysplit the betting pool between these three participantswithout having to fix on a particular number.Ensembl also calculates orthology relationships betweenthe protein-coding gene sets from the differentgenomes. Each genome pair has a number <strong>of</strong> unmatchedgenes that have no obvious orthologs in the other species(complex lineage-specific duplications are allowed). Wecall these unmatched genes “orphans.” Table 2 shows thenumber <strong>of</strong> orphans found between different comparisons,and the total number <strong>of</strong> orphans across all comparisons ineach species. <strong>The</strong> presence <strong>of</strong> orphans is a combination <strong>of</strong>fast-evolving genes that have lost the clear protein similaritysignatures to assign orthology solely on protein sequence;the fact that none <strong>of</strong> the genomes is finished; errorsin the gene prediction process on genomes, inparticular, the presence <strong>of</strong> pseudogenes; and finally, erronouslysubmitted cDNA information (for example, genomiccontamination that has a significant open readingframe). By sampling random sets <strong>of</strong> orphan sequences,we believe that nearly all orphans are due to either misappropiatepseudogene classification (classifying a pseudogeneas a real gene) or cloning artifacts from library sequencing(for example, cloning projects that useddifferential display to clone cancer-specific genes). <strong>The</strong>seTable 2. Comparative Analysis <strong>of</strong> Ensembl Gene Structure PredictionsNumber <strong>of</strong> orphans per speciesHs Mm Rn Dr Fr a Dm b Ag Ce c CbHs:Mm 2723 2237Hs:Rn 2845 1261Hs:Dr 5465 1333Hs:Fr 4835 13016Mm:Rn 1929 717Mm:Dr 5544 1510Mm:Fr 4696 13271Rn:Dr 4340 1613Rn:Fr 3147 13376Dm:Ag 3043 3747Ce:Cb 4035 22Orphans across 2416 1423 375 719 10397 3043 3747 4035 22all comparisonsNumbers represent the number <strong>of</strong> orphans both in pairwise comparisons and in terms <strong>of</strong> comparison to allclosely related species.a Fugu rubripes, b Drosophila melanogaster, and c Caenorhabditis elegans gene structures are imported from externalsources.Pairwise comparison

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!