12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articlesbetween ®sh <strong>and</strong> <strong>human</strong>). These analyses also suggest around30,000 <strong>human</strong> genes.Extrapolations have also been made from <strong>the</strong> gene counts forchromosomes 21 <strong>and</strong> 22 (refs 93, 94), adjusted for differencesin gene densities on <strong>the</strong>se chromosomes, as inferred from ESTmapping. These estimates are between 30,500 <strong>and</strong> 35,500, dependingon <strong>the</strong> precise assumptions used 286 .Insights from invertebrates. The worm <strong>and</strong> ¯y <strong>genome</strong>s contain alarge proportion <strong>of</strong> novel genes (around 50% <strong>of</strong> worm genes <strong>and</strong>30% <strong>of</strong> ¯y genes), in <strong>the</strong> sense <strong>of</strong> showing no signi®cant similarity toorganisms outside <strong>the</strong>ir phylum 293±295 . Such genes may have beenpresent in <strong>the</strong> original eukaryotic ancestor, but were subsequentlylost from <strong>the</strong> lineages <strong>of</strong> <strong>the</strong> o<strong>the</strong>r eukaryotes for which sequence isavailable; <strong>the</strong>y may be rapidly diverging genes, so that it is dif®cult torecognize homologues solely on <strong>the</strong> basis <strong>of</strong> sequence; <strong>the</strong>y mayrepresent true innovations developed within <strong>the</strong> lineage; or <strong>the</strong>ymay represent acquisitions by horizontal transfer. Whatever <strong>the</strong>irorigin, <strong>the</strong>se genes tend to have different biological properties fromhighly conserved genes. In particular, <strong>the</strong>y tend to have low expressionlevels as assayed both by direct studies <strong>and</strong> by a paucity <strong>of</strong>corresponding ESTs, <strong>and</strong> are less likely to produce a visible phenotypein loss-<strong>of</strong>-function genetic experiments 294,296 .Gene prediction. Current gene prediction methods employ combinations<strong>of</strong> three basic approaches: direct evidence <strong>of</strong> transcriptionprovided by ESTs or mRNAs 297±299 ; indirect evidence based onsequence similarity to previously identi®ed genes <strong>and</strong> proteins 300,301 ;<strong>and</strong> ab initio recognition <strong>of</strong> groups <strong>of</strong> exons on <strong>the</strong> basis <strong>of</strong> hiddenMarkov models (HMMs) that combine statistical informationabout splice sites, coding bias <strong>and</strong> exon <strong>and</strong> intron lengths (forexample, Genscan 275 , Genie 302,303 <strong>and</strong> FGENES 304 ).The ®rst approach relies on direct experimental data, but issubject to artefacts arising from contaminating ESTs derived fromunspliced mRNAs, genomic DNA contamination <strong>and</strong> nongenictranscription (for example, from <strong>the</strong> promoter <strong>of</strong> a transposableelement). The ®rst two problems can be mitigated by comparingtranscripts with <strong>the</strong> genomic sequence <strong>and</strong> using only those thatshow clear evidence <strong>of</strong> splicing. This solution, however, tends todiscard evidence from genes with long terminal exons or singleexons. The second approach tends correctly to identify gene-derivedsequences, although some <strong>of</strong> <strong>the</strong>se may be pseudogenes. However, itobviously cannot identify truly novel genes that have no sequencesimilarity to known genes. The third approach would suf®ce alone ifone could accurately de®ne <strong>the</strong> features used by cells for generecognition, but our current underst<strong>and</strong>ing is insuf®cient to doso. The sensitivity <strong>and</strong> speci®city <strong>of</strong> ab initio predictions are greatlyaffected by <strong>the</strong> signal-to-noise ratio. Such methods are moreaccurate in <strong>the</strong> ¯y <strong>and</strong> worm than in <strong>human</strong>. In ¯y, ab initiomethods can correctly predict around 90% <strong>of</strong> individual exons <strong>and</strong>can correctly predict all coding exons <strong>of</strong> a gene in about 40% <strong>of</strong>cases 303 . For <strong>human</strong>, <strong>the</strong> comparable ®gures are only about 70% <strong>and</strong>20%, respectively 94,305 . These estimates may be optimistic, owing to<strong>the</strong> design <strong>of</strong> <strong>the</strong> tests used.In any collection <strong>of</strong> gene predictions, we can expect to see variouserrors. Some gene predictions may represent partial genes, because<strong>of</strong> inability to detect some portions <strong>of</strong> a gene (incomplete sensitivity)or to connect all <strong>the</strong> components <strong>of</strong> a gene (fragmentation);some may be gene fusions; <strong>and</strong> o<strong>the</strong>rs may be spurious predictions(incomplete speci®city) resulting from chance matches or pseudogenes.Creating an initial gene index. We set out to create an initialintegrated gene index (IGI) <strong>and</strong> an associated integrated proteinindex (IPI) for <strong>the</strong> <strong>human</strong> <strong>genome</strong>. We describe <strong>the</strong> results obtainedfrom a version <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence based on <strong>the</strong> sequencedata available in July 2000, to allow time for detailed <strong>analysis</strong> <strong>of</strong> <strong>the</strong>gene <strong>and</strong> protein content. The additional sequence data that hassince become available will affect <strong>the</strong> results quantitatively, but areunlikely to change <strong>the</strong> conclusions qualitatively.We began with predictions produced by <strong>the</strong> Ensembl system 306 .Ensembl starts with ab initio predictions produced by Genscan 275<strong>and</strong> <strong>the</strong>n attempts to con®rm <strong>the</strong>m by virtue <strong>of</strong> similarity toproteins, mRNAs, ESTs <strong>and</strong> protein motifs (contained in <strong>the</strong>Pfam database 307 ) from any organism. In particular, it con®rmsintrons if <strong>the</strong>y are bridged by matches <strong>and</strong> exons if <strong>the</strong>y are ¯ankedby con®rmed introns. It <strong>the</strong>n attempts to extend protein matchesusing <strong>the</strong> GeneWise computer program 308 . Because it requirescon®rmatory evidence to support each gene component, it frequentlyproduces partial gene predictions. In addition, when <strong>the</strong>reis evidence <strong>of</strong> alternative splicing, it reports multiple overlappingtranscripts. In total, Ensembl produced 35,500 gene predictionswith 44,860 transcripts.To reduce fragmentation, we next merged Ensembl-based genepredictions with overlapping gene predictions from ano<strong>the</strong>rprogram, Genie 302 . Genie starts with mRNA or EST matches <strong>and</strong>employs an HMM to extend <strong>the</strong>se matches by using ab initiostatistical approaches. To avoid fragmentation, it attempts to linkinformation from 59 <strong>and</strong> 39 ESTs from <strong>the</strong> same cDNA clone <strong>and</strong><strong>the</strong>reby to produce a complete coding sequence from an initial ATGto a stop codon. As a result, it may generate complete genes moreaccurately than Ensembl in cases where <strong>the</strong>re is extensive ESTsupport. (Genie also generates potential alternative transcripts,but we used only <strong>the</strong> longest transcript in each group.) Wemerged 15,437 Ensembl predictions into 9,526 clusters, <strong>and</strong> <strong>the</strong>longest transcript in each cluster (from ei<strong>the</strong>r Genie or Ensembl)was taken as <strong>the</strong> representative.Next, we merged <strong>the</strong>se results with known genes contained in <strong>the</strong>RefSeq (version <strong>of</strong> 29 September 2000), SWISSPROT (release 39.6<strong>of</strong> 30 August 2000) <strong>and</strong> TrEMBL databases (TrEMBL release 14.17<strong>of</strong> 1 October 2000, TrEMBL_new <strong>of</strong> 1 October 2000). Incorporating<strong>the</strong>se sequences gave rise to overlapping sequences because <strong>of</strong>alternative splice forms <strong>and</strong> partial sequences. To construct anonredundant set, we selected <strong>the</strong> longest sequence from eachoverlapping set by using direct protein comparison <strong>and</strong> by mapping<strong>the</strong> gene predictions back onto <strong>the</strong> <strong>genome</strong> to construct <strong>the</strong> overlappingsets. This may occasionally remove some close paralogues in<strong>the</strong> event that <strong>the</strong> correct genomic location has not yet beensequenced, but this number is expected to be small.Finally, we searched <strong>the</strong> set to eliminate any genes derived fromcontaminating bacterial sequences, recognized by virtue <strong>of</strong> nearidentity to known bacterial plasmids, transposons <strong>and</strong> chromosomalgenes. Although most instances <strong>of</strong> such contamination hadbeen removed in <strong>the</strong> assembly process, a few cases had slippedthrough <strong>and</strong> were removed at this stage.The process resulted in version 1 <strong>of</strong> <strong>the</strong> IGI (IGI.1). Thecomposition <strong>of</strong> <strong>the</strong> corresponding IPI.1 protein set, obtained bytranslating IGI.1, is given in Table 22. There are 31,778 proteinpredictions, with 14,882 from known genes, 4,057 predictions fromEnsembl merged with Genie <strong>and</strong> 12,839 predictions from Ensemblalone. The average lengths are 469 amino acids for <strong>the</strong> knownproteins, 443 amino acids for protein predictions from <strong>the</strong>Ensembl±Genie merge, <strong>and</strong> 187 amino acids for those fromEnsembl alone. (The smaller average size for <strong>the</strong> predictions fromEnsembl alone re¯ects its tendency to predict partial genes where<strong>the</strong>re is supporting evidence for only part <strong>of</strong> <strong>the</strong> gene; <strong>the</strong> remainder<strong>of</strong> <strong>the</strong> gene will <strong>of</strong>ten not be predicted at all, ra<strong>the</strong>r than included aspart <strong>of</strong> ano<strong>the</strong>r prediction. Accordingly, <strong>the</strong> smaller size cannot beused to estimate <strong>the</strong> rate <strong>of</strong> fragmentation in such predictions.)The set corresponds to fewer than 31,000 actual genes, becausesome genes are fragmented into more than one partial prediction<strong>and</strong> some predictions may be spurious or correspond to pseudogenes.As discussed below, our best estimate is that IGI.1 includesabout 24,500 true genes.Evaluation <strong>of</strong> IGI/IPI. We used several approaches to evaluate <strong>the</strong>sensitivity, speci®city <strong>and</strong> fragmentation <strong>of</strong> <strong>the</strong> IGI/IPI set.Comparison with `new' known genes. One approach was to examineNATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 899© 2001 Macmillan Magazines Ltd

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!