12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articlesoverprediction rate <strong>of</strong> 30% for gene predictions in this exp<strong>and</strong>ed set,<strong>the</strong> <strong>analysis</strong> above suggests that IGI+ set contains about 28,000 truegenes <strong>and</strong> yields an estimate <strong>of</strong> about 32,000 <strong>human</strong> genes. We areinvestigating ways to ®lter <strong>the</strong> exp<strong>and</strong>ed set, to produce an IGI with<strong>the</strong> advantage <strong>of</strong> <strong>the</strong> increased sensitivity resulting from combiningmultiple gene prediction programs without <strong>the</strong> corresponding loss<strong>of</strong> speci®city. Meanwhile, <strong>the</strong> IGI+ set can be used by researcherssearching for genes that cannot be found in <strong>the</strong> IGI.Some classes <strong>of</strong> genes may have been missed by all <strong>of</strong> <strong>the</strong> gene-®nding methods. Genes could be missed if <strong>the</strong>y are expressed at lowlevels or in rare tissues (being absent or very under-represented inEST <strong>and</strong> mRNA databases) <strong>and</strong> have sequences that evolve rapidly(being hard to detect by protein homology <strong>and</strong> <strong>genome</strong> comparison).Both <strong>the</strong> worm <strong>and</strong> ¯y gene sets contain a substantial number<strong>of</strong> such genes 293,294 . Single-exon genes encoding small proteins mayalso have been missed, because EST evidence that supports <strong>the</strong>mcannot be distinguished from genomic contamination in <strong>the</strong> ESTdataset <strong>and</strong> because homology may be hard to detect for smallproteins 310 .The <strong>human</strong> thus appears to have only about twice as many genesas worm or ¯y. However, <strong>human</strong> genes differ in important respectsfrom those in worm <strong>and</strong> ¯y. They are spread out over much largerregions <strong>of</strong> genomic DNA, <strong>and</strong> <strong>the</strong>y are used to construct morealternative transcripts. This may result in perhaps ®ve times as manyprimary protein products in <strong>the</strong> <strong>human</strong> as in <strong>the</strong> worm or ¯y.The predicted gene <strong>and</strong> protein sets described here are clearly farfrom ®nal. None<strong>the</strong>less, <strong>the</strong>y provide a valuable starting point forexperimental <strong>and</strong> computational research. The predictions willimprove progressively as <strong>the</strong> sequence is ®nished, as fur<strong>the</strong>rcon®rmatory evidence becomes available (particularly fromo<strong>the</strong>r vertebrate <strong>genome</strong> sequences, such as those <strong>of</strong> mouse <strong>and</strong>T. nigroviridis), <strong>and</strong> as computational methods improve. We intendto create <strong>and</strong> release updated versions <strong>of</strong> <strong>the</strong> IGI <strong>and</strong> IPI regularly,until <strong>the</strong>y converge to a ®nal accurate list <strong>of</strong> every <strong>human</strong> gene. Thegene predictions will be linked to RefSeq, HUGO <strong>and</strong> SWISSPROTidenti®ers where available, <strong>and</strong> tracking identi®ers between versionswill be included, so that individual genes under study can be tracedforwards as <strong>the</strong> <strong>human</strong> sequence is completed.Comparative proteome <strong>analysis</strong>Knowledge <strong>of</strong> <strong>the</strong> <strong>human</strong> proteome will provide unprecedentedopportunities for studies <strong>of</strong> <strong>human</strong> gene function. Often clues willbe provided by sequence similarity with proteins <strong>of</strong> known functionin model organisms. Such initial observations must <strong>the</strong>n be followedup by detailed studies to establish <strong>the</strong> actual function <strong>of</strong> <strong>the</strong>semolecules in <strong>human</strong>s.For example, 35 proteins are known to be involved in <strong>the</strong> vacuolarprotein-sorting machinery in yeast. Human genes encoding homologuescan be found in <strong>the</strong> draft <strong>human</strong> sequence for 34 <strong>of</strong> <strong>the</strong>seyeast proteins, but precise relationships are not always clear. In ninecases <strong>the</strong>re appears to be a single clear <strong>human</strong> orthologue (a genethat arose as a consequence <strong>of</strong> speciation); in 12 cases <strong>the</strong>re arematches to a family <strong>of</strong> <strong>human</strong> paralogues (genes that arose owing tointra-<strong>genome</strong> duplication); <strong>and</strong> in 13 cases <strong>the</strong>re are matchesto speci®c protein domains 311±314 . Hundreds <strong>of</strong> similar storiesemerge from <strong>the</strong> draft sequence, but each merits a detailed interpretationin context. To treat <strong>the</strong>se subjects properly, <strong>the</strong>re will bemany following studies, <strong>the</strong> ®rst <strong>of</strong> which appear in accompanyingpapers 315±323 .Here, we aim to take a more global perspective on <strong>the</strong> content <strong>of</strong><strong>the</strong> <strong>human</strong> proteome by comparing it with <strong>the</strong> proteomes <strong>of</strong> yeast,worm, ¯y <strong>and</strong> mustard weed. Such comparisons shed useful light on<strong>the</strong> commonalities <strong>and</strong> differences among <strong>the</strong>se eukaryotes 294,324,325 .The <strong>analysis</strong> is necessarily preliminary, because <strong>of</strong> <strong>the</strong> imperfectnature <strong>of</strong> <strong>the</strong> <strong>human</strong> sequence, uncertainties in <strong>the</strong> gene <strong>and</strong> proteinsets for all <strong>of</strong> <strong>the</strong> multicellular organisms considered <strong>and</strong> ourincomplete knowledge <strong>of</strong> protein structures. None<strong>the</strong>less, somegeneral patterns emerge. These include insights into fundamentalmechanisms that create functional diversity, including invention <strong>of</strong>protein domains, expansion <strong>of</strong> protein <strong>and</strong> domain families, evolution<strong>of</strong> new protein architectures <strong>and</strong> horizontal transfer <strong>of</strong> genes.O<strong>the</strong>r mechanisms, such as alternative splicing, post-translationalmodi®cation <strong>and</strong> complex regulatory networks, are also crucial ingenerating diversity but are much harder to discern from <strong>the</strong>primary sequence. We will not attempt to consider <strong>the</strong> effects <strong>of</strong>alternative splicing on proteins; we will consider only a single spliceform from each gene in <strong>the</strong> various organisms, even when multiplesplice forms are known.Functional <strong>and</strong> evolutionary classi®cation. We began by classifying<strong>the</strong> <strong>human</strong> proteome on <strong>the</strong> basis <strong>of</strong> functional categories <strong>and</strong>evolutionary conservation. We used <strong>the</strong> InterPro annotation protocolto identify conserved biochemical <strong>and</strong> cellular processes.InterPro is a tool for combining sequence-pattern informationfrom four databases. The ®rst two databases (PRINTS 326 <strong>and</strong>Prosite 327 ) primarily contain information about motifs correspondingto speci®c family subtypes, such as type II receptor tyrosinekinases (RTK-II) in particular or tyrosine kinases in general. Thesecond two databases (Pfam 307 <strong>and</strong> Prosite Pro®le 327 ) containinformation (in <strong>the</strong> form <strong>of</strong> pro®les or HMMs) about families <strong>of</strong>structural domainsÐfor example, protein kinase domains. Inter-Pro integrates <strong>the</strong> motif <strong>and</strong> domain assignments into a hierarchicalclassi®cation system; so a protein might be classi®ed at <strong>the</strong> mostdetailed level as being an RTK-II, at a more general level as being akinase speci®c for tyrosine, <strong>and</strong> at a still more general level asbeing a protein kinase. The complete hierarchy <strong>of</strong> InterPro entriesis described at http://www.ebi.ac.uk/interpro/. We collapsed <strong>the</strong>InterPro entries into 12 broad categories, each re¯ecting a set <strong>of</strong>cellular functions.The InterPro families are partly <strong>the</strong> product <strong>of</strong> <strong>human</strong> judgement<strong>and</strong> re¯ect <strong>the</strong> current state <strong>of</strong> biological <strong>and</strong> evolutionary knowledge.The system is a valuable way to gain insight into largecollections <strong>of</strong> proteins, but not all proteins can be classi®ed atpresent. The proportions <strong>of</strong> <strong>the</strong> yeast, worm, ¯y <strong>and</strong> mustard weedprotein sets that are assigned to at least one InterPro family is, foreach organism, about 50% (Table 23; refs 307, 326, 327).About 40% <strong>of</strong> <strong>the</strong> predicted <strong>human</strong> proteins in <strong>the</strong> IPI could beassigned to InterPro entries <strong>and</strong> functional categories. On <strong>the</strong> basis<strong>of</strong> <strong>the</strong>se assignments, we could compare organisms according to <strong>the</strong>number <strong>of</strong> proteins in each category (Fig. 37). Compared with <strong>the</strong>two invertebrates, <strong>human</strong>s appear to have many proteins involvedin cytoskeleton, defence <strong>and</strong> immunity, <strong>and</strong> transcription <strong>and</strong>translation. These expansions are clearly related to aspects <strong>of</strong>vertebrate physiology. Humans also have many more proteins thatare classi®ed as falling into more than one functional category (426in <strong>human</strong> versus 80 in worm <strong>and</strong> 57 in ¯y, data not shown).Interestingly, 32% <strong>of</strong> <strong>the</strong>se are transmembrane receptors.We obtained fur<strong>the</strong>r insight into <strong>the</strong> evolutionary conservation <strong>of</strong>proteins by comparing each sequence to <strong>the</strong> complete nonredundantdatabase <strong>of</strong> protein sequences maintained at NCBI, using <strong>the</strong>BLASTP computer program 328 <strong>and</strong> <strong>the</strong>n breaking down <strong>the</strong> matchesaccording to organismal taxonomy (Fig. 38). Overall, 74% <strong>of</strong> <strong>the</strong>proteins had signi®cant matches to known proteins.Such classi®cations are based on <strong>the</strong> presence <strong>of</strong> clearly detectablehomologues in existing databases. Many <strong>of</strong> <strong>the</strong>se genes have surelyevolved from genes that were present in common ancestors but havesince diverged substantially. Indeed, one can detect more distantrelationships by using sensitive computer programs that can recognizeweakly conserved features. Using PSI-BLAST, we can recognizeprobable nonvertebrate homologues for about 45% <strong>of</strong> <strong>the</strong> `vertebrate-speci®c'set. None<strong>the</strong>less, <strong>the</strong> classi®cation is useful for gaininginsights into <strong>the</strong> commonalities <strong>and</strong> differences among <strong>the</strong>proteomes <strong>of</strong> different organisms.Probable horizontal transfer. An interesting category is a set <strong>of</strong> 223proteins that have signi®cant similarity to proteins from bacteria,but no comparable similarity to proteins from yeast, worm, ¯y <strong>and</strong>NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com 901© 2001 Macmillan Magazines Ltd

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!