Initial sequencing and analysis of the human genome - Vitagenes

More documents

Recommendations

Info

articlesappear to have considerably less interest in this sense. About 60% ofthe olfactory receptors in the draft genome sequence have disruptedORFs and appear to be pseudogenes, consistent with recentreports 344,346 suggesting massive functional gene loss in the last 10Myr 347,348 . Interestingly, there appears to be a much higher proportionof intact genes among class I than class II olfactory receptors,suggesting functional importance.Vertebrates are not unique in employing gene family expansion.For many domain types, expansions appear to have occurredindependently in each of the major eukaryotic lineages. A goodexample is the classical C2H2 family of zinc ®nger domains, whichhave expanded independently in the yeast, worm, ¯y and humanlineages (Fig. 45). These independent expansions have resulted innumerous C2H2 zinc ®nger domain-containing proteins that arespeci®c to each lineage. In ¯ies, the important components of theC2H2 zinc ®nger expansion are architectures in which it is combinedwith the POZ domain and the C4DM domain (a metalbindingdomain found only in ¯y). In humans, the most prevalentexpansions are combinations of the C2H2 zinc ®nger with POZ(independent of the one in insects) and the vertebrate-speci®cKRAB and SCAN domains.The homeodomain is similarly expanded in all animals and ispresent in both architectures that are conserved and lineage-speci®carchitectures (Fig. 45). This indicates that the ancestral animalprobably encoded a signi®cant number of homeodomain proteins,but subsequent evolution involved multiple, independent expansionsand domain shuf¯ing after lineages diverged. Thus, the mostprevalent transcription factor families are different in worm, ¯y andhuman (Fig. 45). This has major biological implications becausetranscription factors are critical in animal development and differentiation.The emergence of major variations in the developmentalbody plans that accompanied the early radiation of the animals 349could have been driven by lineage-speci®c proliferation of suchtranscription factors. Beyond these large expansions of proteinfamilies, protein components of particular functional systemssuch as the cell death signalling system show a general increase indiversity and numbers in the vertebrates relative to other animals.For example, there are greater numbers of and more novel architecturesin cell death regulatory proteins such as BCL-2, TNFR andNFkB from vertebrates.Conclusion. Five lines of evidence point to an increase in thecomplexity of the proteome from the single-celled yeast to themulticellular invertebrates and to vertebrates such as the human.Speci®cally, the human contains greater numbers of genes, domainand protein families, paralogues, multidomain proteins withmultiple functions, and domain architectures. According to thesemeasures, the relatively greater complexity of the human proteomeis a consequence not simply of its larger size, but also of large-scaleprotein innovation.An important question is the extent to which the greaterphenotypic complexity of vertebrates can be explained simply bytwo- or threefold increases in proteome complexity. The realexplanation may lie in combinatorial ampli®cation of thesemodest differences, by mechanisms that include alternative splicing,post-translational modi®cation and cellular regulatory networks.The potential numbers of different proteins and protein±proteininteractions are vast, and their actual numbers cannot readily bediscerned from the genome sequence. Elucidating such systemlevelproperties presents one of the great challenges for modernbiology.Table 25 The most populous InterPro families in the human proteome and other speciesInterPro IDNo. ofgenesHuman Fly Worm Yeast Mustard weedRankNo. ofgenesRankNo. ofgenesRankNo. ofgenesIPR003006 765 (1) 140 (9) 64 (34) 0 (na) 0 (na) Immunoglobulin domainPR000822 706 (2) 357 (1) 151 (10) 48 (7) 115 (20) C2H2 zinc ®ngerIPR000719 575 (3) 319 (2) 437 (2) 121 (1) 1049 (1) Eukaryotic protein kinaseIPR000276 569 (4) 97 (14) 358 (3) 0 (na) 16 (84) Rhodopsin-like GPCR superfamilyIPR001687 433 (5) 198 (4) 183 (7) 97 (2) 331 (5) P-loop motifIPR000477 350 (6) 10 (65) 50 (41) 6 (36) 80 (35) Reverse transcriptase (RNA-dependent DNA polymerase)IPR000504 300 (7) 157 (6) 96 (21) 54 (6) 255 (8) rrm domainIPR001680 277 (8) 162 (5) 102 (19) 91 (3) 210 (10) G-protein b WD-40 repeatsIPR002110 276 (9) 105 (13) 107 (17) 19 (23) 120 (18) Ankyrin repeatIPR001356 267 (10) 148 (7) 109 (15) 9 (33) 118 (19) Homeobox domainIPR001849 252 (11) 77 (22) 71 (31) 27 (17) 27 (73) PH domainIPR002048 242 (12) 111 (12) 81 (25) 15 (27) 167 (12) EF-hand familyIPR000561 222 (13) 81 (20) 113 (14) 0 (na) 17 (83) EGF-like domainIPR001452 215 (14) 72 (23) 62 (35) 25 (18) 3 (97) SH3 domainIPR001841 210 (15) 114 (11) 126 (12) 35 (12) 379 (4) RING ®ngerIPR001611 188 (16) 115 (10) 54 (38) 7 (35) 392 (2) Leucine-rich repeatIPR001909 171 (17) 0 (na) 0 (na) 0 (na) 0 (na) KRAB boxIPR001777 165 (18) 63 (27) 51 (40) 2 (40) 4 (96) Fibronectin type III domainIPR001478 162 (19) 70 (24) 66 (33) 2 (40) 15 (85) PDZ domainIPR001650 155 (20) 87 (17) 78 (27) 79 (4) 148 (13) Helicase C-terminal domainIPR001440 150 (21) 86 (18) 46 (43) 36 (11) 125 (17) TPR repeatIPR002216 133 (22) 65 (26) 99 (20) 2 (40) 31 (69) Ion transport proteinIPR001092 131 (23) 84 (19) 41 (46) 7 (35) 106 (24) Helix±loop±helix DNA-binding domainIPR000008 123 (24) 43 (34) 36 (49) 9 (33) 82 (34) C2 domainIPR001664 119 (25) 4 (71) 22 (63) 1 (41) 2 (98) SH2 domainIPR001254 118 (26) 210 (3) 12 (73) 1 (41) 15 (85) Serine protease, trypsin familyIPR002126 114 (27) 19 (56) 16 (69) 0 (na) 0 (na) Cadherin domainIPR000210 113 (28) 78 (21) 117 (13) 1 (41) 54 (50) BTB/POZ domainIPR000387 112 (29) 35 (40) 108 (16) 12 (30) 21 (79) Tyrosine-speci®c protein phosphatase and dual speci®cityprotein phosphatase familyIPR000087 106 (30) 18 (57) 169 (9) 0 (na) 5 (95) Collagen triple helix repeatIPR000379 94 (31) 141 (8) 134 (11) 40 (10) 194 (11) Esterase/lipase/thioesteraseIPR000910 89 (32) 38 (38) 18 (67) 8 (34) 18 (82) HMG1/2 (high mobility group) boxIPR000130 87 (33) 56 (29) 92 (22) 8 (34) 12 (88) Neutral zinc metallopeptidaseIPR001965 84 (34) 37 (39) 24 (61) 16 (26) 71 (39) PHD-®ngerIPR000636 83 (35) 32 (43) 24 (61) 1 (41) 14 (86) Cation channels (non-ligand gated)IPR001781 81 (36) 38 (38) 36 (49) 4 (38) 8 (92) LIM domainIPR002035 81 (36) 8 (67) 45 (44) 3 (39) 17 (83) VWA domainIPR001715 80 (37) 33 (42) 30 (55) 3 (39) 18 (82) Calponin homology domainIPR000198 77 (38) 20 (55) 20 (65) 10 (32) 9 (91) RhoGAP domain...................................................................................................................................................................................................................................................................................................................................................................Forty most populous Interpro families found in the human proteome compared with equivalent numbers from other species. na, not applicable (used when there are no proteins in an organism in that family).RankNo. ofgenesRankNATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd907
articlesSegmental history of the human genomeIn bacteria, genomic segments often convey important informationabout function: genes located close to one another often encodeproteins in a common pathway and are regulated in a commonoperon. In mammals, genes found close to each other only rarelyhave common functions, but they are still interesting because theyhave a common history. In fact, the study of genomic segments canshed light on biological events as long as 500 Myr ago and as recentlyas 20,000 years ago.Conserved segments between human and mouseHumans and mice shared a common ancestor about 100 Myr ago.Despite the 200 Myr of evolutionary distance between the species, asigni®cant fraction of genes show synteny between the two, beingpreserved within conserved segments. Genes tightly linked in onemammalian species tend to be linked in others. In fact, conservedsegments have been observed in even more distant species: humansshow conserved segments with ®sh 350,351 and even with invertebratessuch as ¯y and worm 352 . In general, the likelihood that a syntenicrelationship will be disrupted correlates with the physical distancebetween the loci and the evolutionary distance between the species.Studying conserved segments between human and mouse hasseveral uses. First, conservation of gene order has been used toidentify likely orthologues between the species, particularly wheninvestigating disease phenotypes. Second, the study of conservedsegments among genomes helps us to deduce evolutionary ancestry.Fly (number of genes)4003503002502820058 71501032117 15 12 9100 36437 23 2129 20 1334 2211351933 18 145027 263130242516600 100 200 300 400 500 600 700 800 900Human (number of genes)Figure 44 Relative expansions of protein families between human and ¯y. These datahave not been normalized for proteomic size differences. Blue line, equality betweennormalized family sizes in the two organisms. Green line, equality between unnormalizedfamily sizes. Numbered InterPro entries: (1) immunoglobulin domain [IPR003006]; (2) zinc®nger, C2H2 type [IPR000822]; (3) eukaryotic protein kinase [IPR000719]; (4) rhodopsinlikeGPCR superfamily [IPR000276]; (5) ATP/GTP-binding site motif A (P-loop)[IPR001687]; (6) reverse transcriptase (RNA-dependent DNA polymerase) [IPR000477];(7) RNA-binding region RNP-1 (RNA recognition motif) [IPR000504]; (8) G-proteinb WD-40 repeats [IPR001680]; (9) ankyrin repeat [IPR002110]; (10) homeobox domain[IPR001356]; (11) PH domain [IPR001849]; (12) EF-hand family [IPR002048]; (13) EGFlikedomain [IPR000561]; (14) Src homology 3 (SH3) domain [IPR001452]; (15) RING®nger [IPR001841]; (16) KRAB box [IPR001909]; (17) leucine-rich repeat [IPR001611];(18) ®bronectin type III domain [IPR001777]; (19) PDZ domain (also known as DHR orGLGF) [IPR001478]; (20) TPR repeat [IPR001440]; (21) helicase C-terminal domain[IPR001650]; (22) ion transport protein [IPR002216]; (23) helix±loop±helix DNA-bindingdomain [IPR001092]; (24) cadherin domain [IPR002126]; (25) intermediate ®lamentproteins [IPR001664]; (26) C2 domain [IPR000008]; (27) Src homology 2 (SH2) domain[IPR000980]; (28) serine proteases, trypsin family [IPR001254]; (29) BTB/POZ domain[IPR000210]; (30) tyrosine-speci®c protein phosphatase and dual speci®city proteinphosphatase family [IPR000387]; (31) collagen triple helix repeat [IPR000087]; (32)esterase/lipase/thioesterase [IPR000379]; (33) neutral zinc metallopeptidases, zincbindingregion [IPR000130]; (34) ATP-binding transport protein, 2nd P-loop motif[IPR001051]; (35) ABC transporters family [IPR001617]; (36) cytochrome P450 enzyme[IPR001128]; (37) insect cuticle protein [IPR000618].32And third, detailed comparative maps may assist in the assembly ofthe mouse sequence, using the human sequence as a scaffold.Two types of linkage conservation are commonly described 353 .`Conserved synteny' indicates that at least two genes that reside on acommon chromosome in one species are also located on a commonchromosome in the other species. Syntenic loci are said to lie in a`conserved segment' when not only the chromosomal position butthe linear order of the loci has been preserved, without interruptionby other chromosomal rearrangements.An initial survey of homologous loci in human and mouse 354suggested that the total number of conserved segments would beabout 180. Subsequent estimates based on increasingly detailedcomparative maps have remained close to this projection 353,355,356(http://www.informatics.jax.org). The distribution of segmentlengths has corresponded reasonably well to the truncated negativeexponential curve predicted by the random breakage model 357 .The availability of a draft human genome sequence allows the ®rstglobal human±mouse comparison in which human physical distancescan be measured in Mb, rather than cM or orthologous genecounts. We identi®ed likely orthologues by reciprocal comparisonof the human and mouse mRNAs in the LocusLink database, usingmegaBLAST. For each orthologous pair, we mapped the location ofthe human gene in the draft genome sequence and then checked thelocation of the mouse gene in the Mouse Genome Informaticsdatabase (http://www.informatics.jax.org). Using a conservativethreshold, we identi®ed 3,920 orthologous pairs in which thehuman gene could be mapped on the draft genome sequence withhigh con®dence. Of these, 2,998 corresponding mouse genes had aknown position in the mouse genome. We then searched forde®nitive conserved segments, de®ned as human regions containingorthologues of at least two genes from the same mouse chromosomeregion (, 15 cM) without interruption by segments from otherchromosomes.We identi®ed 183 de®nitive conserved segments (Fig. 46). Theaverage segment length was 15.4 Mb, with the largest segment being90.5 Mb and the smallest 24 kb. There were also 141 `singletons',segments that contained only a single locus; these are not counted inthe statistics. Although some of these could be short conservedsegments, they could also re¯ect incorrect choices of orthologues orproblems with the human or mouse maps. Because of this conservativeapproach, the observed number of de®nitive segments islikely be lower than the correct total. One piece of evidence for thisconclusion comes from a more detailed analysis on human chromosome7 (ref. 358), which identi®ed 20 conserved segments, ofwhich three were singletons. Our analysis revealed only 13 de®nitivesegments on this chromosome, with nine singletons.The frequency of observing a particular gene count in a conservedsegment is plotted on a logarithmic scale in Fig. 47. If chromosomalbreaks occur in a random fashion (as has been proposed) anddifferences in gene density are ignored, a roughly straight lineshould result. There is a clear excess for n = 1, suggesting that 50%or more of the singletons are indeed artefactual. Thus, we estimatethat true number of conserved segments is around 190±230, in goodagreement with the original Nadeau±Taylor prediction 354 .Figure 48 shows a plot of the frequency of lengths of conservedsegments, where the x-axis scale is shown in Mb. As before, there is afair amount of scatter in the data for the larger segments (where thenumbers are small), but the trend appears to be consistent with arandom breakage model.We attempted to ascertain whether the breakpoint regions haveany special characteristics. This analysis was complicated by imprecisionin the positioning of these breaks, which will tend to blur anyrelationships. With 2,998 orthologues, the average interval withinwhich a break is known to have occurred is about 1.1 Mb. Wecompared the aggregate features of these breakpoint intervals withthe genome as a whole. The mean gene density was lower inbreakpoint regions than in the conserved segments (13.8 versus908 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com
Page 2 and 3: articlesGenome Sequencing Centres (
Page 4 and 5: articleswell as targeted regions of
Page 6 and 7: articlesultimate goal of a complete
Page 8 and 9: articlesthat were used to `seed' th
Page 11 and 12: articlesFingerprint clone contigPic
Page 13 and 14: articlesof the end of the contig, s
Page 15 and 16: articles®ngerprint map. However, m
Page 17 and 18: articlesnucleotides to collections
Page 19 and 20: articlesTable 10 Number of CpG isla
Page 21 and 22: articlesClasses of interspersed rep
Page 23 and 24: articlesposable elements in categor
Page 25 and 26: articlesProportion of genome compri
Page 27 and 28: articlesgenomes 184±187 . By study
Page 29 and 30: like termini show the typical diver
Page 31 and 32: articlesvelocardiofacial/DiGeorge a
Page 33 and 34: articlesaSum of aligned bases (kb)c
Page 35 and 36: articlesthe IAC anticodon for a Val
Page 37 and 38: articlesTable 21 Characteristics of
Page 39 and 40: articles30% to 50%. The correlation
Page 41 and 42: articlesTable 22 Properties of the
Page 43 and 44: articlesTable 23 Properties of geno
Page 45 and 46: articlesmultifunctional proteins an
Page 47: articlesaYW, F, VBr Br Ba Br Br Br
Page 51 and 52: articlesThe most extreme mechanism
Page 53 and 54: articlesinclude the DiGeorge/veloca
Page 55 and 56: articlesbecause of the availability
Page 57 and 58: articlesNature 380, 152±154 (1996)
Page 59 and 60: articles265. Sorensen, P. D. & Fred
Page 61 and 62: articles443. Roth, F. P., Hughes, J

Initial sequencing and analysis of the human genome - Vitagenes

Create successful ePaper yourself

Delete template?

Save as template?