12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articles<strong>analysis</strong> to attempt to avoid misassemblies. The second is <strong>the</strong>`hierarchical shotgun <strong>sequencing</strong>' approach (Fig. 2), also referredto as `map-based', `BAC-based' or `clone-by-clone'. This approachinvolves generating <strong>and</strong> organizing a set <strong>of</strong> large-insert clones(typically 100±200 kb each) covering <strong>the</strong> <strong>genome</strong> <strong>and</strong> separatelyperforming shotgun <strong>sequencing</strong> on appropriately chosen clones.Because <strong>the</strong> sequence information is local, <strong>the</strong> issue <strong>of</strong> long-rangemisassembly is eliminated <strong>and</strong> <strong>the</strong> risk <strong>of</strong> short-range misassemblyis reduced. One caveat is that some large-insert clones may sufferrearrangement, although this risk can be reduced by appropriatequality-control measures involving clone ®ngerprints (see below).The two methods are likely to entail similar costs for producing®nished sequence <strong>of</strong> a mammalian <strong>genome</strong>. The hierarchicalapproach has a higher initial cost than <strong>the</strong> whole-<strong>genome</strong> approach,owing to <strong>the</strong> need to create a map <strong>of</strong> clones (about 1% <strong>of</strong> <strong>the</strong> totalcost <strong>of</strong> <strong>sequencing</strong>) <strong>and</strong> to sequence overlaps between clones. On<strong>the</strong> o<strong>the</strong>r h<strong>and</strong>, <strong>the</strong> whole-<strong>genome</strong> approach is likely to requiremuch greater work <strong>and</strong> expense in <strong>the</strong> ®nal stage <strong>of</strong> producing a®nished sequence, because <strong>of</strong> <strong>the</strong> challenge <strong>of</strong> resolving misassemblies.Both methods must also deal with cloning biases, resulting inunder-representation <strong>of</strong> some regions in ei<strong>the</strong>r large-insert orsmall-insert clone libraries.There was lively scienti®c debate over whe<strong>the</strong>r <strong>the</strong> <strong>human</strong><strong>genome</strong> <strong>sequencing</strong> effort should employ whole-<strong>genome</strong> or hierarchicalshotgun <strong>sequencing</strong>. Weber <strong>and</strong> Myers 58 stimulated <strong>the</strong>sediscussions with a speci®c proposal for a whole-<strong>genome</strong> shotgunapproach, toge<strong>the</strong>r with an <strong>analysis</strong> suggesting that <strong>the</strong> methodcould work <strong>and</strong> be more ef®cient. Green 59 challenged <strong>the</strong>se conclusions<strong>and</strong> argued that <strong>the</strong> potential bene®ts did not outweigh <strong>the</strong>likely risks.In <strong>the</strong> end, we concluded that <strong>the</strong> <strong>human</strong> <strong>genome</strong> <strong>sequencing</strong>effort should employ <strong>the</strong> hierarchical approach for several reasons.First, it was prudent to use <strong>the</strong> approach for <strong>the</strong> ®rst project tosequence a repeat-rich <strong>genome</strong>. With <strong>the</strong> hierarchical approach, <strong>the</strong>ultimate frequency <strong>of</strong> misassembly in <strong>the</strong> ®nished product wouldprobably be lower than with <strong>the</strong> whole-<strong>genome</strong> approach, in whichit would be more dif®cult to identify regions in which <strong>the</strong> assemblywas incorrect.Second, it was prudent to use <strong>the</strong> approach in dealing with anoutbred organism, such as <strong>the</strong> <strong>human</strong>. In <strong>the</strong> whole-<strong>genome</strong> shotgunmethod, sequence would necessarily come from two differentcopies <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong>. Accurate sequence assembly could becomplicated by sequence variation between <strong>the</strong>se two copiesÐbothSNPs (which occur at a rate <strong>of</strong> 1 per 1,300 bases) <strong>and</strong> larger-scalestructural heterozygosity (which has been documented in <strong>human</strong>chromosomes). In <strong>the</strong> hierarchical shotgun method, each largeinsertclone is derived from a single haplotype.Third, <strong>the</strong> hierarchical method would be better able to deal withinevitable cloning biases, because it would more readily allowtargeting <strong>of</strong> additional <strong>sequencing</strong> to under-represented regions.And fourth, it was better suited to a project shared among members<strong>of</strong> a diverse international consortium, because it allowed work <strong>and</strong>responsibility to be easily distributed. As <strong>the</strong> ultimate goal hasalways been to create a high-quality, ®nished sequence to serve as afoundation for biomedical research, we reasoned that <strong>the</strong> advantages<strong>of</strong> this more conservative approach outweighed <strong>the</strong> additionalcost, if any.A biotechnology company, Celera Genomics, has chosen toincorporate <strong>the</strong> whole-<strong>genome</strong> shotgun approach into its ownefforts to sequence <strong>the</strong> <strong>human</strong> <strong>genome</strong>. Their plan 60,61 uses amixed strategy, involving combining some coverage with whole<strong>genome</strong>shotgun data generated by <strong>the</strong> company toge<strong>the</strong>r with <strong>the</strong>publicly available hierarchical shotgun data generated by <strong>the</strong> InternationalHuman Genome Sequencing Consortium. If <strong>the</strong> rawsequence reads from <strong>the</strong> whole-<strong>genome</strong> shotgun component aremade available, it may be possible to evaluate <strong>the</strong> extent to which <strong>the</strong>sequence <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong> can be assembled without <strong>the</strong> needfor clone-based information. Such <strong>analysis</strong> may help to re®ne<strong>sequencing</strong> strategies for o<strong>the</strong>r large <strong>genome</strong>s.Technology for large-scale <strong>sequencing</strong>Sequencing <strong>the</strong> <strong>human</strong> <strong>genome</strong> depended on many technologicalimprovements in <strong>the</strong> production <strong>and</strong> <strong>analysis</strong> <strong>of</strong> sequence data. Keyinnovations were developed both within <strong>and</strong> outside <strong>the</strong> HumanGenome Project. Laboratory innovations included four-colour¯uorescence-based sequence detection 62 , improved ¯uorescentdyes 63±66 , dye-labelled terminators 67 , polymerases speci®callydesigned for <strong>sequencing</strong> 68±70 , cycle <strong>sequencing</strong> 71 <strong>and</strong> capillary gelelectrophoresis 72±74 . These studies contributed to substantialimprovements in <strong>the</strong> automation, quality <strong>and</strong> throughput <strong>of</strong>collecting raw DNA sequence 75,76 . There were also importantadvances in <strong>the</strong> development <strong>of</strong> s<strong>of</strong>tware packages for <strong>the</strong> <strong>analysis</strong><strong>of</strong> sequence data. The PHRED s<strong>of</strong>tware package 77,78 introduced <strong>the</strong>concept <strong>of</strong> assigning a `base-quality score' to each base, on <strong>the</strong> basis<strong>of</strong> <strong>the</strong> probability <strong>of</strong> an erroneous call. These quality scores make itpossible to monitor raw data quality <strong>and</strong> also assist in determiningwhe<strong>the</strong>r two similar sequences truly overlap. The PHRAP computerpackage (http://bozeman.mbt.washington.edu/phrap.docs/phrap.html) <strong>the</strong>n systematically assembles <strong>the</strong> sequence data using <strong>the</strong>base-quality scores. The program assigns `assembly-quality scores'to each base in <strong>the</strong> assembled sequence, providing an objectivecriterion to guide sequence ®nishing. The quality scores were basedon <strong>and</strong> validated by extensive experimental data.Ano<strong>the</strong>r key innovation for scaling up <strong>sequencing</strong> was <strong>the</strong>development by several centres <strong>of</strong> automated methods for samplepreparation. This typically involved creating new biochemicalprotocols suitable for automation, followed by construction <strong>of</strong>appropriate robotic systems.Coordination <strong>and</strong> public data sharingThe Human Genome Project adopted two important principleswith regard to <strong>human</strong> <strong>sequencing</strong>. The ®rst was that <strong>the</strong> collaborationwould be open to centres from any nation. Although potentiallyless ef®cient, in a narrow economic sense, than a centralizedapproach involving a few large factories, <strong>the</strong> inclusive approachwas strongly favoured because we felt that <strong>the</strong> <strong>human</strong> <strong>genome</strong>sequence is <strong>the</strong> common heritage <strong>of</strong> all <strong>human</strong>ity <strong>and</strong> <strong>the</strong> workshould transcend national boundaries, <strong>and</strong> we believed thatscienti®c progress was best assured by a diversity <strong>of</strong> approaches.The collaboration was coordinated through periodic internationalmeetings (referred to as `Bermuda meetings' after <strong>the</strong> venue <strong>of</strong> <strong>the</strong>®rst three ga<strong>the</strong>rings) <strong>and</strong> regular telephone conferences. Work wasshared ¯exibly among <strong>the</strong> centres, with some groups focusing onparticular chromosomes <strong>and</strong> o<strong>the</strong>rs contributing in a <strong>genome</strong>-widefashion.The second principle was rapid <strong>and</strong> unrestricted data release. Thecentres adopted a policy that all genomic sequence data should bemade publicly available without restriction within 24 hours <strong>of</strong>assembly 79,80 . Pre-publication data releases had been pioneered inmapping projects in <strong>the</strong> worm 11 <strong>and</strong> mouse <strong>genome</strong>s 30,81 <strong>and</strong> wereprominently adopted in <strong>the</strong> <strong>sequencing</strong> <strong>of</strong> <strong>the</strong> worm, providing adirect model for <strong>the</strong> <strong>human</strong> <strong>sequencing</strong> efforts. We believed thatscienti®c progress would be most rapidly advanced by immediate<strong>and</strong> free availability <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong> sequence. The explosion<strong>of</strong> scienti®c work based on <strong>the</strong> publicly available sequence data inboth academia <strong>and</strong> industry has con®rmed this judgement.Generating <strong>the</strong> draft <strong>genome</strong> sequenceGenerating a draft sequence <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong> involved threesteps: selecting <strong>the</strong> BAC clones to be sequenced, <strong>sequencing</strong> <strong>the</strong>m<strong>and</strong> assembling <strong>the</strong> individual sequenced clones into an overall draft<strong>genome</strong> sequence. A glossary <strong>of</strong> terms related to <strong>genome</strong> <strong>sequencing</strong><strong>and</strong> assembly is provided in Box 1.The draft <strong>genome</strong> sequence is a dynamic product, which isregularly updated as additional data accumulate en route to <strong>the</strong>864 NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com© 2001 Macmillan Magazines Ltd

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!