12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articlesthat were used to `seed' <strong>the</strong> <strong>sequencing</strong> <strong>of</strong> new regions. The small®ngerprint clone contigs were extended or merged with o<strong>the</strong>rs as<strong>the</strong> map matured.The clones that make up <strong>the</strong> draft <strong>genome</strong> sequence <strong>the</strong>refore donot constitute a minimally overlapping setÐ<strong>the</strong>re is overlap <strong>and</strong>redundancy in places. The cost <strong>of</strong> using suboptimal overlaps wasjusti®ed by <strong>the</strong> bene®t <strong>of</strong> earlier availability <strong>of</strong> <strong>the</strong> draft <strong>genome</strong>sequence data. Minimizing <strong>the</strong> overlap between adjacent cloneswould have required completing <strong>the</strong> physical map before undertakinglarge-scale <strong>sequencing</strong>. In addition, <strong>the</strong> overlaps betweenBAC clones provide a rich collection <strong>of</strong> SNPs. More than 1.4 millionSNPs have already been identi®ed from clone overlaps <strong>and</strong> o<strong>the</strong>rsequence comparisons 97 .Because <strong>the</strong> <strong>sequencing</strong> project was shared among twenty centresin six countries, it was important to coordinate selection <strong>of</strong> clonesacross <strong>the</strong> centres. Most centres focused on particular chromosomesor, in some cases, larger regions <strong>of</strong> <strong>the</strong> <strong>genome</strong>. We also maintaineda clone registry to track selected clones <strong>and</strong> <strong>the</strong>ir progress. In laterphases, <strong>the</strong> global map provided an integrated view <strong>of</strong> <strong>the</strong> data fromall centres, facilitating <strong>the</strong> distribution <strong>of</strong> effort to maximize coverage<strong>of</strong> <strong>the</strong> <strong>genome</strong>. Before performing extensive <strong>sequencing</strong> on aFigure 3 The automated production line for sample preparation at <strong>the</strong> WhiteheadInstitute, Center for Genome Research. The system consists <strong>of</strong> custom-designed factorystyleconveyor belt robots that perform all functions from purifying DNA from bacterialcultures through setting up <strong>and</strong> purifying <strong>sequencing</strong> reactions.Sequence (Mb)5,0004,5004,0003,5003,0002,5002,0001,5001,0005000Jan-96Apr-96Jul-96FinishedUnfinished (draft <strong>and</strong> pre-draft)Oct-96Jan-97Apr-97Jul-97Oct-97Jan-98Apr-98MonthJul-98Oct-98Jan-99Apr-99Jul-99Oct-99Jan-00Apr-00Jul-00Oct-00Figure 4 Total amount <strong>of</strong> <strong>human</strong> sequence in <strong>the</strong> High Throughput Genome Sequence(HTGS) division <strong>of</strong> GenBank. The total is <strong>the</strong> sum <strong>of</strong> ®nished sequence (red) <strong>and</strong> un®nished(draft plus predraft) sequence (yellow).clone, several centres routinely examined an initial sample <strong>of</strong> 96 rawsequence reads from each subclone library to evaluate possibleoverlap with previously sequenced clones.SequencingThe selected clones were subjected to shotgun <strong>sequencing</strong>. Although<strong>the</strong> basic approach <strong>of</strong> shotgun <strong>sequencing</strong> is well established, <strong>the</strong>details <strong>of</strong> implementation varied among <strong>the</strong> centres. For example,<strong>the</strong>re were differences in <strong>the</strong> average insert size <strong>of</strong> <strong>the</strong> shotgunlibraries, in <strong>the</strong> use <strong>of</strong> single-str<strong>and</strong>ed or double-str<strong>and</strong>ed cloningvectors, <strong>and</strong> in <strong>sequencing</strong> from one end or both ends <strong>of</strong> each insert.Centres differed in <strong>the</strong> ¯uorescent labels employed <strong>and</strong> in <strong>the</strong> degreeto which <strong>the</strong>y used dye-primers or dye-terminators. The sequencedetectors included both slab gel- <strong>and</strong> capillary-based devices.Detailed protocols are available on <strong>the</strong> web sites <strong>of</strong> many <strong>of</strong> <strong>the</strong>individual centres (URLs can be found at www.nhgri.nih.gov/<strong>genome</strong>_hub). The extent <strong>of</strong> automation also varied greatlyamong <strong>the</strong> centres, with <strong>the</strong> most aggressive automation effortsresulting in factory-style systems able to process more than 100,000<strong>sequencing</strong> reactions in 12 hours (Fig. 3). In addition, centresdiffered in <strong>the</strong> amount <strong>of</strong> raw sequence data typically obtained foreach clone (so-called half-shotgun, full shotgun <strong>and</strong> ®nishedsequence). Sequence information from <strong>the</strong> different centres couldbe directly integrated despite this diversity, because <strong>the</strong> data wereanalysed by a common computational procedure. Raw sequencetraces were processed <strong>and</strong> assembled with <strong>the</strong> PHRED <strong>and</strong> PHRAPs<strong>of</strong>tware packages 77,78 (P. Green, unpublished). All assembled contigs<strong>of</strong> more than 2 kb were deposited in public databases within24 hours <strong>of</strong> assembly.The overall <strong>sequencing</strong> output rose sharply during production(Fig. 4). Following installation <strong>of</strong> new sequence detectors beginningin June 1999, <strong>sequencing</strong> capacity <strong>and</strong> output rose approximatelyeightfold in eight months to nearly 7 million samples processed permonth, with little or no drop in success rate (ratio <strong>of</strong> useable readsto attempted reads). By June 2000, <strong>the</strong> centres were producing rawsequence at a rate equivalent to onefold coverage <strong>of</strong> <strong>the</strong> entire<strong>human</strong> <strong>genome</strong> in less than six weeks. This corresponded to acontinuous throughput exceeding 1,000 nucleotides per second,24 hours per day, seven days per week. This scale-up resulted in aconcomitant increase in <strong>the</strong> sequence available in <strong>the</strong> publicdatabases (Fig. 4).A version <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence was prepared on <strong>the</strong> basis<strong>of</strong> <strong>the</strong> map <strong>and</strong> sequence data available on 7 October 2000. For thisversion, <strong>the</strong> mapping effort had assembled <strong>the</strong> ®ngerprinted BACsinto 1,246 ®ngerprint clone contigs. The <strong>sequencing</strong> effort hadsequenced <strong>and</strong> assembled 29,298 overlapping BACs <strong>and</strong> o<strong>the</strong>r largeinsertclones (Table 2), comprising a total length <strong>of</strong> 4.26 gigabases(Gb). This resulted from around 23 Gb <strong>of</strong> underlying raw shotgunsequence data, or about 7.5-fold coverage averaged across <strong>the</strong><strong>genome</strong> (including both draft <strong>and</strong> ®nished sequence). The variouscontributions to <strong>the</strong> total amount <strong>of</strong> sequence deposited in <strong>the</strong>HTGS division <strong>of</strong> GenBank are given in Table 3.Table 2 Total <strong>genome</strong> sequence from <strong>the</strong> collection <strong>of</strong> sequenced clones, bysequence statusSequencestatusNumber <strong>of</strong>clonesTotal clonelength (Mb)Averagenumber <strong>of</strong>sequencereads per kb*Averagesequencedepth²Total amount<strong>of</strong> rawsequence (Mb)Finished 8,277 897 20±25 8±12 9,085Draft 18,969 3,097 12 4.5 13,395Predraft 2,052 267 6 2.5 667Total 23,147.............................................................................................................................................................................* The average number <strong>of</strong> reads per kb was estimated based on information provided by each<strong>sequencing</strong> centre. This number differed among <strong>sequencing</strong> centres, based on <strong>the</strong> actual protocolsused.² The average depth in high quality bases ($99% accuracy) was estimated from informationprovided by each <strong>sequencing</strong> centre. The average varies among <strong>the</strong> centres, <strong>and</strong> <strong>the</strong> number mayvary considerably for clones with <strong>the</strong> same <strong>sequencing</strong> status. For draft clones in <strong>the</strong> publicdatabases (keyword: HTGS_draft), <strong>the</strong> number can be computed from <strong>the</strong> quality scores listed in<strong>the</strong> database entry.NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd867

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!