12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

articles®ngerprint map. However, many involve STSs that have beenlocalized on only one or two <strong>of</strong> <strong>the</strong> previous maps or that occuras isolated discrepancies in con¯ict with several ¯anking STSs.Many <strong>of</strong> <strong>the</strong>se cases are probably due to errors in <strong>the</strong> previousmaps (with error rates for individual maps estimated at 1±2% 100 ).O<strong>the</strong>rs may be due to incorrect assignment <strong>of</strong> <strong>the</strong> STSs to <strong>the</strong> draft<strong>genome</strong> sequence (by <strong>the</strong> electronic polymerase chain reaction(e-PCR) computer program) or to database entries that containsequence data from more than one clone (owing to crosscontamination).Graphical views <strong>of</strong> <strong>the</strong> independent data sets were particularlyuseful in detecting problems with order or orientation (Fig. 5).Areas <strong>of</strong> con¯ict were reviewed <strong>and</strong> corrected if supported by <strong>the</strong>underlying data. In <strong>the</strong> version discussed here, <strong>the</strong>re were 41sequenced clones falling in 14 sequenced-clone contigs with STScontent information from multiple maps that disagreed with <strong>the</strong>¯anking clones or sequenced-clone contigs; <strong>the</strong> placement <strong>of</strong> <strong>the</strong>seclones thus remains suspect. Four <strong>of</strong> <strong>the</strong>se instances suggest errorsin <strong>the</strong> ®ngerprint map, whereas <strong>the</strong> o<strong>the</strong>rs suggest errors in <strong>the</strong>layout <strong>of</strong> sequenced clones. These cases are being investigated <strong>and</strong>will be corrected in future versions.Assembly <strong>of</strong> <strong>the</strong> sequenced clones. We assessed <strong>the</strong> accuracy <strong>of</strong> <strong>the</strong>assembly by using a set <strong>of</strong> 148 draft clones comprising 22.4 Mb forwhich ®nished sequence subsequently became available 104 . Theinitial sequence contigs lack information about order <strong>and</strong> orientation,<strong>and</strong> GigAssembler attempts to use linking data to infer suchinformation as far as possible 104 . Starting with initial sequencecontigs that were unordered <strong>and</strong> unoriented, <strong>the</strong> program placed90% <strong>of</strong> <strong>the</strong> initial sequence contigs in <strong>the</strong> correct orientation <strong>and</strong>85% in <strong>the</strong> correct order with respect to one ano<strong>the</strong>r. In a separatetest, GigAssembler was tested on simulated draft data producedfrom ®nished sequence on chromosome 22 <strong>and</strong> similar results wereobtained.Some problems remain at all levels. First, errors in <strong>the</strong> initialsequence contigs persist in <strong>the</strong> merged sequence contigs built from<strong>the</strong>m <strong>and</strong> can cause dif®culties in <strong>the</strong> assembly <strong>of</strong> <strong>the</strong> draft <strong>genome</strong>sequence. Second, GigAssembler may fail to merge some overlappingsequences because <strong>of</strong> poor data quality, allelic differences ormisassemblies <strong>of</strong> <strong>the</strong> initial sequence contigs; this may result inapparent local duplication <strong>of</strong> a sequence. We have estimated byvarious methods <strong>the</strong> amount <strong>of</strong> such artefactual duplication in <strong>the</strong>assembly from <strong>the</strong>se <strong>and</strong> o<strong>the</strong>r sources to be about 100 Mb. On <strong>the</strong>o<strong>the</strong>r h<strong>and</strong>, nearby duplicated sequences may occasionally be incorrectlymerged. Some sequenced clones remain incorrectly placed on<strong>the</strong> layout, as discussed above, <strong>and</strong> o<strong>the</strong>rs (, 0.5%) remain unplaced.The ®ngerprint map has undoubtedly failed to resolve some closelyrelated duplicated regions, such as <strong>the</strong> Williams region <strong>and</strong> severalhighly repetitive subtelomeric <strong>and</strong> pericentric regions (see below).Detailed examination <strong>and</strong> sequence ®nishing may be required tosort out <strong>the</strong>se regions precisely, as has been done with chromosomeY 89 . Finally, small sequenced-clone contigs with limited or no STSTable 9 Distribution <strong>of</strong> PHRAP scores in <strong>the</strong> draft <strong>genome</strong> sequencePHRAP scorePercentage <strong>of</strong> bases in <strong>the</strong> draft<strong>genome</strong> sequence0±9 0.610±19 1.320±29 2.230±39 4.840±49 8.150±59 8.760±69 9.070±79 12.180±89 17.3.90 35.9.............................................................................................................................................................................PHRAP scores are a logarithmically based representation <strong>of</strong> <strong>the</strong> error probability. A PHRAP score <strong>of</strong>X corresponds to an error probability <strong>of</strong> 10 -X/10 . Thus, PHRAP scores <strong>of</strong> 20, 30 <strong>and</strong> 40 correspond toaccuracy <strong>of</strong> 99%, 99.9% <strong>and</strong> 99.99%, respectively. PHRAP scores are derived from qualityscores <strong>of</strong> <strong>the</strong> underlying sequence reads used in sequence assembly. See http://www.<strong>genome</strong>.washington.edu/UWGC/<strong>analysis</strong>tools/phrap.htm.l<strong>and</strong>mark content remain dif®cult to place. Full utilization <strong>of</strong><strong>the</strong> higher resolution radiation hybrid map (<strong>the</strong> TNG map) mayhelp in this 95 . Future targeted FISH experiments <strong>and</strong> increased mapcontinuity will also facilitate positioning <strong>of</strong> <strong>the</strong>se sequences.Genome coverageWe next assessed <strong>the</strong> nature <strong>of</strong> <strong>the</strong> gaps within <strong>the</strong> draft <strong>genome</strong>sequence, <strong>and</strong> attempted to estimate <strong>the</strong> fraction <strong>of</strong> <strong>the</strong> <strong>human</strong><strong>genome</strong> not represented within <strong>the</strong> current version.Gaps in draft <strong>genome</strong> sequence coverage. There are three types <strong>of</strong>gap in <strong>the</strong> draft <strong>genome</strong> sequence: gaps within un®nishedsequenced clones; gaps between sequenced-clone contigs, butwithin ®ngerprint clone contigs; <strong>and</strong> gaps between ®ngerprintclone contigs. The ®rst two types are relatively straightforward toclose simply by performing additional <strong>sequencing</strong> <strong>and</strong> ®nishing onalready identi®ed clones. Closing <strong>the</strong> third type may require screening<strong>of</strong> additional large-insert clone libraries <strong>and</strong> possibly newtechnologies for <strong>the</strong> most recalcitrant regions. We consider <strong>the</strong>sethree cases in turn.We estimated <strong>the</strong> size <strong>of</strong> gaps within draft clones by studyinginstances in which <strong>the</strong>re was substantial overlap between a draftclone <strong>and</strong> a ®nished clone, as described above. The average gap sizein <strong>the</strong>se draft sequenced clones was 554 bp, although <strong>the</strong> preciseestimate was sensitive to certain assumptions in <strong>the</strong> <strong>analysis</strong>.Assuming that <strong>the</strong> sequence gaps in <strong>the</strong> draft <strong>genome</strong> sequenceare fairly represented by this sample, about 80 Mb or about 3%(likely range 2±4%) <strong>of</strong> sequence may lie in <strong>the</strong> 145,514 gaps withindraft sequenced clones.The gaps between sequenced-clone contigs but within ®ngerprintclone contigs are more dif®cult to evaluate directly, because <strong>the</strong>draft <strong>genome</strong> sequence ¯anking many <strong>of</strong> <strong>the</strong> gaps is <strong>of</strong>ten notprecisely aligned with <strong>the</strong> ®ngerprinted clones. However, most aremuch smaller than a single BAC. In fact, nearly three-quarters <strong>of</strong><strong>the</strong>se gaps are bridged by one or more individual BACs, as indicatedby linking information from BAC end sequences. We measured <strong>the</strong>sizes <strong>of</strong> a subset <strong>of</strong> gaps directly by examining restriction fragment®ngerprints <strong>of</strong> overlapping clones. A study <strong>of</strong> 157 `bridged' gaps <strong>and</strong>55 `unbridged' gaps gave an average gap size <strong>of</strong> 25 kb. Allowing for <strong>the</strong>possibility that <strong>the</strong>se gaps may not be fully representative <strong>and</strong> thatsome restriction fragments are not included in <strong>the</strong> calculation, a moreconservative estimate <strong>of</strong> gap size would be 35 kb. This would indicatethat about 150 Mb or 5% <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong> may reside in <strong>the</strong>4,076 gaps between sequenced-clone contigs. This sequence shouldbe readily obtained as <strong>the</strong> clones spanning <strong>the</strong>m are sequenced.The size <strong>of</strong> <strong>the</strong> gaps between ®ngerprint clone contigs wasestimated by comparing <strong>the</strong> ®ngerprint maps to <strong>the</strong> essentiallycompleted chromosomes 21 <strong>and</strong> 22. The <strong>analysis</strong> shows that <strong>the</strong>®ngerprinted BAC clones in <strong>the</strong> global database cover 97±98% <strong>of</strong><strong>the</strong> sequenced portions <strong>of</strong> those chromosomes 86 . The publishedsequences <strong>of</strong> <strong>the</strong>se chromosomes also contain a few small gaps (5<strong>and</strong> 11, respectively) amounting to some 1.6% <strong>of</strong> <strong>the</strong> euchromaticsequence, <strong>and</strong> do not include <strong>the</strong> heterochromatic portion. Thissuggests that <strong>the</strong> gaps between contigs in <strong>the</strong> ®ngerprint mapcontain about 4% <strong>of</strong> <strong>the</strong> euchromatic <strong>genome</strong>. Experience withclosure <strong>of</strong> such gaps on chromosomes 20 <strong>and</strong> 7 suggests that many<strong>of</strong> <strong>the</strong>se gaps are less than one clone in length <strong>and</strong> will be closed byclones from o<strong>the</strong>r libraries. However, recovery <strong>of</strong> sequence from<strong>the</strong>se gaps represents <strong>the</strong> most challenging aspect <strong>of</strong> producing acomplete ®nished sequence <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong>.As ano<strong>the</strong>r measure <strong>of</strong> <strong>the</strong> representation <strong>of</strong> <strong>the</strong> BAC libraries,Riethman 109 has found BAC or cosmid clones that link to telomerichalf-YACs or to <strong>the</strong> telomeric sequence itself for 40 <strong>of</strong> <strong>the</strong> 41 nonsatellitetelomeres. Thus, <strong>the</strong> ®ngerprint map appears to have nosubstantial gaps in <strong>the</strong>se regions. Many <strong>of</strong> <strong>the</strong> pericentric regions arealso represented, but <strong>analysis</strong> is less complete here (see below).Representation <strong>of</strong> r<strong>and</strong>om raw sequences. In ano<strong>the</strong>r approach tomeasuring coverage, we compared a collection <strong>of</strong> r<strong>and</strong>om rawsequence reads to <strong>the</strong> existing draft <strong>genome</strong> sequence. In principle,874 © 2001 Macmillan Magazines Ltd NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!