12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

articles<strong>the</strong> fraction <strong>of</strong> reads matching <strong>the</strong> draft <strong>genome</strong> sequence shouldprovide an estimate <strong>of</strong> <strong>genome</strong> coverage. In practice, <strong>the</strong> comparisonis complicated by <strong>the</strong> need to allow for repeat sequences, <strong>the</strong>imperfect sequence quality <strong>of</strong> both <strong>the</strong> raw sequence <strong>and</strong> <strong>the</strong> draft<strong>genome</strong> sequence, <strong>and</strong> <strong>the</strong> possibility <strong>of</strong> polymorphism. None<strong>the</strong>less,<strong>the</strong> <strong>analysis</strong> provides a reasonable view <strong>of</strong> <strong>the</strong> extent towhich <strong>the</strong> <strong>genome</strong> is represented in <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong><strong>the</strong> public databases.We compared <strong>the</strong> raw sequence reads against both <strong>the</strong> sequencesused in <strong>the</strong> construction <strong>of</strong> <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> all <strong>of</strong>GenBank using <strong>the</strong> BLAST computer program. Of <strong>the</strong> 5,615 rawsequence reads analysed (each containing at least 100 bp <strong>of</strong> contiguousnon-repetitive sequence), 4,924 had a match <strong>of</strong> $ 97%identity with a sequenced clone, indicating that 88 6 1.5% <strong>of</strong> <strong>the</strong><strong>genome</strong> was represented in sequenced clones. The estimate issubject to various uncertainties. Most serious is <strong>the</strong> proportion <strong>of</strong>repeat sequence in <strong>the</strong> remainder <strong>of</strong> <strong>the</strong> <strong>genome</strong>. If <strong>the</strong> unsequencedportion <strong>of</strong> <strong>the</strong> <strong>genome</strong> is unusually rich in repeated sequence,we would underestimate its size (although <strong>the</strong> excess would becomprised <strong>of</strong> repeated sequence).We examined those raw sequences that failed to match bycomparing <strong>the</strong>m to <strong>the</strong> o<strong>the</strong>r publicly available sequence resources.Fifty (0.9%) had matches in public databases containing cDNAsequences, STSs <strong>and</strong> similar data. An additional 276 (or 43% <strong>of</strong> <strong>the</strong>remaining raw sequence) had matches to <strong>the</strong> whole-<strong>genome</strong> shotgunreads discussed above (consistent with <strong>the</strong> idea that <strong>the</strong>se readscover about half <strong>of</strong> <strong>the</strong> <strong>genome</strong>).We also examined <strong>the</strong> extent <strong>of</strong> <strong>genome</strong> coverage by aligning <strong>the</strong>cDNA sequences for genes in <strong>the</strong> RefSeq dataset 110 to <strong>the</strong> draft<strong>genome</strong> sequence. We found that 88% <strong>of</strong> <strong>the</strong> bases <strong>of</strong> <strong>the</strong>se cDNAscould be aligned to <strong>the</strong> draft <strong>genome</strong> sequence at high stringency (atleast 98% identity). (A few <strong>of</strong> <strong>the</strong> alignments with ei<strong>the</strong>r <strong>the</strong> r<strong>and</strong>omraw sequence reads or <strong>the</strong> cDNAs may be to a highly similar regionin <strong>the</strong> <strong>genome</strong>, but such matches should affect <strong>the</strong> estimate <strong>of</strong><strong>genome</strong> coverage by considerably less than 1%, based on <strong>the</strong>estimated extent <strong>of</strong> duplication within <strong>the</strong> <strong>genome</strong> (see below).)These results indicate that about 88% <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong> isrepresented in <strong>the</strong> draft <strong>genome</strong> sequence <strong>and</strong> about 94% in <strong>the</strong>combined publicly available sequence databases. The ®gure <strong>of</strong> 88%agrees well with our independent estimates above that about 3%,5% <strong>and</strong> 4% <strong>of</strong> <strong>the</strong> <strong>genome</strong> reside in <strong>the</strong> three types <strong>of</strong> gap in <strong>the</strong> draft<strong>genome</strong> sequence.Finally, a small experimental check was performed by screening alarge-insert clone library with probes corresponding to 16 <strong>of</strong> <strong>the</strong>whole <strong>genome</strong> shotgun reads that failed to match <strong>the</strong> draft <strong>genome</strong>sequence. Five hybridized to many clones from different ®ngerprintclone contigs <strong>and</strong> were discarded as being repetitive. Of <strong>the</strong>remaining eleven, two fell within sequenced clones (presumablywithin sequence gaps <strong>of</strong> <strong>the</strong> ®rst type), eight fell in ®ngerprint clonecontigs but between sequenced clones (gaps <strong>of</strong> <strong>the</strong> second type) <strong>and</strong>one failed to identify clones in <strong>the</strong> ®ngerprint map (gaps <strong>of</strong> <strong>the</strong> thirdtype) but did identify clones in ano<strong>the</strong>r large-insert library.Although <strong>the</strong>se numbers are small, <strong>the</strong>y are consistent with <strong>the</strong>view that <strong>the</strong> much <strong>of</strong> <strong>the</strong> remaining <strong>genome</strong> sequence lies withinalready identi®ed clones in <strong>the</strong> current map.Estimates <strong>of</strong> <strong>genome</strong> <strong>and</strong> chromosome sizes. Informed by this<strong>analysis</strong> <strong>of</strong> <strong>genome</strong> coverage, we proceeded to estimate <strong>the</strong> sizes <strong>of</strong><strong>the</strong> <strong>genome</strong> <strong>and</strong> each <strong>of</strong> <strong>the</strong> chromosomes (Table 8). Beginning with<strong>the</strong> current assigned sequence for each chromosome, we correctedfor <strong>the</strong> known gaps on <strong>the</strong> basis <strong>of</strong> <strong>the</strong>ir estimated sizes (seeabove). We attempted to account for <strong>the</strong> sizes <strong>of</strong> centromeres <strong>and</strong>heterochromatin, nei<strong>the</strong>r <strong>of</strong> which are well represented in <strong>the</strong> draftsequence. Finally, we corrected for around 100 Mb <strong>of</strong> artefactualduplication in <strong>the</strong> assembly. We arrived at a total <strong>human</strong> <strong>genome</strong>size estimate <strong>of</strong> around 3,200 Mb, which compares favourably withprevious estimates based on DNA content.We also independently estimated <strong>the</strong> size <strong>of</strong> <strong>the</strong> euchromaticportion <strong>of</strong> <strong>the</strong> <strong>genome</strong> by determining <strong>the</strong> fraction <strong>of</strong> <strong>the</strong> 5,615r<strong>and</strong>om raw sequences that matched <strong>the</strong> ®nished portion <strong>of</strong><strong>the</strong> <strong>human</strong> <strong>genome</strong> (whose total length is known with greaterprecision). Twenty-nine per cent <strong>of</strong> <strong>the</strong>se raw sequences found amatch among 835 Mb <strong>of</strong> nonredundant ®nished sequence. Thisleads to an estimate <strong>of</strong> <strong>the</strong> euchromatic <strong>genome</strong> size <strong>of</strong> 2.9 Gb. Thisagrees reasonably with <strong>the</strong> prediction above based on <strong>the</strong> length <strong>of</strong><strong>the</strong> draft <strong>genome</strong> sequence (Table 8).Update. The results above re¯ect <strong>the</strong> data on 7 October 2000. Newdata are continually being added, with improvements being made to<strong>the</strong> physical map, new clones being sequenced to close gaps <strong>and</strong>draft clones progressing to full shotgun coverage <strong>and</strong> ®nishing. Thedraft <strong>genome</strong> sequence will be regularly reassembled <strong>and</strong> publiclyreleased.Currently, <strong>the</strong> physical map has been re®ned such that <strong>the</strong>number <strong>of</strong> ®ngerprint clone contigs has fallen from 1,246 to 965;this re¯ects <strong>the</strong> elimination <strong>of</strong> some artefactual contigs <strong>and</strong> <strong>the</strong>closure <strong>of</strong> some gaps. The sequence coverage has risen such that90% <strong>of</strong> <strong>the</strong> <strong>human</strong> <strong>genome</strong> is now represented in <strong>the</strong> sequencedclones <strong>and</strong> more than 94% is represented in <strong>the</strong> combined publiclyavailable sequence databases. The total amount <strong>of</strong> ®nished sequenceis now around 1 Gb.Broad genomic l<strong>and</strong>scapeWhat biological insights can be gleaned from <strong>the</strong> draft sequence? Inthis section, we consider very large-scale features <strong>of</strong> <strong>the</strong> draft<strong>genome</strong> sequence: <strong>the</strong> distribution <strong>of</strong> GC content, CpG isl<strong>and</strong>s<strong>and</strong> recombination rates, <strong>and</strong> <strong>the</strong> repeat content <strong>and</strong> gene content <strong>of</strong><strong>the</strong> <strong>human</strong> <strong>genome</strong>. The draft <strong>genome</strong> sequence makes it possible tointegrate <strong>the</strong>se features <strong>and</strong> o<strong>the</strong>rs at scales ranging from individualFigure 10 Screen shot from UCSC Draft Human Genome Browser. Seehttp://<strong>genome</strong>.ucsc.edu/.Figure 11 Screen shot from <strong>the</strong> Genome Browser <strong>of</strong> Project Ensembl. Seehttp://www.ensembl.org.NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd875

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!