12.07.2015 Views

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

Initial sequencing and analysis of the human genome - Vitagenes

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

articleslong introns are more likely than short introns to be interrupted bygaps in <strong>the</strong> draft <strong>genome</strong> sequence. None<strong>the</strong>less, <strong>the</strong>re may be someresidual bias against long genes <strong>and</strong> long introns.There is considerable variation in overall gene size <strong>and</strong> intron size,with both distributions having very long tails. Many genes are over100 kb long, <strong>the</strong> largest known example being <strong>the</strong> dystrophin gene(DMD) at 2.4 Mb. The variation in <strong>the</strong> size distribution <strong>of</strong> codingsequences <strong>and</strong> exons is less extreme, although <strong>the</strong>re are still someremarkable outliers. The titin gene 276 has <strong>the</strong> longest currentlyknown coding sequence at 80,780 bp; it also has <strong>the</strong> largestnumber <strong>of</strong> exons (178) <strong>and</strong> longest single exon (17,106 bp).It is instructive to compare <strong>the</strong> properties <strong>of</strong> <strong>human</strong> genes withthose from worm <strong>and</strong> ¯y. For all three organisms, <strong>the</strong> typical length<strong>of</strong> a coding sequence is similar (1,311 bp for worm, 1,497 bp for ¯y<strong>and</strong> 1,340 bp for <strong>human</strong>), <strong>and</strong> most internal exons fall within acommon peak between 50 <strong>and</strong> 200 bp (Fig. 35a). However, <strong>the</strong>worm <strong>and</strong> ¯y exon distributions have a fatter tail, resulting in alarger mean size for internal exons (218 bp for worm versus 145 bpfor <strong>human</strong>). The conservation <strong>of</strong> preferred exon size across all threespecies supports suggestions <strong>of</strong> a conserved exon-based component<strong>of</strong> <strong>the</strong> splicing machinery 277 . Intriguingly, <strong>the</strong> few extremely short<strong>human</strong> exons show an unusual base composition. In 42 detected<strong>human</strong> exons <strong>of</strong> less than 19 bp, <strong>the</strong> nucleotide frequencies <strong>of</strong> A, G,T <strong>and</strong> C are 39, 33, 15 <strong>and</strong> 12%, respectively, showing a strongpurine bias. Purine-rich sequences may enhance splicing 278,279 , <strong>and</strong> itis possible that such sequences are required or strongly selected forto ensure correct splicing <strong>of</strong> very short exons. Previous studies haveshown that short exons require intronic, but not exonic, splicingenhancers 280 .In contrast to <strong>the</strong> exons, <strong>the</strong> intron size distributions differsubstantially among <strong>the</strong> three species (Fig. 35b, c). The worm <strong>and</strong>¯y each have a reasonably tight distribution, with most introns near<strong>the</strong> preferred minimum intron length (47 bp for worm, 59 bp for¯y) <strong>and</strong> an extended tail (overall average length <strong>of</strong> 267 bp for worm<strong>and</strong> 487 bp for ¯y). Intron size is much more variable in <strong>human</strong>s,with a peak at 87 bp but a very long tail resulting in a mean <strong>of</strong> morethan 3,300 bp. The variation in intron size results in great variationin gene size.The variation in gene size <strong>and</strong> intron size can partly be explainedby <strong>the</strong> fact that GC-rich regions tend to be gene-dense with manycompact genes, whereas AT-rich regions tend to be gene-poor withmany sprawling genes containing large introns. The correlation <strong>of</strong>gene density with GC content is shown in Fig. 36a, b; <strong>the</strong> relativedensity increases more than tenfold as GC content increases fromaFrequency0.200.180.160.140.120.100.080.060.040.020.000.300.320.340.360.380.400.420.440.460.480.500.520.540.560.580.600.620.640.66GC contentGenomeGenes0.680.70b 20Relative gene density1816141210864200.300.320.340.360.380.400.420.440.460.480.500.520.54GC content0.560.580.600.620.640.660.680.70c2,500140Median intron length2,0001,5001,0001201008060Median exon length50040200IntronExon0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65GC content0Figure 36 GC content. a, Distribution <strong>of</strong> GC content in genes <strong>and</strong> in <strong>the</strong> <strong>genome</strong>. For9,315 known genes mapped to <strong>the</strong> draft <strong>genome</strong> sequence, <strong>the</strong> local GC content wascalculated in a window covering ei<strong>the</strong>r <strong>the</strong> whole alignment or 20,000 bp centred around<strong>the</strong> midpoint <strong>of</strong> <strong>the</strong> alignment, whichever was larger. Ns in <strong>the</strong> sequence were notcounted. GC content for <strong>the</strong> <strong>genome</strong> was calculated for adjacent nonoverlapping 20,000-bp windows across <strong>the</strong> sequence. Both <strong>the</strong> gene <strong>and</strong> <strong>genome</strong> distributions have beennormalized to sum to one. b, Gene density as a function <strong>of</strong> GC content, obtained by taking<strong>the</strong> ratio <strong>of</strong> <strong>the</strong> data in a. Values are less accurate at higher GC levels because <strong>the</strong>denominator is small. c, Dependence <strong>of</strong> mean exon <strong>and</strong> intron lengths on GC content. Forexons <strong>and</strong> introns, <strong>the</strong> local GC content was derived from alignments to ®nished sequenceonly, <strong>and</strong> were calculated from windows covering <strong>the</strong> feature or 10,000 bp centred on <strong>the</strong>feature, whichever was larger.NATURE | VOL 409 | 15 FEBRUARY 2001 | www.nature.com © 2001 Macmillan Magazines Ltd897

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!