13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

VARIATION ON CHROMOSOME 7 19expected, given the methods used to build the set, butnonetheless their conservation strongly supports their validity.Since we had not used ESTs in building the gene set,we used them as an independent measure <strong>of</strong> the representation<strong>of</strong> the set. We found 41,399 spliced ESTs with theirbest match to Chromosome 7. Of these, 93% at least partiallyoverlapped an exon <strong>of</strong> the gene set, and an additional1% lay near or within existing genes, suggestingthat they might represent alternative splice forms or exonsmissing from the predicted genes. <strong>The</strong> remainderlacked significant open reading frames, and none satisfiedthe reciprocal match criteria used in making the genepredictions. Only 5% <strong>of</strong> the remainder had any match tomouse sequence. <strong>The</strong>se unmatched spliced ESTs maynonetheless represent missed genes, although at presentthere is little corroborating evidence that they derive fromprotein-coding genes.PSEUDOGENESWe attempted to identify pseudogenes directly, adaptinga method used previously (Waterston et al. 2002;Zdobnov et al. 2002). We inspected all the intervals betweenknown and predicted genes for sequence thatyielded translation products with similarity to known proteins.Altogether we identified 941 such regions. Morethan one pseudogene may lie in any one interval, and old,largely degraded pseudogenes would be missed using thethresholds used here. As a result, we probably have undercountedpseudogenes.We then evaluated the validity <strong>of</strong> our classifications todetermine how <strong>of</strong>ten likely pseudogenes were included inthe gene set and how many excluded genes were laterfound in the pseudogene set. We reasoned that genesshould largely be under purifying selection, and pseudogenesshould be subject to neutral drift. <strong>The</strong>se differencesin evolutionary pressures would produce differences inthe ratio <strong>of</strong> synonymous vs. nonsynonymous substitutions(K a /K s ratio) in the coding portion <strong>of</strong> the genes orpseudogenes (Ohta and Ina 1995). Positive selection actingon genes will increase the K a /K s ratio, but generallythe positive selection is limited to specific domains. Onlyrarely will positive selection act so broadly across a geneas to elevate the K a /K s ratio to near or above that <strong>of</strong> neutrallyevolving sequence.Of the 941 regions identified as containing likely pseudogenes,nearly all (97% ± 3%) had K a /K s ratios consistentwith neutrally evolving sequence, supporting ourclassification. As with the mouse genome analysis, a significantfraction <strong>of</strong> the predicted pseudogenes (33%) hadas yet no disruption to the reading frame. Virtually all thepredicted pseudogenes could be aligned to another region<strong>of</strong> the human genome with higher sequence identity thanto any region <strong>of</strong> the mouse genome, consistent with anorigin after the mouse–human divergence.We also attempted to classify the pseudogenes by origin,by using the orthologous mouse region for relatedsequence. For 88% (573/654) <strong>of</strong> the identified pseudogenes,no related sequence in the orthologous mouse regionwas found; these are likely to represent processedpseudogenes and are broadly distributed through thechromosome. Another 12% (81/654) did have related sequencein the orthologous mouse region, suggesting theywere derived by segmental duplication. Indeed, these liepredominantly in segmentally duplicated regions <strong>of</strong> thechromosome.We carried out the same analysis on the 1,152 members<strong>of</strong> the gene set. Only 5% ± 3% had a ratio consistent withneutral selection, suggesting the set is relatively free <strong>of</strong>pseudogenes. <strong>The</strong> total <strong>of</strong> 1,152 genes is a relatively modestnumber. Extrapolating to the genome, this would suggestthat the human genome contains some 25,000 genes.<strong>The</strong> total number <strong>of</strong> genes is only slightly more than thenumber <strong>of</strong> pseudogenes found and is about 40% less thanthe number <strong>of</strong> genes predicted in another analysis <strong>of</strong>Chromosome 7 sequence (Scherer et al. 2003). Our approachhas been deliberately conservative, but severalpoints <strong>of</strong> our analysis suggest that our count is fairly accurate.By K a /K s analysis, only a few (0–60) <strong>of</strong> the pseudogenesare likely to be functional. <strong>The</strong> gene set coversthe vast majority <strong>of</strong> the ESTs that have their best match toChromosome 7, and those few that fall outside the geneset do not seem likely to be protein-coding. Perhaps much<strong>of</strong> the difference between our estimate and that <strong>of</strong> otherslies in our treatment <strong>of</strong> pseudogenes.CONCLUSIONOur initial analysis <strong>of</strong> the content <strong>of</strong> human Chromosome7 illustrates some <strong>of</strong> the challenges that lie ahead,even with an accurate, complete sequence. It also suggestssome avenues available for understanding thegenome.An immediate goal must be defining the parts list, thatis, all the functional elements <strong>of</strong> the genome. At present,obtaining even the protein-coding gene set remains a difficultand complex task. <strong>The</strong> available experimentally determinedcDNA sequences remain incomplete, and boththese and, to a lesser extent, the genome sequence containerrors. Alignment <strong>of</strong> the cDNA sequences to the genomeis not always straightforward and is complicated by polymorphism.Gene prediction programs give only partialanswers and are <strong>of</strong>ten confounded by the abundant pseudogenesin the human genome. Comparative sequenceanalysis, using the mouse as the informative sequence,improves the accuracy <strong>of</strong> exon prediction in new programssuch as TwinScan and FGENESH2. Further processing<strong>of</strong> the results exploiting the conserved syntenybetween mouse and human to establish orthologous relationshipshelps substantially in distinguishing genes frompseudogenes. As the sequences <strong>of</strong> additional mammaliangenomes become available over the next few years, thedescription <strong>of</strong> the gene set will become increasingly accurateand complete. <strong>The</strong>se additional sequences will als<strong>of</strong>acilitate the identification <strong>of</strong> other functional sequences,such as those for noncoding RNAs and regulation <strong>of</strong> geneexpression. This pathway, combined with ongoing experimentaltesting and validation, holds the promise <strong>of</strong> a relativelycomplete parts list in just a few years’ time.Understanding how those parts function and how theycontribute to human disease when they fail to function

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!