13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

18 WATERSTON ET AL.Table 3. Genes with Disruptive VariantsGene ID cDNA <strong>Genom</strong>ic BACS DNA PDRRefSeq ID Gene name sequence sequence Codon gs/cdna gs/het/cDNAgi13994299 TMPIT cttCAGaac CttTAGacc 42 3:3 0:0:24NM_031925 L Q N L * Ngi15706488 hypothetical gaatgCGAgcc gaatgTGAgcc 2 nd 2:11:9BC012777 M R A M * Agi16554448 zonadhesin ccgtTGTcgg CcgtT*Ttcgg 1,922 nd 9:7:7NM_003386 R C R R F GTo avoid pseudogenes and other false positives in theChromosome 7 predicted gene set, we developed a protocolthat exploited newly available draft genome sequence<strong>of</strong> the mouse. We reasoned that recently derived pseudogenesare problematic because they most closely resemblefunctional genes, whereas older pseudogenes have degradedthrough neutral drift and are less likely to beconfused with genes. With 75 million years separatingmouse and human from their last common ancestor, onlypseudogenes that arose after the split would present problems.Since processed pseudogenes in general do not insertnear their source gene, any processed pseudogenethat arose in one organism thus would not have a counterpartin the corresponding portion <strong>of</strong> the other genome.(For convenience, we call regions in the mouse and humangenomes that descended from a common ancestororthologous, similar to the usage that has been adoptedfor genes.) Thus, for regions where the orthologous relationshipshave been established between the twogenomes (and this covers 90% <strong>of</strong> the human genome),genes are likely to have counterparts in both organisms,whereas processed pseudogenes will not. Furthermore,having arisen since the divergence <strong>of</strong> the two species, theclosest relative <strong>of</strong> the pseudogene will lie within the samegenome, rather than in the second genome.<strong>The</strong> situation is less clear with pseudogenes arisingfrom duplication. Duplications may be at a distance, inwhich case orthology may be a useful discriminator, but<strong>of</strong>ten they are near and even tandemly arranged with respectto the source gene. In some cases, both copies maybe functional members <strong>of</strong> a gene family, with one or bothcopies adopting new functions. Nonetheless, those copiesthat become pseudogenes may be recognized, because thecopy may be incomplete or may have drifted significantlyfrom the original, <strong>of</strong>ten with frameshift mutations orother chain-termination mutations disrupting the readingframe. <strong>The</strong> functional gene will remain most similar tothe orthologous gene in the second organism.With this screening procedure in mind, we set about toobtain a comprehensive set <strong>of</strong> putative genes, reasoningthat false positives could be recognized and removed bycareful examination <strong>of</strong> the orthologous regions betweenmouse and human. We used three different gene predictionprograms, FGENESH2 (Solovyev 2001), TwinScan(Korf et al. 2001), and GeneWise (Birney and Durbin2000), to recognize as many elements as possible. <strong>The</strong>first two use comparative sequence information (withoutusing orthology) to modify ab initio predictions and improveaccuracy, whereas GeneWise uses protein homologiesto seed predictions. We used the mouse sequence asthe informant sequence for TwinScan and FGENESH2and all available protein predictions for GeneWise. Twin-Scan produced 1,350 gene models, FGENESH2 3,793models, and GeneWise 22,326 models. (GeneWise predictsmultiple, alternatively spliced models in any givenregion.) <strong>The</strong> combined output included 90% <strong>of</strong> all knownexons and at least one exon from 98% <strong>of</strong> the knowngenes.Although the combined output from the three programsis thus reasonably sensitive, there are undoubtedly a largenumber <strong>of</strong> false-positive predictions, including manypseudogenes. To eliminate most <strong>of</strong> these unwanted predictions,we used each prediction to search the orthologousregion <strong>of</strong> the mouse genome for a matching gene. Inturn, the matching mouse genes were used to search formatches in the orthologous region <strong>of</strong> the human genome.<strong>The</strong> matches in the orthologous regions had to be the bestor close to the best in the entire genome. Furthermore,single-exon genes were removed if they had matches tomulti-exon genes in either genome. Generally, at any onesite only the best reciprocally matching pair was kept(near-best for local duplications). This left us with 728FGENESH2, 284 TwinScan, and 400 GeneWise predictions.To eliminate redundancy between the various genemodels, for each region we accepted only one prediction,taking in order known genes, FGENESH2, TwinScan,and GeneWise (the order was based on the performance<strong>of</strong> the three programs against known genes) and givingpriority to models with the best reciprocal matches. Modelsthat showed signs <strong>of</strong> nonfunctionality (early truncationcompared to closely related genes, absence <strong>of</strong> intronsin gene models with closely related genes with exons) andhigh homology to L1/reverse transcriptase were also removed.This yielded an additional 545 predicted genes.We examined this gene set in several ways to determinehow complete and accurate it might be. On average,the predicted genes, when compared to the known genes,have fewer exons (6.7 vs. 9.5), have fewer coding bases(1,231 vs. 1,457), and span less <strong>of</strong> the genome (28.5 kbvs. 61.4 kb), suggesting that the predicted genes eitherlack terminal exons or include some fragmented genesthat reduce the average. A high fraction <strong>of</strong> both knowngenes (92%) and predicted genes (95%) have reciprocalbest matches in the mouse genome at the orthologous position,whereas the others have near-best matches. This is

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!