13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

16 WATERSTON ET AL.there are unlikely to be many large, recently duplicatedsegments <strong>of</strong> the genome yet to be identified in regionswhere the above criteria have been used to judge overlaps.Second, there are multiple regions <strong>of</strong> the genomewhere there is locally high variation, sometimes extendingover tens <strong>of</strong> kilobases. Whether these are present because<strong>of</strong> balanced selection or because they represent bychance the tail end <strong>of</strong> the distribution from the last coalescentis uncertain (Charlesworth et al. 1997).We have examined the regions for features that mightsuggest a basis for such balancing selection. Fifteen werein the region <strong>of</strong> some gene, including the known genesCRYPTIC, TSSC1, LRP1B, and GYPE. However, otherscontain no known features. Regardless <strong>of</strong> the basis, ifvariation has been accumulating at neutral rates, these regionshave been maintained independently in the populationsince well before the founding <strong>of</strong> the species and, insome cases, almost equal to the time <strong>of</strong> the chimp–humandivergence.GENESFinding genes in the draft human genome sequencewas challenging, and the results were <strong>of</strong>ten inconsistent(Hogenesch et al. 2001). Even with finished, highly accurategenomic sequence, the extensive amount <strong>of</strong> noncodingsequence and the abundant pseudogenes present asignificant challenge to gene prediction. <strong>The</strong> concertedefforts to obtain full-length sequences for cDNAs havebeen invaluable in annotation (Strausberg et al. 2002),providing direct evidence for an increasing number <strong>of</strong>genes. However, as yet, the collections for human remainincomplete and contain errors. Furthermore, the assignment<strong>of</strong> the sequence to its source in the genome may becomplicated by the presence <strong>of</strong> close paralogs, recentlyformed pseudogenes, and polymorphisms. Another potentiallypowerful complementary approach comes fromcomparative sequence analysis, where conserved featuressuch as genes stand out against neutrally evolving sequence.<strong>The</strong> recent release <strong>of</strong> a draft mouse sequence(Waterston et al. 2002) provides an opportunity to applythis approach genome-wide to human. As the first <strong>of</strong>many other mammalian sequences that will becomeavailable in the next few years, it provides a taste <strong>of</strong> thepower to come from comparative analysis.In annotating the finished sequence <strong>of</strong> Chromosome 7,we combined cDNAs, gene models from gene predictionprograms, and comparison with the mouse sequence toderive a gene set that accurately reflects available experimentalevidence and <strong>of</strong>fers improved discrimination betweengenes and pseudogenes. In the course <strong>of</strong> these studies,we uncovered genes with apparently active andinactive forms present in the population. Such variantsmight be usefully exploited to learn more about the role<strong>of</strong> the gene in human biology.KNOWN GENES: cDNAsWe began our efforts to define the genes on Chromosome7 with the extensive RefSeq (Pruitt and Maglott2001) and the MGC (Strausberg et al. 2002) cDNA collections.We aligned all 14,769 RefSeq and 10,047 MGCsequences against the entire genome sequence, allowingeach cDNA to confirm only a single genomic locus.Spliced forms were favored over single or minimallyspliced forms to avoid processed pseudogenes, and onlythe single match with the highest percent identity waskept to avoid confusion with recent paralogs.In aligning these cDNAs against the genome, we noticedcertain discrepancies that complicated interpretation.Alignment <strong>of</strong> the cDNAs to the genome, although<strong>of</strong>ten straightforward, was erroneous in many cases usingany <strong>of</strong> the several programs. In our experience, Spidey(Wheelan et al. 2001) gave the most reliable results,aligning about 65% <strong>of</strong> the cDNAs and 79% <strong>of</strong> their exonsaccurately against the genomic sequence. But for anygiven gene, BLAT (Kent 2002) or EST_GENOME (Mott1997) may give a better alignment (where the quality <strong>of</strong>the alignment is judged by minimizing base substitutionsbetween the cDNA and genomic sequence and the avoidance<strong>of</strong> noncanonical splice sites). Every case was manuallyreviewed, and in some cases, manual review uncoveredbetter alignments than did any <strong>of</strong> the availableprograms.Even after exhaustive efforts to obtain an optimalalignment, some problems remained. For example, 23mRNAs had no similarity to any mouse gene, and thetranslation product had no similarity to any known protein.Although these could be true genes, it seems morelikely that they represent untranslated segments <strong>of</strong> bonafide genes. Nonetheless, these were kept in the currentgene set. Eight others contained only a single, very shortopen reading frame (

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!