13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GENE EXPRESSION PROFILING IN C. ELEGANS 163Figure 2. Building the conceptual transcriptome. Conceptual transcripts were assembled with known UTRs for genes with EST coverageand predicted UTRs for other genes based on the distribution <strong>of</strong> known UTR lengths. Introns were excised from the codingDNA and UTRs. Predicted 3´UTRs were adjusted according to potential polyadenylation signals, and both 3´ and 5´UTRs were truncatedwhere required to avoid overlapping other genes. In some cases, overestimated 3´UTR lengths were detected by abundantexperimentally observed SAGE tags occurring at the penultimate NlaIII site (position 2). <strong>The</strong>se predicted UTRs were truncatedaccordingly.3´-most NlaIII site, mRNAs with a cut site in their 3´UTRwould be missed if coding sequences alone were used tomap tags. For the 12,272 gene models lacking confirmed3´UTRs, the untranslated regions <strong>of</strong> processed transcriptswere predicted using a method modified from that <strong>of</strong>Pleasance et al. (2003). UTR lengths were estimatedbased on size distributions that cover 95% <strong>of</strong> knownUTRs. About 5,550 <strong>of</strong> the predicted 3´UTRs include aNlaIII site. Because the highest frequency SAGE tag fora transcript occurs at the first tag position, we used pooledSAGE data from more than a million SAGE tags to furtherrefine the 3´UTR predictions for 1,449 gene models.To determine how many transcripts we can identify, ameta-library <strong>of</strong> ~1.8 million tags was constructed bypooling all <strong>of</strong> the SAGE libraries (excluding longSAGE)in Table 2. A “specific” tag is defined as a tag thatuniquely matches to a single gene or that can be resolvedto a single gene by taking the lowest position match. Tominimize the potential impact <strong>of</strong> sequencing errors, onlytags with a cumulative phred score <strong>of</strong> 20 (Ewing andGreen 1998) were considered. A score <strong>of</strong> Phred20 correspondsto a 99% probability that a base is called correctly.In this case, the score represents the average sequencequality <strong>of</strong> the entire tag sequence. A total <strong>of</strong> 26,682 specifictags corresponding to mRNAs for nuclear geneswere observed. <strong>The</strong> total number <strong>of</strong> genes whose expressionwas detected by a SAGE tag for at least one transcriptwas 14,661. A distinct advantage <strong>of</strong> the SAGEtechnique is its ability to discriminate between alternativesplice variants. Indeed, 7,073 (49%) <strong>of</strong> the detected genesare represented by two or more tags. A subset <strong>of</strong> just1,126 (8%) <strong>of</strong> these genes have previously observed alternativesplice variants documented in WormBase. Evenamong these previously well-studied genes, over 800have multiple tags, potentially representing previouslyunobserved splice variants.A Comparison <strong>of</strong> Short (14 bp) VersusLong (21 bp) SAGE TagsUntil very recently, all SAGE libraries were constructedusing the tagging enzyme BsmFI, which generatesa 14-bp tag. <strong>The</strong>oretically, a 14-bp tag is sufficient tounambiguously identify any gene in the C. elegansgenome. In practice, not all tags map unambiguously to asingle location. Two factors contribute to this ambiguity.First, there are multigene families stemming from ancestralsequence duplications; these related genes can sharesimilar 3´ ends. Second, there appears to be some sequencecompositional bias in 3´UTRs that tend to be AT-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!