13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

MOUSE GENOME ENCYCLOPEDIA PROJECT 199tained 33,994 unique sequences. We had intentionallyavoided redundant efforts to sequence the known genespresent in public databases, so by adding 44,122 redundantsequences <strong>of</strong> cDNA known in public databases, weestablished a database carrying 36,830 unique sequences.This we named Representative Transcript and Protein Set(RTPS). RTPSs are the unique FL cDNA sequences encodingprotein coding or protein noncoding transcriptssupported by physical clones.In the process <strong>of</strong> selecting the 60,770 sequences to subjectto full-stretch sequencing, many unique sequences,contained by 1,916,592 clones <strong>of</strong> the original masterbank, were missed. Using the new version (MGSCv3) <strong>of</strong>the mouse genome sequence database that was providedby the Mouse <strong>Genom</strong>e Sequence Consortium (MGSC) asa part <strong>of</strong> the collaboration, it was concluded that at least43,000 unique transcripts are encoded in the mousegenome. It was estimated that the total number <strong>of</strong>“genes,” a word that should be used very carefully (seethe next section), is around 60,000.TRANSCRIPTIONAL UNITTo analyze the total number <strong>of</strong> genes for subsequent research,words with ambiguous definitions, such as“gene,” “locus,” and so on, should be avoided. In thegenome community, the word “gene” is used for “the protein-codinggenes (or loci)”. This was tacitly defined dueto the convenience <strong>of</strong> computer-assisted exon prediction,but genome sequences do not give any pro<strong>of</strong> <strong>of</strong> the existence<strong>of</strong> transcripts. Computer prediction <strong>of</strong> exons isbased on open reading frames (ORF). However, in transcriptomeanalysis, we meet a lot <strong>of</strong> “noncoding genes,”which should also be incorporated into the transcriptome.Thus, we should not use the word “gene” or “locus” todiscuss a genomic region which is transcribed, the socalled“gene” in the past definition.For this reason, we coined a new term, transcriptionalunit (TU). TU is defined as a segment <strong>of</strong> the genomeflanked by the most distal exons from which transcriptsare generated. This term can be used as purely computationaland unequivocal. <strong>The</strong> transcripts sharing any exonare encoded by a single TU. If two transcripts do not shareany single exon, these two transcripts are in different TUs,even if one is localized in the intron <strong>of</strong> the other. Wheretwo transcripts encode the sense and antisense strands, respectively,in the same region <strong>of</strong> genomic DNA, these twogenomic segments are different TUs. Thus, by the definition<strong>of</strong> the new term “TU,” we could count the number <strong>of</strong>so-called “genes” (Okazaki et al. 2002).Figure 5. Contribution <strong>of</strong> RIKEN FANTOM2 clones in theworld. <strong>The</strong> FANTOM2 collection covers 90% <strong>of</strong> all TU publishedin the world.CONTRIBUTION OF RIKEN FANTOM2CLONES INTERNATIONALLY (APRIL 2003)After our contribution to the world-wide mouse gene(TU) database, the number in the RTPS has increased to37,086. As shown in Figure 5, the Riken FANTOM2 collectioncovers 90% <strong>of</strong> the total TUs (33,459/37,086); althoughthe remaining 10%, including many known TUs,were covered by our master bank, they were not sequencedto avoid the redundant efforts. A significant proportion(24% <strong>of</strong> TUs; 8,898/37,086) <strong>of</strong> the RTPS is alsocovered by the Mammalian Gene Collection (MGC) supportedby the National Institutes <strong>of</strong> Health (Strausberg etal. 2002). From this analysis, we estimate that a significantproportion <strong>of</strong> mouse TUs have still not been covered,much less the whole transcriptome, which includes varianttranscripts with alternative splicing and differential 5´and 3´ sites.FUNCTIONAL CLASSIFICATION OF MOUSETRANSCRIPTOME AND NONPROTEINCODING TRANSCRIPTSAll sequence data were classified by level <strong>of</strong> homologyFigure 6. <strong>The</strong> breakdown <strong>of</strong> RIKEN FANTOM2 clones to thecategories defined below. (1) Directly assigned by MGI, (2)Mouse DNA Hit (>98% ID, >100 bp) Complete, (3) MouseDNA Hit (>98% ID, >100 bp) Partial, (4) Protein Hit (>98% ID,>100% length, mouse) Complete, (5) Protein Hit (>85% ID,>90% length) Complete, (6) Protein Hit (>85% ID, >90%length) Partial, (7) Protein Hit (>70% ID, >70% length) Complete,(8) Protein Hit (>70% D, >70% length) Partial, (9) ProteinHit (>50% ID, >50% length) Complete, (10) Protein Hit (>50%ID, >50% length) Partial, (11) Inferred from TIGR/UniGeneCluster, (12) Inferred from UniGene Cluster, (13) Inferred fromTIGR Cluster, (14) InterPro domain/motif containing, (15)MDS domain/motif containing, (16) SCOP structural domain/motifcontaining, (17) hypothetical protein, (18) EST hit,(19) unclassifiable.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!