13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

200 HAYASHIZAKIto known genes. As shown in Figure 6, 20% <strong>of</strong> all FAN-TOM2 FL cDNA sequences are identical to knownmouse transcripts, and 14% show various levels (85%,70%, and 50%) <strong>of</strong> homology with protein sequences fromother organisms. Surprisingly, the remaining FANTOM2sequences are novel. <strong>The</strong>se novel sequences can be classifiedinto four categories. <strong>The</strong> first category consists <strong>of</strong>sequences containing InterPro motifs (Apweiler et al.2001), MDS domains (Kawaji et al. 2002), or SCOPstructural domains (Gough et al. 2001). <strong>The</strong> second categoryconstitutes sequences with ORFs <strong>of</strong> significant size(more than 100 amino acids) that code for hypotheticalproteins but contain no recognizable motifs or domains.<strong>The</strong> sequences in the third category have no ORFs but hybridizeto ESTs reported in public databases. <strong>The</strong> finalcategory contains totally unknown sequences with noORFs, known motifs, or homology with ESTs.<strong>The</strong> first step in FANTOM annotation was identification<strong>of</strong> ORFs in each sequence. However, sequences withno ORFs were found more frequently than expected.FANTOM2 cDNAs contained 20,487 protein-coding sequencesand 16,599 noncoding sequences; this is surprising,given that only around 100 non-coding RNAs (otherthan tRNAs) were known before FANTOM2 was developed(Fig. 6). A new “continent” <strong>of</strong> functional noncodingRNAs has been discovered; previously the protein continenthas been better explored in the examination <strong>of</strong> finalgene expression. Artifacts including unspliced cDNAs(despite efforts to prepare cytoplasmic RNA only) andgenomic contamination were present among 16,599 noncodingRNAs. A significant population <strong>of</strong> these noncodingRNAs are likely not to be “junk”: On average, eachnoncoding RNA is spliced from three exons, and CpG islandsor CG-rich sequences are preferentially located atthe 5´ ends <strong>of</strong> the noncoding RNAs. Since 32,000 protein-codingsequences are predicted from the completedhuman genome sequence, more than several thousandprotein-coding sequences are missing from FANTOM2.However, the prediction <strong>of</strong> so-called “genes” from thehuman genome sequence missed a major population (atleast several thousand) <strong>of</strong> noncoding RNAs. Analysis <strong>of</strong>the functions <strong>of</strong> these noncoding RNAs is an importanttask for future research.LARGE-SCALE MAPPING OF FL cDNA ONTOTHE MOUSE GENOME SEQUENCETo clarify the chromosomal distribution <strong>of</strong> the transcripts,we mapped the FL cDNAs onto the mousegenome draft sequence (MGSCv3) (Waterston et al.2002). Mapping was possible in 32,568 out <strong>of</strong> 33,047cDNA sequences. <strong>The</strong> remaining 479 sequences did notshow any homology with mouse genome sequences. Onepossible explanation for this is the exclusive use <strong>of</strong> the femalemouse in generating the genome database, thus thedatabase does not contain the Y chromosome sequence.Additionally, the mouse genome sequence database isstill only in draft form, with 3% <strong>of</strong> the genome currentlyunsequenced. <strong>The</strong> data on the chromosomal locations <strong>of</strong>the FL cDNAs constitute a powerful tool that will facilitatethe positional candidate approach to identifying thegenes responsible for specific phenotypes in mutant miceand in human disease. We developed a computer s<strong>of</strong>twarepackage, GENOMAPPER, that can list candidategenes when given the flanking marker, expression sites,and, if possible, protein–protein interactions.SENSE AND ANTISENSE RNA PAIRSIn the FANTOM2 set, we found an unexpectedly largenumber <strong>of</strong> pairs <strong>of</strong> sense and antisense transcripts. <strong>The</strong>sewere discovered through the mapping <strong>of</strong> cDNA onto thegenomic sequence. <strong>The</strong> pairs include all combinations <strong>of</strong>coding and noncoding RNAs, and spliced and unsplicedsequences. Natural sense and antisense transcripts mayregulate gene expression in various ways. For example,the antisense imprinter RNA (AIR) was reported as thenoncoding, intronless transcript controlling the imprintedexpression <strong>of</strong> Igf2r (Sleutels et al. 2002). When AIR isdisrupted, imprinted expression is heavily perturbed. Insome <strong>of</strong> the sense–antisense pairs, antisense-strand RNAmay function to repress the function <strong>of</strong> the sense-strandRNA by RNAi, the suppression <strong>of</strong> translation, or othermechanisms. Investigation <strong>of</strong> the functions <strong>of</strong> sense–antisensepairs should be an interesting direction for futureresearch.CDS ANNOTATION AND ANALYSIS OFPROTEIN-CODING SEQUENCE<strong>The</strong> first step <strong>of</strong> annotation, the identification <strong>of</strong> thecoding sequence, addresses many questions. Is there asignificant ORF? Is this ORF protein coding or noncoding?If it is protein coding, is the transcript spliced?Which ORF is the real one? Are there any frameshift errors,or initiation and termination codon errors? Is thecDNA full-length or truncated? Which ATG is the initiationcodon? We developed a computer s<strong>of</strong>tware package,Protein Coding Region Estimator (ProCrest), to answerthese questions (Y. Hayashizaki et al., unpubl.). Thiss<strong>of</strong>tware calculates the amino acid frequency, tandemamino acid weight matrix, degeneracy <strong>of</strong> genetic code,tRNA anticodon usage (wobble rules: [U, C], A, G), bias<strong>of</strong> base contents (G,C and A,G), Kozak consensus, andpolyadenylation sites for a given sequence.Using this s<strong>of</strong>tware, we identified sequences as proteincoding or noncoding and identified CDSs. With the resultingCDS database, functional analysis <strong>of</strong> the proteinsequences predicted by ProCrest was possible. Gene ontology(GO) analysis, motif analysis with Inter Pro and/orPfam, and gene family analysis were also performed onthe CDSs. <strong>The</strong> s<strong>of</strong>tware developed by Matsuda et al.(Kawaji et al. 2002) was used to find new motifs in FAN-TOM2 clones, resulting in the discovery <strong>of</strong> 10 new putativemotifs (Okazaki et al. 2002).DYNAMIC VARIATION OF TRANSCRIPTSPRODUCED FROM STATIC GENOMESEQUENCE

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!