Eukaryotic Gene Prediction - Rice Genome Annotation Project

Gene Structure

Signal SensorsA signal sensor evaluates fixed-length featuresin DNA. Start codons Stop codons Donor sites Acceptor sites Promoters Poly-A signals

Gene Prediction Approaches Intrinsic (ab(initio)GENSCAN, FGENESH, GeneMark.hmmGlimmerM,Genie; Extrinsic (similarity-based)Spliced alignment: GenomeScan, EuGene, , FGENESH+,FGENESH_C, GeneId+, etc;Genomic comparison: TwinScan, , TWAIN, SLAM, SGP,FGENESH_2, etc;IntegratedGeneScope, GeneMachine, , JIGSAW, RiceGAAS,Ensembl, , EVM etc.

ab initio Gene PredictionAdopt a rigorous probabilistic model of sequencestructure and choose the most probable parseaccording to that probabilistic model.ProsConsFast and efficientRemarkable accuracy at the nucleotide levelLess than 50% accuracy at the gene level

Development of a Gene Finder Build the model Train the model to generate the relatedparameters Predict/Evaluate

Imperfect ModelGT…………..AG…A|G| |G|T…1-bp intron

Nucleotide level Exon level Gene levelAccuracy Evaluation

Nucleotide/Base LevelPrediction accuracy per base coding/non-coding

Exon LevelPrediction accuracy with respect to exact prediction of exon startand end points

Gene/Protein LevelPrediction accuracy with respect to the protein productencoded by the predicted gene

A Simple CalculationGiven x accuracy at exon level, the accuracy of the prediction atthe gene level is:P = P (all exons correctly predicted) =x n ,where n is the number of exons in the gene.Typically, x

Performance Species-specific setting GC content Gene density Gene/Exon/Intron length distribution Codon usage Benchmark training data set test data set

Maize Gene Prediction

Gene Finders

Accuracy

Challenges of Intrinsic ApproachesAlternative splicingNested/overlapped genesExtremely long/short genesExtremely long intronsExtremely short exonsNon-canonical intronsFrame-shift errorsSplit start codons (that is, the start codon is split by an intronin the genomic sequence)UTR intronsNon-ATG triplet as the start codonPolycistronic genes

Gene Prediction Approaches Intrinsic (ab(initio)GENSCAN, FGENESH, GeneMark.hmmGlimmerM,Genie; Extrinsic (similarity-based)Spliced alignment: GenomeScan, EuGene, , FGENESH+,FGENESH_C, GeneId+, etc;Genomic comparison: TwinScan, , TWAIN, SLAM, SGP,FGENESH_2, etc;IntegratedGeneScope, GeneMachine, , JIGSAW, RiceGAAS,Ensembl, , etc.

Similarity-based Gene Prediction EST/cDNA spliced alignment Protein spliced alignment Genomic comparison Intra-genomic Inter-genomic

EST/cDNA Spliced Alignment

Pros and ConsProsConsHigh accuracyUnavailability or incompleteness of transcript sequencedataExtra computation to generate alignmentsDiverse sequence qualityIncomplete full-length cDNAContaminationIncorrect sequence orientations

Genomic ComparisonMicrosynteny between M. truncatula and ArabidopsisHongyan et al, 2003

Gene Structure of Syntenic and non-SyntenicHomologous GenesHongyan et al, 2003

Comparative Analysis of CerealGene Structures

Comparative Analysis of CerealGene Promoters

Pros and ConsProsAid to identify low expressed genesIdentify genes in multiple species simultaneouslyAid to identify transcription factor binding sitesUncover non-protein coding genesConsPerformance will depend on the evolutionary distancebetween the compared sequences.Exon/intron boundaries may not be conserved

Tiling Array

ARTADE-ARabidopsisTiling-Array-based Detection of Exons

Gene Prediction Approaches Intrinsic (ab(initio)GENSCAN, FGENESH, GeneMark.hmmGlimmerM,Genie; Extrinsic (similarity-based)Spliced alignment: GenomeScan, EuGene, , FGENESH+,FGENESH_C, GeneId+, etc;Genomic comparison: TwinScan, , TWAIN, SLAM, SGP,FGENESH_2, etc;IntegratedGeneScope, GeneMachine, , JIGSAW (combiner),RiceGAAS, Ensembl, , etc.

Gene Discovery via Multiple GeneFinders

EVM

TIGR Rice Genome AnnotationPipeline

RiceGAAC

EnsemblGenePredictionProcedure

SummaryNothing is perfectEach gene identification approach has its ownfeatures and limitations;Genome annotation is an on-going process, and theaccuracy is being improved along with theaccumulation of the evidence data;tRNAsnoRNAtRNAsnoRNA

Case Study

Sorghum-Rice Synteny and EST Read Pair

Create a Gene Model

Expression DataData TypeEST/FL-cDNAPeptideMPSSSAGEMicroarrayTiling arrayData SourcePASA/Manual curationKoller et al., PNAS, 2002 (6,296peptides/2,528 fgenesh models)Blake Meyers (http://mpss.udel.edumpss.udel.edu/rice/)126,663 tags from MGOS(http://www.mgosdb.orgwww.mgosdb.org/sage/)NSF Rice Oligonucleotide Array project(http://www.ricearraryricearrary.org)Deng lab, Yale University

Expression Data in Gbrowse

GATCGATCI. Library constructionBrenner et al., PNAS 97:1665-70.AAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAmRNAMPSS SEQUENCINGTECHNOLOGY1) Cut w/ DpnII2) Ligate MmeI adapterMmeITTTTTAAAAA3) Cut to capture 21-22 bp “signature”4) Add ‘DNA barcode’, amplify &capture on beadsEach beadcontains theamplifiedproductderived fromthe 3’ end ofa singletranscript.II. Loading the ‘flow cell’+NNNN4 3 2 1III. Sequencing of tagsBrenner et al., Nat. Biotech. 18:630-4.NNNXNNXNRSRSCODEX1CODEX22) Sequence byhybridizationNXNN RS CODEX3XNNN RS CODEX41) Add adaptors16 cyclesfor 4 bp3) Digest with Type IIS enzyme touncover next 4 bases, repeat cycle

Ovary and mature stigma

Refine Gene Structure

“Have no fear of perfection -you'll never reach it.”- Salvador Dalí

Eukaryotic Gene Prediction - Rice Genome Annotation Project

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?