K - Computational Intelligence and Bioinformatics Research Group

cib.uco.es

K - Computational Intelligence and Bioinformatics Research Group

Section 3.1: Gene structure prediction

Computational Intelligence and

Bioinformatics Research Group

(www.cibrg.org)

University of Córdoba


Outline






Introduction

Approaches

Case studies



Geneparser

CONTRAST

Challenges and open problems

Examples

2/83


The biological model

3/83


Gene structure prediction



Objective: Determine complete gene structure


Problems

UTRs, exons, introns, splice sites, …





Signals are subtle and misleading

Evidence must pile up

Pseudogenes and other artifacts

Uncommon genes

4/83


Some simplifications



Non-canonical sites are usually ignored



TIS always ATG

Alternative splicing not considered


Specific line of research to identify alternative splicing

Non-coding regions are usually ignored


5'-UTRs and 3'-UTRs

5/83


First step

Finding the evidence

6/83


Finding the evidence



Content sensors


Signal sensors

Try to classify a DNA region into types: e.g.: coding vs. noncoding


Try to identify functional sites

7/83


Content sensors



Extrinsic content sensors


Exploit a sufficient similarity between a sequence region and a

DNA sequence or protein


Intrinsic content sensors

Methods: S-W, BLAST, FASTA


Exploit the regularities imposed in coding regions

8/83


Extrinsic content sensors


Similarities with three different types of sequences




Protein sequences


50% of all exons, UTRs are lost

Transcripts: ESTs and cDNA


Genomic DNA


Most reliable, coding and non-coding parts

Coding regions more conserved than non-coding

9/83


Extrinsic content sensors



Strengths




Weaknesses

Accumulated preexisting biological data

Biologically relevant predictions

A single hit is enough to predict a gene





Nothing new can be found

Limits of the region of similarity are fuzzy

Small exons are easily missed

Biased towards highly expressed genes

10/83


Intrinsic content sensors





G+C content


Introns are more A+T rich than exons

Codon composition

Hexamers frequency



It was found to be the most discriminant: SORFIND, Genview2,

MZEF, GeneParser

In general k-mer composition

Base occurrence periodicity

11/83


k-mer composition






Can be modeled using a k-order Hidden Markov Model

(HMM)

Stochastic model that assumes:


The probability of occurrence of a nucleotide depends on the k

previous nucleotides

A probability is constructed from a training set

P ( X∣k previous nucleotides)

Greater k means large number of coding sequences and time

to build the model

Many programs: GeneMark, Genscan, EuGène, Glimmer,

GlimmerM, GeneMark.hmm

12/83


Hidden Markov Models



Many more sophisticated methods have been proposed




Interpolated Markov models (IMMs)

Interpolated context models

Generalized Markov models (GMMs)

Different models can be used for different regions



One model for coding regions


More models depending on G+C content

One model for non-coding regions


One model for 5'-UTRs and 3'-UTRS, one model for

introns, one model for intergenic regions

13/83


VEIL: Viterbi Exon-Intron Locator




Contains 9 hidden states or features

Each state is a complex internal Markovian model of the feature

Features:

Exons, introns, intergenic regions, splice sites, etc.

Exon HMM Model

Upstream

Start Codon

3’ Splice Site

Exon

Intron

Stop Codon

5’ Splice Site

Downstream

5’ Poly-A Site

• Enter: start codon or intron (3’ Splice Site)

VEIL Architecture

• Exit: 5’ Splice site or three stop codons

(taa, tag, tga)


Signal sensors




First approach


Look for a consensus sequence, with possible variations,

obtained from multiple alignment: SPLICEVIEW and

SplicePredictor

More flexible: Positional Weight Matrices (PWMs)



0-order Markov model

Can be improved with a neural network: NetPlantGene,

NetGene2, NNSplice

More sophisticated


HMMs for capturing possible dependencies: VEIL, MORGAN,

MZEF

15/83


Signal sensors



Classification methods:


2-class problems

Many available methods





Neural networks

Decision trees

Support vector machines (SVMs)

K-nearest neighbors

16/83


Second step

Combine the evidence to predict gene

structure

17/83


Combining the evidence



Old programs output a collection of exons

New programs:



Combine the evidence to obtain a whole gene

Alternative splicing can also be offered

18/83


Combining the evidence




Possible genes grows exponentially with putative exons

Constraints




There are no overlapping exons.

Coding exons must be frame compatible

No in-frame stop codons

Still huge number of possible genes

Not in eukariotes

Found in complementary

sequences and in introns


Most common technique: Dynamic programming

19/83


Methods for gene structure construction



Extrinsic methods


Use of expression data EST and cDNA

Intrinsic methods



Ab-initio methods


DNA sequences only from the genome to predict

De-novo methods


DNA sequence from other genomes (informant

genomes)

20/83


Extrinsic methods



Based on homologies

Example: BlastX on NG_013087

21/83


Extrinsic methods





Also called “spliced alignment” programs

Variations of the algorithms of Smith-Waterman:


A signal allows the opening (donor) or closing (acceptor) of a

gap

Greek rogue son of

Poseidon killed by

Theseus

Pioneer: Procustes



Align a genomic sequence with a protein

The protein must be given by the user (e.g.: use BLAST)

Similar programs


GeneWise, PredictGenes, ORFgene, ALN

22/83


Extrinsic methods (ii)


Other approaches




Perform an alignment with a cDNA database: AAT, GeneSeqer,

SIM4, Spidey



Very reliable way if identifying exons

TIS and STOP codons difficult to find exactly

SYNCOD uses a Monte Carlo procedure to test different

possibilities

1. Define a domain of possible inputs.

EST driven programs: EbEST, Est2genomic, TAP, PAGAN

✔ ESTs are highly redundant and driven to errors

✔ Very good for 3'-UTRs

2. Generate inputs randomly from a probability

distribution over the domain.

3. Perform a deterministic computation on the inputs.

4. Aggregate the results.

23/83


Homology based prediction programs

[Mathé et al., 2002]

[Mathé et al.,

2002]

C. Mathé, M­Fr. Sagot, T. Schiex and P. Rouzé, “Current methods of gene prediction, their strengths and weaknesses”,

Nucleid Acids Research 19, 4103­4117.

24/83


Intrinsic approaches



Ab initio programs



A sequence and a target genome

Sometimes refers to all intrinsic methods

De novo programs


A sequence, a target genome, and one or more informant

genomes

25/83


De Novo programs




Combine information of different (informant) genomes

Examples



ROSETTA, CEM, TWINSCAN, N-SCAN, SLAM, SGP, EvoGene,

ExoniPhy, DOG-FISH

CONTRAST

Major challenge


Combine predictions from different informants

26/83


More than one informant






Make use of the conserved region across species

A target genome and one or more additional genomes

For example:



Target genome: human

Additional: mouse

Not too nearer and not too far

Experimentally: More than 2 not dramatically useful

27/83


Ab-initio gene prediction




Avantages


Problem



Able to obtain “new” genes

Evidence is unclear, misleading, etc.

Different measures must be combined

Two basic statistics



Content sensors

Site sensors

28/83


Assembling the gene



Usually use dynamic programming

Dynamic programming


Dynamic programming is a method for efficiently solving a

broad range of search and optimization problems which

exhibit the characteristics of overlapping subproblems and

optimal substructure.



Overlapping subproblems: a problem is said to have

overlapping subproblems if the problem can be

broken down into subproblems which are reused

several times.

Optimal substructure: a problem is said to have optimal

substructure if an optimal solution can be constructed

efficiently from optimal solutions to its subproblems.[

29/83


Assembling the gene


Two approaches



Signal based: The gene structure is defined by a succession of

signals separated by homogeneous regions

Exon based: The gene structure is defined by an assembly of

coding segments


The gene assembly is separated from the coding

segments prediction

30/83


Exon-based methods





The goal is to find the highest scoring genes

The gene score is a function (usually a sum) of the segments

scores

This allows complex scores for the segments

Frequent approach: Optimal path in a directed acyclic graph




Vertices: Exons

Edges: Compatibilities between exons

GenView2, GAP3, FGENE, DAGGER, GeneGenerator

31/83


Signal-based methods





The assembly is produced directly from the set of detected

signals

Can also be stated as searching in a directed acyclic path

(DAG)


Example of a DAG

A HMM is used and trained with the Viterbi algorithm

Method used in EuGène, ECOPARSE, Genscan, Genie,

GeneMark.hmm, FGENESH, GRPL, HMMgene, VEIL

32/83


Measures of evidence


Content sensors: Coding vs. non-coding regions



Intrinsic (well... not really, they must be learned)







Extrinsic


In-frame hexamers

Local compositional complexity

Intron/exon length distribution

Bulk hexamers

BLAST scores

Hidden Markov models

BLAST similarities

33/83


Extrinsic content sensors




Similarity between a sequence region and DNA or protein

sequences


Advantage: Information about function

Heuristic methods for comparison


FASTA and BLAST

Three kinds of sequences




Protein sequences

Transcripts

Genomic DNA

34/83


Similarity with protein sequences



Pros:

Cons






Proteins are conserved among species: 50% of sequences have a

high similarity score

Nothing new can be found

Poor quality of stored sequences

Difficult to obtain the exact gene structure

UTRs cannot be identified

35/83


Similarity with transcripts


Sequences obtained from RNA



cDNA: Complete clone from a RNA sequence


Better but more expensive

ESTs (expressed sequence tags): One shot short sequences



Enable identification of exons and non-coding exons

Hints of alternative splicing

✔ Cons:


Only local information


Difficult to assign to an specific gene

36/83


Expressed sequences tags (ESTs)

37/83


Similarity with genomic DNA




Genomic DNA


Coding regions are more conserved than non-coding

Intra-genomic or inter-genomic approach

Problem



Similarity may not cover entire exons but the most conserved

parts

It can extend to introns and UTRs: Genomes evolutionary close

38/83


In frame hexamers


Measures two different effects



Codon usage bias in coding regions

Correlation between neighboring codons

IF 6 i , j=max

{∑

k =0,3,6,, j−6

ln


k =1,4,7,, j−6

ln


k=2,5,8,, j−6

ln

f k

F k

f k

F k

f k

F k

f k

: frequence of hexamer in the coding region

F k

: frequence of hexamer in all the sequences

39/83


In frame hexamers

40/83


Average mutual information

41/83


Local compositional complexity (LCC)




Non coding regions contains repetitive sequences (“simple

sequence” DNA)

Coding regions more informationally rich

Example: Entropy H for oligonucleotides of length L

H =−


k ∈{A ,C , G ,T }


N k

L log 2

N k

: Number of times base k occurs in the oligonucleotide

N k

L

42/83


Local compositional complexity

43/83


Intron and exon length distributions



Length distributions differs significantly for





5' exon

Internal exons

3' exons

Introns

These distributions only apply to the coding part of the exon

44/83


Hidden Markov Models




Stochastic model that assumes:



The appearance of a base, {A,T,GC}, at a given position depends

only on the k previous nucleotides


P(X|k previous nucleotides) defines the model

k is the order of the model

The model is learned from the training data

Many different models


HMM, Interpolated Markov model (IMM), Generalized HMM

45/83


HMM

x m

(i) = probability of being in

state m at position i;

H(m,y i

) = probability of emitting

character y i

in state m;

Φ mk

= probability of transition

from state k to m.

46/83


HMM realistic model

47/83


Signal sensors


Signal (site) sensors


Site signals for: Start codon, donors and acceptors, stop codon,


Other sites: Promoters, poly(A) site, 3'-UTRs, ...


Methods



Alignment with consensus sequences

Positional weight matrixes

48/83


Consensus sequences


Conserved motifs in certain functional sites

49/83


Positional weight matrix (PWM)



0-order Markov model

Method


Choose an interval window

-1 0 +1 +2

A 0.23 0.99 0.27 0.23

T 0.11 0.01 0.43 0.20

G 0.48 0.00 0.05 0.20

C 0.18 0.00 0.25 0.27

50/83


Site statistics


Measure repetitive sequences

51/83


Gene structure prediction




Evidence must be combined in a gene structure

Combinations grow exponentially

One advantage




Gen structure must be:

[TIS – donor] – [acceptor – donor]* [acceptor – STOP]

or

[TIS – STOP]

Also:



No overlapping exons

No in-frame stop codons

52/83


Integrated approaches


Combine intrinsic and extrinsic methods



Intrinsic methods



Content sensors

Signal sensors

Extrinsic methods


Content sensors based on similarities: BLAST

53/83


Ab initio gene prediction programs (some

of them with homology)

54/83


Current trend



Combination of gene recognition programs

Motivation: Few exons are mixed by all programs


A complex problem by itself



Simple approach: OR, AND and Majority voting

Complex approaches: HMMs, Bayesian frameworks, etc.

55/83


Combination of programs


Some results

56/83


Case studies



Geneparser

CONTRAST

57/83


Geneparser




One of the first programs for complete gene recognition

Combines




Content statistics

BLAST similarity scores

Site statistics

Construction of the gene structure


Optimization of the global score using dynamic programming

58/83


Genaparser



Content statistics






Site statistics

In-frame hexamers

Local compositional complexity

Intron and exon length distribution

Bulk hexamers

BLAST similarity scores


Search matrix for each site

59/83


Geneparser



Dynamic programming

Score of a sequence


Recursive rule


Problem: Relative weight of each measure (NN)

60/83


CONTRAST: A two stage approach


Based in two steps [Gross, 2007]


(1) Classifier to recognize boundaries: start codon, stop codon,

splice sites

(2) Model for constructing the gene

Global model of gene structure


Conditional random field: Probability of a sequence x of being of

type y

[Gross,

2007]

S. S. Gross, Ch. B. Do, M. Sirota and S. Batzoglou, “CONTRAST: a discriminative, phylogeny­free approach to multiple

informant de novo gene prediction”, Genome Biology 8(12) , 269.1­269.16, 2007.

61/83


Site recognition


Uses a support vector machine

62/83


Evaluation of programs



Two levels of evaluation



Coding nucleotide sequence

Exonic structure

Imbalanced class problem


Testing error not a good measure

Example: Some sequences have 99% non-coding portion

63/83


Nucleotide level



Specificity: Non standard

measure. TN much larger

than FP, so Sp too large

Standard measure:

Sp=

TN

TN FP


ACP: Average conditional

probability

64/83


Exonic level




Comparison of predicted and true exons

Definition of a correct exon


Usually, perfect correspondence

Two new concepts



Missing exons (ME)

Wrong exons (WE)

65/83


Exonic level (i)

66/83


Protein level



Compares the predicted protein and the actual protein

Measures also the ability to construct the exon




For example: A program can have

And




High accuracy at nucleotide level

High accuracy at exon level

Low accuracy at protein level

If there is a few frameshift errors

67/83


Evaluation example [Burset and Guigó,

1996]

[Burset and

Guigó, 1996]

M. Burset and R. Guigó, “Evaluation of gene structure prediction programs”, Genomics 34, 353­367.

68/83


Evaluation with 1% random mutation

69/83


Performance of CONTRAST

70/83


Difficult issues





Very long genes


Dystrophin gene: 79 exons, 2.3 Mb

Very long introns


Dystrophin gene: Introns of more than 100 kb

Very conserved introns or 3'-UTRs

Very short exons


Some of less than 3 bp

71/83


New challenges for recognizers





Overlapping genes or genes inside introns of other gene



Found small genes in introns of other genes

Found overlapping genes in both strands

Polycistronic gene arrangement


An mRNA that encodes several discrete gene products: Usually

by alternative splicing

Frameshifts in stored sequences

Introns in non-coding regions (5'- and or 3'-UTR)

72/83


New challenges for gene recognizers (ii)




Non-canonical splice sites and TIS

Pseudogenes

Cases of alternative biological processing




Alternate promoters

Alternative splicing

Alternative translation initiation sites

73/83


More challenges





Alternative Processing of Transcripts


Splice variants, Start/stop variants

Overlapping Genes


Mostly UTRs or intronic, but coding is possible

UTR predictions


Especially with introns

Small (mini) exons

74/83


Eukaryotic gene prediction tools and

web servers








Genscan (ab initio), GenomeScan (hybrid)

– (http://genes.mit.edu/)

Twinscan (hybrid)

– (http://genes.cs.wustl.edu/)

FGENESH (ab initio)

– (http://www.softberry.com/berry.phtmltopic=gfind)

GeneMark.hmm (ab initio)

– (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi)

MZEF (ab initio)

– (http://rulai.cshl.org/tools/genefinder/)

GrailEXP (hybrid)

– (http://grail.lsd.ornl.gov/grailexp/)

GeneID (hybrid)

– (http://www1.imim.es/geneid.html)


Prokaryotic Gene Prediction





Glimmer

– http://www.tigr.org/~salzberg/glimmer.html

GeneMark

– http://opal.biology.gatech.edu/GeneMark/gmhmm2_prok.cgi

Critica

– http://www.ttaxus.com/index.phppagename=Software

ORNL Annotation Pipeline

– http://compbio.ornl.gov/GP3/pro.shtml


Non-protein Coding Gene Tools and

Information




tRNA

– tRNA-ScanSE

• http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

– FAStRNA

• http://bioweb.pasteur.fr/seqanal/interfaces/fastrna.html

snoRNA

– snoRNA database

• http://rna.wustl.edu/snoRNAdb/

microRNA

– Sfold

• http://www.bioinfo.rpi.edu/applications/sfold/index.pl

– SIRNA

• http://bioweb.pasteur.fr/seqanal/interfaces/sirna.html


Functional annotation: the final frontier


Function(s) of the protein


Post-translational modification(s)


Domains and sites


Secondary structure


Quaternary structure


Similarities to other proteins


Diseases associated with deficiencies in the protein


Sequence conflicts, variants, etc.


Functional annotation sources

Publications that report experimental data

Protein sequence analysis:

• Search for characteristic domains (patterns in protein sequences

found in all protein carrying the same function: DNA binding

domain, kinase domain, transmembrane domain…)

Comparison with other, related sequenced organisms

• Homology to protein of known function

Experimental data

• Expression studies

• Biochemical studies

• 3D structure determination


From

sequence to

function


Annotation pipeline

NEW SEQUENCES FROM

SEQUENCING PROJECT

BLAST/

FASTA

SEARCH FOR

PATTERNS &

FUNCTION

DBs

NO SIGNIFICANT

HITS

SIGNIFICANT HITS

NO SIGNIFICANT

HITS

SIGNIFICANT HITS

PSI-BLAST

Search

SCOP

IF EQUIVALOG,

INFER FUNCTION

NB look out for multidomain

proteins, put

into genome context

HIT TO 3D PROTEIN-

STRUCTURE &

FUNCTION

PHYSICAL

PROPERTIES,

LOCALISATION ETC

ASSIGN PROTEIN

FAMILY OR DOMAIN,

CF OTHER PROTEINS

IN FAMILY, INFER

FUNCTION

Supplement with

manual curation and

use evidence tags


Limits of automated functional

annotation

Databases are biased in sequence and AA composition and

search is dependent on size

If no homology found- limited amount of information can

be inferred

Incorrect functional annotation can be propagated very fast.

If a functional annotation is wrong, then all the proteins with

homology to that protein discovered afterwards will have a

wrong functional annotation.

No answers to tissue-specificity, binding of ligands,

relationship between genotype and phenotype


Final notes


DON’T COMPLETELY TRUST COMPUTER RESULTS


CHECK LITERATURE


CONFIRM WITH WETLAB WORK

More magazines by this user
Similar magazines