Computational Genomics - ExPASy Home page

Computational Genomics 

Course Faculty 

Thomas L. Casavant, Ph.D; Professor 

Tom B. Bair, Ph.D; Post-Doctoral Fellow 

Todd E. Scheetz, Ph.D; Assistant Professor 

Terry A. Braun, Ph.D; Assistant Professor 

1

Outline for Today/Next Week 

• Administrative matters 

• Some Definitions and Scope 

•BCB Sampler 

• Sequence Alignment/Analysis 

2

Fundamental Observation 

• The study of the Life Sciences is changing 

– The so-called post genome era is here. 

– 100’s of databases of DNA sequence and mRNA 

transcripts exist. 

– Functional knowledge is expanding constantly. 

– Most biological researchers are not prepared for this. 

– People with computational training have not prepared to 

work in this application area. 

3

Hub & Spoke Model 

In the Hub 

Applied Math 

and Statistics 

ECE 

CS/Math 

Mgmt Inf. Sys. 

Information 

Science 

Allied Disciplines 

Physics 

4

Hub and Spoke Model 

Ophthalmology 

Cancer Center 

Center for 

Macular Degen. 

Gene Therapy 

Hub 

Pediatrics 

Genetics 

Pulmonary 

Medicine 

BioChemistry 

Biology 

Allied Disciplines 

Immunology 

5

Benefits of Such a Model 

• In the Hub: Computational Scientists. 

– Interacting with each other to share: 

• Knowledge and methods, computing infrastructure, 

organizational support, collaborative connections, curricular 

efforts, human resources (staff, post-docs, etc). 

• At the Ends of the Spokes: Disciplinary Scientists 

and Interdisciplinary Collaborators 

– Computational Scientists form the links with 

collaborators who provide: 

• Access to real problems, and x-training in allied disciplines 

6

Consequence of Model 

for this Course 

• Who are you 

– A lot of computer background/some biology 

– A lot of biology background/some computing 

• We will bring both groups along. 

– Some material will inevitably be review for some 

– Some material may seem over your head. 

• Like the hub and spoke model, we will try to work 

on them “together” 

• “Different” kind of course , but not too different… 

7

Course Approach 

• Three Levels 

– Lecture/Concepts 

– Demonstration 

– Hands-on experience 

• Areas of Focus: 

– Concepts and Algorithms 

– Review Basic Computer Knowledge (learn a bit 

about what there is to learn) 

– Review Basic Molecular Biology/Genetics 

– How bioinformatics tools are designed 

– How to use/design non web-based software 

– UNIX and Perl 

8

Main Foci of Course 

1. Overview of Bioinformatics and Computational 

Biology. Models, Algorithms, and Available Tools. 

(Dr. Casavant) 

2. Review UNIX/Script writing in Perl, installing and 

running free software, files, etc (Drs. Scheetz, and 

Bair) 

3. An quick review of Molecular Biology, 

Biochemistry, and Genetics (Drs. Scheetz, and 

Bair) 

4. Brisk tour of contemporary problem areas in 

Bioinformatics and Computational Biology (Drs. 

Casavant, Scheetz, Braun, and Bair). 

9

Main BCB Subjects 

• Sequence Analysis 

• Genome Analysis 

• Expression (Gene and Protein) 

• Pathways/Systems Biology 

• Mapping 

• Computational Methods 

• Case Studies 

SEE COURSE WEB PAGE 

10

Intended Take Home Lessons 

• The computer is a tool, but 

• A fairly blunt tool. 

• Need to be able to flex this tool 

• Too many disciplines for a single person 

– Computation 

• Engineering 

• Science 

– Mathematics/Probability/Statistics 

– Biology (Cellular, Molecular, etc) 

– Genetics (Human, evolution, molecular, etc) 

– Physics (bio, molecular, etc) 

–Chemistry 

• Must learn to work in teams 

11

BioInformatics and 

Computational Biology 

•Defn: 

The modern study of biology, and its 

applications to medicine, agriculture, and other 

areas – centered on the use of substantial 

computing resources – hardware, software, and 

human. 

12

How do Bioinformatics and 

Computational Biology differ 

• Bioinformatics: Driven by large datasets which 

must be gathered, curated, stored, organized, 

searched, and archived. 

BioInformatics ⇒ Data Driven 

• Computational Biology: Development of 

algorithms and computational procedures to 

model, simulate, and analyze biological systems. 

Computational Biology ⇒ Model Driven 

13

BCB Training 

Three Critical Components 

1. Continuously Learn Biological/Computational 

Science (X-training) 

2. Develop computer systems (hardware and software) 

to “process” raw data and store it in a form that can 

be used for later analysis (Data Pipelines) 

3. Work with other biological/genetic/medical 

researchers to answer “the questions” by developing 

algorithms and computational tools to analyze the 

archived, as well as new data. (Algorithms and 

Tools) 

14

BCB 

X-Training 

Learning Biological/Computational Science 

• Begin with a recognized strength in either 

Biological or Computational Science 

• Then work to continuously strengthen the other 

area --- IF --- 

– Computational Background: 

• Must study biology, biochemistry, genetics 

• Must develop a working knowledge of laboratory methods 

– Biological Background: 

• Must study algorithms, computer systems, programming 

• Must either develop competence in using computers at the 

“UNIX” level, or 

• Must be able to span the gap between biologists and those who 

can work with computers at that level 

15

BCB 

Data Pipeline Construction 

• Develop computer systems (hardware and software) to 

“process” raw data and store it in a form that can be used 

for later analysis - a “pipeline” of efficient, accurate, and 

high-performance processing steps. 

• Examples 

– Data acquisition hardware/software 

– Format conversion 

– Quality screening 

– Correlation functions 

– Annotation 

– Database deposition 

– Distribution of data 

16

MGC FL cDNA Sequencing Pipe 

17

BCB 

Algorithms and Tools 

• Formulating and answering “the questions” in 

cooperation with other biologists. 

– Understanding the problem 

– Formulating the question as it relates to informatics 

– Prototyping the analysis process 

– Iterating on the form and content of the analysis 

– Applying the tools to more general cases 

– Dissemination of tools and documenting their use 

– Evaluation of effectiveness 

– Continuous evolution of tools and their functions 

18

Java Cluster Viewer 

19

Hierarchical Clustering 

20

Science ⇔ Informatics (BCB) 

Relationship 

Interpreting the information 

(Science/Genetics) 

Dealing with the data 

(Bioinformatics) 

Automating analyses 

(Computational Biology) 

21

Introductory BCB Examples 

• Bioinformatics: 

– Sequence alignment and database search 

– Gene discovery pipeline 

– EST Clustering 

• Computational Biology 

– Gene Prediction 

– Analysis of Low Complexity 

22

Sequence Alignment and Database Search 

(BioInformatics) 

• Alignment-based 

– Smith/Waterman 

– Dynamic Programming 

• Markov-model based 

• Large Database issues 

23

Sequence Alignment 

• Nucleotide vs. amino acids 

• Global vs. Local 

• Pair-wise vs. multiple 

• Simplest case: 

– Global, Pair-wise 

– Must match at both ends 

24

Sequence Alignment Example 

• Example: 

S1: TTACTTGCC (9 bases) 

S2: ATGACGAC (8 bases) 

• Scoring (1 possibility): 

+2 match 

0 mismatch 

-1 gap in either sequence 

• One Possible alignment: 

T T - A C T T G C C 

A T G A C - - G A C 

0 2-1 2 2-1-1 2 0 2 Score = 10 – 3 = 7 

25

Cue to a Data Structure 

Gap in S2 

Gap in S1 

Alignment 

(match/mismatch) 

26

How hard can this be 

• Brute force approach: consider all possible alignments, and 

choose the one with best score 

• 3 choices at each internal branch point 

• Assume n x n comparison. 3 n comparisons 

– n = 3 ⇒ 3 3 = 27 paths 

– n = 20 ⇒ 3 20 = 3.4 x 10 9 paths 

– n = 200 ⇒ 3 200 = 2.6 x 10 95 paths 

• If 1 path takes 1 nanosecond (10 -9 secs) 

– 8.4 x 10 78 years! 

• But, using data structures cleverly, this can be greatly sped up 

to O(n 2 ) ⇒ .04 msec (n=200), but 

for a 400 base query, 3 gb database (HG) ⇒ 20 minutes, 

for a 400 base query, 38 gb database (GenBank) ⇒ 12.6 hrs! 

27

EST Gene Discovery Pipeline 


28

EST Sequence Clustering 


• Goal: Group together expressed 

sequence tags (ESTs) and full length 

cDNA data into gene-based indices 

– Sequences considered linked if similarity 

score exceeds some threshold 

29

Gene Prediction 

(Computational Biology) 

• Contexts: 

– Identifying full length transcripts 

– Finding genes in genomic sequence 

• Approaches 

• Deployment Issues 

30

Genome Architecture in an Nutshell 

Start codon Codons Donor site 

GCCGCCGCCATGCCCTTCTCCAACAGGTGAGTGAG 

Transc ription 

Sta rt 

Promoter 

5’ UTR 

CTCCCAGCCCTGCC 

Ac c eptor site 

Exon 

Intron 

Sto p Co d o n 

ATCCCCATGCCTGAGGGCCCCT 

Po ly-A site 

GCAGAAACAATAAAACCA 

3’ UTR 

31

Preview of an HMM Model for 

Gene Prediction 

32

The Crux of Gene Prediction 

5’ UTR 1st Exon 

Ko zac 

Consensus 

ATG 

Sto p s in 

all 3 frames 

No in-frame 

stops 

GT 

AG 

Exon 

Intron 

Exon 

33

Gene Prediction Approaches 

• Ab initio methods: 

– Profile Hidden Markov Models (GENSCAN, HMMgene) 

– Neural Networks (GRAIL, Genie) 

– Decision Trees (MORGAN) 

• Issues: 

– Seeding from training sets 

– Fully general approaches 

• Interesting question: 

– Can gene finding be done species-independent 

34

Simple Dicty Gene Finder 

(Intuition and an Example) 

• Basic Idea (G. Klein) based on GC/AT content of Intron 

vs. Exons 

• Idealized Example: Count G/Cs and A/Ts in a window 

size of 10 bases. 

 

AT content 

…….CGCGGGCGCCGTATTTATATATTATA…..AATATTTTATATAGCCCGGCGCGGCCG…... 

GC content 

6 10 10 10 

10 10 6 2 

 

Acceptor Site 

 

Donor Site 

Point where GC.left and AT.right are both maximized 

35

Dicty Gene Finding Tool Model 

• Model Parameters: 

– W -- Window Size 

– θ low -- threshold below which GC or AT 

content does not match hypothesis 

– θ high -- threshold above which GC or AT 

content matches hypothesis 

– m -- number of consecutive windows 

that will be examined 

– n -- number of windows out of m that 

that must exceed θ to qualify for an 

intron/exon or exon/intron transition 

– tol -- maximum distance from the GC/AT 

content transition at which the GT or 

motif must be found 

AG 

36

Dicty Gene Finding Tool Model 

W = 8, m = 4, high = 7, low = 6 

6 

G/C=7 

. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . . 

5 

n = 3 

n = 4 

4 

3 

2 

1 

37

Dicty Intron/Gene 

Prediction Algorithm 

1. Calculate AT (GC) content in size W windows 

right and left of each base position. 

2. Calculate n 

AT count ≥θ high , AT count ≤θ low 

for each window of m bases to the left and right of 

each base position. 

3. For each position: If ……... 

ATleft high ≥ n && ATright low ≥ n 

⇒ potential acceptor site 

ATleft low ≥ n && ATright high ≥ n 

⇒ potential donor site 

38

Dicty Intron/Gene 

Prediction Algorithm 

(continued) 

4. For each potential donor site: 

If GT (donor) or AG (acceptor) motif is found 

within Tol bases distance, note this as an intron 

boundary. 

5. Sort boundaries into candidate introns. 

39

Test Data 

>IIADP1D6358 Antiparallèle 811 bases 

AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT 

CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT 

AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC 

TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT 

GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA 

TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt 

atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT 

ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT 

gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat 

tatttgattaaaaatagaaggtttttttttttattttttttttttatttt 

tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat 

taattttaattttttttttttttttttttttttttttttttttttttttt 

ttcatttttaacatcatttgattcattaatttattttttttttcaacatc 

cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA 

TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG 

AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT 

CGACCGAAGGC 

•Probable Correct Introns: 

+267 -341 

+401 -687 

40

• Ranges 

Parameter Space to Search 

–W--3 → 10 (8 values) 

– θ high -- .7xW → W (≈4 values) 

– θ low -- .5xW → .9xW (≈4 values) 

–m --3 → 11 (9 values) 

–n --m/2 → m (≈4 values) 

– tol -- 3-7 (5 values) 

• 3584 x 5 ≈ 18,000 sets of parameters 

• Search for sets that find all expected sites 

with a minimum of false positives. 

41

Test Data 

idt t1.fasta 3 1 3 1 2 4 2 2 269 401 341 687 

idt t1.fasta 3 1 3 2 2 4 2 2 269 401 341 687 

idt t1.fasta 3 1 3 1 3 4 2 2 269 401 341 687 

idt t1.fasta 3 1 3 2 3 4 2 2 269 401 341 687 

idt t1.fasta 3 2 3 1 2 4 2 2 269 401 341 687 

idt t1.fasta 3 2 3 2 2 4 2 2 269 401 341 687 

idt t1.fasta 3 2 3 1 3 4 2 2 269 401 341 687 

idt t1.fasta 3 2 3 2 3 4 2 2 269 401 341 687 

idt t1.fasta 3 3 3 1 2 4 2 2 269 401 341 687 

idt t1.fasta 3 3 3 2 2 4 2 2 269 401 341 687 

idt t1.fasta 3 3 3 1 3 4 2 2 269 401 341 687 

idt t1.fasta 3 3 3 2 3 4 2 2 269 401 341 687 

idt t1.fasta 3 2 4 1 2 4 2 2 269 401 341 687 

idt t1.fasta 3 2 4 2 2 4 2 2 269 401 341 687 

idt t1.fasta 3 2 4 1 3 4 2 2 269 401 341 687 

idt t1.fasta 3 2 4 2 3 4 2 2 269 401 341 687 

idt t1.fasta 3 3 4 1 2 4 2 2 269 401 341 687 

idt t1.fasta 3 3 4 2 2 4 2 2 269 401 341 687 

idt t1.fasta 3 3 4 1 3 4 2 2 269 401 341 687 

idt t1.fasta 3 3 4 2 3 4 2 2 269 401 341 687 

idt t1.fasta 3 4 4 1 2 4 2 2 269 401 341 687 

idt t1.fasta 3 4 4 2 2 4 2 2 269 401 341 687 

idt t1.fasta 3 4 4 1 3 4 2 2 269 401 341 687 

idt t1.fasta 3 4 4 2 3 4 2 2 269 401 341 687 

idt t1.fasta 3 2 5 1 2 4 2 2 269 401 341 687 

idt t1.fasta 3 2 5 2 2 4 2 2 269 401 341 687 

. . . . . About 18,000 more lines like this . . . 

42

Test Data Raw Results 

len=811 W=3 n=1 m=3 thrL=1 thrH=2, Tol=4, Sites Found=18 

Intron: 1 + 91 + 213 - 213 

Intron: 2 + 236 - 241 

Intron: 3 + 267 - 267 

Intron: 4 + 385 + 399 - 399 - 467 

Intron: 5 + 471 + 759 - 759 - 797 

Intron: 6 + 799 - 799 


Intron: 1 + 91 + 213 - 213 

Intron: 2 + 219 - 223 

Intron: 3 + 236 - 241 

Intron: 4 + 267 - 267 

Intron: 5 + 305 - 312 - 335 

Intron: 6 + 341 - 341 

Intron: 7 + 385 + 399 - 399 

Intron: 8 + 429 - 433 

Intron: 9 + 441 - 467 

Intron: 10 + 471 - 753 

Intron: 11 + 759 - 759 - 797 

Intron: 12 + 799 - 799 


Intron: 1 + 91 + 213 - 213 - 241 

Intron: 2 + 267 + 399 - 399 - 467 

Intron: 3 + 471 + 759 - 759 - 786 

. . . About 18,000 sets of results like this. . . 

43

Test Data Filtered Results 

len=811 W=6 n=5 m=9 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND 










This provides an initial set of likely to be optimal parameters 

44

Best Parameter Set on Known Gene 


1: AAAAACCTGC TTAGGATTAA TTATGAGCGA ATTTTTTTTC TTTAAAACTT 

51: CCAAAAATAT TTTTTTTTTT TTTTTTTTTT AATAATTTCG GTTTGCTCAT 

101: AGATTTTTTA TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC 

151: TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT TAAAAGATAT 

201: GAGATTTTTA AAgttcgggt tagaaattaa tttgggtaaa gGAACTCTTA 

251: TTGAATTTGA TGAACAgtgt acttaaatat ttaattaatt tttttttttt 

301: atttgtttta agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT 

351: ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata aatttaaagt 

401: gttataaata atttaaaaat tttattctgt aagAAAATTT GTTTTGAAAT 

451: TATTTGATTA AAAATAGAAG gttttttttt ttattttttt tttttatttt 

501: tatttttttt tattttttat aatttccgcg tttgaatttg ttgtgtaaat 

551: taattttaat tttttttttt tttttttttt tttttttttt tttttttttt 

601: ttcattttta acatcatttg attcattaat ttattttttt tttcaacatc 

651: cccaacccaa aaaaaaaaaa taaaaaaaaa tgataagAAA TTTAACAAAA 

701: TTAACAAAAT TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG 

751: AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA TCTCTAGAGT 

801: CGACCGAAGG C 

Intron 1: + 213 - 241 overpredicted (45 bases) 

Intron 2: + 267 - 341 UNDERPREDICTED (37 BASES) 

Intron 3: + 385 + 399 - 433 correct + (325 bases) 

Intron 4: + 471 - 687 CORRECT - (404 BASES) 

45

Repetitive and 

Low Complexity Analysis 

• The problem: 

DNA is not random 

• 3 properties of interest: 

– Complexity: Compositional bias 

– Pattern: Interspersed k-grams 

– Periodicity: repetition of residues of k-grams 

• We will focus on complexity (low) 

46

Fin 

47

Computational Genomics - ExPASy Home page

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?