Computational Genomics - ExPASy Home page
Computational Genomics - ExPASy Home page
Computational Genomics - ExPASy Home page
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Computational</strong> <strong>Genomics</strong><br />
Course Faculty<br />
Thomas L. Casavant, Ph.D; Professor<br />
Tom B. Bair, Ph.D; Post-Doctoral Fellow<br />
Todd E. Scheetz, Ph.D; Assistant Professor<br />
Terry A. Braun, Ph.D; Assistant Professor<br />
1
Outline for Today/Next Week<br />
• Administrative matters<br />
• Some Definitions and Scope<br />
•BCB Sampler<br />
• Sequence Alignment/Analysis<br />
2
Fundamental Observation<br />
• The study of the Life Sciences is changing<br />
– The so-called post genome era is here.<br />
– 100’s of databases of DNA sequence and mRNA<br />
transcripts exist.<br />
– Functional knowledge is expanding constantly.<br />
– Most biological researchers are not prepared for this.<br />
– People with computational training have not prepared to<br />
work in this application area.<br />
3
Hub & Spoke Model<br />
In the Hub<br />
Applied Math<br />
and Statistics<br />
ECE<br />
CS/Math<br />
Mgmt Inf. Sys.<br />
Information<br />
Science<br />
Allied Disciplines<br />
Physics<br />
4
Hub and Spoke Model<br />
Ophthalmology<br />
Cancer Center<br />
Center for<br />
Macular Degen.<br />
Gene Therapy<br />
Hub<br />
Pediatrics<br />
Genetics<br />
Pulmonary<br />
Medicine<br />
BioChemistry<br />
Biology<br />
Allied Disciplines<br />
Immunology<br />
5
Benefits of Such a Model<br />
• In the Hub: <strong>Computational</strong> Scientists.<br />
– Interacting with each other to share:<br />
• Knowledge and methods, computing infrastructure,<br />
organizational support, collaborative connections, curricular<br />
efforts, human resources (staff, post-docs, etc).<br />
• At the Ends of the Spokes: Disciplinary Scientists<br />
and Interdisciplinary Collaborators<br />
– <strong>Computational</strong> Scientists form the links with<br />
collaborators who provide:<br />
• Access to real problems, and x-training in allied disciplines<br />
6
Consequence of Model<br />
for this Course<br />
• Who are you<br />
– A lot of computer background/some biology<br />
– A lot of biology background/some computing<br />
• We will bring both groups along.<br />
– Some material will inevitably be review for some<br />
– Some material may seem over your head.<br />
• Like the hub and spoke model, we will try to work<br />
on them “together”<br />
• “Different” kind of course , but not too different…<br />
7
Course Approach<br />
• Three Levels<br />
– Lecture/Concepts<br />
– Demonstration<br />
– Hands-on experience<br />
• Areas of Focus:<br />
– Concepts and Algorithms<br />
– Review Basic Computer Knowledge (learn a bit<br />
about what there is to learn)<br />
– Review Basic Molecular Biology/Genetics<br />
– How bioinformatics tools are designed<br />
– How to use/design non web-based software<br />
– UNIX and Perl<br />
8
Main Foci of Course<br />
1. Overview of Bioinformatics and <strong>Computational</strong><br />
Biology. Models, Algorithms, and Available Tools.<br />
(Dr. Casavant)<br />
2. Review UNIX/Script writing in Perl, installing and<br />
running free software, files, etc (Drs. Scheetz, and<br />
Bair)<br />
3. An quick review of Molecular Biology,<br />
Biochemistry, and Genetics (Drs. Scheetz, and<br />
Bair)<br />
4. Brisk tour of contemporary problem areas in<br />
Bioinformatics and <strong>Computational</strong> Biology (Drs.<br />
Casavant, Scheetz, Braun, and Bair).<br />
9
Main BCB Subjects<br />
• Sequence Analysis<br />
• Genome Analysis<br />
• Expression (Gene and Protein)<br />
• Pathways/Systems Biology<br />
• Mapping<br />
• <strong>Computational</strong> Methods<br />
• Case Studies<br />
SEE COURSE WEB PAGE<br />
10
Intended Take <strong>Home</strong> Lessons<br />
• The computer is a tool, but<br />
• A fairly blunt tool.<br />
• Need to be able to flex this tool<br />
• Too many disciplines for a single person<br />
– Computation<br />
• Engineering<br />
• Science<br />
– Mathematics/Probability/Statistics<br />
– Biology (Cellular, Molecular, etc)<br />
– Genetics (Human, evolution, molecular, etc)<br />
– Physics (bio, molecular, etc)<br />
–Chemistry<br />
• Must learn to work in teams<br />
11
BioInformatics and<br />
<strong>Computational</strong> Biology<br />
•Defn:<br />
The modern study of biology, and its<br />
applications to medicine, agriculture, and other<br />
areas – centered on the use of substantial<br />
computing resources – hardware, software, and<br />
human.<br />
12
How do Bioinformatics and<br />
<strong>Computational</strong> Biology differ<br />
• Bioinformatics: Driven by large datasets which<br />
must be gathered, curated, stored, organized,<br />
searched, and archived.<br />
BioInformatics ⇒ Data Driven<br />
• <strong>Computational</strong> Biology: Development of<br />
algorithms and computational procedures to<br />
model, simulate, and analyze biological systems.<br />
<strong>Computational</strong> Biology ⇒ Model Driven<br />
13
BCB Training<br />
Three Critical Components<br />
1. Continuously Learn Biological/<strong>Computational</strong><br />
Science (X-training)<br />
2. Develop computer systems (hardware and software)<br />
to “process” raw data and store it in a form that can<br />
be used for later analysis (Data Pipelines)<br />
3. Work with other biological/genetic/medical<br />
researchers to answer “the questions” by developing<br />
algorithms and computational tools to analyze the<br />
archived, as well as new data. (Algorithms and<br />
Tools)<br />
14
BCB<br />
X-Training<br />
Learning Biological/<strong>Computational</strong> Science<br />
• Begin with a recognized strength in either<br />
Biological or <strong>Computational</strong> Science<br />
• Then work to continuously strengthen the other<br />
area --- IF ---<br />
– <strong>Computational</strong> Background:<br />
• Must study biology, biochemistry, genetics<br />
• Must develop a working knowledge of laboratory methods<br />
– Biological Background:<br />
• Must study algorithms, computer systems, programming<br />
• Must either develop competence in using computers at the<br />
“UNIX” level, or<br />
• Must be able to span the gap between biologists and those who<br />
can work with computers at that level<br />
15
BCB<br />
Data Pipeline Construction<br />
• Develop computer systems (hardware and software) to<br />
“process” raw data and store it in a form that can be used<br />
for later analysis - a “pipeline” of efficient, accurate, and<br />
high-performance processing steps.<br />
• Examples<br />
– Data acquisition hardware/software<br />
– Format conversion<br />
– Quality screening<br />
– Correlation functions<br />
– Annotation<br />
– Database deposition<br />
– Distribution of data<br />
16
MGC FL cDNA Sequencing Pipe<br />
17
BCB<br />
Algorithms and Tools<br />
• Formulating and answering “the questions” in<br />
cooperation with other biologists.<br />
– Understanding the problem<br />
– Formulating the question as it relates to informatics<br />
– Prototyping the analysis process<br />
– Iterating on the form and content of the analysis<br />
– Applying the tools to more general cases<br />
– Dissemination of tools and documenting their use<br />
– Evaluation of effectiveness<br />
– Continuous evolution of tools and their functions<br />
18
Java Cluster Viewer<br />
19
Hierarchical Clustering<br />
20
Science ⇔ Informatics (BCB)<br />
Relationship<br />
Interpreting the information<br />
(Science/Genetics)<br />
Dealing with the data<br />
(Bioinformatics)<br />
Automating analyses<br />
(<strong>Computational</strong> Biology)<br />
21
Introductory BCB Examples<br />
• Bioinformatics:<br />
– Sequence alignment and database search<br />
– Gene discovery pipeline<br />
– EST Clustering<br />
• <strong>Computational</strong> Biology<br />
– Gene Prediction<br />
– Analysis of Low Complexity<br />
22
Sequence Alignment and Database Search<br />
(BioInformatics)<br />
• Alignment-based<br />
– Smith/Waterman<br />
– Dynamic Programming<br />
• Markov-model based<br />
• Large Database issues<br />
23
Sequence Alignment<br />
• Nucleotide vs. amino acids<br />
• Global vs. Local<br />
• Pair-wise vs. multiple<br />
• Simplest case:<br />
– Global, Pair-wise<br />
– Must match at both ends<br />
24
Sequence Alignment Example<br />
• Example:<br />
S1: TTACTTGCC (9 bases)<br />
S2: ATGACGAC (8 bases)<br />
• Scoring (1 possibility):<br />
+2 match<br />
0 mismatch<br />
-1 gap in either sequence<br />
• One Possible alignment:<br />
T T - A C T T G C C<br />
A T G A C - - G A C<br />
0 2-1 2 2-1-1 2 0 2 Score = 10 – 3 = 7<br />
25
Cue to a Data Structure<br />
Gap in S2<br />
Gap in S1<br />
Alignment<br />
(match/mismatch)<br />
26
How hard can this be<br />
• Brute force approach: consider all possible alignments, and<br />
choose the one with best score<br />
• 3 choices at each internal branch point<br />
• Assume n x n comparison. 3 n comparisons<br />
– n = 3 ⇒ 3 3 = 27 paths<br />
– n = 20 ⇒ 3 20 = 3.4 x 10 9 paths<br />
– n = 200 ⇒ 3 200 = 2.6 x 10 95 paths<br />
• If 1 path takes 1 nanosecond (10 -9 secs)<br />
– 8.4 x 10 78 years!<br />
• But, using data structures cleverly, this can be greatly sped up<br />
to O(n 2 ) ⇒ .04 msec (n=200), but<br />
for a 400 base query, 3 gb database (HG) ⇒ 20 minutes,<br />
for a 400 base query, 38 gb database (GenBank) ⇒ 12.6 hrs!<br />
27
EST Gene Discovery Pipeline<br />
(BioInformatics)<br />
28
EST Sequence Clustering<br />
(BioInformatics)<br />
• Goal: Group together expressed<br />
sequence tags (ESTs) and full length<br />
cDNA data into gene-based indices<br />
– Sequences considered linked if similarity<br />
score exceeds some threshold<br />
29
Gene Prediction<br />
(<strong>Computational</strong> Biology)<br />
• Contexts:<br />
– Identifying full length transcripts<br />
– Finding genes in genomic sequence<br />
• Approaches<br />
• Deployment Issues<br />
30
Genome Architecture in an Nutshell<br />
Start codon Codons Donor site<br />
GCCGCCGCCATGCCCTTCTCCAACAGGTGAGTGAG<br />
Transc ription<br />
Sta rt<br />
Promoter<br />
5’ UTR<br />
CTCCCAGCCCTGCC<br />
Ac c eptor site<br />
Exon<br />
Intron<br />
Sto p Co d o n<br />
ATCCCCATGCCTGAGGGCCCCT<br />
Po ly-A site<br />
GCAGAAACAATAAAACCA<br />
3’ UTR<br />
31
Preview of an HMM Model for<br />
Gene Prediction<br />
32
The Crux of Gene Prediction<br />
5’ UTR 1st Exon<br />
Ko zac<br />
Consensus<br />
ATG<br />
Sto p s in<br />
all 3 frames<br />
No in-frame<br />
stops<br />
GT<br />
AG<br />
Exon<br />
Intron<br />
Exon<br />
33
Gene Prediction Approaches<br />
• Ab initio methods:<br />
– Profile Hidden Markov Models (GENSCAN, HMMgene)<br />
– Neural Networks (GRAIL, Genie)<br />
– Decision Trees (MORGAN)<br />
• Issues:<br />
– Seeding from training sets<br />
– Fully general approaches<br />
• Interesting question:<br />
– Can gene finding be done species-independent<br />
34
Simple Dicty Gene Finder<br />
(Intuition and an Example)<br />
• Basic Idea (G. Klein) based on GC/AT content of Intron<br />
vs. Exons<br />
• Idealized Example: Count G/Cs and A/Ts in a window<br />
size of 10 bases.<br />
<br />
AT content<br />
…….CGCGGGCGCCGTATTTATATATTATA…..AATATTTTATATAGCCCGGCGCGGCCG…...<br />
GC content<br />
6 10 10 10<br />
10 10 6 2<br />
<br />
Acceptor Site<br />
<br />
Donor Site<br />
Point where GC.left and AT.right are both maximized<br />
35
Dicty Gene Finding Tool Model<br />
• Model Parameters:<br />
– W -- Window Size<br />
– θ low -- threshold below which GC or AT<br />
content does not match hypothesis<br />
– θ high -- threshold above which GC or AT<br />
content matches hypothesis<br />
– m -- number of consecutive windows<br />
that will be examined<br />
– n -- number of windows out of m that<br />
that must exceed θ to qualify for an<br />
intron/exon or exon/intron transition<br />
– tol -- maximum distance from the GC/AT<br />
content transition at which the GT or<br />
motif must be found<br />
AG<br />
36
Dicty Gene Finding Tool Model<br />
W = 8, m = 4, high = 7, low = 6<br />
6<br />
G/C=7<br />
. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . .<br />
5<br />
n = 3<br />
n = 4<br />
4<br />
3<br />
2<br />
1<br />
37
Dicty Intron/Gene<br />
Prediction Algorithm<br />
1. Calculate AT (GC) content in size W windows<br />
right and left of each base position.<br />
2. Calculate n<br />
AT count ≥θ high , AT count ≤θ low<br />
for each window of m bases to the left and right of<br />
each base position.<br />
3. For each position: If ……...<br />
ATleft high ≥ n && ATright low ≥ n<br />
⇒ potential acceptor site<br />
ATleft low ≥ n && ATright high ≥ n<br />
⇒ potential donor site<br />
38
Dicty Intron/Gene<br />
Prediction Algorithm<br />
(continued)<br />
4. For each potential donor site:<br />
If GT (donor) or AG (acceptor) motif is found<br />
within Tol bases distance, note this as an intron<br />
boundary.<br />
5. Sort boundaries into candidate introns.<br />
39
Test Data<br />
>IIADP1D6358 Antiparallèle 811 bases<br />
AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT<br />
CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT<br />
AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC<br />
TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT<br />
GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA<br />
TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt<br />
atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT<br />
ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT<br />
gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat<br />
tatttgattaaaaatagaaggtttttttttttattttttttttttatttt<br />
tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat<br />
taattttaattttttttttttttttttttttttttttttttttttttttt<br />
ttcatttttaacatcatttgattcattaatttattttttttttcaacatc<br />
cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA<br />
TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG<br />
AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT<br />
CGACCGAAGGC<br />
•Probable Correct Introns:<br />
+267 -341<br />
+401 -687<br />
40
• Ranges<br />
Parameter Space to Search<br />
–W--3 → 10 (8 values)<br />
– θ high -- .7xW → W (≈4 values)<br />
– θ low -- .5xW → .9xW (≈4 values)<br />
–m --3 → 11 (9 values)<br />
–n --m/2 → m (≈4 values)<br />
– tol -- 3-7 (5 values)<br />
• 3584 x 5 ≈ 18,000 sets of parameters<br />
• Search for sets that find all expected sites<br />
with a minimum of false positives.<br />
41
Test Data<br />
idt t1.fasta 3 1 3 1 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 1 3 2 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 1 3 1 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 1 3 2 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 3 1 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 3 2 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 3 1 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 3 2 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 3 1 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 3 2 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 3 1 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 3 2 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 4 1 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 4 2 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 4 1 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 4 2 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 4 1 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 4 2 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 4 1 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 3 4 2 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 4 4 1 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 4 4 2 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 4 4 1 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 4 4 2 3 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 5 1 2 4 2 2 269 401 341 687<br />
idt t1.fasta 3 2 5 2 2 4 2 2 269 401 341 687<br />
. . . . . About 18,000 more lines like this . . .<br />
42
Test Data Raw Results<br />
len=811 W=3 n=1 m=3 thrL=1 thrH=2, Tol=4, Sites Found=18<br />
Intron: 1 + 91 + 213 - 213<br />
Intron: 2 + 236 - 241<br />
Intron: 3 + 267 - 267<br />
Intron: 4 + 385 + 399 - 399 - 467<br />
Intron: 5 + 471 + 759 - 759 - 797<br />
Intron: 6 + 799 - 799<br />
len=811 W=3 n=1 m=3 thrL=2 thrH=2, Tol=4, Sites Found=29<br />
Intron: 1 + 91 + 213 - 213<br />
Intron: 2 + 219 - 223<br />
Intron: 3 + 236 - 241<br />
Intron: 4 + 267 - 267<br />
Intron: 5 + 305 - 312 - 335<br />
Intron: 6 + 341 - 341<br />
Intron: 7 + 385 + 399 - 399<br />
Intron: 8 + 429 - 433<br />
Intron: 9 + 441 - 467<br />
Intron: 10 + 471 - 753<br />
Intron: 11 + 759 - 759 - 797<br />
Intron: 12 + 799 - 799<br />
len=811 W=3 n=1 m=3 thrL=1 thrH=3, Tol=4, Sites Found=13<br />
Intron: 1 + 91 + 213 - 213 - 241<br />
Intron: 2 + 267 + 399 - 399 - 467<br />
Intron: 3 + 471 + 759 - 759 - 786<br />
. . . About 18,000 sets of results like this. . .<br />
43
Test Data Filtered Results<br />
len=811 W=6 n=5 m=9 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=8 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND<br />
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND<br />
This provides an initial set of likely to be optimal parameters<br />
44
Best Parameter Set on Known Gene<br />
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11<br />
1: AAAAACCTGC TTAGGATTAA TTATGAGCGA ATTTTTTTTC TTTAAAACTT<br />
51: CCAAAAATAT TTTTTTTTTT TTTTTTTTTT AATAATTTCG GTTTGCTCAT<br />
101: AGATTTTTTA TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC<br />
151: TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT TAAAAGATAT<br />
201: GAGATTTTTA AAgttcgggt tagaaattaa tttgggtaaa gGAACTCTTA<br />
251: TTGAATTTGA TGAACAgtgt acttaaatat ttaattaatt tttttttttt<br />
301: atttgtttta agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT<br />
351: ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata aatttaaagt<br />
401: gttataaata atttaaaaat tttattctgt aagAAAATTT GTTTTGAAAT<br />
451: TATTTGATTA AAAATAGAAG gttttttttt ttattttttt tttttatttt<br />
501: tatttttttt tattttttat aatttccgcg tttgaatttg ttgtgtaaat<br />
551: taattttaat tttttttttt tttttttttt tttttttttt tttttttttt<br />
601: ttcattttta acatcatttg attcattaat ttattttttt tttcaacatc<br />
651: cccaacccaa aaaaaaaaaa taaaaaaaaa tgataagAAA TTTAACAAAA<br />
701: TTAACAAAAT TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG<br />
751: AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA TCTCTAGAGT<br />
801: CGACCGAAGG C<br />
Intron 1: + 213 - 241 overpredicted (45 bases)<br />
Intron 2: + 267 - 341 UNDERPREDICTED (37 BASES)<br />
Intron 3: + 385 + 399 - 433 correct + (325 bases)<br />
Intron 4: + 471 - 687 CORRECT - (404 BASES)<br />
45
Repetitive and<br />
Low Complexity Analysis<br />
• The problem:<br />
DNA is not random<br />
• 3 properties of interest:<br />
– Complexity: Compositional bias<br />
– Pattern: Interspersed k-grams<br />
– Periodicity: repetition of residues of k-grams<br />
• We will focus on complexity (low)<br />
46
Fin<br />
47