11.01.2015 Views

Computational Genomics - ExPASy Home page

Computational Genomics - ExPASy Home page

Computational Genomics - ExPASy Home page

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Computational</strong> <strong>Genomics</strong><br />

Course Faculty<br />

Thomas L. Casavant, Ph.D; Professor<br />

Tom B. Bair, Ph.D; Post-Doctoral Fellow<br />

Todd E. Scheetz, Ph.D; Assistant Professor<br />

Terry A. Braun, Ph.D; Assistant Professor<br />

1


Outline for Today/Next Week<br />

• Administrative matters<br />

• Some Definitions and Scope<br />

•BCB Sampler<br />

• Sequence Alignment/Analysis<br />

2


Fundamental Observation<br />

• The study of the Life Sciences is changing<br />

– The so-called post genome era is here.<br />

– 100’s of databases of DNA sequence and mRNA<br />

transcripts exist.<br />

– Functional knowledge is expanding constantly.<br />

– Most biological researchers are not prepared for this.<br />

– People with computational training have not prepared to<br />

work in this application area.<br />

3


Hub & Spoke Model<br />

In the Hub<br />

Applied Math<br />

and Statistics<br />

ECE<br />

CS/Math<br />

Mgmt Inf. Sys.<br />

Information<br />

Science<br />

Allied Disciplines<br />

Physics<br />

4


Hub and Spoke Model<br />

Ophthalmology<br />

Cancer Center<br />

Center for<br />

Macular Degen.<br />

Gene Therapy<br />

Hub<br />

Pediatrics<br />

Genetics<br />

Pulmonary<br />

Medicine<br />

BioChemistry<br />

Biology<br />

Allied Disciplines<br />

Immunology<br />

5


Benefits of Such a Model<br />

• In the Hub: <strong>Computational</strong> Scientists.<br />

– Interacting with each other to share:<br />

• Knowledge and methods, computing infrastructure,<br />

organizational support, collaborative connections, curricular<br />

efforts, human resources (staff, post-docs, etc).<br />

• At the Ends of the Spokes: Disciplinary Scientists<br />

and Interdisciplinary Collaborators<br />

– <strong>Computational</strong> Scientists form the links with<br />

collaborators who provide:<br />

• Access to real problems, and x-training in allied disciplines<br />

6


Consequence of Model<br />

for this Course<br />

• Who are you<br />

– A lot of computer background/some biology<br />

– A lot of biology background/some computing<br />

• We will bring both groups along.<br />

– Some material will inevitably be review for some<br />

– Some material may seem over your head.<br />

• Like the hub and spoke model, we will try to work<br />

on them “together”<br />

• “Different” kind of course , but not too different…<br />

7


Course Approach<br />

• Three Levels<br />

– Lecture/Concepts<br />

– Demonstration<br />

– Hands-on experience<br />

• Areas of Focus:<br />

– Concepts and Algorithms<br />

– Review Basic Computer Knowledge (learn a bit<br />

about what there is to learn)<br />

– Review Basic Molecular Biology/Genetics<br />

– How bioinformatics tools are designed<br />

– How to use/design non web-based software<br />

– UNIX and Perl<br />

8


Main Foci of Course<br />

1. Overview of Bioinformatics and <strong>Computational</strong><br />

Biology. Models, Algorithms, and Available Tools.<br />

(Dr. Casavant)<br />

2. Review UNIX/Script writing in Perl, installing and<br />

running free software, files, etc (Drs. Scheetz, and<br />

Bair)<br />

3. An quick review of Molecular Biology,<br />

Biochemistry, and Genetics (Drs. Scheetz, and<br />

Bair)<br />

4. Brisk tour of contemporary problem areas in<br />

Bioinformatics and <strong>Computational</strong> Biology (Drs.<br />

Casavant, Scheetz, Braun, and Bair).<br />

9


Main BCB Subjects<br />

• Sequence Analysis<br />

• Genome Analysis<br />

• Expression (Gene and Protein)<br />

• Pathways/Systems Biology<br />

• Mapping<br />

• <strong>Computational</strong> Methods<br />

• Case Studies<br />

SEE COURSE WEB PAGE<br />

10


Intended Take <strong>Home</strong> Lessons<br />

• The computer is a tool, but<br />

• A fairly blunt tool.<br />

• Need to be able to flex this tool<br />

• Too many disciplines for a single person<br />

– Computation<br />

• Engineering<br />

• Science<br />

– Mathematics/Probability/Statistics<br />

– Biology (Cellular, Molecular, etc)<br />

– Genetics (Human, evolution, molecular, etc)<br />

– Physics (bio, molecular, etc)<br />

–Chemistry<br />

• Must learn to work in teams<br />

11


BioInformatics and<br />

<strong>Computational</strong> Biology<br />

•Defn:<br />

The modern study of biology, and its<br />

applications to medicine, agriculture, and other<br />

areas – centered on the use of substantial<br />

computing resources – hardware, software, and<br />

human.<br />

12


How do Bioinformatics and<br />

<strong>Computational</strong> Biology differ<br />

• Bioinformatics: Driven by large datasets which<br />

must be gathered, curated, stored, organized,<br />

searched, and archived.<br />

BioInformatics ⇒ Data Driven<br />

• <strong>Computational</strong> Biology: Development of<br />

algorithms and computational procedures to<br />

model, simulate, and analyze biological systems.<br />

<strong>Computational</strong> Biology ⇒ Model Driven<br />

13


BCB Training<br />

Three Critical Components<br />

1. Continuously Learn Biological/<strong>Computational</strong><br />

Science (X-training)<br />

2. Develop computer systems (hardware and software)<br />

to “process” raw data and store it in a form that can<br />

be used for later analysis (Data Pipelines)<br />

3. Work with other biological/genetic/medical<br />

researchers to answer “the questions” by developing<br />

algorithms and computational tools to analyze the<br />

archived, as well as new data. (Algorithms and<br />

Tools)<br />

14


BCB<br />

X-Training<br />

Learning Biological/<strong>Computational</strong> Science<br />

• Begin with a recognized strength in either<br />

Biological or <strong>Computational</strong> Science<br />

• Then work to continuously strengthen the other<br />

area --- IF ---<br />

– <strong>Computational</strong> Background:<br />

• Must study biology, biochemistry, genetics<br />

• Must develop a working knowledge of laboratory methods<br />

– Biological Background:<br />

• Must study algorithms, computer systems, programming<br />

• Must either develop competence in using computers at the<br />

“UNIX” level, or<br />

• Must be able to span the gap between biologists and those who<br />

can work with computers at that level<br />

15


BCB<br />

Data Pipeline Construction<br />

• Develop computer systems (hardware and software) to<br />

“process” raw data and store it in a form that can be used<br />

for later analysis - a “pipeline” of efficient, accurate, and<br />

high-performance processing steps.<br />

• Examples<br />

– Data acquisition hardware/software<br />

– Format conversion<br />

– Quality screening<br />

– Correlation functions<br />

– Annotation<br />

– Database deposition<br />

– Distribution of data<br />

16


MGC FL cDNA Sequencing Pipe<br />

17


BCB<br />

Algorithms and Tools<br />

• Formulating and answering “the questions” in<br />

cooperation with other biologists.<br />

– Understanding the problem<br />

– Formulating the question as it relates to informatics<br />

– Prototyping the analysis process<br />

– Iterating on the form and content of the analysis<br />

– Applying the tools to more general cases<br />

– Dissemination of tools and documenting their use<br />

– Evaluation of effectiveness<br />

– Continuous evolution of tools and their functions<br />

18


Java Cluster Viewer<br />

19


Hierarchical Clustering<br />

20


Science ⇔ Informatics (BCB)<br />

Relationship<br />

Interpreting the information<br />

(Science/Genetics)<br />

Dealing with the data<br />

(Bioinformatics)<br />

Automating analyses<br />

(<strong>Computational</strong> Biology)<br />

21


Introductory BCB Examples<br />

• Bioinformatics:<br />

– Sequence alignment and database search<br />

– Gene discovery pipeline<br />

– EST Clustering<br />

• <strong>Computational</strong> Biology<br />

– Gene Prediction<br />

– Analysis of Low Complexity<br />

22


Sequence Alignment and Database Search<br />

(BioInformatics)<br />

• Alignment-based<br />

– Smith/Waterman<br />

– Dynamic Programming<br />

• Markov-model based<br />

• Large Database issues<br />

23


Sequence Alignment<br />

• Nucleotide vs. amino acids<br />

• Global vs. Local<br />

• Pair-wise vs. multiple<br />

• Simplest case:<br />

– Global, Pair-wise<br />

– Must match at both ends<br />

24


Sequence Alignment Example<br />

• Example:<br />

S1: TTACTTGCC (9 bases)<br />

S2: ATGACGAC (8 bases)<br />

• Scoring (1 possibility):<br />

+2 match<br />

0 mismatch<br />

-1 gap in either sequence<br />

• One Possible alignment:<br />

T T - A C T T G C C<br />

A T G A C - - G A C<br />

0 2-1 2 2-1-1 2 0 2 Score = 10 – 3 = 7<br />

25


Cue to a Data Structure<br />

Gap in S2<br />

Gap in S1<br />

Alignment<br />

(match/mismatch)<br />

26


How hard can this be<br />

• Brute force approach: consider all possible alignments, and<br />

choose the one with best score<br />

• 3 choices at each internal branch point<br />

• Assume n x n comparison. 3 n comparisons<br />

– n = 3 ⇒ 3 3 = 27 paths<br />

– n = 20 ⇒ 3 20 = 3.4 x 10 9 paths<br />

– n = 200 ⇒ 3 200 = 2.6 x 10 95 paths<br />

• If 1 path takes 1 nanosecond (10 -9 secs)<br />

– 8.4 x 10 78 years!<br />

• But, using data structures cleverly, this can be greatly sped up<br />

to O(n 2 ) ⇒ .04 msec (n=200), but<br />

for a 400 base query, 3 gb database (HG) ⇒ 20 minutes,<br />

for a 400 base query, 38 gb database (GenBank) ⇒ 12.6 hrs!<br />

27


EST Gene Discovery Pipeline<br />

(BioInformatics)<br />

28


EST Sequence Clustering<br />

(BioInformatics)<br />

• Goal: Group together expressed<br />

sequence tags (ESTs) and full length<br />

cDNA data into gene-based indices<br />

– Sequences considered linked if similarity<br />

score exceeds some threshold<br />

29


Gene Prediction<br />

(<strong>Computational</strong> Biology)<br />

• Contexts:<br />

– Identifying full length transcripts<br />

– Finding genes in genomic sequence<br />

• Approaches<br />

• Deployment Issues<br />

30


Genome Architecture in an Nutshell<br />

Start codon Codons Donor site<br />

GCCGCCGCCATGCCCTTCTCCAACAGGTGAGTGAG<br />

Transc ription<br />

Sta rt<br />

Promoter<br />

5’ UTR<br />

CTCCCAGCCCTGCC<br />

Ac c eptor site<br />

Exon<br />

Intron<br />

Sto p Co d o n<br />

ATCCCCATGCCTGAGGGCCCCT<br />

Po ly-A site<br />

GCAGAAACAATAAAACCA<br />

3’ UTR<br />

31


Preview of an HMM Model for<br />

Gene Prediction<br />

32


The Crux of Gene Prediction<br />

5’ UTR 1st Exon<br />

Ko zac<br />

Consensus<br />

ATG<br />

Sto p s in<br />

all 3 frames<br />

No in-frame<br />

stops<br />

GT<br />

AG<br />

Exon<br />

Intron<br />

Exon<br />

33


Gene Prediction Approaches<br />

• Ab initio methods:<br />

– Profile Hidden Markov Models (GENSCAN, HMMgene)<br />

– Neural Networks (GRAIL, Genie)<br />

– Decision Trees (MORGAN)<br />

• Issues:<br />

– Seeding from training sets<br />

– Fully general approaches<br />

• Interesting question:<br />

– Can gene finding be done species-independent<br />

34


Simple Dicty Gene Finder<br />

(Intuition and an Example)<br />

• Basic Idea (G. Klein) based on GC/AT content of Intron<br />

vs. Exons<br />

• Idealized Example: Count G/Cs and A/Ts in a window<br />

size of 10 bases.<br />

<br />

AT content<br />

…….CGCGGGCGCCGTATTTATATATTATA…..AATATTTTATATAGCCCGGCGCGGCCG…...<br />

GC content<br />

6 10 10 10<br />

10 10 6 2<br />

<br />

Acceptor Site<br />

<br />

Donor Site<br />

Point where GC.left and AT.right are both maximized<br />

35


Dicty Gene Finding Tool Model<br />

• Model Parameters:<br />

– W -- Window Size<br />

– θ low -- threshold below which GC or AT<br />

content does not match hypothesis<br />

– θ high -- threshold above which GC or AT<br />

content matches hypothesis<br />

– m -- number of consecutive windows<br />

that will be examined<br />

– n -- number of windows out of m that<br />

that must exceed θ to qualify for an<br />

intron/exon or exon/intron transition<br />

– tol -- maximum distance from the GC/AT<br />

content transition at which the GT or<br />

motif must be found<br />

AG<br />

36


Dicty Gene Finding Tool Model<br />

W = 8, m = 4, high = 7, low = 6<br />

6<br />

G/C=7<br />

. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . .<br />

5<br />

n = 3<br />

n = 4<br />

4<br />

3<br />

2<br />

1<br />

37


Dicty Intron/Gene<br />

Prediction Algorithm<br />

1. Calculate AT (GC) content in size W windows<br />

right and left of each base position.<br />

2. Calculate n<br />

AT count ≥θ high , AT count ≤θ low<br />

for each window of m bases to the left and right of<br />

each base position.<br />

3. For each position: If ……...<br />

ATleft high ≥ n && ATright low ≥ n<br />

⇒ potential acceptor site<br />

ATleft low ≥ n && ATright high ≥ n<br />

⇒ potential donor site<br />

38


Dicty Intron/Gene<br />

Prediction Algorithm<br />

(continued)<br />

4. For each potential donor site:<br />

If GT (donor) or AG (acceptor) motif is found<br />

within Tol bases distance, note this as an intron<br />

boundary.<br />

5. Sort boundaries into candidate introns.<br />

39


Test Data<br />

>IIADP1D6358 Antiparallèle 811 bases<br />

AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT<br />

CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT<br />

AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC<br />

TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT<br />

GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA<br />

TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt<br />

atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT<br />

ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT<br />

gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat<br />

tatttgattaaaaatagaaggtttttttttttattttttttttttatttt<br />

tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat<br />

taattttaattttttttttttttttttttttttttttttttttttttttt<br />

ttcatttttaacatcatttgattcattaatttattttttttttcaacatc<br />

cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA<br />

TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG<br />

AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT<br />

CGACCGAAGGC<br />

•Probable Correct Introns:<br />

+267 -341<br />

+401 -687<br />

40


• Ranges<br />

Parameter Space to Search<br />

–W--3 → 10 (8 values)<br />

– θ high -- .7xW → W (≈4 values)<br />

– θ low -- .5xW → .9xW (≈4 values)<br />

–m --3 → 11 (9 values)<br />

–n --m/2 → m (≈4 values)<br />

– tol -- 3-7 (5 values)<br />

• 3584 x 5 ≈ 18,000 sets of parameters<br />

• Search for sets that find all expected sites<br />

with a minimum of false positives.<br />

41


Test Data<br />

idt t1.fasta 3 1 3 1 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 1 3 2 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 1 3 1 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 1 3 2 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 3 1 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 3 2 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 3 1 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 3 2 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 3 1 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 3 2 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 3 1 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 3 2 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 4 1 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 4 2 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 4 1 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 4 2 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 4 1 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 4 2 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 4 1 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 3 4 2 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 4 4 1 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 4 4 2 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 4 4 1 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 4 4 2 3 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 5 1 2 4 2 2 269 401 341 687<br />

idt t1.fasta 3 2 5 2 2 4 2 2 269 401 341 687<br />

. . . . . About 18,000 more lines like this . . .<br />

42


Test Data Raw Results<br />

len=811 W=3 n=1 m=3 thrL=1 thrH=2, Tol=4, Sites Found=18<br />

Intron: 1 + 91 + 213 - 213<br />

Intron: 2 + 236 - 241<br />

Intron: 3 + 267 - 267<br />

Intron: 4 + 385 + 399 - 399 - 467<br />

Intron: 5 + 471 + 759 - 759 - 797<br />

Intron: 6 + 799 - 799<br />

len=811 W=3 n=1 m=3 thrL=2 thrH=2, Tol=4, Sites Found=29<br />

Intron: 1 + 91 + 213 - 213<br />

Intron: 2 + 219 - 223<br />

Intron: 3 + 236 - 241<br />

Intron: 4 + 267 - 267<br />

Intron: 5 + 305 - 312 - 335<br />

Intron: 6 + 341 - 341<br />

Intron: 7 + 385 + 399 - 399<br />

Intron: 8 + 429 - 433<br />

Intron: 9 + 441 - 467<br />

Intron: 10 + 471 - 753<br />

Intron: 11 + 759 - 759 - 797<br />

Intron: 12 + 799 - 799<br />

len=811 W=3 n=1 m=3 thrL=1 thrH=3, Tol=4, Sites Found=13<br />

Intron: 1 + 91 + 213 - 213 - 241<br />

Intron: 2 + 267 + 399 - 399 - 467<br />

Intron: 3 + 471 + 759 - 759 - 786<br />

. . . About 18,000 sets of results like this. . .<br />

43


Test Data Filtered Results<br />

len=811 W=6 n=5 m=9 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=8 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND<br />

len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND<br />

This provides an initial set of likely to be optimal parameters<br />

44


Best Parameter Set on Known Gene<br />

len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11<br />

1: AAAAACCTGC TTAGGATTAA TTATGAGCGA ATTTTTTTTC TTTAAAACTT<br />

51: CCAAAAATAT TTTTTTTTTT TTTTTTTTTT AATAATTTCG GTTTGCTCAT<br />

101: AGATTTTTTA TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC<br />

151: TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT TAAAAGATAT<br />

201: GAGATTTTTA AAgttcgggt tagaaattaa tttgggtaaa gGAACTCTTA<br />

251: TTGAATTTGA TGAACAgtgt acttaaatat ttaattaatt tttttttttt<br />

301: atttgtttta agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT<br />

351: ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata aatttaaagt<br />

401: gttataaata atttaaaaat tttattctgt aagAAAATTT GTTTTGAAAT<br />

451: TATTTGATTA AAAATAGAAG gttttttttt ttattttttt tttttatttt<br />

501: tatttttttt tattttttat aatttccgcg tttgaatttg ttgtgtaaat<br />

551: taattttaat tttttttttt tttttttttt tttttttttt tttttttttt<br />

601: ttcattttta acatcatttg attcattaat ttattttttt tttcaacatc<br />

651: cccaacccaa aaaaaaaaaa taaaaaaaaa tgataagAAA TTTAACAAAA<br />

701: TTAACAAAAT TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG<br />

751: AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA TCTCTAGAGT<br />

801: CGACCGAAGG C<br />

Intron 1: + 213 - 241 overpredicted (45 bases)<br />

Intron 2: + 267 - 341 UNDERPREDICTED (37 BASES)<br />

Intron 3: + 385 + 399 - 433 correct + (325 bases)<br />

Intron 4: + 471 - 687 CORRECT - (404 BASES)<br />

45


Repetitive and<br />

Low Complexity Analysis<br />

• The problem:<br />

DNA is not random<br />

• 3 properties of interest:<br />

– Complexity: Compositional bias<br />

– Pattern: Interspersed k-grams<br />

– Periodicity: repetition of residues of k-grams<br />

• We will focus on complexity (low)<br />

46


Fin<br />

47

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!