Giri Narasimhan

CAP 5510: Introduction to Bioinformatics<strong>Giri</strong> <strong>Narasimhan</strong>ECS 254; Phone: x3748giri@cis.fiu.eduwww.cis.fiu.edu/~giri/teach/BioinfS08.html1/29/08 CAP5510 1

Genomic Databases Entrez Portal at National Center for Biotechnology Information (NCBI) givesaccess to:Nucleotide (GenBank, EMBL, DDBJ)Protein (PIR, SwissPROT, PRF, and Protein Data Bank or PDB)GenomeStructure3D DomainsConserved DomainsGene; UniGene; HomoloGene; SNPGEO Profiles & DatasetsCancer ChromosomesPubMed Central; Journals; BooksOMIMDatabase Neighbors and Interlinking1/29/08 CAP5510 2

Sequence Alignment – Why?>gi|12643549|sp|O18381|PAX6_DROME Paired box protein Pax-6 (Eyeless protein)MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLRNLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPLNSSESGGASNSGEGSEQEAIYEKLRLLNTQHAAGPGPLEPARAAPLVGQSPNHLGTRSSHPQLVHGNHQALQQHQQQSWPPRHYSGSWYPTSLSEIPISSAPNIASVTAYASGPSLAHSLSPPNDIESLASIGHQRNCPVATEDIHLKKELDGHQSDETGSGEGENSNGGASNIGNTEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREEKLRNQRRTPNSTGASATSSSTSATASLTDSPNSLSACSSLLSGSAGGPSVSTINGLSSPSTLSTNVNAPTLGAGIDSSESPTPIPHIRPSCTSDNDNGRQSEDCRRVCSPCPLGVGGHQNTHHIQSNGHAQGHALVPAISPRLNFNSGSFGAMYSNMHHTALSMSDSYGAVTPIPSFNHSAVGPLAPPSPIPQQGDLTPSSLYPCHMTLRPPPMAPAHHHIVPGDGGRPAGVGLGSGQSANLGASCSGSGYEVLSAYALPPPPMASSSAADSSFSAASSASANVTPHHTIAQESCPSPCSSASHFGVAHSSGFSSDPISPAVSSYAHMSYNYASSANTMTPSSASGTSAHVAPGKQQFFASCFYSPWV>gi|6174889|PAX6_HUMAN Paired box protein (Oculorhombin) (Aniridia, type II protein)MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLRNLASEKQQMGADGMYDKLRMLNGQTGSWGTRPGWYPGTSVPGQPTQDGCQQQEGGGENTNSISSNGEDSDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKLRNQRRQASNTPSHIPISSSFSTSVYQPIPQPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPTSPSVNGRSYDTYTPPHMQTHMNSQPMGTSGTTSTGLISPGVSVPVQVPGSEPDMSQYWPRLQ1/29/08 CAP5510 3

Drosophila Eyeless vs. Human AniridiaQuery: 57 HSGVNQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG 116HSGVNQLGGVFV GRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSbjct: 5 HSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETG 64Query: 117 SIRPRAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSIN 176SIRPRAIGGSKPRVAT EVVSKI+QYKRECPSIFAWEIRDRLL E VCTNDNIPSVSSINSbjct: 65 SIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSIN 124Query: 177 RVLRNLAAQKEQ 188RVLRNLA++K+QSbjct: 125 RVLRNLASEKQQ 136Query: 417 TEDDQARLILKRKLQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQV 476+++ Q RL LKRKLQRNRTSFT +QI++LEKEFERTHYPDVFARERLA KI LPEARIQVSbjct: 197 SDEAQMRLQLKRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQV 256Query: 477 WFSNRRAKWRREEKLRNQRR 496WFSNRRAKWRREEKLRNQRRSbjct: 257 WFSNRRAKWRREEKLRNQRR 276E-Value = 2e-311/29/08 CAP5510 4

Implications of Sequence AlignmentMutation in DNA is a natural evolutionary process.Thus sequence similarity may indicate commonancestry.In biomolecular sequences (DNA, RNA, protein),high sequence similarity implies significantstructural and/or functional similarity.1/29/08 CAP5510 5

Discovery based on alignments Early 1970s: Simian sarcoma virus causes cancer in some species ofmonkeys. 1970s: infection by certain viruses cause some cells in culture (in vitro)to grow without bounds.Hypothesis: Certain genes (oncogenes) in viruses encode cellular growthfactors, which are proteins needed to stimulate growth of a cell colony. Thusuncontrolled quantities of growth factors produced by the infected cellscause cancer-like behavior. 1983:The oncogene from SSV called v-sis was isolated and sequenced.The partial amino-acid sequence for platelet-derived growth factor (PDGF)was sequenced and published. It stimulates the proliferation of normal cells.R.F. Doolittle was maintaining one of the earliest home-grown databases ofpublished amino-acid sequences.Sequence Alignment of v-sis and PDGF showed something surprising.1/29/08 CAP5510 6

PDGF and v-sis One region of 31 amino acids had 26 exact matches Another region of 39 residues had 35 exact matches. Conclusion:The previously harmless virus incorporates the normal growth-related gene(proto-oncogene) of its host into its genome.The gene gets mutated in the virus, or moves closer to a strong enhancer, ormoves away from a repressor.This causes an uncontrolled amount of the product (the growth factor, forexample) when the virus infects a cell. Several other oncogenes known to be similar to growth-regulatingproteins in normal cells.1/29/08 CAP5510 7

V-sis Oncogene - HomologiesScore ESequences producing significant alignments: (bits) Valuegi|332623|gb|J02396.1|SEG_SSVPCS2 Simian sarcoma virus v-si... 4591 0.0gi|61774|emb|V01201.1|RESSV1 Simian sarcoma virus proviral ... 4504 0.0gi|332622|gb|J02395.1|SEG_SSVPCS1 Simian sarcoma virus LTR ... 1283 0.0gi|885929|gb|U20589.1|GLU20589 Gibbon leukemia virus envelo... 1140 0.0gi|4505680|ref|NM_002608.1| Homo sapiens platelet-derived g... 954 0.0gi|20987438|gb|BC029822.1| Homo sapiens, platelet-derived g... 954 0.0gi|338210|gb|M12783.1|HUMSISPDG Human c-sis/platelet-derive... 954 0.01/29/08 CAP5510 8

Sequence Alignment>gi|4505680|ref|NM_002608.1| Homo sapiens platelet-derived growthfactor beta polypeptide (simian sarcoma viral (v-sis) oncogene homolog)(PDGFB), transcript variant 1, mRNA Length = 3373 Score = 954 bits(481), Expect = 0.0 Identities = 634/681 (93%), Gaps = 3/681 (0%)Strand = Plus / PlusQuery: 1015 agggggaccccattcctgaggagctctataagatgctgagtggccactcgattcgctcct 1074|||||||||||||||| |||||||| ||| |||||||||||| ||||||||| |||||||Sbjct: 1084 agggggaccccattcccgaggagctttatgagatgctgagtgaccactcgatccgctcct 1143> 21 E G D P I P E E L Y E M L S D H S I R SQuery: 1075 tcgatgacctccagcgcctgctgcagggagactccggaaaagaagatggggctgagctgg 1134| ||||| ||||| ||||||||||| |||||| ||||| | ||||||||||| ||| |||Sbjct: 1144 ttgatgatctccaacgcctgctgcacggagaccccggagaggaagatggggccgagttgg 1203> 61 D L N M T R S H S G G E L E S L A R G R1/29/08 CAP5510 9

Sequence AlignmentSequence 1 gi 332624 Simian sarcoma virus v-sis transformingprotein p28 gene, complete cds; and 3' LTR long terminal repeat,complete sequence. Length 2984 (1 .. 2984)Sequence 2 gi 4505680 Homo sapiens platelet-derived growth factorbeta polypeptide (simian sarcoma viral (v-sis) oncogene homolog)(PDGFB), transcript variant 1, mRNA Length 3373 (1 .. 3373)1/29/08 CAP5510 10

Similarity vs. HomologyHomologous sequences share commonancestry.Similar sequences are “near” to eachother by some appropriately definedmeasurable criteria.1/29/08 CAP5510 11

Types of Sequence Alignments - 1GlobalHIV Strain 1HIV Strain 2Global Alignment: similarity over entire lengthLocalLocal Alignment: no overall similarity, but somesegment(s) is/are similar1/29/08 CAP5510 12

Types of Sequence Alignments - 2Semi-GlobalSemi-global Alignment: end segments maynot be similarStrain 1MultipleStrain 2Strain 3Multiple Alignment: similarity between sets ofsequencesStrain 41/29/08 CAP5510 13

Global:Local:Sequence AlignmentNeedleman-Wunsch-Sellers (1970).Smith-Waterman (1981)Useful when commonality is small and global alignment ismeaningless. Often unaligned portions “mask” shortstretches of aligned portions. Example: comparing longstretches of anonymous DNA; aligning proteins thatshare only some motifs or domains.Dynamic Programming (DP) based.1/29/08 CAP5510 14

Why gaps?Example: Finding the gene site for a given(eukaryotic) cDNA requires “gaps”.What is cDNA? cDNA = Copy DNADNATranscriptioncDNATranslationmRNAReverseTranscriptionProtein1/29/08 CAP5510 15

How to score mismatches?1/29/08 CAP5510 16

BLAST & FASTAFASTA[Lipman, Pearson ’85, ‘88]Basic Local Alignment Search Tool[Altschul, Gish, Miller, Myers, Lipman ’90]1/29/08 CAP5510 17

BLAST OverviewProgram(s) to search all sequence databasesTremendous Speed/Less SensitiveStatistical Significance reportedWWWBLAST, QBLAST (send now, retrieve resultslater), Standalone BLAST, BLASTcl3 (Clientversion, TCP/IP connection to NCBI server),BLAST URLAPI (to access QBLAST, no local client)1/29/08 CAP5510 18

BLAST Strategy & ImprovementsLipman et al.: speeded up finding “runs”of “hot spots”.Eugene Myers ’94: “Sublinear algorithmfor approximate keyword matching”.Karlin, Altschul, Dembo ’90, ’91:“Statistical Significance of Matches”1/29/08 CAP5510 19

Why Gaps?Example: Aligning HIV sequences.Strain 1Strain 2Strain 3Strain 41/29/08 CAP5510 20

BLAST Variants Nucleotide BLASTStandard blastnMEGABLAST (Compare large sets, Near-exact searches)Short Sequences (higher E-value threshold, smaller word size, no low-complexityfiltering) Protein BLASTStandard blastpPSI-BLAST (Position Specific Iterated BLAST)PHI-BLAST (Pattern Hit Initiated BLAST; reg expr. Or Motif search)Short Sequences (higher E-value threshold, smaller word size, no low-complexityfiltering, PAM-30) Translating BLASTBlastx: Search nucleotide sequence in protein database (6 reading frames)Tblastn: Search protein sequence in nucleotide dBTblastx: Search nucleotide seq (6 frames) in nucleotide DB (6 frames)1/29/08 CAP5510 21

BLAST Contd RPS BLASTCompare protein sequence against Conserved Domain DB; Helps in predictingrough structure and function Pairwise BLASTblastp (2 Proteins), blastn (2 nucleotides), tblastn (protein-nucleotide w/ 6frames), blastx (nucleotide-protein), tblastx (nucleotide w/6 framesnucleotidew/ 6 frames) Specialized BLASTHuman & Other finished/unfinished genomesP. falciparum: Search ESTs, STSs, GSSs, HTGsVecScreen: screen for contamination while sequencingIgBLAST: Immunoglobin sequence database1/29/08 CAP5510 22

BLAST Credits Stephen Altschul Jonathan Epstein David Lipman Tom Madden Scott McGinnis Jim Ostell Alex Schaffer Sergei Shavirin Heidi Sofia Jinghui Zhang1/29/08 CAP5510 23

ProteinDatabases used by BLASTnr (everything), swissprot, pdb, alu, individualgenomesNucleotideMiscnr, dbest, dbsts, htgs (unfinished genomicsequences), gss, pdb, vector, mito, alu, epd1/29/08 CAP5510 24

Rules of Thumb Most sequences with significant similarity over their entirelengths are homologous. Matches that are > 50% identical in a 20-40 aa region occurfrequently by chance. Distantly related homologs may lack significant similarity.Homologous sequences may have few absolutely conservedresidues. A homologous to B & B to C A homologous to C. Low complexity regions, transmembrane regions and coiled-coilregions frequently display significant similarity withouthomology. Greater evolutionary distance implies that length of a localalignment required to achieve a statistically significant scorealso increases.1/29/08 CAP5510 25

Rules of ThumbResults of searches using different scoring systems may be compared directly using normalizedscores.If S is the (raw) score for a local alignment, the normalized score S' (in bits) is given byln( )S'= ln(2)The parameters depend on the scoring system.Statistically significant normalized score, NS'> log Ewhere E-value = E, and N = size of search space.1/29/08 CAP5510 26

Types of Sequence AlignmentsGlobalHIV Strain 1HIV Strain 2LocalSemi-GlobalStrain 1MultipleStrain 2Strain 3Strain 41/29/08 CAP5510 27

Global Alignment:An exampleV: G A A T T C A G T T AW: G G A T C G AGiven[I, J] = Score of MatchingRecurrence Relationthe I th character of sequence V & S[I, J] = MAXIMUM {the J th character of sequence WS[I-1, J-1] + (V[I], W[J]),ComputeS[I-1, J] + (V[I], ),S[I, J] = Score of MatchingS[I, J-1] + ( , W[J]) }First I characters of sequence V &First J characters of sequence W1/29/08 CAP5510 28

Global Alignment:An exampleV: G A A T T C A G T T AW: G G A T C G AS[I, J] = MAXIMUM {S[I-1, J-1] + (V[I], W[J]),S[I-1, J] + (V[I], ),S[I, J-1] + ( , W[J]) }1/29/08 CAP5510 29

TracebackV: G A A T T C A G T T A| | | | | |W: G G A – T C – G - - A1/29/08 CAP5510 30

Alternative TracebackV: G - A A T T C A G T T A| | | | | |W: G G - A – T C – G - - AV: G A A T T C A G T T A| | | | | |W: G G A – T C – G - - APrevious1/29/08 CAP5510 31

Improved TracebackGAATTCAGTTA000000000000G011 1 1 1 1 11 1 1 1G011111112 2 2 2A0112 2 2 222223T01 2233 3 3 3333C0122334 4 4 4 4 4G012233445 5 5 5A0123334555561/29/08 CAP5510 32

Improved TracebackGAATTCAGTTA000000000000G011 1 1 1 1 11 1 1 1G011111112 2 2 2A0112 2 2 222223T01 2233 3 3 3333C0122334 4 4 4 4 4G012233445 5 5 5A0123334555561/29/08 CAP5510 33

Improved TracebackGAATTCAGTTA000000000000G011 1 1 1 1 11 1 1 1G011111112 2 2 2A0112 2 2 222223T01 2233 3 3 3333C0122334 4 4 4 4 4G012233445 5 5 5A0 1 2 3 3 3 4 5 5 5 5 6V: G A - A T T C A G T T A| | | | | |W: G - G A – T C – G - - A1/29/08 CAP5510 34

Generalizations of Similarity FunctionMismatch Penalty = Spaces (Insertions/Deletions, InDels) = Affine Gap Penalties:(Gap open, Gap extension) = (,)Weighted Mismatch = (a,b) Weighted Matches = (a)1/29/08 CAP5510 36

Alternative Scoring SchemesGAATTCAGTTA0-2-3-4-5-6-7-8-9-10-11-12G-2 1 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10G-3-1 -1 -3 -4 -5 -6 -7 -5 -7 -8 -9A-4-2 0 0 -2 -3 -4 -5 -6 -7 -8 -7T-5-3 -2-2 1 -1 -2 -3 -4 -5 -6 -7C-6-4 -3-3-1 -1 0 -2 -3 -4 -5 -6G-7-5-4-4-2-3-2 -2 -1 -3 -4 -5A-8-6-5-5-3-4-3 -1-3 -3 -5 -3Match +1Mismatch –2Gap (-2, -1)V: G A A T T C A G T T A| | | | | |W: G G A T - C – G - - A1/29/08 CAP5510 37

Local Sequence Alignment Example: comparing long stretches of anonymous DNA; aligning proteinsthat share only some motifs or domains. Smith-Waterman Algorithm1/29/08 CAP5510 38

Recurrence Relations(Global vs Local Alignments) S[I, J] = MAXIMUM {S[I-1, J-1] + (V[I], W[J]),S[I-1, J] + (V[I], ),S[I, J-1] + ( , W[J]) }---------------------------------------------------------------------- S[I, J] = MAXIMUM { 0,S[I-1, J-1] + (V[I], W[J]),S[I-1, J] + (V[I], ),S[I, J-1] + ( , W[J]) }GlobalAlignmentLocalAlignment1/29/08 CAP5510 39

Local Alignment: ExampleGAATTCAGTTA000000000000G0 10000000000G0 1 000000 1000A00 2 1000 1000 1T00 0 1 2 1000 1 10C0000 0 0 200000G00000000 1000A00 1 1000 1000 1Match +1Mismatch –1Gap (-1, -1)V: - G A A T T C A G T T A| | | |W: G G - A T - C – G - - A1/29/08 CAP5510 40

Properties of Smith-Waterman Algorithm How to find all regions of “high similarity”?Find all entries above a threshold score and traceback. What if: Matches = 1 & Mismatches/spaces = 0?Longest Common Subsequence Problem What if: Matches = 1 & Mismatches/spaces = -?Longest Common Substring Problem What if the average entry is positive?Global Alignment1/29/08 CAP5510 41

How to score mismatches?1/29/08 CAP5510 42

BLOSUM n Substitution MatricesFor each amino acid pair a, bFor each BLOCKAlign all proteins in the BLOCKEliminate proteins that are more than n% identicalCount F(a), F(b), F(a,b)Compute Log-odds RatiologF(a,b)F(a)F(b)1/29/08 CAP5510 43

Giri Narasimhan

Create successful ePaper yourself

Delete template?

Save as template?