11.07.2015 Views

Databases and Comparisons - CMB Education - Karolinska Institutet

Databases and Comparisons - CMB Education - Karolinska Institutet

Databases and Comparisons - CMB Education - Karolinska Institutet

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Introduction to Bioinformatics<strong>Databases</strong> <strong>and</strong> <strong>Comparisons</strong>Fredrik Lysholm


BioinformaticsMedicineBiologyComputer scienceMathematicsStatistics Investigate functions of proteins <strong>and</strong> genes– Sequence comparisons of DNA <strong>and</strong> protein– Sequence patterns– Stru ctural calculations– Molecular modelling– Predictions Data management–<strong>Databases</strong> of sequences <strong>and</strong> sequence-related informationLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 2


Genome comparisons


Genome projects15 February 2001Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 4


Completed genomesGenome sizeMb1000010001001010.10.010.0011975 1980 1985 1990 1995 2000 2005YearYear Organism Size (Mb) Genes1977 bacteriophage Φ X174 0.0054 91995 Haemophilus influenzae 1.83 1 6781996 Saccharomyces cerevisiae 13 6 2842001 Homo sapiens 3 200 40 000 (?)Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 5


Completed genomes35302520151050Complete genomes in public databases (GOLD)1995 1996 1997 1998 1999 2000 2001First bacterial in 1995First eukaryotic in 1996Human in 2001Over 100 genomes(June 2003)Genomes in each kingdomEukaryotaArchaeEubacteriaLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 6


Genome comparisonsH. sapiensclusterD. melanogaster C. elegans A. thalianaLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 7


Molecular modelling


Protein foldingDNAProtein?Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 9


Structure predictionsSecondary structure prediction– Presently, at most 75% correct assignments ofα, β <strong>and</strong> coilAb initio prediction– limited to ~23 amino acid residuesHomology modelling– requires template– quite good accuracySubstrate docking– getting better <strong>and</strong> betterLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 10


Molecular modellingHomology modelling– an amino acid sequence is adopted to the structure of ahomologous protein with known three-dimensionalstructure– used to get an idea about the three-dimensional structurein lack of one that is experimentally determinedSubstrate docking– a substrate is modelled at the active site of a protein withknown three-dimensional structure– used to• investigate possible interactions• screen for potential substratesLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 11


Docking of bile acids into class I γγ ADHLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 12


Docking of bile acids into class I γγ ADH2.19Distances in Å at active siteγγ-ADH48(Og)--Ho3Zn--O3C4(NAD)--H3isoUDCA 2.22 2.30 2.19UDCA 3.20 2.88 3.532.22ββ-ADHisoUDCA 3.35 2.24 1.94UDCA 2.31 2.51 2.942.30ADH=alcohol dehydrogenaseUDCA=ursodeoxycholic acidLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 13


Predictions ofstructure <strong>and</strong> function


Predictions of important regions<strong>Comparisons</strong> of related proteins– gives a number of conserved positionsKnown three-dimensional structure– in which the locations of these positions aremarkedLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 15


Multiple sequence alignmentLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 16


Conserved amino acid residuesLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 17


<strong>Databases</strong>


<strong>Databases</strong> with sequence informationAmino acid sequences– Uniprot/Swissprot– Uniprot/TrEMBL– NCBI NRNucleotide sequences– EMBL, Genbank, NCBI NTThree-dimensional structures– PDB/BrookhavenLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 19


SwissprotHigh level of annotation– description of protein function– domain structure– post-translational modifications– heterogeneity/variantsMinimal level of redundancyIntegration with other databases.One of the best protein sequence databasesdue to the quality of the annotation.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 20


Example of Swissprot entryID 2BHD_STREX STANDARD; PRT; 255 AA.AC P19992;DT 01-FEB-1991 (REL. 17, CREATED)DT 01-FEB-1991 (REL. 17, LAST SEQUENCE UPDATE)DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)DE 20-BETA-HYDROXYSTEROID DEHYDROGENASE (EC 1.1.1.53).OS STREPTOMYCES EXFOLIATUS (STREPTOMYCES HYDROGENANS).OC PROKARYOTA; FIRMICUTES; ACTINOMYCETALES; STREPTOMYCETACEAE.RN [1]RP SEQUENCE.RX MEDLINE; 90306362.RA MAREKOV L., KROOK M., JOERNVALL H.;RL FEBS LETT. 266:51-54(1990).RN [2]RP X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS).RX MEDLINE; 92052211.RA GHOSH D., WEEKS C.M., GROCHULSKI P., DUAX W.L., ERMAN M.,RA RIMSAY R.L., ORR J.C.;RL PROC. NATL. ACAD. SCI. U.S.A. 88:10064-10068(1991).Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 21


Example of Swissprot entry, cont.CC -!- CATALYTIC ACTIVITY: ANDROSTAN-3-ALPHA,17-BETA-DIOL + NAD(+) =CC 17-BETA-HYDROXYANDROSTAN-3-ONE + NADH.CC -!- SUBUNIT: HOMOTETRAMER.CC -!- SIMILARITY: BELONGS TO THE SHORT-CHAIN DEHYDROGENASES/REDUCTASESCC FAMILY (SDR).DR PIR; S10707; S10707.DR PDB; 2HSD; 31-AUG-94.DR PDB; 1HDC; 07-FEB-95.DR PROSITE; PS00061; ADH_SHORT; 1.KW OXIDOREDUCTASE; NAD; STEROID METABOLISM; 3D-STRUCTURE.FT NP_BIND 10 34 NAD (BY SIMILARITY).FT ACT_SITE 152 152SQ SEQUENCE 255 AA; 26484 MW; 49F64254 CRC32;MNDLSGKTVI ITGGARGLGA EAARQAVAAG ARVVLADVLD EEGAATAREL GDAARYQHLDVTIEEDWQRV VAYAREEFGS VDGLVNNAGI STGMFLETES VERFRKVVDI NLTGVFIGMKTVIPAMKDAG GGSIVNISSA AGLMGLALTS SYGASKWGVR GLSKLAAVEL GTDRIRVNSVHPGMTYTPMT AETGIRQGEG NYPNTPMGRV GNEPGEIAGA VVKLLSDTSS YVTGAELAVDGGWTTGPTVK YVMGQ//Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 22


TrEMBLA computer-annotated supplement toSwissprot, containing all translations ofEMBL entries not found in Swissprot.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 23


EMBLSequences submitted directly by scientists<strong>and</strong> genome sequencing groups.Sequences taken from literature <strong>and</strong> patents.There is comparatively little error checking<strong>and</strong> there is a fair amount of redundancy.Daily syncronising with GenBank (USA) <strong>and</strong>DDBJ (Japan).Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 24


Example of EMBL entryID HS3B5H5E st<strong>and</strong>ard; RNA; HUM; 1528 BP.XXAC X55997;XXNIXXDTg2386117-APR-1991 (Rel. 28, Created)DT 17-FEB-1997 (Rel. 50, Last updated, Version 7)XXDEXXKWXXOSOCOCXXHuman gene for 3-beta-5-hydroxy-5-ene steroid dehydrogenase3 beta-hydroxy-5-ene steroid dehydrogenase; dehydrogenase; isomerase.Homo sapiens (human)Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 25


Example of EMBL entry, cont.RN [1]RA Sutcliffe R.G.;RT ;RLRLRLXXRN [2]...XXDRDRSubmitted (25-OCT-1990) to the EMBL/GenBank/DDBJ databases.Sutcliffe R.G., Dept. of Genetics, Church Sreet Glasgow University,Glasgow G11, 5JS.GDB; 120056; HSD3B1.SWISS-PROT; P14060; 3BH1_HUMAN.XXFH Key Location/QualifiersFHFT source 1..1528FTFTFTFT/organism="Homo sapiens"/tissue_type="placenta"/clone_lib="lambda gt11"/clone="1/6,B/6"Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 26


Example of EMBL entry, cont.FT CDS 1..1122FT/db_xref="PID:g23862"FTFTFTFTFTFTFTFTFTFT/db_xref="SWISS-PROT:P14060"/EC_number="1.1.1.145"/product="3-beta-hydroxy-5-ene steroid dehydrogenase"/translation="MTGWSCLVTGAGGFLGQRIIRLLVKEKELKEIRVLDKAFGPELREEFSKLQNKTKLTVLEGDILDEPFLKRACQDVSVIIHTACIIDVFGVTHRESIMNVNVKGTQLLLEACVQASVPVFIYTSSIEVAGPNSYKEIIQNGHEEEPLENTWPAPYPHSKKLAEKAVLAANGWNLKNGGTLYTCALRPMYIYGEGSRFLSASINEALNNNGILSSVGKFSTVNPVYVGNVAWAHILALRALQDPKKAPSIRGQFYYISDDTPHQSYDNLNYTLSKEFGLRLDSRWSFPLSLMYWIGFLLEIVSFLLRPIYTYRPPFNRHIVTLSNSVFTFSYKKAQRDLAYKPLYSWEEAKQKTVEWVGSLVDRHKETLKSKTQ"FT mRNA 1..1528FT/evidence=EXPERIMENTALFT/note="3-beta-hydroxy-5-ene steroid dehydrogenase"FT polyA_signal 1511..1516Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 27


Example of EMBL entry, cont.FT allele 1100FT/note="aa:ASN, see ref.[3]"FTFT allele 1012FTFTXXSQ...///replace="a"/note="see ref.[2]"/replace="t"Sequence 1528 BP; 410 A; 374 C; 364 G; 380 T; 0 other;ATGACGGGCT GGAGCTGCCT TGTGACAGGA GCAGGAGGGT TTCTGGGACA GAGGATCATC 60CGCCTCTTGG TGAAGGAGAA GGAGCTGAAG GAGATCAGGG TCTTGGACAA GGCCTTCGGA 120CCAGAATTGA GAGAGGAATT TTCTAAACTC CAGAACAAGA CCAAGCTGAC AGTGCTGGAA 180TGAACAATTT AGGGACTCTT TTAACTTGAG GGTCGTTTTG ACTACTAGAG CTCCATTTCT 1440ACTCTTAAAT GAGAAAGGAT TTCCTTTCTT TTTAATCTTC CATTCCTTCA CATAGTTTGA 1500TAAAAAGATC AATAAATGTT TGAATGTT 1528Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 28


Example of PDB entryHEADER OXIDOREDUCTASE 28-MAR-94 2HSD 2HSD 2COMPND 3 ALPHA, 20 BETA-HYDROXYSTEROID DEHYDROGENASE (HOLO FORM) 2HSD 3COMPND 2 (E.C.1.1.1.53) 2HSD 4SOURCE (STREPTOMYCES HYDROGENANS) 2HSD 5AUTHOR D.GHOSH,W.L.DUAX 2HSD 6REVDAT 1 31-AUG-94 2HSD 0 2HSD 7SPRSDE 31-AUG-94 2HSD 1HSD 2HSD 8JRNL AUTH D.GHOSH,Z.WAWRZAK,C.M.WEEKS,W.L.DUAX,M.ERMAN 2HSD 9JRNL TITL THE REFINED THREE-DIMENSIONAL STRUCTURE OF 2HSD 10JRNL TITL 2 3ALPHA,20BETA-HYDROXYSTEROID DEHYDROGENASE AND 2HSD 11JRNL TITL 3 POSSIBLE ROLES OF THE RESIDUES CONSERVED IN 2HSD 12JRNL TITL 4 SHORT-CHAIN DEHYDROGENASES 2HSD 13JRNL REF STRUCTURE V. 2 629 1994 2HSD 14JRNL REFN ASTM UK ISSN 0969-2126 2005 2HSD 15REMARK 1 2HSD 16REMARK 1 REFERENCE 1 2HSD 17REMARK 1 AUTH D.GHOSH,C.M.WEEKS,P.GROCHULSKI,W.L.DUAX,M.ERMAN, 2HSD 18REMARK 1 AUTH 2 R.L.RIMSAY,J.C.ORR 2HSD 19Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 29


Example of PDB entry, cont.REMARK 1 TITL THREE-DIMENSIONAL STRUCTURE OF HOLO 2HSD 20REMARK 1 TITL 2 3ALPHA,20BETA-HYDROXYSTEROID DEHYDROGENASE: 2HSD 21REMARK 1 TITL 3 A MEMBER OF A SHORT-CHAIN DEHYDROGENASE FAMILY 2HSD 22REMARK 1 REF PROC.NAT.ACAD.SCI.USA V. 88 10064 1991 2HSD 23REMARK 1 REFN ASTM PNASA6 US ISSN 0027-8424 0040 2HSD 24REMARK 2 2HSD 25REMARK 2 RESOLUTION. 2.64 ANGSTROMS. 2HSD 26REMARK 3 2HSD 27REMARK 3 REFINEMENT. 2HSD 28REMARK 3 PROGRAM 1 PROLSQ 2HSD 29REMARK 3 AUTHORS 1 KONNERT,HENDRICKSON 2HSD 30REMARK 3 PROGRAM 2 X-PLOR 2HSD 31REMARK 3 AUTHORS 2 BRUNGER 2HSD 32REMARK 3 R VALUE 0.188 2HSD 33REMARK 3 RMSD BOND DISTANCES 0.012 ANGSTROMS 2HSD 34REMARK 3 RMSD BOND ANGLE DISTANCES 0.035 ANGSTROMS 2HSD 35REMARK 3 2HSD 36REMARK 3 NUMBER OF REFLECTIONS 28327 2HSD 37REMARK 3 RESOLUTION RANGE 8.0 - 2.64 ANGSTROMS 2HSD 38Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 30


Example of PDB entry, cont.SEQRES 1 A 253 ASN ASP LEU SER GLY LYS THR VAL ILE ILE THR GLY GLY 2HSD 86SEQRES 2 A 253 ALA ARG GLY LEU GLY ALA GLU ALA ALA ARG GLN ALA VAL 2HSD 87SEQRES 3 A 253 ALA ALA GLY ALA ARG VAL VAL LEU ALA ASP VAL LEU ASP 2HSD 88SEQRES 4 A 253 GLU GLU GLY ALA ALA THR ALA ARG GLU LEU GLY ASP ALA 2HSD 89...HET NAD A 256 44 NICOTINAMIDE-ADENINE-DINUCLEOTIDE 2HSD 166HET NAD B 256 44 NICOTINAMIDE-ADENINE-DINUCLEOTIDE 2HSD 167HET NAD C 256 44 NICOTINAMIDE-ADENINE-DINUCLEOTIDE 2HSD 168HET NAD D 256 44 NICOTINAMIDE-ADENINE-DINUCLEOTIDE 2HSD 169FORMUL 5 NAD 4(C21 H28 N7 O14 P2) 2HSD 170HELIX 1 BA ARG A 16 ALA A 29 1 2HSD 171HELIX 2 CA ASP A 40 ALA A 54 1 2HSD 172HELIX 3 DA TRP A 67 GLU A 77 1 2HSD 173...SHEET 1 S1A 7 GLN A 57 THR A 62 0 2HSD 195SHEET 2 S1A 7 GLY A 30 VAL A 38 1 N LEU A 35 O LEU A 59 2HSD 196SHEET 3 S1A 7 GLY A 6 THR A 12 1 N ILE A 10 O VAL A 34 2HSD 197SHEET 4 S1A 7 ASP A 82 GLY A 93 1 N ASP A 82 O LYS A 7 2HSD 198Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 31


Example of PDB entry, cont.SHEET 5 S1A 7 GLY A 130 SER A 139 1 N VAL A 135 O LEU A 84 2HSD 199SHEET 6 S1A 7 ARG A 174 GLY A 183 1 O ARG A 174 N GLY A 132 2HSD 200SHEET 7 S1A 7 GLY A 234 ALA A 238 1 N LEU A 237 O SER A 179 2HSD 201...TURN 5 EA LYS A 127 GLY A 130 2HSD 227TURN 6 FA LEU A 170 ASP A 173 2HSD 228TURN 7 GA VAL A 210 GLU A 213 PART OF THE LONGEST LOOP 2HSD 229...CRYST1 106.200 106.200 203.800 90.00 90.00 90.00 P 43 21 2 32 2HSD 251...ATOM 1 N ASN A 2 100.288 -12.391 81.699 1.00 50.99 2HSD 267ATOM 2 CA ASN A 2 99.826 -12.757 83.069 1.00 50.77 2HSD 268ATOM 3 C ASN A 2 99.264 -11.499 83.704 1.00 51.56 2HSD 269ATOM 4 O ASN A 2 99.981 -10.517 83.905 1.00 54.06 2HSD 270ATOM 5 CB ASN A 2 100.992 -13.311 83.883 1.00 53.10 2HSD 271ATOM 6 CG ASN A 2 101.673 -14.476 83.192 1.00 53.65 2HSD 272ATOM 7 OD1 ASN A 2 101.646 -14.567 81.964 1.00 53.85 2HSD 273ATOM 8 ND2 ASN A 2 102.269 -15.374 83.965 1.00 50.11 2HSD 274Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 32


Pairwise sequence comparisons


Pairwise sequence comparisonsQuery sequenceDatabaseNext Sequence sequence from from database databaseFound homologuesMethodsDiagonal plotsFASTABLASTLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 34


Why comparisons?Identification of proteinSearch for homologies– giving functional/structural conclusionsMolecular architectureEvolutionLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 35


Diagonal plots::10 A |* * * corresponding alignment:D | *E | *G | * * A K G C T G A E D A....G | * * A K G C S G G E D A....5 S |C | *G | * *K | *1 A |* * *+--------------------A K G C T G A E D A ....1 5 10Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 36


Results from diagonal plot analysisLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 37


Results from diagonal plot analysisLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 38


Improved comparisons 1Not just considering identitiesbut alsosimilarities (based upon physicochemicalproperties or evolutionary likelihood)This is done by assigning weights fordifferent mismatches (substitution matrices)– A transition is more likely than a transversion– An Ile--Val exchange is more likely than an Ile--Arg change, since functional properties often arepreserved.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 39


Scoring matrixTo score matches <strong>and</strong> mismatches.A number is assigned to each position in thesequence depending on the match at that position.The scores for all positions in the alignment areadded to calculate a total score.This score is used to select the optimal alignmentamong alternative alignments.The simplest way of scoring is to assign 1 for amatch <strong>and</strong> 0 for a mismatch. Such a matrix isreferred to as a unitary matrix.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 40


Selection of scoring matrixThe scoring matrices contain information about thelikelihood of a certain change.Scoring matrices appear in all analysis involvingsequence comparison.The choice of matrix can strongly influence theoutcome of the analysis.Scoring matrices implicitly represent a particulartheory of evolution.Underst<strong>and</strong>ing theories underlying a given scoringmatrix can aid in making a proper choice.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 41


Scoring matrix (BLOSUM62)A R N D C Q E G H I L K M F P S T W Y VA 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 42


Using scoring matrixS 0 0 0 0 0=4C 0 0 0 =31 0G 0 0 =21 0 0K 0 1 0 0 0A 1 0 0 0 0A K G C T=4S 1 0 0 -1 1=12=11C 0 -3 -3 1 0=10G 0 -2 1 -3 0=9K -1 5 -2 -3 -1A 4 -1 0 0 0A K G C TLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 43


PAMThe most important improvement achieved over theunitary matrix was based on evolutionary distances.Margaret O. Dayhoff pioneered this approach in the1970's.– An extensive study of the frequencies in which aminoacids substitute for each other during evolution.– The studies involved carefully aligning all of the proteinsin several families of proteins <strong>and</strong> then constructingphylogenetic trees for each family.– This lead to a table of relative frequencies with whichamino acids replace each other over an evolutionaryperiod. This table <strong>and</strong> the relative frequency of occurrenceof amino acids in the proteins studied were combined incomputing the PAM family of scoring matrices.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 44


PAM, cont.The PAM series are based on estimatedmutation rates (Percent Accepted Mutations)from closely related proteins <strong>and</strong> willtherefore be dominated by amino acidmutations caused by single base changes.It is also called a log-odds matrix, since thenumbers are proportional to the logarithm ofthe ”odds” for the replacement not being ar<strong>and</strong>om change.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 45


PAM, cont.PAM1 st<strong>and</strong>s for 1 accepted mutation per 100residues.PAM matrices for less similar sequences areobtained by extrapolations.The PAM100 matrix corresponds to 100 acceptedmutations per 100 residues, but since the sameresidue might change more than once, twosequences with this level of mutations will haveabout 50% identities.Similarly, the PAM250 matrix corresponds to a levelof about 20% identical residues.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 46


BLOSUMOne assumption inherent in the Dayhoff model isthat the evolutionary rates are uniform over thewhole of protein sequence.This is likely not to be true, because evolutionaryrates are lower in conserved regions <strong>and</strong> higher innon-conserved regions of proteins.Henikoff <strong>and</strong> Henikoff used the BLOCKS databaseto search for differences among sequences but onlyamong the very conserved regions of a protein family.Hence the term BLOSUM is fromBLOcks SUbstitution Matrix.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 47


BLOSUM, cont.They first collected all of the sequences in theBLOCKS database.Then for each one they sum the number ofamino acids at each site to get a frequencytable.The odds for relatedness were then calculatedin a similar way as for Dayhoff matrix.BLOSUM62 is derived from sequence blocksclustered at the 62% identity level.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 48


Improved comparisons 2Introduction of gapsA gap <strong>and</strong> its length are distinct quantities.– Different weights should be applied to each. Thereason for this is that it if a gap is introduced it ismuch more likely that it is prolonged than thatanother gap is formed.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 49


Introduction of gapsLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 50


Gap penaltyThe score from all pairs of aligned residues arecombined with suitable penalties for introducinggaps to calculate a total score, which is used to selectthe optimal alignment.The gap penalty is normally determined by twoparameters:– one for opening of a gap <strong>and</strong>– one that gives a penalty proportional to the length of thegap.Most programs allow the user to choose theseparameters, which might have different optima fordifferent systems. They also depend on the scoringmatrix used.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 51


Improved comparisons 3A large part of the evolutionary process doeswork on the domain level, <strong>and</strong> not on thecomplete sequence level.– Therefore, unless two sequences are known to behomologous over their entire length, a localalignment is usually preferred to globalalignment.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 52


Lipman & PearsonFASTAIdentities / SimilaritiesScoring matrices– PAM250– BLOSUM62Best diagonal only– One alignmentLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 53


FASTA run from the comm<strong>and</strong> linezierf<strong>and</strong>ler: /home/bpn/test> fasta3FASTA searches a protein or DNA sequence data bankversion 3.0t77 April 29, 1997Please cite:W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448test sequence file name: query_sequence.fastaChoose sequence library:S: Swissprot completeG: Genpept completeP: PIREnter library filename (e.g. prot.lib), letter (e.g. P)or a % followed by a list of letters (e.g. %PN): sktup? (1 to 2) [2] 1query_sequence.fasta: 110 aa>INS_HUMANvs Swissprot complete librarysearching /zierf<strong>and</strong>ler/data/swissprot/seq.dat 3 library..... ..... ..... ..... ..... ..... ..... ..... ..... .......... ..... ..... ..... ..... ..... ..... ..... ..... .......... ..... ..... ..... ..... ..... ..... ..... .....Done!Start programEnter querysequence nameEnter sequencelibraryEnter ktup valueLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 54


Results from FASTA run, histogramopt E()< 20 182 0:===22 1 0:= one = represents 88 library sequences24 1 0:=26 8 1:*28 32 13:*30 84 81:*32 396 313:===*=34 1026 848:=========*==36 2019 1742:===================*===38 3168 2879:================================*===40 4160 4016:=============================================*==42 4809 4910:=======================================================*44 52445416:===========================================================*46 5035 5516:==========================================================*48 5090 5281:==========================================================*50 4607 4819:===================================================== *52 4171 4237:================================================*54 3481 3619:======================================== *56 3051 3023:==================================*58 2533 2482:============================*60 1995 2010:======================*62 1721 1612:==================*=64 1293 1282:==============*66 1026 1013:===========*68 840 797:=========*70 670 624:=======*72 527 488:=====*74 375 380:====*76 336 296:===*78 236 230:==*80 232 179:==*82 150 137:=*84 88 108:=*86 69 84:*88 70 65:* inset = represents 3 library sequences90 53 50:*92 34 39:* :============*94 23 30:* :======== *96 17 23:* :====== *98 17 18:* :=====*100 5 14:* :== *102 2 11:* := *104 4 8:* :==*106 5 6:* :=*108 4 5:* :=*110 3 4:* :=*112 3 3:* :*114 1 2:* :*116 0 2:* :*118 0 1:* :*>120 125 1:*= :*=======================================Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 55


Results from FASTA run, hit list21210615 residues in 59022 sequencesstatistics extrapolated from 50000 to 58723 sequencesExpectation fit: rho(ln(x))= 4.8418+/-0.000523; mu= 9.7108+/- 0.029;mean_var=54.3033+/-10.602Kolmogorov-Smirnov statistic: 0.0170 (N=29) at 40FASTA (3.06 Sept, 1996) function (optimized, BL50 matrix) ktup: 1join: 42, opt: 30, gap-pen: -12/ -2, width: 32 reg.-scaledScan time: 54.450Enter filename for results: test1.resHow many scores would you like to see? [20]The best scores are:initn init1 opt z-sc E(58723)INS_HUMAN INSULIN PRECURSOR. ( 110) 765 765 765 1044.1 0INS_PANTR INSULIN PRECURSOR. ( 110) 756 756 756 1031.8 0INS_MACFA INSULIN PRECURSOR. ( 110) 750 750 750 1023.7 0INS_CERAE INSULIN PRECURSOR. ( 110) 745 745 745 1016.9 0INS_CANFA INSULIN PRECURSOR. ( 110) 676 676 676 923.3 0INS_AOTTR INSULIN PRECURSOR. ( 108) 489 489 658 899.0 0..INS_PIG PROINSULIN. ( 84) 296 296 495 679.4 2.9e-31INS_OCTDE INSULIN PRECURSOR.( 109) 414 238 477 653.3 8.2e-30INS_CHICK INSULIN PRECURSOR.( 107) 446 307 444 608.6 2.5e-27More scores? [0]Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 56


Results from FASTA run, alignment>>INS_ONCKE INSULIN PRECURSOR.(105 aa)initn: 295 init1: 182 opt: 328 Z-score: 451.3 expect() 1.5e-18Smith-Waterman score: 328; 50.893% identity in 112 aa overlap10 20 30 40 50INS_HU MALWMRLLPLLALLALW-GPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAE::.:.. ::.:::: : : ::: :::::::::.::::::::.::::::: :...INS_ON MAFWLQAASLLVLLALSPGVDAAAA---QHLCGSHLVDALYLVCGEKGFFYTPK--RDVD10 20 30 40 5060 70 80 90 100 110INS_HU DLQVGQVELGGGPGAGSLQPLALEGSLQ-KRGIVEQCCTSICSLYQLENYCN: .: . .. . :. . .. ::::::::: . :....:.::::INS_ON PL-IGFLSPKSAK-ENEEYPFKDQTEMMVKRGIVEQCCHKPCNIFDLQNYCN60 70 80 90 100Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 57


Alignment scoreThe significance of an alignment score can beestimated by repeating the alignment withr<strong>and</strong>omised sequences of the same aminoacid composition.A number of such alignments will give amean score <strong>and</strong> an estimate of the expectedvariation of the score for r<strong>and</strong>om sequencessimilar to the aligned sequences.From this, the obtained score for thealignment can be expressed as number ofst<strong>and</strong>ard deviations above the mean.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 58


Caution with the interpretationsAn alignment demonstrates similarity, notnecessarily homology.– Homology is an evolutionary inference based onexamination of the similarity <strong>and</strong> its biologicalmeaning. Sequence similarity may result fromhomology but it may also result from chance oranalogy.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 59


The FASTA programLocal installation– program freely available via InternetPart of the GCG packageWWW service at EBI– http://www.ebi.ac.uk/fasta3/Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 60


EBI home pageLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 61


http://www.ebi.ac.uk/fastaLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 62


http://www.ebi.ac.uk/fasta33/Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 63


EBI FASTA submission form, cont.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 64


BLASTFast algorithmLocal installation possible– program available via InternetWWW services– http://www.ncbi.nlm.nih.gov/BLAST/Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 65


BLAST programsProgram Query sequence Databaseblastp protein proteinblastn nucleotide nucleotideblastx nucleotide, translated 6 proteintblastn protein nucleotide, translated 6tblastx nucleotide, translated 6 nucleotide, translated 6Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 66


NCBI home pageLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 67


NCBI BLAST page,www.ncbi.nlm.nih.gov/BLAST/Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 68


Basic BLASTLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 69


Basic BLAST, cont.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 70


Basic BLAST, resultsLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 71


Basic BLAST, results, cont.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 72


FASTA parametersScoring matrix– BLOSUM, PAM, ...Ktup– 1 or 2 for proteins, 1--6 for nucleotide sequences– Low number => high accuracy, long CPU timeGap penalties– Penalty to start a gap (points per gap)– Penalty to elongate a gap (points per residue)E-value (Expect value)– number of results to be expected by chance– values < 10 -6 ”safe”– dependent on search sequence, parameters <strong>and</strong> databaseLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 73


Statistics, normal penalties (12, 2)Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 74


Statistics, high penalties (48, 8)Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 75


Statistics, low penalties (1.2, 0.2)Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 76


Effect on alignment by gap penaltiesNormal penalties(12,2)High penalties(48,8)Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 77


Multiple sequencealignments


Multiple sequence alignmentsCharacterisation of protein familiesImportant amino acid residuesConserved sequence motifsLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 79


Alignment programsClustalW <strong>and</strong> ClustalX– freely available for DOS, Mac, linux, unix– WWW service: http://www.ebi.ac.uk/clustalw/T-CoffeeDialignKalignMafftMuscle….Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 80


Clustering + AligningClustalAll-against-all pairwise comparisonsThe most similar sequence pair is aligned firstSubsequently, the sequence next in similarityorder is aligned to the first pairThis procedure is repeated for all sequences,creating a multiple sequence alignmentLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 81


Clustal, cautionsA gap, introduced at the beginning, willpropagate through the alignmentSince Clustal makes a global alignment,manual intervention might be necessarywhen aligning sequences of large differencesin lengthLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 82


ClustalW main menu********************************************************************** CLUSTAL W(1.5) Multiple Sequence Alignments **********************************************************************1. Sequence Input From Disc2. Multiple Alignments3. Profile Alignments4. Phylogenetic treesS. Execute a system comm<strong>and</strong>H. HELPX. EXIT (leave program)Your choice: 1Sequences should all be in 1 file.6 formats accepted:NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF.Enter the name of the sequence file: sequences.mfLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 83


ClustalW multiple alignment menu****** MULTIPLE ALIGNMENT MENU ******1. Do complete multiple alignment now (Slow/Accurate)2. Produce guide tree file only3. Do alignment using old guide tree file4. Toggle Slow/Fast pairwise alignments = SLOW5. Pairwise alignment parameters6. Multiple alignment parameters7. Reset gaps between alignments? = ON8. Toggle screen display = ON9. Output format optionsS. Execute a system comm<strong>and</strong>H. HELPor press [RETURN] to go back to main menuYour choice:Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 84


Pairwise alignment parameters********* PAIRWISE ALIGNMENT PARAMETERS *********Slow/Accurate alignments:1. Gap Open Penalty :10.002. Gap Extension Penalty :0.103. Protein weight matrix :BLOSUM30Fast/Approximate alignments:4. Gap penalty :35. K-tuple (word) size :16. No. of top diagonals :57. Window size :58. Toggle Slow/Fast pairwise alignments = SLOWH. HELPEnter number (or [RETURN] to exit):Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 85


Multiple alignment parameters********* MULTIPLE ALIGNMENT PARAMETERS *********1. Gap Opening Penalty :10.002. Gap Extension Penalty :0.053. Delay divergent sequences :40 %4. Toggle Transitions (DNA) :Weighted5. Protein weight matrix :BLOSUM series6. Use negative matrix :OFF7. Protein Gap ParametersH. HELPEnter number (or [RETURN] to exit):Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 86


ClustalW at EBI, Submission formLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 87


Clustal W, Submission form, cont.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 88


waiting ...Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 89


ClustalW, ResultsLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 90


ClustalW, Scores TableLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 91


ClustalW, AlignmentLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 92


ClustalW, Guide Tree & CladogramLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 93


Alignment display using Jalview In Jalview, a numberof things can bechanged, e.g. font,character size, colour.Just use the pull-downmenus.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 94


JalViewA new experimental option has been added to theresults page which involved using a Java Appletcalled JalView.This is a fully featured MSA (multiple sequencealignment) editor which allows you not only to editthe alignment but also, to exchange the alignmentformats.Please note that JalView is under development.For documentation please click on the JalViewHyperlink.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 95


ClustalW, output fileLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 96


Sequence formats in ClustalW Pearson (Fasta) GCG/MSF Clustal NBRF/PIR EMBL/SwissProt GDE RSF>title1SEQUENCE>title2NEXTSEQUENCELinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 97


Kalign at EBI, Submission formLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 98


Kalign, AlignmentLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 99


JalView, Kalign alignmentLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 100


PSI-BLASTQuery sequenceFound homologuesDatabaseSequence profileMore homologuesLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 101


PSI-BLASTPurpose of PSI-BLASTsearch:– to identify distant relatives– to gain insight into thefunction of the proteinfamily.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 102


PSI-BLAST, resultsLinköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 103


PSI-BLASTTo start the next iteration,press this button.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 104


Expect valuesThe first of these (upper) sets the thresholdfor the initial BLAST search.– The default value is 10 as in the st<strong>and</strong>ard BLASTprogram.The second E value (lower) is the thresholdvalue for inclusion in the position-specificmatrix used for PSI-BLAST iterations.– The default setting of 0.001.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 105


Filter Low complexityIt is appropriate to filter most queries for lowcomplexity sequences.Some types of low complexity sequences may not bedetected by the filtering option in BLAST.For example, coiled-coil <strong>and</strong> transmembrane regionsneed to be detected using the appropriate programsoutside of BLAST.Since coiled-coil encoding sequences can lead tomatches with other coiled-coil proteins <strong>and</strong> thusobscure more meaningful hits, the user mightconsider manually masking the region to optimisethe sensitivity of the search.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 106


Advanced OptionsIn the "Advanced Options" field it is possibleto specify– gap costs– word size– <strong>and</strong> other parameters not otherwise selectable onthe query form.Output formatting options may also beadjusted here.– For example, the user might type: "-v150" to cause150 descriptions (rather than 100 or 250 availablethrough the pull-down menu) to be displayed.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 107


Set the output formatting options The default number of descriptions <strong>and</strong> alignmentsto be listed is 500. These variables affect the search in two ways:1. If the total number of hits in which E is less than thethreshold exceeds the number (x) of descriptionsrequested, only the top x most signficant would be listed;additional possibly significant alignments would not beshown, though these may embody importantinformation.2. The number of sequences used in generating themultiple alignment <strong>and</strong> the position specific matrix isspecified by the larger of the two (descriptions,alignments) variables. If at any point in the iterative PSI-BLAST process, significant sequences are omitted fromthe profile, all subsequent output will be affected.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 108


Set the output formatting optionsBy selecting a large number of descriptions(e.g. 250--500) it is possible to ensure that theE value <strong>and</strong> not the description limit will bethe determining factor in generating theprofile to be used for additional iterations.Reducing the output can then beaccomplished, if desired, by limiting thenumber of alignments to be reported.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 109


Alignment viewA variety of different alignment formats areavailable.– Pairwise alignmentgives a good view of the quality of an individualhit.– A flat query-anchored alignment (with identities)is a format in which identities shared bynumerous sequences can be easily spotted.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 110


Output from initial BLAST searchThe PSI-BLAST output is identical to the outputfrom a BLAST search.In a PSI-BLAST analysis, the results of the initialBLAST can be examined directly as for BLAST, butthey can also be used to generate a profile withwhich to perform additional searches of the samedatabase.The profile is generated automatically <strong>and</strong> isinvisible at the web interface.Although the profile is not shown, results of aprofile-directed search of the database are availableusing the "Run PSI-BLAST iteration" button.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 111


Examination of the PSI-BLAST resultsIn a PSI-BLAST search, the hits are dividedinto two categories.– Those that are better than the E value thresholdare listed first.– Those with E values worse than threshold, butnonetheless have an E value better than 1(selected on the query page) are listed furtherdown the page.– Hits with E values better than the threshold areused in forming he profile that will be used insubsequent PSI-BLAST iterations.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 112


Examination, cont.Sequences producing significant alignments:Score(bits) ValueEsp|Q57997|Y577_METJA PROTEIN MJ0577 >gi|2128018|pir||A64372 h... 314 2e-85pdb|1MJH| Structure-Based Assignment Of The Biochemical Fun... 272 1e-72The first <strong>and</strong> second hits are to the query sequence(MJ0577) itself. The second hit corresponds to thedatabase entry associated with determination of theMJ0577 structure.The score <strong>and</strong> E value for the structure entry aresomewhat worse because a certain number ofresidues were omitted from this database entrybecause they were disordered in the structure.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 113


Examination, cont.dbj|BAA29916| (AP000003) 170aa long hypothetical protein [Pyr... 107 6e-23sp|Q57951|Y531_METJA HYPOTHETICAL PROTEIN MJ0531 >gi|2128015|... 91 4e-18gi|2622094 (AE000872) conserved protein [Methanobacterium the... 85 4e-16gi|2621993 (AE000865) conserved protein [Methanobacterium the... 81 4e-15gi|2621194 (AE000803) conserved protein [Methanobacterium the... 80 7e-15gi|2622163 (AE000877) conserved protein [Methanobacterium the... 79 2e-14The next set of entries are to orthologous sequencesin two other Archaeal species, Pyrococcus horikoshii<strong>and</strong> Methanobacterium thermoautotrophicum.Interestingly, the top hits also include sets ofparalogs in Methanococcus jannaschii <strong>and</strong>Methanobacterium thermoautotrophicum.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 114


Examination, cont.sp|P42297|YXIE_BACSU HYPOTHETICAL 15.9 KD PROTEIN IN BGLH-WAP... 76 1e-13sp|Q50777|YB54_METTM HYPOTHETICAL 16.1 KD PROTEIN IN MTR REGI... 66 2e-10gi|2648791 (AE000981) conserved hypothetical protein [Archaeo... 65 3e-10gi|2648610 (AE000970) conserved hypothetical protein [Archaeo... 64 5e-10gi|2983400 (AE000710) hypothetical protein [Aquifex aeolicus] 64 6e-10sp|P73475|YC30_SYNY3 HYPOTHETICAL 31.2 KD PROTEIN SLR1230 >gi... 63 1e-09gi|2983527 (AE000719) hypothetical protein [Aquifex aeolicus] 61 4e-09sp|O27222|YB54_METTH HYPOTHETICAL PROTEIN MTH1154 >gi|2622260... 59 2e-08dbj|BAA30650| (AP000006) 208aa long hypothetical protein [Pyr... 57 6e-08dbj|BAA31039.1| (AP000007) 149aa long hypothetical protein [P... 56 1e-07emb|CAB50594.1| (AJ248288) hypothetical protein [Pyrococcus a... 55 2e-07sp|P39177|UP12_ECOLI UNKNOWN PROTEIN FROM 2D-PAGE (SPOTS PR25... 55 2e-07sp|P74148|YD88_SYNY3 HYPOTHETICAL 17.3 KD PROTEIN SLL1388 >gi... 52 2e-06 These entries are to more distantly related Archaealsequences for the most part. Two sequences arebacterial.The scores <strong>and</strong> E values are respectable <strong>and</strong> most ofthe alignments extend the length of the query.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 115


First conclusions from PSI-BLASTThe output from the first stage PSI-BLAST search isidentical to the BLAST search results with theexception that in this query the E value thresholdwas set at 0.001.The results thus far have revealed that MJ0577 is onemember of a moderate size family of proteins thatare found in Archaea <strong>and</strong> in Eubacteria.To gain insight into the function of the MJ0577family of proteins, PSI-BLAST iterations may belaunched.The first iteration is initiated using the iterationbutton located at the end of the first result page.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 116


Performing PSI-BLAST iterationsThe power of PSI-BLAST is in its ability to generatea profile from the initial BLAST results <strong>and</strong> to usethat profile to re-search the same database.This procedure is initiated by the user upon clickingthe ”Run PSI-BLAST iteration” button.The number of iterations that should be performedwill depend on the goal of the experiment <strong>and</strong> thenature of the output.The investigator may find that useful newinformation is limited to the first iteration.When additional iterations yield no new hits, thesearch is said to have reached ”convergence”.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 117


Converged!The results of this iteration indicate that the searchhas converged.Convergence is the term used to describe theiteration in which no new hits have been identified.This is the logical end of a PSI-BLAST experiment,although iterations preceding the final one maycease to contain any new useful information.The remainder of the page looks like pages forprevious iterations, <strong>and</strong> includes a listing ofsignificant hits, hits that fall below the significancethreshold but are better than E=1 <strong>and</strong> alignments.Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 118


Assignment 1Sequence comparisons– FASTA <strong>and</strong> BLAST– Protein level versus nucleotide level– Translation– Homology– Protein identificationhttp://www.ifm.liu.se/bioinfo/HL1008/http://bioinfo.limbo.ifm.liu.se/edu/HL1008/VT200Linköping University & <strong>Karolinska</strong> <strong>Institutet</strong> 119

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!