Principles of Bioinformatics BIO540/STA569/CSI660 Fall 20**10**

**Lecture** **25**Phylogenetic Analysis II

Administrivia

Administrivia• The homework Prof. Kuznetsov has givenout is due Wednesday, December 1 st at11:59 PM. – That’s tonight. • It can be found at his web page for the class:http://lcg.rit.albany.edu/bio540 Fall 20**10** BIO540/STA569/CSI660 4

Administrivia• No class next Monday, December 6 th . • Class will meet Wednesday, December 9 th . – Prof. Kuznetsov’s lecture. – The last class meeting Fall 20**10** BIO540/STA569/CSI660 5

Administrivia• The Final exam: – Here, December 15 th , 3:30 – 5:30 PM – It will be comprehensive. – Closed book, closed notes. Fall 20**10** BIO540/STA569/CSI660 6

Administrivia• Course evaluations – You should all have received an email forcourse evaluations for this class. – Please do find the time to do the evaluation. Fall 20**10** BIO540/STA569/CSI660 7

Today’s Content…

Readings • Chapters Seven and Eight. Fall 20**10** BIO540/STA569/CSI660 9

Guide to choose an appropriate phylogenetic methodMultiple alignment Are these sequences highly similar? Yes Maximum parsimony methods No A set of related sequences Are these sequences moderately similar? Yes Distance- based methods Phylogenetic Tree No Maximum likelihood methods Fall 20**10** BIO540/STA569/CSI660 **10**

Distance-based versus discrete methods• Distance-based methods first convert aligned sequences into adistance matrix and then input the matrix into a tree buildingalgorithm.• Discrete methods are based on characters i.e., consider eachnucleotide or amino acid character directly.• In the distance-based methods, once a distance matrix is builtthe biological information is lost• Whereas in discrete methods additional information suchas which site contributes to the length of each branch ispreserved.• Distance-based methods are faster and easier to implement thandiscrete methods.Fall 20**10** BIO540/STA569/CSI660 11

Parsimony Methods: Background• The Eck and Dayhoff method counts the number of all the amino acidsubstitutions in a phylogeny, but in this method, both high and low probabilitysubstitutions (according to genetic code) are treated equally• Example: AAA (K) CGC (R) vs AAC (N) AGC (S)• The Fitch method counts the minimum number of nucleotide changes requiredto achieve the observed variation, but this method treats both synonymous andnon-synonymous changes equally• Example: UUU(F) CUU(L) CUA(L) CAA (Q)• In the Maximum Parsimony method, a moderate approach based on theabove two methods is used.• All amino acid changes are consistent with the genetic code• Synonymous changes are counted less times than non-synonymouschanges.Fall 20**10** • In the above example the number of changes from F Q is counted astwo, not threeBIO540/STA569/CSI660 **12**

Maximum Parsimony Method• Also called the Minimum Evolution method.• The idea is to predict a tree (or trees) that minimizes the number ofsteps required to generate the observed variation in the sequences.• The mechanism to build these trees is to use the aligned columns inthe multiple alignment.• Finally, those trees that produce the smallest number of changesoverall for all sequence positions are identified.• Very time consuming, not good for large number of sequences orsequences with a large amount of variation• Software packages:• For DNA: DNAPARSFall 20**10** • For proteins: PROTPARSBIO540/STA569/CSI660 13

Example:Fall 20**10** BIO540/STA569/CSI660 14

DNAPARS ExampleFall 20**10** BIO540/STA569/CSI660 15

Distance-based Methods• The distance between pairs of sequences is calculated based onthe fraction of non-identical amino acids between the twosequences• A number of distance models are used. For example:• Dayhoff PAM substitution matrices• Jones-Taylor-Thornton Matrix• Kimura formula (% of amino acids differing in two sequences)• Amino acid categories• A distance matrix of (n x n) is calculated between all pair-wisecombinations where n is equal to the number of taxa (species)• Distance matrices can be used as input for different algorithmsto calculate an optimal evolutionary treeFall 20**10** BIO540/STA569/CSI660 16

• Calculating a distance matrixFeng-Doolittle MethodCalculate pair-wise scores with DP method, convert raw scores into pairwisedistances using the following formulaD = -ln S effS eff = [S obs - S rand ] / [S max - S rand ]where, -S eff is the effective score-S obs is the similarity score (DP score) between a pair-S max is the max score, average of aligning either sequence toitself-S rand is the background noise, obtained by aligning tworandom sequences of equal length and compositionFall 20**10** BIO540/STA569/CSI660 17

Distance Matrix generated by ProtdistHUMAN MOUSE DROME SOLTU WHEAT ARATH NEUCR YEASTFall 20**10** BIO540/STA569/CSI660 18

Distance method continued …• Methods used:• FITCH: Based on Fitch-Margoliash method, no molecularclock.• KITSCH: Same as the above but uses molecular clockhypothesis.• Molecular clock hypothesis: the rate of evolution is constantover time.• NEIGHBOR: Based on neighbor-joining or UPGMAmethods.• Neighbor-Joining: No molecular clock, generates an un-rootedtree.• UPGMA: Uses Molecular clock, generates a rooted tree.Fall 20**10** BIO540/STA569/CSI660 19

Fitch-Margoliash method A B C DHuman Chimp Gorilla OrangA Human 0 88 1**03** 160B Chimp 0 **10**6 170C Gorilla 0 166D Orang 0Tree building using Fitch-Margoliash method (1967)Da = ( D AB + D AC - D BC ) / 2Db = ( D AB + D BC - D AC ) / 2Dc = ( D AC + D BC - D AB ) / 2Dc DaDbC B AJoin the first 3 sequencesDa = ( 88 + 1**03** - **10**6 ) / 2 = 42.5Db = ( 88 + **10**6 - 1**03** ) / 2 = 45.5Dc = ( 1**03** + **10**6 - 88 ) / 2 = 60.5Fall 20**10** BIO540/STA569/CSI660 9.051.542.545.5C B A20

A B C DHuman Chimp Gorilla OrangA Human 0 88 1**03** 160B Chimp 0 **10**6 170C Gorilla 0 166D Orang 0A B CHum/Chimp Gorilla OrangA Hum/Chimp 0 **10**4.5 165B Gorilla 0 166C Orang 0Join the 4 th sequence to current tree30.75Da = ( **10**4.5 + 165 - 166 ) / 2 = 51.75Db = ( **10**4.5 + 166 - 165 ) / 2 = 52.75C82.59.**25**52.7542.545.5B A’ ADc = ( 165 + 166 - **10**4.5 ) / 2 = 113.**25**Fall 20**10** BIO540/STA569/CSI660 21

Maximum-Likelihood Methods• Maximum-likelihood (ML) methods are discrete methodssimilar to maximum parsimony (MP) methods, howeverprobability calculations are used to find a tree that bestaccounts for the variation in a set of sequences• Analysis is performed on all columns in the multiplealignment and all possible trees are considered• Compared to MP methods, more divergent sequences can beanalyzed• However, the main disadvantage is that these methods arecomputationally intense.• Hence different heuristics are used to reduce thecomputational time.Fall 20**10** BIO540/STA569/CSI660 22

Reliability of Phylogenetic Trees• The phylogenetic information expressed by a tree resides exclusively in itsinternal branches.• Thus, by testing the reliability of each internal branch, the reliability ofa tree can be tested.• The Bootstrap Method: The data are re-sampled (usually thousands oftimes) by randomly shuffling vertical columns from the aligned sequencesto produce new alignment of the same length.• Each column of data may be used more than once and some columns maynot be used at all in the new alignment.• New alignments are used to create ‘artificial trees.’• For each internal branch, compute the fraction of artificial trees containingthis internal branch.• Internal branches supported by ≥90% of the replicates are consideredstatistically significant.Fall 20**10** BIO540/STA569/CSI660 23