Lecture 3: A Brief Overview of Molecular Phylogeny - MCD Biology

1 

Lecture 3: A Brief Overview of Molecular Phylogeny 

There are many books and approximately a zillion Websites out there on phylogeny. 

For an authoritative discussion of all things phylogenetic and if you want to do phylogenetics professionally, 

get P. Lemey et al., “The Phylogenetic Handbook: A Practical Approach to phylogenetic Analysis and Hypothesis 

testing,” Cambridge, 723 pp., 2009. 

For a step-by-step walk-through of sequence-based phylogenetic analysis, mostly using the popular program 

package PAUP, you may want: Hall, B.G., “Phylogenetic Trees Made Easy,”Sinauer, 2007, 3 rd edition. 

For a great intro to bioinformatics in general: Claverie, J.-M. and Notredame, C., “Bioinformatics for 

Dummies”, 2 nd Ed., 2006. 

Study the Vignette “The Protein Language” 

Some time before Midterm Exam I you should complete the Molecular Phylogeny Workshop, available on the 

Class Website. The are two parts to the Workshop, one to introduce you to the process of sequence alignment and 

provide you with a rudimentary sequence editor for future use if you want-to; and the second to guide you through a 

set of phylogenetic analyses for class discussion and for your edification in that process. As discussed in the

Workshop write-up, you should save hard copy of the trees you cast and answer the numbered questions, and 

turn these in at the time of the Midterm Exam. They will be worth 20 points on the midterm exam. 

1. The steps in a molecular phylogenetic analysis: 

2 

• Determine (or obtain) sequences [more later on this] 

• “Align” sequences - identify homologous nt (or aa) 

• Perform tree calculations 

• Test tree(s) 

2. The alignment process: the goal is to identify homologous nt in a collection of sequences: 

position n+: 1 2 3 4 5 6 7 8 9 10 

seq A ........ A A A C U U G U U U ......... 

seq B ........ A C A C U U G U G U ......... 

seq C ........ A G A U U U - U C U ......... 

A. Columns of nt constitute a specific hypothesis - those nt are homologs and changes reflect evolution.

3 

B. Note that seqs from different organisms may not be the same length - variation may be at the ends or internal. 

Hence use of an “alignment gap,” or “indel” in seq. C above to bring that seq into continuity of alignment. 

1. That is to say, “homologous” molecules are not necessarily “homologous” over their entire lengths, indeed 

length-variation of homologs in different critters is common. A good example: the bacterial SSU (“16S”) rRNA 

is typically ~1500nt, whereas the animal SSU (“18S) rRNA is ~2000nt: ~500nt of the animal SSU rRNA has no 

counterpart in the bacterial (or archaeal or microbial eucaryote rRNA) 

2. There are automated protocols for sequence alignment (e.g. “clustal”), but they seldom do a perfect job 

unless seqs are highly similar. Generally, manual polishing is necessary, particularly if length variation is 

significant. Misalignment (or inclusion of non-homologous sequences) degrades the calculation – throws 

in random sequences. 

3. In practice, don’t start at end of sequence and work forward: identify regions of clear homology and work 

out into regions of less clarity. 

4. Note structural implications of the alignment process: homologous residues are expected to occur with the 

same spatial constraints - that is, be in the same place/structure in the corresponding molecules in different

4 

organisms. You can predict structure from homology! (E.g. the practice of modeling tertiary structure by 

“threading” sequence onto a homolog with known structure.) 

3. In the case of structured RNAs (or proteins), complementarities can often be used to establish the register 

of the alignment. 

For instance, how would you align these homologous blocks of sequence? 

position n+: 1 2 3 4 5 6 7 8 9 10 11 12 13 

seq A ........ A A A C U U G U U U ................. 

seq B ........ A C A C U U G U G U ................. 

seq C ........ A G A U U U U C U ...................... 

seq D ........ G G C C U U C G G G A C C 

Not at all obvious, from sequence alone, but if you know something about the secondary structure of the 

RNA (paired regions), and look for paired elements:

5 

alignment becomes” straightforward”: 

position n+: 1 2 3 4 5 6 7 8 9 10 11 12 13 

seq A ........ - A A A C U U G U U - U - . 

seq B ........ - A C A C U U G U G - U - . 

seq C ........ - A G A U U U - U C - U - . 

seq D ........ G G C C U U C G G G A C C 

note convention: 

- = alignment gap; no nt 

• = nt present, but unknown

6 

4. In the case of rRNAs, the high degree of conservation makes the secondary structures important alignment 

tools – (see secondary structures of rRNAs-available on Class Website). 

a. Paired regions in the SSU rRNAs are not conjecture based on the occurrence of complements, but are 

“proven” by covariations in the sequence set that maintain the complementarity. 

3. Calculation of trees: there are many ways to do this. As discussed in the Workshop, popular methods include 

trees based on: 

• Evolutionary distance 

• Maximum parsimony 

• Maximum likelihood 

A. In each case, a computer-search is used to find “the tree” most consistent with the data set. 

B. All the methods rely heavily on statistical analysis. One consideration in these calculations or any “discrete 

number” assessments is “Poisson statistics.” (Biologists need to know about Poisson statistics.)

7 

Vignette: Poisson Statistics 

1. Poisson Statistics describes distributions of discrete and small numbers. It is often useful in biology and in 

thinking about the set-up and interpretation of experiments. You probably will see a problem similar to these on 

Midterm I. 

2. P = probability of event: Q = probability of non-event. 

P+Q = 1 

A. Remember also that probabilities multiply: 

P(x,y) = P(x)•P(y) 

3. For small (discrete) numbers: 

P(x) = mx 

x! e!m 

Where: x = number of events 

m = mean number of events 

P(x) = probability of x events 

4. This expression is useful for all sorts of experimental considerations and problems. e.g.: 

Problem 1: You are interested in determining the "burst size" of a bacteriophage growing on E. coli. In order to do 

that, you mix suspensions of phage and E. coli as follows:

8 

1 ml E.coli at 5x10 8 /ml. 

20 ml phage at 2.5x10 10 /ml 

- What portion of the bacteria are infected with exactly one phage? 

Two phage? Three phage? 

- What portion of the bacteria are uninfected? (I.e., what is P(0). Remember here that, by convention, 

n 0 = 1 and 0! = 1.), so 

P(0) = e -m 

Continuing with the experiment, you dilute the infected culture so that when you put 100 ml into different wells of a 

multiple-well plate you are putting, on 

average (m) , one bacterial cell per well; you will incubate to lysis and then assay the number of phage in the 

indvidual wells to extract burst-size. In the distribution into wells: 

- What fraction of the wells gets exactly one cell? 

- What fraction of the wells gets one infected cell? 

Problem 2. What would the phage/bacterium ratio have to be so that exactly 95% of the bacteria are infected? 

Problem 3. You are engaged in a genome project, and are dancing between costs and maximum coverage. You 

choose a random approach. If you sequence single-pass (one sequence run on a one-strand genome equivalent) 

on your library, what would be the coverage (fraction nucleotides sequenced single-pass)? 

- How many passes for 95% double-strand coverage (a common goal)? 

Problem 4: Make-up and solve a problem using Poisson statistics.

1. One ml Eco at 5x10 8 /ml = 5x10 8 cells 

20µ page at 2.5x10 10 /ml = 5x10 8 phage 

9 

m = 1 phage per cell 

a. Cells with exactly one phage? 

Poisson Problems Solved 

P(1) = [1 1 /1!]e -m = e -1 = 0.37 (remember that e -1 = ~0.37) 

b. Cells with exactly two phage? 

P(2) = [1 2 /2!]e -1 = e -1 /2 = 0.18 

c. Cells that are uninfected? 

P(0) = [1 0 /o!]e -1 = e -1 = 0.37 (remember: N 0 = 1, 0! = 1) 

d. Distribute into wells: m=1 

P(one cell/well) = 0.37 

P(one infected cell) = 0.37x0.37 = ~0.14 

2. m=? for P(infection) = 0.95

10 

Flip the question: P(0) = 0.05 = e -m , m=~3 

3. Single-pass means m=1 

a. P(sequence for any nt) = e -m = 0.37 

b. Double vs. single-strand coverage hinges only on the number of nucleotides covered. 

For 95% coverage: as above, P(0) = 0.05 = e -m : m = ~3 

For 95% double strand coverage: m = twice single pass = ~6 

---------------------------------------- END OF POISSON VIGNETTE ------------------------------------------------------- 

4. Evolutionary distance: 

A. This method makes a “map” based on pairwise “evolutionary distance,” the number of sequence changes 

between all pairs of sequences (organisms) in the data sets. 

B. Recall, however, that the number of differences you count between seqs is less than the number of changes 

that occurred – because of the possibility of back mutations and multiple mutations.

11 

C. This can be estimated (from Poisson counting statistics) for any position as “Knuc,” the average extent of 

sequence change at any position in two homologous seqs: 

Knuc = -3/4[ln(1-[4/3]D) 

where D=fractional difference in compared seqs. 

1. For instance, between human and E.coli SSU rRNAs you count 50% difference (= on average 0.5 

changes/nt). The “real” extent is calculated from the expression as: 

Knuc=-3/4[ln(1-[4/3]D) = -3/4ln(1-2/3) = -3/4ln 0.33 = 0.825 

So, the “correction” distance is more than half-again the changes you count! 

2. ”Evolutionary distance” (Knuc) is the calculated number of changes, not the number you count. Knuc is 

non-linear with depth in the tree, so the deeper in the tree, the greater the uncertainty:

12 

3. Note that this assumes that all positions in a molecule change at the same rate, which they do not. Some 

methods estimate the rate of change at each position (based on the data-set), and use that rate in the above 

calculation. Other methods, the most current, use the “information content” of individual positions to “weight” 

the value of those positions in the calculation. In essence, however, unseen past changes in sequences are 

estimable, but fundamentally unknowable. 

4. These “hidden” changes make treeing between the domains a chancy business.

13 

5. Even between the bacterial phyla: 

a. Typical bacterial phylum-level differences (counts) are ca. 25% (75% identity), so: 

D = 0.25 

Knuc=0.3: “only” about 15% (0.05/0.3) of your calculation-basis is inferred. 

b. Since the Knuc calculation is non-linear (a log function), the deeper you go in a tree, the more shaky 

your branching orders. 

D. An important concept in any tree construction is the amount of sequence that you use - more residues is 

better. The standard deviation as a sequence difference count can be estimated: 

Stnd. deviation = 3 !(D)(I)/L 

4 I-1/4 

D=fraction differences 

I=fraction identities(1-D) 

L=number of residues 

e.g. for 50% differences and 1000 nt: 

stnd. deviation = 3 !(0.5)(0.5)/1000 = 0.024 position

14 

4 0.5-0.25 

2xS.D. = ca. 50 positions, 5% of total positions counted 

(Statistically, 95% of instances will fall within 2xS.D.) 

But if you use only 100nt: 

stnd. dev. = 0.075: 2xS.D = 15% of total is getting to be not very good - you can’t rely on your counts to 

produce reliable trees! 

E. Computer programs use evolutionary distances to construct a tree most consistent with all the pairwise 

distances. Since biology is seldom regular, there is no single solution to all the pairwise distances. More later 

on what to do about that. 

5. “Parsimony” methods of tree construction presume that the evolutionary path follows the fewest number of 

changes: the “correct tree” involves the fewest changes required to construct that particular topology. 

A. Often called “ancestral sequence” methods, since inferred ancestral seqs are used to count changes. 

B. E.g. there are 3 possible topologies of the following 4 seqs:

15 

In this “heuristic” method (hunt best tree by testing alternatives, choosing the “optimal” path in a succession of 

steps), all possible trees (for a particular “optimal” path) are examined and the “best” chosen by the least 

number of changes. 

6. “Maximum Likelihood” methods are not intuitively interpreted; they calculate the probability that each node in a 

proposed tree (the heuristic search) is consistent with the particular data set. Statisticians consider this method to 

be the most “robust” (least sensitive to idiosyncrasies injected by a particular sequence) of any of the treeing 

methods; certainly it is the most statistically valid, since it is statistics-based.

16 

A. Several other probabilistic methods are around, e.g. “Bayesian methods” that calculate the probability of a tree 

topology based on the data (http://mrbayes.csit.fsu.edu/). 

B. Phylogeneticists often argue about the ‘best” method for phylogenetic analysis, but they all work about equally 

well given the constraints of the particular method and with appropriate “corrections” for variable rates, base 

compositions, etc. The best approach, as usual in science, is all of the above. I don’t believe a tree that results 

from only one method and not others. 

7. How to validate a particular tree “topology”? 

A. Construct tree with different taxa; since the tree is dependent on which seqs you include, use of a different 

suite of seqs. will test whether associations observed in one tree are consistent in the context of different taxa. 

B. Use all methods available to test specific associations of particular seqs of interest. 

C. “Bootstrap” analysis: resolve tree many times using random subsets of data set. E.g., compile data set for 

each analysis by drawing alignment columns from the data set at random, with replacement, to compile data

17 

set for each tree calculation. Typical bootstraps use the same number of characters as in the alignment, 

selecting positions at random, “with replacement”. 

Question: If you do a bootstrap analysis where you (the computer) draws sequence positions at random, with 

replacement, to equal the number of nt in the alignment, what fraction of the sequences do you NOT sample? 

(A Poisson calculation and a great exam question.) 

1. The bootstrap analysis tests anomalies in tree calculations (any particular tree calculation is a 

mathematical anecdote) and to some extent whether particular sequence blocks in an alignment cause 

weird behavior. 

2. E.g. of “Bootstrap support” for Big Tree nodes for one tree: Support for 

nodes are marked: ML/parsimony; note different results with different methods. that is, a particular node is 

observed in ##% of bootstrap trees. 

(Note that many peripheral nodes have funky bootstrap values because too many “outgroups” included - below 

“Troubles”.)

19 

3. A bootstrap score >70% in a likelihood tree is pretty good; hold out for 80% with distance or 

parsimony. 

4. Note “jackknife” analysis: just like bootstrap, but you randomly throw away half the alignment 

columns for each tree solution. 

D. Check tree analysis with “signature sequences” (below). 

8. Troubles you can get into with tree analysis (beyond bad sequence and bad alignment – big problems): 

A. Rate differences in different lineages: Rapidly evolving lineages behave spuriously in trees. 

e.g. with mitochondria: Actual tree: (rate-corrected) 

Agrobacterium 

Desulfovibrio Escherichia mitos

20 

if you don’t correct for “fast clock” of mitos, they try to get away from otherwise close relatives, to jump 

deeper into tree. You will see this in the Mol Phy workshop. 

Agro. 

Esch. 

Desulfo. 

Mitos. 

not-corrected - mitos go deep 

1. Such rate-effects can be compensated to some extent by adding more (deeper than mitos) sequences to 

trees, and by using systematic rate-correction calculations, but variable rates are a big problem in 

resolution of tree topologies. 

2. This phenomenon has been called “long branch attraction,” but it is really “short branch rejection of long 

branches.”

21 

3. Misalignment creates long branches, probably the most common mistake for neophytes and too many pros. 

So do inaccurate sequences. Note that machine alignments (e.g. BLAST, Clustal) commonly incorporate 

and MISalign non-homologous stretches – any final alignment process is best manually guided if the 

sequence representation is broad in diversity. With rRNA seqs the alignment is helped LOTS by structural 

elements (e.g. helices, conserved sequence blocks) that serve as landmarks. 

(For LOTS of info on alignment issues: Korf et al. “BLAST” (O’Reilly, 2003, 339 pp.) 

B. Base-composition biases. 

1. Variation in genomic G+C composition is reflected in all genes, and can make seqs behave spuriously as a 

function of the organism compositions of trees. To test for base comp problems, do “transversion analysis:” 

Count only changes that are “transversions.” 

Transversion = R"Y 

Transition = R"R (A"G) 

R = purine 

Y = pyrimidine 

Y"Y (T"C) 

Hence, you don’t see G+C vs. A+T difference. (and, you lose a lot of data, about half)

22 

2. What causes variation in genome/rRNA base composition (genomes are more radically different in base 

comp than rRNA seqs)? Answer not known. 

C. Taxon selection: 

1. Taxa included in a tree can have BIG influence on topology of any particular association because of random 

similarities/ dissimilarities. To test any association or branching order, solve tree with different suites of 

taxa. 

2. The broadest possible representation of diversity in the in-group is necessary for the most accurate 

topologies of associations. 

3. But, in general, minimize the number of “outgroup” sequences used to “root” an in-group cluster of 

interest. 

9. “Signature Analysis” - use of simple features to test a tree or assign seqs to “clades” (relatedness groups)

23 

A. This is just like use of morphological or other qualities to identify taxa: the properties are e.g.: 

1. Occurrence of specific nt or sequences at particular positions. 

2, Occurrence (or lack) of structural elements (e.g. helices) at particular positions. Note however that 

LACK of a property is not a specific property. It is only useful in comparison to presence of the 

property. 

3. Or any other diagnostic feature. 

B. Woese used oligonucleotide signatures to first define the clades Eucarya/Archaea/Bacteria. 

1.”oligonucleotide” = a short seq, usually

position 

in se- 

quence* 

24 

All 

Archae 

a 

All 

Bacteria 

All 

Eucarya 

position 

in se- 

quence 

Archae 

a 

Bacteria Eucarya 

113 C G C 962 G C U 

314 G C G 966 U G U 

338 G A A 973 C G G 

339 G C C 1016 G A A 

358 G U G 1060 C U C 

377 C G C 1087 C G U 

386 G C G 1098 G C G 

399 C G - 1110 G A G 

403 A C - 1197 G A G 

507 G C G 1211 G U U 

585 C G U 1212 A U A 

675 U A U 1229 G A G 

716 C A C 1381 C U C

756 G C A 1393 C U U 

923 G A A 1415 C G C 

952 C U C 1485 G U G 

*Eco numbering - position in Eco SSU rRNA sequence in the alignment; not necessarily the same in all rRNAs. 

10. Discussion of Workshop results. 

25

Lecture 3: A Brief Overview of Molecular Phylogeny - MCD Biology

Create successful ePaper yourself

Delete template?

Save as template?