Refined Buneman Trees

More documents

Recommendations

Info

efined <strong>Buneman</strong> algorithm which is the main focus of this work belongs in this subclass of evolutionary tree methods. The hope is that the refined <strong>Buneman</strong> method will be less safe than its namesake and infer more splits, while still maintaining a high degree of confidence in the splits it infers. 4.2 Parsimony methods The foundation for this class of methods is the phylosophy of William of Ockham: Pluralitas non est ponenda sine neccesitate — meaning something along the lines of the best hypothesis is the one that requires the smallest number of assumptions 2 . This philosophy is also known as Ockham’s Razor or the parsimony principle. In our context we shall use it to create a condition of optimality, saying that if two proposed evolutionary processes have the same starting and ending points, we shall assume that the simplest or shortest process is the correct one. For example, we could imagine two substitutions of the same nucleotide working in reverse: A → G → A — in this case we would say that no substitutions took place at all. The parsimony method works by considering one specific site at a time across a set of nucleotide sequences. For each site, we postulate all possible binary tree topologies linking these sites. We now search for a combination of assignments of nucleotides to inner nodes such that the total number of substitutions is minimal, selecting that topology/ those topologies for further study, since we might not identify the same optimal topologies for all sites — we have to sum over possible topologies to find the one that has the smallest number of substitutions across all sites. Figure 4.1 shows an example of estimating the number of substitutions for a topology. Firstly, we have a tree spanning specific sites across 5 nucleotide sequences. Secondly, we can fill in the sites for the ancestral taxa by considering the minimum number of substitutions required, bottom up from the sites we are already given. In this case, looking at a subtree of C and T in the lower left corner, we know that their ancestor site must have been either C or T , requiring only one substitution, since all other combinations (A or G) would require two substitutions. Thirdly, we present one solution of of many as to how the assignment of nucleotides in ancestral sites might have been. In this case we could make due with only two substitutions, but this solution is not unique, there are several combinations which require only two substitutions. The intuition for this method is quite easy to understand, and according to [NK00] and others, under favorable conditions such as the method is expected to produce the correct tree. However, under less favorable conditions the method is known to produce incorrect topologies, and in any case the method is hopelessly inefficient for large data sets, at least when using exhaustive search techniques. [NK00] describes how search might be speeded up by using e.g. branch and bound. 2 or even shorter: keep it simple! 29
A, C, T 1 2 3 T A, C, T T C, T A, T T T C T A T T C T A T T C T A T T Figure 4.1: An illustration of the parsimony tree method. 4.3 Maximum likelihood methods Using maximum likelihood methods is quite simple in theory. We start out with e.g. a set of n nucleotide sequences of length m — aligned such that only substitutions occur, not insertions or deletions. We then postulate some evolutionary tree topology over the n sequences, giving us a rooted binary tree with n − 1 inner nodes. For each inner node we assume there is some ancestor sequence for the sequences in the subtree of that node, but we do not know which nucleotides are in this ancestor sequence. We assume some substitution model, and there are many to choose from, ranging from simple to very complicated. Jukes-Cantor ([JC69]), Kimura ([Kim80]) and Hasegawa-Kishino-Yano ([HKY85]) are names of well known substitution models, and there is a long hierarchy of increasingly complex models using more and more biologically founded assumptions. Now, to find the likelihood of a single nucleotide site in the n leaf sequences we have to multiply probabilities of substitutions through the tree, for all possible substitutions, i.e. for all possible assignments of nucleotides to sites in the n − 1 inner nodes in the tree. And then we would have to sum over all these likelihoods to find the likelihood of the entire sequences, i.e. a sum over m terms. Now, this expression would have to be evaluated for all possible tree topologies, and of course this results in a very time consuming algorithm. But, since the substitution model is actually based in biology, we have a high confidence that the resulting tree with maximum likelihood is the real tree. One way of making this method useful in practise would be to search for tree topologies in some clever way, so that the search would be able to skip large parts of the search space, which would of course limit the accuracy of the method. The ML method is also useful for evaluating trees found by other tree reconstruction method, since we assume the method captures a lot of biological meaning, depending on the model. 30
Page 1 and 2: Refined Buneman Trees Lasse Westh-N
Page 3 and 4: This thesis is dedicated to my fami
Page 5 and 6: Contents 1 Introduction 7 1.1 Docum
Page 7 and 8: 13.3 Correctness of the reference i
Page 9 and 10: The theory of evolution has also be
Page 11 and 12: Chapter 2 Definitions This chapter
Page 13 and 14: A C B Figure 2.1: An evolutionary t
Page 15 and 16: 2.4 Quartets To every set of four s
Page 17 and 18: 2.6 Splits The partition of a finit
Page 19 and 20: evolutionary tree gives an invaluab
Page 21 and 22: time in the algorithm. In [BFÖ+ 03
Page 23 and 24: Figure 3.1: A tree of life. 22
Page 25 and 26: knowing its origins, but how does h
Page 27 and 28: anging from huge time complexity to
Page 29: Algorithm 2 The Neighbor-Joining al
Page 33 and 34: Part II Implementing Refined Bunema
Page 35 and 36: Algorithm 3 Overapproximating the r
Page 37 and 38: the pseudo-code for the algorithm i
Page 39 and 40: AE DE D B e E BC A C BE AC Figure 5
Page 41 and 42: construction, but we still have to
Page 43 and 44: Chapter 6 TheTreeDataStructure This
Page 45 and 46: incidentedge Figure 6.2: The world
Page 47 and 48: interface EdgeIterator { boolean ha
Page 49 and 50: Figure 6.7: An example a node which
Page 51 and 52: So how do we find σ ′ We start
Page 53 and 54: Algorithm 5. Offhand, the algorithm
Page 55 and 56: Algorithm 6 The algorithm that calc
Page 57 and 58: a b c d root ab cd Figure 8.2: Upda
Page 59 and 60: 6000 Quad Tree performance characte
Page 61 and 62: 30000 Quad Tree performance charact
Page 63 and 64: sets A, B, C and D by scanning the
Page 65 and 66: Chapter 10 The Selection Algorithm
Page 67 and 68: is O(n 2 ). The algorithm uses a di
Page 69 and 70: Chapter 11 JSplits Figure 11.1: A s
Page 71 and 72: implementing an algorithm with a hi
Page 73 and 74: Chapter 12 Source Code The source c
Page 75 and 76: Chapter 13 The Reference Implementa
Page 77 and 78: • the splits that are generated.
Page 79 and 80: Chapter 14 Correctness This chapter
Page 81 and 82:
The best possible way of testing wo
Page 83 and 84:
100000 90000 Performance of the ref
Page 85 and 86:
of the size of the heap during the
Page 87 and 88:
140000 Space complexity best fit: x
Page 89 and 90:
Chapter 16 Comparing Evolutionary T
Page 91 and 92:
Figure 16.1: The size of the set B(
Page 93 and 94:
Figure 16.2: The total number of sp
Page 95 and 96:
that it over-induces splits, and th
Page 97 and 98:
efined Buneman therefore suffers a
Page 99 and 100:
Speedups might be achieved using op
Page 101 and 102:
Appendix A Correctness of the Refer
Page 103 and 104:
Quartet: 0 0 | 1 4 -0.1263501163396
Page 105 and 106:
Appendix B Garbage Collector Log 0.
Page 107 and 108:
Bibliography [AJL + 02] Bruce Alber
Page 109:
[Kim80] M. Kimura. A simple model f
show all

Refined Buneman Trees

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?