Refined Buneman Trees

More documents

Recommendations

Info

Chapter 7 TheSingleLinkage Clustering Tree Single linkage clustering trees play an important part in this work as a replacement for anchored Buneman trees. The refined Buneman tree algorithm described in this thesis is based on Lemma 5, where anchored Buneman trees need to be merged with refined Buneman trees to create new refined Buneman trees. However, since in the first part of the algorithm we are only required to build overapproximations of the refined Buneman tree, we can use single linkage clustering trees instead of anchored Buneman trees since single linkage clustering trees contain anchored Buneman trees. 7.1 Replacing the anchored Buneman tree The anchored Buneman tree can be computed in time O(n 2 ), according to Lemma 3, by first constructing a single linkage clustering tree that is a superset of B x (δ), and then pruning that tree ([BB99], section 3). But, since we are creating an overapproximation of the refined Buneman tree by incrementally considering splits from anchored Buneman trees, we can replace the anchored Buneman trees altogether with the unpruned single linkage clustering trees. The extra splits might or might not become part of our overapproximation, but they will be weeded out later on in the algorithm, and they do not asymptotically change the size of the overapproximation. The tree cannot grow beyond O(n) splits. And luckily, the single linkage clustering tree is extremely simple to compute. 7.2 Calculating the single linkage clustering tree As mentioned, the single linkage clustering tree is very simple to calculate. Pseudo code for the algorithm, adapted from [BG91], section 3.2.7, is listed in 51
Algorithm 5. Offhand, the algorithm looks like it would perform in time O(n 3 ). We iterate through a loop, reducing the size of the distance matrix by one (adding one entry and removing two) on each loop, until the distance matrix is spent. Inside the loop, we in line 2 need to find the minimum entry in an n × n matrix, normally a O(n 2 ) operation. However, if we overlay our distance matrix with a quad tree search structure, we can get away with spending constant time finding the minimum entry in the matrix, while spending only linear time updating the search structure after we update the distance matrix in line 6. More about the quad tree data structure in chapter 8. Algorithm 5 The single linkage clustering tree algorithm Require: δ is a distance measure on X Ensure: C is the single linkage clustering tree for δ 1: C a set of clusters, one for each element in X 2: while |C| > 1 do 3: Choose the clusters c 1 ,c 2 ∈ C that minimize the quantity δ(c 1 ,c 2 ). 4: Create c ′ = c 1 ∪ c 2 . 5: Calculate distances from c ′ to all other clusters c ′′ ∈ C by setting δ(c ′ ,c ′′ )=min{δ(c 1 ,c ′′ ),δ(c 2 ,c ′′ )}. 6: Erase c 1 and c 2 from C, addc ′ . 7: end while Also, instead of actually inserting and removing entire rows and columns in the distance matrix, we might reuse a row or column instead, and we might not delete rows but rather keep track of clusters that are alive. Figure 7.1 illustrates this point. 7.3 Converting the distance matrix One important note about the use of the single linkage clustering algorithm instead of the anchored Buneman tree algorithm is the distinction between distance matrix and similarity matrix. The anchored Buneman tree for a species x ∈ X is computed based on a distance measure δ on a X, while the single linkage clustering tree for x ∈ X is computed based on an inverted similarity measure on X with respect to x. In other words we must distinguish between SLCT(δ, X) andSLCT(F x (δ),X). Basically, this means that we have to transform δ before using it to calculate the single linkage clustering tree. The transformation is described in [BB99], section 3, where is it termed a Farris transformation: F x (a, b) = 1 2 (δ ax + δ bx − δ ab ) where a, b ∈ X \{x}. The Farris transformation creates a similarity measure, but this can be converted into a distance measure by e.g. changing signs. 52
Page 1 and 2: Refined Buneman Trees Lasse Westh-N
Page 3 and 4: This thesis is dedicated to my fami
Page 5 and 6: Contents 1 Introduction 7 1.1 Docum
Page 7 and 8: 13.3 Correctness of the reference i
Page 9 and 10: The theory of evolution has also be
Page 11 and 12: Chapter 2 Definitions This chapter
Page 13 and 14: A C B Figure 2.1: An evolutionary t
Page 15 and 16: 2.4 Quartets To every set of four s
Page 17 and 18: 2.6 Splits The partition of a finit
Page 19 and 20: evolutionary tree gives an invaluab
Page 21 and 22: time in the algorithm. In [BFÖ+ 03
Page 23 and 24: Figure 3.1: A tree of life. 22
Page 25 and 26: knowing its origins, but how does h
Page 27 and 28: anging from huge time complexity to
Page 29 and 30: Algorithm 2 The Neighbor-Joining al
Page 31 and 32: A, C, T 1 2 3 T A, C, T T C, T A, T
Page 33 and 34: Part II Implementing Refined Bunema
Page 35 and 36: Algorithm 3 Overapproximating the r
Page 37 and 38: the pseudo-code for the algorithm i
Page 39 and 40: AE DE D B e E BC A C BE AC Figure 5
Page 41 and 42: construction, but we still have to
Page 43 and 44: Chapter 6 TheTreeDataStructure This
Page 45 and 46: incidentedge Figure 6.2: The world
Page 47 and 48: interface EdgeIterator { boolean ha
Page 49 and 50: Figure 6.7: An example a node which
Page 51: So how do we find σ ′ We start
Page 55 and 56: Algorithm 6 The algorithm that calc
Page 57 and 58: a b c d root ab cd Figure 8.2: Upda
Page 59 and 60: 6000 Quad Tree performance characte
Page 61 and 62: 30000 Quad Tree performance charact
Page 63 and 64: sets A, B, C and D by scanning the
Page 65 and 66: Chapter 10 The Selection Algorithm
Page 67 and 68: is O(n 2 ). The algorithm uses a di
Page 69 and 70: Chapter 11 JSplits Figure 11.1: A s
Page 71 and 72: implementing an algorithm with a hi
Page 73 and 74: Chapter 12 Source Code The source c
Page 75 and 76: Chapter 13 The Reference Implementa
Page 77 and 78: • the splits that are generated.
Page 79 and 80: Chapter 14 Correctness This chapter
Page 81 and 82: The best possible way of testing wo
Page 83 and 84: 100000 90000 Performance of the ref
Page 85 and 86: of the size of the heap during the
Page 87 and 88: 140000 Space complexity best fit: x
Page 89 and 90: Chapter 16 Comparing Evolutionary T
Page 91 and 92: Figure 16.1: The size of the set B(
Page 93 and 94: Figure 16.2: The total number of sp
Page 95 and 96: that it over-induces splits, and th
Page 97 and 98: efined Buneman therefore suffers a
Page 99 and 100: Speedups might be achieved using op
Page 101 and 102: Appendix A Correctness of the Refer
Page 103 and 104:
Quartet: 0 0 | 1 4 -0.1263501163396
Page 105 and 106:
Appendix B Garbage Collector Log 0.
Page 107 and 108:
Bibliography [AJL + 02] Bruce Alber
Page 109:
[Kim80] M. Kimura. A simple model f
show all

Refined Buneman Trees

Create successful ePaper yourself

Delete template?

Save as template?