Refined Buneman Trees

More documents

Recommendations

Info

C k−1 is an overapproximation of RB(δ| Xk−1 ), and C k is an overapproximation of B xk (δ| Xk ). Now we need to merge these two sets of splits. We will test each extended split σ against the splits in C k : if σ is already in C k , we will ignore it. If σ is incompatible with some split σ ′ ∈ C k (line 6), we will use the DISCARD- RIGHT algorithm on the two splits to decide if σ ′ has non-positive refined Buneman index (see chapter 9 for more details). If we decided σ ′ was not a refined Buneman split, we will delete it from C k (line 8) and again measure if σ is incompatible with some new split σ ′ ∈ C k (line 9). Finally, if we have decided that σ qualifies, we will insert it into our compatible set of splits (lines 11–12). The time complexity analysis for this first part of the algorithm goes like this: bootstrapping a tree of size 4 in line 1 of the algorithm takes constant time. We then in line 2 iterate over at most n species. In line 3 we build a single linkage clustering tree in time O(n 2 ). In line 4 we iterate over all splits in an unrooted tree, there are O(n) of those. Each of the algorithms INCOM- PATIBLE, INSERT, DELETE and DISCARD-RIGHT take time O(n). For the while loop in line 7 we can only find O(n) splits in an unrooted tree that can possibly be incompatible with σ, so the first part of the algorithm runs in time O(n 3 ). The space usage consists of holding two compatible sets of splits at any one time and the space required to calculate the single linkage clustering trees. We can store any compatible set of splits in linear space using the tree data structure described in chapter 6. It is much worse for the single linkage clustering tree, the implementation in this paper uses a quad tree data structure, which will allocate Θ(n 2 ) space. Of course, we already need space for the distance data, which is also Θ(n 2 ). So in total we need to use space Θ(n 2 ) for the first part of the algorithm. 5.2 Pruning The second part of the algorithm deals with the extraction of refined Buneman splits from a the set of compatible splits constructed in the first part of the algorithm. Denote this set T . In the context of (evolutionary) trees and bioinformatics, we might call this part pruning, even though we shall not actually remove any edges from T — our tree data structure cannot handle a tree topology which is not a regular, leaf-labeled tree, so we will only invalidate edges instead of changing the topology of the tree. In this part of the algorithm, we shall look at all splits in T by considering all edges in the tree, and calculate refined Buneman indices for the splits they represent. Finally we shall report those splits which have positive refined Buneman index. We will do this by first decorating all edges in T with their refined Buneman index, or 0 as a special marker for invalid splits. Later we extract the refined Buneman splits from T in quadratic time — linear time iteration over edges, and for each edge, linear time for reporting n bits in a bitvector. So our so-called pruning algorithm is in reality a searching-and-scoring algorithm, and 35
the pseudo-code for the algorithm is given in Algorithm 4. The algorithm is very long and complicated, but it can be broken up into three parts: lines 1–3 deal with initializing linked lists representing “global matrix fronts” on matrices which contain diagonal quartets. Lines 4–17 populate the matrix fronts. In lines 18–28 we search the matrices, finding the minimum quartets that we need to calculate the refined Buneman indices for the edges represented by the matrices. And in lines 29–34 we report refined Buneman splits. Before we explain the algorithm, we need to study the nature of diagonal quartets. 5.2.1 Searching for minimum diagonals We know from previous sections that a quartet induces two diagonal quartets. And clearly, given a diagonal quartet we can identify its “parent” quartet. In the pruning part of our algorithm, we are interesting in finding the n−3quartets with minimum score induced by each edge in T . We will do this by searching for diagonal quartets instead, but we need to ensure that we never identify the same quartet twice (for example by identifying it from seeing both of its diagonal quartets). We will use a convention that says we shall only identify a quartet if we see its minimum diagonal. In case we see a diagonal quartet which is not a minimum diagonal, we shall disregard it. Another important property of diagonal quartets is this: if we fix a and c, such that a and c lie on different sides of a fixed edge e, we can search for b and d independently to minimize the score of the diagonal quartet ab||cd induced by e. Byfixinga and c we can rewrite the score of a diagonal quartet into a sum of two functions f a,c and g a,c , such that f a,c only depends on b and g a,c only depends on d. Clearly, such a function takes its minimum only when f a,c and g a,c are minimal. η ab||cd = 1 2 (δ bc − δ ab + δ ad − δ dc )=f a,c (b)+g a,c (d) where f a,c (b) =(δ bc − δ ab )/2 andg a,c (d) =(δ ad − δ dc )/2. Not only can we find the diagonal quartet with minimum score in this way. But also, we can search for the “next minimum”, i.e. the diagonal quartet with minimum score when discounting the actual minimum. We can do this in a general way: say we have some diagonal quartet ab i ||cd j with score η abi||cd j . Imagine we have considered all diagonal quartets with scores less than η abi||cd j , and now we wish to consider ab i ||cd j and then find the next diagonal quartet with minimum score. The way to do this is to search for b i+1 such that η abi+1||cd j ≥ η abi||cd j is the minimum among all choices of b i+1 , and similarly for d j+1 , η abi||cd j+1 ≥ η abi||cd j must be the smallest among choices of d j+1 . One of those will be the next minimum. Note that the indices refer to an ordering of increasing f a,c and g a,c respectively, not ordering as members of X. 36
Page 1 and 2: Refined Buneman Trees Lasse Westh-N
Page 3 and 4: This thesis is dedicated to my fami
Page 5 and 6: Contents 1 Introduction 7 1.1 Docum
Page 7 and 8: 13.3 Correctness of the reference i
Page 9 and 10: The theory of evolution has also be
Page 11 and 12: Chapter 2 Definitions This chapter
Page 13 and 14: A C B Figure 2.1: An evolutionary t
Page 15 and 16: 2.4 Quartets To every set of four s
Page 17 and 18: 2.6 Splits The partition of a finit
Page 19 and 20: evolutionary tree gives an invaluab
Page 21 and 22: time in the algorithm. In [BFÖ+ 03
Page 23 and 24: Figure 3.1: A tree of life. 22
Page 25 and 26: knowing its origins, but how does h
Page 27 and 28: anging from huge time complexity to
Page 29 and 30: Algorithm 2 The Neighbor-Joining al
Page 31 and 32: A, C, T 1 2 3 T A, C, T T C, T A, T
Page 33 and 34: Part II Implementing Refined Bunema
Page 35: Algorithm 3 Overapproximating the r
Page 39 and 40: AE DE D B e E BC A C BE AC Figure 5
Page 41 and 42: construction, but we still have to
Page 43 and 44: Chapter 6 TheTreeDataStructure This
Page 45 and 46: incidentedge Figure 6.2: The world
Page 47 and 48: interface EdgeIterator { boolean ha
Page 49 and 50: Figure 6.7: An example a node which
Page 51 and 52: So how do we find σ ′ We start
Page 53 and 54: Algorithm 5. Offhand, the algorithm
Page 55 and 56: Algorithm 6 The algorithm that calc
Page 57 and 58: a b c d root ab cd Figure 8.2: Upda
Page 59 and 60: 6000 Quad Tree performance characte
Page 61 and 62: 30000 Quad Tree performance charact
Page 63 and 64: sets A, B, C and D by scanning the
Page 65 and 66: Chapter 10 The Selection Algorithm
Page 67 and 68: is O(n 2 ). The algorithm uses a di
Page 69 and 70: Chapter 11 JSplits Figure 11.1: A s
Page 71 and 72: implementing an algorithm with a hi
Page 73 and 74: Chapter 12 Source Code The source c
Page 75 and 76: Chapter 13 The Reference Implementa
Page 77 and 78: • the splits that are generated.
Page 79 and 80: Chapter 14 Correctness This chapter
Page 81 and 82: The best possible way of testing wo
Page 83 and 84: 100000 90000 Performance of the ref
Page 85 and 86: of the size of the heap during the
Page 87 and 88:
140000 Space complexity best fit: x
Page 89 and 90:
Chapter 16 Comparing Evolutionary T
Page 91 and 92:
Figure 16.1: The size of the set B(
Page 93 and 94:
Figure 16.2: The total number of sp
Page 95 and 96:
that it over-induces splits, and th
Page 97 and 98:
efined Buneman therefore suffers a
Page 99 and 100:
Speedups might be achieved using op
Page 101 and 102:
Appendix A Correctness of the Refer
Page 103 and 104:
Quartet: 0 0 | 1 4 -0.1263501163396
Page 105 and 106:
Appendix B Garbage Collector Log 0.
Page 107 and 108:
Bibliography [AJL + 02] Bruce Alber
Page 109:
[Kim80] M. Kimura. A simple model f
show all

Refined Buneman Trees

Create successful ePaper yourself

Delete template?

Save as template?