Refined Buneman Trees
Refined Buneman Trees
Refined Buneman Trees
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
C k−1 is an overapproximation of RB(δ| Xk−1 ), and C k is an overapproximation<br />
of B xk (δ| Xk ).<br />
Now we need to merge these two sets of splits. We will test each extended<br />
split σ against the splits in C k : if σ is already in C k , we will ignore it. If σ<br />
is incompatible with some split σ ′ ∈ C k (line 6), we will use the DISCARD-<br />
RIGHT algorithm on the two splits to decide if σ ′ has non-positive refined<br />
<strong>Buneman</strong> index (see chapter 9 for more details). If we decided σ ′ was not a<br />
refined <strong>Buneman</strong> split, we will delete it from C k (line 8) and again measure if σ<br />
is incompatible with some new split σ ′ ∈ C k (line 9). Finally, if we have decided<br />
that σ qualifies, we will insert it into our compatible set of splits (lines 11–12).<br />
The time complexity analysis for this first part of the algorithm goes like<br />
this: bootstrapping a tree of size 4 in line 1 of the algorithm takes constant<br />
time. We then in line 2 iterate over at most n species. In line 3 we build a<br />
single linkage clustering tree in time O(n 2 ). In line 4 we iterate over all splits<br />
in an unrooted tree, there are O(n) of those. Each of the algorithms INCOM-<br />
PATIBLE, INSERT, DELETE and DISCARD-RIGHT take time O(n). For<br />
the while loop in line 7 we can only find O(n) splits in an unrooted tree that<br />
can possibly be incompatible with σ, so the first part of the algorithm runs in<br />
time O(n 3 ).<br />
The space usage consists of holding two compatible sets of splits at any one<br />
time and the space required to calculate the single linkage clustering trees. We<br />
can store any compatible set of splits in linear space using the tree data structure<br />
described in chapter 6. It is much worse for the single linkage clustering tree,<br />
the implementation in this paper uses a quad tree data structure, which will<br />
allocate Θ(n 2 ) space. Of course, we already need space for the distance data,<br />
which is also Θ(n 2 ). So in total we need to use space Θ(n 2 ) for the first part of<br />
the algorithm.<br />
5.2 Pruning<br />
The second part of the algorithm deals with the extraction of refined <strong>Buneman</strong><br />
splits from a the set of compatible splits constructed in the first part of the<br />
algorithm. Denote this set T . In the context of (evolutionary) trees and bioinformatics,<br />
we might call this part pruning, even though we shall not actually<br />
remove any edges from T — our tree data structure cannot handle a tree topology<br />
which is not a regular, leaf-labeled tree, so we will only invalidate edges<br />
instead of changing the topology of the tree.<br />
In this part of the algorithm, we shall look at all splits in T by considering<br />
all edges in the tree, and calculate refined <strong>Buneman</strong> indices for the splits they<br />
represent. Finally we shall report those splits which have positive refined <strong>Buneman</strong><br />
index. We will do this by first decorating all edges in T with their refined<br />
<strong>Buneman</strong> index, or 0 as a special marker for invalid splits. Later we extract the<br />
refined <strong>Buneman</strong> splits from T in quadratic time — linear time iteration over<br />
edges, and for each edge, linear time for reporting n bits in a bitvector. So our<br />
so-called pruning algorithm is in reality a searching-and-scoring algorithm, and<br />
35