22.01.2015 Views

Refined Buneman Trees

Refined Buneman Trees

Refined Buneman Trees

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

C k−1 is an overapproximation of RB(δ| Xk−1 ), and C k is an overapproximation<br />

of B xk (δ| Xk ).<br />

Now we need to merge these two sets of splits. We will test each extended<br />

split σ against the splits in C k : if σ is already in C k , we will ignore it. If σ<br />

is incompatible with some split σ ′ ∈ C k (line 6), we will use the DISCARD-<br />

RIGHT algorithm on the two splits to decide if σ ′ has non-positive refined<br />

<strong>Buneman</strong> index (see chapter 9 for more details). If we decided σ ′ was not a<br />

refined <strong>Buneman</strong> split, we will delete it from C k (line 8) and again measure if σ<br />

is incompatible with some new split σ ′ ∈ C k (line 9). Finally, if we have decided<br />

that σ qualifies, we will insert it into our compatible set of splits (lines 11–12).<br />

The time complexity analysis for this first part of the algorithm goes like<br />

this: bootstrapping a tree of size 4 in line 1 of the algorithm takes constant<br />

time. We then in line 2 iterate over at most n species. In line 3 we build a<br />

single linkage clustering tree in time O(n 2 ). In line 4 we iterate over all splits<br />

in an unrooted tree, there are O(n) of those. Each of the algorithms INCOM-<br />

PATIBLE, INSERT, DELETE and DISCARD-RIGHT take time O(n). For<br />

the while loop in line 7 we can only find O(n) splits in an unrooted tree that<br />

can possibly be incompatible with σ, so the first part of the algorithm runs in<br />

time O(n 3 ).<br />

The space usage consists of holding two compatible sets of splits at any one<br />

time and the space required to calculate the single linkage clustering trees. We<br />

can store any compatible set of splits in linear space using the tree data structure<br />

described in chapter 6. It is much worse for the single linkage clustering tree,<br />

the implementation in this paper uses a quad tree data structure, which will<br />

allocate Θ(n 2 ) space. Of course, we already need space for the distance data,<br />

which is also Θ(n 2 ). So in total we need to use space Θ(n 2 ) for the first part of<br />

the algorithm.<br />

5.2 Pruning<br />

The second part of the algorithm deals with the extraction of refined <strong>Buneman</strong><br />

splits from a the set of compatible splits constructed in the first part of the<br />

algorithm. Denote this set T . In the context of (evolutionary) trees and bioinformatics,<br />

we might call this part pruning, even though we shall not actually<br />

remove any edges from T — our tree data structure cannot handle a tree topology<br />

which is not a regular, leaf-labeled tree, so we will only invalidate edges<br />

instead of changing the topology of the tree.<br />

In this part of the algorithm, we shall look at all splits in T by considering<br />

all edges in the tree, and calculate refined <strong>Buneman</strong> indices for the splits they<br />

represent. Finally we shall report those splits which have positive refined <strong>Buneman</strong><br />

index. We will do this by first decorating all edges in T with their refined<br />

<strong>Buneman</strong> index, or 0 as a special marker for invalid splits. Later we extract the<br />

refined <strong>Buneman</strong> splits from T in quadratic time — linear time iteration over<br />

edges, and for each edge, linear time for reporting n bits in a bitvector. So our<br />

so-called pruning algorithm is in reality a searching-and-scoring algorithm, and<br />

35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!