Refined Buneman Trees

Refined Buneman Trees 

Lasse Westh-Nielsen 

May 2004

Abstract 

The field of bioinformatics needs efficient methods for analysing data. New and 

improved technologies increase the efficiency of e.g. DNA sequencing, and the 

amount of data coming out of biological research laboratories grows at an increasing 

rate. It is therefore important to be able to handle these huge amounts 

of data with limited computational power. 

Evolutionary tree reconstruction is a fundamental research problem in biology, 

with a wide range of applications. The perfect phylogeny problem is 

NP-hard, and large dataset need to be analysed quickly and accurately. Current 

methods are very one-sided in their trade-off when tackling this problem, 

being either slow, but precise, or fast, but inaccurate. 

The refined Buneman method is a relatively new and certainly untested tree 

reconstruction method. Earlier algorithms computing the refined buneman tree 

would run in time O(n 5 )andspaceO(n 4 ), making the method infeasible to run 

for large datasets, but with the advent of the algorithm described in [BFÖ+ 03], 

the method has complexity bounds of O(n 3 )andO(n 2 ) for time and space, 

respectively. Suddenly the method becomes useful in practice, and is directly 

comparable to the popular Neighbor-Joining method which has precisely the 

same complexity bounds. 

This thesis describes the first implementation of the cubic time algorithm 

computing the refined Buneman tree. We verify the complexity bounds and 

perform an initial comparitive study of the refined Buneman algorithm and 

the Neighbor-Joining method. We describe how the implementation can easily 

be integrated into the widely used JSplits software package, thus making 

the method instantly and easily available. And we ponder over the biological 

accurary of the refined Buneman method. 

1

This thesis is dedicated to my family 

Estrid, René and Morten 

and to my dear 

Khay Ling 

2

Acknowledgements 

I would like to thank the following people for their help during the course of 

writing this thesis: 

Poh Khay Ling for your love, support and infinite patience with me as I have 

been dragging this process out for far too long. 

Christian Nørgaard Storm Pedersen for inspiration, guidance, advice and 

supervision during the course of writing this thesis. 

Thomas Mailund for invalueable feedback and corrections regarding both formalia 

and content of this thesis. 

Wouter Boomsma, René Thomsen & Jakob Vesterstrøm forhelponEmacs, 

L A TEXand Linux, and for good office companionship. 

Mikkel Heide Schierup for giving me a job and an office at BiRC, without 

which this thesis would probably never exist... 

3

Contents 

1 Introduction 7 

1.1 Documentstructure ......................... 8 

I Overview of Evolutionary Trees 9 

2 Definitions 10 

2.1 Species................................. 10 

2.2 Evolutionary tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 

2.3 Distance measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

2.4 Quartets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

2.5 Diagonal quartets . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

2.6 Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

2.7 Buneman tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 

2.8 Anchored Buneman tree . . . . . . . . . . . . . . . . . . . . . . . 18 

2.9 Refined Buneman tree . . . . . . . . . . . . . . . . . . . . . . . . 19 

3 Evolution and Bioinformatics 21 

3.1 The Tree of Life and the language of DNA . . . . . . . . . . . . . 21 

3.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 

4 A Catalogue of Tree Reconstruction Methods 25 

4.1 Distance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

4.1.1 UPGMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

4.1.2 Neighbor-Joining . . . . . . . . . . . . . . . . . . . . . . . 27 

4.1.3 Quartetbasedmethods ................... 28 

4.2 Parsimony methods . . . . . . . . . . . . . . . . . . . . . . . . . . 29 

4.3 Maximum likelihood methods . . . . . . . . . . . . . . . . . . . . 30 

4.4 Hybrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

4.5 Accuracy of inferred trees . . . . . . . . . . . . . . . . . . . . . . 31 

4

II Implementing Refined Buneman Trees 32 

5 Implementation Structure 33 

5.1 Overapproximating . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

5.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

5.2.1 Searching for minimum diagonals . . . . . . . . . . . . . . 36 

5.2.2 Lazy matrix construction . . . . . . . . . . . . . . . . . . 38 

5.2.3 Pruning explained . . . . . . . . . . . . . . . . . . . . . . 40 

6 The Tree Data Structure 42 

6.1 Design of the tree data datastructure . . . . . . . . . . . . . . . . 42 

6.1.1 The tree datastructure explained . . . . . . . . . . . . . . 43 

6.2 Operationsonthetreedatastructure................ 46 

6.2.1 Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

6.2.2 Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

6.2.3 Incompatible . . . . . . . . . . . . . . . . . . . . . . . . . 49 

6.3 Testing the tree data structure . . . . . . . . . . . . . . . . . . . 50 

7 The Single Linkage Clustering Tree 51 

7.1 Replacing the anchored Buneman tree . . . . . . . . . . . . . . . 51 

7.2 Calculating the single linkage clustering tree . . . . . . . . . . . . 51 

7.3 Converting the distance matrix . . . . . . . . . . . . . . . . . . . 52 

8 The Quad Tree Data Structure 55 

8.1 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . 56 

8.2 Performance of the quad tree data structure . . . . . . . . . . . . 57 

9 The Discard-Right algorithm 61 

10 The Selection Algorithm 64 

10.1 Selection with side effects . . . . . . . . . . . . . . . . . . . . . . 64 

10.2 Performance of the selection algorithm . . . . . . . . . . . . . . . 66 

11 JSplits 68 

11.1 What is JSplits . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

11.2 Using JSplits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

11.3 Extending JSplits . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

11.4 The Distances2Splits interface . . . . . . . . . . . . . . . . . . . . 70 

12 Source Code 72 

III Tests and Experiments 73 

13 The Reference Implementation 74 

13.1 A simple refined Buneman tree algorithm . . . . . . . . . . . . . 74 

13.2 Implementation highlights . . . . . . . . . . . . . . . . . . . . . . 75 

5

13.3 Correctness of the reference implementation . . . . . . . . . . . . 75 

13.4 Performance of the reference implementation . . . . . . . . . . . 76 

14 Correctness 78 

14.1 Test strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 

14.2 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 

14.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 

15 Complexity 81 

15.1 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

15.2 Space requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 83 

15.2.1 The Linux ps command . . . . . . . . . . . . . . . . . . . 84 

15.2.2 JVM garbage collector log . . . . . . . . . . . . . . . . . . 85 

15.2.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . 86 

16 Comparing Evolutionary Tree Methods 88 

16.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 

16.2 Buneman and refined Buneman . . . . . . . . . . . . . . . . . . . 89 

16.3 Refined Buneman and Neighbor-Joining . . . . . . . . . . . . . . 89 

16.4Summaryofexperimentalresults.................. 94 

17 Conclusion 97 

17.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

IV Appendices 99 

A Correctness of the Reference Implementation 100 

B Garbage Collector Log 104 

6

Chapter 1 

Introduction 

Although much remains obscure, and will long remain obscure, I 

can entertain no doubt, after the most deliberate study and dispassionate 

judgement of which I am capable, that the view which 

most naturalists entertain, and which I formerly entertained — 

namely, that each species has been independently created — is erroneous. 

I am fully convinced that species are not immutable; but 

that those belonging to what are called the same genera are lineal 

descendants of some other and generally extinct species, in the 

same manner as the acknowledged varieties of any one species are 

the descendants of that species. 

Charles Darwin: “The Origin of Species” 

Charles Darwin published his theory of evolution in 1859, roughly 150 years 

ago at the time of writing. He was opposed by a majority of his age, where the 

great majority of naturalists believed that species were immutable productions, 

and had been separately created. 150 years later, the theory is widely accepted 

and has given rise to much scientific research and new philosophy. Controversy 

still lurks, particularly in some school systems where creationism and evolutionism 

compete for the role as the “theory of life”. Luckily, the battle is heavily 

favoured by the latter, and religious sensibilities are being forced to subside. 

All in all, the theory of evolution has gone from victory to victory as many 

scientific achievements have turned out to support it. The discovery of DNA by 

Crick & Watson in 1953 solved a big part of the puzzle, showing how genetic 

information is stored, copied and propagated, thus creating evolutionary history 

through inheritance. The appearance of modern day computers also contributes 

significantly to the success of evolutionary biology. As huge amounts of data 

is being sequenced in labs all over the world, computer programs implementing 

mathematical models of evolution are giving us new information on a daily basis, 

information that gives us increasingly greater understanding of nature and life 

itself. 

7

The theory of evolution has also become a cornerstone in philosophy and 

modern thinking, to a point where it is almost religious. The phrase survival of 

the fittest is the guideline for businesses in the capitalist economy, and playing 

God through genetic experiments is one of the most popular scenarios in modern 

science fiction. 

In this work we shall look at a specific method for inferring evolutionary 

history from a set of species, the refined Buneman tree algorithm. 

1.1 Document structure 

This work is organized into three major parts. 

Part 1 Overview of Evolutionary Trees 

In the first part we look at evolutionary trees, their applications and some 

of the most important tree reconstruction methods and method classes. 

We introduce evolutionary principles and mechanisms, and establish how 

these may be modelled mathematically. 

Part 2 Implementing Refined Buneman Trees 

In this part we will concern ourselves with the implementation of the 

refined Buneman tree algorithms described in [BFÖ+ 03]. This algorithm 

is the first algorithm that runs in time O(n 3 )andspaceO(n 2 )—the 

previous best algorithm would use time O(n 5 )andspaceO(n 4 ). This 

implementation is the first of its kind and the first that is practical to 

run on larger scale data sets. We will also see how the implementation is 

integrated into the widely used JSplits tree visualizing tool. 

Part 3 Tests and Experiments 

In this part we test the implementation of the refined Buneman tree algorithm 

to see if it performs as specified, and we also perform experiments 

that demonstrate the applicability of the method, relating it to the well 

known Neighbor-Joining method. 

8

Part I 

Overview of Evolutionary 

Trees 

9

Chapter 2 

Definitions 

This chapter is concerned with defining the terms used throughout the text, 

and their relation. The definitions are largely copied from [BFÖ+ 03], but with 

some additional comments. 

2.1 Species 

Species — a set of animals or plants in which the members have 

similar characteristics to each other and can breed with each other. 

From Cambridge Advanced Learners Dictionary 

In this text we shall use the term species more loosely than the definition 

above suggests. The term is taken from a bioinformatics context, but it should 

be understood to mean more than just plants or animals. In his original paper 

[Bun71], Peter Buneman was concerned with both biology and filiation of 

manuscripts. But we are basically interested in anything which is related by 

some evolutionary relationship or phylogeny. We can indeed compare apples 

and oranges, however it makes no sense to compare species of fish to editions 

of Hamlet. Thus, for our purposes we shall define the a set of species to mean 

something along the lines of a set of objects in which the members have a measurable 

phylogeny. 

In the following we will need to refer to the first k species from a set of 

species X = {x 1 , ··· ,x n } assuming some arbitrary ordering, so we define for 

integers k ∈{1, ··· ,n},X k = {x 1 , ··· ,x k }. 

2.2 Evolutionary tree 

Definition 1 of an evolutionary tree is taken from [SS03], it is a bit more elaborate 

than the equivalent definition in [BFÖ+ 03]. The idea of the evolutionary tree 

is, in the words of Semple & Steel ([SS03]) to provide a standard graphical 

10

epresentation of evolutionary relationships — but of course, we would like to 

do more than just look at the trees. 

Definition 1 (Evolutionary tree). An evolutionary tree T is an ordered pair 

(T ; φ), where T is a tree with vertex set V and φ : X → V with the property 

that, for each v ∈ V of degree at most two, v ∈ φ(X). An evolutionary tree is 

also called a semi-labeled tree (on X). 

The main thing to notice about Definition 1 is the rule that all leaves must 

correspond to species, but not all species have to correspond to leaves. While 

working on this thesis, the author experienced some confusion until Definition 2 

filled in a gap in the authors understanding of evolutionary and phylogenetic 

trees and their relation to graph theoretical trees. One might be tempted to 

interpret evolutionary trees as unrooted trees where leaves represent species, but 

the small detail that evolutionary trees do not need to be fully resolved is quite 

important. 

Definition 2 (Phylogenetic tree). A phylogenetic tree T is an evolutionary 

tree (T ; φ) with the property that φ is a bijection from X into the set of leaves 

of T . 

In graph theoretical terms, a phylogenetic tree is a leaf-labeled tree, while 

an evolutionary tree is a semi-labeled tree. An illustration of this distinction is 

given in Figure 2.1. Here we have an evolutionary tree with three “abnormal” 

regions A, B and C, where the tree is not fully resolved. If we look at regions 

A and B, we see that a species appears to be the ancestor of other species. It 

is of course possible that we have such a dataset, but a more likely explanation 

would be that the underlying evolutionary data simple does not tell us how to 

resolve these species — which is precisely the situation in region C. Wemight 

argue that since we cannot distinguish between these species, they must be the 

same species. But it is also very likely that the underlying evolutionary data is 

inaccurate. 

The important thing to stress is that our tree reconstruction method might 

output a tree that is not fully resolved for whatever reason, and additional 

analysis might be needed to find the answers we are looking for. An easy way 

of obtaining a leaf-labeled tree is to simply add extra edges to those labeled 

nodes which have degree two or more. This is illustrated in Figure 2.2. Of 

course, by doing so we are loosing information about the original tree, but this 

can be remedied by marking the extra edges. The reason for performing this 

transformation is that leaf-labeled trees are easier to work with then semi-labeled 

trees, for the average computer scientist. 

Two graph theoretical results are handy when reasoning about trees and tree 

search spaces: a semi-labeled unrooted tree with at most n leaves has at most 

n − 2 inner nodes, for a maximum of 2n − 2 nodes in the whole tree. And there 

are at most 2n − 3edges. 

11

A 

C 

B 

Figure 2.1: An evolutionary tree for 14 species. 

Figure 2.2: A phylogenetic tree resembling the evolutionary tree in Figure 2.1, 

with extra edges marked. 

12

Regarding the size of the search space for evolutionary trees on n species, 

it is a bit unclear. The following result from [SS03] gives the number of binary 

phylogenetic trees with n leaves: 

(2n − 4)! 

(n − 2)!2 n−2 

However, there must be more evolutionary trees with n labels since these are 

not required to be neither binary nor to have n leaves. Another way of looking 

at the size of the search space is to span it using set partition. If we look at a 

set X of n species, we might say that a partition of this set corresponds to an 

edge in an evolutionary tree for those species. There are O(2 n ) such partitions 

(or splits as they are also called). Since a tree is basically a set of partitions, the 

number of trees that can be constructed from O(2 n ) partitions must be upper 

bounded by the size of P(2 n ), which is O(2 2n ). However, this upper bound is 

unrealistically high, since we know that any tree with at most n leaves has at 

most 2n − 3 edges corresponding to 2n − 3 partitions in the set of species. And 

out of these, only a fraction will correspond to a tree, namely those where the 

set partitions are not contradictory. 

The author is a bit unsure which estimate is the best, but in any case the 

search space is very gigantic indeed! And clearly, when we want to find an 

evolutionary tree for a set of species, it is completely infeasible to search through 

this space of tree topologies from end to end. 

2.3 Distance measure 

A distance measure on a set of species X is a function δ : X 2 → R + where 

δ(x, x) =0andδ(x, y) =δ(y, x) for all x, y ∈ X. In this text, we will assume 

that the function δ is given by a distance matrix. Let n = |X|. Then a distance 

measure would look like the matrix in Equation 2.1. 

⎛ 

δ = 

⎜ 

⎝ 

0 δ 21 δ 31 . . . δ n1 

δ 21 0 δ 32 . . . δ n2 

δ 31 δ 32 0 . 

. . . . 

. . . . 

. . . . 

δ n1 δ n2 . . . . 0 

⎞ 

⎟ 

⎠ 

(2.1) 

Notice that the distance data does not have to have any metrix or other 

sensible properties, e.g. there is no provision that the triangle inequality must 

hold. The only requirements are that the matrix must be symmetrical around 

the diagonal, and the diagonal must be all zeros. 

13

2.4 Quartets 

To every set of four species a, b, c, d ∈ X there are four ways to associate a leaflabeled 

tree, as shown in Figure 2.3. The three possible binary tree resolutions, 

quartets, are denoted by ab|cd, ac|bd and ad|bc, indicating how the central edge 

of the binary tree bipartitions the four species. We say that an edge e in an 

evolutionary tree induces a quartet ab|cd if e bipartitions the four species in the 

same way as the central edge of the quartet — see Figure 2.4. A single edge 

in an evolutionary tree on X induces O(|X| 4 ) quartets: imagine the edge splits 

the species in two halves, then there are roughly |X|/2 choices for each of a, b, 

c and d for a total of |X| 4 /16 possible quartets. Quartets are symmetric in the 

sense that we may swap a ↔ b or c ↔ d or even ab ↔ cd and still obtain the 

same quartet. 

a 

c 

a 

b 

a 

b 

a 

c 

b 

d c d d 

ab|cd ac|bd ad|bc 

c 

b 

d 

Figure 2.3: The possible topologies of four species. 

quartets. 

Only three of these are 

The Buneman Score of a quartet q = ab|cd, wherea, b, c, d ∈ X is defined as 

β q = 1 2 (min{δ ac + δ bd ,δ ad + δ bc }−(δ ab + δ cd )) 

Two distinct quartets q 1 and q 2 for the same four species satisfy 

β q1 + β q2 ≤ 0 ([BFÖ+ 03]) 

b 

a 

e 

d 

c 

Figure 2.4: The edge e induces the quartet ab|cd, aswellasmanyothers. 

14

2.5 Diagonal quartets 

Looking at the definition of a quartet and its associated Buneman score, we 

might observe that the quartet q = ab|cd canbeviewedastwodiagonal quartets, 

denoted ab||cd and ab||dc. This intuition is not simple, but we can make the 

following rewrite to illustrate this point: 

β q = 1 2 (min{δ ac + δ bd ,δ ad + δ bc }−(δ ab + δ cd )) (2.2) 

{ 1 

= min 2 (−δ } 

ab + δ bc − δ cd + δ da ) 

1 

2 (−δ (2.3) 

ab + δ bd − δ dc + δ ca ) 

= min{η ab||cd ,η ab||dc } (2.4) 

where η ab||cd = 1 2 (δ bc − δ ab + δ ad − δ cd ) is the score of the diagonal quartet 

ab||cd. 

Looking at Figure 2.5 we can see that the terms η ab||cd and η ab||dc correspond 

to a “tour” in one of the two diagonal quartets. In words, we go through either 

a → b → c → d → a or a → b → d → c → a, adding or subtracting chords 

as appropriate. We might also notice that diagonal quartets are symmetric 

on either side of their central edge, i.e. starting out with the diagonal quartet 

ab||cd we can swap a ↔ b and c ↔ d or ab ↔ cd to obtain ba||dc or cd||ab, where 

η ab||cd = η ba||dc = η cd||ab and all three diagonal quartets still identify the same 

quartet ab|cd. 

a 

c 

b 

ab|cd 

d 

a 

c 

a 

c 

b 

ab||cd 

d 

b 

ab||dc 

d 

Figure 2.5: The symmetric quartet and its associated asymmetric diagonal quartets. 

To keep an ordering on diagonal quartets we shall define the minimum diagonal 

of ab|cd: Leta be the smallest (by index in X) of the species a, b, c, d ∈ X. 

ab||cd is the minimum diagonal if η ab||cd

2.6 Splits 

The partition of a finite set S into two non-empty parts U and V is denoted a 

split σ = U|V .If|U| =1or|V | = 1 the split is called trivial. It is reasonable to 

represent a split as a bitvector or binary number, and for convention we shall 

say that for a bitvector A representing σ, x i ∈ U if and only if A[i] = 0. Splits 

are symmetric, so if w represents the split σ, then¬w also represents σ. Theset 

of splits on a set X is denoted σ(X). The size of σ(X) is the number of unique 

splits on X, so|σ(X)| = |P(X)|−2 

2 

= 2n −2 

2 

=2 n−1 − 1. We exclude symmetric 

splits and deduct the two splits where U = ∅ or V = ∅. 

The set of quartets associated with a split σ = U|V on a set X is defined 

by q(σ) ={uu ′ |vv ′ : u, u ′ ∈ U ∧ v, v ′ ∈ V }. Hereu, u ′ (and similarly v, v ′ ) need 

not be distinct. The size of q(U|V ) is in the order of O(|X| 4 ) — recall that an 

edge in a tree induces O(|X| 4 ) quartets, and splits are equivalent to edges in 

this case. 

Definition 3 (Compatibility). Two splits A|B and C|D are said to be compatible 

if and only if one of A ∩ C, A ∩ D, B ∩ C or B ∩ D is empty. 

Compatible sets of splits are the foundation for the algorithm presented in 

this thesis, and they are a perfect tool when dealing with evolutionary trees. 

A set of splits is compatible if and only if all splits in the set are pairwise 

compatible. And of course, any subset of a compatible set of splits is again 

compatible. 

There is a close connection between compatible sets of splits and evolutionary 

trees. Any edge e in an unrooted tree T splits the set of leaves of T into two 

non-empty parts. Let Σ(T ) denote the set of splits associated with the edges of 

atreeT . Then Theorem 1 (from [SS03]) gives the relation between compatible 

sets of splits and evolutionary trees. 

Theorem 1 (Splits-Equivalence Theorem). Let Σ be a collection of splits 

on X. Then, there is an evolutionary tree T such that Σ=Σ(T ) if and only if 

Σ is a compatible set of splits. If T exists it is unique up to isomorphisms. 

From now on we shall use the terms compatible set of splits/ evolutionary 

tree and split/ edge interchangeably. They are one and the same: Table 2.1 

shows a compatible set of (weighted) splits, and Figure 2.6 shows the equivalent 

evolutionary tree. Recall the discussion of evolutionary trees versus phylogenetic 

trees; when working with a method such as the refined Buneman tree algorithm 

which outputs compatible sets of splits which might or might now correspond 

to a fully resolved tree, it is important to be able to such a tree in a precise 

manner. When dealing with e.g Neighbor-Joining, we can rely on the more 

regular phylogenetic trees since the NJ method always resolves trees completely. 

Lemma 1 is due to Dan Gusfield ([Gus91], section 1.2) and gives an important 

upper bound for the time required to go from compatible sets of splits to 

phylogenetic trees. 

Lemma 1. An unrooted tree with n leaves can be constructed from its set of 

non-trivial splits in time O(kn), wherek is the number of non-trivial splits. 

16

ABCDEFG Score 

0011111 0.6 

0011000 0.4 

0000001 0.2 

0000111 0.5 

0100000 0.3 

0111111 0.2 

0000010 0.3 

Table 2.1: A compatible set of weighed splits. 

A 

B 

C, D 

E 

G 

F 

Figure 2.6: A tree that represent the set of splits in Table 2.1. 

Of course, in Lemma 1 it is implicitly assumed that all trivial splits are 

present, so it does not apply directly to our situation. However, the tree data 

structure we use for storing a compatible set of splits (see chapter 6) can be 

used to turn a set of compatible splits into a tree in time O(n 2 ) — inserting a 

maximum of n trivial and n − 2 non-trivial splits at a cost of linear time each. 

A trivial split is compatible with every split of X. In a phylogenetic tree, 

the edges that are connected to leaves are exactly the trivial splits of X. Recall 

the definitions of evolutionary trees and phylogenetic trees and let us define 

Σ trivial (X) to be the trivial splits of X. It is now apparent that for every 

evolutionary tree T we have an associated phylogenetic tree T ′ defined by the 

set of splits Σ(T ′ )=Σ(T ) ∪ Σ trivial (X)). 

Recall Figure 2.1. Here we see examples of unresolved parts in an evolutionary 

tree. This is quite feasible, and shows the usefulness of working with 

splits rather than normal trees. Our normal leaf-labeled tree data structures 

cannot easily capture this class of trees. A normal tree would correspond to a 

phylogenetic tree, and one would have to mark the excess branches in some way, 

to capture the evolutionary tree underlying that representation. It is important 

to remember that species are not leaves. It is convenient if all species are leaves, 

but tree reconstruction methods (for example the refined Buneman tree method 

described in this paper) do not guarantee such high resolutions. 

Still, trees have some very useful computational properties which we will 

use extensively throughout this work. Also, the graphical representation of an 

17

evolutionary tree gives an invaluable understanding of the evolutionary data 

it represents. One obvious scheme when using an ordinary leaf-labeled tree to 

represent a set of splits is to say that all trivial splits are in the tree — but 

they have a special marker, if they are not members of the set of splits that the 

tree is supposed to represent. In order to be able to mark special edges, and 

also keep track of lengths of branches in the evolutionary tree, we will define 

the weight of an edge such that a weight of zero is the special marker that 

invalidates the edge as a split, and positive weights represent branch lengths. 

This is very reasonable since the sets of splits we are going to represent later on 

will only have positive weights — edges with non-positive weights are simply not 

members of the tree. The graphical representation would closely resemble the 

actual evolutionary tree corresponding to the set of splits, when trivial branches 

have no extent. 

2.7 Buneman tree 

Given a split σ for a set of size n, theBuneman Index of a split σ = U|V of X 

is defined as: 

μ σ (δ) = min β 

u,u ′ ∈U,v,v ′ uu 

∈V ′ |vv ′ (2.5) 

The set of splits B(δ) ={σ : μ σ (δ) > 0} is a compatible set of splits [Bun71]. 

The Buneman tree corresponding to a given dissimilarity measure δ is defined 

to be the weighted unrooted tree whose edges represent the splits σ ∈ B(δ) and 

are weighted according to μ σ (δ). 

This definition of the refined Buneman tree relates very well to the discussion 

in the previous section; a compatible set of splits is represented by a weighted, 

unrooted tree. 

2.8 Anchored Buneman tree 

The anchored Buneman tree is a relaxation of the Buneman tree. The anchored 

Buneman tree fixes some species x ∈ X and only considers splits U|V where 

x ∈ U. One might say the species x plays the role of outgroup with respect to 

the rest of the species. 

Let σ = U|V be a split with x ∈ U, then the anchored Buneman index is 

defined as follows: 

μ x σ (δ) = min β u∈U,v,v ′ xu|vv 

∈V ′ 

The anchored Buneman tree is the tree that represents the compatible set of 

splits defined by B x (δ) ={σ : μ x σ > 0}. 

A couple of lemmas are worth noting when dealing with anchored Buneman 

trees. Firstly, David Bryant and Vincent Moulton ([BM99]) have shown the 

connection between anchored Buneman trees and the regular Buneman tree, 

given here in Lemma 2. Secondly, Lemma 3 which is due to Vincent Berry 

18

and David Bryant ([BB99], section 3) states the complexity of computing the 

anchored Buneman tree. 

Lemma 2. Let δ be a distance measure on X. ThenB(δ) = ⋂ B x (δ) 

Lemma 3. B x (δ) can be computed in time and space O(n 2 ). 

2.9 Refined Buneman tree 

Given a split σ for a set of size n, letm = |q(σ)| and let q 1 , ··· ,q m be an ordering 

of the elements in q(σ) in non-decreasing order of their Buneman scores. Then 

the refined Buneman index of a split σ is defined as: 

μ σ (δ) = 1 

n−3 

∑ 

β qi (2.6) 

n − 3 

i=1 

In other words, the refined Buneman index of a split is the average over 

the n − 3 least scoring quartets. The choice of n − 3 is accredited to divine 

intervention — apparently, that was the choice that would make the proof in 

[MS99] work. The set of splits RB(δ) ={σ : μ σ (δ) > 0} is a compatible set of 

splits ([MS99]). And thus the Refined Buneman Tree corresponding to a given 

dissimilarity measure δ is defined to be the weighted unrooted tree whose edges 

represent the splits σ ∈ RB(δ) and are weighted according to μ σ (δ). 

When constructing the refined Buneman tree as described later in this work, 

we shall rely heavily on Lemma 4. The lemma is due to [MS99], and it is used 

to maintain compatibility in a set of splits overapproximating the set of refined 

Buneman splits. 

Lemma 4. Given two incompatible splits σ and σ ′ , 

μ σ (δ) ≤ 0 ∨ μ σ ′(δ) ≤ 0 

and this can be computed in time O(n). 

In the algorithm that computes the refined Buneman tree, we are going to 

construct a compatible set of splits which is an overapproximation of the set 

of refined Buneman splits. We shall do this for subsets of X of increasing size. 

The way to do this is, after bootstrapping some compatible set of splits, we 

shall introduce a set of candidates to go into the overapproximation. However, 

to maintain compatibility in the set we shall test the candidate splits against 

the existing splits, to find pairs that are incompatible. 

Once we find a pair of incompatible splits, we shall use Lemma 4 to determine 

which one does not belong in the set. We are not concerned with testing if 

one of the splits in the pair actually belongs in the set of refined Buneman 

splits. We are only interested in throwing away candidates that are clearly 

incompatible. We can allow ourselves this luxury since we are only creating 

an overapproximation of RB(δ| Xk ), not the set itself, and this saves valueable 

x∈X 

19

time in the algorithm. In [BFÖ+ 03], an algorithm is given which solves the 

problem in Lemma 4 in linear time. It this text, this algorithm is called the 

DISCARD-RIGHT algorithm. 

The second important lemma regarding refined Buneman trees is the foundation 

for the incremental algorithm presented in the article by Brodal et al. 

([BFÖ+ 03]), and which is presented later in this work. It is due to Bryant and 

Moulton ([BM99], proposition 3). It says that a split σ ∈ RB(δ| Xk )iseither 

amemberofB xk (δ| Xk )orRB(δ| Xk−1 ). If we turn it around we can say that 

given the refined Buneman tree for X k , we can calculate the refined Buneman 

tree for X k+1 by looking only at splits in B xk+1 (δ| Xk+1 )andRB(δ| Xk )(with 

the discussion from the previous paragraph in mind, this would be “bootstrap 

set” and “candidate set”, respectively). 

Lemma 5. Suppose |X| > 4, and fix x ∈ X. Ifσ = U|V is a split in RB(δ) with 

x ∈ U, and|U| > 2, then either U|V ∈ B x (δ) or U −{x}|V ∈ RB(δ | X−{x} ), 

or both. 

20

Chapter 3 

Evolution and 

Bioinformatics 

The theory of evolution is accredited to Charles Darwin and his work On the 

Origin of Species by Means of Natural Selection, or the Preservation of Favoured 

Races in the Struggle for Life from 1859. At the time, the theory challenged 

many established beliefs, particularly religious beliefs. Before Darwin the origins 

of life were credited to so-called “Creation Science” and other superstitions, 

and even today small pockets of resistance to Darwins theories exist, notably in 

Alabama, USA. 

Evolution was hinted at before Darwin, for example by Jean Baptiste de 

Lamarck, who suggested that life was governed by two principles: the principle 

of use and disuse - individuals lose characteristics they do not require and 

develop those which are useful. And inheritance of acquired traits - individuals 

inherit the acquired traits of their ancestors. Even in ancient Greece the, Anaximander 

fostered ideas similar to evolution. But not until Darwin were these 

theories supported by any real scientific evidence. 

3.1 The Tree of Life and the language of DNA 

After Darwin introduced his theory, biologists started working on reconstructing 

the evolutionary history of all organisms on earth, and expressing it in the form 

of a phylogenetic tree, the Tree of Life, illustrated in Figure 3.1. This work 

was carried out on fossils and living species, using comparative morphology and 

comparative physiology. These methods are however rather imprecise, and the 

trees thus constructed have been somewhat controversial. 

All of this changed when Watson and Crick discovered the ability of deoxyribonucleic 

acid (DNA) to encode and replicate hereditary information. Suddenly, 

scientists were able to read the recipe for a species in its DNA, which is basically 

a string over the alphabet Σ = {A, C, G, T }. It became possible to readily compare 

two organisms just by comparing their DNA in a precise and systematic 

21

Figure 3.1: A tree of life. 

22

fashion. Even organisms that do not exhibit common traits can be compared 

using this method. 

By reducing the complexity of DNA to a simple string over Σ = {A, C, G, T } 

and since the process of evolution can be simplified as operations on a string, e.g. 

insertion, deletion or substitution of a character, we are able to formulate the 

process as a mathematical model. This model can then be used to e.g. measure 

the evolutionary distance between two species. A very simple model would be 

to say that we count all the sites where the DNA of the two species differ (after 

the sequences have been aligned, probably), and say that this number represents 

the distance. But there are of course more engenious schemes than this. And 

thus once we have all pairwise distances between species in a set, we are ready 

to build their evolutionary tree. 

This is of course an extremely oversimplified explanation of molecular biology. 

We shall not worry about expressive (genes) and non-expressive regions of 

the DNA, the effects of selection, errors in sequenced strings, selecting comparable 

regions of DNA or details of evolutionary models. For our purposes it is 

enough to know that we can find a distance between species and use it to infer 

evolutionary history. 

Some questions arise even after we accept the theory of evolution. What 

was the origin of life Given we have so many, very different organisms, is it 

likely that they all decended from one common ancestor Or is it more likely 

that there are more than one origin of life Did life come from Mars, riding on a 

comet, as some propose Or did life just erupt from molecules with autocatalytic 

properties And does life exist elsewhere in the universe Certainly, the theory 

of evolution answer a lot of questions, but it also poses new ones. 

3.2 Bioinformatics 

The field of bioinformatics is relatively new. The field is a joint venture between 

the sciences of mathematics, biology, statistics, computer science and other 

related sciences. Faced with huge and growing amounts of data from e.g. genome 

sequencing projects, the mission is to make sense of data and perhaps apply 

this new found knowledge to cure diseases, invent new technology and generally 

increase understanding of life and the universe. 

A few things need to be highlighted about bioinformatics and the problem 

of dealing with huge amounts of data, from a computer scientists point of view. 

We have already established the gigantic size of the search space for evolutionary 

trees. But even smaller problems in bioinformatics, such as the alignment 

problem (for which the simplest instance, pairwise alignment, can be solved in 

time O(n 2 )), can cause difficulties, when the size of the human genome is in the 

order of 3.2 billion base pairs distributed over approximately 30.000 genes with 

an average size of 27.000 nucleotide pairs (source: [AJL + 02]). 

Again stressing the informatics part of bioinformatics, we might illustrate the 

usefulness of a program such as BLAST. Imagine a scientist in a lab stumbling on 

a new protein or virus. He is able to sequence the DNA of the organism without 

23

knowing its origins, but how does he place it in the Tree of Life He BLASTs 

his new sequence against sequence databases across the globe, connected via the 

internet, by entering his sequence on a webpage, and waits a while for hits, i.e. 

sequences that are similar to his own. Information about BLAST can be found 

here: 

http://www.ncbi.nlm.nih.gov/BLAST/ 

He is now able to measure evolutionary distances of these sequences, and 

then use his favourite evolutionary tree reconstruction program to construct a 

phylogeny for the sequences. Shortly after he is looking at the evolutionary 

history of his new organism - or perhaps he realizes someone found it already, 

he just didnt know about it. 

24

Chapter 4 

A Catalogue of Tree 

Reconstruction Methods 

Evolutionary tree reconstruction methods find application in e.g. protein structure/ 

function prediction. When scientists find a new protein, they are anxious 

to find out which properties it has. And if it is possible to place the protein in 

a known protein family, it might also be possible to deduce properties derived 

from other family members. 

These methods also facilitate tracing biological material. If we are faced 

with two strains of HIV virus, we might wish to know if they are closely related, 

or if they are more likely from different families. The story in the website below 

shows how evolutionary history can be used as evidence in a murder trial: 

http://www.aegis.com/news/upi/2002/UP021005.html 

Generally, different evolutionary tree methods have different characteristics, 

including input/ output data formats. Some methods look at distance matrices, 

others take nucleotide sequences directly. Output might be rooted or unrooted 

trees. So the methods are perhaps not directly compatible, but with a bit of 

work, output from different methods ought to be comparable. 

Another factor to be considered is the origin and type of data we are working 

on. Some methods are sensitive to data with certain properties, particularly the 

methods that do not have a biological model to support them; and some data will 

give us unexpected results if we are not careful, e.g. an evolutionary tree based 

on sequences from homologous genes might not be equal to the evolutionary 

tree of the species from which the genes were taken, since . This is known as 

the problem of gene trees vs species trees. 

We have established that the search space for evolutionary trees is enormous. 

The task of searching for an optimal tree in the search space of all semi-labeled 

trees is NP-hard, of course depending on the precise formulation of the optimization 

problem — one example is given in [SM97]. The following is a catalogue 

of methods for tree reconstruction. They all suffer from some disadvantage, 

25

anging from huge time complexity to low biological precision, i.e. the kinds of 

tradeoffs one would expect for algorithms or heuristics attacking a problem in 

the class of NP hard problems. 

Tree reconstruction methods can be classified into three major groups: distance 

methods, parsimony methods and maximum likelihood methods. The 

methods in the first group attack the problem of finding good evolutionary 

trees by sacrificing accuracy for speed, primarily by relying on biological information 

already extracted from biological data through e.g. sequence alignment. 

The other two groups encompass methods with are modelled more or less directly 

from evolutionary mechanisms and are thus expected to produce accurate 

results — but they generally suffer from poor performance. The next sections 

describe these groups of methods in greater detail. 

4.1 Distance methods 

Distance based methods use precomputed evolutionary distances to construct 

evolutionary trees. Given n pairs of taxa, all pairs of taxa are compared and 

an evolutionary distance is computed, corresponding to an entry in a distance 

matrix of size n×n. The evolutionary distance between two nucleotide sequences 

may be found by aligning the sequences and using the alignment score as a 

distance measure. 

The methods in this group rely on the distance data for biological meaning, 

and in themselves they do not make any biological assumptions — rather, 

they are general data mining/ clustering methods. The assumption is that the 

distance data captures enough biological meaning that these fast methods may 

still produce reliable evolutionary trees. Clearly, the methods in themselves do 

not inspire confidence that they will capture biological meaning in themselves, 

but they are much faster than other methods, and studies have shown they do 

produce somewhat reliable results, albeit with some bias, depending on data. 

4.1.1 UPGMA 

The simplest clustering method is the unweighted pair-group method using arithmetic 

averages (UPGMA) method, introduced by Sokal & Sneath ([SS73]). The 

method is extremely simple and can be found in Algorithm 1. Basically the algorithm 

joins two nodes in each turn, replacing them by their new parent in the 

tree that is being built. The algorithm is guaranteed to terminate after n − 1 

iterations since we effectively remove one node from the set of active nodes in 

each turn. 

It is not clear from Algorithm 1 that the UPGMA method will yield any 

biologically meaningful results. We have to assume that biological meaning 

has been extracted when we prepared the distance matrix, since the UPGMA 

method merely performs data mining. Surprisingly, under certain conditions 

this extremely simple method is sometimes useful, according to e.g. [NTT83] 

and [TN96]. But of course, the method simplicity makes it probably the least 

26

Algorithm 1 The UPGMA algorithm 

Require: δ is a distance matrix on n species X 

Ensure: T is a rooted binary tree with leaves from X 

1: Assign a cluster c i ∈ C and leaf n i ∈ T with height 0 to each species x i ∈ X 

2: while |C| > 2 do 

3: find clusters c i ,c j such that d(i, j) is minimal 

4: define a new cluster c k = c i ∪ c j ,removingc i ,c j from C 

5: assign distances from c k to the remaining clusters c l ∈ C such that 

d(k, l) = d(i, l)|c i| + d(j, l)|c j | 

|c i ||c j | 

6: add c k to C 

7: add a new node n k to T such that |e ik | = |e jk | = d(i, j)/2 

8: end while 

9: Create n root ∈ T , and connect the last two active nodes n i ,n j to n root such 

that |e i,root | = |e j,root | = d(i, j)/2 

accurate tree reconstruction method that still captures some biological meaning. 

The main advantage of this method is its very low running time complexity of 

O(n 2 ) on input size n species/ n 2 entries in a distance matrix. 

4.1.2 Neighbor-Joining 

The Neighbor-Joining method was introduced by Saitou & Nei in 1987 ([SN87]) 

and has become one of the most widely used tree reconstruction methods. It 

is fast, with a running time of O(n 3 ), even approaching quadratic time in the 

QuickJoin algorithm developed at BiRC: 

http://www.birc.dk/Software/QuickJoin/ 

The Neighbor-Joining method has been tested extensively there is a consensus 

that the method produces trees that are reasonably accurate ([JWMV03]) 

— particularly, it is more accurate than other methods known with similar running 

time complexity. The speed and relatively good accuracy of the method 

makes it an affordable way of creating a guide tree for other, more expensive 

tree reconstruction methods such as e.g. maximum likelihood methods. The 

algorithm is given in Algorithm 2. 

The Neighbor-Joining method is very similar to the UPGMA, following the 

same abstract template, but it does consider a more complex concept of neighbors 

when it selects which nodes to join, and thus it takes a little more time 

to run. Generally, there are many variants on UPGMA/ Neighbor-Joining with 

different scoring strategies, or adaptations that can input nucleotide sequences 

directly, for example. 

27

Algorithm 2 The Neighbor-Joining algorithm 

Require: δ is a distance matrix on n species X 

Ensure: T is a unrooted binary tree with leaves from X 

1: initialise T such that for each species x i ∈ X there is a leaf node n i ∈ T 

2: define a set of active nodes L containing the leaves of T 

3: while |L| > 2 do 

4: find nodes n i ,n j ∈ L such that 

D ij = d(i, j) − (r i + r j ) where r i = ∑ k∈L 

d(i, k) 

|L|−2 

is minimal 

5: remove nodes n i ,n j from L 

6: create a new node n k and set distances from n k to remaining nodes n m ∈ L 

such that d(k, m) =(d(i, m)+d(j, m) − d(i, j))/2 

7: add n k to L, T and connect n k to n i ,n j such that |e ik | = 1 2 (d(i, j)+r i−r j ) 

and |e jk | = 1 2 (d(i, j) − r i + r j ) 

8: end while 

9: join the last two nodes n i ,n j ∈ L such that |e ij | = d(i, j) 

4.1.3 Quartet based methods 

Buneman trees ([Bun71]) are a new form 1 of distance method that relies on 

quartets and splits rather than just picking closest pairs. An algorithm with a 

running time complexity of O(n 3 ) is available, but not given here (see [BB99], 

section3). Another method which relies on quartets is the Q ∗ method proposed 

by Berry & Gascuel ([BG00]) which runs in time O(n 4 ). 

The latter illustrates the problem for quartet based methods: given a set 

of quartets, these quartets do not nescessarily support a tree, e.g. they might 

contain contradictory constraints such as quartets wanting to split species in 

opposite ways. In the Buneman case, a set of splits is generated from quartets 

which are guaranteed to be tree-consistent — in the case of Q ∗ , a tree-consistent 

set of quartets is found by weeding out in a larger set of favorable quartets. The 

intuition is that by considering quartets, we are looking at the species in a global 

sense, determining which species should be separated from other species and 

trying to construct a tree under a large number of such (possibly conflicting) 

constraints. This is in many ways the opposite of the intuition behind the 

clustering methods mentioned earlier. 

Another issue for both of these algorithms is that they produce only partially 

resolved trees. Compared to the UPGMA and NJ methods we might say that 

these quartet based methods only resolve “safe” branches where the clustering 

methods resolve trees fully, regardless of data. However, we also have to say that 

these methods are perhaps too safe since they might only infer a small fraction 

of edges in the evolutionary tree, depending on data ([BG00], [BB99]). The 

1 “new form” compared to the clustering methods UPGMA and Neighbor-Joining 

28

efined Buneman algorithm which is the main focus of this work belongs in this 

subclass of evolutionary tree methods. The hope is that the refined Buneman 

method will be less safe than its namesake and infer more splits, while still 

maintaining a high degree of confidence in the splits it infers. 

4.2 Parsimony methods 

The foundation for this class of methods is the phylosophy of William of Ockham: 

Pluralitas non est ponenda sine neccesitate — meaning something along 

the lines of the best hypothesis is the one that requires the smallest number of 

assumptions 2 . 

This philosophy is also known as Ockham’s Razor or the parsimony principle. 

In our context we shall use it to create a condition of optimality, saying that if 

two proposed evolutionary processes have the same starting and ending points, 

we shall assume that the simplest or shortest process is the correct one. For 

example, we could imagine two substitutions of the same nucleotide working in 

reverse: A → G → A — in this case we would say that no substitutions took 

place at all. 

The parsimony method works by considering one specific site at a time across 

a set of nucleotide sequences. For each site, we postulate all possible binary tree 

topologies linking these sites. We now search for a combination of assignments 

of nucleotides to inner nodes such that the total number of substitutions is minimal, 

selecting that topology/ those topologies for further study, since we might 

not identify the same optimal topologies for all sites — we have to sum over 

possible topologies to find the one that has the smallest number of substitutions 

across all sites. 

Figure 4.1 shows an example of estimating the number of substitutions for 

a topology. Firstly, we have a tree spanning specific sites across 5 nucleotide 

sequences. Secondly, we can fill in the sites for the ancestral taxa by considering 

the minimum number of substitutions required, bottom up from the sites we 

are already given. In this case, looking at a subtree of C and T in the lower 

left corner, we know that their ancestor site must have been either C or T , 

requiring only one substitution, since all other combinations (A or G) would 

require two substitutions. Thirdly, we present one solution of of many as to 

how the assignment of nucleotides in ancestral sites might have been. In this 

case we could make due with only two substitutions, but this solution is not 

unique, there are several combinations which require only two substitutions. 

The intuition for this method is quite easy to understand, and according to 

[NK00] and others, under favorable conditions such as the method is expected to 

produce the correct tree. However, under less favorable conditions the method is 

known to produce incorrect topologies, and in any case the method is hopelessly 

inefficient for large data sets, at least when using exhaustive search techniques. 

[NK00] describes how search might be speeded up by using e.g. branch and 

bound. 

2 or even shorter: keep it simple! 

29

A, C, T 

1 2 3 

T 

A, C, T 

T 

C, T 

A, T 

T 

T 

C T A T T C T A T T 

C T A T T 

Figure 4.1: An illustration of the parsimony tree method. 

4.3 Maximum likelihood methods 

Using maximum likelihood methods is quite simple in theory. We start out with 

e.g. a set of n nucleotide sequences of length m — aligned such that only substitutions 

occur, not insertions or deletions. We then postulate some evolutionary 

tree topology over the n sequences, giving us a rooted binary tree with n − 1 

inner nodes. For each inner node we assume there is some ancestor sequence for 

the sequences in the subtree of that node, but we do not know which nucleotides 

are in this ancestor sequence. 

We assume some substitution model, and there are many to choose from, 

ranging from simple to very complicated. Jukes-Cantor ([JC69]), Kimura ([Kim80]) 

and Hasegawa-Kishino-Yano ([HKY85]) are names of well known substitution 

models, and there is a long hierarchy of increasingly complex models using more 

and more biologically founded assumptions. Now, to find the likelihood of a single 

nucleotide site in the n leaf sequences we have to multiply probabilities of 

substitutions through the tree, for all possible substitutions, i.e. for all possible 

assignments of nucleotides to sites in the n − 1 inner nodes in the tree. And 

then we would have to sum over all these likelihoods to find the likelihood of 

the entire sequences, i.e. a sum over m terms. 

Now, this expression would have to be evaluated for all possible tree topologies, 

and of course this results in a very time consuming algorithm. But, since 

the substitution model is actually based in biology, we have a high confidence 

that the resulting tree with maximum likelihood is the real tree. 

One way of making this method useful in practise would be to search for 

tree topologies in some clever way, so that the search would be able to skip 

large parts of the search space, which would of course limit the accuracy of the 

method. The ML method is also useful for evaluating trees found by other tree 

reconstruction method, since we assume the method captures a lot of biological 

meaning, depending on the model. 

30

4.4 Hybrid methods 

All the methods we have described until now are very one-sided in their bias, 

exploring only one side of the trade-off between accuracy and speed. Hybrid 

methods do exist, where a combination of a fast method and an accurate one 

might produce a practical method with sound biological meaning. 

Another method called Disc Covering is described in [HNP + 98], using a 

divide-and-conquer type of approach. From the abstract: 

(The Disc-Covering Method) DCM obtains a decomposition of the 

input dataset into small overlapping sets of closely related taxa, 

reconstructs trees on these subsets (using a “base” phylogenetic 

method of choice), and then combines the subtrees into one tree on 

the entire set of taxa. Because the subproblems analyzed by DCM 

are smaller, computationally expensive methods such as maximum 

likelihood estimation can be used without incurring too much cost. 

4.5 Accuracy of inferred trees 

It is possible to evaluate the quality or confidence of branches in the evolutionary 

trees we find using our tree reconstruction methods. These statistical methods 

are known as bootstrap tests, and they are described [NK00], with references to 

other articles. 

The basic idea is to find some tree T using some method M, forsomedata 

set D. Lets say D consists of n nucleotide sequences. Now we may select n 

sequences from D with replacement, to form a new sample dataset D ′ —notice 

that the same sequence might occur several times in the new set, while some 

sequences might not occur at all. Now we use the new sample to infer a new tree 

T ′ bythesamemethodM, andbycomparingT and T ′ we can assign a count 

of1tothebranchesinT which also occur in T ′ , and 0 to the rest. Repeating 

this process many times (e.g a thousand times) yields a statistic of how often 

each branch occurs for different samples, and thus a reflection of how confident 

we can be in this particular branch. 

31

Part II 

Implementing Refined 

Buneman Trees 

32

Chapter 5 

Implementation Structure 

The refined Buneman tree algorithm described in [BFÖ+ 03] consists of two 

main parts. The first part creates an overapproximation of the refined Buneman 

tree, i.e. a compatible set of splits Σ : RB(δ) ⊆ Σ ⊂ σ(X). The second part is 

concerned with searching through Σ and finding those splits which have positive 

refined Buneman index. 

5.1 Overapproximating 

The first part of the algorithm deals with constructing and maintaining a compatible 

set of splits, which is a superset of the set RB(δ). This part of the 

algorithm is given in pseudocode in Algorithm 3. 

The pseudocode deserves a few notes. First of all, we will use a tree data 

structure to represent compatible sets of splits, in order to achieve time and 

space complexity goals (see chapter 6 for more details). Secondly, compared to 

the pseudocode algorithm given in [BFÖ+ 03] we have for the sake of simplicity 

skipped anchored Buneman trees and used single linkage clustering trees instead. 

Berry and Bryant have shown in [BB99] that the set of splits represented by the 

anchored Buneman tree B xk (δ| Xk ) is a subset of the set of splits represented by 

the single linkage clustering tree for x k with respect to δ restricted to X k (see 

chapter 7 for more details). 

In line 1 we initialize a compatible set of splits for the first four species, 

namely the set of splits contained in the single linkage clustering tree for x 4 . 

We represent a compatible set of splits as an unrooted tree, and basically the 

transformation from the rooted clustering tree to our unrooted tree consists of 

just cutting off the root and splicing the two subtrees together. The cartoon in 

Figure 5.1 shows this process. 

In line 2–3 we iterate over the remaining species. For each new species x k we 

build the single linkage clustering tree for that species, C k . In line 4 we iterate 

over all splits σ i ∈ C k−1 and in line 5 we add the new species x k to either “side” 

of σ i . These three lines together form the construction described in Lemma 5, 

33

Algorithm 3 Overapproximating the refined Buneman tree 

Require: δ is a distance matrix of size n × n 

Ensure: C n ⊇ RB(δ) 

1: C 4 = SLCT x4 (δ 4 ) 

2: for k =5ton do 

3: C k = SLCT xk (δ k ) 

4: for U|V ∈ C k−1 do 

5: for σ ∈{U ∪{x k }|V , U|V ∪{x k }} do 

6: σ ′ = INCOMPATIBLE(C k ,σ) 

7: while σ ′ ≠ null and DISCARD − RIGHT (σ, σ ′ ) do 

8: DELETE(C k ,σ ′ ) 

9: σ ′ = INCOMPATIBLE(C k ,σ) 

10: end while 

11: if σ ′ = null then 

12: INSERT(C k ,σ) 

13: end if 

14: end for 

15: end for 

16: end for 

✂ 

✁❆ 

✁ ❆ 

e✁❆ ❆❆❆❆ 

✁ ❆❆❆ 

✁❆ ✁ ❆ 

x 1 x 2 x 3 x 4 

x 4 

⊲⊳ 

e✁❆ ✁ ❆❆❆ 

✁❆ ✁ ❆ 

x 1 x 2 x 3 

x 1 

❅ 

x 2 

x 4 

e 

❅x3 

Figure 5.1: Going from the single linkage tree for x 4 to an unrooted tree, representing 

the same set of splits. 

34

C k−1 is an overapproximation of RB(δ| Xk−1 ), and C k is an overapproximation 

of B xk (δ| Xk ). 

Now we need to merge these two sets of splits. We will test each extended 

split σ against the splits in C k : if σ is already in C k , we will ignore it. If σ 

is incompatible with some split σ ′ ∈ C k (line 6), we will use the DISCARD- 

RIGHT algorithm on the two splits to decide if σ ′ has non-positive refined 

Buneman index (see chapter 9 for more details). If we decided σ ′ was not a 

refined Buneman split, we will delete it from C k (line 8) and again measure if σ 

is incompatible with some new split σ ′ ∈ C k (line 9). Finally, if we have decided 

that σ qualifies, we will insert it into our compatible set of splits (lines 11–12). 

The time complexity analysis for this first part of the algorithm goes like 

this: bootstrapping a tree of size 4 in line 1 of the algorithm takes constant 

time. We then in line 2 iterate over at most n species. In line 3 we build a 

single linkage clustering tree in time O(n 2 ). In line 4 we iterate over all splits 

in an unrooted tree, there are O(n) of those. Each of the algorithms INCOM- 

PATIBLE, INSERT, DELETE and DISCARD-RIGHT take time O(n). For 

the while loop in line 7 we can only find O(n) splits in an unrooted tree that 

can possibly be incompatible with σ, so the first part of the algorithm runs in 

time O(n 3 ). 

The space usage consists of holding two compatible sets of splits at any one 

time and the space required to calculate the single linkage clustering trees. We 

can store any compatible set of splits in linear space using the tree data structure 

described in chapter 6. It is much worse for the single linkage clustering tree, 

the implementation in this paper uses a quad tree data structure, which will 

allocate Θ(n 2 ) space. Of course, we already need space for the distance data, 

which is also Θ(n 2 ). So in total we need to use space Θ(n 2 ) for the first part of 

the algorithm. 

5.2 Pruning 

The second part of the algorithm deals with the extraction of refined Buneman 

splits from a the set of compatible splits constructed in the first part of the 

algorithm. Denote this set T . In the context of (evolutionary) trees and bioinformatics, 

we might call this part pruning, even though we shall not actually 

remove any edges from T — our tree data structure cannot handle a tree topology 

which is not a regular, leaf-labeled tree, so we will only invalidate edges 

instead of changing the topology of the tree. 

In this part of the algorithm, we shall look at all splits in T by considering 

all edges in the tree, and calculate refined Buneman indices for the splits they 

represent. Finally we shall report those splits which have positive refined Buneman 

index. We will do this by first decorating all edges in T with their refined 

Buneman index, or 0 as a special marker for invalid splits. Later we extract the 

refined Buneman splits from T in quadratic time — linear time iteration over 

edges, and for each edge, linear time for reporting n bits in a bitvector. So our 

so-called pruning algorithm is in reality a searching-and-scoring algorithm, and 

35

the pseudo-code for the algorithm is given in Algorithm 4. 

The algorithm is very long and complicated, but it can be broken up into 

three parts: lines 1–3 deal with initializing linked lists representing “global 

matrix fronts” on matrices which contain diagonal quartets. Lines 4–17 populate 

the matrix fronts. In lines 18–28 we search the matrices, finding the minimum 

quartets that we need to calculate the refined Buneman indices for the edges 

represented by the matrices. And in lines 29–34 we report refined Buneman 

splits. Before we explain the algorithm, we need to study the nature of diagonal 

quartets. 

5.2.1 Searching for minimum diagonals 

We know from previous sections that a quartet induces two diagonal quartets. 

And clearly, given a diagonal quartet we can identify its “parent” quartet. In 

the pruning part of our algorithm, we are interesting in finding the n−3quartets 

with minimum score induced by each edge in T . We will do this by searching 

for diagonal quartets instead, but we need to ensure that we never identify the 

same quartet twice (for example by identifying it from seeing both of its diagonal 

quartets). 

We will use a convention that says we shall only identify a quartet if we 

see its minimum diagonal. In case we see a diagonal quartet which is not a 

minimum diagonal, we shall disregard it. 

Another important property of diagonal quartets is this: if we fix a and c, 

such that a and c lie on different sides of a fixed edge e, we can search for b and 

d independently to minimize the score of the diagonal quartet ab||cd induced by 

e. Byfixinga and c we can rewrite the score of a diagonal quartet into a sum 

of two functions f a,c and g a,c , such that f a,c only depends on b and g a,c only 

depends on d. Clearly, such a function takes its minimum only when f a,c and 

g a,c are minimal. 

η ab||cd = 1 2 (δ bc − δ ab + δ ad − δ dc )=f a,c (b)+g a,c (d) 

where f a,c (b) =(δ bc − δ ab )/2 andg a,c (d) =(δ ad − δ dc )/2. 

Not only can we find the diagonal quartet with minimum score in this way. 

But also, we can search for the “next minimum”, i.e. the diagonal quartet with 

minimum score when discounting the actual minimum. We can do this in a 

general way: say we have some diagonal quartet ab i ||cd j with score η abi||cd j 

. 

Imagine we have considered all diagonal quartets with scores less than η abi||cd j 

, 

and now we wish to consider ab i ||cd j and then find the next diagonal quartet 

with minimum score. 

The way to do this is to search for b i+1 such that η abi+1||cd j 

≥ η abi||cd j 

is the 

minimum among all choices of b i+1 , and similarly for d j+1 , η abi||cd j+1 

≥ η abi||cd j 

must be the smallest among choices of d j+1 . One of those will be the next 

minimum. Note that the indices refer to an ordering of increasing f a,c and g a,c 

respectively, not ordering as members of X. 

36

Algorithm 4 Pruning the overapproximated refined Buneman tree 


Require: T =(V,E) is an overapproximation of RB(δ) 

Ensure: S is a set of splits representing RB(δ) 

1: for e ∈ E do 

2: Q e = ∅ 

3: end for 

4: for (a, c) ∈ X 2 ∧ aa 

10: end for 

11: for each edge e on the path from a to c do 

12: Q e = Q e ∪{ab e ||cd e } 

13: if |Q e |≥3(n − 3) then 

14: remove the n − 3 quartets with largest score from Q e 

15: end if 

16: end for 

17: end for 

18: for e ∈ E do 

19: S e = ∅ 

20: while |S e | 0 then 

32: S = S ∪ σ e 

33: end if 

34: end for 

37

AE 

DE 

D 

B 

e 

E 

BC 

A 

C 

BE 

AC 

Figure 5.2: A tree, and edge, and the set of matrices induced by that edge, one 

for every choice of a pair a, c where a

Figure 5.3: A front of readied entries moving across a matrix. Green entries 

are ready entries, i.e. the entries that are potential next minimums. Red entries 

have been considered and are not available. After considering an entry, we add 

its neighbor to the south — and in case we are in the top row, we also add its 

neighbor to the east. 

the previous section. We know that for a given edge, the n − 3quartetswith 

minimum score can be found in these matrices — or rather, we can find the 

n − 3 minimum diagonals identifying the n − 3 minimum scoring quartets in 

the matrices. The naive approach would be to build a matrix for every pair 

(a, c) wherea lies on one side of e and c lies on the other. However, since we 

know that each matrix is sorted so that the rows and columns are monotonic 

non-decreasing, we do not need to construct the matrices completely. 

Instead, we can imagine storing a “matrix-front” for each matrix, which 

will contain the “unspent” minimums. The matrix front would be initialised 

with only one element, namely the entry (1, 1). When that entry is “spent”, 

we delete it from the matrix front and add its neighbours to the matrix front. 

And of course this generalises: when we “spend” entry (i, j) wehavetoaddits 

neighbours (i +1,j)and(i, j +1). 

Actually, this scheme would mean we would encounter entries twice, since 

an entry (i, j) is a neighbour of both (i − 1,j)and(i, j − 1). To amend this we 

will say that when spending (i, j), we will only look to its neighbour (i, j +1) 

— unless j = 1 in which case we will also look at (i +1,j). This way we 

avoid encountering the same entry twice; graphically, it is equivalent to painting 

columns top to bottom such that the leftmost columns are always at least as 

tall as the ones to the right. Figure 5.3 illustrates this. 

Now we have established that we can search through the matrices using lazy 

39

construction, but we still have to search through a set of matrices (one for each 

pair (a, c)) concurrently. This can however be simplified by saying we have one 

big combined “matrix front”, which contains all entries in all “matrix fronts” at 

the same time. Finding the next diagonal quartet across all the matrices now 

only involves finding the minimum in the global “matrix front”. Such a global 

matrix front has the potential of becoming very large. The number of matrices 

for an edge e is in O(n 2 ), the number of possible pairs (a, c). Fortunately, it 

turns out we do not have to store so many entries. We are looking for n − 3 

quartets with minimum score corresponding to n−3 minimum diagonals. Recall 

that a minimum diagonal ab||cd can have the same score as its “cousin” ab||dc, 

so when searching for diagonal quartets index by score we would in the worst 

case have to look through 2(n − 3) diagonal quartets with minimum scores to 

find n − 3 minimum diagonals. 

In the worst case with respect to number of matrices, the n − 3 minimum 

diagonals and their cousins would stem from different pairs (a, c). Thus to ensure 

we had captured all of them, we would need to initialize 2(n − 3) matrices in 

such that the 2(n − 3) diagonal quartets with minimum score could appear as 

entries (1, 1)in the matrix for their (a, c)-pair. 

After initialisation, we search the matrices for diagonal quartets with minimum 

score. In the worst case where all the minimum diagonals have the same 

score as their cousins, we could be unlucky enough to always encounter the 

cousin first and the minimum diagonal second. But even in that case, we would 

never need to look at more than 2(n − 3) diagonal quartets with minimum score 

before we found (n − 3) minimum diagonals. So, the search space is bounded 

both in time and space by O(n). 

5.2.3 Pruning explained 

Now we are ready to look at the individual steps in the algorithm. In line 1–3, 

we for each edge initialize a “global matrix front”, implemented as a linked list 

— this ensures we can insert elements in constant time in one end, and find the 

minimum element in constant time (line 21). We store the Q e ’s in a hashtable 

for easy lookup. There are at most O(n) edges, to this task takes linear time. 

In line 4 we iterate over all possible pairs (a, c), ensuring a

know that we need at most 2(n − 3) matrices initialized with diagonal quartets 

of minimum score. Therefore in line 13–15 we measure the size of Q e —if 

|Q e |≥3(n − 3) we can safely remove the n − 3 diagonal quartets with largest 

score, and thus the matrices to which they belong, since we are guaranteed 

to find (n − 3) minimum diagonals in the remaining 2(n − 3) matrices. The 

procedure of selecting the n − 3 largest members of a set of size 3(n − 3) is 

described in chapter 10, and this can be done in linear time. Thus in lines 4–17 

we spend only O(n 3 )time. 

After initialising the matrices it is time to search through them. We iterate 

through the edges in line 18, and for each edge we initialize a list of diagonal 

quartets S e to the empty set (line 19). We need the n − 3 smallest minimum 

diagonals in order to identify the n − 3 minimum quartets. So we iterate in line 

20 until we have enough minimum diagonals, in linear time since we would at 

most need to look at 2(n − 3) minimum diagonal quartets. We find minimum 

diagonals by successively looking at the diagonal quartets with minimum score 

in Q e . The size of Q e is bounded by O(n), since |Q e | is at most 3(n − 3) 

after initialization, and when we traverse our matrices we remove one entry and 

add two at most 2(n − 3) times — by that time we are sure to have found 

the n − 3 minimum diagonals. So |Q e | hasatmost5(n − 3) elements. Thus, 

we can find and remove the minimum element from Q e in linear time in line 

21. If ab i ||cd j is a minimum diagonal, we add it to S e (lines 22–24). After 

considering ab i ||cd j , we use our matrix traversal/ entry readying scheme to add 

its appropriate neighbors in lines 25 and 26. Since we do not have the imaginary 

matrices, we instead have to search the two subtrees of T induced by e, forthe 

choices of b and d that yields the matrix neighbours. Such a search can be done 

in linear time. All in all we iterate a linear number of times, performing linear 

tasks, so this of the algorithm takes only O(n 2 )time. 

In line 29 we initialise a list of splits to the empty set. Going through the 

edges of T and summing up minimum diagonals takes time O(n 2 ) in lines 30–34. 

The pruning part of the refined Buneman tree algorithm is extremely complicated, 

and it is very hard to both capture the intuition behind the algorithm, 

and still keep track of all details. In the description of the algorithm, the author 

has tried to capture modules that could be splits off and described seperately, 

in order to reduce the complexity. This was possible for the first part, but the 

author was not so successful in the second part. The point of the description was 

to give intuition rather than convey all details, and therefore it is possible that 

some small errors and oversights have crept in — particularly in the pruning 

part. For a full and concise account of the algorithm, turn to [BFÖ+ 03]. 

41

Chapter 6 

TheTreeDataStructure 

This chapter discusses a tree datastructure designed to maintain a compatible 

set of splits. In part 1 of the algorithm described in [BFÖ+ 03], we need to 

insert and remove splits from the tree and search for incompatible splits. These 

operations all need to perform in time O(|X|). Part 2 of the algorithm requires 

traversing paths in the tree and searching subtrees, also in linear time. 

6.1 Design of the tree data datastructure 

As we saw in a previous section, a compatible set of splits is atree—not 

necessarily a regular, leaf-labeled tree, but at least a semi-labeled tree. We 

also saw there was a close connection between the evolutionary trees we want 

to represent, and phylogenetic trees which can be represented by leaf-labeled 

trees. Thus, the design of the tree data structure used in this work will be a 

leaf-labeled tree that uses special markers to indicate edges that are not “real” 

edges, when needed. Since the refined Buneman tree algorithm deals with overapproximations 

of evolutionary trees, we shall simply disregard the controversy 

altogether; all complexity bounds will hold even if we do. The operations we 

need to support with the tree data structure are not affected by the presence of 

extra trivial splits; the search for incompatible splits will of course never return 

a trivial split, since they are compatible with any split, and in the insert and 

delete operations we can just ignore or work around the cases where trivial splits 

are involved. 

Even though we say that our tree is unrooted, we still need some sort of root 

as a starting point for our tree traversals. One way of creating such an artificial 

root would be to select an inner node and use this as a starting point for every 

tree operation. However, as the topology of the tree changes as we insert and 

remove splits from the compatible set of splits that the tree represents, a fixed 

root node might be removed at any time. Instead, we shall choose a random 

node in the tree as starting point each time we start a new operation, to ensure 

we have a valid starting point. This of course means we have to have a design 

42

that supports traversal in any direction, and this is the solution the author has 

chosen. One alternative to this approach would be to select a leaf node as root, 

but this would mean we would have to introduce a special case for that node — 

at the same time, we could make do with a one-way directed tree. In any case, 

all complexity bounds would be upheld, so the choice is open. 

One more requirement for the tree data structure is that it would have to 

support multiple children. Since the set of splits that is being represented does 

not necessarily correspond to a fully resolved tree, any inner node can have any 

degree. 

6.1.1 The tree datastructure explained 

The author has chosen a tree datastructure which is inspired by the doubly 

connected edge list (DCEL) datastructure described in [dBSvKO00]. The DCEL 

is used to describe both geometric and topological information, and is introduced 

in a context of planar subdivisions, but for our purposes we need only a simplified 

version. 

This simplified tree datastructure used for implementing refined Buneman 

trees consists of nodes and directed edges, and the class signatures of these can 

be seen in Figure 6.1. The illustrations in Figure 6.2 and Figure 6.3 provide 

a “local navigational map” for a Node and an Edge, respectively. Leaves and 

inner nodes would subclass the Node class. 

class Node 

{ 

Edge incidentEdge; 

} 

class Edge 

{ 

Node origin, destination; 

} 

Edge next, previous, twin; 

Figure 6.1: Signatures for the Node and Edge classes 

Clearly, the storage required for a Node or an Edge is constant, 1 and 5 

pointers respectively. In an unrooted tree with n leaves there are at most n − 2 

internal nodes, so in total there are at most 2n − 2 nodes. Also, in an unrooted 

tree with n leaves there are at most 2n − 3 edges. The storage space needed for 

our tree data structure is therefore (2n − 2) + 5(2n − 3) or O(n). 

The set of edges with origin in some node w form a cyclic list. This allows 

for constant time insertion and removal, given we have the particular edge in 

hand. The issue of finding the edge to be inserted or removed is described later. 

The cyclic list of outgoing edges is illustrated in Figure 6.4. 

43

incidentedge 

Figure 6.2: The world seen from the viewpoint of a Node 

previous 

origin 

twin 

destination 

next 

Figure 6.3: The world seen from the viewpoint of an Edge 

44

000 111 

000 111 

000 111 

000 111 

000 111 

000 111 

000 111 

000 111 

Figure 6.4: The set of edges going out from a node form a cyclic list. The big 

arrows represent Edges, and the thin arrows are pointers. The Edge that has 

been marked is the Edge that indidentEdge to the Node. The blurred Edges 

and pointers are supposed to indicate that the number of neighbors to a Node 

is variable. 

Access to the cyclic list of edges going out from a node is granted only 

through incident edge of the node, which can be any of the edges with origin in 

that node. In other words, the node has no knowledge of its local surroundings 

and particularly, it has no notion of direction in the tree. When traversing the 

tree we will therefore have to impose the direction externally. 

Traversal through outgoing edges from a node is made easy by the use of an 

EdgeIterator class. The EdgeIterator is inspired by the Iterator design pattern 

from the famous GoF-book [GHJV94]. The EdgeIterator provides a simple 

interface for the unordered traversal of outgoing edges from a node, and it is 

used extensively in the tree data structure operations. 

The interface for the Edge iterator class can be found in Figure 6.5. It closely 

mimics the interface for the Iterator class in the Java standard API. The author 

has chosen not to implement the Java Iterator interface to avoid having to cast 

Object to Edge all the time. 

The code snippet in Figure 6.6 is the archetypal example of recursive traversal 

through T . Notice how we are providing a direction for the traversal by 

supplying the parent edge of the node as a parameter. This allows us to skip 

over that edge when we encounter it on our way through the cyclic list. The 

EdgeIterator keeps track of our progress so we only perform one cycle though 

the cyclic list of edges. 

The EdgeIterator wraps the functionality of navigating through the pointers 

between edges. And the template described in Figure 6.6 wraps the functionality 

of recursively descending down through the tree datastructure. 

45

interface EdgeIterator 

{ 

boolean hasNext(); 

} 

Edge next(); 

Figure 6.5: The interface for the EdgeIterator class 

6.2 Operations on the tree datastructure 

The tree datastructure maintains a compatible set of splits. And the operations 

we are interested in supporting are the following: 

• Insert(σ) 

Of course, we need to be able to insert splits into our set of splits 

• Delete(σ) 

And we need to be able to remove splits from the set 

• Incompatible(σ) 

And finally we need to be able to determine if a split is compatible with 

any split in our compatible set of splits, and if there is one, to extract it. 

The following sections describe in detail how these operations have been 

implemented. In order to support the operations on the tree data structure 

needed in the refined Buneman algorithm, we must be able to search through 

the tree, and find both nodes and edges. 

For the insert operation we will need to locate a node with the property that 

for a split σ = U|V the node seperates U and V in the sense that we can group 

its subtrees in two, where one group only contains elements from U, andthe 

other group only contains elements from V . This node can then be expanded 

into two connected nodes, one for each subtree group. 

For the delete operation we need to be able to find the edge e that seperates U 

and V , meaning that the node on one of the edge is root in a subtree containing 

only elements from U, and the node on the other end is root in a subtree 

containiing only elements from V . 

For the incompatible operation we need to be able to find en edge e which 

is incompatible with another edge e ′ . We will use Definition 3 to decide the 

question of incompatibility. 

6.2.1 Insert 

Assume we have an unrooted tree T , and that we have selected some random 

node w root as starting point for traversal. The task is to insert a split σ = U|V 

which is compatible with all other splits represented by T . This amounts to 

46

class Node 

{ 

Edge incidentEdge; 

public void traverse(Edge parent) 

{ 

EdgeIterator ei = new EdgeIterator(this.incidentEdge); 

while (ei.hasNext()) 

{ 

Edge e = ei.next(); 

if (e.twin == parent) continue; 

Node n = e.destination; 

} 

} 

} 

n.traverse(e); 

Figure 6.6: An example of code for traversing T 

finding the node w target (it is guaranteed to exist since σ is compatible with all 

other splits in T ,see[BFÖ+ 03]), such that if we root T at w target , all subtrees 

of w target contain only leaves from either U or V , but not from both — this is 

illustrated in Figure 6.7. 

Clearly, if we have a node w where all the subtrees of w have the property 

that they contain only leaves from either U or V , we may take all the U- 

subtrees and hang them on a new node w U ,andwemaytakealltheV -subtrees 

and hang them on a new node w V . We now have two trees rooted at w U and 

w V , respectively, containing all the U’s and all the V ’s, respectively, and we 

may now join these two nodes with an edge e that now seperates U and V ,and 

thereby represents the split σ. This is illustrated in Figure 6.8. 

The question is now, how do we find w inthefirstplace Thenodeis 

guaranteed to exist, since σ is a compatible split ([BFÖ+ 03]). So we just need 

to specify its location. Looking at Figure 6.9, lets assume we are traversing T 

andwehavereachedsomenodew, wherew has a parent edge and a bunch of 

edges to its children, given the orientation indicated in the figure. If we proceed 

to count the number of elements from U and V in the subtrees of w, letscall 

the counts w u and w v , we end up with one of the following results: 

• w u < |U|∧w v < |V | — in this case, we can ascertain that the subtree of 

w given by its parent edge must contain element from both U and V, so 

w cannot be the node we are looking for. 

47

Figure 6.7: An example a node which is a valid insertion node (green), and one 

which is not a valid insertion node (red). A node is a valid insertion node if all 

subtrees of that node contain only nodes from U (blue nodes) or from V (black 

nodes), but not from both. 

• w u = |U|∨w v < |V | — in this case, we know that the subtree of w given 

by its parent edge only contains elements from V . If we assume that we 

found w in a bottom–up traversal, and that w is the first node with the 

property w u = |U|, w mustbethenodewearelookingfor.Itcannotbea 

node below w, since that node would have a subtree with mixed u’s and 

v’s, i.e. the subtree given by its parent edge. 

• w u < |U|∨w v = |V | — similar argument as in the case w u = |U|∨w v < |V |. 

• w u = |U|∨w v = |V | — in this case, the node we chose at randon to be 

the intermediate root, must be the insertion node we are looking for. 

Counting can be done in linear time since we only need to visit each node 

once, and there are at most 2n − 2 nodes in an unrooted tree with n leaves. 

Searching can afterwards be done in time O(n), visiting each node exactly once 

and testing each node in constant time until we find the one we are looking for. 

6.2.2 Delete 

The delete operation is somewhat similar to the insert operation. Again we 

start out by counting U’s and V ’s in a bottom-up fashion. The edge e we are 

looking for has the property that its destination node is a node w, whereall 

subtrees of w contain only elements from U or from V , but not both. In other 

words, w isthenodewhereu w = |U| and v w = 0 or vice versa. And, once the 

node w is found, the edge e has also been found, since we can keep track of both 

48

e 

000 111 

000 111 

000 111 

000 111 

000 111 

000 111 

000 111 

000 111 

Figure 6.8: Inserting an edge e that splits U and V 

parent edge 

w 

up 

down 

Figure 6.9: Local orientation around a node, given we are in the middle of a 

tree traversal. 

w and e while searching for w. Again we use linear time for counting through 

T , and afterwards we can search through T in time O(n). 

To remove e we need to move all subtrees of w onto e’s origin node, lets call 

it w ′ . But this is quite easy, we just iterate through the nodes directly “under” 

w and sever their connection to w. Afterwards we iterate through the nodes 

again and attach them to w ′ . This can all be done in linear time 1 . 

6.2.3 Incompatible 

Given a split σ = U|V , the search for an incompatible split σ ′ = U ′ |V ′ ∈ T 

is a bit more complicated than for the insert and delete cases. A split σ ′ is 

incompatible with σ if all four intersections of {U ′ ,V ′ } and {U, V } are not 

empty. In other words, the values |U ′ ∩ U|, |U ′ ∩ V |, |V ′ ∩ U| and |V ′ ∩ V | are 

all non-zero. 

1 Of course it can be done faster if we did not cache the nodes in between detaching 

and attaching them, but the design of our tree data structure warrants this procedure, and 

asymptotically there is no difference 

49

So how do we find σ ′ We start by counting T bottom-up with respect to 

U and V ,wehavethatforeverynodew ∈ T which hangs from some edge e 

that u w is equal the the value |U ′ ∩ U|. U ′ is the set of leaves which lie in the 

subtree of w, sou w is both the number of elements from U in w’s subtree and 

the number of elements that U and U ′ have in common. Similarly, we also have 

that v w = |U ′ ∩ V | by the same argument. 

Now we apply basic set theory to find the values |V ′ ∩ U| = |U|−|U ′ ∩ U| 

and |V ′ ∩ V | = |V |−|V ′ ∩ V |. We have all four quantities, and we can check if 

any one of them is zero, and answer the question whether w is the node we are 

looking for (consequently whether e is the edge we are looking for). 

The time spent searching for an incompatible node is linear. We use linear 

time decorating T with U/V counts, and we use linear time checking each one for 

incompatibility. Given u w , v w , |U| and |V | we can in constant time determine 

incompatibility by calculating the quantities described above. 

We are not guaranteed to find an incompatible split, but if we do, we can report 

it in linear time. To do this, we allocate a bitvector b of length n and search 

through one of the subtrees induced by e. Wemarkanentryinb corresponding 

to the index of every leaf we find. 

6.3 Testing the tree data structure 

The tree data structure has not been tested formally. Of course, all tree operations 

have been unit-tested, and have been found to be correct. Also, performance 

tests have been run to some extent, but this is not easy: if we want to 

test the performance of the tree operations, we would need to be able to create 

highly resolved trees by inserting a large number of compatible splits created 

at random, which could then be a basis for inserting, deleting or searching for 

splits. Doing the operations on an unresolved tree would not give a realistic 

picture of the performance. It is quite easy to generate a larvae-tree by starting 

out with a bitvector in the form 11000..., and then generating splits of 

the form 111000..., 1111000..., 11111000... andsoon. Thiswasusedfor 

informal performance testing, but clearly this kind of tree is biased, and the 

author has chosen not to present a formal test on this basis. Still, the author 

claims that the tree data structure does indeed run as specified, referring to 

the performance test for the whole algorithm — if the tree data structure did 

not support linear time insertion, deletion and searching, the refined Buneman 

algorithm would not be able to perform as specified. 

50

Chapter 7 

TheSingleLinkage 

Clustering Tree 

Single linkage clustering trees play an important part in this work as a replacement 

for anchored Buneman trees. The refined Buneman tree algorithm 

described in this thesis is based on Lemma 5, where anchored Buneman trees 

need to be merged with refined Buneman trees to create new refined Buneman 

trees. However, since in the first part of the algorithm we are only required 

to build overapproximations of the refined Buneman tree, we can use single 

linkage clustering trees instead of anchored Buneman trees since single linkage 

clustering trees contain anchored Buneman trees. 

7.1 Replacing the anchored Buneman tree 

The anchored Buneman tree can be computed in time O(n 2 ), according to 

Lemma 3, by first constructing a single linkage clustering tree that is a superset 

of B x (δ), and then pruning that tree ([BB99], section 3). But, since we are 

creating an overapproximation of the refined Buneman tree by incrementally 

considering splits from anchored Buneman trees, we can replace the anchored 

Buneman trees altogether with the unpruned single linkage clustering trees. The 

extra splits might or might not become part of our overapproximation, but they 

will be weeded out later on in the algorithm, and they do not asymptotically 

change the size of the overapproximation. The tree cannot grow beyond O(n) 

splits. And luckily, the single linkage clustering tree is extremely simple to 

compute. 

7.2 Calculating the single linkage clustering tree 

As mentioned, the single linkage clustering tree is very simple to calculate. 

Pseudo code for the algorithm, adapted from [BG91], section 3.2.7, is listed in 

51

Algorithm 5. Offhand, the algorithm looks like it would perform in time O(n 3 ). 

We iterate through a loop, reducing the size of the distance matrix by one 

(adding one entry and removing two) on each loop, until the distance matrix is 

spent. Inside the loop, we in line 2 need to find the minimum entry in an n × n 

matrix, normally a O(n 2 ) operation. However, if we overlay our distance matrix 

with a quad tree search structure, we can get away with spending constant 

time finding the minimum entry in the matrix, while spending only linear time 

updating the search structure after we update the distance matrix in line 6. 

More about the quad tree data structure in chapter 8. 

Algorithm 5 The single linkage clustering tree algorithm 

Require: δ is a distance measure on X 

Ensure: C is the single linkage clustering tree for δ 

1: C a set of clusters, one for each element in X 


3: Choose the clusters c 1 ,c 2 ∈ C that minimize the quantity δ(c 1 ,c 2 ). 

4: Create c ′ = c 1 ∪ c 2 . 

5: Calculate distances from c ′ to all other clusters c ′′ ∈ C by setting 

δ(c ′ ,c ′′ )=min{δ(c 1 ,c ′′ ),δ(c 2 ,c ′′ )}. 

6: Erase c 1 and c 2 from C, addc ′ . 

7: end while 

Also, instead of actually inserting and removing entire rows and columns in 

the distance matrix, we might reuse a row or column instead, and we might not 

delete rows but rather keep track of clusters that are alive. Figure 7.1 illustrates 

this point. 

7.3 Converting the distance matrix 

One important note about the use of the single linkage clustering algorithm 

instead of the anchored Buneman tree algorithm is the distinction between distance 

matrix and similarity matrix. The anchored Buneman tree for a species 

x ∈ X is computed based on a distance measure δ on a X, while the single 

linkage clustering tree for x ∈ X is computed based on an inverted similarity 

measure on X with respect to x. In other words we must distinguish between 

SLCT(δ, X) andSLCT(F x (δ),X). 

Basically, this means that we have to transform δ before using it to calculate 

the single linkage clustering tree. The transformation is described in [BB99], 

section 3, where is it termed a Farris transformation: 

F x (a, b) = 1 2 (δ ax + δ bx − δ ab ) 

where a, b ∈ X \{x}. The Farris transformation creates a similarity measure, 

but this can be converted into a distance measure by e.g. changing signs. 

52

A 0 1 2 3 4 5 6 7 

0 1 2 3 4 5 6 7 

B 

C1 

C2 

0 

1 

2 

3 

4 

5 

6 

7 

0 

1 

2 

3 

4 

5 

6 

7 

00 11 

00 11 

00 

C 

0 1 8 3 411 

5 6 7 

00 11 

000000000 

111111111 

000000000 

111111111 

0 000000000 

111111111 

000000000 

111111111 

1 000000000 

111111111 

000000000 

111111111 

000 111 000000000 

111111111 

C’ 8 000000000 

111111111 

000000000 

111111111 

3 000000000 

111111111 

000000000 

111111111 

000000000 

111111111 

4 000000000 

111111111 

000000000 

111111111 

5 000000000 

111111111 

000000000 

111111111 

11000000000 

111111111 

00 116 

000000000 

111111111 

11000000000 

111111111 

7 000000000 

111111111 

000000000 

111111111 

Figure 7.1: Manipulating the matrix in the single linkage clustering tree. A) 

we find the two rows and two columns with minimum entries (recall the matrix 

is symmetric). B) We reuse one set of row and column, and discard the other 

set or row and column. C) The new matrix is no longer able to index the row/ 

column that was discarded, and the reused row and column has been re-indexed. 

53

Algorithm 6 The algorithm that calculates the single linkage clustering tree 

for x 

Require: δ is a distance measure on X 

Ensure: T is root in the single linkage clustering tree for x ∈ X 

1: δ ′ = F ARRIST RANSF ORM(δ, x) 

2: Q is a quad tree overlaid δ ′ 

3: C is a list of clusters corresponding to δ ′ 


5: (c 1 ,c 2 )=MINIMUM(Q) 

6: c ′ = c 1 ∪ c 2 

7: for c ′′ ∈ C, c ′′ ≠ c 1 ,c 2 do 

8: δ c ′ ′ ,c = ′′ min{δ′ c 1,c ′′,δ′ c 2,c ′′} 

9: end for 

10: C = C ∪{c ′ }\{c 1 ,c 2 } 

11: UPDATE(Q) 

12: end while 

13: T = C 

Thus, the algorithm for computing the single linkage clustering tree for x 

becomes the algorithm found in Algorithm 6. 

Analysing Algorithm 6 is straightforward: initialising δ ′ and Q takes time 

O(n 2 ). We iterate n times in the while loop, performing constant time search 

for the minimum entry in the distance matrix, linear time update of the new 

row/ column for C ′ with regards to the remaining clusters in the matrix, linear 

time removal of C 1 and C 2 , and linear time updating the quad tree data 

structure. All in all we have a O(n 2 ) implementation of the single linkage clustering 

tree algorithm. Thus, we can safely replace anchored Buneman trees 

with single linkage clustering trees in the main algorithm, since is has the same 

running time complexity. Also, we have argued that the anchored Buneman 

tree is contained in the single linkage clustering tree. And since we are building 

overapproximations of the refined Buneman tree, the extra splits from the single 

linkage clustering tree will not affect the algorithm adversely. 

54

Chapter 8 

TheQuadTreeData 

Structure 

The quad tree is a generalisation of the binary tree; where the binary tree deals 

with linearly ordered data, the quad tree encapsulates data in two dimensions. 

The analogy is shown in Figure 8.1. We will use the quadtree as a search 

structure atop a matrix, so that we can find the minimum in that matrix in 

constant time while paying only linear time for updates. This makes it possible 

to adhere to certain complexity bounds in the single linkage clustering tree 

algorithm described in chapter 7. 

1 2 3 4 5 6 7 8 

1 2 3 4 5 6 7 8 

1 2 

3 4 

5 6 

7 8 

9 10 

13 

14 

11 12 

15 16 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Figure 8.1: The analogy of binary trees and quad trees. 

55

a b c d 

root 

ab 

cd 

Figure 8.2: Updating the quad tree by propagating changes upward. 

8.1 Complexity analysis 

For every node in our quad tree we wish maintain information about the minimum 

matrix entry in the subtree of that node. That way we can, at any time, 

find the minimum entry of the matrix in constant time by looking it up in 

the root node. We can keep the tree updated by making leaf nodes propagate 

changes upward in the tree, at a price proportional to the height of the tree; for a 

balanced quad tree spanning an n×n matrix the height is log 4 (n 2 )=2∗log 4 (n), 

so the time spent propagating changes upward through the tree is O(log 4 (n)). 

Unfortunately, when we are using the quadtree to support construction of 

single linkage clustering trees, we need to update two rows and two columns in 

the matrix for each iteration, which would then take time O(n log 4 (n)) if we 

needed to propagate values upward through the tree every time, and we can 

afford to spend no more than linear time. We can also not afford to update 

(re-initialise) the whole tree since that would take time O(n 2 ). 

Luckily, it is possible to optimize the update procedure under these specific 

circumstances. It turns out that it is possible to update the tree in linear 

time when changing precisely exactly one row or column. This is due to the 

geometric dependency of the nodes, as illustrated in Figure 8.2. It clearly shows 

that the upward propagation of changes introduces redundancy, for example 

would changes to entries a and b share almost identical paths toward the root. 

By cutting down on this redundancy we can shave off a bit of the running time. 

We can ask ourselves how many nodes in the quad tree we need to update, in 

total. The answer is that after updating a row of n entries in the matrix we 

need to update n/2 entries in the first layer of the quad tree, n/4 entriesinthe 

second layer, and so on. This gives us 

n + n 2 + n 4 

+ ... +1≤ 2n 

56

00 11 

000 11100 

1100 

000 111 000 111 

00 11 

000 11100 

1100 

000 111 000 111 

000 111 0000 1111 0000 1111000 

000 111 0000 1111 0000 1111000 

000 111 0000 1111 0000 1111000 

1111100000 

1111100000 

1111100000 

1111100000 

0000011111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

0000000000 

1111111111 

Figure 8.3: Layers in the quad tree are invalidated after updating a matrix row. 

entries to be updated in total. The notion of layers is illustrated in Figure 8.3. 

The redundancy cutdown can be implemented using a graph colouring scheme. 

When entries are changed, they shall propagate the change upwards toward the 

root, and mark nodes as invalid. If a node is encountered, which has already 

been invalidated, there is no need to go upward any further, since all nodes 

above it will also be invalid. This way we avoid visiting the same node more 

than once. Updating a node of course takes constant time. 

Once the matix row has been updated, and nodes have been invalidated 

throughout the tree, we can update the tree nodes in a bottom-up fashion 

starting from the root, and thus maintain valid information about the minimum 

entry in the root node. 

8.2 Performance of the quad tree data structure 

One important practical issue regarding quad trees is their dimension. Obviously, 

for an n×n matrix where n is a power of 2, a quad tree will fit perfectly. In 

case n is not a power of two, one has to take special considerations. The author 

has chosen to implement a quad tree such that given an input matrix of size 

n×n, the base quad tree layer will have size 2 m × 2 m such that 2 m−1

6000 

Quad Tree performance characteristic 

row/ column update 

best fit: ax+b 

5000 

4000 

Running time (ms.) 

3000 

2000 

1000 

0 

-1000 

0 100 200 300 400 500 600 700 800 900 

Input size 

Figure 8.4: Quad tree performance 1: updating 1000 rows and column for a 

given input size. The artifact between sizes 700-800 is probably due to some 

system processes disturbing the experiment. 

given input size, the time unit is milliseconds. Input sizes range between 50 

and 873. Originally, the intention was to test input sizes up to 1100, i.e. past 

the next power-of-two threshold at 1024. However, the Java program threw 

an OutOfMemory exception at input size 873. Also, the plots clearly show a 

degeneration of performance in the large input size range, i.e. sizes greater than 

700, due to the large memory consumption. 

The initialisation-artifact due to the sizing of the quad tree to a power of 

2 show up in the performance characteristic for the refined Buneman tree algorithm. 

Perhaps there is a more elegent way of handling the sizing problem 

Or perhaps we should choose a different algorithm so we would not need to 

use a quad tree The single linkage clustering tree was chosen for its apparent 

simplicity, but there are also minimum spanning tree methods that solve the 

problem (see for example [Epp98]). In any case, the initialisation still stays 

within specified complexity boundaries, and it does not adversely affect the 

asymptotic performance of the algorithm. 

58

30000 


initialisation 

25000 

20000 


15000 

10000 

5000 

0 

0 100 200 300 400 500 600 700 800 900 

Input size 

Figure 8.5: Quad tree performance 2: initialising the quad tree. Notice the 2 m - 

artifact and the performance degradation in the 700+ size range. This is due 

to the JVM running out of memory (of course, the JVM can be parameterized 

to use more memory.) 

59

30000 


initialisation and update 

25000 

20000 


15000 

10000 

5000 

0 

0 100 200 300 400 500 600 700 800 900 

Input size 

Figure 8.6: Quad tree performance 3: initialisaton and update, in total. 

60

Chapter 9 

The Discard-Right 

algorithm 

The DISCARD-RIGHT algorithm is used in Algorithm 3 to answer the following 

question: given two incompatible splits σ (left) and σ ′ (right), is μ σ ′ ≤ 0 The 

algorithm is based on the construction given in the article by Brodal et al. 

([BFÖ+ 03]), in the proof for Lemma 5. This construction relies on both σ and 

σ ′ to compute an upper bound for μ σ ′ in linear time. 

In the context where DISCARD-RIGHT is used, we have in one hand a set of 

compatible splits S representing an overapproximation of RB(δ| Xk )forsomek. 

In the other hand, we have a new split σ which we want to evaluate. We have 

determined that σ is incompatible with some σ ′ ∈ S. This means that either σ 

or σ ′ (or both) do not belong in RB(δ| Xk ). 

Normally it would take time in the order of n 4 to calculate μ σ ′,sincewe 

would have to consider all possible quartets induced by σ ′ to find the n − 3least 

scoring quartets that are used to find the refined Buneman index. However, since 

we have a pair of incompatible splits we can use the linear time algorithm given 

in Algorithm 7 to find an upper bound for μ σ ′ which can be used to determine 

whether σ ′ should be discarded or not. Note that this does not necessarily 

validate or invalidate σ as a split in RB(δ). This is apparent from the truthtable 

in Table 9.1 — notice how the fourth option, where Buneman indices are 

positive, is not possible when we assume that σ and σ ′ are incompatible. 

Algorithm 7 deserves a few comments; firstly, in line 1 we can determine the 

μ σ μ σ ′ DISCARD-RIGHT 

1 ≤ 0 ≤ 0 true 

2 > 0 ≤ 0 true 

3 ≤ 0 > 0 false 

Table 9.1: The truth table for the DISCARD-RIGHT algorithm. 

61

sets A, B, C and D by scanning the two splits concurrently, in linear time. In 

lines 3–6 we know that |A|×|B|×|C|×|D| ≥n − 3, so we will generate enough 

possible quartets for later use. 

In line 7 we have elements a ∈ A, b ∈ B, c ∈ C and d ∈ D, andwecan 

combine these into 3 different quartets, q x , q y and q z . We know that two distinct 

quartets q 1 and q 2 for the same four species satisfy β q1 + β q2 ≤ 0 from chapter 2 

or from [BFÖ+ 03]. So now all we need to do is to determine which pair of 

quartets fit with our splits. We need to select q 1 and q 2 from {q x ,q y ,q z } such 

that q 1 ∈ q(σ) andq 2 ∈ q(σ ′ ), and there are 6 possible combinations to choose 

from: q x q y , q x q z , q y q x , q y q z , q z q x and q z q y . In line 8 we store q 2 for later. And 

finally in line 9, if we have found enough quartets we skip to the second part of 

the algorithm. 

In line 17 we exploit the fact that for all the pairs (q 1 , q 2 ) we found in the 

for-loops, β q1 + β q2 ≤ 0, because of our choices of a–d. Fromthiswegetthat 

n−3 

∑ 

β q1i + β q2i ≤ 0 

i=1 

for some enumeration of our candidate quartets. We can split the sum in two, 

then we have that 

n−3 

∑ 

β q1i ≤ 0 and/or 

i=1 

n−3 

∑ 

β q2i ≤ 0 

i=1 

since they cannot both be positive. Now we know that the sum of the n − 3 

least scoring quartets is the refined Buneman index, so we have 

μ σ ′ 

n−3 

∑ 

≤ 

And thus we can answer the question of whether σ ′ can be discarded or not. 

i=1 

β q2i 

62

Algorithm 7 The DISCARD-RIGHT algorithm 

Require: σ = U 1 |V 1 and σ ′ = U 2 |V 2 are incompatible splits. 

Ensure: return true if μ σ ′ ≤ 0, false otherwise. 

1: A = U1 ∩ U2, B = U1 ∩ V 2, C = V 1 ∩ U2, D = V 1 ∩ V 2 

2: Q = ∅ 

3: for a =1to|A| do 

4: for b =1to|B| do 

5: for c =1to|C| do 

6: for d =1to|D| do 

7: determine permutations q 1 and q 2 of A[a], B[b], C[c] andD[d] such 

that q 1 ∈ q(σ) andq 2 ∈ q(σ ′ ) 

8: Q = Q ∪ q 2 

9: if |Q| = n − 3 then 

10: goto sum 

11: end if 

12: end for 

13: end for 

14: end for 

15: end for 

16: sum: 

17: if ∑ n−3 

i=1 β Q[i] ≤ 0 then 

18: return TRUE 

19: end if 

20: return FALSE 

63

Chapter 10 

The Selection Algorithm 

In Algorithm 4 we are faced with the problem of removing the (n − 3) largest 

elements from a set of size 3(n−3) or more. In general, the problem is removing 

the i largest elements from a set A of size n, wherei

Algorithm 9 The RANDOMIZED-PARTITION algorithm 

Require: A is an array of length n. 

Require: p, r are indices in A 

Ensure: returns q such that elements in A[p..q] are all less than or equal to all 

elements in A[q +1..r]. 

1: i = RANDOM(p, r) 

2: SWAP(A[p],A[i]) 

3: return PARTITION(A, p, r) 

Algorithm 10 The PARTITION algorithm 

Require: A is an array of length n. 

Require: p, r are indices in A 

Ensure: returns j such that elements in A[p..j − 1] are all less than or equal 

to all elements in A[j..r]. 

1: x = A[p] 

2: i = p − 1 

3: j = r +1 

4: while true do 

5: repeat 

6: j = j − 1 

7: until A[j] ≤ x 

8: repeat 

9: i = i +1 

10: until A[i] ≥ x 

11: if i

is O(n 2 ). The algorithm uses a divide-and-conquer strategy, where the idea 

is to cut away search space to quickly find the element we are looking for. If 

we were to divide in half every time we would expect to spend linear time 

dividing and searching recursively (n + n/2 +n/4 +···+1). However, since 

we are partitioning around some element in an array, we are not guaranteed to 

partition in the middle each time. The randomized part of the algorithm tries 

to remedy this, and the expectation is that selecting an element at random will 

on average partition the search space in half. Of course, if we are very unlucky 

we will partition such that we only cut away one element each time, for a worst 

case performance of O(n 2 ). 

10.2 Performance of the selection algorithm 

The selection algorithm runs in expected linear time, but worst case quadratic 

time. The following test shows that the performance is indeed linear, as expected. 

The test was done by running the randomized selection algorithm 50 

times on input sizes starting at 10000 and increasing the size by 1000 for each 

new run. Each runs consists of 100 repetitions. A plot is made of input size in 

number of elements, against running time in milliseconds. This plot is shown in 

Figure 10.1. The plot clearly shows that the selection algorithm does indeed run 

in linear time, as expected. If one is not satisfied with this algorithm, [CLR90] 

has a more complicated selection algorithm which runs in guaranteed worst case 

linear time. 

66

18 

Performance of Randomized-Select 

best fit: ax+b 

16 

14 


12 

10 

8 

6 

4 

2 

10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 

Input size 

Figure 10.1: Performance of the randomized selection algorithm. 

67

Chapter 11 

JSplits 

Figure 11.1: A screenshot of the JSplits user interface. 

11.1 What is JSplits 

JSplits or SplitsTree 4 is the latest version of a program initially developed by 

Dr. Daniel Huson in cooperation with Rainer Wetzel at the University of Bielefeld, 

and later development has been continued by Dr. Huson at the University 

68

of Tübingen. In Dr. Husons words, SplitsTree is a program for analyzing and 

visualizing evolutionary data. 

The first version of SplitsTree was presented in an article from 1996 ([DHM96]), 

and a later version followed in 1998 ([Hus98]). Up to version 3.2, SplitsTree was 

written in C++, but starting from SplitsTree 4 the program will be developed in 

Java. Versions have been incorporating more and more phylogenetic methods, 

and this latest version features a plug-in facility which enables others to extend 

the functionality of the program by adding new tree reconstruction methods and 

analysis tools. The author has gratefully taken advantage of this in his work. 

The JSplits program is available from this website: 

http://www-ab.informatik.uni-tuebingen.de/software/jsplits/ 

In this work the author has used JSplits/ SplitsTree 4 version 3 beta, 

but at the time of writing the JSplits development team has reach version 4 beta, 

and the author is unsure whether this has introduced any inconsistencies. 

11.2 Using JSplits 

JSplits runs on any computer, Linux or Windows, with a JRE v1.4.2 (Java 

Runtime Environment, available from http://www.java.com), which is very 

handy indeed. It has a very simple user interface, as depicted in Figure 11.1. 

To use JSplits to visualize a data set, one just opens the data set from a file, and 

selects from a plethora of different tree methods using the drop down menus. 

The result of applying a method to a dataset in JSplits is some graphical 

representation depending on the method. For refined Buneman trees one gets 

a tree graph with labeled leaves. It is possible to drag vertices to new positions 

while keeping branch lengths constant, so one can sculpt the tree as one whishes. 

It is also possible to zoom in on interesting regions, which is very useful for large 

trees. The JSplits homepage might reveal more features, even some that the 

author is not aware of, since JSplits development has raced ahead of this work. 

One minor drawback to JSplits is that it only supports datasets in the Nexus 

format. The test data used in this work, a set of protein families from the PFAM 

database, is given in Phylip format, so it is not immediately applicable. However, 

the translation from Phylip to Nexus is straghtforward, and the author 

has written such a module (see chapter 12). The Nexus format is described in 

[MSM97], but a simple google search will also provide several resources, including 

format translation programs: 

http://www.google.com/searchq=nexus+format 

11.3 Extending JSplits 

The implementation of the refined Buneman tree algorithm described in this 

work is aimed at integration into JSplits. Java is perhaps not the first choice for 

69

implementing an algorithm with a high time and space complexity. However, 

by implementing the algorithm in Java and integrating it into JSplits as a plugin, 

the author has been handed a free user interface that fits perfectly with 

this algorithm. Also, this method can hopefully become an integrated part in 

future releases of JSplits which is easily and freely available, thereby contributing 

something valueable to the field of tree reconstruction. 

JSplits plug-ins fall into categories based on input and output. The refined 

Buneman tree algorithm will take a distance matrix as input and return a compatible 

set of splits, which can then be visualised in JSplits. Other methods 

might take a set of aligned nucleotide sequences and return tree structures. A 

full list of the extensive set of implemented methods and possible types of new 

methods can be found in the JSplits source code documentation, which is probably 

available upon request from Dr. Huson. At the time of writing it is not 

publicly available. 

The general way of creating a JSplits plug-in consists of two steps: 

• Implementing an interface from the catalogue of possible interfaces available 

in JSplits 

• Placing the implementing file correctly in the JSplits directory structure, 

which is grouped by input type 

The JSplits application will automatically detect the presence of implemented 

methods and create drop-down menu options for them. So the plug-in 

facility is very easy to use indeed, and the author experienced no problems with 

it at all. 

The author did experience a few problems with the classes used in JSplits. 

For some odd reason, the taxa list and input distance data is given in arrays 

with offset 1 instead of the traditional offset 0 used in most other computer 

science contexts - particularly in the Java programming language. But in spite 

of this minor quirk, the author has been very satisfied working with JSplits. 

11.4 The Distances2Splits interface 

In JSplits terminology, the refined Buneman tree algorithm falls in the category 

distances-to-splits algorithm, and Dr. Huson pointed out that the author should 

implement his work as a Java class implementing the Distances2Splits interface 

found in JSplits. That interface looks like this: 

package splits.algorithms.distances; 

interface Distances2Splits extends DistancesTransform 

{ 

boolean isApplicable(Taxa taxa, Distances d); 

} 

Splits apply(Taxa taxa, Distances d) throws Exception; 

70

The interface has two methods that need to be implemented. Firstly, we 

would have to implement the method isApplicable which takes two parameters, 

a list of taxa (names of species) and a set of distance data (in the form 

of a matrix). The point is not to test if our algorithm can accept this type of 

input, we know that already since we are implementing the Distances2Splits 

interface. Instead, the method is designed to test the quality of the input. Our 

refined Buneman tree algorithm requires a minimum input size of four, for example. 

So in case the input data is not we will say that we cannot handle the 

data, and JSplits can then use that information to ensure our algorithm will 

never have to be computed with this input. JSplits handles this by blanking 

out the drop down menu option for our algorithm. Secondly, we have to implement 

the apply method. We are given the same parameters as before, and 

we are required to return a set of splits. JSplits can use these splits to create a 

graphical representation of the refined Buneman tree. 

The author has placed all source code in a separate package so as to not 

pollute the JSplits directory structure with irrelevant files. The apply method 

basically only performs translation from the (oddly indexed) JSplits data to 

simple distance matrix with the type double[][]. This simple matrix is then 

passed to the real implementation in the seperate package, which is an implementation 

of the following interface. We simply ignore the taxa list for now. 

package dk.birc.rbt; 

interface Distances2SplitsAlgorithmInterface 

{ 

Split[] compute(double[][] distanceMatrix); 

} 

The Split class thus returned has the following type: 

package dk.birc.rbt; 

class Split 

{ 

boolean[] split; 

} 

double weight; 

Finally the array of splits is translated into a set of splits in the JSplits 

sense, which basically involves re-indexing and adding taxa names, and that set 

is returned. 

71

Chapter 12 

Source Code 

The source code for the implementation described in the previous chapters is 

available from the following webpage. As the implementation has been done in 

Java, the implementation should be able to run on any machine of any architecture 

using any operating system, provided a Java Runtime Environment (JRE) 

is available. 

http://www.daimi.au.dk/~lasse/thesis/ 

The Java source code files are available for browsing online, and for downloading 

in a Zip file. The code is packaged 1 in two directory structures: 

dk.birc.rbt contains source code for the implementation of the refined Buneman 

tree algorithm. The code is modularized into one .java file for each 

Java class for a total of 31 source files, containing roughly 2600 lines of 

code. 

splits.algorithms.distances contains the plugin code that adapts the refined 

Buneman method to the JSplits software package. Basically this consists 

of one file which performs simple transformations between the formats in 

JSplits and in the refined Buneman implementation. 

All Java classes and methods have been decorated using the Javadoc format 

([Sun03]), and thus a Javadoc documentation of the source code tree is available 

for online browsing, using the webpage mentioned above. 

The implementation has furthermore been documented in a series of UML 

diagrams, also available from the webpage, in PostScript format. The UML 

diagrams are generally too large to fit on A4 paper, therefore they have not 

been included in this thesis. These diagrams should convey an overview of the 

implementation. 

1 packaged in the Java sense: using a reverse domain name scheme should generate a unique 

namespace, and the directory structure follows the package name with ’.’ replaced by ’/’ 

72

Part III 

Tests and Experiments 

73

Chapter 13 

The Reference 

Implementation 

For testing purposes the author has implemented a package reference, which 

in a simple and transparent (and hopelessly inefficient) manner calculates the 

refined Buneman tree. 

The reference implementation is intended to provide a basis for comparison 

against the implementation of the [BFÖ+ 03] algorithm. To justify that the 

latter has been implemented correctly, we argue first that our reference implementation 

is correct, and thereafter we demonstrate that on the same input, 

the two implementations agree completely. These two arguments together form 

a basis for concluding that the [BFÖ+ 03] algorithm has been implemented correctly. 

13.1 A simple refined Buneman tree algorithm 

Algorithm 11 is a very simple algorithm that calculates the refined Buneman 

tree. It has been adapted directly from the definition of the refined Buneman 

tree. 

Line 1 initialises a set S to be the empty set. In line 2, we iterate over all 

possible splits in σ(X), avoiding duplicates by interpreting splits as bitvectors, 

and counting (using bit-flipping) through exactly half the possible splits. For 

each split σ we in line 3 initialise a set of quartets Q to the empty set. In line 4 

we iterate over all possible quartets in q(σ) and add them to Q in line 5. In line 

7wesortQ in increasing order so that in line 8 we can sum over the n − 3least 

scoring quartets to find the refined Buneman index. In lines 9–10 we report the 

splits with positive refined Buneman index. 

The complexity of this algorithm is quite horrible: first we iterate over O(2 n ) 

splits, and for each split we iterate over O(n 4 )quartets. Thealgorithm therefore 

performs in time O(2 n n 4 ). 

74

Algorithm 11 A naive algorithm that computes the refined Buneman tree 


Ensure: S = RB(δ) 

1: S = ∅ 

2: for σ ∈ σ(X) do 

3: Q = ∅ 

4: for q ∈ q(σ) do 

5: Q = Q ∪ q 

6: end for 

7: SORT(Q) 

8: w = 1 

n−3 

9: if w>0 then 

10: S = S ∪ σ 

11: end if 

12: end for 

∑ n−3 

i=1 β Q[i](δ) 

13.2 Implementation highlights 

There are two important issues in this algorithm: we need to avoid considering 

duplicate splits, and we need to avoid considering duplicate quartets. Both have 

built-in symmetries. 

Firstly, let us consider splits as bitvectors. We can start out with the bitvector 

containing all zeros, and count through splits by bit-flipping. We flip from 

low-order end to high-order end every bit that is ’1’. When we reach a bit that is 

’0’, we flip it to ’1’ and stop. Then we have the next split. To avoid duplicates, 

we can just restrict ourselves to counting through the first n − 1 bits, leaving 

the nth bit as ’0’. 

Regarding quartets, we need to recognize that quartets are symmetric on 

either side of their central edge, i.e. the quartet ab|cd isthesamequartetas 

ba|cd, ab|dc and ba|dc. For the split σ = U|V we say that a, b ∈ U and c, d ∈ V , 

and to avoid duplicates we just have to make sure that a ≤ b and c ≤ d. 

13.3 Correctness of the reference implementation 

We are going to use this reference implementation of the refined Buneman tree 

algorithm to demonstrate the correctness of our implementation of the refined 

Buneman tree algorithm described in [BFÖ+ 03]. But first we must convince 

ourselves that the reference implementation is correct. 

To this end, the author has written a test program for the reference implementation. 

The test program generate a random distance matrix, and during 

the computation of the reference implementation the program will output: 

• the distance matrix. 

75

• the splits that are generated. 

• the quartets that are generated for each split. 

• the Buneman scores for the quartets. 

• the Buneman Index for each split. 

• the weighted splits reported by the algorithm. 

After running the test program it is then possible to study the output from 

the program and ensure: 

• all unique splits of size n are generated, i.e. all possible splits on the form 

0xxx. 

• all quartets ab|cd are generated such that a, b ∈ U, c, d ∈ V , a ≤ b and 

c ≤ d. 

• for each quartet q generated, β q (δ) is calculated correctly. 

• for each split σ generated, μ σ (δ) is calculated correctly. 

• the weighted splits reported by the algorithm all have positive refined 

Buneman index, and all splits generated that have positive refined Buneman 

index are reported. 

Of course, this process is quite infeasible for larger δ, but the author has 

read through outputs from a few runs, and has found no errors. The author 

therefore concludes that the reference implementation is correct. An example 

of such an output is given in appendix A. 

13.4 Performance of the reference implementation 

The performance characteristic for the algorithm can be seen in Figure 13.1. 

The plots shows running time for input sizes 4–20. Clearly, it is infeasible to 

run this implementation on large examples, but for our testing purposes it is 

adequate. 

76

1e+06 

Performance of the Reference Implementation 

best fit 

800000 


600000 

400000 

200000 

0 

0 5 10 15 20 

Input size 

Figure 13.1: Performance of the reference implementation. 

77

Chapter 14 

Correctness 

This chapter deals with testing the correctness of the refined Buneman tree 

algorithm. The algorithm produces a set of splits S, and to determine if this 

set of splits is the set RB(δ) we could ask a number of questions: 

• Is S a compatible set of splits 

• Is it the case that μ s δ is positive for all s ∈ S 

• Is it the case that μ s ′δ is not positive for all s ′ ∈ ¯S 

The first two questions can be answered easily by inspecting S, but they do 

not form a basis for any conclusions regarding the correctness of the implementation. 

Question 3 is not easily answered, it would require that we tested all 

possible splits not in S to see if they had positive refined Buneman index, and 

we have seen earlier that this search space is quite huge. Here lies the key to 

answering the question of correctness: even if we have a compatible set of splits, 

all with positive score, we only know that we have a subset of RB(δ), but we 

do not know if we have actually produced the whole set. 

In fact, questions two and three together provide nescessary and sufficient 

conditions for determining correctness, since Moulton and Steel in [MS99] have 

shown that the set {σ : μ σ > 0} is a compatible set of splits. But since question 

three is too hard to answer we shall have to adopt a different test strategy. 

14.1 Test strategy 

Instead of tackling the questions above, we shall use a completely different 

strategy. We shall build a chain of arguments that will lead us to conclude that 

our implementation of the [BFÖ+ 03] algorithm is correct. 

The basis of our argumentation relies on the reference package described 

in the previous chapter. We have to assume that our reference package can 

correctly compute the refined Buneman tree, and the previous chapter argues 

that the package meets this condition. 

78

Assuming we have a correct implementation that calculates the refined Buneman 

tree, we can proceed to compare output from this reference implementation 

against output from our implementation of the [BFÖ+ 03] algorithm. Given the 

same input in the form of a distance matrix δ, the two implementations should 

produce the same output in the form of a set of weighed splits. 

It is of course impossible to test all possible input in the two methods, so we 

will have to restrict ourselves to testing some number of different inputs N, such 

that N is large enough to persuade anyone that the methods always produce 

the same output in identical input. Assuming the reference implementation is 

correct, and that the two methods always compute the same algorithm we can 

trust that our implementation of the [BFÖ+ 03] algorithm is indeed correct. 

Ideally, looking at N sets of output and observing only identical sets will 

not allow us to conclude anything, only merely that our implementation has 

not been proven to be incorrect — yet. But in practice the author believes that 

forsomehugenumberN any reader should be persuaded to believe that the 

implementation is correct. One weakness in this approach would be the running 

time of the reference implementation which will quickly put a practical limit on 

N. 

14.2 Test setup 

The goals of the test are clear; we wish provide a large number of inputs, and 

test if the implementation of the [BFÖ+ 03] algorithm outputs the same result as 

the reference implementation does. Each implementation adheres to the same 

interface, outputting an array of weighed splits, but the output is unsorted. 

Therefore our test program must run the two algorithms and sort the outputs 

before we can compare the results. 

The source code for the test program can be found in the package test, 

as described in chapter 12, and it is quite straightforward. A random distance 

matrix is created and both implementations are run with that input. Output is 

sorted and compared, and the program will report nothing 1 in the case where 

there are no discrepancies on output, and in case the outputs differ it will report 

a CorrectnessException with the following class signature: 

package test; 

class CorrectnessException extends Exception 

{ 

CorrectnessException() 

{ 

super("Output not identical!"); 

} 

} 

1 it will actually output something, namely a line saying “Test completed successfully.” 

79

The best possible way of testing would be to test both vertically (many repetitions 

with the same input size) and horisontally (a wide range of input sizes) 

simultaneously. Unfortunately the performance of the reference implementation 

very much limits our options, and we are forced to test the two directions 

seperately. 

14.3 Test results 

From inspecting the output of the vertical and horisontal tests, it is clear that 

on all input, the two implementations output exactly the same set of weighed 

splits. Test output consists of a single line of text saying the test completed 

successfully. Of course, in case the test program found that something was 

NOT correct, it would report an exception. Output might look like this: 

$ java test.CorrectnessTest 1000 4 17 

Test completed successfully. 

The vertical test was able to complete 1000 repetitions of runs of size 4-17, 

for a total 13.000 runs. No errors were encountered. Note that running the 

two implementations 1000 times each on input size 17 took around 15 hours to 

complete, and the author thought it pointless to continue the test any further. 

The horisontal test was done using only 1 repetition of runs starting from 

input size 4. The test was able to complete sizes up to 22 before it was interrupted 

due to impatience, after running overnight for two nights. No errors were 

encountered. The awful time complexity of the algorithm makes it infeasible to 

continue the test much further. 

In themselves, these 13.000+ test runs do not prove anything, but they do 

give a very high degree of confidence in the correctness of the implementation. 

It is, of course, impossible to test all possible inputs, so we will have to satisfy 

ourselves with some finite number of runs. And this author believes 13.000 is a 

satisfactory number of test runs. 

A further argument to support the conclusion is that when the two implementations 

completely agree on all inputs, it is very likely that they are both 

correct. It is of course possible that they are both incorrect, and that the test 

outputs merely show that the two implementations agree on being wrong. However, 

the author estimates that the probability of creating two independent and 

very different implementations that would produce exactly the same errors on 

identical input, is very low indeed. 

The author believes he has argued thoroughly for the correctness of the 

implementation of the simple algorithm, and therefore provided a sound basis 

for comparison. Also, the author believes that the number of test runs should 

persuade anyone that the two implementations always produce the same output 

on identical input. Combining this with the transparent source code and 

correctness tests of the reference implementation, the author concludes that the 

implementation of the refined Buneman tree algorithm from [BFÖ+ 03] is indeed 

correct. 

80

Chapter 15 

Complexity 

This chapter deals with the time and space complexity of the RBT-algorithm. 

Before the advent of [BFÖ+ 03], a non-trivial algorithm for computing the RBT 

did exist ([BB99]), but with a running time of O(n 5 ) and space consumption 

O(n 4 ). The goals of [BFÖ+ 03] was to reduce these factors and make RBTs 

computationally competitive to methods based on neighbor joining and on plain 

Buneman trees, and in turn one of the goals of this thesis is to implement this 

algorithm and demonstrate how RBTs can be used in practice. Therefore we 

need to demonstrate that the implementation does indeed run in time O(n 3 ) 

and space O(n 2 ). 

15.1 Running time 

To analyse the running time of the algorithm the author has written the test 

program which is parameterized by starting input size, ending input size and 

number of repetitions per input size. The test program measures the running 

time of the refined Buneman treee algorithm found by measuring the system 

time before and after each computation of the algorithm, excluding the initialization 

of a random input matrix. Input size and running time is reported for 

each repetition. 

The timing is not optimal; the running time does not reflect the exact running 

time of the algorithm, since the algorithm does not run exclusively on the 

test PC. However, the author made sure that the test PC was largely unused 

during testing, meaning that no users and only a few system processes were 

running during the time trials. Therefore the timing data might show a slightly 

skewed picture of the running time, but very importantly, the skew is evenly 

distributed and thus does not affect the test goal, with is to determine the 

asymptotic running time behaviour of the algorithm. 

The test was run with the following parameters: 

• Input start size 4 

81

100000 

90000 

Performance of the refined Buneman tree algorithm 

best fit 

80000 

70000 


60000 

50000 

40000 

30000 

20000 

10000 

0 

4 8 16 32 64 128 

Input size 

Figure 15.1: The running time of the refined Buneman algorithm. Notice the 

artifact at power of two intervals. This stems from the quad tree data structure 

described in chapter 8 

• Input stop size 174 1 

• Number of repetitions 100 

To analyse the asymptotic running time performance of the algorithm, the 

test data was analysed using the nonlinear least-squares (NLLS) Marquardt- 

Levenberg algorithm via the gnuplot ([gnu99]) fit command. 

The expectation was that the performance of the algorithm would be O(n 3 ), 

and therefore the fit-function was chosen to be f(x) = ax b . The fit of the 

function f with the whole dataset (input sizes 4 – 174) returned the following 

values: a =0.0130805 and b =3.02047. A plot of the data along with the 

function f is shown in Figure 15.1. The figure shows a plot of the running time 

in milliseconds against the input size in number of species. 

Figure 15.1 shows an artifact from the implementation, namely a jump in 

running time performance consistent with intervals matching powers of 2. This 

artifact stems from the dimensioning of the quad tree data structure described in 

1 Actually the stop size was 200, but after 3 consecutive days of running the test it was 

halted due to impatience. 

82

6000 

Performance difference in different intervals 

best fit size 64 

best fit size 65 

5000 

4000 


3000 

2000 

1000 

0 

0 10 20 30 40 50 60 70 

Input size 

Figure 15.2: The running time of the refined Buneman algorithm determined 

from two different input size intervals. 

chapter 8. Due to this artifact, estimating asymptotic running time performance 

yields different results depending on the input size range that forms the basis 

for the analysis. For example, estimating the running time performance on a 

dataset limited to size 64 returns a =0.0489006 and b =2.58288. On the other 

hand, for a dataset up to size 65 yields a =0.00330764 and b =3.27513. This 

difference is illustrated in Figure 15.2. In both cases however, the estimate of b 

iscloseto3,sotheestimateofb based on the whole dataset is very reliable. 

15.2 Space requirements 

Measuring space consumption of a (Java) program is no easy task. As we are 

interested in verifying that the space complexity is identical to the result from 

the analysis in chapter 5, it would be handy to have a tool that measures the 

maximum heap size used by a program during its execution time. The author 

was not able to find such a program. 

Instead, the author has looked at several tools which might provide at least 

a partial solution. Such tools would take a snapshot of the heapsize used by the 

program, and we would only need to run it often to get a reasonable estimate 

83

of the size of the heap during the execution time. Hopefully the data extacted 

in this way would present a clear image of the heap size trend. 

15.2.1 The Linux ps command 

The first option considered was the Linux command ps - report process status 

- which reports various information about a process or group of processes, including 

allocated memory. However, since we are measuring the size of a Java 

program running on a Java Virtual Machine, any information obtained from 

beyond the JVM would only reflect the characteristics of the JVM, and would 

therefore be very unreliable. We only know that the program we wish to profile 

is contained inside the space of the JVM, but we have no clue how big the 

program actually is. We would only have an upper limit. 

An example of the uselessness of the ps command when dealing with the 

JVM is the fact that we might set initial heap sizes differently, for example a 

high value and a low value, observe that the memory consumption is different in 

the two cases, but also that the memory consumption reported by ps grows in 

both cases! Clearly, if we set the initial heap size low to begin with, and measure 

the maximum heap size reported by ps, and then set the initial heap size in a 

new experiment higher than the previous reported maximum - we would not 

expect that the reported heap size would ever exceed the new minimum. But it 

does! 

The experiment is a bit tricky; we need to start fork a process with the JVM 

running on some sample data. Then quickly, before the process terminates, we 

must run ps -vg a few times to get snapshots of the process running. But it is 

possible to do, and here are the results: firstly, runs with small heap size, where 

the reported start and end process size are around 10 and 14 MB. (Note: the 

output has been edited to provide clarity) 

$ java -Xms1000000 -Xmx1000000000 RBT test.phylip & 

[1] 29376 

$ ps -v 

PID TIME RSS %MEM 

29376 0:03 10868 2.1 

29376 0:12 11452 2.2 

29376 0:22 13472 2.6 

Secondly, a large initial heap size, reporting a memory usage between 16 and 

26 MB. (Note: the output has been edited to provide clarity) 

$ java -Xms100000000 -Xmx1000000000 RBT test.phylip & 

[1] 29420 

$ ps -v 

PID TIME RSS %MEM 

29420 0:02 16044 3.1 

29420 0:08 22264 4.3 

29420 0:17 25840 5.0 

84

Clearly, this is useless for profiling memory complexity of the algorithm. And 

with good reason. The JVM is an advanced optimizing interpreter, which might 

or might now trade memory for clock cycles in its optimizations or compile bits 

of a program to native code to speed up critical sections. So this crude profiling 

method will not yield any useful information. 

15.2.2 JVM garbage collector log 

The second option considered for profiling the memory consumption of our 

refined Buneman tree algorithm implementation is the JVM garbage collector 

log. The garbage collection log is provided as a non-standard option for the 

JVM and can be accessed by starting the JVM with a command of the form: 

java -Xloggc: command line parameters... 

The garbage collector log file will contain entries documenting all runs of 

the JVM garbage collector - or rather, collectors. The JVM uses generational 

garbage collection, with one minor collector running often, and a major collector 

running rarely. The minor collector will only take “bites off the top” of the 

part of the heap that is reclaimable, while the major collector will clean up 

completely. The latter is of course an expensive operation. A sample of a 

garbage collector log can be found in appendix B. Notice the low frequency of 

major garbage colletions (1, marked Full GC) compared to minor ones (127, 

marked GC). The log entry format contains the following information: 

• Timestamp 

• Heap size before GC 

• Heap size after GC 

• Total heap size (in parenthesis) 

• Time to complete GC 

So how do we use this information to profile the performance of our program 

Clearly, minor GC reports are useless, as they are only some upper limit, which 

tends to be much too large (looking at the example in appendix B we see that 

a round of major GC is able to reclaim 75% of the heap that the minor GC has 

not touched). So we have to rely on the infrequent major GC counts to estimate 

the size of the part of the JVM heap that is actually used. Further information 

about the JVM and its garbage collectors can be found e.g. here: 

http://java.sun.com/docs/hotspot/gc/ 

85

140000 

Space complexity 

best fit: x a 

120000 

100000 

Space used (kilobyte) 

80000 

60000 

40000 

20000 

0 

0 100 200 300 400 500 600 700 

Input size 

Figure 15.3: The space usage of the refined Buneman algorithm. 

15.2.3 Test results 

The experiment was completed on a large part of the PFAM test data, resulting 

in 83 useful 2 out of 130 possible garbage collection logs. The smallest matrices 

in the PFAM set do not require a full garbage collection, and the author was 

running out of time, so some of the largest data sets in the 500–700 size range 

we not tested. 

For each garbage collector log, the author ran the command grep Full to extract results from the major garbage collector. After looking over 

a few logs it was determined that the last major garbage collection would always 

report the largest space consumption for that computation, so the author 

ran the command tail -n 1 to find these maximums. 

Afterwards, using cut-and-paste and the Linux wc command, a dataset consisting 

of number of species and maximum space consumption was compiled. The 

data was analysed using gnuplot, fitting the data to a function f(x) =x a ,and 

gnuplot returned the result a =1.75287. A plot of the data and the function 

f is available in Figure 15.3. 

Regarding the accuracy of this experiment, the author is would like to mention 

that the data mining was rather sporadic; we rely on the results from the 

2 Log files where there are invocations of the major garbage collector 

86

major garbage collector, but this corresponds to a very selective or random sampling 

— in between these major garbage collections there might be space usage 

peaks that we are unaware of. Also, Figure 15.3 shows that the data is very 

scattered, especially in the 500–700 size range, and the author would wish to 

repeat the experiment using more repetitions, and to complete the experiment 

for all data in the 10–700 size range, at least. Another thing that introduces uncertainty 

is the JVM and its ability to do time/space trade-offs for optimization 

purposes. The author does not have much insight into this matter. A different 

approach to this experiment would be to use a profiling tool. Google reveals a 

lot of commercial and free products, but the author has not had the time to try 

out this approach. 

In conclusion, it appears that the implementation of the refined Buneman 

tree algorithm does conform to its theoretical space complexity bounds. However, 

there are a number of uncertainties about the experiment described above, 

and if the experiment had shown a less favorable result the author would have 

been rather sceptic with respect to those results. It is therefore only fair to 

mention that the experiment is not ideal and that its results should be viewed 

with some scepticism. 

87

Chapter 16 

Comparing Evolutionary 

Tree Methods 

How does the refined Buneman tree algorithm fare compared to other known 

tree reconstruction methods Firstly we note that there is a trivial relationship 

between Buneman trees and refined Buneman trees, since B(δ) ⊆ RB(δ) for any 

δ, by definition. But it would be interesting to see just how much less restrictive 

the refined Buneman tree method is compared to its unrefined counterpart. 

Secondly we shall look at the well-known Neighbor-Joining tree reconstruction 

method, introduced by Saitou and Nei in [SN87]. What is the relation, if any, 

between refined Buneman trees and neighbor-joining trees We know that the 

NJ-method always produces fully resolved binary trees, and that refined Buneman 

trees might not be fully resolved. But how different are the resolutions 

Is it the case that RB(δ) ⊆ NJ(δ) Is the intersection of splits in the refined 

Buneman tree and in the Neighbor-Joining tree a set of particularly good splits, 

making refined Buneman splits an extra reliable core set of splits It would be 

nice if we were able to say that the set of refined Buneman splits is a set we 

have great confidence in, compared to the set of Neighbor-Joining splits which 

might be more resolved than the data would warrant. 

Quoting [JWMV03]: From the perspective of experimental performance studies 

and algorithm design, (global) NJ should be regarded as a universal lowest 

common denominator in phylogeny reconstruction algorithms. Its speed makes 

it easy to use under all circumstances; its topological accuracy makes it an acceptable 

starting point for tree reconstruction in biological practice. We suggest 

that a proposed method should be compared with NJ and abandoned if it does not 

offer a demonstrable advantage over NJ for substantial subproblem families. 

There is clearly no competition between our implementation of the refined 

Buneman tree algorithm, and e.g. the optimized Neighbor-Joining algorithm 

described in [BFM + 03], when it comes to performance. The current implementation 

of the refined Buneman algorithm runs much slower than e.g. the 

QuickJoin algorithm ([BFM + 03]), even though they have the same theoretical 

88

worst case time and space complexity. However, with respect to performance 

in terms of reconstructing accurate phylogenetic trees, the refined Buneman 

method might demonstrate unknown strengths. 

16.1 Test setup 

In the following section we will test the refined Buneman tree algorithm against 

two other known algorithms, the Buneman method ([Bun71]) and the Neighbor- 

Joining ([SN87]) method. All experiments have been run on test data from the 

PFAM database of protein sequence families. The data consists of distance 

matrices with sizes ranging from 10–50 (30), 100–200 (50) and 500–700 (50), 

for a total of 130 test matrices. Matrices with larger sizes are available, but 

they will be disregarded due to time constraints. Distance data is given in the 

common Phylip format. 

16.2 Buneman and refined Buneman 

The implementation of the refined Buneman tree algorithm can easily be adapted 

to mark those splits which are both in B(δ) andinRB(δ). Since we have the 

n − 3 least scoring quartets used to calculate the refined Buneman index for a 

split, it is easy to find the least scoring quartet among them. If that quartet 

has positive Buneman score, we can mark the split as belonging in B(δ). 

This first experiment consists of running the (modified) refined Buneman 

tree algorithm on examples from the set of PFAM distance matrices, counting 

for each one the number of splits in B(δ) andinRB(δ). The results from the 

experiment are given in Figure 16.1, sorted according to increasing distance 

matrix size and plotted in percentage (the size of RB(δ) is 100%). 

Figure 16.1 shows that the size of B(δ) fluctuates quite a bit, especially 

when the size of the distance matrix increases, where more and more datasets 

do not infer any Buneman splits at all. The refined Buneman method is clearly 

much less restrictive then the Buneman method, as expected. Regarding the 

quality of the splits that are in RB(δ) but not in B(δ), further studies would 

need to be undertaken — one could study either the specific dataset for which 

the Buneman method produces few splits, or use simulated data which would 

provide a key to which splits are well-supported and which are unsupported. 

16.3 Refined Buneman and Neighbor-Joining 

To test the refined Buneman tree method against the Neighbor-Joining method, 

the author has run the implementation of the refined Buneman tree algorithm 

described in this paper, against the Quick-Join algorithm described in [BFM + 03]. 

The Quick-Join software is available from this website: 

http://www.birc.dk/Software/QuickJoin/ 

89

Figure 16.1: The size of the set B(δ) as a percentage of the set RB(δ) onPFAM 

data. The two artifacts are due to two datasets where neither method produces 

any splits. 

90

The refined Buneman tree algorithm has been adapted to reading Phylip 

formats, and produces a list of splits in the form of strings from the alphabet 

{0, 1}, ordered according to the input distance matrix. 

The Quick-Join program inherently reads Phylip matrices, and produces a 

tree in Newick format, with names taken from the input distance matrix. Thus, 

the author has chosen to rename species so that their names reflect their index 

in the distance matrix, to ensure the output Newick format can be translated 

into a set of splits which is directly comparable to the splits output by the 

refined Buneman program. 

That translation is done using the Split-Dist package provided by Dr. Mailund. 

This piece of software can read two or more trees in Newick format, and output 

the set of splits that they have in common. We shall feed the Split-Dist program 

two copies of the output from Quick-Join to obtain one copy of that output in 

the form of a set of splits. The Split-Dist software is available from this website: 

http://www.daimi.au.dk/~mailund/split-dist.html 

This experiment was performed on the 130 PFAM matrices, and theresults 

are available, sorted according to increasing distance matrix size, in Figure 16.2. 

Figure 16.3 shows the same dataset plotted in percentage (the size of NJ(δ) is 

100%). 

The first conclusion we can draw from this experiment is that RB ⊈ NJ. 

Thereisexactlyonedistancematrix—4HBT.phylip, size 559 — out of the 

130 test matrices that has a single splits which is in RB(δ) but not in NJ(δ). 

However, for all practical purposes we can say that RB ⊆ NJ, and the author 

recommends that the matter be investigated further. 

Since we have established that all but one split in RB(δ) isalsoinNJ(δ), 

we can look at the relative sizes of B(δ), RB(δ) andNJ(δ) to try and judge the 

quality of the refined Buneman method. If we assume that the Neighbor-Joining 

method has a good topological accuracy, our hope is that the refined Buneman 

method will capture many of the same splits as the Neighbor-Joining method 

finds. Figure 16.2 shows how the number of splits identified by the Neighborjoining 

method increases steadily (of course, the number of edges in a fully 

resolved tree is always 2n−3), while the two Buneman methods fluctuate greatly. 

Generally, the refined Buneman method performs better than the Buneman 

method and often identifies a signifiant number of NJ–splits that the Buneman 

method does not. If we look at Figure 16.3, we see that in some cases, the refined 

Buneman method identifies between 30% and 70% of the splits identified by the 

Neighbor-Joining method, however, in many cases the refined Buneman method 

identifes very few or even zero splits. If we assume the Neighbor-Joining tree 

represent the true evolutionary history, this would mean the refined Buneman 

method is very bad indeed. 

However, we have no knowledge of the quality of the splits inferred by the 

Neighbor-Joining method on this particular dataset. We have a general result 

saying the method is a universal lowest common denominator in phylogeny 

reconstruction algorithms, mainly because of its speed and not its biological accuracy. 

Since the NJ method will always resolve a full tree, it is quite possible 

91

Figure 16.2: The total number of splits in B(δ), RB(δ) andNJ(δ). 

92

Figure 16.3: The sizes of B(δ) andRB(δ) as percentages of |NJ(δ)|. 

93

that it over-induces splits, and that the result from the refined Buneman method 

is actually closer to the truth: whether the method induces 70%, 30% or even 

zero Neighbor-Joining splits, it might be the case that the particular dataset we 

are looking simply does not warrant any further resolution of the evolutionary 

tree. Clearly, further investigations are required. On the positive side we note 

that at least the splits produced by the refined Buneman method can already 

be considered somewhat reliable, since the Neighbor-Joining method produces 

the same splits. 

The above comparisons are based on all splits, both trivial and non-trivial. In 

the following we will restrict ourselves to comparing only non-trivial splits from 

RB(δ) andNJ(δ). Recall the discussion of evolutionary trees and phylogenetic 

trees; perhaps this following comparison will show that the core of the refined 

Buneman tree is somehow a reliable core of the Neighbor-Joining tree 

In Figure 16.4 we see the number of non-trivial splits in RB(δ) compared 

to the number of non-trivial splits in NJ(δ), plotted in percentage with the NJ 

splits as 100% — the one case where a split from RB(δ) isnotinNJ(δ) will 

be ignored for now. The graph clearly shows a decreasing trend in the number 

of splits in RB(δ). Apart from the cases where the refined Buneman method 

produces no splits at all or only very few splits, the set of non-trivial splits 

identified by the refined Buneman method clearly goes down as matrix sizes go 

up. 

16.4 Summary of experimental results 

The total number of splits in the Neighbor-Joining tree increases steadily as the 

matrix size goes up, since NJ always produces fully resolved trees. Meanwhile, 

the number of refined Buneman splits fluctuates greatly, in many cases to size 

zero. The NJ method always resolves binary trees regardless of the nature and 

properties of the distance data, but it seems the refined Buneman method is 

very sensitive to data. The criterion that makes a set of data induce many or 

few refined Buneman splits is unknown, i.e. we cannot look at a distance matrix 

and say whether it will induce many or few splits, before we actually apply 

the method to the data — but the author suggests this be investigated further. 

One reason could be that the distance data does not necessarily uphold metric 

properties such as the triangle inequality, which might create contradictions or 

“noise” in the data, which in turn might lead the refined Buneman method to 

reject splits. However, this is only speculation. 

In spite of the fluctuation, there seems to be a trend of refined Buneman sets 

getting relatively smaller, as the size of the distance matrices goes up — at least 

the non-trivial parts. This is most evident in Figure 16.4. The refined Buneman 

method looks at n − 3 minimum quartets for each edge, but as the number of 

species goes up, the number of quartets does not go up linearly — there are 

O(n 4 ) quartets induced by an edge. Thus the set of n − 3 least scoring quartets 

becomes a smaller and smaller part of the total number of quartets induced by 

an edge, a smaller and smaller part of the least scoring splits even, and the 

94

Figure 16.4: The set of non-trivial splits from RB(δ) as percentage of the nontrivial 

splits in NJ(δ). 

95

efined Buneman therefore suffers a great penalty as as the number of species 

goes up. Mathematically, when the number of species approaches infinity, 

n − 3 

n 4 → 0 

This would indicate that the method is not very useful for large data sets, 

since it suffers a great disadvantage. It would be interesting to see the performance 

of a quartet based method that considers a fixed percentage of quartets 

per split, e.g. 5–10%, which would not suffer from the same problems with scalability 

as the refined Buneman method. The problem becomes finding the 5–10% 

of least scoring quartets for every split in an efficient manner, while ensuring the 

set of splits produced is still a compatible set of splits — or tree. The Q ∗ method 

[BG00] tackles the problem of going from favorable quartets to a compatible set 

of splits, at the cost of performance compared to the refined Buneman method. 

One experiment which would be interesting to perform, but which the author 

has not undertaken, is to measure the quality of splits from the Neighbor-Joining 

method that are either accepted or rejected by the Buneman method. The 

authors expectation is that the splits accepted by the refined Buneman method 

would get a high confidence, as that method relies on much more evidence 

then the Neighbor-Joining method does. Such an investigation could be done 

by inspecting the PFAM data and interpreting their biological meaning, by 

using bootstrap tests, or by using simulated data where one would know the 

evolutionary history. Is it the case that the splits in RB(δ) are more trustworthy 

than splits in NJ(δ) \ RB(δ) In the cases where the NJ method infers a fully 

resolved binary tree, and the RB method infers no splits at all, which one 

tells the truth Some datasets might be completely random, and an NJ tree 

based on such a set is completely useless — on the other hand, tests show that 

the refined Buneman method produces very few splits or non at all on input 

distance matrices with random entries. So when the RB method tells us that 

the dataset does not warrant any splits, is that because the RB method has 

good biological properties This question is beyond the scope of this thesis, but 

certainly deserves further investigation. 

96

Chapter 17 

Conclusion 

In this thesis the author has described an implementation of the refined Buneman 

tree reconstruction method. The author has verified that the implementation 

is correct and that it runs in expected time Θ(n 3 )andspaceO(n 2 ). 

The author has argued that the implementation could be improved to run in 

worst case cubic time by changing a single module, i.e. the selection algorithm 

described in chapter 10. 

The author has introduced the field of bioinformatics in general and evolutionary 

tree reconstruction in particular, and the problems posed by increasingly 

large amounts of data becoming available. A classification is made of various 

types of tree reconstruction methods, highlighting their particular advantages 

and disadvantages with respect to a performance trade-off between speed and 

biological accuracy. 

The refined Buneman method has compared the refined Buneman tree method 

to two other tree reconstruction methods from its class, namely the original 

Buneman tree method and the well known Neighbor-Joining method. The 

conclusion from these experiments is that the refined Buneman tree method 

produces trees is less restrictive then the Buneman method but more restrictive 

than the Neighbor-Joining method, and with running times that makes the 

method useful in practice. The author encourages more work in this area, to 

better estimate the biological accuracy of the method. 

Lastly, the author has prepared the implementation of the refined Buneman 

method for integration into the widely used JSplits software package, thus 

making the method publically available and easy to use. 

17.1 Future work 

Clearly, the implementation of the refined Buneman tree algorithm can be implemented 

to run faster in practice. Even if its asymptotic running time performance 

equals that of the widely-used Neighbor-Joining method, the practical 

difference in running time, compared to e.g. Quick-Join, is still substantial. 

97

Speedups might be achieved using optimizations, simplifications and maybe a 

different programming language. 

More experiments are needed to fully understand the biological performance 

of the refined Buneman method. We have seen that the method in practice 

produces subsets of Neighbor-Joining trees, which are considered somewhat accurate. 

We have seen that in many cases the degree of resolution in refined 

Buneman trees is much lower than in Neighbor-Joining trees — and that in 

many cases the refined Buneman trees will produce a tree with zero edges where 

the Neighbor-Joining tree will produce 2n − 3 edges for input size n. But we 

have not been able to determine which method is more correct, and here lies an 

important task for the future. 

Finally, we have seen that the refined Buneman method does not scale well. 

The method uses the set of n − 3 minimum scoring quartets associated with an 

ege, to determine the biological quality of the edge. But the number of quartets 

of an edge scales by O(n 4 ) when the input size goes up, and thus the refined 

Buneman method will be using a relatively smaller amount of the least scoring 

quartets to decide the quality of an edge. The author therefore suggests that 

the method be amended or that a new method is developed, where the edge 

measuring basis is relatively constant, to provide better scalability. 

98

Part IV 

Appendices 

99

Appendix A 

Correctness of the 

Reference Implementation 

Distance matrix: 

0,000 0,134 0,559 0,093 0,158 

0,134 0,000 0,921 0,889 0,545 

0,559 0,921 0,000 0,843 0,610 

0,093 0,889 0,843 0,000 0,751 

0,158 0,545 0,610 0,751 0,000 

Split: 00000 

Not a split! 

Split: 00001 

Quartet: 0 0 | 4 4 0.15820401473428802 

Quartet: 0 1 | 4 4 0.2845541310739736 

Quartet: 0 2 | 4 4 0.10507832094101921 

Quartet: 0 3 | 4 4 0.4080878060278718 

Quartet: 1 1 | 4 4 0.5448085612764465 

Quartet: 1 2 | 4 4 0.11734369883965029 

Quartet: 1 3 | 4 4 0.20357979187534186 

Quartet: 2 2 | 4 4 0.6104717294506344 

Quartet: 2 3 | 4 4 0.25932504761092445 

Quartet: 3 3 | 4 4 0.7513406854234302 

Buneman Index: 0.11121100989033475 

Split: 00010 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 1 | 3 3 0.42422721859419016 

Quartet: 0 2 | 3 3 0.1890061527256532 

Quartet: 0 4 | 3 3 0.34325287939555843 

Quartet: 1 1 | 3 3 0.888989662949193 

Quartet: 1 2 | 3 3 0.40577954477681416 

Quartet: 1 4 | 3 3 0.5477608935480884 

Quartet: 2 2 | 3 3 0.8431623196522158 

Quartet: 2 4 | 3 3 0.49201563781250585 

Quartet: 4 4 | 3 3 0.7513406854234302 


Split: 00011 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 0 | 3 4 -0.2498837912935838 

Quartet: 0 0 | 4 4 0.15820401473428802 

Quartet: 0 1 | 3 3 0.42422721859419016 

Quartet: 0 1 | 3 4 -0.12353367495389822 

Quartet: 0 1 | 4 4 0.2845541310739736 

Quartet: 0 2 | 3 3 0.1890061527256532 

Quartet: 0 2 | 3 4 -0.3030094850868526 

Quartet: 0 2 | 4 4 0.10507832094101921 

Quartet: 1 1 | 3 3 0.888989662949193 

Quartet: 1 1 | 3 4 0.34122876940110464 

Quartet: 1 1 | 4 4 0.5448085612764465 

Quartet: 1 2 | 3 3 0.40577954477681416 

Quartet: 1 2 | 3 4 -0.14198134877127422 

Quartet: 1 2 | 4 4 0.11734369883965029 

Quartet: 2 2 | 3 3 0.8431623196522158 

Quartet: 2 2 | 3 4 0.35114668183971004 

Quartet: 2 2 | 4 4 0.6104717294506344 

Buneman Index: -0.2764466381902182 

100

Split: 00100 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 1 | 2 2 0.6726038407439385 

Quartet: 0 3 | 2 2 0.6541561669265625 

Quartet: 0 4 | 2 2 0.5053934085096152 

Quartet: 1 1 | 2 2 0.9205928930477804 

Quartet: 1 3 | 2 2 0.4373827748754015 

Quartet: 1 4 | 2 2 0.49312803061098415 

Quartet: 3 3 | 2 2 0.8431623196522158 

Quartet: 3 4 | 2 2 0.35114668183971004 

Quartet: 4 4 | 2 2 0.6104717294506344 


Split: 00101 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 0 | 2 4 0.05312569379326881 

Quartet: 0 0 | 4 4 0.15820401473428802 

Quartet: 0 1 | 2 2 0.6726038407439385 

Quartet: 0 1 | 2 4 0.16721043223432325 

Quartet: 0 1 | 4 4 0.2845541310739736 

Quartet: 0 3 | 2 2 0.6541561669265625 

Quartet: 0 3 | 2 4 0.14876275841694736 

Quartet: 0 3 | 4 4 0.4080878060278718 

Quartet: 1 1 | 2 2 0.9205928930477804 

Quartet: 1 1 | 2 4 0.4274648624367962 

Quartet: 1 1 | 4 4 0.5448085612764465 

Quartet: 1 3 | 2 2 0.4373827748754015 

Quartet: 1 3 | 2 4 -0.055745255735582644 

Quartet: 1 3 | 4 4 0.20357979187534186 

Quartet: 3 3 | 2 2 0.8431623196522158 

Quartet: 3 3 | 2 4 0.49201563781250585 

Quartet: 3 3 | 4 4 0.7513406854234302 


Split: 00110 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 0 | 2 3 -0.09563706462367855 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 1 | 2 2 0.6726038407439385 

Quartet: 0 1 | 2 3 0.018447673817375887 

Quartet: 0 1 | 3 3 0.42422721859419016 

Quartet: 0 4 | 2 2 0.5053934085096152 

Quartet: 0 4 | 2 3 -0.14876275841694736 

Quartet: 0 4 | 3 3 0.34325287939555843 

Quartet: 1 1 | 2 2 0.9205928930477804 

Quartet: 1 1 | 2 3 0.4832101181723788 

Quartet: 1 1 | 3 3 0.888989662949193 

Quartet: 1 4 | 2 2 0.49312803061098415 

Quartet: 1 4 | 2 3 0.055745255735582644 

Quartet: 1 4 | 3 3 0.5477608935480884 

Quartet: 4 4 | 2 2 0.6104717294506344 

Quartet: 4 4 | 2 3 0.25932504761092445 

Quartet: 4 4 | 3 3 0.7513406854234302 


Split: 00111 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 0 | 2 3 -0.09563706462367855 

Quartet: 0 0 | 2 4 0.05312569379326881 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 0 | 3 4 -0.2498837912935838 

Quartet: 0 0 | 4 4 0.15820401473428802 

Quartet: 0 1 | 2 2 0.6726038407439385 

Quartet: 0 1 | 2 3 0.018447673817375887 

Quartet: 0 1 | 2 4 0.16721043223432325 

Quartet: 0 1 | 3 3 0.42422721859419016 

Quartet: 0 1 | 3 4 -0.12353367495389822 

Quartet: 0 1 | 4 4 0.2845541310739736 

Quartet: 1 1 | 2 2 0.9205928930477804 

Quartet: 1 1 | 2 3 0.4832101181723788 

Quartet: 1 1 | 2 4 0.4274648624367962 

Quartet: 1 1 | 3 3 0.888989662949193 

Quartet: 1 1 | 3 4 0.34122876940110464 

Quartet: 1 1 | 4 4 0.5448085612764465 


Split: 01000 

Quartet: 0 0 | 1 1 0.13390431386278734 

Quartet: 0 2 | 1 1 0.24798905230384188 

Quartet: 0 3 | 1 1 0.4647624443550029 

Quartet: 0 4 | 1 1 0.2602544302024729 

Quartet: 2 2 | 1 1 0.9205928930477804 

Quartet: 2 3 | 1 1 0.4832101181723788 

Quartet: 2 4 | 1 1 0.4274648624367962 

Quartet: 3 3 | 1 1 0.888989662949193 

Quartet: 3 4 | 1 1 0.34122876940110464 

Quartet: 4 4 | 1 1 0.5448085612764465 


Split: 01001 

Quartet: 0 0 | 1 1 0.13390431386278734 

101

Quartet: 0 0 | 1 4 -0.12635011633968557 

Quartet: 0 0 | 4 4 0.15820401473428802 

Quartet: 0 2 | 1 1 0.24798905230384188 

Quartet: 0 2 | 1 4 -0.17947581013295438 

Quartet: 0 2 | 4 4 0.10507832094101921 

Quartet: 0 3 | 1 1 0.4647624443550029 

Quartet: 0 3 | 1 4 0.12353367495389822 

Quartet: 0 3 | 4 4 0.4080878060278718 

Quartet: 2 2 | 1 1 0.9205928930477804 

Quartet: 2 2 | 1 4 0.49312803061098415 

Quartet: 2 2 | 4 4 0.6104717294506344 

Quartet: 2 3 | 1 1 0.4832101181723788 

Quartet: 2 3 | 1 4 0.055745255735582644 

Quartet: 2 3 | 4 4 0.25932504761092445 

Quartet: 3 3 | 1 1 0.888989662949193 

Quartet: 3 3 | 1 4 0.5477608935480884 

Quartet: 3 3 | 4 4 0.7513406854234302 


Split: 01010 

Quartet: 0 0 | 1 1 0.13390431386278734 

Quartet: 0 0 | 1 3 -0.3308581304922155 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 2 | 1 1 0.24798905230384188 

Quartet: 0 2 | 1 3 -0.23522106586853697 

Quartet: 0 2 | 3 3 0.1890061527256532 

Quartet: 0 4 | 1 1 0.2602544302024729 

Quartet: 0 4 | 1 3 -0.2045080141525299 

Quartet: 0 4 | 3 3 0.34325287939555843 

Quartet: 2 2 | 1 1 0.9205928930477804 

Quartet: 2 2 | 1 3 0.4373827748754015 

Quartet: 2 2 | 3 3 0.8431623196522158 

Quartet: 2 4 | 1 1 0.4274648624367962 

Quartet: 2 4 | 1 3 -0.055745255735582644 

Quartet: 2 4 | 3 3 0.49201563781250585 

Quartet: 4 4 | 1 1 0.5448085612764465 

Quartet: 4 4 | 1 3 0.20357979187534186 

Quartet: 4 4 | 3 3 0.7513406854234302 


Split: 01011 

Quartet: 0 0 | 1 1 0.13390431386278734 

Quartet: 0 0 | 1 3 -0.3308581304922155 

Quartet: 0 0 | 1 4 -0.12635011633968557 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 0 | 3 4 -0.2498837912935838 

Quartet: 0 0 | 4 4 0.15820401473428802 

Quartet: 0 2 | 1 1 0.24798905230384188 

Quartet: 0 2 | 1 3 -0.23522106586853697 

Quartet: 0 2 | 1 4 -0.17947581013295438 

Quartet: 0 2 | 3 3 0.1890061527256532 

Quartet: 0 2 | 3 4 -0.3030094850868526 

Quartet: 0 2 | 4 4 0.10507832094101921 

Quartet: 2 2 | 1 1 0.9205928930477804 

Quartet: 2 2 | 1 3 0.4373827748754015 

Quartet: 2 2 | 1 4 0.49312803061098415 

Quartet: 2 2 | 3 3 0.8431623196522158 

Quartet: 2 2 | 3 4 0.35114668183971004 

Quartet: 2 2 | 4 4 0.6104717294506344 


Split: 01100 

Quartet: 0 0 | 1 1 0.13390431386278734 

Quartet: 0 0 | 1 2 -0.11408473844105449 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 3 | 1 1 0.4647624443550029 

Quartet: 0 3 | 1 2 -0.018447673817375887 

Quartet: 0 3 | 2 2 0.6541561669265625 

Quartet: 0 4 | 1 1 0.2602544302024729 

Quartet: 0 4 | 1 2 -0.16721043223432325 

Quartet: 0 4 | 2 2 0.5053934085096152 

Quartet: 3 3 | 1 1 0.888989662949193 

Quartet: 3 3 | 1 2 0.40577954477681416 

Quartet: 3 3 | 2 2 0.8431623196522158 

Quartet: 3 4 | 1 1 0.34122876940110464 

Quartet: 3 4 | 1 2 -0.14198134877127422 

Quartet: 3 4 | 2 2 0.35114668183971004 

Quartet: 4 4 | 1 1 0.5448085612764465 

Quartet: 4 4 | 1 2 0.11734369883965029 

Quartet: 4 4 | 2 2 0.6104717294506344 


Split: 01101 

Quartet: 0 0 | 1 1 0.13390431386278734 

Quartet: 0 0 | 1 2 -0.11408473844105449 

Quartet: 0 0 | 1 4 -0.12635011633968557 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 0 | 2 4 0.05312569379326881 

Quartet: 0 0 | 4 4 0.15820401473428802 

Quartet: 0 3 | 1 1 0.4647624443550029 

Quartet: 0 3 | 1 2 -0.018447673817375887 

102

Quartet: 0 3 | 1 4 0.12353367495389822 

Quartet: 0 3 | 2 2 0.6541561669265625 

Quartet: 0 3 | 2 4 0.14876275841694736 

Quartet: 0 3 | 4 4 0.4080878060278718 

Quartet: 3 3 | 1 1 0.888989662949193 

Quartet: 3 3 | 1 2 0.40577954477681416 

Quartet: 3 3 | 1 4 0.5477608935480884 

Quartet: 3 3 | 2 2 0.8431623196522158 

Quartet: 3 3 | 2 4 0.49201563781250585 

Quartet: 3 3 | 4 4 0.7513406854234302 


Split: 01110 

Quartet: 0 0 | 1 1 0.13390431386278734 

Quartet: 0 0 | 1 2 -0.11408473844105449 

Quartet: 0 0 | 1 3 -0.3308581304922155 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 0 | 2 3 -0.09563706462367855 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 4 | 1 1 0.2602544302024729 

Quartet: 0 4 | 1 2 -0.16721043223432325 

Quartet: 0 4 | 1 3 -0.2045080141525299 

Quartet: 0 4 | 2 2 0.5053934085096152 

Quartet: 0 4 | 2 3 -0.14876275841694736 

Quartet: 0 4 | 3 3 0.34325287939555843 

Quartet: 4 4 | 1 1 0.5448085612764465 

Quartet: 4 4 | 1 2 0.11734369883965029 

Quartet: 4 4 | 1 3 0.20357979187534186 

Quartet: 4 4 | 2 2 0.6104717294506344 

Quartet: 4 4 | 2 3 0.25932504761092445 

Quartet: 4 4 | 3 3 0.7513406854234302 


Split: 01111 

Quartet: 0 0 | 1 1 0.13390431386278734 

Quartet: 0 0 | 1 2 -0.11408473844105449 

Quartet: 0 0 | 1 3 -0.3308581304922155 

Quartet: 0 0 | 1 4 -0.12635011633968557 

Quartet: 0 0 | 2 2 0.558519102302884 

Quartet: 0 0 | 2 3 -0.09563706462367855 

Quartet: 0 0 | 2 4 0.05312569379326881 

Quartet: 0 0 | 3 3 0.09336908810197464 

Quartet: 0 0 | 3 4 -0.2498837912935838 

Quartet: 0 0 | 4 4 0.15820401473428802 


Refined Buneman splits: 

01000 0.1909466830833146 

00100 0.39426472835755577 

00010 0.1411876204138139 

00001 0.11121100989033475 

103

Appendix B 

Garbage Collector Log 

0.000: [GC 511K->156K(1984K), 0.0090600 secs] 

0.178: [GC 668K->154K(1984K), 0.0041940 secs] 

0.241: [GC 666K->155K(1984K), 0.0020350 secs] 

0.264: [GC 667K->196K(1984K), 0.0014720 secs] 

0.277: [GC 708K->184K(1984K), 0.0011120 secs] 

0.299: [GC 696K->158K(1984K), 0.0007900 secs] 

0.310: [GC 670K->178K(1984K), 0.0006680 secs] 

0.327: [GC 690K->164K(1984K), 0.0005610 secs] 

0.340: [GC 676K->165K(1984K), 0.0009310 secs] 

0.353: [GC 677K->166K(1984K), 0.0006800 secs] 

0.363: [GC 678K->166K(1984K), 0.0005850 secs] 

0.373: [GC 678K->167K(1984K), 0.0006100 secs] 

0.384: [GC 679K->168K(1984K), 0.0006100 secs] 

0.397: [GC 680K->212K(1984K), 0.0010710 secs] 

0.408: [GC 724K->168K(1984K), 0.0007290 secs] 

0.419: [GC 680K->175K(1984K), 0.0005970 secs] 

0.429: [GC 687K->183K(1984K), 0.0006870 secs] 

0.438: [GC 695K->196K(1984K), 0.0006830 secs] 

0.453: [GC 708K->183K(1984K), 0.0007270 secs] 

0.461: [GC 695K->246K(1984K), 0.0014170 secs] 

0.475: [GC 758K->304K(1984K), 0.0027140 secs] 

0.485: [GC 816K->363K(1984K), 0.0026530 secs] 

0.501: [GC 875K->312K(1984K), 0.0006540 secs] 

0.511: [GC 824K->311K(1984K), 0.0004770 secs] 

0.523: [GC 823K->320K(1984K), 0.0006030 secs] 

0.532: [GC 832K->319K(1984K), 0.0006790 secs] 

0.544: [GC 831K->329K(1984K), 0.0008010 secs] 

0.553: [GC 841K->328K(1984K), 0.0007290 secs] 

0.565: [GC 840K->331K(1984K), 0.0009070 secs] 

0.576: [GC 843K->329K(1984K), 0.0007710 secs] 

0.584: [GC 841K->456K(1984K), 0.0028120 secs] 

0.599: [GC 968K->464K(1984K), 0.0020600 secs] 

0.609: [GC 976K->464K(1984K), 0.0018370 secs] 

0.623: [GC 976K->479K(1984K), 0.0005670 secs] 

0.631: [GC 991K->477K(1984K), 0.0005740 secs] 

0.640: [GC 989K->614K(1984K), 0.0028960 secs] 

0.655: [GC 1126K->627K(1984K), 0.0020690 secs] 

0.666: [GC 1139K->627K(1984K), 0.0018740 secs] 

0.676: [GC 1139K->813K(1984K), 0.0037820 secs] 

0.695: [GC 1325K->786K(1984K), 0.0016900 secs] 

0.705: [GC 1298K->787K(1984K), 0.0010740 secs] 

0.718: [GC 1299K->801K(1984K), 0.0006760 secs] 

0.727: [GC 1313K->801K(1984K), 0.0006270 secs] 

0.736: [GC 1313K->801K(1984K), 0.0006380 secs] 

0.749: [GC 1313K->811K(1984K), 0.0007980 secs] 

0.757: [GC 1323K->811K(1984K), 0.0008100 secs] 

0.766: [GC 1323K->812K(1984K), 0.0007680 secs] 

0.780: [GC 1324K->813K(1984K), 0.0008510 secs] 

0.788: [GC 1325K->812K(1984K), 0.0008660 secs] 

0.799: [GC 1324K->812K(1984K), 0.0007680 secs] 

0.808: [GC 1324K->1006K(1984K), 0.0047390 secs] 

0.830: [GC 1518K->971K(1984K), 0.0012210 secs] 

0.839: [GC 1483K->971K(1984K), 0.0007970 secs] 

0.847: [GC 1483K->1068K(1984K), 0.0021650 secs] 

0.864: [GC 1580K->1142K(1984K), 0.0033550 secs] 

0.875: [GC 1654K->1144K(1984K), 0.0024180 secs] 

0.886: [GC 1656K->1144K(1984K), 0.0005500 secs] 

0.900: [GC 1656K->1162K(1984K), 0.0007140 secs] 

0.908: [GC 1674K->1162K(1984K), 0.0007450 secs] 

0.917: [GC 1674K->1161K(1984K), 0.0007530 secs] 

0.936: [GC 1673K->1163K(1984K), 0.0018910 secs] 

104

0.972: [GC 1675K->1182K(1984K), 0.0011610 secs] 

0.989: [GC 1694K->1202K(1984K), 0.0014850 secs] 

1.004: [GC 1714K->1222K(1984K), 0.0015970 secs] 

1.028: [GC 1734K->1243K(1984K), 0.0016290 secs] 

1.038: [GC 1755K->1264K(1984K), 0.0015250 secs] 

1.048: [GC 1776K->1284K(1984K), 0.0015470 secs] 

1.058: [GC 1796K->1306K(1984K), 0.0015560 secs] 

1.070: [GC 1818K->1328K(1984K), 0.0019730 secs] 

1.092: [GC 1840K->1355K(1984K), 0.0022360 secs] 

1.104: [GC 1867K->1385K(1984K), 0.0025080 secs] 

1.115: [GC 1897K->1406K(1984K), 0.0023980 secs] 

1.126: [GC 1918K->1417K(1984K), 0.0017430 secs] 

1.136: [GC 1929K->1441K(1984K), 0.0022290 secs] 

1.147: [GC 1953K->1455K(1984K), 0.0018660 secs] 

1.149: [Full GC 1455K->356K(1984K), 0.0411630 secs] 

1.198: [GC 868K->384K(1984K), 0.0014270 secs] 

1.208: [GC 896K->409K(1984K), 0.0015440 secs] 

1.218: [GC 921K->428K(1984K), 0.0019500 secs] 

1.228: [GC 940K->447K(1984K), 0.0017300 secs] 

1.238: [GC 959K->465K(1984K), 0.0021660 secs] 

1.249: [GC 977K->483K(1984K), 0.0019060 secs] 

1.259: [GC 995K->495K(1984K), 0.0020870 secs] 

1.270: [GC 1007K->516K(1984K), 0.0017420 secs] 

1.280: [GC 1028K->528K(1984K), 0.0016770 secs] 

1.290: [GC 1040K->541K(1984K), 0.0016130 secs] 

1.300: [GC 1053K->554K(1984K), 0.0016020 secs] 

1.310: [GC 1066K->566K(1984K), 0.0014070 secs] 

1.320: [GC 1078K->582K(1984K), 0.0017910 secs] 

1.330: [GC 1094K->597K(1984K), 0.0015940 secs] 

1.340: [GC 1109K->613K(1984K), 0.0015960 secs] 

1.350: [GC 1125K->638K(1984K), 0.0024570 secs] 

1.361: [GC 1150K->652K(1984K), 0.0019230 secs] 

1.371: [GC 1164K->672K(1984K), 0.0020700 secs] 

1.381: [GC 1184K->689K(1984K), 0.0018670 secs] 

1.391: [GC 1201K->713K(1984K), 0.0021350 secs] 

1.401: [GC 1225K->734K(1984K), 0.0020890 secs] 

1.412: [GC 1246K->758K(1984K), 0.0024130 secs] 

1.422: [GC 1270K->772K(1984K), 0.0019690 secs] 

1.433: [GC 1284K->793K(1984K), 0.0020540 secs] 

1.466: [GC 1305K->800K(1984K), 0.0018280 secs] 

1.481: [GC 1312K->809K(1984K), 0.0016140 secs] 

1.495: [GC 1321K->815K(1984K), 0.0008950 secs] 

1.512: [GC 1327K->823K(1984K), 0.0010560 secs] 

1.529: [GC 1335K->830K(1984K), 0.0010790 secs] 

1.542: [GC 1342K->837K(1984K), 0.0012930 secs] 

1.556: [GC 1349K->846K(1984K), 0.0015160 secs] 

1.568: [GC 1358K->856K(1984K), 0.0012840 secs] 

1.581: [GC 1368K->864K(1984K), 0.0013920 secs] 

1.593: [GC 1376K->872K(1984K), 0.0014740 secs] 

1.606: [GC 1384K->881K(1984K), 0.0013130 secs] 

1.619: [GC 1393K->889K(1984K), 0.0013220 secs] 

1.632: [GC 1401K->897K(1984K), 0.0012040 secs] 

1.647: [GC 1409K->904K(1984K), 0.0012350 secs] 

1.659: [GC 1416K->912K(1984K), 0.0012510 secs] 

1.673: [GC 1424K->922K(1984K), 0.0013530 secs] 

1.685: [GC 1434K->931K(1984K), 0.0014230 secs] 

1.698: [GC 1443K->941K(1984K), 0.0016090 secs] 

1.712: [GC 1453K->950K(1984K), 0.0013770 secs] 

1.724: [GC 1462K->956K(1984K), 0.0013470 secs] 

1.737: [GC 1468K->965K(1984K), 0.0014090 secs] 

1.750: [GC 1477K->974K(1984K), 0.0013430 secs] 

1.763: [GC 1486K->983K(1984K), 0.0013740 secs] 

1.775: [GC 1495K->991K(1984K), 0.0012020 secs] 

1.787: [GC 1503K->996K(1984K), 0.0012050 secs] 

1.800: [GC 1508K->1005K(1984K), 0.0012510 secs] 

1.813: [GC 1517K->1013K(1984K), 0.0013920 secs] 

1.825: [GC 1525K->1020K(1984K), 0.0013730 secs] 

1.837: [GC 1532K->1029K(1984K), 0.0013140 secs] 

1.850: [GC 1541K->1036K(1984K), 0.0013250 secs] 

1.863: [GC 1548K->1044K(1984K), 0.0012460 secs] 

1.875: [GC 1556K->1052K(1984K), 0.0013450 secs] 

1.887: [GC 1564K->1059K(1984K), 0.0012150 secs] 

1.899: [GC 1571K->1065K(1984K), 0.0010660 secs] 

1.911: [GC 1577K->1074K(1984K), 0.0013640 secs] 

1.923: [GC 1586K->1083K(1984K), 0.0012460 secs] 

105

Bibliography 

[AJL + 02] 

Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, 

Keith Roberts, and Peter Walter. Molecular Biology of the Cell, 

Fourth Edition. Garland Science, Taylor & Francis Group, 29 West 

35th Street, New York, 2002. 

[BB99] Vincent Berry and David Bryant. Faster reliable phylogenetic 

analysis. In Proceedings of the third annual international conference 

on Computational molecular biology, pages 59–68. ACM 

Press, 1999. 

[BFM + 03] 

[BFÖ+ 03] 

[BG91] 

[BG00] 

[BM99] 

[Bun71] 

Gerth Stølting Brodal, Rolf Fagerberg, Thomas Mailund, Christian 

N. S. Pedersen, and Derek Phillips. Speeding up neighbourjoining 

tree construction. Technical report, ALCOM-FT, 2003. 

Gerth Stølting Brodal, Rolf Fagerberg, Anna Östlin, Christian 

N. S. Pedersen, and S. Srinivasa Rao. Computing Refined Buneman 

Trees in Cubic Time. In Proceedings of the 3rd Workshop on 

Algorithms in BioInformatics (WABI 2003), volume 2812 of Lecture 

Notes in Computer Science, pages 259–270. Springer Verlag, 

September 2003. 

Jean-Pierre Barthélemy and Alain Guénoche. Trees and Proximity 

Representations. John Wiley & Sons, 1991. 

Vincent Berry and Olivier Gascuel. Inferring evolutionary trees 

with strong combinatorial evidence. Theoretical Computer Science, 

240(2):271–298, 2000. 

David Bryant and Vincent Moulton. A polynomial time algorithm 

for constructing the refined buneman tree. Applied Mathematics 

Letters, 12:51–56, 1999. 

Peter Buneman. The recovery of trees from measures of dissimilarity. 

In Mathematics in the Archeological and Historical Sciences, 

pages 387–395. Edinburgh University Press, 1971. 

[cal04] Cambridge advanced learner’s dictionary. Webpage, 2004. 

http://dictionary.cambridge.org/. 

106

[CLR90] 

[Dar59] 

Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. 

Introduction to Algorithms. MIT Press, Cambridge Massachusetts, 

1990. 

Charles Darwin. On the Origin of Species by Means of Natural 

Selection, or the Preservation of Favoured Races in the Struggle 

for Life. John Murray, 1859. 

[dBSvKO00] Mark de Berg, Otfried Schwarzkopf, Marc van Kreveld, and Mark 

Overmars. Computational Geometry: Algorithms and Applications. 

Springer, 2000. 

[DHM96] A. Dress, D. Huson, and V. Moulton. Analyzing and visualizing 

sequence and distance data using splitstree. Discrete Applied 

Mathematics, 71:95–109, 1996. 

[Epp98] 

[GHJV94] 

David Eppstein. Fast hiearchical clustering and other applications 

of dynamic closet pairs. In Proceedings of the ninth annual ACM- 

SIAM symposium on Discrete algorithms, pages 619–628. Society 

for Industrial and Applied Mathematics, 1998. 

E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns 

— Elements of Reusable Object-Oriented Software. Addison- 

Wesley, 1994. 

[gnu99] gnuplot — a command-driven interactive function plotting 

program. Webpage, 1999. http://http://www.rz.unikarlsruhe.de/ 

ig25/gnuplot-faq/. 

[Gus91] 

[HKY85] 

[HNP + 98] 

[Hus98] 

[JC69] 

[JWMV03] 

Dan Gusfield. Efficient algorithms for inferring evolutionary trees. 

Networks, 21:19–28, 1991. 

M. Hasegawa, H. Kishino, and T. Yano. Dating of the humanape 

splitting by a molecular clock of mitchondrial dna. Journal of 

Molecular Evolution, 22:160–174, 1985. 

D.H. Huson, S. Nettles, L. Parida, T. Warnow, and S. Yooseph. 

The disk-covering method for tree reconstruction. Proceedings of 

Algorithms and Experiments, pages 62–75, 1998. 

D. Huson. Splitstree: Analyzing and visualizing evolutionary data. 

Bioinformatics, 14:68–73, 1998. 

T. H. Jukes and C. Cantor. Evolution of protein molecules. Academic 

Press, New York, 1969. 

Katherine St. John, Tandy Warnow, Bernard M. E. Moret, and 

Lisa Vawter. Performance study of phylogenetic methods: (unweighted) 

quartet methods and neighbor-joining. Journal of Algorithms, 

48:173–193, 2003. 

107

[Kim80] M. Kimura. A simple model for estimating evolutionary rates 

of base substitutions through comparative studies of nucleotide 

sequences. Journal of Molecular Evolution, 16:111–120, 1980. 

[MS99] Vincent Moulton and Mike Steel. Retractions of finite distance 

functions onto tree metrics. Discrete Applied Mathematics, 

91:215–233, 1999. 

[MSM97] 

[NK00] 

[NTT83] 

[SM97] 

[SN87] 

[SS73] 

[SS03] 

D. R. Maddison, D. L. Swofford, and Wayne P. Maddison. Nexus: 

an extensible file format for systematic information. Systematic 

Biology, 46:590–621, 1997. 

Masatoshi Nei and Sudhir Kumar. Molecular Evolution and Phylogenetics. 

Oxford University Press, 198 Madison Avenue, New 

York, 2000. 

M. Nei, F. Tajima, and Y. Tateno. Accuracy of estimated phylogenetic 

trees from molecular data. ii. gene frequency data. Journal 

of Molecular Evolution, 19:153–170, 1983. 

João Setubal and João Meidanis. Introduction to Computational 

Molecular Biology. Brooks/ Cole Publishing Company, 511 Forest 

Lodge Road, Pacific Grove, California, 1997. 

Naruya Saitou and Masatoshi Nei. The neighbor-joining method: 

a new method for reconstructing phylogenetic trees. Molecular 

biology and evolution, 4:406–425, 1987. 

P.H.A. Sneath and R.R. Sokal. Numerical Taxonomy. W.H. Freeman 

& Co, 41 Madison Avenue, New York, 1973. 

Charles Semple and Mike Steel. Phylogenetics. OxfordUniversity 

Press, 2003. 

[Sun03] Sun. Javadoc 1.4.2 Tool. Webpage, 2003. 

http://java.sun.com/j2se/1.4.2/docs/tooldocs/javadoc/index.html. 

[TN96] 

N. Takezaki and M. Nei. Genetic distances and reconstruction of 

phylogenetic trees from microsatellite dna. Genetics, 144:389–399, 

1996. 

108

Refined Buneman Trees

Create successful ePaper yourself

Delete template?

Save as template?