22.01.2015 Views

Refined Buneman Trees

Refined Buneman Trees

Refined Buneman Trees

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Refined</strong> <strong>Buneman</strong> <strong>Trees</strong><br />

Lasse Westh-Nielsen<br />

May 2004


Abstract<br />

The field of bioinformatics needs efficient methods for analysing data. New and<br />

improved technologies increase the efficiency of e.g. DNA sequencing, and the<br />

amount of data coming out of biological research laboratories grows at an increasing<br />

rate. It is therefore important to be able to handle these huge amounts<br />

of data with limited computational power.<br />

Evolutionary tree reconstruction is a fundamental research problem in biology,<br />

with a wide range of applications. The perfect phylogeny problem is<br />

NP-hard, and large dataset need to be analysed quickly and accurately. Current<br />

methods are very one-sided in their trade-off when tackling this problem,<br />

being either slow, but precise, or fast, but inaccurate.<br />

The refined <strong>Buneman</strong> method is a relatively new and certainly untested tree<br />

reconstruction method. Earlier algorithms computing the refined buneman tree<br />

would run in time O(n 5 )andspaceO(n 4 ), making the method infeasible to run<br />

for large datasets, but with the advent of the algorithm described in [BFÖ+ 03],<br />

the method has complexity bounds of O(n 3 )andO(n 2 ) for time and space,<br />

respectively. Suddenly the method becomes useful in practice, and is directly<br />

comparable to the popular Neighbor-Joining method which has precisely the<br />

same complexity bounds.<br />

This thesis describes the first implementation of the cubic time algorithm<br />

computing the refined <strong>Buneman</strong> tree. We verify the complexity bounds and<br />

perform an initial comparitive study of the refined <strong>Buneman</strong> algorithm and<br />

the Neighbor-Joining method. We describe how the implementation can easily<br />

be integrated into the widely used JSplits software package, thus making<br />

the method instantly and easily available. And we ponder over the biological<br />

accurary of the refined <strong>Buneman</strong> method.<br />

1


This thesis is dedicated to my family<br />

Estrid, René and Morten<br />

and to my dear<br />

Khay Ling<br />

2


Acknowledgements<br />

I would like to thank the following people for their help during the course of<br />

writing this thesis:<br />

Poh Khay Ling for your love, support and infinite patience with me as I have<br />

been dragging this process out for far too long.<br />

Christian Nørgaard Storm Pedersen for inspiration, guidance, advice and<br />

supervision during the course of writing this thesis.<br />

Thomas Mailund for invalueable feedback and corrections regarding both formalia<br />

and content of this thesis.<br />

Wouter Boomsma, René Thomsen & Jakob Vesterstrøm forhelponEmacs,<br />

L A TEXand Linux, and for good office companionship.<br />

Mikkel Heide Schierup for giving me a job and an office at BiRC, without<br />

which this thesis would probably never exist...<br />

3


Contents<br />

1 Introduction 7<br />

1.1 Documentstructure ......................... 8<br />

I Overview of Evolutionary <strong>Trees</strong> 9<br />

2 Definitions 10<br />

2.1 Species................................. 10<br />

2.2 Evolutionary tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

2.3 Distance measure . . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

2.4 Quartets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

2.5 Diagonal quartets . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

2.6 Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2.7 <strong>Buneman</strong> tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18<br />

2.8 Anchored <strong>Buneman</strong> tree . . . . . . . . . . . . . . . . . . . . . . . 18<br />

2.9 <strong>Refined</strong> <strong>Buneman</strong> tree . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

3 Evolution and Bioinformatics 21<br />

3.1 The Tree of Life and the language of DNA . . . . . . . . . . . . . 21<br />

3.2 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

4 A Catalogue of Tree Reconstruction Methods 25<br />

4.1 Distance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

4.1.1 UPGMA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

4.1.2 Neighbor-Joining . . . . . . . . . . . . . . . . . . . . . . . 27<br />

4.1.3 Quartetbasedmethods ................... 28<br />

4.2 Parsimony methods . . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

4.3 Maximum likelihood methods . . . . . . . . . . . . . . . . . . . . 30<br />

4.4 Hybrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

4.5 Accuracy of inferred trees . . . . . . . . . . . . . . . . . . . . . . 31<br />

4


II Implementing <strong>Refined</strong> <strong>Buneman</strong> <strong>Trees</strong> 32<br />

5 Implementation Structure 33<br />

5.1 Overapproximating . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

5.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

5.2.1 Searching for minimum diagonals . . . . . . . . . . . . . . 36<br />

5.2.2 Lazy matrix construction . . . . . . . . . . . . . . . . . . 38<br />

5.2.3 Pruning explained . . . . . . . . . . . . . . . . . . . . . . 40<br />

6 The Tree Data Structure 42<br />

6.1 Design of the tree data datastructure . . . . . . . . . . . . . . . . 42<br />

6.1.1 The tree datastructure explained . . . . . . . . . . . . . . 43<br />

6.2 Operationsonthetreedatastructure................ 46<br />

6.2.1 Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

6.2.2 Delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

6.2.3 Incompatible . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

6.3 Testing the tree data structure . . . . . . . . . . . . . . . . . . . 50<br />

7 The Single Linkage Clustering Tree 51<br />

7.1 Replacing the anchored <strong>Buneman</strong> tree . . . . . . . . . . . . . . . 51<br />

7.2 Calculating the single linkage clustering tree . . . . . . . . . . . . 51<br />

7.3 Converting the distance matrix . . . . . . . . . . . . . . . . . . . 52<br />

8 The Quad Tree Data Structure 55<br />

8.1 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

8.2 Performance of the quad tree data structure . . . . . . . . . . . . 57<br />

9 The Discard-Right algorithm 61<br />

10 The Selection Algorithm 64<br />

10.1 Selection with side effects . . . . . . . . . . . . . . . . . . . . . . 64<br />

10.2 Performance of the selection algorithm . . . . . . . . . . . . . . . 66<br />

11 JSplits 68<br />

11.1 What is JSplits . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

11.2 Using JSplits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

11.3 Extending JSplits . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

11.4 The Distances2Splits interface . . . . . . . . . . . . . . . . . . . . 70<br />

12 Source Code 72<br />

III Tests and Experiments 73<br />

13 The Reference Implementation 74<br />

13.1 A simple refined <strong>Buneman</strong> tree algorithm . . . . . . . . . . . . . 74<br />

13.2 Implementation highlights . . . . . . . . . . . . . . . . . . . . . . 75<br />

5


13.3 Correctness of the reference implementation . . . . . . . . . . . . 75<br />

13.4 Performance of the reference implementation . . . . . . . . . . . 76<br />

14 Correctness 78<br />

14.1 Test strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78<br />

14.2 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79<br />

14.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80<br />

15 Complexity 81<br />

15.1 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

15.2 Space requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 83<br />

15.2.1 The Linux ps command . . . . . . . . . . . . . . . . . . . 84<br />

15.2.2 JVM garbage collector log . . . . . . . . . . . . . . . . . . 85<br />

15.2.3 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . 86<br />

16 Comparing Evolutionary Tree Methods 88<br />

16.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89<br />

16.2 <strong>Buneman</strong> and refined <strong>Buneman</strong> . . . . . . . . . . . . . . . . . . . 89<br />

16.3 <strong>Refined</strong> <strong>Buneman</strong> and Neighbor-Joining . . . . . . . . . . . . . . 89<br />

16.4Summaryofexperimentalresults.................. 94<br />

17 Conclusion 97<br />

17.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

IV Appendices 99<br />

A Correctness of the Reference Implementation 100<br />

B Garbage Collector Log 104<br />

6


Chapter 1<br />

Introduction<br />

Although much remains obscure, and will long remain obscure, I<br />

can entertain no doubt, after the most deliberate study and dispassionate<br />

judgement of which I am capable, that the view which<br />

most naturalists entertain, and which I formerly entertained —<br />

namely, that each species has been independently created — is erroneous.<br />

I am fully convinced that species are not immutable; but<br />

that those belonging to what are called the same genera are lineal<br />

descendants of some other and generally extinct species, in the<br />

same manner as the acknowledged varieties of any one species are<br />

the descendants of that species.<br />

Charles Darwin: “The Origin of Species”<br />

Charles Darwin published his theory of evolution in 1859, roughly 150 years<br />

ago at the time of writing. He was opposed by a majority of his age, where the<br />

great majority of naturalists believed that species were immutable productions,<br />

and had been separately created. 150 years later, the theory is widely accepted<br />

and has given rise to much scientific research and new philosophy. Controversy<br />

still lurks, particularly in some school systems where creationism and evolutionism<br />

compete for the role as the “theory of life”. Luckily, the battle is heavily<br />

favoured by the latter, and religious sensibilities are being forced to subside.<br />

All in all, the theory of evolution has gone from victory to victory as many<br />

scientific achievements have turned out to support it. The discovery of DNA by<br />

Crick & Watson in 1953 solved a big part of the puzzle, showing how genetic<br />

information is stored, copied and propagated, thus creating evolutionary history<br />

through inheritance. The appearance of modern day computers also contributes<br />

significantly to the success of evolutionary biology. As huge amounts of data<br />

is being sequenced in labs all over the world, computer programs implementing<br />

mathematical models of evolution are giving us new information on a daily basis,<br />

information that gives us increasingly greater understanding of nature and life<br />

itself.<br />

7


The theory of evolution has also become a cornerstone in philosophy and<br />

modern thinking, to a point where it is almost religious. The phrase survival of<br />

the fittest is the guideline for businesses in the capitalist economy, and playing<br />

God through genetic experiments is one of the most popular scenarios in modern<br />

science fiction.<br />

In this work we shall look at a specific method for inferring evolutionary<br />

history from a set of species, the refined <strong>Buneman</strong> tree algorithm.<br />

1.1 Document structure<br />

This work is organized into three major parts.<br />

Part 1 Overview of Evolutionary <strong>Trees</strong><br />

In the first part we look at evolutionary trees, their applications and some<br />

of the most important tree reconstruction methods and method classes.<br />

We introduce evolutionary principles and mechanisms, and establish how<br />

these may be modelled mathematically.<br />

Part 2 Implementing <strong>Refined</strong> <strong>Buneman</strong> <strong>Trees</strong><br />

In this part we will concern ourselves with the implementation of the<br />

refined <strong>Buneman</strong> tree algorithms described in [BFÖ+ 03]. This algorithm<br />

is the first algorithm that runs in time O(n 3 )andspaceO(n 2 )—the<br />

previous best algorithm would use time O(n 5 )andspaceO(n 4 ). This<br />

implementation is the first of its kind and the first that is practical to<br />

run on larger scale data sets. We will also see how the implementation is<br />

integrated into the widely used JSplits tree visualizing tool.<br />

Part 3 Tests and Experiments<br />

In this part we test the implementation of the refined <strong>Buneman</strong> tree algorithm<br />

to see if it performs as specified, and we also perform experiments<br />

that demonstrate the applicability of the method, relating it to the well<br />

known Neighbor-Joining method.<br />

8


Part I<br />

Overview of Evolutionary<br />

<strong>Trees</strong><br />

9


Chapter 2<br />

Definitions<br />

This chapter is concerned with defining the terms used throughout the text,<br />

and their relation. The definitions are largely copied from [BFÖ+ 03], but with<br />

some additional comments.<br />

2.1 Species<br />

Species — a set of animals or plants in which the members have<br />

similar characteristics to each other and can breed with each other.<br />

From Cambridge Advanced Learners Dictionary<br />

In this text we shall use the term species more loosely than the definition<br />

above suggests. The term is taken from a bioinformatics context, but it should<br />

be understood to mean more than just plants or animals. In his original paper<br />

[Bun71], Peter <strong>Buneman</strong> was concerned with both biology and filiation of<br />

manuscripts. But we are basically interested in anything which is related by<br />

some evolutionary relationship or phylogeny. We can indeed compare apples<br />

and oranges, however it makes no sense to compare species of fish to editions<br />

of Hamlet. Thus, for our purposes we shall define the a set of species to mean<br />

something along the lines of a set of objects in which the members have a measurable<br />

phylogeny.<br />

In the following we will need to refer to the first k species from a set of<br />

species X = {x 1 , ··· ,x n } assuming some arbitrary ordering, so we define for<br />

integers k ∈{1, ··· ,n},X k = {x 1 , ··· ,x k }.<br />

2.2 Evolutionary tree<br />

Definition 1 of an evolutionary tree is taken from [SS03], it is a bit more elaborate<br />

than the equivalent definition in [BFÖ+ 03]. The idea of the evolutionary tree<br />

is, in the words of Semple & Steel ([SS03]) to provide a standard graphical<br />

10


epresentation of evolutionary relationships — but of course, we would like to<br />

do more than just look at the trees.<br />

Definition 1 (Evolutionary tree). An evolutionary tree T is an ordered pair<br />

(T ; φ), where T is a tree with vertex set V and φ : X → V with the property<br />

that, for each v ∈ V of degree at most two, v ∈ φ(X). An evolutionary tree is<br />

also called a semi-labeled tree (on X).<br />

The main thing to notice about Definition 1 is the rule that all leaves must<br />

correspond to species, but not all species have to correspond to leaves. While<br />

working on this thesis, the author experienced some confusion until Definition 2<br />

filled in a gap in the authors understanding of evolutionary and phylogenetic<br />

trees and their relation to graph theoretical trees. One might be tempted to<br />

interpret evolutionary trees as unrooted trees where leaves represent species, but<br />

the small detail that evolutionary trees do not need to be fully resolved is quite<br />

important.<br />

Definition 2 (Phylogenetic tree). A phylogenetic tree T is an evolutionary<br />

tree (T ; φ) with the property that φ is a bijection from X into the set of leaves<br />

of T .<br />

In graph theoretical terms, a phylogenetic tree is a leaf-labeled tree, while<br />

an evolutionary tree is a semi-labeled tree. An illustration of this distinction is<br />

given in Figure 2.1. Here we have an evolutionary tree with three “abnormal”<br />

regions A, B and C, where the tree is not fully resolved. If we look at regions<br />

A and B, we see that a species appears to be the ancestor of other species. It<br />

is of course possible that we have such a dataset, but a more likely explanation<br />

would be that the underlying evolutionary data simple does not tell us how to<br />

resolve these species — which is precisely the situation in region C. Wemight<br />

argue that since we cannot distinguish between these species, they must be the<br />

same species. But it is also very likely that the underlying evolutionary data is<br />

inaccurate.<br />

The important thing to stress is that our tree reconstruction method might<br />

output a tree that is not fully resolved for whatever reason, and additional<br />

analysis might be needed to find the answers we are looking for. An easy way<br />

of obtaining a leaf-labeled tree is to simply add extra edges to those labeled<br />

nodes which have degree two or more. This is illustrated in Figure 2.2. Of<br />

course, by doing so we are loosing information about the original tree, but this<br />

can be remedied by marking the extra edges. The reason for performing this<br />

transformation is that leaf-labeled trees are easier to work with then semi-labeled<br />

trees, for the average computer scientist.<br />

Two graph theoretical results are handy when reasoning about trees and tree<br />

search spaces: a semi-labeled unrooted tree with at most n leaves has at most<br />

n − 2 inner nodes, for a maximum of 2n − 2 nodes in the whole tree. And there<br />

are at most 2n − 3edges.<br />

11


A<br />

C<br />

B<br />

Figure 2.1: An evolutionary tree for 14 species.<br />

Figure 2.2: A phylogenetic tree resembling the evolutionary tree in Figure 2.1,<br />

with extra edges marked.<br />

12


Regarding the size of the search space for evolutionary trees on n species,<br />

it is a bit unclear. The following result from [SS03] gives the number of binary<br />

phylogenetic trees with n leaves:<br />

(2n − 4)!<br />

(n − 2)!2 n−2<br />

However, there must be more evolutionary trees with n labels since these are<br />

not required to be neither binary nor to have n leaves. Another way of looking<br />

at the size of the search space is to span it using set partition. If we look at a<br />

set X of n species, we might say that a partition of this set corresponds to an<br />

edge in an evolutionary tree for those species. There are O(2 n ) such partitions<br />

(or splits as they are also called). Since a tree is basically a set of partitions, the<br />

number of trees that can be constructed from O(2 n ) partitions must be upper<br />

bounded by the size of P(2 n ), which is O(2 2n ). However, this upper bound is<br />

unrealistically high, since we know that any tree with at most n leaves has at<br />

most 2n − 3 edges corresponding to 2n − 3 partitions in the set of species. And<br />

out of these, only a fraction will correspond to a tree, namely those where the<br />

set partitions are not contradictory.<br />

The author is a bit unsure which estimate is the best, but in any case the<br />

search space is very gigantic indeed! And clearly, when we want to find an<br />

evolutionary tree for a set of species, it is completely infeasible to search through<br />

this space of tree topologies from end to end.<br />

2.3 Distance measure<br />

A distance measure on a set of species X is a function δ : X 2 → R + where<br />

δ(x, x) =0andδ(x, y) =δ(y, x) for all x, y ∈ X. In this text, we will assume<br />

that the function δ is given by a distance matrix. Let n = |X|. Then a distance<br />

measure would look like the matrix in Equation 2.1.<br />

⎛<br />

δ =<br />

⎜<br />

⎝<br />

0 δ 21 δ 31 . . . δ n1<br />

δ 21 0 δ 32 . . . δ n2<br />

δ 31 δ 32 0 .<br />

. . . .<br />

. . . .<br />

. . . .<br />

δ n1 δ n2 . . . . 0<br />

⎞<br />

⎟<br />

⎠<br />

(2.1)<br />

Notice that the distance data does not have to have any metrix or other<br />

sensible properties, e.g. there is no provision that the triangle inequality must<br />

hold. The only requirements are that the matrix must be symmetrical around<br />

the diagonal, and the diagonal must be all zeros.<br />

13


2.4 Quartets<br />

To every set of four species a, b, c, d ∈ X there are four ways to associate a leaflabeled<br />

tree, as shown in Figure 2.3. The three possible binary tree resolutions,<br />

quartets, are denoted by ab|cd, ac|bd and ad|bc, indicating how the central edge<br />

of the binary tree bipartitions the four species. We say that an edge e in an<br />

evolutionary tree induces a quartet ab|cd if e bipartitions the four species in the<br />

same way as the central edge of the quartet — see Figure 2.4. A single edge<br />

in an evolutionary tree on X induces O(|X| 4 ) quartets: imagine the edge splits<br />

the species in two halves, then there are roughly |X|/2 choices for each of a, b,<br />

c and d for a total of |X| 4 /16 possible quartets. Quartets are symmetric in the<br />

sense that we may swap a ↔ b or c ↔ d or even ab ↔ cd and still obtain the<br />

same quartet.<br />

a<br />

c<br />

a<br />

b<br />

a<br />

b<br />

a<br />

c<br />

b<br />

d c d d<br />

ab|cd ac|bd ad|bc<br />

c<br />

b<br />

d<br />

Figure 2.3: The possible topologies of four species.<br />

quartets.<br />

Only three of these are<br />

The <strong>Buneman</strong> Score of a quartet q = ab|cd, wherea, b, c, d ∈ X is defined as<br />

β q = 1 2 (min{δ ac + δ bd ,δ ad + δ bc }−(δ ab + δ cd ))<br />

Two distinct quartets q 1 and q 2 for the same four species satisfy<br />

β q1 + β q2 ≤ 0 ([BFÖ+ 03])<br />

b<br />

a<br />

e<br />

d<br />

c<br />

Figure 2.4: The edge e induces the quartet ab|cd, aswellasmanyothers.<br />

14


2.5 Diagonal quartets<br />

Looking at the definition of a quartet and its associated <strong>Buneman</strong> score, we<br />

might observe that the quartet q = ab|cd canbeviewedastwodiagonal quartets,<br />

denoted ab||cd and ab||dc. This intuition is not simple, but we can make the<br />

following rewrite to illustrate this point:<br />

β q = 1 2 (min{δ ac + δ bd ,δ ad + δ bc }−(δ ab + δ cd )) (2.2)<br />

{ 1<br />

= min 2 (−δ }<br />

ab + δ bc − δ cd + δ da )<br />

1<br />

2 (−δ (2.3)<br />

ab + δ bd − δ dc + δ ca )<br />

= min{η ab||cd ,η ab||dc } (2.4)<br />

where η ab||cd = 1 2 (δ bc − δ ab + δ ad − δ cd ) is the score of the diagonal quartet<br />

ab||cd.<br />

Looking at Figure 2.5 we can see that the terms η ab||cd and η ab||dc correspond<br />

to a “tour” in one of the two diagonal quartets. In words, we go through either<br />

a → b → c → d → a or a → b → d → c → a, adding or subtracting chords<br />

as appropriate. We might also notice that diagonal quartets are symmetric<br />

on either side of their central edge, i.e. starting out with the diagonal quartet<br />

ab||cd we can swap a ↔ b and c ↔ d or ab ↔ cd to obtain ba||dc or cd||ab, where<br />

η ab||cd = η ba||dc = η cd||ab and all three diagonal quartets still identify the same<br />

quartet ab|cd.<br />

a<br />

c<br />

b<br />

ab|cd<br />

d<br />

a<br />

c<br />

a<br />

c<br />

b<br />

ab||cd<br />

d<br />

b<br />

ab||dc<br />

d<br />

Figure 2.5: The symmetric quartet and its associated asymmetric diagonal quartets.<br />

To keep an ordering on diagonal quartets we shall define the minimum diagonal<br />

of ab|cd: Leta be the smallest (by index in X) of the species a, b, c, d ∈ X.<br />

ab||cd is the minimum diagonal if η ab||cd


2.6 Splits<br />

The partition of a finite set S into two non-empty parts U and V is denoted a<br />

split σ = U|V .If|U| =1or|V | = 1 the split is called trivial. It is reasonable to<br />

represent a split as a bitvector or binary number, and for convention we shall<br />

say that for a bitvector A representing σ, x i ∈ U if and only if A[i] = 0. Splits<br />

are symmetric, so if w represents the split σ, then¬w also represents σ. Theset<br />

of splits on a set X is denoted σ(X). The size of σ(X) is the number of unique<br />

splits on X, so|σ(X)| = |P(X)|−2<br />

2<br />

= 2n −2<br />

2<br />

=2 n−1 − 1. We exclude symmetric<br />

splits and deduct the two splits where U = ∅ or V = ∅.<br />

The set of quartets associated with a split σ = U|V on a set X is defined<br />

by q(σ) ={uu ′ |vv ′ : u, u ′ ∈ U ∧ v, v ′ ∈ V }. Hereu, u ′ (and similarly v, v ′ ) need<br />

not be distinct. The size of q(U|V ) is in the order of O(|X| 4 ) — recall that an<br />

edge in a tree induces O(|X| 4 ) quartets, and splits are equivalent to edges in<br />

this case.<br />

Definition 3 (Compatibility). Two splits A|B and C|D are said to be compatible<br />

if and only if one of A ∩ C, A ∩ D, B ∩ C or B ∩ D is empty.<br />

Compatible sets of splits are the foundation for the algorithm presented in<br />

this thesis, and they are a perfect tool when dealing with evolutionary trees.<br />

A set of splits is compatible if and only if all splits in the set are pairwise<br />

compatible. And of course, any subset of a compatible set of splits is again<br />

compatible.<br />

There is a close connection between compatible sets of splits and evolutionary<br />

trees. Any edge e in an unrooted tree T splits the set of leaves of T into two<br />

non-empty parts. Let Σ(T ) denote the set of splits associated with the edges of<br />

atreeT . Then Theorem 1 (from [SS03]) gives the relation between compatible<br />

sets of splits and evolutionary trees.<br />

Theorem 1 (Splits-Equivalence Theorem). Let Σ be a collection of splits<br />

on X. Then, there is an evolutionary tree T such that Σ=Σ(T ) if and only if<br />

Σ is a compatible set of splits. If T exists it is unique up to isomorphisms.<br />

From now on we shall use the terms compatible set of splits/ evolutionary<br />

tree and split/ edge interchangeably. They are one and the same: Table 2.1<br />

shows a compatible set of (weighted) splits, and Figure 2.6 shows the equivalent<br />

evolutionary tree. Recall the discussion of evolutionary trees versus phylogenetic<br />

trees; when working with a method such as the refined <strong>Buneman</strong> tree algorithm<br />

which outputs compatible sets of splits which might or might now correspond<br />

to a fully resolved tree, it is important to be able to such a tree in a precise<br />

manner. When dealing with e.g Neighbor-Joining, we can rely on the more<br />

regular phylogenetic trees since the NJ method always resolves trees completely.<br />

Lemma 1 is due to Dan Gusfield ([Gus91], section 1.2) and gives an important<br />

upper bound for the time required to go from compatible sets of splits to<br />

phylogenetic trees.<br />

Lemma 1. An unrooted tree with n leaves can be constructed from its set of<br />

non-trivial splits in time O(kn), wherek is the number of non-trivial splits.<br />

16


ABCDEFG Score<br />

0011111 0.6<br />

0011000 0.4<br />

0000001 0.2<br />

0000111 0.5<br />

0100000 0.3<br />

0111111 0.2<br />

0000010 0.3<br />

Table 2.1: A compatible set of weighed splits.<br />

A<br />

B<br />

C, D<br />

E<br />

G<br />

F<br />

Figure 2.6: A tree that represent the set of splits in Table 2.1.<br />

Of course, in Lemma 1 it is implicitly assumed that all trivial splits are<br />

present, so it does not apply directly to our situation. However, the tree data<br />

structure we use for storing a compatible set of splits (see chapter 6) can be<br />

used to turn a set of compatible splits into a tree in time O(n 2 ) — inserting a<br />

maximum of n trivial and n − 2 non-trivial splits at a cost of linear time each.<br />

A trivial split is compatible with every split of X. In a phylogenetic tree,<br />

the edges that are connected to leaves are exactly the trivial splits of X. Recall<br />

the definitions of evolutionary trees and phylogenetic trees and let us define<br />

Σ trivial (X) to be the trivial splits of X. It is now apparent that for every<br />

evolutionary tree T we have an associated phylogenetic tree T ′ defined by the<br />

set of splits Σ(T ′ )=Σ(T ) ∪ Σ trivial (X)).<br />

Recall Figure 2.1. Here we see examples of unresolved parts in an evolutionary<br />

tree. This is quite feasible, and shows the usefulness of working with<br />

splits rather than normal trees. Our normal leaf-labeled tree data structures<br />

cannot easily capture this class of trees. A normal tree would correspond to a<br />

phylogenetic tree, and one would have to mark the excess branches in some way,<br />

to capture the evolutionary tree underlying that representation. It is important<br />

to remember that species are not leaves. It is convenient if all species are leaves,<br />

but tree reconstruction methods (for example the refined <strong>Buneman</strong> tree method<br />

described in this paper) do not guarantee such high resolutions.<br />

Still, trees have some very useful computational properties which we will<br />

use extensively throughout this work. Also, the graphical representation of an<br />

17


evolutionary tree gives an invaluable understanding of the evolutionary data<br />

it represents. One obvious scheme when using an ordinary leaf-labeled tree to<br />

represent a set of splits is to say that all trivial splits are in the tree — but<br />

they have a special marker, if they are not members of the set of splits that the<br />

tree is supposed to represent. In order to be able to mark special edges, and<br />

also keep track of lengths of branches in the evolutionary tree, we will define<br />

the weight of an edge such that a weight of zero is the special marker that<br />

invalidates the edge as a split, and positive weights represent branch lengths.<br />

This is very reasonable since the sets of splits we are going to represent later on<br />

will only have positive weights — edges with non-positive weights are simply not<br />

members of the tree. The graphical representation would closely resemble the<br />

actual evolutionary tree corresponding to the set of splits, when trivial branches<br />

have no extent.<br />

2.7 <strong>Buneman</strong> tree<br />

Given a split σ for a set of size n, the<strong>Buneman</strong> Index of a split σ = U|V of X<br />

is defined as:<br />

μ σ (δ) = min β<br />

u,u ′ ∈U,v,v ′ uu<br />

∈V ′ |vv ′ (2.5)<br />

The set of splits B(δ) ={σ : μ σ (δ) > 0} is a compatible set of splits [Bun71].<br />

The <strong>Buneman</strong> tree corresponding to a given dissimilarity measure δ is defined<br />

to be the weighted unrooted tree whose edges represent the splits σ ∈ B(δ) and<br />

are weighted according to μ σ (δ).<br />

This definition of the refined <strong>Buneman</strong> tree relates very well to the discussion<br />

in the previous section; a compatible set of splits is represented by a weighted,<br />

unrooted tree.<br />

2.8 Anchored <strong>Buneman</strong> tree<br />

The anchored <strong>Buneman</strong> tree is a relaxation of the <strong>Buneman</strong> tree. The anchored<br />

<strong>Buneman</strong> tree fixes some species x ∈ X and only considers splits U|V where<br />

x ∈ U. One might say the species x plays the role of outgroup with respect to<br />

the rest of the species.<br />

Let σ = U|V be a split with x ∈ U, then the anchored <strong>Buneman</strong> index is<br />

defined as follows:<br />

μ x σ (δ) = min β u∈U,v,v ′ xu|vv<br />

∈V ′<br />

The anchored <strong>Buneman</strong> tree is the tree that represents the compatible set of<br />

splits defined by B x (δ) ={σ : μ x σ > 0}.<br />

A couple of lemmas are worth noting when dealing with anchored <strong>Buneman</strong><br />

trees. Firstly, David Bryant and Vincent Moulton ([BM99]) have shown the<br />

connection between anchored <strong>Buneman</strong> trees and the regular <strong>Buneman</strong> tree,<br />

given here in Lemma 2. Secondly, Lemma 3 which is due to Vincent Berry<br />

18


and David Bryant ([BB99], section 3) states the complexity of computing the<br />

anchored <strong>Buneman</strong> tree.<br />

Lemma 2. Let δ be a distance measure on X. ThenB(δ) = ⋂ B x (δ)<br />

Lemma 3. B x (δ) can be computed in time and space O(n 2 ).<br />

2.9 <strong>Refined</strong> <strong>Buneman</strong> tree<br />

Given a split σ for a set of size n, letm = |q(σ)| and let q 1 , ··· ,q m be an ordering<br />

of the elements in q(σ) in non-decreasing order of their <strong>Buneman</strong> scores. Then<br />

the refined <strong>Buneman</strong> index of a split σ is defined as:<br />

μ σ (δ) = 1<br />

n−3<br />

∑<br />

β qi (2.6)<br />

n − 3<br />

i=1<br />

In other words, the refined <strong>Buneman</strong> index of a split is the average over<br />

the n − 3 least scoring quartets. The choice of n − 3 is accredited to divine<br />

intervention — apparently, that was the choice that would make the proof in<br />

[MS99] work. The set of splits RB(δ) ={σ : μ σ (δ) > 0} is a compatible set of<br />

splits ([MS99]). And thus the <strong>Refined</strong> <strong>Buneman</strong> Tree corresponding to a given<br />

dissimilarity measure δ is defined to be the weighted unrooted tree whose edges<br />

represent the splits σ ∈ RB(δ) and are weighted according to μ σ (δ).<br />

When constructing the refined <strong>Buneman</strong> tree as described later in this work,<br />

we shall rely heavily on Lemma 4. The lemma is due to [MS99], and it is used<br />

to maintain compatibility in a set of splits overapproximating the set of refined<br />

<strong>Buneman</strong> splits.<br />

Lemma 4. Given two incompatible splits σ and σ ′ ,<br />

μ σ (δ) ≤ 0 ∨ μ σ ′(δ) ≤ 0<br />

and this can be computed in time O(n).<br />

In the algorithm that computes the refined <strong>Buneman</strong> tree, we are going to<br />

construct a compatible set of splits which is an overapproximation of the set<br />

of refined <strong>Buneman</strong> splits. We shall do this for subsets of X of increasing size.<br />

The way to do this is, after bootstrapping some compatible set of splits, we<br />

shall introduce a set of candidates to go into the overapproximation. However,<br />

to maintain compatibility in the set we shall test the candidate splits against<br />

the existing splits, to find pairs that are incompatible.<br />

Once we find a pair of incompatible splits, we shall use Lemma 4 to determine<br />

which one does not belong in the set. We are not concerned with testing if<br />

one of the splits in the pair actually belongs in the set of refined <strong>Buneman</strong><br />

splits. We are only interested in throwing away candidates that are clearly<br />

incompatible. We can allow ourselves this luxury since we are only creating<br />

an overapproximation of RB(δ| Xk ), not the set itself, and this saves valueable<br />

x∈X<br />

19


time in the algorithm. In [BFÖ+ 03], an algorithm is given which solves the<br />

problem in Lemma 4 in linear time. It this text, this algorithm is called the<br />

DISCARD-RIGHT algorithm.<br />

The second important lemma regarding refined <strong>Buneman</strong> trees is the foundation<br />

for the incremental algorithm presented in the article by Brodal et al.<br />

([BFÖ+ 03]), and which is presented later in this work. It is due to Bryant and<br />

Moulton ([BM99], proposition 3). It says that a split σ ∈ RB(δ| Xk )iseither<br />

amemberofB xk (δ| Xk )orRB(δ| Xk−1 ). If we turn it around we can say that<br />

given the refined <strong>Buneman</strong> tree for X k , we can calculate the refined <strong>Buneman</strong><br />

tree for X k+1 by looking only at splits in B xk+1 (δ| Xk+1 )andRB(δ| Xk )(with<br />

the discussion from the previous paragraph in mind, this would be “bootstrap<br />

set” and “candidate set”, respectively).<br />

Lemma 5. Suppose |X| > 4, and fix x ∈ X. Ifσ = U|V is a split in RB(δ) with<br />

x ∈ U, and|U| > 2, then either U|V ∈ B x (δ) or U −{x}|V ∈ RB(δ | X−{x} ),<br />

or both.<br />

20


Chapter 3<br />

Evolution and<br />

Bioinformatics<br />

The theory of evolution is accredited to Charles Darwin and his work On the<br />

Origin of Species by Means of Natural Selection, or the Preservation of Favoured<br />

Races in the Struggle for Life from 1859. At the time, the theory challenged<br />

many established beliefs, particularly religious beliefs. Before Darwin the origins<br />

of life were credited to so-called “Creation Science” and other superstitions,<br />

and even today small pockets of resistance to Darwins theories exist, notably in<br />

Alabama, USA.<br />

Evolution was hinted at before Darwin, for example by Jean Baptiste de<br />

Lamarck, who suggested that life was governed by two principles: the principle<br />

of use and disuse - individuals lose characteristics they do not require and<br />

develop those which are useful. And inheritance of acquired traits - individuals<br />

inherit the acquired traits of their ancestors. Even in ancient Greece the, Anaximander<br />

fostered ideas similar to evolution. But not until Darwin were these<br />

theories supported by any real scientific evidence.<br />

3.1 The Tree of Life and the language of DNA<br />

After Darwin introduced his theory, biologists started working on reconstructing<br />

the evolutionary history of all organisms on earth, and expressing it in the form<br />

of a phylogenetic tree, the Tree of Life, illustrated in Figure 3.1. This work<br />

was carried out on fossils and living species, using comparative morphology and<br />

comparative physiology. These methods are however rather imprecise, and the<br />

trees thus constructed have been somewhat controversial.<br />

All of this changed when Watson and Crick discovered the ability of deoxyribonucleic<br />

acid (DNA) to encode and replicate hereditary information. Suddenly,<br />

scientists were able to read the recipe for a species in its DNA, which is basically<br />

a string over the alphabet Σ = {A, C, G, T }. It became possible to readily compare<br />

two organisms just by comparing their DNA in a precise and systematic<br />

21


Figure 3.1: A tree of life.<br />

22


fashion. Even organisms that do not exhibit common traits can be compared<br />

using this method.<br />

By reducing the complexity of DNA to a simple string over Σ = {A, C, G, T }<br />

and since the process of evolution can be simplified as operations on a string, e.g.<br />

insertion, deletion or substitution of a character, we are able to formulate the<br />

process as a mathematical model. This model can then be used to e.g. measure<br />

the evolutionary distance between two species. A very simple model would be<br />

to say that we count all the sites where the DNA of the two species differ (after<br />

the sequences have been aligned, probably), and say that this number represents<br />

the distance. But there are of course more engenious schemes than this. And<br />

thus once we have all pairwise distances between species in a set, we are ready<br />

to build their evolutionary tree.<br />

This is of course an extremely oversimplified explanation of molecular biology.<br />

We shall not worry about expressive (genes) and non-expressive regions of<br />

the DNA, the effects of selection, errors in sequenced strings, selecting comparable<br />

regions of DNA or details of evolutionary models. For our purposes it is<br />

enough to know that we can find a distance between species and use it to infer<br />

evolutionary history.<br />

Some questions arise even after we accept the theory of evolution. What<br />

was the origin of life Given we have so many, very different organisms, is it<br />

likely that they all decended from one common ancestor Or is it more likely<br />

that there are more than one origin of life Did life come from Mars, riding on a<br />

comet, as some propose Or did life just erupt from molecules with autocatalytic<br />

properties And does life exist elsewhere in the universe Certainly, the theory<br />

of evolution answer a lot of questions, but it also poses new ones.<br />

3.2 Bioinformatics<br />

The field of bioinformatics is relatively new. The field is a joint venture between<br />

the sciences of mathematics, biology, statistics, computer science and other<br />

related sciences. Faced with huge and growing amounts of data from e.g. genome<br />

sequencing projects, the mission is to make sense of data and perhaps apply<br />

this new found knowledge to cure diseases, invent new technology and generally<br />

increase understanding of life and the universe.<br />

A few things need to be highlighted about bioinformatics and the problem<br />

of dealing with huge amounts of data, from a computer scientists point of view.<br />

We have already established the gigantic size of the search space for evolutionary<br />

trees. But even smaller problems in bioinformatics, such as the alignment<br />

problem (for which the simplest instance, pairwise alignment, can be solved in<br />

time O(n 2 )), can cause difficulties, when the size of the human genome is in the<br />

order of 3.2 billion base pairs distributed over approximately 30.000 genes with<br />

an average size of 27.000 nucleotide pairs (source: [AJL + 02]).<br />

Again stressing the informatics part of bioinformatics, we might illustrate the<br />

usefulness of a program such as BLAST. Imagine a scientist in a lab stumbling on<br />

a new protein or virus. He is able to sequence the DNA of the organism without<br />

23


knowing its origins, but how does he place it in the Tree of Life He BLASTs<br />

his new sequence against sequence databases across the globe, connected via the<br />

internet, by entering his sequence on a webpage, and waits a while for hits, i.e.<br />

sequences that are similar to his own. Information about BLAST can be found<br />

here:<br />

http://www.ncbi.nlm.nih.gov/BLAST/<br />

He is now able to measure evolutionary distances of these sequences, and<br />

then use his favourite evolutionary tree reconstruction program to construct a<br />

phylogeny for the sequences. Shortly after he is looking at the evolutionary<br />

history of his new organism - or perhaps he realizes someone found it already,<br />

he just didnt know about it.<br />

24


Chapter 4<br />

A Catalogue of Tree<br />

Reconstruction Methods<br />

Evolutionary tree reconstruction methods find application in e.g. protein structure/<br />

function prediction. When scientists find a new protein, they are anxious<br />

to find out which properties it has. And if it is possible to place the protein in<br />

a known protein family, it might also be possible to deduce properties derived<br />

from other family members.<br />

These methods also facilitate tracing biological material. If we are faced<br />

with two strains of HIV virus, we might wish to know if they are closely related,<br />

or if they are more likely from different families. The story in the website below<br />

shows how evolutionary history can be used as evidence in a murder trial:<br />

http://www.aegis.com/news/upi/2002/UP021005.html<br />

Generally, different evolutionary tree methods have different characteristics,<br />

including input/ output data formats. Some methods look at distance matrices,<br />

others take nucleotide sequences directly. Output might be rooted or unrooted<br />

trees. So the methods are perhaps not directly compatible, but with a bit of<br />

work, output from different methods ought to be comparable.<br />

Another factor to be considered is the origin and type of data we are working<br />

on. Some methods are sensitive to data with certain properties, particularly the<br />

methods that do not have a biological model to support them; and some data will<br />

give us unexpected results if we are not careful, e.g. an evolutionary tree based<br />

on sequences from homologous genes might not be equal to the evolutionary<br />

tree of the species from which the genes were taken, since . This is known as<br />

the problem of gene trees vs species trees.<br />

We have established that the search space for evolutionary trees is enormous.<br />

The task of searching for an optimal tree in the search space of all semi-labeled<br />

trees is NP-hard, of course depending on the precise formulation of the optimization<br />

problem — one example is given in [SM97]. The following is a catalogue<br />

of methods for tree reconstruction. They all suffer from some disadvantage,<br />

25


anging from huge time complexity to low biological precision, i.e. the kinds of<br />

tradeoffs one would expect for algorithms or heuristics attacking a problem in<br />

the class of NP hard problems.<br />

Tree reconstruction methods can be classified into three major groups: distance<br />

methods, parsimony methods and maximum likelihood methods. The<br />

methods in the first group attack the problem of finding good evolutionary<br />

trees by sacrificing accuracy for speed, primarily by relying on biological information<br />

already extracted from biological data through e.g. sequence alignment.<br />

The other two groups encompass methods with are modelled more or less directly<br />

from evolutionary mechanisms and are thus expected to produce accurate<br />

results — but they generally suffer from poor performance. The next sections<br />

describe these groups of methods in greater detail.<br />

4.1 Distance methods<br />

Distance based methods use precomputed evolutionary distances to construct<br />

evolutionary trees. Given n pairs of taxa, all pairs of taxa are compared and<br />

an evolutionary distance is computed, corresponding to an entry in a distance<br />

matrix of size n×n. The evolutionary distance between two nucleotide sequences<br />

may be found by aligning the sequences and using the alignment score as a<br />

distance measure.<br />

The methods in this group rely on the distance data for biological meaning,<br />

and in themselves they do not make any biological assumptions — rather,<br />

they are general data mining/ clustering methods. The assumption is that the<br />

distance data captures enough biological meaning that these fast methods may<br />

still produce reliable evolutionary trees. Clearly, the methods in themselves do<br />

not inspire confidence that they will capture biological meaning in themselves,<br />

but they are much faster than other methods, and studies have shown they do<br />

produce somewhat reliable results, albeit with some bias, depending on data.<br />

4.1.1 UPGMA<br />

The simplest clustering method is the unweighted pair-group method using arithmetic<br />

averages (UPGMA) method, introduced by Sokal & Sneath ([SS73]). The<br />

method is extremely simple and can be found in Algorithm 1. Basically the algorithm<br />

joins two nodes in each turn, replacing them by their new parent in the<br />

tree that is being built. The algorithm is guaranteed to terminate after n − 1<br />

iterations since we effectively remove one node from the set of active nodes in<br />

each turn.<br />

It is not clear from Algorithm 1 that the UPGMA method will yield any<br />

biologically meaningful results. We have to assume that biological meaning<br />

has been extracted when we prepared the distance matrix, since the UPGMA<br />

method merely performs data mining. Surprisingly, under certain conditions<br />

this extremely simple method is sometimes useful, according to e.g. [NTT83]<br />

and [TN96]. But of course, the method simplicity makes it probably the least<br />

26


Algorithm 1 The UPGMA algorithm<br />

Require: δ is a distance matrix on n species X<br />

Ensure: T is a rooted binary tree with leaves from X<br />

1: Assign a cluster c i ∈ C and leaf n i ∈ T with height 0 to each species x i ∈ X<br />

2: while |C| > 2 do<br />

3: find clusters c i ,c j such that d(i, j) is minimal<br />

4: define a new cluster c k = c i ∪ c j ,removingc i ,c j from C<br />

5: assign distances from c k to the remaining clusters c l ∈ C such that<br />

d(k, l) = d(i, l)|c i| + d(j, l)|c j |<br />

|c i ||c j |<br />

6: add c k to C<br />

7: add a new node n k to T such that |e ik | = |e jk | = d(i, j)/2<br />

8: end while<br />

9: Create n root ∈ T , and connect the last two active nodes n i ,n j to n root such<br />

that |e i,root | = |e j,root | = d(i, j)/2<br />

accurate tree reconstruction method that still captures some biological meaning.<br />

The main advantage of this method is its very low running time complexity of<br />

O(n 2 ) on input size n species/ n 2 entries in a distance matrix.<br />

4.1.2 Neighbor-Joining<br />

The Neighbor-Joining method was introduced by Saitou & Nei in 1987 ([SN87])<br />

and has become one of the most widely used tree reconstruction methods. It<br />

is fast, with a running time of O(n 3 ), even approaching quadratic time in the<br />

QuickJoin algorithm developed at BiRC:<br />

http://www.birc.dk/Software/QuickJoin/<br />

The Neighbor-Joining method has been tested extensively there is a consensus<br />

that the method produces trees that are reasonably accurate ([JWMV03])<br />

— particularly, it is more accurate than other methods known with similar running<br />

time complexity. The speed and relatively good accuracy of the method<br />

makes it an affordable way of creating a guide tree for other, more expensive<br />

tree reconstruction methods such as e.g. maximum likelihood methods. The<br />

algorithm is given in Algorithm 2.<br />

The Neighbor-Joining method is very similar to the UPGMA, following the<br />

same abstract template, but it does consider a more complex concept of neighbors<br />

when it selects which nodes to join, and thus it takes a little more time<br />

to run. Generally, there are many variants on UPGMA/ Neighbor-Joining with<br />

different scoring strategies, or adaptations that can input nucleotide sequences<br />

directly, for example.<br />

27


Algorithm 2 The Neighbor-Joining algorithm<br />

Require: δ is a distance matrix on n species X<br />

Ensure: T is a unrooted binary tree with leaves from X<br />

1: initialise T such that for each species x i ∈ X there is a leaf node n i ∈ T<br />

2: define a set of active nodes L containing the leaves of T<br />

3: while |L| > 2 do<br />

4: find nodes n i ,n j ∈ L such that<br />

D ij = d(i, j) − (r i + r j ) where r i = ∑ k∈L<br />

d(i, k)<br />

|L|−2<br />

is minimal<br />

5: remove nodes n i ,n j from L<br />

6: create a new node n k and set distances from n k to remaining nodes n m ∈ L<br />

such that d(k, m) =(d(i, m)+d(j, m) − d(i, j))/2<br />

7: add n k to L, T and connect n k to n i ,n j such that |e ik | = 1 2 (d(i, j)+r i−r j )<br />

and |e jk | = 1 2 (d(i, j) − r i + r j )<br />

8: end while<br />

9: join the last two nodes n i ,n j ∈ L such that |e ij | = d(i, j)<br />

4.1.3 Quartet based methods<br />

<strong>Buneman</strong> trees ([Bun71]) are a new form 1 of distance method that relies on<br />

quartets and splits rather than just picking closest pairs. An algorithm with a<br />

running time complexity of O(n 3 ) is available, but not given here (see [BB99],<br />

section3). Another method which relies on quartets is the Q ∗ method proposed<br />

by Berry & Gascuel ([BG00]) which runs in time O(n 4 ).<br />

The latter illustrates the problem for quartet based methods: given a set<br />

of quartets, these quartets do not nescessarily support a tree, e.g. they might<br />

contain contradictory constraints such as quartets wanting to split species in<br />

opposite ways. In the <strong>Buneman</strong> case, a set of splits is generated from quartets<br />

which are guaranteed to be tree-consistent — in the case of Q ∗ , a tree-consistent<br />

set of quartets is found by weeding out in a larger set of favorable quartets. The<br />

intuition is that by considering quartets, we are looking at the species in a global<br />

sense, determining which species should be separated from other species and<br />

trying to construct a tree under a large number of such (possibly conflicting)<br />

constraints. This is in many ways the opposite of the intuition behind the<br />

clustering methods mentioned earlier.<br />

Another issue for both of these algorithms is that they produce only partially<br />

resolved trees. Compared to the UPGMA and NJ methods we might say that<br />

these quartet based methods only resolve “safe” branches where the clustering<br />

methods resolve trees fully, regardless of data. However, we also have to say that<br />

these methods are perhaps too safe since they might only infer a small fraction<br />

of edges in the evolutionary tree, depending on data ([BG00], [BB99]). The<br />

1 “new form” compared to the clustering methods UPGMA and Neighbor-Joining<br />

28


efined <strong>Buneman</strong> algorithm which is the main focus of this work belongs in this<br />

subclass of evolutionary tree methods. The hope is that the refined <strong>Buneman</strong><br />

method will be less safe than its namesake and infer more splits, while still<br />

maintaining a high degree of confidence in the splits it infers.<br />

4.2 Parsimony methods<br />

The foundation for this class of methods is the phylosophy of William of Ockham:<br />

Pluralitas non est ponenda sine neccesitate — meaning something along<br />

the lines of the best hypothesis is the one that requires the smallest number of<br />

assumptions 2 .<br />

This philosophy is also known as Ockham’s Razor or the parsimony principle.<br />

In our context we shall use it to create a condition of optimality, saying that if<br />

two proposed evolutionary processes have the same starting and ending points,<br />

we shall assume that the simplest or shortest process is the correct one. For<br />

example, we could imagine two substitutions of the same nucleotide working in<br />

reverse: A → G → A — in this case we would say that no substitutions took<br />

place at all.<br />

The parsimony method works by considering one specific site at a time across<br />

a set of nucleotide sequences. For each site, we postulate all possible binary tree<br />

topologies linking these sites. We now search for a combination of assignments<br />

of nucleotides to inner nodes such that the total number of substitutions is minimal,<br />

selecting that topology/ those topologies for further study, since we might<br />

not identify the same optimal topologies for all sites — we have to sum over<br />

possible topologies to find the one that has the smallest number of substitutions<br />

across all sites.<br />

Figure 4.1 shows an example of estimating the number of substitutions for<br />

a topology. Firstly, we have a tree spanning specific sites across 5 nucleotide<br />

sequences. Secondly, we can fill in the sites for the ancestral taxa by considering<br />

the minimum number of substitutions required, bottom up from the sites we<br />

are already given. In this case, looking at a subtree of C and T in the lower<br />

left corner, we know that their ancestor site must have been either C or T ,<br />

requiring only one substitution, since all other combinations (A or G) would<br />

require two substitutions. Thirdly, we present one solution of of many as to<br />

how the assignment of nucleotides in ancestral sites might have been. In this<br />

case we could make due with only two substitutions, but this solution is not<br />

unique, there are several combinations which require only two substitutions.<br />

The intuition for this method is quite easy to understand, and according to<br />

[NK00] and others, under favorable conditions such as the method is expected to<br />

produce the correct tree. However, under less favorable conditions the method is<br />

known to produce incorrect topologies, and in any case the method is hopelessly<br />

inefficient for large data sets, at least when using exhaustive search techniques.<br />

[NK00] describes how search might be speeded up by using e.g. branch and<br />

bound.<br />

2 or even shorter: keep it simple!<br />

29


A, C, T<br />

1 2 3<br />

T<br />

A, C, T<br />

T<br />

C, T<br />

A, T<br />

T<br />

T<br />

C T A T T C T A T T<br />

C T A T T<br />

Figure 4.1: An illustration of the parsimony tree method.<br />

4.3 Maximum likelihood methods<br />

Using maximum likelihood methods is quite simple in theory. We start out with<br />

e.g. a set of n nucleotide sequences of length m — aligned such that only substitutions<br />

occur, not insertions or deletions. We then postulate some evolutionary<br />

tree topology over the n sequences, giving us a rooted binary tree with n − 1<br />

inner nodes. For each inner node we assume there is some ancestor sequence for<br />

the sequences in the subtree of that node, but we do not know which nucleotides<br />

are in this ancestor sequence.<br />

We assume some substitution model, and there are many to choose from,<br />

ranging from simple to very complicated. Jukes-Cantor ([JC69]), Kimura ([Kim80])<br />

and Hasegawa-Kishino-Yano ([HKY85]) are names of well known substitution<br />

models, and there is a long hierarchy of increasingly complex models using more<br />

and more biologically founded assumptions. Now, to find the likelihood of a single<br />

nucleotide site in the n leaf sequences we have to multiply probabilities of<br />

substitutions through the tree, for all possible substitutions, i.e. for all possible<br />

assignments of nucleotides to sites in the n − 1 inner nodes in the tree. And<br />

then we would have to sum over all these likelihoods to find the likelihood of<br />

the entire sequences, i.e. a sum over m terms.<br />

Now, this expression would have to be evaluated for all possible tree topologies,<br />

and of course this results in a very time consuming algorithm. But, since<br />

the substitution model is actually based in biology, we have a high confidence<br />

that the resulting tree with maximum likelihood is the real tree.<br />

One way of making this method useful in practise would be to search for<br />

tree topologies in some clever way, so that the search would be able to skip<br />

large parts of the search space, which would of course limit the accuracy of the<br />

method. The ML method is also useful for evaluating trees found by other tree<br />

reconstruction method, since we assume the method captures a lot of biological<br />

meaning, depending on the model.<br />

30


4.4 Hybrid methods<br />

All the methods we have described until now are very one-sided in their bias,<br />

exploring only one side of the trade-off between accuracy and speed. Hybrid<br />

methods do exist, where a combination of a fast method and an accurate one<br />

might produce a practical method with sound biological meaning.<br />

Another method called Disc Covering is described in [HNP + 98], using a<br />

divide-and-conquer type of approach. From the abstract:<br />

(The Disc-Covering Method) DCM obtains a decomposition of the<br />

input dataset into small overlapping sets of closely related taxa,<br />

reconstructs trees on these subsets (using a “base” phylogenetic<br />

method of choice), and then combines the subtrees into one tree on<br />

the entire set of taxa. Because the subproblems analyzed by DCM<br />

are smaller, computationally expensive methods such as maximum<br />

likelihood estimation can be used without incurring too much cost.<br />

4.5 Accuracy of inferred trees<br />

It is possible to evaluate the quality or confidence of branches in the evolutionary<br />

trees we find using our tree reconstruction methods. These statistical methods<br />

are known as bootstrap tests, and they are described [NK00], with references to<br />

other articles.<br />

The basic idea is to find some tree T using some method M, forsomedata<br />

set D. Lets say D consists of n nucleotide sequences. Now we may select n<br />

sequences from D with replacement, to form a new sample dataset D ′ —notice<br />

that the same sequence might occur several times in the new set, while some<br />

sequences might not occur at all. Now we use the new sample to infer a new tree<br />

T ′ bythesamemethodM, andbycomparingT and T ′ we can assign a count<br />

of1tothebranchesinT which also occur in T ′ , and 0 to the rest. Repeating<br />

this process many times (e.g a thousand times) yields a statistic of how often<br />

each branch occurs for different samples, and thus a reflection of how confident<br />

we can be in this particular branch.<br />

31


Part II<br />

Implementing <strong>Refined</strong><br />

<strong>Buneman</strong> <strong>Trees</strong><br />

32


Chapter 5<br />

Implementation Structure<br />

The refined <strong>Buneman</strong> tree algorithm described in [BFÖ+ 03] consists of two<br />

main parts. The first part creates an overapproximation of the refined <strong>Buneman</strong><br />

tree, i.e. a compatible set of splits Σ : RB(δ) ⊆ Σ ⊂ σ(X). The second part is<br />

concerned with searching through Σ and finding those splits which have positive<br />

refined <strong>Buneman</strong> index.<br />

5.1 Overapproximating<br />

The first part of the algorithm deals with constructing and maintaining a compatible<br />

set of splits, which is a superset of the set RB(δ). This part of the<br />

algorithm is given in pseudocode in Algorithm 3.<br />

The pseudocode deserves a few notes. First of all, we will use a tree data<br />

structure to represent compatible sets of splits, in order to achieve time and<br />

space complexity goals (see chapter 6 for more details). Secondly, compared to<br />

the pseudocode algorithm given in [BFÖ+ 03] we have for the sake of simplicity<br />

skipped anchored <strong>Buneman</strong> trees and used single linkage clustering trees instead.<br />

Berry and Bryant have shown in [BB99] that the set of splits represented by the<br />

anchored <strong>Buneman</strong> tree B xk (δ| Xk ) is a subset of the set of splits represented by<br />

the single linkage clustering tree for x k with respect to δ restricted to X k (see<br />

chapter 7 for more details).<br />

In line 1 we initialize a compatible set of splits for the first four species,<br />

namely the set of splits contained in the single linkage clustering tree for x 4 .<br />

We represent a compatible set of splits as an unrooted tree, and basically the<br />

transformation from the rooted clustering tree to our unrooted tree consists of<br />

just cutting off the root and splicing the two subtrees together. The cartoon in<br />

Figure 5.1 shows this process.<br />

In line 2–3 we iterate over the remaining species. For each new species x k we<br />

build the single linkage clustering tree for that species, C k . In line 4 we iterate<br />

over all splits σ i ∈ C k−1 and in line 5 we add the new species x k to either “side”<br />

of σ i . These three lines together form the construction described in Lemma 5,<br />

33


Algorithm 3 Overapproximating the refined <strong>Buneman</strong> tree<br />

Require: δ is a distance matrix of size n × n<br />

Ensure: C n ⊇ RB(δ)<br />

1: C 4 = SLCT x4 (δ 4 )<br />

2: for k =5ton do<br />

3: C k = SLCT xk (δ k )<br />

4: for U|V ∈ C k−1 do<br />

5: for σ ∈{U ∪{x k }|V , U|V ∪{x k }} do<br />

6: σ ′ = INCOMPATIBLE(C k ,σ)<br />

7: while σ ′ ≠ null and DISCARD − RIGHT (σ, σ ′ ) do<br />

8: DELETE(C k ,σ ′ )<br />

9: σ ′ = INCOMPATIBLE(C k ,σ)<br />

10: end while<br />

11: if σ ′ = null then<br />

12: INSERT(C k ,σ)<br />

13: end if<br />

14: end for<br />

15: end for<br />

16: end for<br />

✂<br />

✁❆<br />

✁ ❆<br />

e✁❆ ❆❆❆❆<br />

✁ ❆❆❆<br />

✁❆ ✁ ❆<br />

x 1 x 2 x 3 x 4<br />

x 4<br />

⊲⊳<br />

e✁❆ ✁ ❆❆❆<br />

✁❆ ✁ ❆<br />

x 1 x 2 x 3<br />

x 1<br />

❅<br />

x 2<br />

x 4<br />

e <br />

❅x3<br />

Figure 5.1: Going from the single linkage tree for x 4 to an unrooted tree, representing<br />

the same set of splits.<br />

34


C k−1 is an overapproximation of RB(δ| Xk−1 ), and C k is an overapproximation<br />

of B xk (δ| Xk ).<br />

Now we need to merge these two sets of splits. We will test each extended<br />

split σ against the splits in C k : if σ is already in C k , we will ignore it. If σ<br />

is incompatible with some split σ ′ ∈ C k (line 6), we will use the DISCARD-<br />

RIGHT algorithm on the two splits to decide if σ ′ has non-positive refined<br />

<strong>Buneman</strong> index (see chapter 9 for more details). If we decided σ ′ was not a<br />

refined <strong>Buneman</strong> split, we will delete it from C k (line 8) and again measure if σ<br />

is incompatible with some new split σ ′ ∈ C k (line 9). Finally, if we have decided<br />

that σ qualifies, we will insert it into our compatible set of splits (lines 11–12).<br />

The time complexity analysis for this first part of the algorithm goes like<br />

this: bootstrapping a tree of size 4 in line 1 of the algorithm takes constant<br />

time. We then in line 2 iterate over at most n species. In line 3 we build a<br />

single linkage clustering tree in time O(n 2 ). In line 4 we iterate over all splits<br />

in an unrooted tree, there are O(n) of those. Each of the algorithms INCOM-<br />

PATIBLE, INSERT, DELETE and DISCARD-RIGHT take time O(n). For<br />

the while loop in line 7 we can only find O(n) splits in an unrooted tree that<br />

can possibly be incompatible with σ, so the first part of the algorithm runs in<br />

time O(n 3 ).<br />

The space usage consists of holding two compatible sets of splits at any one<br />

time and the space required to calculate the single linkage clustering trees. We<br />

can store any compatible set of splits in linear space using the tree data structure<br />

described in chapter 6. It is much worse for the single linkage clustering tree,<br />

the implementation in this paper uses a quad tree data structure, which will<br />

allocate Θ(n 2 ) space. Of course, we already need space for the distance data,<br />

which is also Θ(n 2 ). So in total we need to use space Θ(n 2 ) for the first part of<br />

the algorithm.<br />

5.2 Pruning<br />

The second part of the algorithm deals with the extraction of refined <strong>Buneman</strong><br />

splits from a the set of compatible splits constructed in the first part of the<br />

algorithm. Denote this set T . In the context of (evolutionary) trees and bioinformatics,<br />

we might call this part pruning, even though we shall not actually<br />

remove any edges from T — our tree data structure cannot handle a tree topology<br />

which is not a regular, leaf-labeled tree, so we will only invalidate edges<br />

instead of changing the topology of the tree.<br />

In this part of the algorithm, we shall look at all splits in T by considering<br />

all edges in the tree, and calculate refined <strong>Buneman</strong> indices for the splits they<br />

represent. Finally we shall report those splits which have positive refined <strong>Buneman</strong><br />

index. We will do this by first decorating all edges in T with their refined<br />

<strong>Buneman</strong> index, or 0 as a special marker for invalid splits. Later we extract the<br />

refined <strong>Buneman</strong> splits from T in quadratic time — linear time iteration over<br />

edges, and for each edge, linear time for reporting n bits in a bitvector. So our<br />

so-called pruning algorithm is in reality a searching-and-scoring algorithm, and<br />

35


the pseudo-code for the algorithm is given in Algorithm 4.<br />

The algorithm is very long and complicated, but it can be broken up into<br />

three parts: lines 1–3 deal with initializing linked lists representing “global<br />

matrix fronts” on matrices which contain diagonal quartets. Lines 4–17 populate<br />

the matrix fronts. In lines 18–28 we search the matrices, finding the minimum<br />

quartets that we need to calculate the refined <strong>Buneman</strong> indices for the edges<br />

represented by the matrices. And in lines 29–34 we report refined <strong>Buneman</strong><br />

splits. Before we explain the algorithm, we need to study the nature of diagonal<br />

quartets.<br />

5.2.1 Searching for minimum diagonals<br />

We know from previous sections that a quartet induces two diagonal quartets.<br />

And clearly, given a diagonal quartet we can identify its “parent” quartet. In<br />

the pruning part of our algorithm, we are interesting in finding the n−3quartets<br />

with minimum score induced by each edge in T . We will do this by searching<br />

for diagonal quartets instead, but we need to ensure that we never identify the<br />

same quartet twice (for example by identifying it from seeing both of its diagonal<br />

quartets).<br />

We will use a convention that says we shall only identify a quartet if we<br />

see its minimum diagonal. In case we see a diagonal quartet which is not a<br />

minimum diagonal, we shall disregard it.<br />

Another important property of diagonal quartets is this: if we fix a and c,<br />

such that a and c lie on different sides of a fixed edge e, we can search for b and<br />

d independently to minimize the score of the diagonal quartet ab||cd induced by<br />

e. Byfixinga and c we can rewrite the score of a diagonal quartet into a sum<br />

of two functions f a,c and g a,c , such that f a,c only depends on b and g a,c only<br />

depends on d. Clearly, such a function takes its minimum only when f a,c and<br />

g a,c are minimal.<br />

η ab||cd = 1 2 (δ bc − δ ab + δ ad − δ dc )=f a,c (b)+g a,c (d)<br />

where f a,c (b) =(δ bc − δ ab )/2 andg a,c (d) =(δ ad − δ dc )/2.<br />

Not only can we find the diagonal quartet with minimum score in this way.<br />

But also, we can search for the “next minimum”, i.e. the diagonal quartet with<br />

minimum score when discounting the actual minimum. We can do this in a<br />

general way: say we have some diagonal quartet ab i ||cd j with score η abi||cd j<br />

.<br />

Imagine we have considered all diagonal quartets with scores less than η abi||cd j<br />

,<br />

and now we wish to consider ab i ||cd j and then find the next diagonal quartet<br />

with minimum score.<br />

The way to do this is to search for b i+1 such that η abi+1||cd j<br />

≥ η abi||cd j<br />

is the<br />

minimum among all choices of b i+1 , and similarly for d j+1 , η abi||cd j+1<br />

≥ η abi||cd j<br />

must be the smallest among choices of d j+1 . One of those will be the next<br />

minimum. Note that the indices refer to an ordering of increasing f a,c and g a,c<br />

respectively, not ordering as members of X.<br />

36


Algorithm 4 Pruning the overapproximated refined <strong>Buneman</strong> tree<br />

Require: δ is a distance matrix of size n × n<br />

Require: T =(V,E) is an overapproximation of RB(δ)<br />

Ensure: S is a set of splits representing RB(δ)<br />

1: for e ∈ E do<br />

2: Q e = ∅<br />

3: end for<br />

4: for (a, c) ∈ X 2 ∧ aa<br />

10: end for<br />

11: for each edge e on the path from a to c do<br />

12: Q e = Q e ∪{ab e ||cd e }<br />

13: if |Q e |≥3(n − 3) then<br />

14: remove the n − 3 quartets with largest score from Q e<br />

15: end if<br />

16: end for<br />

17: end for<br />

18: for e ∈ E do<br />

19: S e = ∅<br />

20: while |S e | 0 then<br />

32: S = S ∪ σ e<br />

33: end if<br />

34: end for<br />

37


AE<br />

DE<br />

D<br />

B<br />

e<br />

E<br />

BC<br />

A<br />

C<br />

BE<br />

AC<br />

Figure 5.2: A tree, and edge, and the set of matrices induced by that edge, one<br />

for every choice of a pair a, c where a


Figure 5.3: A front of readied entries moving across a matrix. Green entries<br />

are ready entries, i.e. the entries that are potential next minimums. Red entries<br />

have been considered and are not available. After considering an entry, we add<br />

its neighbor to the south — and in case we are in the top row, we also add its<br />

neighbor to the east.<br />

the previous section. We know that for a given edge, the n − 3quartetswith<br />

minimum score can be found in these matrices — or rather, we can find the<br />

n − 3 minimum diagonals identifying the n − 3 minimum scoring quartets in<br />

the matrices. The naive approach would be to build a matrix for every pair<br />

(a, c) wherea lies on one side of e and c lies on the other. However, since we<br />

know that each matrix is sorted so that the rows and columns are monotonic<br />

non-decreasing, we do not need to construct the matrices completely.<br />

Instead, we can imagine storing a “matrix-front” for each matrix, which<br />

will contain the “unspent” minimums. The matrix front would be initialised<br />

with only one element, namely the entry (1, 1). When that entry is “spent”,<br />

we delete it from the matrix front and add its neighbours to the matrix front.<br />

And of course this generalises: when we “spend” entry (i, j) wehavetoaddits<br />

neighbours (i +1,j)and(i, j +1).<br />

Actually, this scheme would mean we would encounter entries twice, since<br />

an entry (i, j) is a neighbour of both (i − 1,j)and(i, j − 1). To amend this we<br />

will say that when spending (i, j), we will only look to its neighbour (i, j +1)<br />

— unless j = 1 in which case we will also look at (i +1,j). This way we<br />

avoid encountering the same entry twice; graphically, it is equivalent to painting<br />

columns top to bottom such that the leftmost columns are always at least as<br />

tall as the ones to the right. Figure 5.3 illustrates this.<br />

Now we have established that we can search through the matrices using lazy<br />

39


construction, but we still have to search through a set of matrices (one for each<br />

pair (a, c)) concurrently. This can however be simplified by saying we have one<br />

big combined “matrix front”, which contains all entries in all “matrix fronts” at<br />

the same time. Finding the next diagonal quartet across all the matrices now<br />

only involves finding the minimum in the global “matrix front”. Such a global<br />

matrix front has the potential of becoming very large. The number of matrices<br />

for an edge e is in O(n 2 ), the number of possible pairs (a, c). Fortunately, it<br />

turns out we do not have to store so many entries. We are looking for n − 3<br />

quartets with minimum score corresponding to n−3 minimum diagonals. Recall<br />

that a minimum diagonal ab||cd can have the same score as its “cousin” ab||dc,<br />

so when searching for diagonal quartets index by score we would in the worst<br />

case have to look through 2(n − 3) diagonal quartets with minimum scores to<br />

find n − 3 minimum diagonals.<br />

In the worst case with respect to number of matrices, the n − 3 minimum<br />

diagonals and their cousins would stem from different pairs (a, c). Thus to ensure<br />

we had captured all of them, we would need to initialize 2(n − 3) matrices in<br />

such that the 2(n − 3) diagonal quartets with minimum score could appear as<br />

entries (1, 1)in the matrix for their (a, c)-pair.<br />

After initialisation, we search the matrices for diagonal quartets with minimum<br />

score. In the worst case where all the minimum diagonals have the same<br />

score as their cousins, we could be unlucky enough to always encounter the<br />

cousin first and the minimum diagonal second. But even in that case, we would<br />

never need to look at more than 2(n − 3) diagonal quartets with minimum score<br />

before we found (n − 3) minimum diagonals. So, the search space is bounded<br />

both in time and space by O(n).<br />

5.2.3 Pruning explained<br />

Now we are ready to look at the individual steps in the algorithm. In line 1–3,<br />

we for each edge initialize a “global matrix front”, implemented as a linked list<br />

— this ensures we can insert elements in constant time in one end, and find the<br />

minimum element in constant time (line 21). We store the Q e ’s in a hashtable<br />

for easy lookup. There are at most O(n) edges, to this task takes linear time.<br />

In line 4 we iterate over all possible pairs (a, c), ensuring a


know that we need at most 2(n − 3) matrices initialized with diagonal quartets<br />

of minimum score. Therefore in line 13–15 we measure the size of Q e —if<br />

|Q e |≥3(n − 3) we can safely remove the n − 3 diagonal quartets with largest<br />

score, and thus the matrices to which they belong, since we are guaranteed<br />

to find (n − 3) minimum diagonals in the remaining 2(n − 3) matrices. The<br />

procedure of selecting the n − 3 largest members of a set of size 3(n − 3) is<br />

described in chapter 10, and this can be done in linear time. Thus in lines 4–17<br />

we spend only O(n 3 )time.<br />

After initialising the matrices it is time to search through them. We iterate<br />

through the edges in line 18, and for each edge we initialize a list of diagonal<br />

quartets S e to the empty set (line 19). We need the n − 3 smallest minimum<br />

diagonals in order to identify the n − 3 minimum quartets. So we iterate in line<br />

20 until we have enough minimum diagonals, in linear time since we would at<br />

most need to look at 2(n − 3) minimum diagonal quartets. We find minimum<br />

diagonals by successively looking at the diagonal quartets with minimum score<br />

in Q e . The size of Q e is bounded by O(n), since |Q e | is at most 3(n − 3)<br />

after initialization, and when we traverse our matrices we remove one entry and<br />

add two at most 2(n − 3) times — by that time we are sure to have found<br />

the n − 3 minimum diagonals. So |Q e | hasatmost5(n − 3) elements. Thus,<br />

we can find and remove the minimum element from Q e in linear time in line<br />

21. If ab i ||cd j is a minimum diagonal, we add it to S e (lines 22–24). After<br />

considering ab i ||cd j , we use our matrix traversal/ entry readying scheme to add<br />

its appropriate neighbors in lines 25 and 26. Since we do not have the imaginary<br />

matrices, we instead have to search the two subtrees of T induced by e, forthe<br />

choices of b and d that yields the matrix neighbours. Such a search can be done<br />

in linear time. All in all we iterate a linear number of times, performing linear<br />

tasks, so this of the algorithm takes only O(n 2 )time.<br />

In line 29 we initialise a list of splits to the empty set. Going through the<br />

edges of T and summing up minimum diagonals takes time O(n 2 ) in lines 30–34.<br />

The pruning part of the refined <strong>Buneman</strong> tree algorithm is extremely complicated,<br />

and it is very hard to both capture the intuition behind the algorithm,<br />

and still keep track of all details. In the description of the algorithm, the author<br />

has tried to capture modules that could be splits off and described seperately,<br />

in order to reduce the complexity. This was possible for the first part, but the<br />

author was not so successful in the second part. The point of the description was<br />

to give intuition rather than convey all details, and therefore it is possible that<br />

some small errors and oversights have crept in — particularly in the pruning<br />

part. For a full and concise account of the algorithm, turn to [BFÖ+ 03].<br />

41


Chapter 6<br />

TheTreeDataStructure<br />

This chapter discusses a tree datastructure designed to maintain a compatible<br />

set of splits. In part 1 of the algorithm described in [BFÖ+ 03], we need to<br />

insert and remove splits from the tree and search for incompatible splits. These<br />

operations all need to perform in time O(|X|). Part 2 of the algorithm requires<br />

traversing paths in the tree and searching subtrees, also in linear time.<br />

6.1 Design of the tree data datastructure<br />

As we saw in a previous section, a compatible set of splits is atree—not<br />

necessarily a regular, leaf-labeled tree, but at least a semi-labeled tree. We<br />

also saw there was a close connection between the evolutionary trees we want<br />

to represent, and phylogenetic trees which can be represented by leaf-labeled<br />

trees. Thus, the design of the tree data structure used in this work will be a<br />

leaf-labeled tree that uses special markers to indicate edges that are not “real”<br />

edges, when needed. Since the refined <strong>Buneman</strong> tree algorithm deals with overapproximations<br />

of evolutionary trees, we shall simply disregard the controversy<br />

altogether; all complexity bounds will hold even if we do. The operations we<br />

need to support with the tree data structure are not affected by the presence of<br />

extra trivial splits; the search for incompatible splits will of course never return<br />

a trivial split, since they are compatible with any split, and in the insert and<br />

delete operations we can just ignore or work around the cases where trivial splits<br />

are involved.<br />

Even though we say that our tree is unrooted, we still need some sort of root<br />

as a starting point for our tree traversals. One way of creating such an artificial<br />

root would be to select an inner node and use this as a starting point for every<br />

tree operation. However, as the topology of the tree changes as we insert and<br />

remove splits from the compatible set of splits that the tree represents, a fixed<br />

root node might be removed at any time. Instead, we shall choose a random<br />

node in the tree as starting point each time we start a new operation, to ensure<br />

we have a valid starting point. This of course means we have to have a design<br />

42


that supports traversal in any direction, and this is the solution the author has<br />

chosen. One alternative to this approach would be to select a leaf node as root,<br />

but this would mean we would have to introduce a special case for that node —<br />

at the same time, we could make do with a one-way directed tree. In any case,<br />

all complexity bounds would be upheld, so the choice is open.<br />

One more requirement for the tree data structure is that it would have to<br />

support multiple children. Since the set of splits that is being represented does<br />

not necessarily correspond to a fully resolved tree, any inner node can have any<br />

degree.<br />

6.1.1 The tree datastructure explained<br />

The author has chosen a tree datastructure which is inspired by the doubly<br />

connected edge list (DCEL) datastructure described in [dBSvKO00]. The DCEL<br />

is used to describe both geometric and topological information, and is introduced<br />

in a context of planar subdivisions, but for our purposes we need only a simplified<br />

version.<br />

This simplified tree datastructure used for implementing refined <strong>Buneman</strong><br />

trees consists of nodes and directed edges, and the class signatures of these can<br />

be seen in Figure 6.1. The illustrations in Figure 6.2 and Figure 6.3 provide<br />

a “local navigational map” for a Node and an Edge, respectively. Leaves and<br />

inner nodes would subclass the Node class.<br />

class Node<br />

{<br />

Edge incidentEdge;<br />

}<br />

class Edge<br />

{<br />

Node origin, destination;<br />

}<br />

Edge next, previous, twin;<br />

Figure 6.1: Signatures for the Node and Edge classes<br />

Clearly, the storage required for a Node or an Edge is constant, 1 and 5<br />

pointers respectively. In an unrooted tree with n leaves there are at most n − 2<br />

internal nodes, so in total there are at most 2n − 2 nodes. Also, in an unrooted<br />

tree with n leaves there are at most 2n − 3 edges. The storage space needed for<br />

our tree data structure is therefore (2n − 2) + 5(2n − 3) or O(n).<br />

The set of edges with origin in some node w form a cyclic list. This allows<br />

for constant time insertion and removal, given we have the particular edge in<br />

hand. The issue of finding the edge to be inserted or removed is described later.<br />

The cyclic list of outgoing edges is illustrated in Figure 6.4.<br />

43


incidentedge<br />

Figure 6.2: The world seen from the viewpoint of a Node<br />

previous<br />

origin<br />

twin<br />

destination<br />

next<br />

Figure 6.3: The world seen from the viewpoint of an Edge<br />

44


000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

Figure 6.4: The set of edges going out from a node form a cyclic list. The big<br />

arrows represent Edges, and the thin arrows are pointers. The Edge that has<br />

been marked is the Edge that indidentEdge to the Node. The blurred Edges<br />

and pointers are supposed to indicate that the number of neighbors to a Node<br />

is variable.<br />

Access to the cyclic list of edges going out from a node is granted only<br />

through incident edge of the node, which can be any of the edges with origin in<br />

that node. In other words, the node has no knowledge of its local surroundings<br />

and particularly, it has no notion of direction in the tree. When traversing the<br />

tree we will therefore have to impose the direction externally.<br />

Traversal through outgoing edges from a node is made easy by the use of an<br />

EdgeIterator class. The EdgeIterator is inspired by the Iterator design pattern<br />

from the famous GoF-book [GHJV94]. The EdgeIterator provides a simple<br />

interface for the unordered traversal of outgoing edges from a node, and it is<br />

used extensively in the tree data structure operations.<br />

The interface for the Edge iterator class can be found in Figure 6.5. It closely<br />

mimics the interface for the Iterator class in the Java standard API. The author<br />

has chosen not to implement the Java Iterator interface to avoid having to cast<br />

Object to Edge all the time.<br />

The code snippet in Figure 6.6 is the archetypal example of recursive traversal<br />

through T . Notice how we are providing a direction for the traversal by<br />

supplying the parent edge of the node as a parameter. This allows us to skip<br />

over that edge when we encounter it on our way through the cyclic list. The<br />

EdgeIterator keeps track of our progress so we only perform one cycle though<br />

the cyclic list of edges.<br />

The EdgeIterator wraps the functionality of navigating through the pointers<br />

between edges. And the template described in Figure 6.6 wraps the functionality<br />

of recursively descending down through the tree datastructure.<br />

45


interface EdgeIterator<br />

{<br />

boolean hasNext();<br />

}<br />

Edge next();<br />

Figure 6.5: The interface for the EdgeIterator class<br />

6.2 Operations on the tree datastructure<br />

The tree datastructure maintains a compatible set of splits. And the operations<br />

we are interested in supporting are the following:<br />

• Insert(σ)<br />

Of course, we need to be able to insert splits into our set of splits<br />

• Delete(σ)<br />

And we need to be able to remove splits from the set<br />

• Incompatible(σ)<br />

And finally we need to be able to determine if a split is compatible with<br />

any split in our compatible set of splits, and if there is one, to extract it.<br />

The following sections describe in detail how these operations have been<br />

implemented. In order to support the operations on the tree data structure<br />

needed in the refined <strong>Buneman</strong> algorithm, we must be able to search through<br />

the tree, and find both nodes and edges.<br />

For the insert operation we will need to locate a node with the property that<br />

for a split σ = U|V the node seperates U and V in the sense that we can group<br />

its subtrees in two, where one group only contains elements from U, andthe<br />

other group only contains elements from V . This node can then be expanded<br />

into two connected nodes, one for each subtree group.<br />

For the delete operation we need to be able to find the edge e that seperates U<br />

and V , meaning that the node on one of the edge is root in a subtree containing<br />

only elements from U, and the node on the other end is root in a subtree<br />

containiing only elements from V .<br />

For the incompatible operation we need to be able to find en edge e which<br />

is incompatible with another edge e ′ . We will use Definition 3 to decide the<br />

question of incompatibility.<br />

6.2.1 Insert<br />

Assume we have an unrooted tree T , and that we have selected some random<br />

node w root as starting point for traversal. The task is to insert a split σ = U|V<br />

which is compatible with all other splits represented by T . This amounts to<br />

46


class Node<br />

{<br />

Edge incidentEdge;<br />

public void traverse(Edge parent)<br />

{<br />

EdgeIterator ei = new EdgeIterator(this.incidentEdge);<br />

while (ei.hasNext())<br />

{<br />

Edge e = ei.next();<br />

if (e.twin == parent) continue;<br />

Node n = e.destination;<br />

}<br />

}<br />

}<br />

n.traverse(e);<br />

Figure 6.6: An example of code for traversing T<br />

finding the node w target (it is guaranteed to exist since σ is compatible with all<br />

other splits in T ,see[BFÖ+ 03]), such that if we root T at w target , all subtrees<br />

of w target contain only leaves from either U or V , but not from both — this is<br />

illustrated in Figure 6.7.<br />

Clearly, if we have a node w where all the subtrees of w have the property<br />

that they contain only leaves from either U or V , we may take all the U-<br />

subtrees and hang them on a new node w U ,andwemaytakealltheV -subtrees<br />

and hang them on a new node w V . We now have two trees rooted at w U and<br />

w V , respectively, containing all the U’s and all the V ’s, respectively, and we<br />

may now join these two nodes with an edge e that now seperates U and V ,and<br />

thereby represents the split σ. This is illustrated in Figure 6.8.<br />

The question is now, how do we find w inthefirstplace Thenodeis<br />

guaranteed to exist, since σ is a compatible split ([BFÖ+ 03]). So we just need<br />

to specify its location. Looking at Figure 6.9, lets assume we are traversing T<br />

andwehavereachedsomenodew, wherew has a parent edge and a bunch of<br />

edges to its children, given the orientation indicated in the figure. If we proceed<br />

to count the number of elements from U and V in the subtrees of w, letscall<br />

the counts w u and w v , we end up with one of the following results:<br />

• w u < |U|∧w v < |V | — in this case, we can ascertain that the subtree of<br />

w given by its parent edge must contain element from both U and V, so<br />

w cannot be the node we are looking for.<br />

47


Figure 6.7: An example a node which is a valid insertion node (green), and one<br />

which is not a valid insertion node (red). A node is a valid insertion node if all<br />

subtrees of that node contain only nodes from U (blue nodes) or from V (black<br />

nodes), but not from both.<br />

• w u = |U|∨w v < |V | — in this case, we know that the subtree of w given<br />

by its parent edge only contains elements from V . If we assume that we<br />

found w in a bottom–up traversal, and that w is the first node with the<br />

property w u = |U|, w mustbethenodewearelookingfor.Itcannotbea<br />

node below w, since that node would have a subtree with mixed u’s and<br />

v’s, i.e. the subtree given by its parent edge.<br />

• w u < |U|∨w v = |V | — similar argument as in the case w u = |U|∨w v < |V |.<br />

• w u = |U|∨w v = |V | — in this case, the node we chose at randon to be<br />

the intermediate root, must be the insertion node we are looking for.<br />

Counting can be done in linear time since we only need to visit each node<br />

once, and there are at most 2n − 2 nodes in an unrooted tree with n leaves.<br />

Searching can afterwards be done in time O(n), visiting each node exactly once<br />

and testing each node in constant time until we find the one we are looking for.<br />

6.2.2 Delete<br />

The delete operation is somewhat similar to the insert operation. Again we<br />

start out by counting U’s and V ’s in a bottom-up fashion. The edge e we are<br />

looking for has the property that its destination node is a node w, whereall<br />

subtrees of w contain only elements from U or from V , but not both. In other<br />

words, w isthenodewhereu w = |U| and v w = 0 or vice versa. And, once the<br />

node w is found, the edge e has also been found, since we can keep track of both<br />

48


e<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

000 111<br />

Figure 6.8: Inserting an edge e that splits U and V<br />

parent edge<br />

w<br />

up<br />

down<br />

Figure 6.9: Local orientation around a node, given we are in the middle of a<br />

tree traversal.<br />

w and e while searching for w. Again we use linear time for counting through<br />

T , and afterwards we can search through T in time O(n).<br />

To remove e we need to move all subtrees of w onto e’s origin node, lets call<br />

it w ′ . But this is quite easy, we just iterate through the nodes directly “under”<br />

w and sever their connection to w. Afterwards we iterate through the nodes<br />

again and attach them to w ′ . This can all be done in linear time 1 .<br />

6.2.3 Incompatible<br />

Given a split σ = U|V , the search for an incompatible split σ ′ = U ′ |V ′ ∈ T<br />

is a bit more complicated than for the insert and delete cases. A split σ ′ is<br />

incompatible with σ if all four intersections of {U ′ ,V ′ } and {U, V } are not<br />

empty. In other words, the values |U ′ ∩ U|, |U ′ ∩ V |, |V ′ ∩ U| and |V ′ ∩ V | are<br />

all non-zero.<br />

1 Of course it can be done faster if we did not cache the nodes in between detaching<br />

and attaching them, but the design of our tree data structure warrants this procedure, and<br />

asymptotically there is no difference<br />

49


So how do we find σ ′ We start by counting T bottom-up with respect to<br />

U and V ,wehavethatforeverynodew ∈ T which hangs from some edge e<br />

that u w is equal the the value |U ′ ∩ U|. U ′ is the set of leaves which lie in the<br />

subtree of w, sou w is both the number of elements from U in w’s subtree and<br />

the number of elements that U and U ′ have in common. Similarly, we also have<br />

that v w = |U ′ ∩ V | by the same argument.<br />

Now we apply basic set theory to find the values |V ′ ∩ U| = |U|−|U ′ ∩ U|<br />

and |V ′ ∩ V | = |V |−|V ′ ∩ V |. We have all four quantities, and we can check if<br />

any one of them is zero, and answer the question whether w is the node we are<br />

looking for (consequently whether e is the edge we are looking for).<br />

The time spent searching for an incompatible node is linear. We use linear<br />

time decorating T with U/V counts, and we use linear time checking each one for<br />

incompatibility. Given u w , v w , |U| and |V | we can in constant time determine<br />

incompatibility by calculating the quantities described above.<br />

We are not guaranteed to find an incompatible split, but if we do, we can report<br />

it in linear time. To do this, we allocate a bitvector b of length n and search<br />

through one of the subtrees induced by e. Wemarkanentryinb corresponding<br />

to the index of every leaf we find.<br />

6.3 Testing the tree data structure<br />

The tree data structure has not been tested formally. Of course, all tree operations<br />

have been unit-tested, and have been found to be correct. Also, performance<br />

tests have been run to some extent, but this is not easy: if we want to<br />

test the performance of the tree operations, we would need to be able to create<br />

highly resolved trees by inserting a large number of compatible splits created<br />

at random, which could then be a basis for inserting, deleting or searching for<br />

splits. Doing the operations on an unresolved tree would not give a realistic<br />

picture of the performance. It is quite easy to generate a larvae-tree by starting<br />

out with a bitvector in the form 11000..., and then generating splits of<br />

the form 111000..., 1111000..., 11111000... andsoon. Thiswasusedfor<br />

informal performance testing, but clearly this kind of tree is biased, and the<br />

author has chosen not to present a formal test on this basis. Still, the author<br />

claims that the tree data structure does indeed run as specified, referring to<br />

the performance test for the whole algorithm — if the tree data structure did<br />

not support linear time insertion, deletion and searching, the refined <strong>Buneman</strong><br />

algorithm would not be able to perform as specified.<br />

50


Chapter 7<br />

TheSingleLinkage<br />

Clustering Tree<br />

Single linkage clustering trees play an important part in this work as a replacement<br />

for anchored <strong>Buneman</strong> trees. The refined <strong>Buneman</strong> tree algorithm<br />

described in this thesis is based on Lemma 5, where anchored <strong>Buneman</strong> trees<br />

need to be merged with refined <strong>Buneman</strong> trees to create new refined <strong>Buneman</strong><br />

trees. However, since in the first part of the algorithm we are only required<br />

to build overapproximations of the refined <strong>Buneman</strong> tree, we can use single<br />

linkage clustering trees instead of anchored <strong>Buneman</strong> trees since single linkage<br />

clustering trees contain anchored <strong>Buneman</strong> trees.<br />

7.1 Replacing the anchored <strong>Buneman</strong> tree<br />

The anchored <strong>Buneman</strong> tree can be computed in time O(n 2 ), according to<br />

Lemma 3, by first constructing a single linkage clustering tree that is a superset<br />

of B x (δ), and then pruning that tree ([BB99], section 3). But, since we are<br />

creating an overapproximation of the refined <strong>Buneman</strong> tree by incrementally<br />

considering splits from anchored <strong>Buneman</strong> trees, we can replace the anchored<br />

<strong>Buneman</strong> trees altogether with the unpruned single linkage clustering trees. The<br />

extra splits might or might not become part of our overapproximation, but they<br />

will be weeded out later on in the algorithm, and they do not asymptotically<br />

change the size of the overapproximation. The tree cannot grow beyond O(n)<br />

splits. And luckily, the single linkage clustering tree is extremely simple to<br />

compute.<br />

7.2 Calculating the single linkage clustering tree<br />

As mentioned, the single linkage clustering tree is very simple to calculate.<br />

Pseudo code for the algorithm, adapted from [BG91], section 3.2.7, is listed in<br />

51


Algorithm 5. Offhand, the algorithm looks like it would perform in time O(n 3 ).<br />

We iterate through a loop, reducing the size of the distance matrix by one<br />

(adding one entry and removing two) on each loop, until the distance matrix is<br />

spent. Inside the loop, we in line 2 need to find the minimum entry in an n × n<br />

matrix, normally a O(n 2 ) operation. However, if we overlay our distance matrix<br />

with a quad tree search structure, we can get away with spending constant<br />

time finding the minimum entry in the matrix, while spending only linear time<br />

updating the search structure after we update the distance matrix in line 6.<br />

More about the quad tree data structure in chapter 8.<br />

Algorithm 5 The single linkage clustering tree algorithm<br />

Require: δ is a distance measure on X<br />

Ensure: C is the single linkage clustering tree for δ<br />

1: C a set of clusters, one for each element in X<br />

2: while |C| > 1 do<br />

3: Choose the clusters c 1 ,c 2 ∈ C that minimize the quantity δ(c 1 ,c 2 ).<br />

4: Create c ′ = c 1 ∪ c 2 .<br />

5: Calculate distances from c ′ to all other clusters c ′′ ∈ C by setting<br />

δ(c ′ ,c ′′ )=min{δ(c 1 ,c ′′ ),δ(c 2 ,c ′′ )}.<br />

6: Erase c 1 and c 2 from C, addc ′ .<br />

7: end while<br />

Also, instead of actually inserting and removing entire rows and columns in<br />

the distance matrix, we might reuse a row or column instead, and we might not<br />

delete rows but rather keep track of clusters that are alive. Figure 7.1 illustrates<br />

this point.<br />

7.3 Converting the distance matrix<br />

One important note about the use of the single linkage clustering algorithm<br />

instead of the anchored <strong>Buneman</strong> tree algorithm is the distinction between distance<br />

matrix and similarity matrix. The anchored <strong>Buneman</strong> tree for a species<br />

x ∈ X is computed based on a distance measure δ on a X, while the single<br />

linkage clustering tree for x ∈ X is computed based on an inverted similarity<br />

measure on X with respect to x. In other words we must distinguish between<br />

SLCT(δ, X) andSLCT(F x (δ),X).<br />

Basically, this means that we have to transform δ before using it to calculate<br />

the single linkage clustering tree. The transformation is described in [BB99],<br />

section 3, where is it termed a Farris transformation:<br />

F x (a, b) = 1 2 (δ ax + δ bx − δ ab )<br />

where a, b ∈ X \{x}. The Farris transformation creates a similarity measure,<br />

but this can be converted into a distance measure by e.g. changing signs.<br />

52


A 0 1 2 3 4 5 6 7<br />

0 1 2 3 4 5 6 7<br />

B<br />

C1<br />

C2<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

0<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

00 11<br />

00 11<br />

00<br />

C<br />

0 1 8 3 411<br />

5 6 7<br />

00 11<br />

000000000<br />

111111111<br />

000000000<br />

111111111<br />

0 000000000<br />

111111111<br />

000000000<br />

111111111<br />

1 000000000<br />

111111111<br />

000000000<br />

111111111<br />

000 111 000000000<br />

111111111<br />

C’ 8 000000000<br />

111111111<br />

000000000<br />

111111111<br />

3 000000000<br />

111111111<br />

000000000<br />

111111111<br />

000000000<br />

111111111<br />

4 000000000<br />

111111111<br />

000000000<br />

111111111<br />

5 000000000<br />

111111111<br />

000000000<br />

111111111<br />

11000000000<br />

111111111<br />

00 116<br />

000000000<br />

111111111<br />

11000000000<br />

111111111<br />

7 000000000<br />

111111111<br />

000000000<br />

111111111<br />

Figure 7.1: Manipulating the matrix in the single linkage clustering tree. A)<br />

we find the two rows and two columns with minimum entries (recall the matrix<br />

is symmetric). B) We reuse one set of row and column, and discard the other<br />

set or row and column. C) The new matrix is no longer able to index the row/<br />

column that was discarded, and the reused row and column has been re-indexed.<br />

53


Algorithm 6 The algorithm that calculates the single linkage clustering tree<br />

for x<br />

Require: δ is a distance measure on X<br />

Ensure: T is root in the single linkage clustering tree for x ∈ X<br />

1: δ ′ = F ARRIST RANSF ORM(δ, x)<br />

2: Q is a quad tree overlaid δ ′<br />

3: C is a list of clusters corresponding to δ ′<br />

4: while |C| > 1 do<br />

5: (c 1 ,c 2 )=MINIMUM(Q)<br />

6: c ′ = c 1 ∪ c 2<br />

7: for c ′′ ∈ C, c ′′ ≠ c 1 ,c 2 do<br />

8: δ c ′ ′ ,c = ′′ min{δ′ c 1,c ′′,δ′ c 2,c ′′}<br />

9: end for<br />

10: C = C ∪{c ′ }\{c 1 ,c 2 }<br />

11: UPDATE(Q)<br />

12: end while<br />

13: T = C<br />

Thus, the algorithm for computing the single linkage clustering tree for x<br />

becomes the algorithm found in Algorithm 6.<br />

Analysing Algorithm 6 is straightforward: initialising δ ′ and Q takes time<br />

O(n 2 ). We iterate n times in the while loop, performing constant time search<br />

for the minimum entry in the distance matrix, linear time update of the new<br />

row/ column for C ′ with regards to the remaining clusters in the matrix, linear<br />

time removal of C 1 and C 2 , and linear time updating the quad tree data<br />

structure. All in all we have a O(n 2 ) implementation of the single linkage clustering<br />

tree algorithm. Thus, we can safely replace anchored <strong>Buneman</strong> trees<br />

with single linkage clustering trees in the main algorithm, since is has the same<br />

running time complexity. Also, we have argued that the anchored <strong>Buneman</strong><br />

tree is contained in the single linkage clustering tree. And since we are building<br />

overapproximations of the refined <strong>Buneman</strong> tree, the extra splits from the single<br />

linkage clustering tree will not affect the algorithm adversely.<br />

54


Chapter 8<br />

TheQuadTreeData<br />

Structure<br />

The quad tree is a generalisation of the binary tree; where the binary tree deals<br />

with linearly ordered data, the quad tree encapsulates data in two dimensions.<br />

The analogy is shown in Figure 8.1. We will use the quadtree as a search<br />

structure atop a matrix, so that we can find the minimum in that matrix in<br />

constant time while paying only linear time for updates. This makes it possible<br />

to adhere to certain complexity bounds in the single linkage clustering tree<br />

algorithm described in chapter 7.<br />

1 2 3 4 5 6 7 8<br />

1 2 3 4 5 6 7 8<br />

1 2<br />

3 4<br />

5 6<br />

7 8<br />

9 10<br />

13<br />

14<br />

11 12<br />

15 16<br />

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16<br />

Figure 8.1: The analogy of binary trees and quad trees.<br />

55


a b c d<br />

root<br />

ab<br />

cd<br />

Figure 8.2: Updating the quad tree by propagating changes upward.<br />

8.1 Complexity analysis<br />

For every node in our quad tree we wish maintain information about the minimum<br />

matrix entry in the subtree of that node. That way we can, at any time,<br />

find the minimum entry of the matrix in constant time by looking it up in<br />

the root node. We can keep the tree updated by making leaf nodes propagate<br />

changes upward in the tree, at a price proportional to the height of the tree; for a<br />

balanced quad tree spanning an n×n matrix the height is log 4 (n 2 )=2∗log 4 (n),<br />

so the time spent propagating changes upward through the tree is O(log 4 (n)).<br />

Unfortunately, when we are using the quadtree to support construction of<br />

single linkage clustering trees, we need to update two rows and two columns in<br />

the matrix for each iteration, which would then take time O(n log 4 (n)) if we<br />

needed to propagate values upward through the tree every time, and we can<br />

afford to spend no more than linear time. We can also not afford to update<br />

(re-initialise) the whole tree since that would take time O(n 2 ).<br />

Luckily, it is possible to optimize the update procedure under these specific<br />

circumstances. It turns out that it is possible to update the tree in linear<br />

time when changing precisely exactly one row or column. This is due to the<br />

geometric dependency of the nodes, as illustrated in Figure 8.2. It clearly shows<br />

that the upward propagation of changes introduces redundancy, for example<br />

would changes to entries a and b share almost identical paths toward the root.<br />

By cutting down on this redundancy we can shave off a bit of the running time.<br />

We can ask ourselves how many nodes in the quad tree we need to update, in<br />

total. The answer is that after updating a row of n entries in the matrix we<br />

need to update n/2 entries in the first layer of the quad tree, n/4 entriesinthe<br />

second layer, and so on. This gives us<br />

n + n 2 + n 4<br />

+ ... +1≤ 2n<br />

56


00 11<br />

000 11100<br />

1100<br />

000 111 000 111<br />

00 11<br />

000 11100<br />

1100<br />

000 111 000 111<br />

000 111 0000 1111 0000 1111000<br />

000 111 0000 1111 0000 1111000<br />

000 111 0000 1111 0000 1111000<br />

1111100000<br />

1111100000<br />

1111100000<br />

1111100000<br />

0000011111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

0000000000<br />

1111111111<br />

Figure 8.3: Layers in the quad tree are invalidated after updating a matrix row.<br />

entries to be updated in total. The notion of layers is illustrated in Figure 8.3.<br />

The redundancy cutdown can be implemented using a graph colouring scheme.<br />

When entries are changed, they shall propagate the change upwards toward the<br />

root, and mark nodes as invalid. If a node is encountered, which has already<br />

been invalidated, there is no need to go upward any further, since all nodes<br />

above it will also be invalid. This way we avoid visiting the same node more<br />

than once. Updating a node of course takes constant time.<br />

Once the matix row has been updated, and nodes have been invalidated<br />

throughout the tree, we can update the tree nodes in a bottom-up fashion<br />

starting from the root, and thus maintain valid information about the minimum<br />

entry in the root node.<br />

8.2 Performance of the quad tree data structure<br />

One important practical issue regarding quad trees is their dimension. Obviously,<br />

for an n×n matrix where n is a power of 2, a quad tree will fit perfectly. In<br />

case n is not a power of two, one has to take special considerations. The author<br />

has chosen to implement a quad tree such that given an input matrix of size<br />

n×n, the base quad tree layer will have size 2 m × 2 m such that 2 m−1


6000<br />

Quad Tree performance characteristic<br />

row/ column update<br />

best fit: ax+b<br />

5000<br />

4000<br />

Running time (ms.)<br />

3000<br />

2000<br />

1000<br />

0<br />

-1000<br />

0 100 200 300 400 500 600 700 800 900<br />

Input size<br />

Figure 8.4: Quad tree performance 1: updating 1000 rows and column for a<br />

given input size. The artifact between sizes 700-800 is probably due to some<br />

system processes disturbing the experiment.<br />

given input size, the time unit is milliseconds. Input sizes range between 50<br />

and 873. Originally, the intention was to test input sizes up to 1100, i.e. past<br />

the next power-of-two threshold at 1024. However, the Java program threw<br />

an OutOfMemory exception at input size 873. Also, the plots clearly show a<br />

degeneration of performance in the large input size range, i.e. sizes greater than<br />

700, due to the large memory consumption.<br />

The initialisation-artifact due to the sizing of the quad tree to a power of<br />

2 show up in the performance characteristic for the refined <strong>Buneman</strong> tree algorithm.<br />

Perhaps there is a more elegent way of handling the sizing problem<br />

Or perhaps we should choose a different algorithm so we would not need to<br />

use a quad tree The single linkage clustering tree was chosen for its apparent<br />

simplicity, but there are also minimum spanning tree methods that solve the<br />

problem (see for example [Epp98]). In any case, the initialisation still stays<br />

within specified complexity boundaries, and it does not adversely affect the<br />

asymptotic performance of the algorithm.<br />

58


30000<br />

Quad Tree performance characteristic<br />

initialisation<br />

25000<br />

20000<br />

Running time (ms.)<br />

15000<br />

10000<br />

5000<br />

0<br />

0 100 200 300 400 500 600 700 800 900<br />

Input size<br />

Figure 8.5: Quad tree performance 2: initialising the quad tree. Notice the 2 m -<br />

artifact and the performance degradation in the 700+ size range. This is due<br />

to the JVM running out of memory (of course, the JVM can be parameterized<br />

to use more memory.)<br />

59


30000<br />

Quad Tree performance characteristic<br />

initialisation and update<br />

25000<br />

20000<br />

Running time (ms.)<br />

15000<br />

10000<br />

5000<br />

0<br />

0 100 200 300 400 500 600 700 800 900<br />

Input size<br />

Figure 8.6: Quad tree performance 3: initialisaton and update, in total.<br />

60


Chapter 9<br />

The Discard-Right<br />

algorithm<br />

The DISCARD-RIGHT algorithm is used in Algorithm 3 to answer the following<br />

question: given two incompatible splits σ (left) and σ ′ (right), is μ σ ′ ≤ 0 The<br />

algorithm is based on the construction given in the article by Brodal et al.<br />

([BFÖ+ 03]), in the proof for Lemma 5. This construction relies on both σ and<br />

σ ′ to compute an upper bound for μ σ ′ in linear time.<br />

In the context where DISCARD-RIGHT is used, we have in one hand a set of<br />

compatible splits S representing an overapproximation of RB(δ| Xk )forsomek.<br />

In the other hand, we have a new split σ which we want to evaluate. We have<br />

determined that σ is incompatible with some σ ′ ∈ S. This means that either σ<br />

or σ ′ (or both) do not belong in RB(δ| Xk ).<br />

Normally it would take time in the order of n 4 to calculate μ σ ′,sincewe<br />

would have to consider all possible quartets induced by σ ′ to find the n − 3least<br />

scoring quartets that are used to find the refined <strong>Buneman</strong> index. However, since<br />

we have a pair of incompatible splits we can use the linear time algorithm given<br />

in Algorithm 7 to find an upper bound for μ σ ′ which can be used to determine<br />

whether σ ′ should be discarded or not. Note that this does not necessarily<br />

validate or invalidate σ as a split in RB(δ). This is apparent from the truthtable<br />

in Table 9.1 — notice how the fourth option, where <strong>Buneman</strong> indices are<br />

positive, is not possible when we assume that σ and σ ′ are incompatible.<br />

Algorithm 7 deserves a few comments; firstly, in line 1 we can determine the<br />

μ σ μ σ ′ DISCARD-RIGHT<br />

1 ≤ 0 ≤ 0 true<br />

2 > 0 ≤ 0 true<br />

3 ≤ 0 > 0 false<br />

Table 9.1: The truth table for the DISCARD-RIGHT algorithm.<br />

61


sets A, B, C and D by scanning the two splits concurrently, in linear time. In<br />

lines 3–6 we know that |A|×|B|×|C|×|D| ≥n − 3, so we will generate enough<br />

possible quartets for later use.<br />

In line 7 we have elements a ∈ A, b ∈ B, c ∈ C and d ∈ D, andwecan<br />

combine these into 3 different quartets, q x , q y and q z . We know that two distinct<br />

quartets q 1 and q 2 for the same four species satisfy β q1 + β q2 ≤ 0 from chapter 2<br />

or from [BFÖ+ 03]. So now all we need to do is to determine which pair of<br />

quartets fit with our splits. We need to select q 1 and q 2 from {q x ,q y ,q z } such<br />

that q 1 ∈ q(σ) andq 2 ∈ q(σ ′ ), and there are 6 possible combinations to choose<br />

from: q x q y , q x q z , q y q x , q y q z , q z q x and q z q y . In line 8 we store q 2 for later. And<br />

finally in line 9, if we have found enough quartets we skip to the second part of<br />

the algorithm.<br />

In line 17 we exploit the fact that for all the pairs (q 1 , q 2 ) we found in the<br />

for-loops, β q1 + β q2 ≤ 0, because of our choices of a–d. Fromthiswegetthat<br />

n−3<br />

∑<br />

β q1i + β q2i ≤ 0<br />

i=1<br />

for some enumeration of our candidate quartets. We can split the sum in two,<br />

then we have that<br />

n−3<br />

∑<br />

β q1i ≤ 0 and/or<br />

i=1<br />

n−3<br />

∑<br />

β q2i ≤ 0<br />

i=1<br />

since they cannot both be positive. Now we know that the sum of the n − 3<br />

least scoring quartets is the refined <strong>Buneman</strong> index, so we have<br />

μ σ ′<br />

n−3<br />

∑<br />

≤<br />

And thus we can answer the question of whether σ ′ can be discarded or not.<br />

i=1<br />

β q2i<br />

62


Algorithm 7 The DISCARD-RIGHT algorithm<br />

Require: σ = U 1 |V 1 and σ ′ = U 2 |V 2 are incompatible splits.<br />

Ensure: return true if μ σ ′ ≤ 0, false otherwise.<br />

1: A = U1 ∩ U2, B = U1 ∩ V 2, C = V 1 ∩ U2, D = V 1 ∩ V 2<br />

2: Q = ∅<br />

3: for a =1to|A| do<br />

4: for b =1to|B| do<br />

5: for c =1to|C| do<br />

6: for d =1to|D| do<br />

7: determine permutations q 1 and q 2 of A[a], B[b], C[c] andD[d] such<br />

that q 1 ∈ q(σ) andq 2 ∈ q(σ ′ )<br />

8: Q = Q ∪ q 2<br />

9: if |Q| = n − 3 then<br />

10: goto sum<br />

11: end if<br />

12: end for<br />

13: end for<br />

14: end for<br />

15: end for<br />

16: sum:<br />

17: if ∑ n−3<br />

i=1 β Q[i] ≤ 0 then<br />

18: return TRUE<br />

19: end if<br />

20: return FALSE<br />

63


Chapter 10<br />

The Selection Algorithm<br />

In Algorithm 4 we are faced with the problem of removing the (n − 3) largest<br />

elements from a set of size 3(n−3) or more. In general, the problem is removing<br />

the i largest elements from a set A of size n, wherei


Algorithm 9 The RANDOMIZED-PARTITION algorithm<br />

Require: A is an array of length n.<br />

Require: p, r are indices in A<br />

Ensure: returns q such that elements in A[p..q] are all less than or equal to all<br />

elements in A[q +1..r].<br />

1: i = RANDOM(p, r)<br />

2: SWAP(A[p],A[i])<br />

3: return PARTITION(A, p, r)<br />

Algorithm 10 The PARTITION algorithm<br />

Require: A is an array of length n.<br />

Require: p, r are indices in A<br />

Ensure: returns j such that elements in A[p..j − 1] are all less than or equal<br />

to all elements in A[j..r].<br />

1: x = A[p]<br />

2: i = p − 1<br />

3: j = r +1<br />

4: while true do<br />

5: repeat<br />

6: j = j − 1<br />

7: until A[j] ≤ x<br />

8: repeat<br />

9: i = i +1<br />

10: until A[i] ≥ x<br />

11: if i


is O(n 2 ). The algorithm uses a divide-and-conquer strategy, where the idea<br />

is to cut away search space to quickly find the element we are looking for. If<br />

we were to divide in half every time we would expect to spend linear time<br />

dividing and searching recursively (n + n/2 +n/4 +···+1). However, since<br />

we are partitioning around some element in an array, we are not guaranteed to<br />

partition in the middle each time. The randomized part of the algorithm tries<br />

to remedy this, and the expectation is that selecting an element at random will<br />

on average partition the search space in half. Of course, if we are very unlucky<br />

we will partition such that we only cut away one element each time, for a worst<br />

case performance of O(n 2 ).<br />

10.2 Performance of the selection algorithm<br />

The selection algorithm runs in expected linear time, but worst case quadratic<br />

time. The following test shows that the performance is indeed linear, as expected.<br />

The test was done by running the randomized selection algorithm 50<br />

times on input sizes starting at 10000 and increasing the size by 1000 for each<br />

new run. Each runs consists of 100 repetitions. A plot is made of input size in<br />

number of elements, against running time in milliseconds. This plot is shown in<br />

Figure 10.1. The plot clearly shows that the selection algorithm does indeed run<br />

in linear time, as expected. If one is not satisfied with this algorithm, [CLR90]<br />

has a more complicated selection algorithm which runs in guaranteed worst case<br />

linear time.<br />

66


18<br />

Performance of Randomized-Select<br />

best fit: ax+b<br />

16<br />

14<br />

Running time (ms.)<br />

12<br />

10<br />

8<br />

6<br />

4<br />

2<br />

10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000<br />

Input size<br />

Figure 10.1: Performance of the randomized selection algorithm.<br />

67


Chapter 11<br />

JSplits<br />

Figure 11.1: A screenshot of the JSplits user interface.<br />

11.1 What is JSplits<br />

JSplits or SplitsTree 4 is the latest version of a program initially developed by<br />

Dr. Daniel Huson in cooperation with Rainer Wetzel at the University of Bielefeld,<br />

and later development has been continued by Dr. Huson at the University<br />

68


of Tübingen. In Dr. Husons words, SplitsTree is a program for analyzing and<br />

visualizing evolutionary data.<br />

The first version of SplitsTree was presented in an article from 1996 ([DHM96]),<br />

and a later version followed in 1998 ([Hus98]). Up to version 3.2, SplitsTree was<br />

written in C++, but starting from SplitsTree 4 the program will be developed in<br />

Java. Versions have been incorporating more and more phylogenetic methods,<br />

and this latest version features a plug-in facility which enables others to extend<br />

the functionality of the program by adding new tree reconstruction methods and<br />

analysis tools. The author has gratefully taken advantage of this in his work.<br />

The JSplits program is available from this website:<br />

http://www-ab.informatik.uni-tuebingen.de/software/jsplits/<br />

In this work the author has used JSplits/ SplitsTree 4 version 3 beta,<br />

but at the time of writing the JSplits development team has reach version 4 beta,<br />

and the author is unsure whether this has introduced any inconsistencies.<br />

11.2 Using JSplits<br />

JSplits runs on any computer, Linux or Windows, with a JRE v1.4.2 (Java<br />

Runtime Environment, available from http://www.java.com), which is very<br />

handy indeed. It has a very simple user interface, as depicted in Figure 11.1.<br />

To use JSplits to visualize a data set, one just opens the data set from a file, and<br />

selects from a plethora of different tree methods using the drop down menus.<br />

The result of applying a method to a dataset in JSplits is some graphical<br />

representation depending on the method. For refined <strong>Buneman</strong> trees one gets<br />

a tree graph with labeled leaves. It is possible to drag vertices to new positions<br />

while keeping branch lengths constant, so one can sculpt the tree as one whishes.<br />

It is also possible to zoom in on interesting regions, which is very useful for large<br />

trees. The JSplits homepage might reveal more features, even some that the<br />

author is not aware of, since JSplits development has raced ahead of this work.<br />

One minor drawback to JSplits is that it only supports datasets in the Nexus<br />

format. The test data used in this work, a set of protein families from the PFAM<br />

database, is given in Phylip format, so it is not immediately applicable. However,<br />

the translation from Phylip to Nexus is straghtforward, and the author<br />

has written such a module (see chapter 12). The Nexus format is described in<br />

[MSM97], but a simple google search will also provide several resources, including<br />

format translation programs:<br />

http://www.google.com/searchq=nexus+format<br />

11.3 Extending JSplits<br />

The implementation of the refined <strong>Buneman</strong> tree algorithm described in this<br />

work is aimed at integration into JSplits. Java is perhaps not the first choice for<br />

69


implementing an algorithm with a high time and space complexity. However,<br />

by implementing the algorithm in Java and integrating it into JSplits as a plugin,<br />

the author has been handed a free user interface that fits perfectly with<br />

this algorithm. Also, this method can hopefully become an integrated part in<br />

future releases of JSplits which is easily and freely available, thereby contributing<br />

something valueable to the field of tree reconstruction.<br />

JSplits plug-ins fall into categories based on input and output. The refined<br />

<strong>Buneman</strong> tree algorithm will take a distance matrix as input and return a compatible<br />

set of splits, which can then be visualised in JSplits. Other methods<br />

might take a set of aligned nucleotide sequences and return tree structures. A<br />

full list of the extensive set of implemented methods and possible types of new<br />

methods can be found in the JSplits source code documentation, which is probably<br />

available upon request from Dr. Huson. At the time of writing it is not<br />

publicly available.<br />

The general way of creating a JSplits plug-in consists of two steps:<br />

• Implementing an interface from the catalogue of possible interfaces available<br />

in JSplits<br />

• Placing the implementing file correctly in the JSplits directory structure,<br />

which is grouped by input type<br />

The JSplits application will automatically detect the presence of implemented<br />

methods and create drop-down menu options for them. So the plug-in<br />

facility is very easy to use indeed, and the author experienced no problems with<br />

it at all.<br />

The author did experience a few problems with the classes used in JSplits.<br />

For some odd reason, the taxa list and input distance data is given in arrays<br />

with offset 1 instead of the traditional offset 0 used in most other computer<br />

science contexts - particularly in the Java programming language. But in spite<br />

of this minor quirk, the author has been very satisfied working with JSplits.<br />

11.4 The Distances2Splits interface<br />

In JSplits terminology, the refined <strong>Buneman</strong> tree algorithm falls in the category<br />

distances-to-splits algorithm, and Dr. Huson pointed out that the author should<br />

implement his work as a Java class implementing the Distances2Splits interface<br />

found in JSplits. That interface looks like this:<br />

package splits.algorithms.distances;<br />

interface Distances2Splits extends DistancesTransform<br />

{<br />

boolean isApplicable(Taxa taxa, Distances d);<br />

}<br />

Splits apply(Taxa taxa, Distances d) throws Exception;<br />

70


The interface has two methods that need to be implemented. Firstly, we<br />

would have to implement the method isApplicable which takes two parameters,<br />

a list of taxa (names of species) and a set of distance data (in the form<br />

of a matrix). The point is not to test if our algorithm can accept this type of<br />

input, we know that already since we are implementing the Distances2Splits<br />

interface. Instead, the method is designed to test the quality of the input. Our<br />

refined <strong>Buneman</strong> tree algorithm requires a minimum input size of four, for example.<br />

So in case the input data is not we will say that we cannot handle the<br />

data, and JSplits can then use that information to ensure our algorithm will<br />

never have to be computed with this input. JSplits handles this by blanking<br />

out the drop down menu option for our algorithm. Secondly, we have to implement<br />

the apply method. We are given the same parameters as before, and<br />

we are required to return a set of splits. JSplits can use these splits to create a<br />

graphical representation of the refined <strong>Buneman</strong> tree.<br />

The author has placed all source code in a separate package so as to not<br />

pollute the JSplits directory structure with irrelevant files. The apply method<br />

basically only performs translation from the (oddly indexed) JSplits data to<br />

simple distance matrix with the type double[][]. This simple matrix is then<br />

passed to the real implementation in the seperate package, which is an implementation<br />

of the following interface. We simply ignore the taxa list for now.<br />

package dk.birc.rbt;<br />

interface Distances2SplitsAlgorithmInterface<br />

{<br />

Split[] compute(double[][] distanceMatrix);<br />

}<br />

The Split class thus returned has the following type:<br />

package dk.birc.rbt;<br />

class Split<br />

{<br />

boolean[] split;<br />

}<br />

double weight;<br />

Finally the array of splits is translated into a set of splits in the JSplits<br />

sense, which basically involves re-indexing and adding taxa names, and that set<br />

is returned.<br />

71


Chapter 12<br />

Source Code<br />

The source code for the implementation described in the previous chapters is<br />

available from the following webpage. As the implementation has been done in<br />

Java, the implementation should be able to run on any machine of any architecture<br />

using any operating system, provided a Java Runtime Environment (JRE)<br />

is available.<br />

http://www.daimi.au.dk/~lasse/thesis/<br />

The Java source code files are available for browsing online, and for downloading<br />

in a Zip file. The code is packaged 1 in two directory structures:<br />

dk.birc.rbt contains source code for the implementation of the refined <strong>Buneman</strong><br />

tree algorithm. The code is modularized into one .java file for each<br />

Java class for a total of 31 source files, containing roughly 2600 lines of<br />

code.<br />

splits.algorithms.distances contains the plugin code that adapts the refined<br />

<strong>Buneman</strong> method to the JSplits software package. Basically this consists<br />

of one file which performs simple transformations between the formats in<br />

JSplits and in the refined <strong>Buneman</strong> implementation.<br />

All Java classes and methods have been decorated using the Javadoc format<br />

([Sun03]), and thus a Javadoc documentation of the source code tree is available<br />

for online browsing, using the webpage mentioned above.<br />

The implementation has furthermore been documented in a series of UML<br />

diagrams, also available from the webpage, in PostScript format. The UML<br />

diagrams are generally too large to fit on A4 paper, therefore they have not<br />

been included in this thesis. These diagrams should convey an overview of the<br />

implementation.<br />

1 packaged in the Java sense: using a reverse domain name scheme should generate a unique<br />

namespace, and the directory structure follows the package name with ’.’ replaced by ’/’<br />

72


Part III<br />

Tests and Experiments<br />

73


Chapter 13<br />

The Reference<br />

Implementation<br />

For testing purposes the author has implemented a package reference, which<br />

in a simple and transparent (and hopelessly inefficient) manner calculates the<br />

refined <strong>Buneman</strong> tree.<br />

The reference implementation is intended to provide a basis for comparison<br />

against the implementation of the [BFÖ+ 03] algorithm. To justify that the<br />

latter has been implemented correctly, we argue first that our reference implementation<br />

is correct, and thereafter we demonstrate that on the same input,<br />

the two implementations agree completely. These two arguments together form<br />

a basis for concluding that the [BFÖ+ 03] algorithm has been implemented correctly.<br />

13.1 A simple refined <strong>Buneman</strong> tree algorithm<br />

Algorithm 11 is a very simple algorithm that calculates the refined <strong>Buneman</strong><br />

tree. It has been adapted directly from the definition of the refined <strong>Buneman</strong><br />

tree.<br />

Line 1 initialises a set S to be the empty set. In line 2, we iterate over all<br />

possible splits in σ(X), avoiding duplicates by interpreting splits as bitvectors,<br />

and counting (using bit-flipping) through exactly half the possible splits. For<br />

each split σ we in line 3 initialise a set of quartets Q to the empty set. In line 4<br />

we iterate over all possible quartets in q(σ) and add them to Q in line 5. In line<br />

7wesortQ in increasing order so that in line 8 we can sum over the n − 3least<br />

scoring quartets to find the refined <strong>Buneman</strong> index. In lines 9–10 we report the<br />

splits with positive refined <strong>Buneman</strong> index.<br />

The complexity of this algorithm is quite horrible: first we iterate over O(2 n )<br />

splits, and for each split we iterate over O(n 4 )quartets. Thealgorithm therefore<br />

performs in time O(2 n n 4 ).<br />

74


Algorithm 11 A naive algorithm that computes the refined <strong>Buneman</strong> tree<br />

Require: δ is a distance matrix of size n × n<br />

Ensure: S = RB(δ)<br />

1: S = ∅<br />

2: for σ ∈ σ(X) do<br />

3: Q = ∅<br />

4: for q ∈ q(σ) do<br />

5: Q = Q ∪ q<br />

6: end for<br />

7: SORT(Q)<br />

8: w = 1<br />

n−3<br />

9: if w>0 then<br />

10: S = S ∪ σ<br />

11: end if<br />

12: end for<br />

∑ n−3<br />

i=1 β Q[i](δ)<br />

13.2 Implementation highlights<br />

There are two important issues in this algorithm: we need to avoid considering<br />

duplicate splits, and we need to avoid considering duplicate quartets. Both have<br />

built-in symmetries.<br />

Firstly, let us consider splits as bitvectors. We can start out with the bitvector<br />

containing all zeros, and count through splits by bit-flipping. We flip from<br />

low-order end to high-order end every bit that is ’1’. When we reach a bit that is<br />

’0’, we flip it to ’1’ and stop. Then we have the next split. To avoid duplicates,<br />

we can just restrict ourselves to counting through the first n − 1 bits, leaving<br />

the nth bit as ’0’.<br />

Regarding quartets, we need to recognize that quartets are symmetric on<br />

either side of their central edge, i.e. the quartet ab|cd isthesamequartetas<br />

ba|cd, ab|dc and ba|dc. For the split σ = U|V we say that a, b ∈ U and c, d ∈ V ,<br />

and to avoid duplicates we just have to make sure that a ≤ b and c ≤ d.<br />

13.3 Correctness of the reference implementation<br />

We are going to use this reference implementation of the refined <strong>Buneman</strong> tree<br />

algorithm to demonstrate the correctness of our implementation of the refined<br />

<strong>Buneman</strong> tree algorithm described in [BFÖ+ 03]. But first we must convince<br />

ourselves that the reference implementation is correct.<br />

To this end, the author has written a test program for the reference implementation.<br />

The test program generate a random distance matrix, and during<br />

the computation of the reference implementation the program will output:<br />

• the distance matrix.<br />

75


• the splits that are generated.<br />

• the quartets that are generated for each split.<br />

• the <strong>Buneman</strong> scores for the quartets.<br />

• the <strong>Buneman</strong> Index for each split.<br />

• the weighted splits reported by the algorithm.<br />

After running the test program it is then possible to study the output from<br />

the program and ensure:<br />

• all unique splits of size n are generated, i.e. all possible splits on the form<br />

0xxx.<br />

• all quartets ab|cd are generated such that a, b ∈ U, c, d ∈ V , a ≤ b and<br />

c ≤ d.<br />

• for each quartet q generated, β q (δ) is calculated correctly.<br />

• for each split σ generated, μ σ (δ) is calculated correctly.<br />

• the weighted splits reported by the algorithm all have positive refined<br />

<strong>Buneman</strong> index, and all splits generated that have positive refined <strong>Buneman</strong><br />

index are reported.<br />

Of course, this process is quite infeasible for larger δ, but the author has<br />

read through outputs from a few runs, and has found no errors. The author<br />

therefore concludes that the reference implementation is correct. An example<br />

of such an output is given in appendix A.<br />

13.4 Performance of the reference implementation<br />

The performance characteristic for the algorithm can be seen in Figure 13.1.<br />

The plots shows running time for input sizes 4–20. Clearly, it is infeasible to<br />

run this implementation on large examples, but for our testing purposes it is<br />

adequate.<br />

76


1e+06<br />

Performance of the Reference Implementation<br />

best fit<br />

800000<br />

Running time (ms.)<br />

600000<br />

400000<br />

200000<br />

0<br />

0 5 10 15 20<br />

Input size<br />

Figure 13.1: Performance of the reference implementation.<br />

77


Chapter 14<br />

Correctness<br />

This chapter deals with testing the correctness of the refined <strong>Buneman</strong> tree<br />

algorithm. The algorithm produces a set of splits S, and to determine if this<br />

set of splits is the set RB(δ) we could ask a number of questions:<br />

• Is S a compatible set of splits<br />

• Is it the case that μ s δ is positive for all s ∈ S<br />

• Is it the case that μ s ′δ is not positive for all s ′ ∈ ¯S<br />

The first two questions can be answered easily by inspecting S, but they do<br />

not form a basis for any conclusions regarding the correctness of the implementation.<br />

Question 3 is not easily answered, it would require that we tested all<br />

possible splits not in S to see if they had positive refined <strong>Buneman</strong> index, and<br />

we have seen earlier that this search space is quite huge. Here lies the key to<br />

answering the question of correctness: even if we have a compatible set of splits,<br />

all with positive score, we only know that we have a subset of RB(δ), but we<br />

do not know if we have actually produced the whole set.<br />

In fact, questions two and three together provide nescessary and sufficient<br />

conditions for determining correctness, since Moulton and Steel in [MS99] have<br />

shown that the set {σ : μ σ > 0} is a compatible set of splits. But since question<br />

three is too hard to answer we shall have to adopt a different test strategy.<br />

14.1 Test strategy<br />

Instead of tackling the questions above, we shall use a completely different<br />

strategy. We shall build a chain of arguments that will lead us to conclude that<br />

our implementation of the [BFÖ+ 03] algorithm is correct.<br />

The basis of our argumentation relies on the reference package described<br />

in the previous chapter. We have to assume that our reference package can<br />

correctly compute the refined <strong>Buneman</strong> tree, and the previous chapter argues<br />

that the package meets this condition.<br />

78


Assuming we have a correct implementation that calculates the refined <strong>Buneman</strong><br />

tree, we can proceed to compare output from this reference implementation<br />

against output from our implementation of the [BFÖ+ 03] algorithm. Given the<br />

same input in the form of a distance matrix δ, the two implementations should<br />

produce the same output in the form of a set of weighed splits.<br />

It is of course impossible to test all possible input in the two methods, so we<br />

will have to restrict ourselves to testing some number of different inputs N, such<br />

that N is large enough to persuade anyone that the methods always produce<br />

the same output in identical input. Assuming the reference implementation is<br />

correct, and that the two methods always compute the same algorithm we can<br />

trust that our implementation of the [BFÖ+ 03] algorithm is indeed correct.<br />

Ideally, looking at N sets of output and observing only identical sets will<br />

not allow us to conclude anything, only merely that our implementation has<br />

not been proven to be incorrect — yet. But in practice the author believes that<br />

forsomehugenumberN any reader should be persuaded to believe that the<br />

implementation is correct. One weakness in this approach would be the running<br />

time of the reference implementation which will quickly put a practical limit on<br />

N.<br />

14.2 Test setup<br />

The goals of the test are clear; we wish provide a large number of inputs, and<br />

test if the implementation of the [BFÖ+ 03] algorithm outputs the same result as<br />

the reference implementation does. Each implementation adheres to the same<br />

interface, outputting an array of weighed splits, but the output is unsorted.<br />

Therefore our test program must run the two algorithms and sort the outputs<br />

before we can compare the results.<br />

The source code for the test program can be found in the package test,<br />

as described in chapter 12, and it is quite straightforward. A random distance<br />

matrix is created and both implementations are run with that input. Output is<br />

sorted and compared, and the program will report nothing 1 in the case where<br />

there are no discrepancies on output, and in case the outputs differ it will report<br />

a CorrectnessException with the following class signature:<br />

package test;<br />

class CorrectnessException extends Exception<br />

{<br />

CorrectnessException()<br />

{<br />

super("Output not identical!");<br />

}<br />

}<br />

1 it will actually output something, namely a line saying “Test completed successfully.”<br />

79


The best possible way of testing would be to test both vertically (many repetitions<br />

with the same input size) and horisontally (a wide range of input sizes)<br />

simultaneously. Unfortunately the performance of the reference implementation<br />

very much limits our options, and we are forced to test the two directions<br />

seperately.<br />

14.3 Test results<br />

From inspecting the output of the vertical and horisontal tests, it is clear that<br />

on all input, the two implementations output exactly the same set of weighed<br />

splits. Test output consists of a single line of text saying the test completed<br />

successfully. Of course, in case the test program found that something was<br />

NOT correct, it would report an exception. Output might look like this:<br />

$ java test.CorrectnessTest 1000 4 17<br />

Test completed successfully.<br />

The vertical test was able to complete 1000 repetitions of runs of size 4-17,<br />

for a total 13.000 runs. No errors were encountered. Note that running the<br />

two implementations 1000 times each on input size 17 took around 15 hours to<br />

complete, and the author thought it pointless to continue the test any further.<br />

The horisontal test was done using only 1 repetition of runs starting from<br />

input size 4. The test was able to complete sizes up to 22 before it was interrupted<br />

due to impatience, after running overnight for two nights. No errors were<br />

encountered. The awful time complexity of the algorithm makes it infeasible to<br />

continue the test much further.<br />

In themselves, these 13.000+ test runs do not prove anything, but they do<br />

give a very high degree of confidence in the correctness of the implementation.<br />

It is, of course, impossible to test all possible inputs, so we will have to satisfy<br />

ourselves with some finite number of runs. And this author believes 13.000 is a<br />

satisfactory number of test runs.<br />

A further argument to support the conclusion is that when the two implementations<br />

completely agree on all inputs, it is very likely that they are both<br />

correct. It is of course possible that they are both incorrect, and that the test<br />

outputs merely show that the two implementations agree on being wrong. However,<br />

the author estimates that the probability of creating two independent and<br />

very different implementations that would produce exactly the same errors on<br />

identical input, is very low indeed.<br />

The author believes he has argued thoroughly for the correctness of the<br />

implementation of the simple algorithm, and therefore provided a sound basis<br />

for comparison. Also, the author believes that the number of test runs should<br />

persuade anyone that the two implementations always produce the same output<br />

on identical input. Combining this with the transparent source code and<br />

correctness tests of the reference implementation, the author concludes that the<br />

implementation of the refined <strong>Buneman</strong> tree algorithm from [BFÖ+ 03] is indeed<br />

correct.<br />

80


Chapter 15<br />

Complexity<br />

This chapter deals with the time and space complexity of the RBT-algorithm.<br />

Before the advent of [BFÖ+ 03], a non-trivial algorithm for computing the RBT<br />

did exist ([BB99]), but with a running time of O(n 5 ) and space consumption<br />

O(n 4 ). The goals of [BFÖ+ 03] was to reduce these factors and make RBTs<br />

computationally competitive to methods based on neighbor joining and on plain<br />

<strong>Buneman</strong> trees, and in turn one of the goals of this thesis is to implement this<br />

algorithm and demonstrate how RBTs can be used in practice. Therefore we<br />

need to demonstrate that the implementation does indeed run in time O(n 3 )<br />

and space O(n 2 ).<br />

15.1 Running time<br />

To analyse the running time of the algorithm the author has written the test<br />

program which is parameterized by starting input size, ending input size and<br />

number of repetitions per input size. The test program measures the running<br />

time of the refined <strong>Buneman</strong> treee algorithm found by measuring the system<br />

time before and after each computation of the algorithm, excluding the initialization<br />

of a random input matrix. Input size and running time is reported for<br />

each repetition.<br />

The timing is not optimal; the running time does not reflect the exact running<br />

time of the algorithm, since the algorithm does not run exclusively on the<br />

test PC. However, the author made sure that the test PC was largely unused<br />

during testing, meaning that no users and only a few system processes were<br />

running during the time trials. Therefore the timing data might show a slightly<br />

skewed picture of the running time, but very importantly, the skew is evenly<br />

distributed and thus does not affect the test goal, with is to determine the<br />

asymptotic running time behaviour of the algorithm.<br />

The test was run with the following parameters:<br />

• Input start size 4<br />

81


100000<br />

90000<br />

Performance of the refined <strong>Buneman</strong> tree algorithm<br />

best fit<br />

80000<br />

70000<br />

Running time (ms.)<br />

60000<br />

50000<br />

40000<br />

30000<br />

20000<br />

10000<br />

0<br />

4 8 16 32 64 128<br />

Input size<br />

Figure 15.1: The running time of the refined <strong>Buneman</strong> algorithm. Notice the<br />

artifact at power of two intervals. This stems from the quad tree data structure<br />

described in chapter 8<br />

• Input stop size 174 1<br />

• Number of repetitions 100<br />

To analyse the asymptotic running time performance of the algorithm, the<br />

test data was analysed using the nonlinear least-squares (NLLS) Marquardt-<br />

Levenberg algorithm via the gnuplot ([gnu99]) fit command.<br />

The expectation was that the performance of the algorithm would be O(n 3 ),<br />

and therefore the fit-function was chosen to be f(x) = ax b . The fit of the<br />

function f with the whole dataset (input sizes 4 – 174) returned the following<br />

values: a =0.0130805 and b =3.02047. A plot of the data along with the<br />

function f is shown in Figure 15.1. The figure shows a plot of the running time<br />

in milliseconds against the input size in number of species.<br />

Figure 15.1 shows an artifact from the implementation, namely a jump in<br />

running time performance consistent with intervals matching powers of 2. This<br />

artifact stems from the dimensioning of the quad tree data structure described in<br />

1 Actually the stop size was 200, but after 3 consecutive days of running the test it was<br />

halted due to impatience.<br />

82


6000<br />

Performance difference in different intervals<br />

best fit size 64<br />

best fit size 65<br />

5000<br />

4000<br />

Running time (ms.)<br />

3000<br />

2000<br />

1000<br />

0<br />

0 10 20 30 40 50 60 70<br />

Input size<br />

Figure 15.2: The running time of the refined <strong>Buneman</strong> algorithm determined<br />

from two different input size intervals.<br />

chapter 8. Due to this artifact, estimating asymptotic running time performance<br />

yields different results depending on the input size range that forms the basis<br />

for the analysis. For example, estimating the running time performance on a<br />

dataset limited to size 64 returns a =0.0489006 and b =2.58288. On the other<br />

hand, for a dataset up to size 65 yields a =0.00330764 and b =3.27513. This<br />

difference is illustrated in Figure 15.2. In both cases however, the estimate of b<br />

iscloseto3,sotheestimateofb based on the whole dataset is very reliable.<br />

15.2 Space requirements<br />

Measuring space consumption of a (Java) program is no easy task. As we are<br />

interested in verifying that the space complexity is identical to the result from<br />

the analysis in chapter 5, it would be handy to have a tool that measures the<br />

maximum heap size used by a program during its execution time. The author<br />

was not able to find such a program.<br />

Instead, the author has looked at several tools which might provide at least<br />

a partial solution. Such tools would take a snapshot of the heapsize used by the<br />

program, and we would only need to run it often to get a reasonable estimate<br />

83


of the size of the heap during the execution time. Hopefully the data extacted<br />

in this way would present a clear image of the heap size trend.<br />

15.2.1 The Linux ps command<br />

The first option considered was the Linux command ps - report process status<br />

- which reports various information about a process or group of processes, including<br />

allocated memory. However, since we are measuring the size of a Java<br />

program running on a Java Virtual Machine, any information obtained from<br />

beyond the JVM would only reflect the characteristics of the JVM, and would<br />

therefore be very unreliable. We only know that the program we wish to profile<br />

is contained inside the space of the JVM, but we have no clue how big the<br />

program actually is. We would only have an upper limit.<br />

An example of the uselessness of the ps command when dealing with the<br />

JVM is the fact that we might set initial heap sizes differently, for example a<br />

high value and a low value, observe that the memory consumption is different in<br />

the two cases, but also that the memory consumption reported by ps grows in<br />

both cases! Clearly, if we set the initial heap size low to begin with, and measure<br />

the maximum heap size reported by ps, and then set the initial heap size in a<br />

new experiment higher than the previous reported maximum - we would not<br />

expect that the reported heap size would ever exceed the new minimum. But it<br />

does!<br />

The experiment is a bit tricky; we need to start fork a process with the JVM<br />

running on some sample data. Then quickly, before the process terminates, we<br />

must run ps -vg a few times to get snapshots of the process running. But it is<br />

possible to do, and here are the results: firstly, runs with small heap size, where<br />

the reported start and end process size are around 10 and 14 MB. (Note: the<br />

output has been edited to provide clarity)<br />

$ java -Xms1000000 -Xmx1000000000 RBT test.phylip &<br />

[1] 29376<br />

$ ps -v<br />

PID TIME RSS %MEM<br />

29376 0:03 10868 2.1<br />

29376 0:12 11452 2.2<br />

29376 0:22 13472 2.6<br />

Secondly, a large initial heap size, reporting a memory usage between 16 and<br />

26 MB. (Note: the output has been edited to provide clarity)<br />

$ java -Xms100000000 -Xmx1000000000 RBT test.phylip &<br />

[1] 29420<br />

$ ps -v<br />

PID TIME RSS %MEM<br />

29420 0:02 16044 3.1<br />

29420 0:08 22264 4.3<br />

29420 0:17 25840 5.0<br />

84


Clearly, this is useless for profiling memory complexity of the algorithm. And<br />

with good reason. The JVM is an advanced optimizing interpreter, which might<br />

or might now trade memory for clock cycles in its optimizations or compile bits<br />

of a program to native code to speed up critical sections. So this crude profiling<br />

method will not yield any useful information.<br />

15.2.2 JVM garbage collector log<br />

The second option considered for profiling the memory consumption of our<br />

refined <strong>Buneman</strong> tree algorithm implementation is the JVM garbage collector<br />

log. The garbage collection log is provided as a non-standard option for the<br />

JVM and can be accessed by starting the JVM with a command of the form:<br />

java -Xloggc: command line parameters...<br />

The garbage collector log file will contain entries documenting all runs of<br />

the JVM garbage collector - or rather, collectors. The JVM uses generational<br />

garbage collection, with one minor collector running often, and a major collector<br />

running rarely. The minor collector will only take “bites off the top” of the<br />

part of the heap that is reclaimable, while the major collector will clean up<br />

completely. The latter is of course an expensive operation. A sample of a<br />

garbage collector log can be found in appendix B. Notice the low frequency of<br />

major garbage colletions (1, marked Full GC) compared to minor ones (127,<br />

marked GC). The log entry format contains the following information:<br />

• Timestamp<br />

• Heap size before GC<br />

• Heap size after GC<br />

• Total heap size (in parenthesis)<br />

• Time to complete GC<br />

So how do we use this information to profile the performance of our program<br />

Clearly, minor GC reports are useless, as they are only some upper limit, which<br />

tends to be much too large (looking at the example in appendix B we see that<br />

a round of major GC is able to reclaim 75% of the heap that the minor GC has<br />

not touched). So we have to rely on the infrequent major GC counts to estimate<br />

the size of the part of the JVM heap that is actually used. Further information<br />

about the JVM and its garbage collectors can be found e.g. here:<br />

http://java.sun.com/docs/hotspot/gc/<br />

85


140000<br />

Space complexity<br />

best fit: x a<br />

120000<br />

100000<br />

Space used (kilobyte)<br />

80000<br />

60000<br />

40000<br />

20000<br />

0<br />

0 100 200 300 400 500 600 700<br />

Input size<br />

Figure 15.3: The space usage of the refined <strong>Buneman</strong> algorithm.<br />

15.2.3 Test results<br />

The experiment was completed on a large part of the PFAM test data, resulting<br />

in 83 useful 2 out of 130 possible garbage collection logs. The smallest matrices<br />

in the PFAM set do not require a full garbage collection, and the author was<br />

running out of time, so some of the largest data sets in the 500–700 size range<br />

we not tested.<br />

For each garbage collector log, the author ran the command grep Full to extract results from the major garbage collector. After looking over<br />

a few logs it was determined that the last major garbage collection would always<br />

report the largest space consumption for that computation, so the author<br />

ran the command tail -n 1 to find these maximums.<br />

Afterwards, using cut-and-paste and the Linux wc command, a dataset consisting<br />

of number of species and maximum space consumption was compiled. The<br />

data was analysed using gnuplot, fitting the data to a function f(x) =x a ,and<br />

gnuplot returned the result a =1.75287. A plot of the data and the function<br />

f is available in Figure 15.3.<br />

Regarding the accuracy of this experiment, the author is would like to mention<br />

that the data mining was rather sporadic; we rely on the results from the<br />

2 Log files where there are invocations of the major garbage collector<br />

86


major garbage collector, but this corresponds to a very selective or random sampling<br />

— in between these major garbage collections there might be space usage<br />

peaks that we are unaware of. Also, Figure 15.3 shows that the data is very<br />

scattered, especially in the 500–700 size range, and the author would wish to<br />

repeat the experiment using more repetitions, and to complete the experiment<br />

for all data in the 10–700 size range, at least. Another thing that introduces uncertainty<br />

is the JVM and its ability to do time/space trade-offs for optimization<br />

purposes. The author does not have much insight into this matter. A different<br />

approach to this experiment would be to use a profiling tool. Google reveals a<br />

lot of commercial and free products, but the author has not had the time to try<br />

out this approach.<br />

In conclusion, it appears that the implementation of the refined <strong>Buneman</strong><br />

tree algorithm does conform to its theoretical space complexity bounds. However,<br />

there are a number of uncertainties about the experiment described above,<br />

and if the experiment had shown a less favorable result the author would have<br />

been rather sceptic with respect to those results. It is therefore only fair to<br />

mention that the experiment is not ideal and that its results should be viewed<br />

with some scepticism.<br />

87


Chapter 16<br />

Comparing Evolutionary<br />

Tree Methods<br />

How does the refined <strong>Buneman</strong> tree algorithm fare compared to other known<br />

tree reconstruction methods Firstly we note that there is a trivial relationship<br />

between <strong>Buneman</strong> trees and refined <strong>Buneman</strong> trees, since B(δ) ⊆ RB(δ) for any<br />

δ, by definition. But it would be interesting to see just how much less restrictive<br />

the refined <strong>Buneman</strong> tree method is compared to its unrefined counterpart.<br />

Secondly we shall look at the well-known Neighbor-Joining tree reconstruction<br />

method, introduced by Saitou and Nei in [SN87]. What is the relation, if any,<br />

between refined <strong>Buneman</strong> trees and neighbor-joining trees We know that the<br />

NJ-method always produces fully resolved binary trees, and that refined <strong>Buneman</strong><br />

trees might not be fully resolved. But how different are the resolutions<br />

Is it the case that RB(δ) ⊆ NJ(δ) Is the intersection of splits in the refined<br />

<strong>Buneman</strong> tree and in the Neighbor-Joining tree a set of particularly good splits,<br />

making refined <strong>Buneman</strong> splits an extra reliable core set of splits It would be<br />

nice if we were able to say that the set of refined <strong>Buneman</strong> splits is a set we<br />

have great confidence in, compared to the set of Neighbor-Joining splits which<br />

might be more resolved than the data would warrant.<br />

Quoting [JWMV03]: From the perspective of experimental performance studies<br />

and algorithm design, (global) NJ should be regarded as a universal lowest<br />

common denominator in phylogeny reconstruction algorithms. Its speed makes<br />

it easy to use under all circumstances; its topological accuracy makes it an acceptable<br />

starting point for tree reconstruction in biological practice. We suggest<br />

that a proposed method should be compared with NJ and abandoned if it does not<br />

offer a demonstrable advantage over NJ for substantial subproblem families.<br />

There is clearly no competition between our implementation of the refined<br />

<strong>Buneman</strong> tree algorithm, and e.g. the optimized Neighbor-Joining algorithm<br />

described in [BFM + 03], when it comes to performance. The current implementation<br />

of the refined <strong>Buneman</strong> algorithm runs much slower than e.g. the<br />

QuickJoin algorithm ([BFM + 03]), even though they have the same theoretical<br />

88


worst case time and space complexity. However, with respect to performance<br />

in terms of reconstructing accurate phylogenetic trees, the refined <strong>Buneman</strong><br />

method might demonstrate unknown strengths.<br />

16.1 Test setup<br />

In the following section we will test the refined <strong>Buneman</strong> tree algorithm against<br />

two other known algorithms, the <strong>Buneman</strong> method ([Bun71]) and the Neighbor-<br />

Joining ([SN87]) method. All experiments have been run on test data from the<br />

PFAM database of protein sequence families. The data consists of distance<br />

matrices with sizes ranging from 10–50 (30), 100–200 (50) and 500–700 (50),<br />

for a total of 130 test matrices. Matrices with larger sizes are available, but<br />

they will be disregarded due to time constraints. Distance data is given in the<br />

common Phylip format.<br />

16.2 <strong>Buneman</strong> and refined <strong>Buneman</strong><br />

The implementation of the refined <strong>Buneman</strong> tree algorithm can easily be adapted<br />

to mark those splits which are both in B(δ) andinRB(δ). Since we have the<br />

n − 3 least scoring quartets used to calculate the refined <strong>Buneman</strong> index for a<br />

split, it is easy to find the least scoring quartet among them. If that quartet<br />

has positive <strong>Buneman</strong> score, we can mark the split as belonging in B(δ).<br />

This first experiment consists of running the (modified) refined <strong>Buneman</strong><br />

tree algorithm on examples from the set of PFAM distance matrices, counting<br />

for each one the number of splits in B(δ) andinRB(δ). The results from the<br />

experiment are given in Figure 16.1, sorted according to increasing distance<br />

matrix size and plotted in percentage (the size of RB(δ) is 100%).<br />

Figure 16.1 shows that the size of B(δ) fluctuates quite a bit, especially<br />

when the size of the distance matrix increases, where more and more datasets<br />

do not infer any <strong>Buneman</strong> splits at all. The refined <strong>Buneman</strong> method is clearly<br />

much less restrictive then the <strong>Buneman</strong> method, as expected. Regarding the<br />

quality of the splits that are in RB(δ) but not in B(δ), further studies would<br />

need to be undertaken — one could study either the specific dataset for which<br />

the <strong>Buneman</strong> method produces few splits, or use simulated data which would<br />

provide a key to which splits are well-supported and which are unsupported.<br />

16.3 <strong>Refined</strong> <strong>Buneman</strong> and Neighbor-Joining<br />

To test the refined <strong>Buneman</strong> tree method against the Neighbor-Joining method,<br />

the author has run the implementation of the refined <strong>Buneman</strong> tree algorithm<br />

described in this paper, against the Quick-Join algorithm described in [BFM + 03].<br />

The Quick-Join software is available from this website:<br />

http://www.birc.dk/Software/QuickJoin/<br />

89


Figure 16.1: The size of the set B(δ) as a percentage of the set RB(δ) onPFAM<br />

data. The two artifacts are due to two datasets where neither method produces<br />

any splits.<br />

90


The refined <strong>Buneman</strong> tree algorithm has been adapted to reading Phylip<br />

formats, and produces a list of splits in the form of strings from the alphabet<br />

{0, 1}, ordered according to the input distance matrix.<br />

The Quick-Join program inherently reads Phylip matrices, and produces a<br />

tree in Newick format, with names taken from the input distance matrix. Thus,<br />

the author has chosen to rename species so that their names reflect their index<br />

in the distance matrix, to ensure the output Newick format can be translated<br />

into a set of splits which is directly comparable to the splits output by the<br />

refined <strong>Buneman</strong> program.<br />

That translation is done using the Split-Dist package provided by Dr. Mailund.<br />

This piece of software can read two or more trees in Newick format, and output<br />

the set of splits that they have in common. We shall feed the Split-Dist program<br />

two copies of the output from Quick-Join to obtain one copy of that output in<br />

the form of a set of splits. The Split-Dist software is available from this website:<br />

http://www.daimi.au.dk/~mailund/split-dist.html<br />

This experiment was performed on the 130 PFAM matrices, and theresults<br />

are available, sorted according to increasing distance matrix size, in Figure 16.2.<br />

Figure 16.3 shows the same dataset plotted in percentage (the size of NJ(δ) is<br />

100%).<br />

The first conclusion we can draw from this experiment is that RB ⊈ NJ.<br />

Thereisexactlyonedistancematrix—4HBT.phylip, size 559 — out of the<br />

130 test matrices that has a single splits which is in RB(δ) but not in NJ(δ).<br />

However, for all practical purposes we can say that RB ⊆ NJ, and the author<br />

recommends that the matter be investigated further.<br />

Since we have established that all but one split in RB(δ) isalsoinNJ(δ),<br />

we can look at the relative sizes of B(δ), RB(δ) andNJ(δ) to try and judge the<br />

quality of the refined <strong>Buneman</strong> method. If we assume that the Neighbor-Joining<br />

method has a good topological accuracy, our hope is that the refined <strong>Buneman</strong><br />

method will capture many of the same splits as the Neighbor-Joining method<br />

finds. Figure 16.2 shows how the number of splits identified by the Neighborjoining<br />

method increases steadily (of course, the number of edges in a fully<br />

resolved tree is always 2n−3), while the two <strong>Buneman</strong> methods fluctuate greatly.<br />

Generally, the refined <strong>Buneman</strong> method performs better than the <strong>Buneman</strong><br />

method and often identifies a signifiant number of NJ–splits that the <strong>Buneman</strong><br />

method does not. If we look at Figure 16.3, we see that in some cases, the refined<br />

<strong>Buneman</strong> method identifies between 30% and 70% of the splits identified by the<br />

Neighbor-Joining method, however, in many cases the refined <strong>Buneman</strong> method<br />

identifes very few or even zero splits. If we assume the Neighbor-Joining tree<br />

represent the true evolutionary history, this would mean the refined <strong>Buneman</strong><br />

method is very bad indeed.<br />

However, we have no knowledge of the quality of the splits inferred by the<br />

Neighbor-Joining method on this particular dataset. We have a general result<br />

saying the method is a universal lowest common denominator in phylogeny<br />

reconstruction algorithms, mainly because of its speed and not its biological accuracy.<br />

Since the NJ method will always resolve a full tree, it is quite possible<br />

91


Figure 16.2: The total number of splits in B(δ), RB(δ) andNJ(δ).<br />

92


Figure 16.3: The sizes of B(δ) andRB(δ) as percentages of |NJ(δ)|.<br />

93


that it over-induces splits, and that the result from the refined <strong>Buneman</strong> method<br />

is actually closer to the truth: whether the method induces 70%, 30% or even<br />

zero Neighbor-Joining splits, it might be the case that the particular dataset we<br />

are looking simply does not warrant any further resolution of the evolutionary<br />

tree. Clearly, further investigations are required. On the positive side we note<br />

that at least the splits produced by the refined <strong>Buneman</strong> method can already<br />

be considered somewhat reliable, since the Neighbor-Joining method produces<br />

the same splits.<br />

The above comparisons are based on all splits, both trivial and non-trivial. In<br />

the following we will restrict ourselves to comparing only non-trivial splits from<br />

RB(δ) andNJ(δ). Recall the discussion of evolutionary trees and phylogenetic<br />

trees; perhaps this following comparison will show that the core of the refined<br />

<strong>Buneman</strong> tree is somehow a reliable core of the Neighbor-Joining tree<br />

In Figure 16.4 we see the number of non-trivial splits in RB(δ) compared<br />

to the number of non-trivial splits in NJ(δ), plotted in percentage with the NJ<br />

splits as 100% — the one case where a split from RB(δ) isnotinNJ(δ) will<br />

be ignored for now. The graph clearly shows a decreasing trend in the number<br />

of splits in RB(δ). Apart from the cases where the refined <strong>Buneman</strong> method<br />

produces no splits at all or only very few splits, the set of non-trivial splits<br />

identified by the refined <strong>Buneman</strong> method clearly goes down as matrix sizes go<br />

up.<br />

16.4 Summary of experimental results<br />

The total number of splits in the Neighbor-Joining tree increases steadily as the<br />

matrix size goes up, since NJ always produces fully resolved trees. Meanwhile,<br />

the number of refined <strong>Buneman</strong> splits fluctuates greatly, in many cases to size<br />

zero. The NJ method always resolves binary trees regardless of the nature and<br />

properties of the distance data, but it seems the refined <strong>Buneman</strong> method is<br />

very sensitive to data. The criterion that makes a set of data induce many or<br />

few refined <strong>Buneman</strong> splits is unknown, i.e. we cannot look at a distance matrix<br />

and say whether it will induce many or few splits, before we actually apply<br />

the method to the data — but the author suggests this be investigated further.<br />

One reason could be that the distance data does not necessarily uphold metric<br />

properties such as the triangle inequality, which might create contradictions or<br />

“noise” in the data, which in turn might lead the refined <strong>Buneman</strong> method to<br />

reject splits. However, this is only speculation.<br />

In spite of the fluctuation, there seems to be a trend of refined <strong>Buneman</strong> sets<br />

getting relatively smaller, as the size of the distance matrices goes up — at least<br />

the non-trivial parts. This is most evident in Figure 16.4. The refined <strong>Buneman</strong><br />

method looks at n − 3 minimum quartets for each edge, but as the number of<br />

species goes up, the number of quartets does not go up linearly — there are<br />

O(n 4 ) quartets induced by an edge. Thus the set of n − 3 least scoring quartets<br />

becomes a smaller and smaller part of the total number of quartets induced by<br />

an edge, a smaller and smaller part of the least scoring splits even, and the<br />

94


Figure 16.4: The set of non-trivial splits from RB(δ) as percentage of the nontrivial<br />

splits in NJ(δ).<br />

95


efined <strong>Buneman</strong> therefore suffers a great penalty as as the number of species<br />

goes up. Mathematically, when the number of species approaches infinity,<br />

n − 3<br />

n 4 → 0<br />

This would indicate that the method is not very useful for large data sets,<br />

since it suffers a great disadvantage. It would be interesting to see the performance<br />

of a quartet based method that considers a fixed percentage of quartets<br />

per split, e.g. 5–10%, which would not suffer from the same problems with scalability<br />

as the refined <strong>Buneman</strong> method. The problem becomes finding the 5–10%<br />

of least scoring quartets for every split in an efficient manner, while ensuring the<br />

set of splits produced is still a compatible set of splits — or tree. The Q ∗ method<br />

[BG00] tackles the problem of going from favorable quartets to a compatible set<br />

of splits, at the cost of performance compared to the refined <strong>Buneman</strong> method.<br />

One experiment which would be interesting to perform, but which the author<br />

has not undertaken, is to measure the quality of splits from the Neighbor-Joining<br />

method that are either accepted or rejected by the <strong>Buneman</strong> method. The<br />

authors expectation is that the splits accepted by the refined <strong>Buneman</strong> method<br />

would get a high confidence, as that method relies on much more evidence<br />

then the Neighbor-Joining method does. Such an investigation could be done<br />

by inspecting the PFAM data and interpreting their biological meaning, by<br />

using bootstrap tests, or by using simulated data where one would know the<br />

evolutionary history. Is it the case that the splits in RB(δ) are more trustworthy<br />

than splits in NJ(δ) \ RB(δ) In the cases where the NJ method infers a fully<br />

resolved binary tree, and the RB method infers no splits at all, which one<br />

tells the truth Some datasets might be completely random, and an NJ tree<br />

based on such a set is completely useless — on the other hand, tests show that<br />

the refined <strong>Buneman</strong> method produces very few splits or non at all on input<br />

distance matrices with random entries. So when the RB method tells us that<br />

the dataset does not warrant any splits, is that because the RB method has<br />

good biological properties This question is beyond the scope of this thesis, but<br />

certainly deserves further investigation.<br />

96


Chapter 17<br />

Conclusion<br />

In this thesis the author has described an implementation of the refined <strong>Buneman</strong><br />

tree reconstruction method. The author has verified that the implementation<br />

is correct and that it runs in expected time Θ(n 3 )andspaceO(n 2 ).<br />

The author has argued that the implementation could be improved to run in<br />

worst case cubic time by changing a single module, i.e. the selection algorithm<br />

described in chapter 10.<br />

The author has introduced the field of bioinformatics in general and evolutionary<br />

tree reconstruction in particular, and the problems posed by increasingly<br />

large amounts of data becoming available. A classification is made of various<br />

types of tree reconstruction methods, highlighting their particular advantages<br />

and disadvantages with respect to a performance trade-off between speed and<br />

biological accuracy.<br />

The refined <strong>Buneman</strong> method has compared the refined <strong>Buneman</strong> tree method<br />

to two other tree reconstruction methods from its class, namely the original<br />

<strong>Buneman</strong> tree method and the well known Neighbor-Joining method. The<br />

conclusion from these experiments is that the refined <strong>Buneman</strong> tree method<br />

produces trees is less restrictive then the <strong>Buneman</strong> method but more restrictive<br />

than the Neighbor-Joining method, and with running times that makes the<br />

method useful in practice. The author encourages more work in this area, to<br />

better estimate the biological accuracy of the method.<br />

Lastly, the author has prepared the implementation of the refined <strong>Buneman</strong><br />

method for integration into the widely used JSplits software package, thus<br />

making the method publically available and easy to use.<br />

17.1 Future work<br />

Clearly, the implementation of the refined <strong>Buneman</strong> tree algorithm can be implemented<br />

to run faster in practice. Even if its asymptotic running time performance<br />

equals that of the widely-used Neighbor-Joining method, the practical<br />

difference in running time, compared to e.g. Quick-Join, is still substantial.<br />

97


Speedups might be achieved using optimizations, simplifications and maybe a<br />

different programming language.<br />

More experiments are needed to fully understand the biological performance<br />

of the refined <strong>Buneman</strong> method. We have seen that the method in practice<br />

produces subsets of Neighbor-Joining trees, which are considered somewhat accurate.<br />

We have seen that in many cases the degree of resolution in refined<br />

<strong>Buneman</strong> trees is much lower than in Neighbor-Joining trees — and that in<br />

many cases the refined <strong>Buneman</strong> trees will produce a tree with zero edges where<br />

the Neighbor-Joining tree will produce 2n − 3 edges for input size n. But we<br />

have not been able to determine which method is more correct, and here lies an<br />

important task for the future.<br />

Finally, we have seen that the refined <strong>Buneman</strong> method does not scale well.<br />

The method uses the set of n − 3 minimum scoring quartets associated with an<br />

ege, to determine the biological quality of the edge. But the number of quartets<br />

of an edge scales by O(n 4 ) when the input size goes up, and thus the refined<br />

<strong>Buneman</strong> method will be using a relatively smaller amount of the least scoring<br />

quartets to decide the quality of an edge. The author therefore suggests that<br />

the method be amended or that a new method is developed, where the edge<br />

measuring basis is relatively constant, to provide better scalability.<br />

98


Part IV<br />

Appendices<br />

99


Appendix A<br />

Correctness of the<br />

Reference Implementation<br />

Distance matrix:<br />

0,000 0,134 0,559 0,093 0,158<br />

0,134 0,000 0,921 0,889 0,545<br />

0,559 0,921 0,000 0,843 0,610<br />

0,093 0,889 0,843 0,000 0,751<br />

0,158 0,545 0,610 0,751 0,000<br />

Split: 00000<br />

Not a split!<br />

Split: 00001<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

Quartet: 0 1 | 4 4 0.2845541310739736<br />

Quartet: 0 2 | 4 4 0.10507832094101921<br />

Quartet: 0 3 | 4 4 0.4080878060278718<br />

Quartet: 1 1 | 4 4 0.5448085612764465<br />

Quartet: 1 2 | 4 4 0.11734369883965029<br />

Quartet: 1 3 | 4 4 0.20357979187534186<br />

Quartet: 2 2 | 4 4 0.6104717294506344<br />

Quartet: 2 3 | 4 4 0.25932504761092445<br />

Quartet: 3 3 | 4 4 0.7513406854234302<br />

<strong>Buneman</strong> Index: 0.11121100989033475<br />

Split: 00010<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 1 | 3 3 0.42422721859419016<br />

Quartet: 0 2 | 3 3 0.1890061527256532<br />

Quartet: 0 4 | 3 3 0.34325287939555843<br />

Quartet: 1 1 | 3 3 0.888989662949193<br />

Quartet: 1 2 | 3 3 0.40577954477681416<br />

Quartet: 1 4 | 3 3 0.5477608935480884<br />

Quartet: 2 2 | 3 3 0.8431623196522158<br />

Quartet: 2 4 | 3 3 0.49201563781250585<br />

Quartet: 4 4 | 3 3 0.7513406854234302<br />

<strong>Buneman</strong> Index: 0.1411876204138139<br />

Split: 00011<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 0 | 3 4 -0.2498837912935838<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

Quartet: 0 1 | 3 3 0.42422721859419016<br />

Quartet: 0 1 | 3 4 -0.12353367495389822<br />

Quartet: 0 1 | 4 4 0.2845541310739736<br />

Quartet: 0 2 | 3 3 0.1890061527256532<br />

Quartet: 0 2 | 3 4 -0.3030094850868526<br />

Quartet: 0 2 | 4 4 0.10507832094101921<br />

Quartet: 1 1 | 3 3 0.888989662949193<br />

Quartet: 1 1 | 3 4 0.34122876940110464<br />

Quartet: 1 1 | 4 4 0.5448085612764465<br />

Quartet: 1 2 | 3 3 0.40577954477681416<br />

Quartet: 1 2 | 3 4 -0.14198134877127422<br />

Quartet: 1 2 | 4 4 0.11734369883965029<br />

Quartet: 2 2 | 3 3 0.8431623196522158<br />

Quartet: 2 2 | 3 4 0.35114668183971004<br />

Quartet: 2 2 | 4 4 0.6104717294506344<br />

<strong>Buneman</strong> Index: -0.2764466381902182<br />

100


Split: 00100<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 1 | 2 2 0.6726038407439385<br />

Quartet: 0 3 | 2 2 0.6541561669265625<br />

Quartet: 0 4 | 2 2 0.5053934085096152<br />

Quartet: 1 1 | 2 2 0.9205928930477804<br />

Quartet: 1 3 | 2 2 0.4373827748754015<br />

Quartet: 1 4 | 2 2 0.49312803061098415<br />

Quartet: 3 3 | 2 2 0.8431623196522158<br />

Quartet: 3 4 | 2 2 0.35114668183971004<br />

Quartet: 4 4 | 2 2 0.6104717294506344<br />

<strong>Buneman</strong> Index: 0.39426472835755577<br />

Split: 00101<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 0 | 2 4 0.05312569379326881<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

Quartet: 0 1 | 2 2 0.6726038407439385<br />

Quartet: 0 1 | 2 4 0.16721043223432325<br />

Quartet: 0 1 | 4 4 0.2845541310739736<br />

Quartet: 0 3 | 2 2 0.6541561669265625<br />

Quartet: 0 3 | 2 4 0.14876275841694736<br />

Quartet: 0 3 | 4 4 0.4080878060278718<br />

Quartet: 1 1 | 2 2 0.9205928930477804<br />

Quartet: 1 1 | 2 4 0.4274648624367962<br />

Quartet: 1 1 | 4 4 0.5448085612764465<br />

Quartet: 1 3 | 2 2 0.4373827748754015<br />

Quartet: 1 3 | 2 4 -0.055745255735582644<br />

Quartet: 1 3 | 4 4 0.20357979187534186<br />

Quartet: 3 3 | 2 2 0.8431623196522158<br />

Quartet: 3 3 | 2 4 0.49201563781250585<br />

Quartet: 3 3 | 4 4 0.7513406854234302<br />

<strong>Buneman</strong> Index: -0.0013097809711569153<br />

Split: 00110<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 0 | 2 3 -0.09563706462367855<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 1 | 2 2 0.6726038407439385<br />

Quartet: 0 1 | 2 3 0.018447673817375887<br />

Quartet: 0 1 | 3 3 0.42422721859419016<br />

Quartet: 0 4 | 2 2 0.5053934085096152<br />

Quartet: 0 4 | 2 3 -0.14876275841694736<br />

Quartet: 0 4 | 3 3 0.34325287939555843<br />

Quartet: 1 1 | 2 2 0.9205928930477804<br />

Quartet: 1 1 | 2 3 0.4832101181723788<br />

Quartet: 1 1 | 3 3 0.888989662949193<br />

Quartet: 1 4 | 2 2 0.49312803061098415<br />

Quartet: 1 4 | 2 3 0.055745255735582644<br />

Quartet: 1 4 | 3 3 0.5477608935480884<br />

Quartet: 4 4 | 2 2 0.6104717294506344<br />

Quartet: 4 4 | 2 3 0.25932504761092445<br />

Quartet: 4 4 | 3 3 0.7513406854234302<br />

<strong>Buneman</strong> Index: -0.12219991152031295<br />

Split: 00111<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 0 | 2 3 -0.09563706462367855<br />

Quartet: 0 0 | 2 4 0.05312569379326881<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 0 | 3 4 -0.2498837912935838<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

Quartet: 0 1 | 2 2 0.6726038407439385<br />

Quartet: 0 1 | 2 3 0.018447673817375887<br />

Quartet: 0 1 | 2 4 0.16721043223432325<br />

Quartet: 0 1 | 3 3 0.42422721859419016<br />

Quartet: 0 1 | 3 4 -0.12353367495389822<br />

Quartet: 0 1 | 4 4 0.2845541310739736<br />

Quartet: 1 1 | 2 2 0.9205928930477804<br />

Quartet: 1 1 | 2 3 0.4832101181723788<br />

Quartet: 1 1 | 2 4 0.4274648624367962<br />

Quartet: 1 1 | 3 3 0.888989662949193<br />

Quartet: 1 1 | 3 4 0.34122876940110464<br />

Quartet: 1 1 | 4 4 0.5448085612764465<br />

<strong>Buneman</strong> Index: -0.186708733123741<br />

Split: 01000<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

Quartet: 0 2 | 1 1 0.24798905230384188<br />

Quartet: 0 3 | 1 1 0.4647624443550029<br />

Quartet: 0 4 | 1 1 0.2602544302024729<br />

Quartet: 2 2 | 1 1 0.9205928930477804<br />

Quartet: 2 3 | 1 1 0.4832101181723788<br />

Quartet: 2 4 | 1 1 0.4274648624367962<br />

Quartet: 3 3 | 1 1 0.888989662949193<br />

Quartet: 3 4 | 1 1 0.34122876940110464<br />

Quartet: 4 4 | 1 1 0.5448085612764465<br />

<strong>Buneman</strong> Index: 0.1909466830833146<br />

Split: 01001<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

101


Quartet: 0 0 | 1 4 -0.12635011633968557<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

Quartet: 0 2 | 1 1 0.24798905230384188<br />

Quartet: 0 2 | 1 4 -0.17947581013295438<br />

Quartet: 0 2 | 4 4 0.10507832094101921<br />

Quartet: 0 3 | 1 1 0.4647624443550029<br />

Quartet: 0 3 | 1 4 0.12353367495389822<br />

Quartet: 0 3 | 4 4 0.4080878060278718<br />

Quartet: 2 2 | 1 1 0.9205928930477804<br />

Quartet: 2 2 | 1 4 0.49312803061098415<br />

Quartet: 2 2 | 4 4 0.6104717294506344<br />

Quartet: 2 3 | 1 1 0.4832101181723788<br />

Quartet: 2 3 | 1 4 0.055745255735582644<br />

Quartet: 2 3 | 4 4 0.25932504761092445<br />

Quartet: 3 3 | 1 1 0.888989662949193<br />

Quartet: 3 3 | 1 4 0.5477608935480884<br />

Quartet: 3 3 | 4 4 0.7513406854234302<br />

<strong>Buneman</strong> Index: -0.15291296323631998<br />

Split: 01010<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

Quartet: 0 0 | 1 3 -0.3308581304922155<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 2 | 1 1 0.24798905230384188<br />

Quartet: 0 2 | 1 3 -0.23522106586853697<br />

Quartet: 0 2 | 3 3 0.1890061527256532<br />

Quartet: 0 4 | 1 1 0.2602544302024729<br />

Quartet: 0 4 | 1 3 -0.2045080141525299<br />

Quartet: 0 4 | 3 3 0.34325287939555843<br />

Quartet: 2 2 | 1 1 0.9205928930477804<br />

Quartet: 2 2 | 1 3 0.4373827748754015<br />

Quartet: 2 2 | 3 3 0.8431623196522158<br />

Quartet: 2 4 | 1 1 0.4274648624367962<br />

Quartet: 2 4 | 1 3 -0.055745255735582644<br />

Quartet: 2 4 | 3 3 0.49201563781250585<br />

Quartet: 4 4 | 1 1 0.5448085612764465<br />

Quartet: 4 4 | 1 3 0.20357979187534186<br />

Quartet: 4 4 | 3 3 0.7513406854234302<br />

<strong>Buneman</strong> Index: -0.2830395981803763<br />

Split: 01011<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

Quartet: 0 0 | 1 3 -0.3308581304922155<br />

Quartet: 0 0 | 1 4 -0.12635011633968557<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 0 | 3 4 -0.2498837912935838<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

Quartet: 0 2 | 1 1 0.24798905230384188<br />

Quartet: 0 2 | 1 3 -0.23522106586853697<br />

Quartet: 0 2 | 1 4 -0.17947581013295438<br />

Quartet: 0 2 | 3 3 0.1890061527256532<br />

Quartet: 0 2 | 3 4 -0.3030094850868526<br />

Quartet: 0 2 | 4 4 0.10507832094101921<br />

Quartet: 2 2 | 1 1 0.9205928930477804<br />

Quartet: 2 2 | 1 3 0.4373827748754015<br />

Quartet: 2 2 | 1 4 0.49312803061098415<br />

Quartet: 2 2 | 3 3 0.8431623196522158<br />

Quartet: 2 2 | 3 4 0.35114668183971004<br />

Quartet: 2 2 | 4 4 0.6104717294506344<br />

<strong>Buneman</strong> Index: -0.31693380778953406<br />

Split: 01100<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

Quartet: 0 0 | 1 2 -0.11408473844105449<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 3 | 1 1 0.4647624443550029<br />

Quartet: 0 3 | 1 2 -0.018447673817375887<br />

Quartet: 0 3 | 2 2 0.6541561669265625<br />

Quartet: 0 4 | 1 1 0.2602544302024729<br />

Quartet: 0 4 | 1 2 -0.16721043223432325<br />

Quartet: 0 4 | 2 2 0.5053934085096152<br />

Quartet: 3 3 | 1 1 0.888989662949193<br />

Quartet: 3 3 | 1 2 0.40577954477681416<br />

Quartet: 3 3 | 2 2 0.8431623196522158<br />

Quartet: 3 4 | 1 1 0.34122876940110464<br />

Quartet: 3 4 | 1 2 -0.14198134877127422<br />

Quartet: 3 4 | 2 2 0.35114668183971004<br />

Quartet: 4 4 | 1 1 0.5448085612764465<br />

Quartet: 4 4 | 1 2 0.11734369883965029<br />

Quartet: 4 4 | 2 2 0.6104717294506344<br />

<strong>Buneman</strong> Index: -0.15459589050279873<br />

Split: 01101<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

Quartet: 0 0 | 1 2 -0.11408473844105449<br />

Quartet: 0 0 | 1 4 -0.12635011633968557<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 0 | 2 4 0.05312569379326881<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

Quartet: 0 3 | 1 1 0.4647624443550029<br />

Quartet: 0 3 | 1 2 -0.018447673817375887<br />

102


Quartet: 0 3 | 1 4 0.12353367495389822<br />

Quartet: 0 3 | 2 2 0.6541561669265625<br />

Quartet: 0 3 | 2 4 0.14876275841694736<br />

Quartet: 0 3 | 4 4 0.4080878060278718<br />

Quartet: 3 3 | 1 1 0.888989662949193<br />

Quartet: 3 3 | 1 2 0.40577954477681416<br />

Quartet: 3 3 | 1 4 0.5477608935480884<br />

Quartet: 3 3 | 2 2 0.8431623196522158<br />

Quartet: 3 3 | 2 4 0.49201563781250585<br />

Quartet: 3 3 | 4 4 0.7513406854234302<br />

<strong>Buneman</strong> Index: -0.12021742739037003<br />

Split: 01110<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

Quartet: 0 0 | 1 2 -0.11408473844105449<br />

Quartet: 0 0 | 1 3 -0.3308581304922155<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 0 | 2 3 -0.09563706462367855<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 4 | 1 1 0.2602544302024729<br />

Quartet: 0 4 | 1 2 -0.16721043223432325<br />

Quartet: 0 4 | 1 3 -0.2045080141525299<br />

Quartet: 0 4 | 2 2 0.5053934085096152<br />

Quartet: 0 4 | 2 3 -0.14876275841694736<br />

Quartet: 0 4 | 3 3 0.34325287939555843<br />

Quartet: 4 4 | 1 1 0.5448085612764465<br />

Quartet: 4 4 | 1 2 0.11734369883965029<br />

Quartet: 4 4 | 1 3 0.20357979187534186<br />

Quartet: 4 4 | 2 2 0.6104717294506344<br />

Quartet: 4 4 | 2 3 0.25932504761092445<br />

Quartet: 4 4 | 3 3 0.7513406854234302<br />

<strong>Buneman</strong> Index: -0.2676830723223727<br />

Split: 01111<br />

Quartet: 0 0 | 1 1 0.13390431386278734<br />

Quartet: 0 0 | 1 2 -0.11408473844105449<br />

Quartet: 0 0 | 1 3 -0.3308581304922155<br />

Quartet: 0 0 | 1 4 -0.12635011633968557<br />

Quartet: 0 0 | 2 2 0.558519102302884<br />

Quartet: 0 0 | 2 3 -0.09563706462367855<br />

Quartet: 0 0 | 2 4 0.05312569379326881<br />

Quartet: 0 0 | 3 3 0.09336908810197464<br />

Quartet: 0 0 | 3 4 -0.2498837912935838<br />

Quartet: 0 0 | 4 4 0.15820401473428802<br />

<strong>Buneman</strong> Index: -0.29037096089289965<br />

<strong>Refined</strong> <strong>Buneman</strong> splits:<br />

01000 0.1909466830833146<br />

00100 0.39426472835755577<br />

00010 0.1411876204138139<br />

00001 0.11121100989033475<br />

103


Appendix B<br />

Garbage Collector Log<br />

0.000: [GC 511K->156K(1984K), 0.0090600 secs]<br />

0.178: [GC 668K->154K(1984K), 0.0041940 secs]<br />

0.241: [GC 666K->155K(1984K), 0.0020350 secs]<br />

0.264: [GC 667K->196K(1984K), 0.0014720 secs]<br />

0.277: [GC 708K->184K(1984K), 0.0011120 secs]<br />

0.299: [GC 696K->158K(1984K), 0.0007900 secs]<br />

0.310: [GC 670K->178K(1984K), 0.0006680 secs]<br />

0.327: [GC 690K->164K(1984K), 0.0005610 secs]<br />

0.340: [GC 676K->165K(1984K), 0.0009310 secs]<br />

0.353: [GC 677K->166K(1984K), 0.0006800 secs]<br />

0.363: [GC 678K->166K(1984K), 0.0005850 secs]<br />

0.373: [GC 678K->167K(1984K), 0.0006100 secs]<br />

0.384: [GC 679K->168K(1984K), 0.0006100 secs]<br />

0.397: [GC 680K->212K(1984K), 0.0010710 secs]<br />

0.408: [GC 724K->168K(1984K), 0.0007290 secs]<br />

0.419: [GC 680K->175K(1984K), 0.0005970 secs]<br />

0.429: [GC 687K->183K(1984K), 0.0006870 secs]<br />

0.438: [GC 695K->196K(1984K), 0.0006830 secs]<br />

0.453: [GC 708K->183K(1984K), 0.0007270 secs]<br />

0.461: [GC 695K->246K(1984K), 0.0014170 secs]<br />

0.475: [GC 758K->304K(1984K), 0.0027140 secs]<br />

0.485: [GC 816K->363K(1984K), 0.0026530 secs]<br />

0.501: [GC 875K->312K(1984K), 0.0006540 secs]<br />

0.511: [GC 824K->311K(1984K), 0.0004770 secs]<br />

0.523: [GC 823K->320K(1984K), 0.0006030 secs]<br />

0.532: [GC 832K->319K(1984K), 0.0006790 secs]<br />

0.544: [GC 831K->329K(1984K), 0.0008010 secs]<br />

0.553: [GC 841K->328K(1984K), 0.0007290 secs]<br />

0.565: [GC 840K->331K(1984K), 0.0009070 secs]<br />

0.576: [GC 843K->329K(1984K), 0.0007710 secs]<br />

0.584: [GC 841K->456K(1984K), 0.0028120 secs]<br />

0.599: [GC 968K->464K(1984K), 0.0020600 secs]<br />

0.609: [GC 976K->464K(1984K), 0.0018370 secs]<br />

0.623: [GC 976K->479K(1984K), 0.0005670 secs]<br />

0.631: [GC 991K->477K(1984K), 0.0005740 secs]<br />

0.640: [GC 989K->614K(1984K), 0.0028960 secs]<br />

0.655: [GC 1126K->627K(1984K), 0.0020690 secs]<br />

0.666: [GC 1139K->627K(1984K), 0.0018740 secs]<br />

0.676: [GC 1139K->813K(1984K), 0.0037820 secs]<br />

0.695: [GC 1325K->786K(1984K), 0.0016900 secs]<br />

0.705: [GC 1298K->787K(1984K), 0.0010740 secs]<br />

0.718: [GC 1299K->801K(1984K), 0.0006760 secs]<br />

0.727: [GC 1313K->801K(1984K), 0.0006270 secs]<br />

0.736: [GC 1313K->801K(1984K), 0.0006380 secs]<br />

0.749: [GC 1313K->811K(1984K), 0.0007980 secs]<br />

0.757: [GC 1323K->811K(1984K), 0.0008100 secs]<br />

0.766: [GC 1323K->812K(1984K), 0.0007680 secs]<br />

0.780: [GC 1324K->813K(1984K), 0.0008510 secs]<br />

0.788: [GC 1325K->812K(1984K), 0.0008660 secs]<br />

0.799: [GC 1324K->812K(1984K), 0.0007680 secs]<br />

0.808: [GC 1324K->1006K(1984K), 0.0047390 secs]<br />

0.830: [GC 1518K->971K(1984K), 0.0012210 secs]<br />

0.839: [GC 1483K->971K(1984K), 0.0007970 secs]<br />

0.847: [GC 1483K->1068K(1984K), 0.0021650 secs]<br />

0.864: [GC 1580K->1142K(1984K), 0.0033550 secs]<br />

0.875: [GC 1654K->1144K(1984K), 0.0024180 secs]<br />

0.886: [GC 1656K->1144K(1984K), 0.0005500 secs]<br />

0.900: [GC 1656K->1162K(1984K), 0.0007140 secs]<br />

0.908: [GC 1674K->1162K(1984K), 0.0007450 secs]<br />

0.917: [GC 1674K->1161K(1984K), 0.0007530 secs]<br />

0.936: [GC 1673K->1163K(1984K), 0.0018910 secs]<br />

104


0.972: [GC 1675K->1182K(1984K), 0.0011610 secs]<br />

0.989: [GC 1694K->1202K(1984K), 0.0014850 secs]<br />

1.004: [GC 1714K->1222K(1984K), 0.0015970 secs]<br />

1.028: [GC 1734K->1243K(1984K), 0.0016290 secs]<br />

1.038: [GC 1755K->1264K(1984K), 0.0015250 secs]<br />

1.048: [GC 1776K->1284K(1984K), 0.0015470 secs]<br />

1.058: [GC 1796K->1306K(1984K), 0.0015560 secs]<br />

1.070: [GC 1818K->1328K(1984K), 0.0019730 secs]<br />

1.092: [GC 1840K->1355K(1984K), 0.0022360 secs]<br />

1.104: [GC 1867K->1385K(1984K), 0.0025080 secs]<br />

1.115: [GC 1897K->1406K(1984K), 0.0023980 secs]<br />

1.126: [GC 1918K->1417K(1984K), 0.0017430 secs]<br />

1.136: [GC 1929K->1441K(1984K), 0.0022290 secs]<br />

1.147: [GC 1953K->1455K(1984K), 0.0018660 secs]<br />

1.149: [Full GC 1455K->356K(1984K), 0.0411630 secs]<br />

1.198: [GC 868K->384K(1984K), 0.0014270 secs]<br />

1.208: [GC 896K->409K(1984K), 0.0015440 secs]<br />

1.218: [GC 921K->428K(1984K), 0.0019500 secs]<br />

1.228: [GC 940K->447K(1984K), 0.0017300 secs]<br />

1.238: [GC 959K->465K(1984K), 0.0021660 secs]<br />

1.249: [GC 977K->483K(1984K), 0.0019060 secs]<br />

1.259: [GC 995K->495K(1984K), 0.0020870 secs]<br />

1.270: [GC 1007K->516K(1984K), 0.0017420 secs]<br />

1.280: [GC 1028K->528K(1984K), 0.0016770 secs]<br />

1.290: [GC 1040K->541K(1984K), 0.0016130 secs]<br />

1.300: [GC 1053K->554K(1984K), 0.0016020 secs]<br />

1.310: [GC 1066K->566K(1984K), 0.0014070 secs]<br />

1.320: [GC 1078K->582K(1984K), 0.0017910 secs]<br />

1.330: [GC 1094K->597K(1984K), 0.0015940 secs]<br />

1.340: [GC 1109K->613K(1984K), 0.0015960 secs]<br />

1.350: [GC 1125K->638K(1984K), 0.0024570 secs]<br />

1.361: [GC 1150K->652K(1984K), 0.0019230 secs]<br />

1.371: [GC 1164K->672K(1984K), 0.0020700 secs]<br />

1.381: [GC 1184K->689K(1984K), 0.0018670 secs]<br />

1.391: [GC 1201K->713K(1984K), 0.0021350 secs]<br />

1.401: [GC 1225K->734K(1984K), 0.0020890 secs]<br />

1.412: [GC 1246K->758K(1984K), 0.0024130 secs]<br />

1.422: [GC 1270K->772K(1984K), 0.0019690 secs]<br />

1.433: [GC 1284K->793K(1984K), 0.0020540 secs]<br />

1.466: [GC 1305K->800K(1984K), 0.0018280 secs]<br />

1.481: [GC 1312K->809K(1984K), 0.0016140 secs]<br />

1.495: [GC 1321K->815K(1984K), 0.0008950 secs]<br />

1.512: [GC 1327K->823K(1984K), 0.0010560 secs]<br />

1.529: [GC 1335K->830K(1984K), 0.0010790 secs]<br />

1.542: [GC 1342K->837K(1984K), 0.0012930 secs]<br />

1.556: [GC 1349K->846K(1984K), 0.0015160 secs]<br />

1.568: [GC 1358K->856K(1984K), 0.0012840 secs]<br />

1.581: [GC 1368K->864K(1984K), 0.0013920 secs]<br />

1.593: [GC 1376K->872K(1984K), 0.0014740 secs]<br />

1.606: [GC 1384K->881K(1984K), 0.0013130 secs]<br />

1.619: [GC 1393K->889K(1984K), 0.0013220 secs]<br />

1.632: [GC 1401K->897K(1984K), 0.0012040 secs]<br />

1.647: [GC 1409K->904K(1984K), 0.0012350 secs]<br />

1.659: [GC 1416K->912K(1984K), 0.0012510 secs]<br />

1.673: [GC 1424K->922K(1984K), 0.0013530 secs]<br />

1.685: [GC 1434K->931K(1984K), 0.0014230 secs]<br />

1.698: [GC 1443K->941K(1984K), 0.0016090 secs]<br />

1.712: [GC 1453K->950K(1984K), 0.0013770 secs]<br />

1.724: [GC 1462K->956K(1984K), 0.0013470 secs]<br />

1.737: [GC 1468K->965K(1984K), 0.0014090 secs]<br />

1.750: [GC 1477K->974K(1984K), 0.0013430 secs]<br />

1.763: [GC 1486K->983K(1984K), 0.0013740 secs]<br />

1.775: [GC 1495K->991K(1984K), 0.0012020 secs]<br />

1.787: [GC 1503K->996K(1984K), 0.0012050 secs]<br />

1.800: [GC 1508K->1005K(1984K), 0.0012510 secs]<br />

1.813: [GC 1517K->1013K(1984K), 0.0013920 secs]<br />

1.825: [GC 1525K->1020K(1984K), 0.0013730 secs]<br />

1.837: [GC 1532K->1029K(1984K), 0.0013140 secs]<br />

1.850: [GC 1541K->1036K(1984K), 0.0013250 secs]<br />

1.863: [GC 1548K->1044K(1984K), 0.0012460 secs]<br />

1.875: [GC 1556K->1052K(1984K), 0.0013450 secs]<br />

1.887: [GC 1564K->1059K(1984K), 0.0012150 secs]<br />

1.899: [GC 1571K->1065K(1984K), 0.0010660 secs]<br />

1.911: [GC 1577K->1074K(1984K), 0.0013640 secs]<br />

1.923: [GC 1586K->1083K(1984K), 0.0012460 secs]<br />

105


Bibliography<br />

[AJL + 02]<br />

Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff,<br />

Keith Roberts, and Peter Walter. Molecular Biology of the Cell,<br />

Fourth Edition. Garland Science, Taylor & Francis Group, 29 West<br />

35th Street, New York, 2002.<br />

[BB99] Vincent Berry and David Bryant. Faster reliable phylogenetic<br />

analysis. In Proceedings of the third annual international conference<br />

on Computational molecular biology, pages 59–68. ACM<br />

Press, 1999.<br />

[BFM + 03]<br />

[BFÖ+ 03]<br />

[BG91]<br />

[BG00]<br />

[BM99]<br />

[Bun71]<br />

Gerth Stølting Brodal, Rolf Fagerberg, Thomas Mailund, Christian<br />

N. S. Pedersen, and Derek Phillips. Speeding up neighbourjoining<br />

tree construction. Technical report, ALCOM-FT, 2003.<br />

Gerth Stølting Brodal, Rolf Fagerberg, Anna Östlin, Christian<br />

N. S. Pedersen, and S. Srinivasa Rao. Computing <strong>Refined</strong> <strong>Buneman</strong><br />

<strong>Trees</strong> in Cubic Time. In Proceedings of the 3rd Workshop on<br />

Algorithms in BioInformatics (WABI 2003), volume 2812 of Lecture<br />

Notes in Computer Science, pages 259–270. Springer Verlag,<br />

September 2003.<br />

Jean-Pierre Barthélemy and Alain Guénoche. <strong>Trees</strong> and Proximity<br />

Representations. John Wiley & Sons, 1991.<br />

Vincent Berry and Olivier Gascuel. Inferring evolutionary trees<br />

with strong combinatorial evidence. Theoretical Computer Science,<br />

240(2):271–298, 2000.<br />

David Bryant and Vincent Moulton. A polynomial time algorithm<br />

for constructing the refined buneman tree. Applied Mathematics<br />

Letters, 12:51–56, 1999.<br />

Peter <strong>Buneman</strong>. The recovery of trees from measures of dissimilarity.<br />

In Mathematics in the Archeological and Historical Sciences,<br />

pages 387–395. Edinburgh University Press, 1971.<br />

[cal04] Cambridge advanced learner’s dictionary. Webpage, 2004.<br />

http://dictionary.cambridge.org/.<br />

106


[CLR90]<br />

[Dar59]<br />

Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest.<br />

Introduction to Algorithms. MIT Press, Cambridge Massachusetts,<br />

1990.<br />

Charles Darwin. On the Origin of Species by Means of Natural<br />

Selection, or the Preservation of Favoured Races in the Struggle<br />

for Life. John Murray, 1859.<br />

[dBSvKO00] Mark de Berg, Otfried Schwarzkopf, Marc van Kreveld, and Mark<br />

Overmars. Computational Geometry: Algorithms and Applications.<br />

Springer, 2000.<br />

[DHM96] A. Dress, D. Huson, and V. Moulton. Analyzing and visualizing<br />

sequence and distance data using splitstree. Discrete Applied<br />

Mathematics, 71:95–109, 1996.<br />

[Epp98]<br />

[GHJV94]<br />

David Eppstein. Fast hiearchical clustering and other applications<br />

of dynamic closet pairs. In Proceedings of the ninth annual ACM-<br />

SIAM symposium on Discrete algorithms, pages 619–628. Society<br />

for Industrial and Applied Mathematics, 1998.<br />

E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns<br />

— Elements of Reusable Object-Oriented Software. Addison-<br />

Wesley, 1994.<br />

[gnu99] gnuplot — a command-driven interactive function plotting<br />

program. Webpage, 1999. http://http://www.rz.unikarlsruhe.de/<br />

ig25/gnuplot-faq/.<br />

[Gus91]<br />

[HKY85]<br />

[HNP + 98]<br />

[Hus98]<br />

[JC69]<br />

[JWMV03]<br />

Dan Gusfield. Efficient algorithms for inferring evolutionary trees.<br />

Networks, 21:19–28, 1991.<br />

M. Hasegawa, H. Kishino, and T. Yano. Dating of the humanape<br />

splitting by a molecular clock of mitchondrial dna. Journal of<br />

Molecular Evolution, 22:160–174, 1985.<br />

D.H. Huson, S. Nettles, L. Parida, T. Warnow, and S. Yooseph.<br />

The disk-covering method for tree reconstruction. Proceedings of<br />

Algorithms and Experiments, pages 62–75, 1998.<br />

D. Huson. Splitstree: Analyzing and visualizing evolutionary data.<br />

Bioinformatics, 14:68–73, 1998.<br />

T. H. Jukes and C. Cantor. Evolution of protein molecules. Academic<br />

Press, New York, 1969.<br />

Katherine St. John, Tandy Warnow, Bernard M. E. Moret, and<br />

Lisa Vawter. Performance study of phylogenetic methods: (unweighted)<br />

quartet methods and neighbor-joining. Journal of Algorithms,<br />

48:173–193, 2003.<br />

107


[Kim80] M. Kimura. A simple model for estimating evolutionary rates<br />

of base substitutions through comparative studies of nucleotide<br />

sequences. Journal of Molecular Evolution, 16:111–120, 1980.<br />

[MS99] Vincent Moulton and Mike Steel. Retractions of finite distance<br />

functions onto tree metrics. Discrete Applied Mathematics,<br />

91:215–233, 1999.<br />

[MSM97]<br />

[NK00]<br />

[NTT83]<br />

[SM97]<br />

[SN87]<br />

[SS73]<br />

[SS03]<br />

D. R. Maddison, D. L. Swofford, and Wayne P. Maddison. Nexus:<br />

an extensible file format for systematic information. Systematic<br />

Biology, 46:590–621, 1997.<br />

Masatoshi Nei and Sudhir Kumar. Molecular Evolution and Phylogenetics.<br />

Oxford University Press, 198 Madison Avenue, New<br />

York, 2000.<br />

M. Nei, F. Tajima, and Y. Tateno. Accuracy of estimated phylogenetic<br />

trees from molecular data. ii. gene frequency data. Journal<br />

of Molecular Evolution, 19:153–170, 1983.<br />

João Setubal and João Meidanis. Introduction to Computational<br />

Molecular Biology. Brooks/ Cole Publishing Company, 511 Forest<br />

Lodge Road, Pacific Grove, California, 1997.<br />

Naruya Saitou and Masatoshi Nei. The neighbor-joining method:<br />

a new method for reconstructing phylogenetic trees. Molecular<br />

biology and evolution, 4:406–425, 1987.<br />

P.H.A. Sneath and R.R. Sokal. Numerical Taxonomy. W.H. Freeman<br />

& Co, 41 Madison Avenue, New York, 1973.<br />

Charles Semple and Mike Steel. Phylogenetics. OxfordUniversity<br />

Press, 2003.<br />

[Sun03] Sun. Javadoc 1.4.2 Tool. Webpage, 2003.<br />

http://java.sun.com/j2se/1.4.2/docs/tooldocs/javadoc/index.html.<br />

[TN96]<br />

N. Takezaki and M. Nei. Genetic distances and reconstruction of<br />

phylogenetic trees from microsatellite dna. Genetics, 144:389–399,<br />

1996.<br />

108

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!