computing the quartet distance between general trees

Master’s thesisCOMPUTING THE QUARTET DISTANCEBETWEEN GENERAL TREES:AN EXPERIMENTAL STUDY OF A SUB-CUBIC ALGORITHMbyAnders Kabell Kristensenstudent id: 20041248November, 2010with supervisorChristian Nørgaard Storm PedersenDepartment of Computer ScienceAarhus UniversityDenmark

AbstractThis thesis provides strong evidence that the quartet distance between two general treesis practically computable in sub-cubic time. The thesis comprises a detailed study ofthree algorithms for quartet distance computation between trees of arbitrary degree; aquartic, a cubic and a sub-cubic time algorithm. A common property to these three algorithmsis that they all have a time complexity that is only dependent on the numberof leaves in the two trees compared and not on the degree of internal nodes. Focus ison the sub-cubic algorithm suggested by Mailund et al. [14] that is currently the theoreticallybest algorithm in this category and has a time complexity of O(n 2+α ) where αhas a direct dependence on matrix multiplication. This dependence might show to bea problematic obstacle in practice since the need for sub-cubic matrix multiplication isinescapable. The goal is to reveal the practical behavior of the algorithm. Naturally thisis done through experimental verification using a wide range of input trees with differentproperties. The result is clear – the performance of the algorithm is close to quadratic formost input, however, some inputs reveal that the running time is dominated by that ofmatrix multiplication for very large inputs.Among the contributions of the thesis are: A verification of the correctness and runningtime of the two reference algorithms including a minor but essential correction ofthe cubic algorithm. Verification of the correctness of the sub-cubic algorithm leadingto a few algorithmic observations that are crucial to obtain a correct result. Practicalverification of the theoretical bounds on the time complexity of the sub-cubic algorithmincluding a discussion of a possibly tighter upper bound. Detailed documentation of theexperimental approach with description of the steps necessary to implement the algorithmsand reproduce the experiments. Finally, the result is a piece of software that isefficient in practice and has already shown its worth at the time of writing.iii

AcknowledgementsFirst of all I would like to thank my supervisor Christian Nørgaard Storm Pedersen forgreat support and feedback, always being positive and encouraging me and simply forshowing interest. Thanks to Jesper Nielsen, PhD student at BiRC 1 , for some constructivediscussions. Thanks goes to Thomas Wessel and Anders Viskum for proofreading andgood times in the crane. Thanks to my bro Kasper and to my darling Astrid for supremeproofreading.Anders Kabell Kristensen,Aarhus, October 31, 2010.1 Bioinformatics Research Center, Aarhus University: http://birc.au.dk/v

ContentsAbstractAcknowledgementsiiiv1 Introduction 11.1 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Measuring difference or similarity . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 The quartet distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Overview of algorithms for quartet distance computation . . . . . . . . . . 41.3.1 Final result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Focus of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Prerequisites 92.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Choice of language and test environments . . . . . . . . . . . . . . . . . . . 103 Experimental approach 133.1 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Tree construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Newick format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Parsing data files and building a tree data structure . . . . . . . . . . 174 Quartic time algorithm 194.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Calculating leaf set sizes 255.1 Subtree leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Shared leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26vii

viiiCONTENTS5.3 Implementing leaf set algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 275.3.1 Subtree leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3.2 Shared leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Cubic time algorithm 316.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.1.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Sub-cubic time algorithm 377.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.1.1 Basic preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.1.2 Counting shared butterflies . . . . . . . . . . . . . . . . . . . . . . . . 407.1.3 Counting different butterflies . . . . . . . . . . . . . . . . . . . . . . 427.1.4 How to count butterflies in constant time . . . . . . . . . . . . . . . 447.1.4.1 The calculation of I ′′′ . . . . . . . . . . . . . . . . . . . . . 457.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.2.1 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.2.2 Introducing a library for matrix multiplication . . . . . . . . . . . . 497.2.3 The final version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Results and discussion 598.1 Large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Conclusion 639.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Bibliography 65A Preprocessing for the sub-cubic algorithm 69B Obtaining and running the programs 71C Real-life application of the algorithm 73

Chapter 1IntroductionThe topic of quartet distance computation is studied by computer scientists, but has itsorigin within evolutionary biology and requires some amount of motivation. The followingintroduction should be sufficient to understand why the topic is relevant and has itsapplications within other areas than computer science. However, it will by no means becomprehensive and should not be seen as a general introduction to phylogenetics.1.1 Phylogenetic treesAn evolutionary or phylogenetic tree is a hierarchical structure, commonly used to expressthe interrelationship, with regards to inheritance, among a set of evolutionary units(EUs), e.g. different species, and has found its use within the field of bioinformatics. Thetree consists of a set of nodes connected by edges, and the configuration of these decidethe structure or topology of the tree. By definition, there is exactly one path between anytwo nodes in a tree. The outermost nodes of the tree – the ones that are only incident to asingle edge – are called leaves and represent EUs, while branching points in the tree representspeciation events. A speciation event is a splitting of lineage. Some evolutionarytrees are rooted in one of the branching points, which adds a direction to the evolutionand the root is then said to be the most recent common ancestor to all species in the tree.In this case we picture the tree with the root on top and all edges as directed and pointingdownwards away from the root. Each branching point or internal node is the root of asubtree, and is viewed as a common ancestor to the EUs in this subtree. Two EUs downone branch are closer related than each of them are to any EU found in another branchof the same branching point. When each branching point in a rooted tree has exactlytwo immediate descendants, the tree is said to be binary. In a binary unrooted tree, eachnode has exactly three connections to other nodes. The number of connections a node1

2 CHAPTER 1. INTRODUCTIONhas is called the degree of the node. An unrooted tree does not contain internal nodes ofdegree two. A tree where internal nodes are allowed to be polytomies, that is, they canhave any degree equal to or greater than three, is called a general tree. General trees areoften used to represent partly resolved relationships where the complete topology is notknown and each species therefore cannot be represented by a distinct node. Sometimesbranches are assigned a length which adds the notion of time to the evolution.The true evolutionary relation among a set of EUs is rarely known. Multiple methodsfor determining the exact relationship, from biological data, are available. They do notnecessarily agree and might induce different trees. Some methods for inferring relationshipswill result in a large range of plausible tree reconstructions. Furthermore, multipledata sets, e.g. DNA sequences, describing a single species are often at hand. Thus, onemethod may yield a different solution for each data set used. Figure 1.1 is an illustrationof two alternative relationships inferred for the Panthera (big cats).PantheraClouded LeopardJaguarLeopardLionSnow LeopardTigerPantheraClouded LeopardTigerJaguarSnow LeopardLeopardLionFigure 1.1: Two alternative views of the relationship between the Panthera (bigcats). Note that one is a binary tree whereas the other includes a polytomy andis thus a general tree. Example from Davis et al. [9].This disagreement between trees introduces the need of some means of assessingtrees. One approach is to make a pairwise comparison of trees in an attempt of quantifyingthe differences or similarities.1.2 Measuring difference or similarityVarious methods for tree comparison have been defined and each measure has certainproperties and takes certain aspects of the trees into consideration. Some can only handlefully resolved trees while others are able to take branch lengths into account. Somemetrics consider topological properties only. An example of the latter is the nearest–neighbor interchange metric, proposed by Waterman and Smith [20], defined as thefewest number of nearest–neighbor interchanges required to convert one tree into another.The metric only works for binary trees and the problem of computing it has beenshown to be NP-complete (seeDasGupta et al. [8]). In this thesis focus will be on generaltrees. Here, an example is the Robinson–Foulds distance metric, proposed by Robinsonand Foulds [15], and also known as the symmetric difference metric. It is defined as the

1.2. MEASURING DIFFERENCE OR SIMILARITY 3number of partitions of the species, produced by deleting an edge in a tree, that differ inthe two trees. The Robinson–Foulds metric has been particularly popular because it canbe computed in linear time using Day’s algorithm (see Day [10]).This thesis will concentrate on the quartet distance, proposed in 1985 by Estabrooket al. [12].1.2.1 The quartet distanceThe quartet distance is a measure of difference between two trees and is based uponthe observation made by Estabrook et al. [12] that a group of four EUs is the smallestgroup for which there is more than one possible unrooted tree topology. Call such agroup a quartet. In an evolutionary tree each quartet of species inherits one of the fourtopologies shown in Fig. 1.2.acababacbdcddcbd(a)(b)(c)(d)Figure 1.2: The four possible topologies of a quartet consisting of the four leavesa,b,c,d. (a)-(c) are so-called butterfly quartets, while (d) is a star quartet.That is, considering only the part of a tree that remains after removal of every edgethat is not part of a path between two of the four leaves in the quartet, that part of thetree will have one of the four topologies. Three of those are so called butterfly topologies,sometimes called resolved topologies, where the leaves have a closer relation in pairsand one is called the star topology, or unresolved topology, where all leaves are equallyclosely related. Note that, besides the topological structures illustrated in the figures, theleaves are unordered, and thus, leaves a and b might as well switch around in Fig. 1.2(a).The quartet distance is the number of quartets that inherit different topologies in the twotrees considered. Figure 1.3 illustrates the inference of the topology of different quartets.abcfdeabcfdeabcfde(a)(b)(c)Figure 1.3: (a) shows a tree. (b) shows, highlighted, the butterfly topology inheritedby the quartet (a,b,d, f ) and (c) the star topology inherited by (b,c,e, f ).

4 CHAPTER 1. INTRODUCTIONIt is evident that within binary trees, a quartet can only inherit one of the three butterflytopologies, however, with the inclusion of the star topology, the method works justas well on general trees. This means that the quartet distance can be used with trees thatinclude partly resolved relationships, i.e. polytomies, but it is required that the two treesspecify the exact same set of leaves.The quartet distance does not consider branch length but focuses solely on topologicalproperties. It works equally well with rooted and unrooted trees since a rootedtree can be interpreted as unrooted and a quartet will inherit the same topology in both.However, for rooted trees, there is in fact more than one topology for a group of onlythree leaves meaning that also a triplet distance has its justification, see Dobson [11].1.3 Overview of algorithms for quartet distance computationIn computer science, a widely used data structure is the tree data structure. It comesin numerous variants, has an endless amount of applications, and therefore algorithmsworking on trees have been studied in very great detail. Hence, algorithms for calculationof the quartet distance between evolutionary trees is just another application, and whilealgorithms for this problem benefit from previous research, they might end up beingof use within completely different areas than phylogenetics. Likewise, the focus of thisthesis will be purely algorithmic and not on applications in bioinformatics.There are ( n4)∈ O(n 4 ) unique quartets in a tree with n leaves which makes quartetdistance calculation a computationally heavy problem. Computing the quartet distancenaively by explicitly inspecting the topologies of the O(n 4 ) quartets in the two trees takesO(n 5 ) time.Several algorithms have been designed over the years, resulting in dramatic improvementsin time usage. Focus has been on calculation of the quartet distance between binarytrees, which do not include star quartets and seem to be less complex to handle.Steel and Penny [16] showed how to calculate the quartet distance in time O(n 3 ). Bryantet al. [4] improved this to O(n 2 ) and introduced some concepts important to this thesis.The work of Brodal et al. [3] has also been important to this thesis and resulted in thefastest known algorithm for binary trees with a time bound of O(n logn).For general trees, Bansal et al. [1] describe an O(n 2 ) time 2-approximation algorithm,but this thesis will only deal with exact quartet distance calculation. Christiansen et al.[6] present three algorithms with running times of O(n 4 ), O(n 3 ) and O(n 2 d 2 ) respectively,where d is the maximum degree of any node in the two trees. Stissing et al. [17]present an O(d 9 n logn) time algorithm. Note that some of the algorithms are boundedby the degree of the internal nodes whereas others are independent of this factor and

1.3. OVERVIEW OF ALGORITHMS FOR QUARTET DISTANCE COMPUTATION 5only bounded by the number of leaves in the tree. However, d ≤ n and consequently acomplexity of O(d 2 n 2 ) is in any case better than one of O(n 4 ).In this thesis, the main subject of study will be an algorithm proposed by Mailundet al. [14], that yields a running time of O(n 2+α ), where α = ω−12and O(n ω ) is the complexityof matrix multiplication. At the time of writing, it is the only algorithm that provides asub-cubic time complexity, while being independent of the degree. However, it is dependenton the utilization of a fast method for matrix multiplication, providing a sub-cubictime usage on square matrices. Naive matrix multiplication which features an ω = 3yields α = 1 and thus a running time of O(n 3 ). The authors of the article point out thatutilizing an advanced method for matrix multiplication with a theoretical good result willyield a sub-cubic time complexity. It mentions the Coppersmith-Winograd algorithm [7],which provides an ω = 2.376, resulting in a time complexity of O(n 2.688 ). Now, despitebeing the fastest method for matrix multiplication to date, the Coppersmith-Winogradalgorithm is not practically applicable.Such theoretical results will often leave the programmer reluctant since an implementationseems most unrealistic. However, from my discussions with the authors Iknow that the early theoretical work on the analysis of the quartet distance algorithmleft a general first impression that it was more efficient than eventually shown by the finalasymptotic worst-case analysis. Theoretical upper bounds on the time-usage of analgorithm might be far worse than the actual time-usage, as a result of the algorithm beingdifficult to analyse, and I therefore find this information promising and hope thatthese early predictions were true. This may sound ambiguous but the case is that whileit might be difficult, because of practical limitations, to obtain a worst-case running timeas good as the one of the analysis, an implementation might at the same time performsignificantly better on most input. Hence, this thesis will study the practical behaviorof the algorithm with the goal of clarifying the relation between the theoretical and thepractical time complexities. For reference, the two other algorithms that have a runningtime that is independent of the degree [6], are studied in detail as well. This will allowcomparison of the performance and verification of correctness.1.3.1 Final resultThe final result of the study of the practical behaviour of the algorithms is illustrated inFig. 1.4. The three algorithms are compared for two different implementations. Withoutgoing too much into detail at this point (see Chap. 8 for details), note that the plots showthat the O(n 4 ) and O(n 3 ) algorithms are behaving as expected, but more interestingly wecan observe that the sub-cubic algorithm is not only faster in practice, it also behaves

6 CHAPTER 1. INTRODUCTIONsub-cubic and actually close to quadratic. Now the final result has briefly been revealedas an appetizer and I hope that will encourage the reader to seek the answers to why andhow, in the subsequent chapters.100001000Comparison of the three algorithms, Python implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 4 50 100 200 500 800time in seconds - t(n)1001010.150 100 200 500 800number of leaves - ntime in seconds - t(n)1000100101Comparison of the three algorithms, C++ implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 40.10.01number of leaves - nFigure 1.4: The final result of the experimental work of this thesis. See Chap.8 for more details.1.4 Focus of the thesisThe aim of this thesis is to explore in detail the practical behaviour of the theoretical subcubicalgorithm. Furthermore, I want to give readers who have no knowledge of quartetdistance calculation on general trees a gentle introduction to the subject. Therefore Iwill make an effort to explain everything thoroughly on the fly and not leave anythingunanswered. By continuously explaining my reasoning and immediately following upon the theory with implementation details and experimental results, I will encouragethe reader to go on and read everything chronologically. This will especially apply to themiddle part of the thesis, where I will deal with the algorithms one at a time, first theory,then implementation details and finally the results. Where appropriate, the thesis willfollow my workflow as to reflect the iterative process of my work to reveal every interest-

1.5. THESIS OUTLINE 7ing experience I had during the process.As mentioned, the thesis will focus on algorithmics and not the bioinformatics. Thepurpose is to explore the behaviour of three algorithms. This means that I will acceptthe theory as presented, try to verify the results through experiments and, in case of anysurprises, take a closer look at the algorithms to verify and possibly adjust the theory.Along the way, I will strive to provide enough information for the reader to reproduce theexperiments.1.5 Thesis outlineThe thesis is structured as follows. Chapter 2, Prerequisites, will go through terminologynecessary to prepare for the algorithmic chapters, and motivate the choice of implementationlanguages and experimental platforms. Chapter 3, Experimental approach, is ageneral explanation of how experiments are used to verify the quality of the implementationsand what kind of data is used for this purpose.Chapters 4 through 7 make up the algorithmic core of the thesis. Chapter 4 deals witheverything concerning the O(n 4 ) algorithm, which will also be referred to as the quarticalgorithm. Chapter 5 gives the details about an algorithmic concept called leaf set sizesthat is part of the cubic and sub-cubic algorithms which are the subjects of Chap. 6 andChap. 7 respectively, the latter of course being the more exhaustive.Chapter 8, Results and discussion, will put the results into perspective by comparingthe three algorithms and give an outline of my contributions and of what has beensuccessful and what has not. Finally, Chap. 9 will be a short overview and conclusion,summarizing the results and contributions, and a comment on possible future work.The appendices will contain: Appendix A, algorithmic details that do not fit in themain text, App. B, instructions on how to obtain and run the programs and finally App. C,a report on a real-life application of the final sub-cubic algorithm.

Chapter 2PrerequisitesFirst, this chapter gives an introduction to the terminology that is used extensively throughoutthe thesis. New notation will be used at times but will be appropriately introduced.Second, I will motivate my choice of implementation languages and the give the detailsabout the platforms used to run the experiments.2.1 TerminologyIn Sec. 1.1, I have introduced the phylogenetic tree. From a purely algorithmic perspective,a tree, denoted by T, consist of three kinds of objects; a set I of internal nodes, a set Lof leaf nodes and, a set E of edges connecting the nodes. The set of all nodes, leaves andinternal, are denoted by V . When two trees with the same set of leaves, usually denotedby T and T ′ , are to be compared, the measure of the size of the problem will be the totalnumber of leaves in each tree. This value will alternatively be denoted as n, and thusn = |L|. The number of edges connected to an internal node v, is referred to as the degreeof the node and will be denoted as d v . For any internal node, d v ≥ 3. The degree of aleaf is one and no node will have degree two (except e.g. for the root of a rooted binarytree). Allowing nodes of degree two would give rise to trees with arbitrarily many nodesand edges. This would allow the number of nodes and edges to exceed O(n), clearly invalidatingany analysis based on n = |L| as the problem size. It is sufficient to bound thenumber of nodes of degree two since one can prove, that a tree with no degree-two nodeshas at most 2n-2 nodes and at most 2n-3 edges, with the maximal values being attainedby binary trees. The maximum degree of any node in the tree is denoted by d.Sometimes an (undirected) edge will be regarded as two distinct opposite directededges. A directed edge e is interesting because it identifies the subtree, call it F, consistingof every part of the tree T in front of the edge. Likewise, the opposite edge, denoted by ē,9

10 CHAPTER 2. PREREQUISITESidentifies the subtree ¯F , which includes every part of the tree behind e. The quantity |F |is called the leaf set size and will be used to denote the number of leaves in the subtree F.Therefore, |F | + | ¯F | = n. This is a slight abuse of the mathematical notation for sets andone could argue that F should merely denote the set of leaves in the subtree, however,I will stick to this notation as it is used in the background literature. Most often, thealgorithms will deal with two subtrees, one from each tree, where F is a subtree of T andG is a subtree of T ′ . Since T and T ′ contain the same set of leaves, F and G might alsohave some leaves in common. This is written |F ∩G| and will be referred to as the sharedleaf set size.The concept of quartets of four leaves a, b, c and d has been described in Sec. 1.2.1.The three butterfly topologies, illustrated in Fig. 1.2 (a)–(c), are written ab|cd, ac|bd andad|bc respectively, while the star quartet in Fig. 1.2(d) is written a b ×c d .2.2 Choice of language and test environmentsMy first choice of implementation language was the Python programming language.Python is not among the most efficient languages, and usually not used in critical algorithmicor mathematical applications. However it is a popular language for fast prototyping,because it supports clarity and simplicity and a very short workcycle. This was agood match for me, during the early parts of the implementation process, where I gainedcomplete understanding through the experimental work. In addition, I did not know theactual time needed for quartet distance calculation and therefore productivity was moreimportant than performance.Later on, I made a decision to implement the algorithms in C++ as well. We haveseen, in Sec. 1.3.1, that the results of the Python implementation were indeed informative,however, the practical running times were rather slow. The time-consuming calculationswere carried out on remote servers and despite the fact that this thesis is notabout optimizations, I found it interesting to see if I could bring down the running timeto a level where the experiments could be carried out on my own laptop. Furthermoreanother language with other properties might give the whole study a new perspective.C++ does indeed have other properties. It is compiled and, with support for a set oflow-level language features, considerably closer to the machine level and often used foralgorithms and other time critical calculations.These two tracks of implementation will be described in parallel, and in some casesnot distinguished between. However, they should not be compared in a one-to-one relationand the resulting programs will merely be used as two different opportunities toexperiment with the theoretic results.

2.2. CHOICE OF LANGUAGE AND TEST ENVIRONMENTS 11Details about the two programming languages, libraries and physical environmentsused are listed below. For a guide on how to obtain, compile and run the code, see App. B.Python programming environmentplatform:PCprocessor:Intel Xeon 3.00 GHzmemory:1 GBoperating system: Red Hat Linux (kernel: 2.6.18)language: Python (v. 2.4.3)libraries: NumPy (v. 1.2.1)SciPy (v. 0.6.0)BLAS (preinstalled, v. 3.1.1)C++ programming environmentplatform: MacBook Pro 13”model:MacBookPro5,5processor:Intel Core 2 Duo 2.26 GHzmemory:4 GBoperating system:Mac OS X 10.6.4 (10F569)language: C++ (g++ from gcc version 4.2.1 (Apple Inc. build 5664))libraries: Boost.uBlas (Boost libraries v. 1.42.0)Boost.NumericBindings (v. v1)BLAS (Apple Accelerate framework: v. vecLib-268.0)

Chapter 3Experimental approachAlgorithms for computing the quartet distance should all produce the same simple result,namely a single number. Once confidence in one algorithm has been established,it is rather simple to verify another such algorithm and build the same level of trust inthe correctness of its results. That is one reason for implementing several different algorithmsalong with each other, and so is a strong motivation for the inclusion of thequartic algorithm of Chap. 4, because it is simple and therefore easy to verify by hand.And thorough verification has indeed been practiced, with various small examples andunit tests.Another way to gain trust, is of course to rely on some independent implementation.This has also been done in the effort to cement the correctness of my work. A piece ofsoftware called QDist, described in Mailund and Pedersen [13] has been utilized. It isan implementation of the O(n log 2 n) algorithm presented in Brodal et al. [2], which isworking only on binary trees. However, since there was no working software for generaltrees at hand, all results involving general trees had to be verified separately.After implementing three algorithms using completely different approaches, in twodifferent programming languages, my confidence in the correctness is very high.Needless it is to say that correctness of the result is essential, but yet another allimportantproperty is the running time. Especially to this thesis, where running timeis the quality measure, by which the algorithms are compared. Experiments are usedto determine the running time of each algorithm, as to verify whether the theoreticalpromised time bounds are correct, assess the practical behaviour of the algorithms, whichmight be different than anticipated, and last, to be able to compare the algorithms againstone another. For these experiments, a wide range of test data has been used. This isexplained in detail in the following Sec. 3.1. Other details about each experiment, e.g.which input trees have been used and how many times an experiment has been exe-13

14 CHAPTER 3. EXPERIMENTAL APPROACHcuted, are found in the respective and appropriate sections.The evaluation of the outcome of those experiments has mainly been done by plotting,with the use of the two libraries for Python; matplotlib 1 and SciPy 2 , for plotting andscientific computation respectively. Each data set is displayed in a log-log plot, which,due to the nonlinear scaling of the axes, has the ability of displaying a power function,like f (x) = ax b , as a straight line, where b can be read as the slope of the line. This makesit easy to verify which development, the running time of an algorithm underlies. Forconvenience, each plot will include lines indicating linear, quadratic, cubic and quarticbehaviour where appropriate.In addition to the visual evaluation of results, the exponents of the function describingeach plot has been estimated through the least squares method, from the scipy.optimizemodule. These approximated exponents are displayed along with the name of each dataseries in a plot.3.1 TreesFive groups of input trees have been utilized throughout the array of experiments performedas part of this thesis. They are all artificial trees and fit the conditions necessaryto work with the three algorithms considered, as described in Sec. 1.3. The intention isto cover as wide a range of trees as possible, which involves varying the relationship betweenthe number of leaves and the number of internal nodes, which again influencesthe degree of the internal nodes. If a tree has a relatively large number of internal nodes,their degree will be small.It is important to consider all these different kinds of trees to give a reliable evaluationof the algorithms subject to the experiments. This will clarify whether an algorithmhas any weakness related to a certain property of trees. In particular, as mentioned inChapter 1, this thesis only considers algorithms underlying a theoretic time complexitythat is independent of the degree of the internal nodes in the tree. Hopefully, thesedifferent trees will help investigate whether such assumptions hold true.The five groups of trees are described below, along with an outline of their respectiveproperties. All trees are considered unrooted.Binary trees (bin)In this thesis, a binary tree is a tree where ∀v ∈ I : d v = 3, that is,the degree of each internal node is three. Consequently there are n − 2 internal nodesand 2n −3 edges and therefore the number of internal nodes and the number of edges is1 matplotlib: http://matplotlib.sourceforge.net/2 SciPy: http://www.scipy.org/

3.1. TREES 15O(n). Essentially, a binary tree has many internal nodes of low degree. Figure 3.1 showsan example of a binary tree. It is evident that a binary tree cannot contain star quartets.Figure 3.1: A bin treeRandom trees (ran) A random tree can have any topology where the degree of internalnodes is at least three, that is, ∀v ∈ I : d v ≥ 3. At this point we cannot predict the behaviorof an algorithm given random trees as input, however, they are included in casesomething interesting will come out of the experiments.Square root trees (sqrt)A square root tree consists of a single central node, connectedto n internal nodes, all of which have edges connected to n leaves. See Figure 3.2.Consequently, such a tree has one node of degree n and n inner nodes of degree n +1. The number of edges is |E| = n + n. Essentially, the sqrt-tree has a large numberof nodes of high degree.Figure 3.2: A sqrt treeStar trees (star) A star tree consists of a single internal node connected to every leaf.See Figure 3.3. That is, one inner node of degree n, meaning few nodes, high degree.Also the number of edges is |E| = n. Naturally, this tree only contains star quartets.Figure 3.3: A star tree

16 CHAPTER 3. EXPERIMENTAL APPROACHWorst-case trees (wc) In a worst-case tree there is one internal node of degree n 2 and n 2internal nodes of degree three, see Figure3.4. The tree has its name from Christiansenand Randers [5], where it was invented with the purpose of having a tree that has O(n)internal nodes and O(n) internal edges connected to an internal node of degree O(n).More interesting to this thesis is the total number of edges, which is |E| = 3 2n. The nameshould not be taken literally, since we do not yet know how the algorithms in this thesiswill perform.Figure 3.4: A wc tree3.2 Tree constructionSince this thesis mainly encompasses an experimental approach to evaluating algorithms,I find it natural to describe my choice of data format and data construction.3.2.1 Newick formatThe programs are reading in tree data from files, described in the Newick tree format 3which is a commonly used format for representation of phylogenetic trees. It is a simpletext format based on parentheses and commas. The grammar in Fig 3.5 provides a formaldescription for parsing the Newick format.Tree → Subtree ";"| Branch ";"Subtree → Leaf | InternalLeaf → NameInternal → "("BranchSet ")"NameBranchSet → Branch | BranchSet ","BranchBranch → Subtree LengthName → empty | stringLength → empty | ":"numberFigure 3.5: A context-free grammar for the Newick format3 Wikipedia article on the Newick format: http://en.wikipedia.org/wiki/Newick_format (accessedOct. 3, 2010)

3.2. TREE CONSTRUCTION 17Formal descriptions are good, however, this one has some small restrictions, e.g.parentheses inside Name are prohibited. Nevertheless, this grammar more than satisfiesthe needs for this thesis. Note that a tree either descends from nowhere, or it is rootedin an internal node. Thus, the Newick format is incapable of describing unrooted trees.This is not a problem, since the tree can merely be rooted in an arbitrary internal nodewhich, when parsing, will be treated as any other internal node.The two trees from the example in Fig. 1.1 would be written as follows in Newick treeformat. Note that there is only one named Internal node in each tree, namely the root(Panthera):(((Tiger, Snow Leopard), ((Lion, Leopard), Jaguar)), Clouded Leopard) Panthera;(((Lion, Leopard), Snow Leopard, Jaguar), (Tiger, Clouded Leopard)) Panthera;Data construction Data files have been generated in Newick format using a range ofsimple Python-scripts for each class of artificial tree described in 3.1. They are foundalong with the code and all the data files used, see Appendix B.3.2.2 Parsing data files and building a tree data structureOne way of parsing data described by a grammar, like the one in Fig. 3.5, is by using arecursive descent parser. As the name indicates, a recursive descent parser is workingtop-down, by repeatedly identifying one construct in the data that corresponds to a productionrule in the grammar and then process this piece of data according to the rule.A number of mutually recursive procedures will typically represent the rules defined bythe grammar.For the Python implementation I have made use of a parser based on the Toy ParserGenerator framework 4 (TPG), and written by Thomas Mailund 5 , with a few modifications.For the C++ implementation I wrote my own parser from scratch. I wrote proceduresParse(), ParseSubtree(), ParseInternalNode(), ParseLeafNode(), ParseBranchSet()and ParseBranch(), corresponding to the six first rules of the grammar of the Newick format.Both parsers are capable of parsing all the formats suggested by the grammar inSec. 3.2.1 and they meet the needs of this thesis.While the algorithms in general deal with non-rooted trees, some parts view the treeas rooted, in which cases it turns out more convenient to have a rooted, top-down representation.Furthermore, the Newick format dictates that some random node should4 TPG framework: http://christophe.delord.free.fr/tpg/index.html5 Mailunds parser: www.mailund.dk/index.php/2009/01/19/yet-another-newick-parser/

18 CHAPTER 3. EXPERIMENTAL APPROACHbe chosen as root, so that is what one can expect when parsing such data. This is noproblem and in any case, some entry point is needed to access the tree.When initiating the process of implementing the first algorithm in Python, I did nothave deep and substantial knowledge about the requirements each algorithm would askof the data structure used to store and represent the tree. Consequently, I found myselfusing the simple top-down data structure provided by the Python Newick parserdescribed. This worked out all right in my first attempt on implementing the quarticalgorithm, see Sec. 4.1. However, when attacking the other algorithms, that sometimestake an edge-based approach (remember that subtrees are identified by directed edges),it turned out to be insufficient, inexpressible and confusing.Therefore, enriching the data structure with some constructs that correspond to theideas used in the algorithms, seemed obvious. As long as the modifications could bedone in linear time by simple traversal of the tree, the overhead would easily fit into thetime bounds of the algorithms considered in this thesis.One improvement was to decorate the tree with more directed edges. A recursivedescent parser built using the Toy Parser Generator framework is tail recursive and doesnot leave the opportunity to send back information through the recursion. The result isthat trees parsed with a TPG parser do not have back-edges, i.e. edges pointing towardsthe root. Subsequently adding back-edges, made it easy to traverse the tree from anychoice of start node and in any direction. Also, after realising that the algorithms mostoften deal with directed edges, I let each edge know its opposite and thus two directededges together correspond to one actual undirected edge. The algorithms naturally requireunique identification of leaves and use the assumption that internal nodes andedges can be identified uniquely as well. Therefore, ids were also added in a subsequenttraversal of the tree. Because I implemented the C++ parser and data structure with thisin mind, I made the C++ parser a bit more advanced, letting it add the necessary edgeswhile parsing, keep track of the ids of nodes and edges and collect nodes and edges inlists for easy access. Of course this required that procedures return information aboutthe part of the tree just parsed (the subtree below).The few dissimilarities in the two data structures used are not playing a significantrole in the implementations. There is a slight difference in the way a tree is traversed,since an internal node in the Python data structure has a parent and a number of subtrees,whereas the C++ data structure simply has a number of subtrees related. Furthermore,the fact that the two implementations do the job in different orders result in theedges having different ids in the two languages which will change the order of processingfor edges. This should not have any impact on the results of the algorithms however.

Chapter 4Quartic time algorithmHere I will describe an approach to calculating the quartet distance between two trees Tand T ′ that leads to a quartic time complexity, that is, a running time of O(n 4 ). The algorithmwas suggested by Christiansen et al. [6]. Given n leaves there are O(n 4 ) quartetsto consider. Consequently, being able to determine the topology of some quartet in eachof the two trees and compare them in constant time makes it possible to make the entirealgorithm perform within O(n 4 ).The main principle of the algorithm is to process a triplet, that is three leaves, ata time instead of a quartet. Then, for every triplet, in linear time, process each of theremaining n − 3 leaves in turn. There are O(n 3 ) triplets, meaning that we achieve a totalrunning time of O(n 4 ).The triplets are processed as follows: For every triplet (a,b,c) there is a unique nodethat lies on the three paths between a and b, between a and c and between b and c. Callsuch a node a center and denote it C . A node in the tree is associated with a number ofsubtrees, each corresponding to an outgoing edge from the node. Each subtree containsa number of leaves. For some center node, C , let T a , T b and T c denote the subtrees thatcontain the leaves a, b and c respectively. Furthermore, let T rest denote the collectionof all subtrees of C , except T a , T b and T c . See Fig. 4.1. Every leaf, x, other than a, b,and c, will be the fourth leaf in some quartet. The topology of this quartet can easilybe determined, based on the position of x, relative to C . If x is positioned in the samesubtree as one of the leaves a, b or c, the resulting quartet is a butterfly. Otherwise, itis a star. If x ∈ T a , the topology is ax|bc. If x ∈ T b , the topology is bx|ac. If x ∈ T c , thetopology is cx|ab. Furthermore, if x ∈ T rest , the topology is a x ×b c .Finding the center of three leaves and collecting the leaf sets T a , T b , T c and T rest canbe done in linear time. The topologies of the n−3 quartets containing both a, b and c canthen be determined in linear time by going through the collections of leaves. To compare19

20 CHAPTER 4. QUARTIC TIME ALGORITHMbT baT aCT ccFigure 4.1: A center node C of the leaves a,b,c. The white subtrees make up the setof subtrees denoted T rest.the topologies found in the two trees, an array is computed, holding, at position i , thetopology of the quartet containing a, b, c and the i ’th leaf. Given two of those arrays,one for T and one for T ′ , the topologies can be compared in linear time and the numberof different topologies associated with the triplet (a,b,c) counted. This gives an overallrunning time of O(n 4 ).A number of different quartet topologies is thus computed by processing all triplets(a,b,c). However, each quartet (a,b,c,d) is actually considered four times; once for eachpossible triplet composed from the four leaves – in other words, each of the four leaveswill eventually act as the leaf x, described above. Consequently, the total number of differentquartets counted has to be divided by four to get the quartet distance. See Alg. 4.1for an outline of the algorithm.Regarding space consumption, this approach only needs memory for the tree datastructure and the two arrays, which are all linear in the number of leaves. Thus, thealgorithm uses O(n) space.4.1 ImplementationThe quartic algorithm is simple and easy to understand and this counts for the implementationof it as well. The algorithm is a bit clumsy by nature, making a lot of traversalsof the tree, but since it is the starting point for my investigation of quartet distance calculation,my focus will be on correctness rather than efficiency, and thus, I will make noattempt to optimize the algorithm in any way. Having a solid foundation and referencepoint for comparison with the other implementations is important to gain confidence inthe final results.The main loop, line 2 of Algorithm 4.1, goes through all distinct triplets, which can begenerated in various ways. The centers of a tree are found by first making three traversals

4.1. IMPLEMENTATION 21Algorithm 4.1 Quartic algorithm for quartet distance1: diffQ = 02: for all triplets (a,b,c) do ⊲ O(n 3 )3: Find center C in T and C ′ in T ′ ⊲ O(n)4: Collect leaf sets T a , T b , T c and T rest ⊲ O(n)5: and leaf sets T a ′ , T ′ b , T c ′ and T rest ′ ⊲ O(n)6: Create arrays A and A ′7: for all leaves x ∉ {a,b,c} do ⊲ O(n)8: Topology can be decided in constant time based on leaf set membership9: A[x] ← topology(a,b,c, x) in T ⊲ O(1)10: A ′ [x] ← topology(a,b,c, x) in T ′11: end for12: for all topologies t ∈ A and t ′ ∈ A ′ do13: if t ≠ t ′ then14: diffQ = diffQ + 115: end if16: end for17: end for18: return diffQ / 4to find the path between each pair of nodes (a,b), (a,c) and (b,c) and then examiningthese paths to find the single common internal node that is the center. The leaf sets T a ,T b and T c are easily found by traversing the tree with origin in the center node and T rest isnot needed, as this set is just every leaf that is not in one of the other sets. Now, to processeach iteration of the loop in line 7 in constant time, it is necessary to go through the leafsets and deal with each leaf contained in these, rather than to just go through all leaves,which would require repeated look-ups in each leaf set, breaking the time bound. Thefour topologies are represented by integers 0,1,2,3 and comparison of the two arrays isdone by traversing them in parallel and comparing elements.4.1.1 ResultThe algorithm has been implemented in Python and C++ and tested on the five groupsof artificial test trees described in Sec. 3.1. The result is displayed in Fig. 4.2 and Fig. 4.3.Because the algorithm is slow, the size of the trees used is limited to around 140 leaves.Larger trees would induce unacceptable running times and 140 leaves are enough toobserve any tendencies of the development in performance. The experiments have beenrepeated five times.Naturally, the expectation is to see a O(n 4 ) development on all trees and this is indeedthe case; all plots seem to be parallel to or less steep than the line indicating a quartic development.Furthermore, the estimated exponents of the lines plotted are below four.

22 CHAPTER 4. QUARTIC TIME ALGORITHMThis evidence makes it clear that it is in fact a worst-case quartic algorithm. In addition,the figure also reveals the interrelationship between the five groups of data. Observe thatthere is a rather large difference in running time, with the star tree being the faster andthe binary tree being the slower one to deal with. This seems natural, since the internalstructures of the trees are very different. The traversal of a tree with a complex internalstructure, a large number of internal nodes, is more time consuming and since the algorithmis based on a large number of traversals, this penalty is observable in the plots.And of course, the time bound is related to the number of leaves and not the internalstructure of the trees.This algorithm has shown – not surprisingly – to be very slow and not competitivewith a sub-cubic algorithm. It has merely worked as a reference of correctness, whenimplementing the other algorithms. Nevertheless, in Chapter 8, where all the results ofthis thesis are compared and discussed, the practical performance is compared to thoseof the other algorithms.

4.1. IMPLEMENTATION 23time in seconds - t(n)100001000100101bin, best exp= 3.93ran, best exp= 3.84sqrt, best exp= 3.69star, best exp= 3.65wc, best exp= 3.69n 3n 4n 5Python implementation of the quartic algorithm.0.10.0110 50 100 200number of leaves - nFigure 4.2: Performance of the quartic algorithm implemented in Python.time in seconds - t(n)1000010001001010.1bin, best exp= 3.81ran, best exp= 3.71sqrt, best exp= 3.37star, best exp= 3.27wc, best exp= 3.44n 3n 4n 5C++ implementation of the quartic algorithm.0.0110 50 100 200number of leaves - nFigure 4.3: Performance of the quartic algorithm implemented in C++.

Chapter 5Calculating leaf set sizesCalculating the size of every subtree leaf set in a given tree, that is, the number of leavescontained in the subtree, is an essential part of two of the algorithms studied in thisthesis, namely the cubic time algorithm of Chap. 6 and, most important, the sub-cubicalgorithm of Chap. 7, that is the central topic of the thesis. So is the calculation of sharedleaf set sizes between two trees. A shared leaf set is the intersection of a pair of subtreeleaf sets. It turns out that the former is needed under fast calculation of the latter.Christiansen and Randers [5] study these calculations for different types of trees andhave been my primary inspiration. Here I will focus solely on outlining the approach andaspects important to this thesis.5.1 Subtree leaf set sizesThe calculation of all subtree leaf set sizes can be done in time O(n). Consider somegeneral unrooted tree, T like the one in Fig. 5.1(a). If T is rooted in one of its internalnodes, e.g. r , one can consider directed edges as either pointing away from r or towardsr , like e and e 2 respectively, see Fig. 5.1(b). Each directed edge represents a subtree. Thesubtree F , represented by e, does not contain r and the subtree ¯F , represented by theopposite edge, ē, does does contain r .The leaf set size of every subtree that, like F , does not contain r , can be calculatedrecursively using the following recipe. Make a depth-first traversal of all nodes, v ∈ T ,starting at r . For each node v we look at the subtree defined by the edge entering v andpointing away from r . If v is a leaf node, the leaf set size is 1. If v is an internal node,the leaf set size is the sum of all leaf set sizes of the subtrees pointing downwards from v.Half of the leaf set sizes have been calculated and this calculation takes O(n) time, sincea tree contains a linear number of nodes.25

26 CHAPTER 5. CALCULATING LEAF SET SIZESAnother approach is needed when calculating the other half of the leaf set sizes, becausea repetition of the procedure for every possible root would take too long. Clearly,F ∪ ¯F contains every leaf and thus | ¯F | = |L|−|F | = n −|F |. With this in mind and using thevalues previously computed, all leaf set sizes containing r can be counted in linear time,giving a total time complexity of O(n).rre e 2(a)Figure 5.1: (a) An unrooted tree and (b) the same tree rooted in the node r . Half of the directededges, like e, point away from r and identify subtrees, like F , that do not contain r . The other halfpoint towards r , like e 2 and identify subtrees that do contain r .(b)F5.2 Shared leaf set sizesThe shared leaf set size for two subtrees F ∈ T and G ∈ T ′ , written |F ∩G|, is the numberof leaves contained in both subtrees. Filling out a table holding the shared leaf set sizeof all pairs of subtrees between two trees can be done in time O(n 2 ). As mentioned, thistime bound is crucial to the cubic and sub-cubic quartet distance algorithms.As basis for my implementation I have used Christiansen and Randers [5] and I willtherefore give an explanation of the idea here. It is an adaptation of the original idea forcomputing the shared leaf set sizes between two binary trees, presented in Bryant et al.[4] and elaborated in Tsang [19].Again, consider the two subtrees from above. F and G are rooted in two nodes, sayv and v ′ respectively, and each composed of a number of smaller subtrees, given by thechildren of these nodes. If v has subtrees F 1 ,...,F dv −1 and v ′ has subtrees G 1 ,...,G dv ′ −1,the shared leaf set size of F and G can be expressed in two different ways by the followingequation:d∑v −1d v ′∑−1|F ∩G| = |F i ∩G| = |F ∩G j |. (5.1)i=1Doing this for all pairs of subtrees like F,G is cumbersome, however, using the same ideaas for subtree leaf set sizes, the running time can be kept within the critical bound ofO(n 2 ). That is, root T and T ′ in some internal nodes r and r ′ , see Fig. 5.1, and calculatej =1

5.3. IMPLEMENTING LEAF SET ALGORITHMS 27only the shared leaf set sizes for the subtrees pointing away from these roots. In a tree,there is exactly one of these subtrees for each node. Comparison of two leaves is done inconstant time because leaves are numbered. This leads to the following expression:∑ ∑∑v∈Imin((d v − 1),(d v′ − 1)) + ∑ 1 = O(n 2 ) (5.2)v ′ ∈I ′ v∈L v ′ ∈L ′Now, processing of all pairs of subtrees where one or both of them contains the root,still remains. Making use of the subtree leaf set sizes of T and T ′ , each remaining paircan be processed in a similar manner as used for subtree leaf set sizes in Sec. 5.1. Thereare three kinds of pairs of subtrees still to process and using the filled parts of the table,that is, all entries associated with a subtree pair |F ∩G|, their shared leaf set sizes can becomputed in constant time, using the following recipe:| ¯F ∩G| = |G| − |F ∩G|,|F ∩Ḡ| = |F | − |F ∩G|,| ¯F ∩Ḡ| = n − (|F | + |G| − |F ∩G|). (5.3)This gives a complete time bound of O(n 2 ) for filling the entire table.In the following section I will give details about my implementation of the algorithmand verification of the running time.5.3 Implementing leaf set algorithms5.3.1 Subtree leaf set sizesThe calculation of subtree leaf set sizes is easily implemented using the recipe describedin Sec. 5.1 and hence there is no notable differences between the Python and C++ implementation.Since a directed edge uniquely identifies a subtree, an array of length equal to thenumber of directed edges is used to store the result and edge ids are used to index the array.First, a recursive function is counting leaves in a depth-first manner, for all subtreesnot containing the root of the tree. Next, all remaining entries of the array are filled bysimple calculation. Every directed edge knows its opposite and the value associated withan up-edge can therefore be calculated using the value associated with the correspondingdown-edge.5.3.2 Shared leaf set sizesThe description in Sec. 5.2 is easily translated into Python or C++ code as well. A twodimensionaltable is used to store the result and again ids of directed edges are used

28 CHAPTER 5. CALCULATING LEAF SET SIZESas indices. All entries associated with two edges pointing away from the root are filledby a recursive function. The function makes a combined depth-first traversal of the twotrees, compares the leaves and sums up the shared leaf set sizes on its way up through thetwo trees. Equation (5.1) shows that every time two subtrees are encountered there is adecision to make; for which subtree should we sum over the children? This can influencethe total number of additions the function has to make. Equation (5.2) includes thischoice as the min-expression, however, analysing the expression reveals that the choiceis not significant; the running time will stay within the asymptotic time bound whicheverway is chosen.After processing all pairs of subtrees not containing the root, the results are used tofill the remaining entries of the table, cf. Eq. 5.3.I have of course tested all parts of my code thoroughly for correctness and efficiency.However, as mentioned earlier, the running time of this procedure is critical to the overallrunning time of the cubic and sub-cubic algorithms and I have therefore decided ondocumenting the experiments verifying the time usage of the shared leaf set size calculation.The algorithm has been tested against each of the five types of test trees describedin Sec. 3.1.Expectations Before commenting on the results of my experiments I will discuss whichexpectations one might have to the actual performance of the algorithm. Clearly wewould expect it to perform within the theoretical time bounds of O(n 2 ), but what aboutthe actual time usage - will it e.g. perform equally good on every type of test tree or doesthe internal structure of a tree influence the running time?It is difficult to guess about the actual time spent by the algorithm. One thing to expect,however, is the C++ implementation to perform better than the Python implementation.With regards to the performance between different types of trees, one should takea look at the actual mechanisms of the algorithm. Where is the work really done? Whenlooking at Eq. (5.2) it is evident that the number of internal nodes is significant, meaningthat a high number of inner nodes will make the two sums very large. What about the degreeof inner nodes then? Even though there is a correspondence between the number ofinner nodes, |I |, and the degree of these, d v , as pointed out in Sec. 3.1, I find it hard to seethe exact influence on the algorithm here. From my point of view, it is easier to look atthe implementation details. Here I deal with directed edges pointing away from the rootand the number of these is clearly the same as the number of (undirected) edges in thetree altogether. From this point of view I would expect the number of edges to be criticalto the running time, of course meaning that fewer edges lead to a better performance.Therefore I expect binary trees, having 2n − 3 edges, to demand a high processing time,

5.3. IMPLEMENTING LEAF SET ALGORITHMS 29whereas the star tree, having n edges, would be faster to deal with. In between, sqrt-treesshould be faster than wc-trees, again due to the number of edges.This algorithm has a quadratic space consumption, due to the table being used forstorage, which should not cause any problems with regards to memory. As an example,think of two binary trees of 800 leaves. The table has an entry for each pair of subtrees,which is equal to each pair of directed edges. Since a binary tree has |E| = 2n − 3, theequation is roughly:2|E| × 2|E| × size_of_int ≈ 2 × 1600 × 2 × 1600 × 4 Bytes ≈ 41 MB.This will be no problem for the test environment described in Sec. 2.2, which will haveplenty of memory to spare.Result Figure 5.2 and 5.3 display the results of the two implementations, Python andC++ respectively, being applied to the five types of trees. The first thing to observe is thecorrectness of the time bound. Every plot is parallel to the line indicating a quadraticgrowth. Furthermore, the estimated exponents of the expressions describing the plotsare very close to 2. Next, the expectations about the order among the trees hold true.It seems to be correct that more edges lead to higher processing time. There is actuallyas large a difference as a factor of ten between the best and worst results. What is notas clear, because of the log-log-plot, is that the difference is more pronounced the largerthe trees become. This is the case, because the distance between plots remains the same,even though one step on an axis becomes increasingly significant when moving along theaxis. This is of course a consequence of the quadratic development.Last and less important, we can compare the two figures and see that the C++ implementationis clearly faster than the Python implementation.

30 CHAPTER 5. CALCULATING LEAF SET SIZEStime in seconds - t(n)1001010.1Python implementation of the shared leaf set sizes calculationbin, best exp= 2.02ran, best exp= 2.04sqrt, best exp= 1.89star, best exp= 2.00wc, best exp= 2.00nn 2n 30.0150 100 200 500 800number of leaves - nFigure 5.2: Experiments showing the running time of the shared leaf set sizes calculationalgorithm in Python for the different types of test trees.time in seconds - t(n)1001010.10.01C++ implementation of the shared leaf set sizes calculationbin, best exp= 2.03ran, best exp= 2.05sqrt, best exp= 1.86star, best exp= 1.97wc, best exp= 1.99nn 2n 350 100 200 500 800number of leaves - nFigure 5.3: Experiments showing the running time of the shared leaf set sizes calculationalgorithm in C++ for the different types of test trees.

Chapter 6Cubic time algorithmHere I will describe another algorithm for quartet distance computation by Christiansenet al. [6] that improves the solution to a cubic running time at the expense of an increasedspace consumption. The algorithm is based on the concept of shared leaf set sizes, introducedin Section 5.2, and further extends the idea of using centers, as introduced alongwith the quartic algorithm described in Chapter 4.We are interested in the number of quartets for which the topology differ betweenthe two trees. Therefore, we might as well count or calculate the number of quartetsthat share the same topology and then subtract this number from the overall number ofquartets, ( n4).Having calculated the shared leaf set sizes, the number of leaves common to twosubtrees, |T x ∩ T x ′ |, can be found with a constant time look-up. This will be used extensivelyto calculate the number of shared quartets containing some triplet of leaves,(a,b,c), in constant time. However, the main idea of the algorithm is to process pairs ofleaves, (a,b), of which there are O(n 2 ), and then, in linear time, to calculate the numberof shared quartets that include a given pair. This is done by considering all internalnodes on the path from a to b as centers. These centers can clearly be found in lineartime. Every leaf c, different from a and b, can be reached from exactly one of the centersC , by following an outgoing edge from C that is not part of the path between a and b. SeeFig. 6.1.In linear time, a path between a and b is found and an array computed, storing inentry i , the center of the triplet (a,b,i ). This is done for both trees. The arrays are linearin size and processing each pair of centers of leaves (a,b,i ) in constant time gives anoverall running time of the algorithm of O(n 3 ). The following and remaining part of thealgorithm is a recipe for constant time computation of the number of shared quartetscontaining a triplet (a,b,c), given two centers, C in T and C ′ in T ′ , of the triplet.31

32 CHAPTER 6. CUBIC TIME ALGORITHMcT CxbC xC yaT CyFigure 6.1: Each node on the path between a and b make up a center C . For everyleaf c contained in T cx , C x will be the center of the triplet (a,b,c).Recall that center C , of leaves a, b and c, defines three subtrees T a , T b , T c and also theset T rest , of all remaining subtrees of the center node. The same applies to C ′ . Identicallyto the quartic algorithm of Chap. 4, every leaf other than a, b, and c, will be the fourthleaf in some quartet. Thus, if x appears in T a , the resulting topology in T will be ax|bc.Likewise, if x ∈ T ′ a , the quartet will have the same topology in T ′ . It follows that every leafdifferent from a, appearing in both subtrees T a and T a ′ , will result in a shared quartettopology. It is easy to see that all this boils down to a single look-up in the the table ofshared leaf set sizes, namely |T a ∩ T a ′ |. One should be subtracted, because the table alsoincludes the leaf a itself, which is obviously present in both subtrees. Thus, we can countall quartets of this type, associated with a certain triplet, in constant time. The sameapplies to the leaves b and c, giving a constant time computation of all shared butterflytopologies with the following expression:|T a ∩ T ′ a | + |T b ∩ T ′ b | + |T c ∩ T ′ c | − 3 (6.1)Counting the number of shared star topologies is not as straight forward. Recall that ifx ∈ T rest , the topology of quartet (a,b,c, x) is a x ×b c . Because T rest is actually not a singlesubtree, but a set of subtrees, the expression |T rest ∩ Trest ′ | cannot be found with a singlelook-up. Instead, we can make use of the following observation: Every leaf x ∈ T is inexactly one of T a , T b , T c or T rest , since they contain all leaves and are all disjoint, that is,|T a ∪ T b ∪ T c ∪ T rest | = n and T a ∩ T b ∩ T c ∩ T rest = . Similarly for leaves in T ′ . This givesthe following expression:|T rest ∩ T ′ rest | = |T ′ rest | − (|T ′ rest ∩ T a| + |T ′ rest ∩ T b| + |T ′ rest ∩ T c|) (6.2)This is only part of the solution however, since none of the terms are stored in the sharedleaf sets table. Using the same principle over again a couple of times, we can eliminateall occurrences of T rest and Trest ′ and rewrite every term to be expressed only by terms

6.1. IMPLEMENTATION 33that we can look up in constant time:|T ′ rest ∩ T a| = |T a | − (|T a ∩ T ′ a | + |T a ∩ T ′ b | + |T a ∩ T ′ c |)|T ′ rest ∩ T b| = |T b | − (|T b ∩ T ′ a | + |T b ∩ T ′ b | + |T b ∩ T ′ c |)|T rest ′ ∩ T c| = |T c | − (|T c ∩ T a ′ | + |T c ∩ T ′ b | + |T c ∩ T c ′ |) (6.3)The last expression is derived directly from the number of leaves in T ′ :|T rest ′ | = n − (|T a ′ | + |T ′ b | + |T c ′ |) (6.4)We are now able to compute the number of shared star quartets and thus, the overallnumber of shared quartets of some triplet in constant time. Combining this with theapproach of finding all centers associated with some pair of leaves in linear time, yieldsa running time of the entire algorithm of O(n 3 ). Because of the table for shared leaf setsizes, the algorithm requires a space consumption of O(n 2 ).Just like with the quartic algorithm the shared quartets are counted too many times.Here, however, each quartet is counted twelve times. This is due to the fact that we dealwith pairs and that each quartet is considered once for each possible six pairs that canbe composed from the four leaves. In addition, each of those pairs will be used for constructionof two triplets; one for each of the two remaining leaves. Therefore, the numberof shared quartet topologies counted is divided by twelve.See Alg. 6.1 for an outline of the algorithm.Contribution 6.1 Despite of the description in Section 2.2 of Christiansen et al. [6]which states that each quartet is counted four times, the cubic algorithm is actuallycounting each quartet twelve times as explained in the text. This is a minor change,but of course necessary to keep in mind, as to get the correct result when implementingthe algorithm. I made the discovery while working on my implementation, seeing thatthe result was consistently triple of what was expected.6.1 ImplementationThe cubic algorithm has been implemented first in Python and later in C++. Since nounusual tricks or language features are required, the two implementations are very similar.The first step is the implementation of the algorithms for calculating the leaf sets –line 1 and 2 of Alg. 6.1. These have been described and tested in Sec. 5.3.

34 CHAPTER 6. CUBIC TIME ALGORITHMAlgorithm 6.1 Cubic algorithm for quartet distance1: Calculate subtree leaf set sizes ⊲ O(n)2: Calculate shared leaf set sizes ⊲ O(n 2 )3: sharedQ = 04: for all pairs (a,b) do ⊲ O(n 2 )5: Find paths p in T and p ′ in T ′ between a and b ⊲ O(n)6: Find all centers C on the path p and C ′ on p ′ ⊲ O(n)7: for all leaves c ∉ {a,b} do ⊲ O(n)8: tmpQ = count quartets (a,b,c,x) with same topology in the two trees ⊲ O(1)9: – this is done in constant time using C abc and shared leaf set sizes10: sharedQ = sharedQ + tmpQ11: end for12: end for13: diffQ = ( n4)− sharedQ/1214: return diffQGoing through each pair of leaves (a,b) is easy. The path between two leaves is foundby making one traversal of the tree. Every internal node on the path has a number ofsubtrees (besides the two containing a and b) and each of these are traversed in turn,registering which leaves are contained in that subtree. A center is simply storing the idsof the edges that identify the subtrees containing a, b and a third leaf. An array, call it Ais created to store each of the n −2 centers. Centers are stored in the array A on the indexcorresponding to the id of the third leaf known as c.Now only the counting remains. Looking at the pair of centers associated with everychoice of third leaf, the shared butterfly quartets and shared star quartets associated withexactly those centers are counted, cf. Eq. (6.1), Eq. (6.2) and subsequent expressions.6.1.1 ResultThe implementations have been exposed to the usual range of trees. The result is displayedin Fig. 6.2 and Fig. 6.3.As expected, both implementations seem to have a cubic behaviour on all trees, sincethey are parallel to the purely cubic line. This is further supported by the estimated exponentswhich are all very close to three.The experiments have been continued until the running times exceeded what is acceptable.Because of the fact that the Python implementation is significantly slower andsince a cubic development in running time is rather dramatic, the Python experimentshave only been continued to as much as 400 leaves. The C++ experiments to 800 leaves.The experiments have in general been repeated five times, however, the larger Pythonexperiments take more than an hour of running time each and have therefore only been

6.1. IMPLEMENTATION 35carried out once.

36 CHAPTER 6. CUBIC TIME ALGORITHMtime in seconds - t(n)100001000100101bin, best exp= 2.97ran, best exp= 2.97sqrt, best exp= 2.92star, best exp= 3.02wc, best exp= 2.99n 2n 3n 4Python implementation of the cubic algorithm.0.10.0110 50 100 200 500number of leaves - nFigure 6.2: Performance of the cubic algorithm implemented in Python.time in seconds - t(n)1000100101Performance of the C++ implementation of the cubic algorithm.bin, best exp= 2.85ran, best exp= 2.86sqrt, best exp= 2.72star, best exp= 3.03wc, best exp= 2.96n 2n 3n 40.10.0150 100 200 500 800number of leaves - nFigure 6.3: Performance of the cubic algorithm implemented in C++.

Chapter 7Sub-cubic time algorithmIn this section I will describe the sub-cubic algorithm of Mailund et al. [14], which is themain topic of this thesis. I will establish the theoretical foundation necessary to understandthe experimental investigation that follows. The first part, Sec.7.1, is an intuitiveoutline of the approach used to calculate the quartet distance. Then follows a detailedexplanation of the most critical point in the algorithm including a discussion of the influenceit has on the asymptotic worst-case analysis. The algorithm has a sub-cubic runningtime but the exact complexity is tightly related to the complexity of matrix multiplicationand this is significant in the analysis. Furthermore, this complexity is used asa parameter in the algorithm and thus has to be known. This is a subject of study inSec. 7.2, that will give details about my work on implementation of the algorithm andthe experimental results.7.1 The algorithmIn this algorithm, the overall approach to the quartet distance problem is changed, whencompared to the two reference algorithms of Chapters 4 and 6. Instead of countingshared or different star quartets separately, they can be expressed solely in terms of thebutterfly topologies. The article Christiansen et al. [6] introduces the idea. The quartetdistance is the total number of quartets that induce different topologies in the two trees.That is, the number of quartets that have one butterfly topology in the one tree and anotherbutterfly topology in the other tree plus the number of quartets that have a startopology in one of the trees and a butterfly topology in the other. In short,qdist(T,T ′ ) = diff B (T,T ′ ) + diff S (T,T ′ ).It turns out that diff S (T,T ′ ) can be expressed in terms of the number of butterflies ineach tree, B in T and B ′ in T ′ , the number of shared butterflies, shared B (T,T ′ ), and the37

38 CHAPTER 7. SUB-CUBIC TIME ALGORITHMnumber of different butterflies, diff B (T,T ′ ), in the following way:diff S (T,T ′ ) = B + B ′ − 2(shared B (T,T ′ ) + diff B (T,T ′ ))Which gives a complete expression for the quartet distance between two trees:qdist(T,T ′ ) = B + B ′ − 2shared B (T,T ′ ) − diff B (T,T ′ ) (7.1)Given the fact that B = shared B (T,T ) and likewise B ′ = shared B (T ′ ,T ′ ), it is clear thatonly two procedures are needed to calculate Eq. (7.1); one for shared B and one for diff B .They are described separately in the following Sec. 7.1.2 and Sec. 7.1.3.Like the cubic algorithm, the sub-cubic algorithm makes heavy use of the conceptof shared leaf set sizes, introduced in Section 5.2, but makes even less use of tree examination– one can say that it has a more computational nature. Nevertheless, it is easy toillustrate the intuition behind the algorithm. Two new concepts are introduced.A directed quartet, written ab → cd, first encountered in the article Brodal et al. [3]is a butterfly quartet, ab|cd, with a direction on the path between the two “wings” ofthe butterfly, see Fig. 7.1. Of course the number of shared butterfly quartet topologiesbetween two trees is equal to half the number of shared directed quartets. Likewise fordifferent butterflies. We observe that each directed quartet is identified by exactly onedirected edge e, in such a way that a and b are positioned behind e and c and d are positionedin front of e, in two different subtrees of the end node of e.acacacbdbdbdFigure 7.1: The two directed quartets induced from a butterfly quartet.A claim, written A −→ e (C ,D), is the term used to denote such an edge, e, “claiming” alldirected quartets ab → cd where a and b are contained in the subtree A behind e, while cand d are contained in two different subtrees C and D in front of e. It is convenient to addthis to our vocabulary now, as the algorithm will eventually deal with edges. Naturally, aclaim is associated with subtrees and therefore one edge typically claims multiple butterflyquartets. Also, an edge can be involved in different claims with different subtrees,but as mentioned, every directed quartet is claimed by exactly one edge. See Fig. 7.2 foran illustration. We see how the edge e “divides” the tree into several parts; one subtreebehind the edge and several different subtrees in front of the edge.The fundamentals have been established. The algorithm will process all possiblepairs of directed edges, (e,e ′ ) ∈ T × T ′ , and for all the directed quartets claimed by both

7.1. THE ALGORITHM 39CcabADFigure 7.2: Example of a claim. The directed edge is a unique identifierof the directed quartet ab → cd.dedges, count the quartet as a shared butterfly if the two claims give rise to the same topology,and otherwise count the quartet as a different butterfly. Making heavy use of preprocessing,one such pair of edges can be handled in constant time, see Sec. 7.1.4. Since|E| = O(n) and there are O(n 2 ) pairs of edges, the process of counting the butterflies canbe done in O(n 2 ) time. Thus, the preprocessing step is crucial to the running time.7.1.1 Basic preprocessingHere I will describe only a fundamental part of the preprocessing that is necessary tounderstand the intuition behind the algorithm and how to count shared and differentbutterflies. More preprocessing is needed to make it possible to process a pair of directededges in constant time as mentioned. Unfortunately that part of the preprocessing posesa threat to the sub-cubic complexity of the entire algorithm and has direct influence onthe running time and must be handled with care. For now it is merely confusing and Iwill postpone the introduction of this until the appropriate section.As we shall see, it comes in handy that the notion of claims and the concept of sharedleaf set sizes both deal with subtrees. The first preprocessing step is to calculate theshared leaf set sizes, as explained in Sec. 5.2, which has quadratic time and space consumption.The next step is to calculate, for each pair of internal nodes v ∈ T and v ′ ∈ T ′ , withsubtrees F 1 ,...,F dv and G 1 ,...,G d′v, a matrix I where I [i , j ] = |F i ∩G j |. When processingpairs of edges as mentioned above, we will need this matrix, I , associated with the twonodes that the edges point to.This is enough information about the preprocessing step to complete the intuitiveexplanation for counting butterflies in Sec. 7.1.2 and 7.1.3.

40 CHAPTER 7. SUB-CUBIC TIME ALGORITHM7.1.2 Counting shared butterfliesThe number of shared butterflies, the quantity shared B (T,T ′ ), can be counted using thefollowing method: The edges e and e ′ represent a pair of claims, name them F ie−→ (Fk ,F m )and G je ′−→ (G l ,G n ). The intention of the algorithm is to count the number of directedbutterflies, ab → cd, where a,b ∈ F i ∩G j , c ∈ F k ∩G l and d ∈ F m ∩G n . This is illustratedin Fig. 7.3.cbaF kG jabF iG lcF mG nddFigure 7.3: Two edges both claiming the butterfly ab → cd.Now, the total number of shared directed butterflies can be calculated using the expression:( )1 |F i ∩G j | ∑ ∑|F k ∩G l |2 2k≠i l≠j∑ ∑|F m ∩G n |. (7.2)m≠i ,k n≠j,lwhich, using the preprocessing matrix I , associated with the two nodes that e and e ′point to, is equal to the expression:( )1 I [i , j ] ∑ ∑I [k,l]2 2k≠i l≠j∑ ∑I [m,n], (7.3)m≠i ,k n≠j,lwhere the division by 2 is necessary to avoid counting the same butterfly twice becauseof symmetry. That is, symmetry between the two entries I [k,l] and I [m,n]. Figure 7.4 a)illustrates Eq. (7.3) with a choice of indices and it is clear that a symmetry occurs betweenthe two entries mentioned.Contribution 7.1 The article originally claimed that the count should be divided by fourand not two, due to symmetry between the indices k and m and between the indices l andn. It turned out that this is actually not the case. I got suspicious after implementing thealgorithm and seeing the result consistently being only half of what it should be. Whenlooking at Fig. 7.3 it is easy to imagine how the subtrees F k and F m could interchangeand likewise for G l and G n . This gives rise to four different constellations. Figure 7.4 a),

7.1. THE ALGORITHM 41reflecting Eq. (7.3), however, makes it clear that the symmetry only occurs between twoentries in the matrix. Therefore the butterflies are only counted twice. This is becauseEq. (7.3) is incomplete and only captures half of the constellations of the indices. To producethe total number of shared directed butterflies due to symmetry, the two equations(7.2) and (7.3) should have read:( )1 |F i ∩G j | ∑ ∑ ∑ ∑(|F k ∩G l ||F m ∩G n |) + (|F k ∩G n ||F m ∩G l |) (7.4)4 2k≠i l≠j m≠i ,k n≠j,land( )1 I [i , j ] ∑ ∑4 2k≠i l≠j∑ ∑(I [k,l]I [m,n]) + (I [k,n]I [m,l]). (7.5)m≠i ,k n≠j,lFigure 7.4 b) illustrates a choice of indices in Eq. (7.5). It is clear that for some fixed choiceof the (i , j ) entry, there are four ways to choose the other entries; as soon as one entry isselected, say (k,l), there is only one way to select the remaining three entries (m,n), (k,n)and (m,l) if precisely those four entries are to be selected.As the result in the article is simply wrong by a factor of 2, the succeeding parts ofthe article are still valid, if we divide by 2 instead of 4. One can show that the two subexpressionsof Eq. (7.5) are actually expressing the same quantity and thus, this shouldbe a correct workaround. This is most fortunate, since what remains from that point inthe article is to transform Eq. (7.3) into an expression that is constant time computable.This work will still be valid.jnljnliimmkk(a)(b)Figure 7.4: Illustrates (a) the choice of entries in Eq. (7.3) and (b) the choice of entries in Eq. (7.5)A recipe for counting shared butterflies associated with a pair of edges has now beendescribed. This is done for all pairs of directed edges. As mentioned in the descriptionof directed butterflies, one butterfly is identified by two different edges in one tree. Theconsequence is that each shared butterfly is associated with two different pairs of claims.

42 CHAPTER 7. SUB-CUBIC TIME ALGORITHMIf e and e ′ are the edges that claim the directed quartet ab → cd in the two trees T andT ′ respectively, the directed quartet ab ← cd will also be claimed by some edge in eachtree, these edges will obviously point in the opposite directions. Lets denote these edgesē and ē ′ . Each butterfly will be counted twice; once as associated with the pair (e,e ′ ) andonce as associated with the pair (ē,ē ′ ). Obviously, a directed quartet can not be claimedby both e and ē and therefore shared quartets are only counted twice. This might beobvious, but I make my point because the same is not true when it comes to countingdifferent butterflies.7.1.3 Counting different butterfliesDifferent butterflies, the quantity diff B (T,T ′ ), can be counted using the following method.Once again, the edges e and e ′ erepresent a pair of claims, name them F i −→ (Fk ,F m ) andeG ′j −→ (G l ,G n ). Of interest are the quartets that are claimed by both edges where thosetwo claims result in different butterfly topologies. As an example, the two directed butterfliesab e −→ cd and ac e′−→ bd, would be the result of the following positioning of theleaves; a ∈ F i ∩G j , b ∈ F i ∩G l , c ∈ F k ∩G j and d ∈ F m ∩G n . This is illustrated in Fig. 7.5.This situation is expressed through the following equation:ccaF kG jabF iG lbF mG nddFigure 7.5: Two edges claim butterflies ab → cd and ac → bd respectively.|F i ∩G j | ∑ ∑ ∑ ∑|F i ∩G l ||F k ∩G j ||F m ∩G n |. (7.6)k≠i l≠j m≠i ,k n≠j,lAnd using matrix notation:I [i , j ] ∑ ∑ ∑ ∑I [i ,l]I [k, j ]I [m,n]. (7.7)k≠i l≠j m≠i ,k n≠j,lFigure 7.6 illustrates the entries that are multiplied for some choice of indices. Observingthis or Fig. 7.5, it should be fairly clear that there is no symmetry issue whencounting different butterflies. No two entries can be switched without forcing the otherentries to move as well, making it impossible to produce the same situation with another

7.1. THE ALGORITHM 43choice of indices.jnlimkFigure 7.6: Illustrates the choice of entries in Eq. (7.7).Once again, each possible pair of directed edges needs processing. And once again,since a regular undirected butterfly quartet is subject of two claims, each quartet is encounteredboth when dealing with e and when dealing with ē. Again, ē is some oppositelydirected edge, namely the one that claims the same quartet as e.The total result of counting different butterflies should be divided by four, as opposedto the shared butterflies. While similarity is identified by observing the same two leavesbehind the edge in both T and T ′ , difference is identified by requiring that the pairsof leaves behind the edges should be different. The consequence is, that if the edge pair(e,e ′ ) is identifying a different butterfly quartet, the pair (e,ē ′ ) does also and so does (ē,e ′ )and (ē,ē ′ ). Thus, one single quartet is counted as different four times and consequentlythe result should be divided by four.Contribution 7.2 It is evident that the use of directed edge claiming introduces the needfor dividing the total count by some factor. However, the article states that different butterfliesare counted twice, just like shared butterflies, and that the result should thereforebe divided by two.After implementing the diff B (T,T ′ ) algorithm I observed that the result was, “forsome reason”, consistently double of what I expected. After thorough verification thatmy implementation was correct, this led to further investigations of the correctness of thearticle. I realized that the reasoning should be different and that the result should bedivided by four, as explained in the text.

44 CHAPTER 7. SUB-CUBIC TIME ALGORITHM7.1.4 How to count butterflies in constant timeSec. 7.1.1, 7.1.2 and 7.1.3 gave an intuitive explanation of the algorithm. This sectionwill clarify that the number of butterflies, shared and different, associated with a pair ofdirected edges, can in fact be calculated in constant time, given the right preprocessinginformation. All preprocessing arrays and tables are listed in Appendix A. These will bereferenced here and the most important table, calculation of which is posing a threat tothe overall complexity of the algorithm, will be explained in detail below.The article explains how the expression in Eq. (7.3), used to calculate the number ofshared directed butterflies associated with a pair of internal nodes, can be translated intothe following constant time computable expression:( )1 I [i , j ] (M ′ − R ′ [i ] −C ′ [j ] + I ′ [i , j ]+2 2(I [i , j ] − R[i ] −C [j ])(M − R[i ] −C [j ] + I [i , j ])+R ′′ [i ] − I [i , j ](C [j ] − I [i , j ])+C ′′ [j ] − I [i , j ](R[i ] − I [i , j ]) ) (7.8)And furthermore how the expression in Eq. (7.7), used to calculate the number ofdifferent directed butterflies associated with a pair of internal nodes, is translated intoone of the succeeding two expressions.I [i , j ]((M − R[i ] −C [j ] + I [i , j ])(R[i ] − I [i , j ])(C [j ] − I [i , j ])+(R[i ] − I [i , j ])(I [i , j ](R[i ] − I [i , j ]) −C ′′ [j ])+(C [j ] − I [i , j ])(I [i , j ](C [j ] − I [i , j ]) − R ′′ [i ])+I 1 ′′′′′[i , j ] − I [i , j ]I 1 [i ,i ] − I [i , j ](C ′′′ [j ] − I [i , j ] 2 ) ) (7.9)I [i , j ]((M − R[i ] −C [j ] + I [i , j ])(R[i ] − I [i , j ])(C [j ] − I [i , j ])+(R[i ] − I [i , j ])(I [i , j ](R[i ] − I [i , j ]) −C ′′ [j ])+(C [j ] − I [i , j ])(I [i , j ](C [j ] − I [i , j ]) − R ′′ [i ])+I 2 ′′′′′[i , j ] − I [i , j ]I 2 [j, j ] − I [i , j ](R′′′ [i ] − I [i , j ] 2 ) ) (7.10)This requires some explanation. The respective translations into these constant timecomputable expressions are very cumbersome and not essential to this thesis and I willmerely refer to the appendix of the original article by Mailund et al. [14] for details onthis process.

7.1. THE ALGORITHM 45More interesting are the steps needed to prepare each of the tables used for look-up– because the actual calculation of these is part of the algorithm and a thorough understandingis therefore essential. All tables are prepared during the preprocessing of eachpair of internal nodes, along with the table I , presented in Sec. 7.1.1. They are all resultsof further processing of I and necessary to make the constant time calculation. With theexception of the tables I1 ′′′ ′′′and I2, none of the tables are time-consuming to deal withand no worse than O(d v d v ′) which, for all pairs of inner nodes, leads to a total time of∑v∈T∑v ′ ∈T ′ d v d v′ = ( ∑ v∈T d v )( ∑ v ′ ∈T ′ d v ′) ≤ (2|E|)(2|E|) = O(n2 ).These two exceptions, I1 ′′′ ′′′and I2, are actually identical and only differ in the waythey are calculated. With one term, they shall simply be known as I ′′′ and the calculationof this table will be explained thoroughly in the following Sec. 7.1.4.1.7.1.4.1 The calculation of I ′′′The table named I ′′′ plays a part in the calculation of the number of different butterfliesand in order to keep the entire running time of the algorithm sub-cubic, special care isneeded in the calculation of the table. It is defined as follows:I ′′′ [i , j ] =∑d vd v ′∑k=1,k≠i l=1,l≠jI [i ,l]I [k, j ]I [k,l] (7.11)Filling the table naively, in accordance with the formula, takes time O(n 4 ) and thesub-cubic time bound promised will be broken. Hence, the calculation constitutes aserious barrier and demands separate attention. The solution is instead to calculate eitherI ′′′1 = (I I T )I or otherwise I ′′′2 = I (I T I ) which are both described in more detail in theappendix. At first sight, this does not seem to solve the problem, since the solution is relyingon matrix multiplication. The complexity of matrix multiplication, if done naively,is O(n 3 ). As explained in the article [14] choosing either the first or second solution willresult in an explicit running time of either O(d 2 v d v ′) or O(d v d 2 v ′ ) for processing a pair ofinternal nodes (v, v ′ ) with degrees d v and d v′ respectively. However, other methods forcalculating the matrix product may be utilized and this is essential to the algorithm. Applyinga matrix multiplication method with a time complexity of O(n ω ) on square matrices,one can make the calculation in time O(max(d v ,d v ′) ω ). This value might be smallerfor some matrices that are nearly square, but the approach requires that the matrices arepadded with zeroes to become square – i.e. extending the matrix to fit the requirementsof the matrix multiplication method used.It is difficult to predict the impact of the matrix multiplication on the entire algorithm;since it is on an internal node basis the complete running time is not identical tothe one of the multiplication method. In the article [14], Section 4 gives a thorough case

46 CHAPTER 7. SUB-CUBIC TIME ALGORITHManalysis of the problem, clarifying that the worst-case running time of the algorithm isO(n α+2 ) where α = ω−12. A theoretical asymptotic lower bound on matrix multiplicationis Ω(n 2 ) because all 2n 2 entries have to be processed, and thus, one can assume that2 ≤ ω ≤ 3. The consequence is that 1 2≤ α ≤ 1 and thus, the algorithm might be sub-cubic,but not below Ω(n 2.5 ).The article mentions the Coppersmith-Winograd algorithm [7] as an example of afast method for matrix multiplication, featuring ω = 2.376, which, according to the analysis,will result in a running time of O(n 2.688 ).Nevertheless, these theoretic considerations regarding the exponent are not the focusof this thesis. First of all because such an analysis merely establishes a worst-casebound of the algorithm which might be loose and not even close to the real value. Thus,an implementation might yield better results. Second, because ω is in fact theoretical,meaning that it would have to be estimated through practical experiments to be used asa parameter of the algorithm. Regardless, I am interested in the study of the algorithm asa tool for calculating the quartet distance, and therefore the best choice of matrix multiplicationmethod will be some robust, practical fast implementation. For example, aquick investigation of Coppersmith-Winograd reveals that the algorithm is not applicableon real data because the constant factors are far too large and thus, the benefit wouldnot be seen on data of the size manageable on present-day computers.The worst-case scenario is that the analysis given in the article is tight, meaning thatit is impossible to do better in practice. Suppose, in the same time, that there is no subcubicmethod for matrix multiplication practically useful. This would lead to a strictperformance of Θ(n 3 ) in practice. Hopefully, the analysis is “loose” and leaves room forimprovement, meaning that an implementation does not necessarily perform close tothe asymptotic worst-case bound predicted. Furthermore, it may be easy to find an efficientmatrix multiplication implementation, be it naive or sub-cubic.In short, the algorithm leaves us with a choice to make, namely to decidemin(max(d v ,d v ′) ω ,dv 2 d v ′,d v d 2 v ′), (7.12)which is dependent on the exponent ω. In practice, this is a bit more cumbersome, however,than to simply calculate that expression, since the two methods for matrix multiplication,naive and fast, hide different constant coefficients. One will need to experimentallyestimate ω and/or make an empiric conclusion on when to choose one wayover the others. This is further investigated in Sec. 7.2.2 where I choose a fast and robustimplementation and use it as a black box.

7.2. IMPLEMENTATION 477.2 ImplementationThis section gives a description of my work on implementing the sub-cubic algorithmof Sec. 7.1. As mentioned, it comes down to implementing two algorithms, namelyshared B (T,T ′ ) and diff B (T,T ′ ), used for counting shared and different butterflies in thetwo trees T and T ′ respectively.To calculate the total number of butterflies in one tree the algorithm for countingshared butterflies is used. This means that the basic preprocessing, namely calculatingthe shared leaf set sizes, has to be done three times; once for counting shared and differentbutterflies in the two trees, once for counting butterflies in the tree T and oncefor counting butterflies in the tree T ′ . The implementation of the algorithm for sharedleaf set size calculation has been described in Sec. 5.3 and will comprise the most spaceconsuming part of the sub-cubic algorithm, resulting in O(n 2 ) memory usage.The paper [14] presents the algorithms as follows: do the preprocessing for each pairof inner nodes, then do the counting for each pair of directed edges. Instead, I have chosena strategy of processing each pair of inner nodes in turn, including preprocessing andcounting for each pair of directed edges adjacent to the nodes. Consequently, the onlyreal preprocessing that is done prior to processing the nodes is calculating the sharedleaf set sizes, the implementation of which is described in Sec. 5.3.The full amount of preprocessing information needed to deal with a pair of nodes islisted in App. A. Some of it is needed in both shared B (T,T ′ ) and diff B (T,T ′ ), and someis specific to the latter. Therefore, I implemented the two together in a single functioncapable of returning both counts. If both values are needed, the preprocessing specificto the counting of different butterflies is just an extension to the preprocessing done forshared butterflies.Implementing shared B (T,T ′ ) is straight forward following the recipe I have given inSec. 7.1.2, 7.1.4 and App. A. I loop through every pair of internal nodes and fill out allthe preprocessing tables needed for shared quartets only. Then I go through the pairs ofedges adjacent to the two nodes, apply Eq. (7.8) and add the result to the total sum ofshared butterflies. I make use of the Boost 1 library for binomial coefficient calculation.Before returning the total number of shared butterflies the result is divided by 4; by 2because of symmetry (see clarification in Contribution 7.1) and 2 because there are twiceas many directed edges as there are undirected edges. One thing to note is that it onlymakes sense to look at edges pointing to internal nodes since there has to be at least twoleaves behind the edge (e.g. in the subtree F i on Fig. 7.3) to form a butterfly.Implementing diff B (T,T ′ ) requires more attention. If the analysis given in the paper1 Boost: http://www.boost.org/

48 CHAPTER 7. SUB-CUBIC TIME ALGORITHMis optimal, this is a key point to success or failure since the matrix multiplication posesthe worst threat to the sub-cubic time bound promised. Solving the task involves decidingwhich library to use for matrix multiplication and to study the behaviour of this, as tofind out how to make the crucial comparison of the three quantities max(d v ,d v ′) ω , dv 2d v ′and d v d 2 .v ′After implementing those two algorithms only one thing remains, namely to applythe algorithms on the two trees and put the counts together according to the expressionshown in Eq. (7.1).7.2.1 PrototypeStarting out softly, the first goal is to make a prototype implementation only focusingon the correct result. Consequently, I will make a simple implementation of the choiceand solely base it on which of d 2 v d v ′ and d v d 2 v ′ is smaller. That is easy to determine, andalong with the choice comes the calculation of either C ′′′ and I ′′′1 = (I I T )I , or R ′′′ andI ′′′2 = I (I T I ), respectively. I will use a basic, naive implementation of matrix multiplication.For the Python implementation I will utilize the NumPy library for scientific computing2 and for the C++ implementation I will utilize the Boost.uBLAS Library 3 . Theyboth provide matrix data structures and routines for matrix multiplication. Then it is allabout looping through pairs of edges, applying Eq. (7.9) or Eq. (7.10), and summing upthe results. Before returning, the result is divided by four because directed edges give riseto four different situations (see Contribution 7.2).ExpectationsSo, what would one expect from the prototype implementations whensubjected to the usual range of trees? Using naive matrix multiplication, and hence thevalue of α = 1, will make the whole algorithm O(n 3 ). Thus, we can expect to see cubicworst-case behavior. However, since the analysis is simply worst-case, we can still,as mentioned earlier, hope that it is not a tight upper bound and that the algorithm isefficient in practice.With regards to the difference in performance between the trees used, I will, onceagain, base my expectations on the actual code written. From my point of view there isone reasonable way to look at this. The algorithm is processing pairs of internal nodes,and for each of these pairs there is some preprocessing and some calculations to do,both of which are dependent on the degree of the two internal nodes. The preprocessingrelies on matrix multiplication, meaning that larger degrees will be harder to deal with.Depending on the overhead due to preprocessing and calculations the relationship be-2 SciPy/NumPy: http://numpy.scipy.org/3 Boost.uBLAS: http://www.boost.org/doc/libs/1_42_0/libs/numeric/ublas/doc/index.htm

7.2. IMPLEMENTATION 49tween the number of inner nodes and their degree might be more or less important. Thesame relation made the case analysis difficult.Based on these observations my assumption is that trees with a large number of highdegreeinternal nodes will pose the worst challenge; here the implementation will sufferfrom increasing cost of the matrix multiplication while the overhead on other calculationswill influence the time consumption as well. As described in Sec. 3.1, sqrttrees are intended to have this property. For bin trees however, with a large numberof small-degree inner nodes, the cost of matrix multiplication per node pair will nevergrow. Therefore the result will probably be that this cost can be disregarded and that thealgorithm will behave as a quadratic function, even though there might be a large overheaddue to the many node pairs being processed. The star trees have one node of muchhigher degree which could result in a very small overhead but a much faster growth.Result The performance of the prototypes applied to the five types of test trees is shownin Fig. 7.7 and 7.8. The result is quite surprising and the first thing that catches the eye, isthe relatively slow growth of the time usage. For the Python implementation the runningtimes seem to underlie a more or less quadratic development, with the star tree being theworst challenge, resulting in a slightly steeper slope, however, far from cubic. The plotof the C++ implementation displays the same tendency, the development being closeto quadratic. It is, however, even more clear that the implementation is suffering frommatrix multiplication costs when dealing with star trees and to some extent also the wctrees.The conclusion is, as assumed, that the algorithm responds differently to trees, dependingon their internal structure. For smaller input trees, the algorithm easily dealswith the ones that have a small number of internal nodes, whereas a complex internalstructure with a large number of internal nodes results in a large overhead. When theinput size is increased the overhead is neutralized and the more dominant characteristicseems to be the maximum degree of the internal nodes. Large degrees result in large matrixmultiplications and the consequence is a faster growth in the development of timeusage.These experiments have been repeated five times each, once again with the exceptionof some of the larger Python experiments. Especially the large experiments on bin,ran and wc trees are slow and have not been repeated.7.2.2 Introducing a library for matrix multiplicationHaving gained the first experience with the algorithm, I will now introduce a better methodfor matrix multiplication and attempt to improve the algorithm in accordance to the the-

50 CHAPTER 7. SUB-CUBIC TIME ALGORITHMSimple Python implementation of the sub-cubic algorithm without advanced mat. mult.time in seconds - t(n)1000100101bin, best exp= 2.03ran, best exp= 2.06sqrt, best exp= 1.81star, best exp= 2.02wc, best exp= 1.99nn 2n 30.10.01Figure 7.7:Python.50 100 200 500 800number of leaves - nPerformance of the naive prototype of the sub-cubic algorithm intime in seconds - t(n)10001001010.1Simple C++ implementation of the sub-cubic algorithm without advanced mat. mult.bin, best exp= 2.05ran, best exp= 2.06sqrt, best exp= 1.90star, best exp= 2.96wc, best exp= 2.18nn 2n 30.0150 100 200 500 800number of leaves - nFigure 7.8: Performance of the naive prototype of the sub-cubic algorithm in C++.

7.2. IMPLEMENTATION 51oretic approach.Mailund et al. [14] suggest the use of advanced matrix multiplication methods andsubstantiate their algorithmic results with the best theoretic result within the field ofmatrix multiplication, namely the Coppersmith-Winograd algorithm [7]. As mentionedin Chap. 1, this kind of theoretic result will be met with immediate scepticism by mostprogrammers, since one might suspect that the demands of the theory can never be metby an implementation. In this case because of the size of the input being too small toobserve the improvement.Here the ideal situation would be to find an algorithm that is sub-cubic in theoryand still efficiently implemented in practice. Searching the literature and the web for areasonable method for matrix multiplication I came across the Strassen algorithm [18],which is approximately O(n 2.8 ) and is described as being efficient in practice for largematrices. However, I was not successful in finding an optimized implementation of thealgorithm or a linear algebra library containing one. Since matrix multiplication is notthe focus of this thesis, but rather a smaller piece in the algorithmic puzzle, the time didnot allow me to attempt implementing it on my own. In addition, it seems unreasonablethat I would be competitive with highly optimized libraries.Instead I decided to go with a highly optimized, reliable and robust library. This isa meaningful decision, since the goal of the thesis is to test the practical usefulness ofthe algorithm. What is the use of a theoretic result if it is not practically applicable? Thefinal choice was to use the Basic Linear Algebra Subprograms (BLAS) API 4 , which is a defacto standard for various linear algebra packages. Since my first implementation was inPython, BLAS caught my attention after discovering that it integrates with SciPy/NumPy,which is automatically compiled against the BLAS installation if present on the machine.Furthermore, the Boost.NumericBindings library 5 , provides a generic layer between theBoost.uBlas data types and the BLAS linear algebra routines in C++.The exact routines utilized are some level 3 BLAS calls for matrix-matrix multiplication.In NumPy and Boost.NumericBindings these have general interfaces used throughthe calls C = dot(A,B) and void gemm(A,B,C) respectively. Both solutions require abit of setup using the right type of matrix container structure, but with the libraries thisis not difficult. I will not provide further details, but merely refer to the code which isavailable (see App. B).To get an idea of how the performance improved, a comparison has been made of thefour implementations used, namely the two used in the prototypes and the two BLAS in-4 The BLAS specification: http://www.netlib.org/blas/5 Boost.NumericBindings: http://svn.boost.org/svn/boost/sandbox/numeric_bindings-v1/libs/numeric/bindings/doc/index.html

52 CHAPTER 7. SUB-CUBIC TIME ALGORITHMtegrations. The result of multiplying square matrices of increasing sizes is illustrated inthe plot of Fig. 7.9. As expected, the BLAS calls are significantly faster than the basic im-Performance of the different matrix multiplication methods used.10010python-scipypython-blascpp-boostcpp-blasn 3time in seconds - t(n)10.10.0150 100 200 500 800size of matrix - nFigure 7.9: Comparison of the matrix multiplication methods used.plementations. Nearly as much as a factor of one thousand. The plot is based on tenrepetitions of each matrix multiplication, but nevertheless, the points are a bit scatteredand not forming straight lines. I blame this on the methods used which may suffer fromstructural influences and this could also be indicated in the similar behaviour of the twoBLAS routines. Especially for the larger matrices the optimized calls are superior. It is,however, surprising to see that both Python calls are performing better than their respectiveC++ combatant.The only method that seems to be strict O(n 3 ) is the C++ Boost.uBlas call. In retrospectone might see this as the reason why the star tree was such a challenge to the C++prototype implementation.7.2.3 The final versionNow, had I been successful in finding an implementation of a theoretically good algorithm,the next step would be to estimate the exponent ω, and thus being able to choosethe smallest of max(d v ,d v ′) ω , dv 2d v ′ and d v d 2 . However, as discussed in Sec. 7.1.1, thev ′matrix multiplication routines hide away different coefficients and in addition, time isspent padding the matrices in some cases, all of which has to be taken into account as

7.2. IMPLEMENTATION 53well. My implementation will differ from the theory by making no difference between thenaive and some advanced method for matrix multiplication. I might as well use BLAS inevery case, since it is not specifically designed for square matrices. Anyway, even thoughI have no knowledge of the internals of the BLAS implementation I can treat it as a blackbox and test it as a piece in the algorithmic puzzle. With all of this in mind, I will makean array of experiments with the purpose of clarifying exactly when, if ever, it is feasibleto pad the matrices, making them square, before applying the BLAS routine.I have been executing an equivalent to the prototype algorithm, but substituting thematrix multiplication method by BLAS. In each case the matrix multiplication had to becarried out, I made both calculations, with and without padding of the matrices, comparingthe running time of that particular part of the execution. This has been donefor each combination of the different types of test trees, for larger trees of mainly 800leaves. The result is shown in Table 7.1. For each execution of the algorithm it is shownhow many times the multiplication was carried out and for how many of those paddingwas advantageous. This is also shown percentage-wise. For every case where paddingwas in fact advantageous the table shows the largest degree for an internal node and thelargest difference between the degree of two of the internal nodes being processed whichis identical to the difference between the number of rows and columns in the matrix.Trees Number of Number of times Largest Largest %multiplications padding was degree differencean advantagein degreebin vs bin 636804 0 - - 0.000ran vs ran 154449 30 9 4 0.019sqrt vs sqrt 841 0 - - 0.000star vs star 1 0 - - 0.000wc vs wc 160801 0 - - 0.000bin vs ran 313614 43 7 4 0.014bin vs sqrt 8358 0 - - 0.000bin vs star 798 0 - - 0.000bin vs wc 319998 0 - - 0.000ran vs sqrt 4221 0 - - 0.000ran vs star 393 0 - - 0.000ran vs wc 157593 130 7 4 0.082sqrt vs star 21 0 - - 0.000sqrt vs wc 4221 0 - - 0.000star vs wc 401 0 - - 0.000Table 7.1: Investigating when it is advantageous to multiply square matrices.

54 CHAPTER 7. SUB-CUBIC TIME ALGORITHMFrom the table it is evident that padding the matrices is not speeding up the multiplication.Only in three experiments it seems faster in some cases and only in a very smallpercentage of the multiplications. Furthermore, from the sizes of the internal nodes involvedwe observe that it is only when dealing with very small matrices (#rows/#columns< 10), that we “benefit” from padding. More likely, it is a coincidence rather than an actualtendency. My conclusion is that it is not beneficial to pad the matrices and consequently,there is no reason to separately take care of the case where max(d v ,d v ′) ω issmallest and I will continue to distinguish only between dv 2d v ′ and d v d 2 , as done in thev ′prototype.Expectations The implementation still does not meet the formal description of the article,which means that we can not count on the analysis and thus, the implementationmight brake the sub-cubic time bound and behave as a cubic algorithm. It seems reasonableto keep the assumptions from the prototype. Nevertheless, I hope to see a generalimprovement which of course is dependent on the proportion of time actually spent onmatrix multiplication. This should be more significant to the experiments involving thelargest multiplications, namely the star trees and the worst case trees which seemed tobe suffering because of the larger matrices encountered.Result The result of applying the two final implementations on the usual array of testdata is shown in Fig. 7.10 and Fig. 7.11. We observe that there is no significant improvementin running time. The general picture is much the same as for the prototype; thefinal implementations perform in the same range of time consumption as the prototypeimplementations with a few exceptions. The reason might very likely be the relativelylow amount of time spent doing matrix multiplication and the relatively high amount oftime spent processing the pairs of internal nodes.There are a couple of cases, however, where a change in the slope of the plot is clearlyobservable. That is, not surprisingly, the trees with few inner nodes, resulting in largematrices; star trees and wc trees. And only for the C++ implementation. Especially thestar tree yields a dramatic improvement. The plot of the C++ prototype, that was almostparallel to the O(n 3 ) line, has now improved to being almost parallel to the O(n 2 ) line.However, the plot is not making an exact straight line which I suspect might be becausethe overhead is gradually equalized by the complexity of the matrix multiplication as thesize of the matrices grow. Eventually this will result in the plot showing the complexityof the matrix multiplication only. The processing time of an 800-leaf star tree is loweredfrom over 100 seconds to below 10 seconds meaning that the change of matrix librarywas indeed a great improvement.

7.2. IMPLEMENTATION 55Performance of the Python implementation of the sub-cubic algorithm.time in seconds - t(n)1000100101bin, best exp= 2.04ran, best exp= 2.07sqrt, best exp= 1.79star, best exp= 2.00wc, best exp= 2.00nn 2n 30.10.0150 100 200 500 800number of leaves - nFigure 7.10: Performance of the sub-cubic algorithm in Python.time in seconds - t(n)10001001010.1Performance of the C++ implementation of the sub-cubic algorithm.bin, best exp= 2.05ran, best exp= 2.07sqrt, best exp= 1.81star, best exp= 2.04wc, best exp= 2.02nn 2n 30.0150 100 200 500 800number of leaves - nFigure 7.11: Performance of the sub-cubic algorithm in C++.

56 CHAPTER 7. SUB-CUBIC TIME ALGORITHMFurther verifications of the quality of the algorithm In order to establish completeconfidence in the quality of the algorithm two more experiments have been carried out.The first one is an extension of the experiments already done, but with the purpose ofexamining more challenging combinations of trees to compare. This is done to furtherensure that the algorithm really is behaving well in all imaginable scenarios which mightindicate that it is in fact sub-cubic, even though we do not have the proof of the analysisto back up such a claim.The trees having a single internal node of very high degree, namely the star tree andthe wc tree, were the ones showing the steepest growth because of the matrix multiplicationof larger matrices. I will combine these with the trees having a larger number ofinternal nodes and yielding a worse actual running time to see if the combination of thetwo properties will result in a longer running time than the two do separately.The result is clear. These new combinations of input are not a greater challenge tothe algorithm. Figure 7.12 shows that the resulting performance is no worse than whatwe have previously witnessed; the growth in running time seems close to quadratic andthe actual running time is better than the worst we have seen so far.time in seconds - t(n)1000100101Performance of the C++ implementation on different combinations of trees.star-bin, best exp= 2.03star-ran, best exp= 2.05star-wc, best exp= 2.01wc-bin, best exp= 2.04wc-ran, best exp= 2.04nn 2n 30.10.0150 100 200 500 800number of leaves - nFigure 7.12: Combining the trees with high-degree inner nodes with other trees.The purpose of the second experiment is to ensure that the arrangement, or order,of the input is not influencing the running time. Some algorithms perform significantlybetter on certain inputs. For example, some sorting algorithms, like quick sort and bub-

7.2. IMPLEMENTATION 57ble sort, will, in spite of a quadratic worst-case running time, perform in linear time oninput that is nearly sorted, indeed giving a wrong impression of the algorithm. My implementationmakes a depth-first numbering of the leaves in T and then numbers theleaves in T ′ according to this. This numbering determines in which order we processdifferent parts of the tree. If we shuffle the leaves in both trees and redo the experimentthe topologies of the trees remain the same, but the algorithm will process each tree in adifferent order.Repeating the experiment 25 times with different permutations of the leaves result inthe plot displayed in Fig. 7.13. The difference in performance is difficult to observe andit is evident that the order of the leaves is not a decisive factor; the running time is onlydependent on the structure of the tree. These results will hopefully heighten the level ofconfidence.time in seconds - t(n)10001001010.1Performance of the C++ implementation if leaves are shuffled before execution.bin, best exp= 2.06ran, best exp= 2.07sqrt, best exp= 1.79star, best exp= 2.04wc, best exp= 2.02nn 2n 30.0150 100 200 500 800number of leaves - nFigure 7.13: Shuffling the leaves does not yield a different running time.

Chapter 8Results and discussionThe experiments with each of the three algorithms have been commented separately inChap. 4, 6 and 7. The general impression has been positive: the quartic and cubic algorithmswere indeed performing within – and also rather close to – the expected timebounds, however, the quartic algorithm was more sensitive to the variations in inputdata. The implementation of the sub-cubic algorithm has not been precisely followingthe theoretic guidelines and as a consequence some compromises have been introduced.More precisely, experimental verification made it clear that padding the matricesto square matrices did not give an advantage over the normal multiplication. Nevertheless,the final implementation has been thoroughly tested with different input – in anattempt of finding the worst possible input – and the result is an algorithm that behavessub-cubicly in every situation. However, while the algorithm shows a sub-cubic performancefor all types of trees presented, both the actual running time and the developmentin time-usage is clearly varying between trees. Hence, the star tree has an interesting role.For the size of trees used in this thesis (≤ 800 leaves), the star tree is both the fastest todeal with and the tree that makes the algorithm show the worst development in timeusage.As a consequence, I find it natural to assume that running time of the sub-cubicalgorithm on the star tree, and to some extent also the wc tree, will continue growingfaster than on the other inputs and eventually, on much larger trees, the star trees will bethe most time consuming inputs to deal with.The final result of the experiments was briefly revealed and commented rather cursoryin Chap. 1 but shown again in Fig. 8.1 and Fig. 8.2. The two figures compare theexperimental results of the three algorithms implemented in Python and C++ respectively.These plots are clear evidence that the sub-cubic algorithm does not only meetthe theoretic expectations but it is faster in practice as well. The C++ implementation isfaster for trees with 80 leaves and more while the Python implementation is not clearly59

60 CHAPTER 8. RESULTS AND DISCUSSIONfaster for trees with less than 200 leaves.A general observation based on the experiments with the three algorithms is thateven though their respective time complexities are only defined in terms of the inputsize n, this does not mean that the implementations are not sensitive to the structure ofthe input, e.g. the degree of the internal nodes.8.1 Large numbersOne thing not mentioned, but important to every quartet distance algorithm, is the needof handling large numbers. Quartet distance results can often be very large, simply becausethere are such a large amount of quartets to compare. As an example, 600 leavesresult in ( 600) 4 = 5346164850 distinct quartets to compare, which is a number exceedingthe capacity of a normal 32 bit integer. Thus, if many of those quartets inherit a differenttopology in the two trees, there is a risk that the result will overflow. Therefore, thisshould be taken into consideration, when implementing a quartet distance algorithm –which size of trees are to be compared and which architecture will be used.Python provides a transparent handling of large numbers, which will be enough foralgorithms like the cubic and quartic algorithms. However, when utilizing external routines,which was the case with BLAS matrix multiplication, in the sub-cubic algorithm,special care must be taken and the right data structures used.This means that pure C++ implementations, without special libraries for large numbers,are limited by the architecture but pure Python implementations are not. In anycase, external procedures may giver further restrictions.

8.1. LARGE NUMBERS 61time in seconds - t(n)10000100010010Comparison of the three algorithms, Python implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 410.150 100 200 500 800number of leaves - nFigure 8.1: Comparison of the performances of the three algorithms implementedin Python.time in seconds - t(n)1000100101Comparison of the three algorithms, C++ implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 40.10.0150 100 200 500 800number of leaves - nFigure 8.2: Comparison of the performances of the three algorithms implementedin C++.

Chapter 9ConclusionThe focus of this thesis has been a thorough experimental study of the algorithm forquartet distance computation between general trees described by Mailund et al. [14]with the aim of eventually being able to either accept or reject the postulated sub-cubicrunning time as being a practically usable result.As a foundation I have given a gentle introduction to the subject of quartet distancecalculation followed by a detailed description of two reference algorithms – a quartic anda cubic – including implementations and experimental verification of those.Since my focus has been on experiments, I have given a careful review of a wide rangeof test data, arguing that the whole spectrum of challenging input has been covered.I have implemented the supposed sub-cubic time algorithm and my conclusion ispositive – I am certainly convinced that the algorithm performs in sub-cubic time in practice.However, my implementation is not strictly following the theoretical proposal andconsequently I do not have the support of the analytic result to rely on. Nevertheless Ican provide robust evidence to support my conclusion: exposing the algorithm to theentire set of test trees showed good practical behavior in every situation. As a matterof fact, the algorithm is showing a quadratic behavior on most input and especially inthe least artificial cases. The most challenging input introduced has been the star tree– a single internal node with n leaves attached – which forces the algorithm to make amultiplication of two n × n matrices, however, the situation is very artificial and the twomatrices multiplied contain zeroes in every entry except n entries that contain ones. Itis not within the scope of this thesis to gain knowledge about the internals of the libraryused for matrix multiplication and therefore, I do not know if such matrices are treateddifferently. The experiments on star trees show good results but the need for sub-cubicmatrix multiplication is inevitable.Another view on this conclusion is that the analysis of the algorithm is not tight63

64 CHAPTER 9. CONCLUSIONenough to reflect the actual running time. The algorithm performs better than predictedon most input but given two star trees it should perform with exactly the same time complexityas the matrix multiplication method in use, namely in (O ω ) time.The resulting implementation of the sub-cubic algorithm is not only behaving wellwith regards to time development – it also performs better than the two reference algorithmsand thus, the algorithm does not suffer from being the product of a seeminglycomplex idea.Finally, I have verified the correctness of each algorithm and left a few contributionsin this regard. These are highlighted as Contribution 6.1, 7.1 and 7.2.In summary I have verified the correctness and time complexity of three algorithms,with focus on the sub-cubic algorithm. Hopefully I have left the reader with enoughinsight and the necessary tools to continue the work in a direction towards an even fasteralgorithm.9.1 Future workI find it natural that this work should be extended algorithmically in the attempt of findinga purely quadratic algorithm. I think the ideas used by Mailund et al. [14] are interestingand promising. In any case, the sub-cubic algorithm may be usable in practice,but it is not superior to the other existing algorithms for general trees and hence, I thinkalgorithmic improvements would be far more valuable than practical optimizations.I think focus should be on finding a different way to calculate diff B (T,T ′ ) as to getrid of the matrix multiplication. Alternatively, it might be possible to calculate the totalamount of quartets where both inherit a butterfly topology, call this quantity total B (T,T ′ ),and then calculate diff B (T,T ′ ) = total B (T,T ′ ) − shared B (T,T ′ ).

Bibliography[1] Mukul S. Bansal, Jianrong Dong, and David Fernández-Baca. Comparing and aggregatingpartially resolved trees. In LATIN’08: Proceedings of the 8th Latin Americanconference on Theoretical informatics, pages 72–83, Berlin, Heidelberg, 2008.Springer-Verlag. ISBN 3-540-78772-0, 978-3-540-78772-3.[2] Gerth Stølting Brodal, Rolf Fagerberg, and Christian N. S. Pedersen. Computing thequartet distance between evolutionary trees in time O(n log 2 n). Proceedings of the12th International Symposium for Algorithms and Computation (ISAAC), 2223:731–742, 2001.[3] Gerth Stølting Brodal, Rolf Fagerberg, and Christian N. S. Pedersen. Computing thequartet distance between evolutionary trees in time O(n logn). Algorithmica, 38:377–395, 2003.[4] David Bryant, John Tsang, Paul E. Kearney, and Ming Li. Computing the quartetdistance between evolutionary trees. In SODA, pages 285–286, 2000.[5] Chris Christiansen and Martin Randers. Computing the quartet distance betweentrees of arbitrary degrees. Master’s thesis, University of Aarhus, January 2006.[6] Chris Christiansen, Thomas Mailund, Christian N. S. Pedersen, and Martin Randers.Computing the quartet distance between trees of arbitrary degree. In WABI, pages77–88, 2005.[7] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions.Journal of Symbolic Computation, 9(3):251–280, 1990. ISSN 0747-7171.Computational algebraic complexity editorial.[8] Bhaskar DasGupta, Xin He, Tao Jiang, Ming Li, John Tromp, and Louxin Zhang. Oncomputing the nearest neighbor interchange distance. 1997.65

66 BIBLIOGRAPHY[9] Brian W. Davis, Gang Li, and William J. Murphy. Supermatrix and species tree methodsresolve phylogenetic relationships within the big cats, panthera (carnivora: Felidae).Molecular Phylogenetics and Evolution, 56(1):64 – 76, 2010. ISSN 1055-7903.doi: DOI:10.1016/j.ympev.2010.01.036.[10] William H. E. Day. Optimal algorithms for comparing trees with labeled leaves. Journalof Classification, 2:7–28, 1985. ISSN 0176-4268. 10.1007/BF01908061.[11] Annette Dobson. Comparing the shapes of trees. In Anne Street and Walter Wallis,editors, Combinatorial Mathematics III, volume 452 of Lecture Notes in Mathematics,pages 95–100. Springer Berlin / Heidelberg, 1975. 10.1007/BFb0069548.[12] George F. Estabrook, F. R. McMorris, and Christopher A. Meacham. Comparison ofundirected phylogenetic trees based on subtrees of four evolutionary units. SystematicZoology, 34(2):193–200, 1985. ISSN 00397989.[13] Thomas Mailund and Christian N. S. Pedersen. QDist-quartet distance betweenevolutionary trees. Bioinformatics, 20(10):1636–1637, 2004.[14] Thomas Mailund, Jesper Nielsen, and Christian N. S. Pedersen. A sub-cubic timealgorithm for computing the quartet distance between two general trees. In InternationalJoint Conferences on Bioinformatics, Systems Biology and Intelligent Computing(IJCBS), pages 565–568, 2009.[15] D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. MathematicalBiosciences, 53(1-2):131–147, 1981. ISSN 0025-5564.[16] Mike A. Steel and David Penny. Distributions of tree comparison metrics-some newresults. Systematic Biology, 42(2):pp. 126–141, 1993. ISSN 10635157.[17] Martin Stig Stissing, Christian N. S. Pedersen, Thomas Mailund, Gerth Stølting Brodal,and Rolf Fagerberg. Computing the quartet distance between evolutionarytrees of bounded degree. In Series on Advances in Bioinformatics and ComputationalBiology, volume 5, pages 101–110, 2007.[18] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13:354–356, 1969. ISSN 0029-599X.[19] John Tsang. An approximation algorithm for character compatibility and fastquartet-based phylogenetic tree comparison. Master’s thesis, University of Waterloo,October 2000.

BIBLIOGRAPHY 67[20] M.S. Waterman and T.F. Smith. On the similarity of dendrograms. Journal of TheoreticalBiology, 73(4):789–800, 1978. ISSN 0022-5193.

Appendix APreprocessing for the sub-cubicalgorithmIn this appendix I will list the details of the preprocessing steps omitted in the descriptionof the sub-cubic algorithm in Chap. 7. Though crucial to the running time of thecomplete algorithm, they do not support the reader in obtaining an overview, and thus,I have chosen to hide them away here.Recall from Sec. 7.1.1 the matrix I which is storing a subset of the table of shared leafset sizes between two trees. All remaining preprocessing arrays and tables are results offurther processing the information in I . Since the information is making little sense byitself, the naming conventions have been kept simple and are not very explanatory.Row and column sums and the sum of all entries of I :R[i ] =d v ′∑j =1I [i , j ]∑d vC [j ] = I [i , j ]M =i=1∑d v d v ′∑i=1 j =1I [i , j ](A.1)(A.2)(A.3)Another matrix I ′ , its row sums, column sums and its total sum of entries:I ′ [i , j ] = I [i , j ](M − R[i ] −C [j ] + I [i , j ])R ′ [i ] =d v ′∑j =1I ′ [i , j ](A.4)(A.5)69

70 APPENDIX A. PREPROCESSING FOR THE SUB-CUBIC ALGORITHMAnd two more arrays:R ′′ [i ] =C ′ ∑d v[j ] = I ′ [i , j ]M ′ =∑d vj =1i=1∑d v d v ′∑i=1 j =1I ′ [i , j ]I [i , j ](C [j ] − I [i , j ])(A.6)(A.7)(A.8)C ′′ ∑d v[j ] = I [i , j ](R[i ] − I [i , j ])(A.9)i=1Only one of the two following arrays will need calculation, depending on which of thetwo expressions for counting different butterflies is used, Eq. (7.9) or Eq. (7.10):R ′′′ [i ] =∑d vj =1I [i , j ] 2C ′′′ ∑d v[j ] = I [i , j ] 2With the approach taken in the article [14], the following table needs calculation.i=1(A.10)(A.11)I ′′′ [i , j ] =∑d vd v ′∑k=1,k≠i l=1,l≠jI [i ,l]I [k, j ]I [k,l]However, if calculated naively, the sub-cubic time bound promised will be broken. Thesolution is instead to calculate either I1 ′′ = I I T and then I1 ′′′ = I ′′ I , which is basically1dI 1 ′′v ′[i ,k] = ∑j =1I [i , j ]I [k, j ](A.12)I 1 ′′′d v[i , j ] = ∑k=1I [k, j ]I 1 ′′ [i ,k], (A.13)or otherwise calculate I2 ′′ = I T I and then I2 ′′′ = I I ′′ , which is basically2I 2 ′′d v[j,l] = ∑i=1I [i , j ]I [i ,l](A.14)dI 2 ′′′v ′[i , j ] = ∑I [i ,l]I 2 ′′ [l,k]. (A.15)l=1The choice is essential to the algorithm and is explained in further detail in Sec. 7.1.4.1.

Appendix BObtaining and running the programsA small website has been dedicated to this thesis. It gives information on how to download,compile and run the programs:link: http://www.cs.au.dk/~dalko/thesis/Otherwise, request the code at:e-mail: anders.kabell.kristensen@gmail.com71

Appendix CReal-life application of the algorithmSome of the programs produced in this thesis have already shown to be usable in practice.This was not a goal of this thesis, but I find it satisfactory to know – and seen inretrospect I think it adds another level of justification to the thesis.Peter Foster, Natural History Museum of London, inquired about software for quartetdistance computation on general trees at the Bioinformatics Research Center at AarhusUniversity. Foster writes:I am working with my colleague Mark Wilkinson here at the Natural HistoryMuseum in London on developing a new method of constructing supertrees,called Quartet Joining. We hope that this will become an alternativeto the widely-used MRP (matrix representation with parsimony) methodof supertree construction. We test it by simulating a random master treeand then taking subset trees from that master tree, and then reconstructingsupertrees from those subset trees, and then comparing the topology ofthe resulting supertree to the master. We have been using Robinson-Fouldsdistances, but have thought that using quartet distances would offer an additionalperspective on the topology distance. However, simplistic quartetcomparisons proved to be far too slow for sizable trees. Often our supertreesare not fully resolved, and so we needed a fast algorithm that could handlepolytomies.Peter let me know that the size of the trees compared is up to a few hundred leavesand that the experiments would be repeated hundreds or thousands of times. They usuallywork with Python and therefore I sent my Python implementation of the sub-cubicalgorithm. It compares two trees in less than a minute and that worked alright for a start.73

74 APPENDIX C. REAL-LIFE APPLICATION OF THE ALGORITHMHowever, I later provided my C++ implementation that does the job in a few seconds andafter wrapping it as a module for Python it worked painlessly with their software:The new algorithm is exactly what we needed, and it will be quite useful.Foster provided an example of a master tree and a QJ tree made from some subsetsof the master tree. They are displayed in Fig. C.1 and we can observe that they do indeedinclude polytomies. The quartet distance between the two trees is 317524.

75t29t25t54t20t27 t13 t1t3t17t24t41t49t7t14t28t53t15t36t10t47t6t0t43t57t48t45t9t31t30t59t56t42t51t4t39t5t2t8t18t33t19t11t38t35t23t52t40t50t37t34t55t46t21t16t58t22t26t44t12t32t11t4t39t8t18 t33 t35t50t52t34t46t38t55t5t6t30t47t51t15t42t7t2t9t13t31t3t43t1t10t0t12t56t14t53t16t49t21t59t48t58t22t57t26t45t44t41t32t37t17t36t28t24t40t23t29t25t54t27t20t19Figure C.1: An example of a master tree and a reconstruction of the master tree by quartet joining.

computing the quartet distance between general trees

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?