13.07.2015 Views

computing the quartet distance between general trees

computing the quartet distance between general trees

computing the quartet distance between general trees

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Master’s <strong>the</strong>sisCOMPUTING THE QUARTET DISTANCEBETWEEN GENERAL TREES:AN EXPERIMENTAL STUDY OF A SUB-CUBIC ALGORITHMbyAnders Kabell Kristensenstudent id: 20041248November, 2010with supervisorChristian Nørgaard Storm PedersenDepartment of Computer ScienceAarhus UniversityDenmark


AbstractThis <strong>the</strong>sis provides strong evidence that <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> two <strong>general</strong> <strong>trees</strong>is practically computable in sub-cubic time. The <strong>the</strong>sis comprises a detailed study ofthree algorithms for <strong>quartet</strong> <strong>distance</strong> computation <strong>between</strong> <strong>trees</strong> of arbitrary degree; aquartic, a cubic and a sub-cubic time algorithm. A common property to <strong>the</strong>se three algorithmsis that <strong>the</strong>y all have a time complexity that is only dependent on <strong>the</strong> numberof leaves in <strong>the</strong> two <strong>trees</strong> compared and not on <strong>the</strong> degree of internal nodes. Focus ison <strong>the</strong> sub-cubic algorithm suggested by Mailund et al. [14] that is currently <strong>the</strong> <strong>the</strong>oreticallybest algorithm in this category and has a time complexity of O(n 2+α ) where αhas a direct dependence on matrix multiplication. This dependence might show to bea problematic obstacle in practice since <strong>the</strong> need for sub-cubic matrix multiplication isinescapable. The goal is to reveal <strong>the</strong> practical behavior of <strong>the</strong> algorithm. Naturally thisis done through experimental verification using a wide range of input <strong>trees</strong> with differentproperties. The result is clear – <strong>the</strong> performance of <strong>the</strong> algorithm is close to quadratic formost input, however, some inputs reveal that <strong>the</strong> running time is dominated by that ofmatrix multiplication for very large inputs.Among <strong>the</strong> contributions of <strong>the</strong> <strong>the</strong>sis are: A verification of <strong>the</strong> correctness and runningtime of <strong>the</strong> two reference algorithms including a minor but essential correction of<strong>the</strong> cubic algorithm. Verification of <strong>the</strong> correctness of <strong>the</strong> sub-cubic algorithm leadingto a few algorithmic observations that are crucial to obtain a correct result. Practicalverification of <strong>the</strong> <strong>the</strong>oretical bounds on <strong>the</strong> time complexity of <strong>the</strong> sub-cubic algorithmincluding a discussion of a possibly tighter upper bound. Detailed documentation of <strong>the</strong>experimental approach with description of <strong>the</strong> steps necessary to implement <strong>the</strong> algorithmsand reproduce <strong>the</strong> experiments. Finally, <strong>the</strong> result is a piece of software that isefficient in practice and has already shown its worth at <strong>the</strong> time of writing.iii


AcknowledgementsFirst of all I would like to thank my supervisor Christian Nørgaard Storm Pedersen forgreat support and feedback, always being positive and encouraging me and simply forshowing interest. Thanks to Jesper Nielsen, PhD student at BiRC 1 , for some constructivediscussions. Thanks goes to Thomas Wessel and Anders Viskum for proofreading andgood times in <strong>the</strong> crane. Thanks to my bro Kasper and to my darling Astrid for supremeproofreading.Anders Kabell Kristensen,Aarhus, October 31, 2010.1 Bioinformatics Research Center, Aarhus University: http://birc.au.dk/v


ContentsAbstractAcknowledgementsiiiv1 Introduction 11.1 Phylogenetic <strong>trees</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Measuring difference or similarity . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 The <strong>quartet</strong> <strong>distance</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Overview of algorithms for <strong>quartet</strong> <strong>distance</strong> computation . . . . . . . . . . 41.3.1 Final result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Focus of <strong>the</strong> <strong>the</strong>sis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Prerequisites 92.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Choice of language and test environments . . . . . . . . . . . . . . . . . . . 103 Experimental approach 133.1 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Tree construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.1 Newick format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Parsing data files and building a tree data structure . . . . . . . . . . 174 Quartic time algorithm 194.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Calculating leaf set sizes 255.1 Subtree leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Shared leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26vii


viiiCONTENTS5.3 Implementing leaf set algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 275.3.1 Subtree leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.3.2 Shared leaf set sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Cubic time algorithm 316.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.1.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Sub-cubic time algorithm 377.1 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377.1.1 Basic preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397.1.2 Counting shared butterflies . . . . . . . . . . . . . . . . . . . . . . . . 407.1.3 Counting different butterflies . . . . . . . . . . . . . . . . . . . . . . 427.1.4 How to count butterflies in constant time . . . . . . . . . . . . . . . 447.1.4.1 The calculation of I ′′′ . . . . . . . . . . . . . . . . . . . . . 457.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.2.1 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.2.2 Introducing a library for matrix multiplication . . . . . . . . . . . . 497.2.3 The final version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Results and discussion 598.1 Large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Conclusion 639.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Bibliography 65A Preprocessing for <strong>the</strong> sub-cubic algorithm 69B Obtaining and running <strong>the</strong> programs 71C Real-life application of <strong>the</strong> algorithm 73


Chapter 1IntroductionThe topic of <strong>quartet</strong> <strong>distance</strong> computation is studied by computer scientists, but has itsorigin within evolutionary biology and requires some amount of motivation. The followingintroduction should be sufficient to understand why <strong>the</strong> topic is relevant and has itsapplications within o<strong>the</strong>r areas than computer science. However, it will by no means becomprehensive and should not be seen as a <strong>general</strong> introduction to phylogenetics.1.1 Phylogenetic <strong>trees</strong>An evolutionary or phylogenetic tree is a hierarchical structure, commonly used to express<strong>the</strong> interrelationship, with regards to inheritance, among a set of evolutionary units(EUs), e.g. different species, and has found its use within <strong>the</strong> field of bioinformatics. Thetree consists of a set of nodes connected by edges, and <strong>the</strong> configuration of <strong>the</strong>se decide<strong>the</strong> structure or topology of <strong>the</strong> tree. By definition, <strong>the</strong>re is exactly one path <strong>between</strong> anytwo nodes in a tree. The outermost nodes of <strong>the</strong> tree – <strong>the</strong> ones that are only incident to asingle edge – are called leaves and represent EUs, while branching points in <strong>the</strong> tree representspeciation events. A speciation event is a splitting of lineage. Some evolutionary<strong>trees</strong> are rooted in one of <strong>the</strong> branching points, which adds a direction to <strong>the</strong> evolutionand <strong>the</strong> root is <strong>the</strong>n said to be <strong>the</strong> most recent common ancestor to all species in <strong>the</strong> tree.In this case we picture <strong>the</strong> tree with <strong>the</strong> root on top and all edges as directed and pointingdownwards away from <strong>the</strong> root. Each branching point or internal node is <strong>the</strong> root of asubtree, and is viewed as a common ancestor to <strong>the</strong> EUs in this subtree. Two EUs downone branch are closer related than each of <strong>the</strong>m are to any EU found in ano<strong>the</strong>r branchof <strong>the</strong> same branching point. When each branching point in a rooted tree has exactlytwo immediate descendants, <strong>the</strong> tree is said to be binary. In a binary unrooted tree, eachnode has exactly three connections to o<strong>the</strong>r nodes. The number of connections a node1


2 CHAPTER 1. INTRODUCTIONhas is called <strong>the</strong> degree of <strong>the</strong> node. An unrooted tree does not contain internal nodes ofdegree two. A tree where internal nodes are allowed to be polytomies, that is, <strong>the</strong>y canhave any degree equal to or greater than three, is called a <strong>general</strong> tree. General <strong>trees</strong> areoften used to represent partly resolved relationships where <strong>the</strong> complete topology is notknown and each species <strong>the</strong>refore cannot be represented by a distinct node. Sometimesbranches are assigned a length which adds <strong>the</strong> notion of time to <strong>the</strong> evolution.The true evolutionary relation among a set of EUs is rarely known. Multiple methodsfor determining <strong>the</strong> exact relationship, from biological data, are available. They do notnecessarily agree and might induce different <strong>trees</strong>. Some methods for inferring relationshipswill result in a large range of plausible tree reconstructions. Fur<strong>the</strong>rmore, multipledata sets, e.g. DNA sequences, describing a single species are often at hand. Thus, onemethod may yield a different solution for each data set used. Figure 1.1 is an illustrationof two alternative relationships inferred for <strong>the</strong> Pan<strong>the</strong>ra (big cats).Pan<strong>the</strong>raClouded LeopardJaguarLeopardLionSnow LeopardTigerPan<strong>the</strong>raClouded LeopardTigerJaguarSnow LeopardLeopardLionFigure 1.1: Two alternative views of <strong>the</strong> relationship <strong>between</strong> <strong>the</strong> Pan<strong>the</strong>ra (bigcats). Note that one is a binary tree whereas <strong>the</strong> o<strong>the</strong>r includes a polytomy andis thus a <strong>general</strong> tree. Example from Davis et al. [9].This disagreement <strong>between</strong> <strong>trees</strong> introduces <strong>the</strong> need of some means of assessing<strong>trees</strong>. One approach is to make a pairwise comparison of <strong>trees</strong> in an attempt of quantifying<strong>the</strong> differences or similarities.1.2 Measuring difference or similarityVarious methods for tree comparison have been defined and each measure has certainproperties and takes certain aspects of <strong>the</strong> <strong>trees</strong> into consideration. Some can only handlefully resolved <strong>trees</strong> while o<strong>the</strong>rs are able to take branch lengths into account. Somemetrics consider topological properties only. An example of <strong>the</strong> latter is <strong>the</strong> nearest–neighbor interchange metric, proposed by Waterman and Smith [20], defined as <strong>the</strong>fewest number of nearest–neighbor interchanges required to convert one tree into ano<strong>the</strong>r.The metric only works for binary <strong>trees</strong> and <strong>the</strong> problem of <strong>computing</strong> it has beenshown to be NP-complete (seeDasGupta et al. [8]). In this <strong>the</strong>sis focus will be on <strong>general</strong><strong>trees</strong>. Here, an example is <strong>the</strong> Robinson–Foulds <strong>distance</strong> metric, proposed by Robinsonand Foulds [15], and also known as <strong>the</strong> symmetric difference metric. It is defined as <strong>the</strong>


1.2. MEASURING DIFFERENCE OR SIMILARITY 3number of partitions of <strong>the</strong> species, produced by deleting an edge in a tree, that differ in<strong>the</strong> two <strong>trees</strong>. The Robinson–Foulds metric has been particularly popular because it canbe computed in linear time using Day’s algorithm (see Day [10]).This <strong>the</strong>sis will concentrate on <strong>the</strong> <strong>quartet</strong> <strong>distance</strong>, proposed in 1985 by Estabrooket al. [12].1.2.1 The <strong>quartet</strong> <strong>distance</strong>The <strong>quartet</strong> <strong>distance</strong> is a measure of difference <strong>between</strong> two <strong>trees</strong> and is based upon<strong>the</strong> observation made by Estabrook et al. [12] that a group of four EUs is <strong>the</strong> smallestgroup for which <strong>the</strong>re is more than one possible unrooted tree topology. Call such agroup a <strong>quartet</strong>. In an evolutionary tree each <strong>quartet</strong> of species inherits one of <strong>the</strong> fourtopologies shown in Fig. 1.2.acababacbdcddcbd(a)(b)(c)(d)Figure 1.2: The four possible topologies of a <strong>quartet</strong> consisting of <strong>the</strong> four leavesa,b,c,d. (a)-(c) are so-called butterfly <strong>quartet</strong>s, while (d) is a star <strong>quartet</strong>.That is, considering only <strong>the</strong> part of a tree that remains after removal of every edgethat is not part of a path <strong>between</strong> two of <strong>the</strong> four leaves in <strong>the</strong> <strong>quartet</strong>, that part of <strong>the</strong>tree will have one of <strong>the</strong> four topologies. Three of those are so called butterfly topologies,sometimes called resolved topologies, where <strong>the</strong> leaves have a closer relation in pairsand one is called <strong>the</strong> star topology, or unresolved topology, where all leaves are equallyclosely related. Note that, besides <strong>the</strong> topological structures illustrated in <strong>the</strong> figures, <strong>the</strong>leaves are unordered, and thus, leaves a and b might as well switch around in Fig. 1.2(a).The <strong>quartet</strong> <strong>distance</strong> is <strong>the</strong> number of <strong>quartet</strong>s that inherit different topologies in <strong>the</strong> two<strong>trees</strong> considered. Figure 1.3 illustrates <strong>the</strong> inference of <strong>the</strong> topology of different <strong>quartet</strong>s.abcfdeabcfdeabcfde(a)(b)(c)Figure 1.3: (a) shows a tree. (b) shows, highlighted, <strong>the</strong> butterfly topology inheritedby <strong>the</strong> <strong>quartet</strong> (a,b,d, f ) and (c) <strong>the</strong> star topology inherited by (b,c,e, f ).


4 CHAPTER 1. INTRODUCTIONIt is evident that within binary <strong>trees</strong>, a <strong>quartet</strong> can only inherit one of <strong>the</strong> three butterflytopologies, however, with <strong>the</strong> inclusion of <strong>the</strong> star topology, <strong>the</strong> method works justas well on <strong>general</strong> <strong>trees</strong>. This means that <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> can be used with <strong>trees</strong> thatinclude partly resolved relationships, i.e. polytomies, but it is required that <strong>the</strong> two <strong>trees</strong>specify <strong>the</strong> exact same set of leaves.The <strong>quartet</strong> <strong>distance</strong> does not consider branch length but focuses solely on topologicalproperties. It works equally well with rooted and unrooted <strong>trees</strong> since a rootedtree can be interpreted as unrooted and a <strong>quartet</strong> will inherit <strong>the</strong> same topology in both.However, for rooted <strong>trees</strong>, <strong>the</strong>re is in fact more than one topology for a group of onlythree leaves meaning that also a triplet <strong>distance</strong> has its justification, see Dobson [11].1.3 Overview of algorithms for <strong>quartet</strong> <strong>distance</strong> computationIn computer science, a widely used data structure is <strong>the</strong> tree data structure. It comesin numerous variants, has an endless amount of applications, and <strong>the</strong>refore algorithmsworking on <strong>trees</strong> have been studied in very great detail. Hence, algorithms for calculationof <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> evolutionary <strong>trees</strong> is just ano<strong>the</strong>r application, and whilealgorithms for this problem benefit from previous research, <strong>the</strong>y might end up beingof use within completely different areas than phylogenetics. Likewise, <strong>the</strong> focus of this<strong>the</strong>sis will be purely algorithmic and not on applications in bioinformatics.There are ( n4)∈ O(n 4 ) unique <strong>quartet</strong>s in a tree with n leaves which makes <strong>quartet</strong><strong>distance</strong> calculation a computationally heavy problem. Computing <strong>the</strong> <strong>quartet</strong> <strong>distance</strong>naively by explicitly inspecting <strong>the</strong> topologies of <strong>the</strong> O(n 4 ) <strong>quartet</strong>s in <strong>the</strong> two <strong>trees</strong> takesO(n 5 ) time.Several algorithms have been designed over <strong>the</strong> years, resulting in dramatic improvementsin time usage. Focus has been on calculation of <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> binary<strong>trees</strong>, which do not include star <strong>quartet</strong>s and seem to be less complex to handle.Steel and Penny [16] showed how to calculate <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> in time O(n 3 ). Bryantet al. [4] improved this to O(n 2 ) and introduced some concepts important to this <strong>the</strong>sis.The work of Brodal et al. [3] has also been important to this <strong>the</strong>sis and resulted in <strong>the</strong>fastest known algorithm for binary <strong>trees</strong> with a time bound of O(n logn).For <strong>general</strong> <strong>trees</strong>, Bansal et al. [1] describe an O(n 2 ) time 2-approximation algorithm,but this <strong>the</strong>sis will only deal with exact <strong>quartet</strong> <strong>distance</strong> calculation. Christiansen et al.[6] present three algorithms with running times of O(n 4 ), O(n 3 ) and O(n 2 d 2 ) respectively,where d is <strong>the</strong> maximum degree of any node in <strong>the</strong> two <strong>trees</strong>. Stissing et al. [17]present an O(d 9 n logn) time algorithm. Note that some of <strong>the</strong> algorithms are boundedby <strong>the</strong> degree of <strong>the</strong> internal nodes whereas o<strong>the</strong>rs are independent of this factor and


1.3. OVERVIEW OF ALGORITHMS FOR QUARTET DISTANCE COMPUTATION 5only bounded by <strong>the</strong> number of leaves in <strong>the</strong> tree. However, d ≤ n and consequently acomplexity of O(d 2 n 2 ) is in any case better than one of O(n 4 ).In this <strong>the</strong>sis, <strong>the</strong> main subject of study will be an algorithm proposed by Mailundet al. [14], that yields a running time of O(n 2+α ), where α = ω−12and O(n ω ) is <strong>the</strong> complexityof matrix multiplication. At <strong>the</strong> time of writing, it is <strong>the</strong> only algorithm that provides asub-cubic time complexity, while being independent of <strong>the</strong> degree. However, it is dependenton <strong>the</strong> utilization of a fast method for matrix multiplication, providing a sub-cubictime usage on square matrices. Naive matrix multiplication which features an ω = 3yields α = 1 and thus a running time of O(n 3 ). The authors of <strong>the</strong> article point out thatutilizing an advanced method for matrix multiplication with a <strong>the</strong>oretical good result willyield a sub-cubic time complexity. It mentions <strong>the</strong> Coppersmith-Winograd algorithm [7],which provides an ω = 2.376, resulting in a time complexity of O(n 2.688 ). Now, despitebeing <strong>the</strong> fastest method for matrix multiplication to date, <strong>the</strong> Coppersmith-Winogradalgorithm is not practically applicable.Such <strong>the</strong>oretical results will often leave <strong>the</strong> programmer reluctant since an implementationseems most unrealistic. However, from my discussions with <strong>the</strong> authors Iknow that <strong>the</strong> early <strong>the</strong>oretical work on <strong>the</strong> analysis of <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> algorithmleft a <strong>general</strong> first impression that it was more efficient than eventually shown by <strong>the</strong> finalasymptotic worst-case analysis. Theoretical upper bounds on <strong>the</strong> time-usage of analgorithm might be far worse than <strong>the</strong> actual time-usage, as a result of <strong>the</strong> algorithm beingdifficult to analyse, and I <strong>the</strong>refore find this information promising and hope that<strong>the</strong>se early predictions were true. This may sound ambiguous but <strong>the</strong> case is that whileit might be difficult, because of practical limitations, to obtain a worst-case running timeas good as <strong>the</strong> one of <strong>the</strong> analysis, an implementation might at <strong>the</strong> same time performsignificantly better on most input. Hence, this <strong>the</strong>sis will study <strong>the</strong> practical behaviorof <strong>the</strong> algorithm with <strong>the</strong> goal of clarifying <strong>the</strong> relation <strong>between</strong> <strong>the</strong> <strong>the</strong>oretical and <strong>the</strong>practical time complexities. For reference, <strong>the</strong> two o<strong>the</strong>r algorithms that have a runningtime that is independent of <strong>the</strong> degree [6], are studied in detail as well. This will allowcomparison of <strong>the</strong> performance and verification of correctness.1.3.1 Final resultThe final result of <strong>the</strong> study of <strong>the</strong> practical behaviour of <strong>the</strong> algorithms is illustrated inFig. 1.4. The three algorithms are compared for two different implementations. Withoutgoing too much into detail at this point (see Chap. 8 for details), note that <strong>the</strong> plots showthat <strong>the</strong> O(n 4 ) and O(n 3 ) algorithms are behaving as expected, but more interestingly wecan observe that <strong>the</strong> sub-cubic algorithm is not only faster in practice, it also behaves


6 CHAPTER 1. INTRODUCTIONsub-cubic and actually close to quadratic. Now <strong>the</strong> final result has briefly been revealedas an appetizer and I hope that will encourage <strong>the</strong> reader to seek <strong>the</strong> answers to why andhow, in <strong>the</strong> subsequent chapters.100001000Comparison of <strong>the</strong> three algorithms, Python implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 4 50 100 200 500 800time in seconds - t(n)1001010.150 100 200 500 800number of leaves - ntime in seconds - t(n)1000100101Comparison of <strong>the</strong> three algorithms, C++ implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 40.10.01number of leaves - nFigure 1.4: The final result of <strong>the</strong> experimental work of this <strong>the</strong>sis. See Chap.8 for more details.1.4 Focus of <strong>the</strong> <strong>the</strong>sisThe aim of this <strong>the</strong>sis is to explore in detail <strong>the</strong> practical behaviour of <strong>the</strong> <strong>the</strong>oretical subcubicalgorithm. Fur<strong>the</strong>rmore, I want to give readers who have no knowledge of <strong>quartet</strong><strong>distance</strong> calculation on <strong>general</strong> <strong>trees</strong> a gentle introduction to <strong>the</strong> subject. Therefore Iwill make an effort to explain everything thoroughly on <strong>the</strong> fly and not leave anythingunanswered. By continuously explaining my reasoning and immediately following upon <strong>the</strong> <strong>the</strong>ory with implementation details and experimental results, I will encourage<strong>the</strong> reader to go on and read everything chronologically. This will especially apply to <strong>the</strong>middle part of <strong>the</strong> <strong>the</strong>sis, where I will deal with <strong>the</strong> algorithms one at a time, first <strong>the</strong>ory,<strong>the</strong>n implementation details and finally <strong>the</strong> results. Where appropriate, <strong>the</strong> <strong>the</strong>sis willfollow my workflow as to reflect <strong>the</strong> iterative process of my work to reveal every interest-


1.5. THESIS OUTLINE 7ing experience I had during <strong>the</strong> process.As mentioned, <strong>the</strong> <strong>the</strong>sis will focus on algorithmics and not <strong>the</strong> bioinformatics. Thepurpose is to explore <strong>the</strong> behaviour of three algorithms. This means that I will accept<strong>the</strong> <strong>the</strong>ory as presented, try to verify <strong>the</strong> results through experiments and, in case of anysurprises, take a closer look at <strong>the</strong> algorithms to verify and possibly adjust <strong>the</strong> <strong>the</strong>ory.Along <strong>the</strong> way, I will strive to provide enough information for <strong>the</strong> reader to reproduce <strong>the</strong>experiments.1.5 Thesis outlineThe <strong>the</strong>sis is structured as follows. Chapter 2, Prerequisites, will go through terminologynecessary to prepare for <strong>the</strong> algorithmic chapters, and motivate <strong>the</strong> choice of implementationlanguages and experimental platforms. Chapter 3, Experimental approach, is a<strong>general</strong> explanation of how experiments are used to verify <strong>the</strong> quality of <strong>the</strong> implementationsand what kind of data is used for this purpose.Chapters 4 through 7 make up <strong>the</strong> algorithmic core of <strong>the</strong> <strong>the</strong>sis. Chapter 4 deals wi<strong>the</strong>verything concerning <strong>the</strong> O(n 4 ) algorithm, which will also be referred to as <strong>the</strong> quarticalgorithm. Chapter 5 gives <strong>the</strong> details about an algorithmic concept called leaf set sizesthat is part of <strong>the</strong> cubic and sub-cubic algorithms which are <strong>the</strong> subjects of Chap. 6 andChap. 7 respectively, <strong>the</strong> latter of course being <strong>the</strong> more exhaustive.Chapter 8, Results and discussion, will put <strong>the</strong> results into perspective by comparing<strong>the</strong> three algorithms and give an outline of my contributions and of what has beensuccessful and what has not. Finally, Chap. 9 will be a short overview and conclusion,summarizing <strong>the</strong> results and contributions, and a comment on possible future work.The appendices will contain: Appendix A, algorithmic details that do not fit in <strong>the</strong>main text, App. B, instructions on how to obtain and run <strong>the</strong> programs and finally App. C,a report on a real-life application of <strong>the</strong> final sub-cubic algorithm.


Chapter 2PrerequisitesFirst, this chapter gives an introduction to <strong>the</strong> terminology that is used extensively throughout<strong>the</strong> <strong>the</strong>sis. New notation will be used at times but will be appropriately introduced.Second, I will motivate my choice of implementation languages and <strong>the</strong> give <strong>the</strong> detailsabout <strong>the</strong> platforms used to run <strong>the</strong> experiments.2.1 TerminologyIn Sec. 1.1, I have introduced <strong>the</strong> phylogenetic tree. From a purely algorithmic perspective,a tree, denoted by T, consist of three kinds of objects; a set I of internal nodes, a set Lof leaf nodes and, a set E of edges connecting <strong>the</strong> nodes. The set of all nodes, leaves andinternal, are denoted by V . When two <strong>trees</strong> with <strong>the</strong> same set of leaves, usually denotedby T and T ′ , are to be compared, <strong>the</strong> measure of <strong>the</strong> size of <strong>the</strong> problem will be <strong>the</strong> totalnumber of leaves in each tree. This value will alternatively be denoted as n, and thusn = |L|. The number of edges connected to an internal node v, is referred to as <strong>the</strong> degreeof <strong>the</strong> node and will be denoted as d v . For any internal node, d v ≥ 3. The degree of aleaf is one and no node will have degree two (except e.g. for <strong>the</strong> root of a rooted binarytree). Allowing nodes of degree two would give rise to <strong>trees</strong> with arbitrarily many nodesand edges. This would allow <strong>the</strong> number of nodes and edges to exceed O(n), clearly invalidatingany analysis based on n = |L| as <strong>the</strong> problem size. It is sufficient to bound <strong>the</strong>number of nodes of degree two since one can prove, that a tree with no degree-two nodeshas at most 2n-2 nodes and at most 2n-3 edges, with <strong>the</strong> maximal values being attainedby binary <strong>trees</strong>. The maximum degree of any node in <strong>the</strong> tree is denoted by d.Sometimes an (undirected) edge will be regarded as two distinct opposite directededges. A directed edge e is interesting because it identifies <strong>the</strong> subtree, call it F, consistingof every part of <strong>the</strong> tree T in front of <strong>the</strong> edge. Likewise, <strong>the</strong> opposite edge, denoted by ē,9


10 CHAPTER 2. PREREQUISITESidentifies <strong>the</strong> subtree ¯F , which includes every part of <strong>the</strong> tree behind e. The quantity |F |is called <strong>the</strong> leaf set size and will be used to denote <strong>the</strong> number of leaves in <strong>the</strong> subtree F.Therefore, |F | + | ¯F | = n. This is a slight abuse of <strong>the</strong> ma<strong>the</strong>matical notation for sets andone could argue that F should merely denote <strong>the</strong> set of leaves in <strong>the</strong> subtree, however,I will stick to this notation as it is used in <strong>the</strong> background literature. Most often, <strong>the</strong>algorithms will deal with two sub<strong>trees</strong>, one from each tree, where F is a subtree of T andG is a subtree of T ′ . Since T and T ′ contain <strong>the</strong> same set of leaves, F and G might alsohave some leaves in common. This is written |F ∩G| and will be referred to as <strong>the</strong> sharedleaf set size.The concept of <strong>quartet</strong>s of four leaves a, b, c and d has been described in Sec. 1.2.1.The three butterfly topologies, illustrated in Fig. 1.2 (a)–(c), are written ab|cd, ac|bd andad|bc respectively, while <strong>the</strong> star <strong>quartet</strong> in Fig. 1.2(d) is written a b ×c d .2.2 Choice of language and test environmentsMy first choice of implementation language was <strong>the</strong> Python programming language.Python is not among <strong>the</strong> most efficient languages, and usually not used in critical algorithmicor ma<strong>the</strong>matical applications. However it is a popular language for fast prototyping,because it supports clarity and simplicity and a very short workcycle. This was agood match for me, during <strong>the</strong> early parts of <strong>the</strong> implementation process, where I gainedcomplete understanding through <strong>the</strong> experimental work. In addition, I did not know <strong>the</strong>actual time needed for <strong>quartet</strong> <strong>distance</strong> calculation and <strong>the</strong>refore productivity was moreimportant than performance.Later on, I made a decision to implement <strong>the</strong> algorithms in C++ as well. We haveseen, in Sec. 1.3.1, that <strong>the</strong> results of <strong>the</strong> Python implementation were indeed informative,however, <strong>the</strong> practical running times were ra<strong>the</strong>r slow. The time-consuming calculationswere carried out on remote servers and despite <strong>the</strong> fact that this <strong>the</strong>sis is notabout optimizations, I found it interesting to see if I could bring down <strong>the</strong> running timeto a level where <strong>the</strong> experiments could be carried out on my own laptop. Fur<strong>the</strong>rmoreano<strong>the</strong>r language with o<strong>the</strong>r properties might give <strong>the</strong> whole study a new perspective.C++ does indeed have o<strong>the</strong>r properties. It is compiled and, with support for a set oflow-level language features, considerably closer to <strong>the</strong> machine level and often used foralgorithms and o<strong>the</strong>r time critical calculations.These two tracks of implementation will be described in parallel, and in some casesnot distinguished <strong>between</strong>. However, <strong>the</strong>y should not be compared in a one-to-one relationand <strong>the</strong> resulting programs will merely be used as two different opportunities toexperiment with <strong>the</strong> <strong>the</strong>oretic results.


2.2. CHOICE OF LANGUAGE AND TEST ENVIRONMENTS 11Details about <strong>the</strong> two programming languages, libraries and physical environmentsused are listed below. For a guide on how to obtain, compile and run <strong>the</strong> code, see App. B.Python programming environmentplatform:PCprocessor:Intel Xeon 3.00 GHzmemory:1 GBoperating system: Red Hat Linux (kernel: 2.6.18)language: Python (v. 2.4.3)libraries: NumPy (v. 1.2.1)SciPy (v. 0.6.0)BLAS (preinstalled, v. 3.1.1)C++ programming environmentplatform: MacBook Pro 13”model:MacBookPro5,5processor:Intel Core 2 Duo 2.26 GHzmemory:4 GBoperating system:Mac OS X 10.6.4 (10F569)language: C++ (g++ from gcc version 4.2.1 (Apple Inc. build 5664))libraries: Boost.uBlas (Boost libraries v. 1.42.0)Boost.NumericBindings (v. v1)BLAS (Apple Accelerate framework: v. vecLib-268.0)


Chapter 3Experimental approachAlgorithms for <strong>computing</strong> <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> should all produce <strong>the</strong> same simple result,namely a single number. Once confidence in one algorithm has been established,it is ra<strong>the</strong>r simple to verify ano<strong>the</strong>r such algorithm and build <strong>the</strong> same level of trust in<strong>the</strong> correctness of its results. That is one reason for implementing several different algorithmsalong with each o<strong>the</strong>r, and so is a strong motivation for <strong>the</strong> inclusion of <strong>the</strong>quartic algorithm of Chap. 4, because it is simple and <strong>the</strong>refore easy to verify by hand.And thorough verification has indeed been practiced, with various small examples andunit tests.Ano<strong>the</strong>r way to gain trust, is of course to rely on some independent implementation.This has also been done in <strong>the</strong> effort to cement <strong>the</strong> correctness of my work. A piece ofsoftware called QDist, described in Mailund and Pedersen [13] has been utilized. It isan implementation of <strong>the</strong> O(n log 2 n) algorithm presented in Brodal et al. [2], which isworking only on binary <strong>trees</strong>. However, since <strong>the</strong>re was no working software for <strong>general</strong><strong>trees</strong> at hand, all results involving <strong>general</strong> <strong>trees</strong> had to be verified separately.After implementing three algorithms using completely different approaches, in twodifferent programming languages, my confidence in <strong>the</strong> correctness is very high.Needless it is to say that correctness of <strong>the</strong> result is essential, but yet ano<strong>the</strong>r allimportantproperty is <strong>the</strong> running time. Especially to this <strong>the</strong>sis, where running timeis <strong>the</strong> quality measure, by which <strong>the</strong> algorithms are compared. Experiments are usedto determine <strong>the</strong> running time of each algorithm, as to verify whe<strong>the</strong>r <strong>the</strong> <strong>the</strong>oreticalpromised time bounds are correct, assess <strong>the</strong> practical behaviour of <strong>the</strong> algorithms, whichmight be different than anticipated, and last, to be able to compare <strong>the</strong> algorithms againstone ano<strong>the</strong>r. For <strong>the</strong>se experiments, a wide range of test data has been used. This isexplained in detail in <strong>the</strong> following Sec. 3.1. O<strong>the</strong>r details about each experiment, e.g.which input <strong>trees</strong> have been used and how many times an experiment has been exe-13


14 CHAPTER 3. EXPERIMENTAL APPROACHcuted, are found in <strong>the</strong> respective and appropriate sections.The evaluation of <strong>the</strong> outcome of those experiments has mainly been done by plotting,with <strong>the</strong> use of <strong>the</strong> two libraries for Python; matplotlib 1 and SciPy 2 , for plotting andscientific computation respectively. Each data set is displayed in a log-log plot, which,due to <strong>the</strong> nonlinear scaling of <strong>the</strong> axes, has <strong>the</strong> ability of displaying a power function,like f (x) = ax b , as a straight line, where b can be read as <strong>the</strong> slope of <strong>the</strong> line. This makesit easy to verify which development, <strong>the</strong> running time of an algorithm underlies. Forconvenience, each plot will include lines indicating linear, quadratic, cubic and quarticbehaviour where appropriate.In addition to <strong>the</strong> visual evaluation of results, <strong>the</strong> exponents of <strong>the</strong> function describingeach plot has been estimated through <strong>the</strong> least squares method, from <strong>the</strong> scipy.optimizemodule. These approximated exponents are displayed along with <strong>the</strong> name of each dataseries in a plot.3.1 TreesFive groups of input <strong>trees</strong> have been utilized throughout <strong>the</strong> array of experiments performedas part of this <strong>the</strong>sis. They are all artificial <strong>trees</strong> and fit <strong>the</strong> conditions necessaryto work with <strong>the</strong> three algorithms considered, as described in Sec. 1.3. The intention isto cover as wide a range of <strong>trees</strong> as possible, which involves varying <strong>the</strong> relationship <strong>between</strong><strong>the</strong> number of leaves and <strong>the</strong> number of internal nodes, which again influences<strong>the</strong> degree of <strong>the</strong> internal nodes. If a tree has a relatively large number of internal nodes,<strong>the</strong>ir degree will be small.It is important to consider all <strong>the</strong>se different kinds of <strong>trees</strong> to give a reliable evaluationof <strong>the</strong> algorithms subject to <strong>the</strong> experiments. This will clarify whe<strong>the</strong>r an algorithmhas any weakness related to a certain property of <strong>trees</strong>. In particular, as mentioned inChapter 1, this <strong>the</strong>sis only considers algorithms underlying a <strong>the</strong>oretic time complexitythat is independent of <strong>the</strong> degree of <strong>the</strong> internal nodes in <strong>the</strong> tree. Hopefully, <strong>the</strong>sedifferent <strong>trees</strong> will help investigate whe<strong>the</strong>r such assumptions hold true.The five groups of <strong>trees</strong> are described below, along with an outline of <strong>the</strong>ir respectiveproperties. All <strong>trees</strong> are considered unrooted.Binary <strong>trees</strong> (bin)In this <strong>the</strong>sis, a binary tree is a tree where ∀v ∈ I : d v = 3, that is,<strong>the</strong> degree of each internal node is three. Consequently <strong>the</strong>re are n − 2 internal nodesand 2n −3 edges and <strong>the</strong>refore <strong>the</strong> number of internal nodes and <strong>the</strong> number of edges is1 matplotlib: http://matplotlib.sourceforge.net/2 SciPy: http://www.scipy.org/


3.1. TREES 15O(n). Essentially, a binary tree has many internal nodes of low degree. Figure 3.1 showsan example of a binary tree. It is evident that a binary tree cannot contain star <strong>quartet</strong>s.Figure 3.1: A bin treeRandom <strong>trees</strong> (ran) A random tree can have any topology where <strong>the</strong> degree of internalnodes is at least three, that is, ∀v ∈ I : d v ≥ 3. At this point we cannot predict <strong>the</strong> behaviorof an algorithm given random <strong>trees</strong> as input, however, <strong>the</strong>y are included in casesomething interesting will come out of <strong>the</strong> experiments.Square root <strong>trees</strong> (sqrt)A square root tree consists of a single central node, connectedto n internal nodes, all of which have edges connected to n leaves. See Figure 3.2.Consequently, such a tree has one node of degree n and n inner nodes of degree n +1. The number of edges is |E| = n + n. Essentially, <strong>the</strong> sqrt-tree has a large numberof nodes of high degree.Figure 3.2: A sqrt treeStar <strong>trees</strong> (star) A star tree consists of a single internal node connected to every leaf.See Figure 3.3. That is, one inner node of degree n, meaning few nodes, high degree.Also <strong>the</strong> number of edges is |E| = n. Naturally, this tree only contains star <strong>quartet</strong>s.Figure 3.3: A star tree


16 CHAPTER 3. EXPERIMENTAL APPROACHWorst-case <strong>trees</strong> (wc) In a worst-case tree <strong>the</strong>re is one internal node of degree n 2 and n 2internal nodes of degree three, see Figure3.4. The tree has its name from Christiansenand Randers [5], where it was invented with <strong>the</strong> purpose of having a tree that has O(n)internal nodes and O(n) internal edges connected to an internal node of degree O(n).More interesting to this <strong>the</strong>sis is <strong>the</strong> total number of edges, which is |E| = 3 2n. The nameshould not be taken literally, since we do not yet know how <strong>the</strong> algorithms in this <strong>the</strong>siswill perform.Figure 3.4: A wc tree3.2 Tree constructionSince this <strong>the</strong>sis mainly encompasses an experimental approach to evaluating algorithms,I find it natural to describe my choice of data format and data construction.3.2.1 Newick formatThe programs are reading in tree data from files, described in <strong>the</strong> Newick tree format 3which is a commonly used format for representation of phylogenetic <strong>trees</strong>. It is a simpletext format based on paren<strong>the</strong>ses and commas. The grammar in Fig 3.5 provides a formaldescription for parsing <strong>the</strong> Newick format.Tree → Subtree ";"| Branch ";"Subtree → Leaf | InternalLeaf → NameInternal → "("BranchSet ")"NameBranchSet → Branch | BranchSet ","BranchBranch → Subtree LengthName → empty | stringLength → empty | ":"numberFigure 3.5: A context-free grammar for <strong>the</strong> Newick format3 Wikipedia article on <strong>the</strong> Newick format: http://en.wikipedia.org/wiki/Newick_format (accessedOct. 3, 2010)


3.2. TREE CONSTRUCTION 17Formal descriptions are good, however, this one has some small restrictions, e.g.paren<strong>the</strong>ses inside Name are prohibited. Never<strong>the</strong>less, this grammar more than satisfies<strong>the</strong> needs for this <strong>the</strong>sis. Note that a tree ei<strong>the</strong>r descends from nowhere, or it is rootedin an internal node. Thus, <strong>the</strong> Newick format is incapable of describing unrooted <strong>trees</strong>.This is not a problem, since <strong>the</strong> tree can merely be rooted in an arbitrary internal nodewhich, when parsing, will be treated as any o<strong>the</strong>r internal node.The two <strong>trees</strong> from <strong>the</strong> example in Fig. 1.1 would be written as follows in Newick treeformat. Note that <strong>the</strong>re is only one named Internal node in each tree, namely <strong>the</strong> root(Pan<strong>the</strong>ra):(((Tiger, Snow Leopard), ((Lion, Leopard), Jaguar)), Clouded Leopard) Pan<strong>the</strong>ra;(((Lion, Leopard), Snow Leopard, Jaguar), (Tiger, Clouded Leopard)) Pan<strong>the</strong>ra;Data construction Data files have been generated in Newick format using a range ofsimple Python-scripts for each class of artificial tree described in 3.1. They are foundalong with <strong>the</strong> code and all <strong>the</strong> data files used, see Appendix B.3.2.2 Parsing data files and building a tree data structureOne way of parsing data described by a grammar, like <strong>the</strong> one in Fig. 3.5, is by using arecursive descent parser. As <strong>the</strong> name indicates, a recursive descent parser is workingtop-down, by repeatedly identifying one construct in <strong>the</strong> data that corresponds to a productionrule in <strong>the</strong> grammar and <strong>the</strong>n process this piece of data according to <strong>the</strong> rule.A number of mutually recursive procedures will typically represent <strong>the</strong> rules defined by<strong>the</strong> grammar.For <strong>the</strong> Python implementation I have made use of a parser based on <strong>the</strong> Toy ParserGenerator framework 4 (TPG), and written by Thomas Mailund 5 , with a few modifications.For <strong>the</strong> C++ implementation I wrote my own parser from scratch. I wrote proceduresParse(), ParseSubtree(), ParseInternalNode(), ParseLeafNode(), ParseBranchSet()and ParseBranch(), corresponding to <strong>the</strong> six first rules of <strong>the</strong> grammar of <strong>the</strong> Newick format.Both parsers are capable of parsing all <strong>the</strong> formats suggested by <strong>the</strong> grammar inSec. 3.2.1 and <strong>the</strong>y meet <strong>the</strong> needs of this <strong>the</strong>sis.While <strong>the</strong> algorithms in <strong>general</strong> deal with non-rooted <strong>trees</strong>, some parts view <strong>the</strong> treeas rooted, in which cases it turns out more convenient to have a rooted, top-down representation.Fur<strong>the</strong>rmore, <strong>the</strong> Newick format dictates that some random node should4 TPG framework: http://christophe.delord.free.fr/tpg/index.html5 Mailunds parser: www.mailund.dk/index.php/2009/01/19/yet-ano<strong>the</strong>r-newick-parser/


18 CHAPTER 3. EXPERIMENTAL APPROACHbe chosen as root, so that is what one can expect when parsing such data. This is noproblem and in any case, some entry point is needed to access <strong>the</strong> tree.When initiating <strong>the</strong> process of implementing <strong>the</strong> first algorithm in Python, I did nothave deep and substantial knowledge about <strong>the</strong> requirements each algorithm would askof <strong>the</strong> data structure used to store and represent <strong>the</strong> tree. Consequently, I found myselfusing <strong>the</strong> simple top-down data structure provided by <strong>the</strong> Python Newick parserdescribed. This worked out all right in my first attempt on implementing <strong>the</strong> quarticalgorithm, see Sec. 4.1. However, when attacking <strong>the</strong> o<strong>the</strong>r algorithms, that sometimestake an edge-based approach (remember that sub<strong>trees</strong> are identified by directed edges),it turned out to be insufficient, inexpressible and confusing.Therefore, enriching <strong>the</strong> data structure with some constructs that correspond to <strong>the</strong>ideas used in <strong>the</strong> algorithms, seemed obvious. As long as <strong>the</strong> modifications could bedone in linear time by simple traversal of <strong>the</strong> tree, <strong>the</strong> overhead would easily fit into <strong>the</strong>time bounds of <strong>the</strong> algorithms considered in this <strong>the</strong>sis.One improvement was to decorate <strong>the</strong> tree with more directed edges. A recursivedescent parser built using <strong>the</strong> Toy Parser Generator framework is tail recursive and doesnot leave <strong>the</strong> opportunity to send back information through <strong>the</strong> recursion. The result isthat <strong>trees</strong> parsed with a TPG parser do not have back-edges, i.e. edges pointing towards<strong>the</strong> root. Subsequently adding back-edges, made it easy to traverse <strong>the</strong> tree from anychoice of start node and in any direction. Also, after realising that <strong>the</strong> algorithms mostoften deal with directed edges, I let each edge know its opposite and thus two directededges toge<strong>the</strong>r correspond to one actual undirected edge. The algorithms naturally requireunique identification of leaves and use <strong>the</strong> assumption that internal nodes andedges can be identified uniquely as well. Therefore, ids were also added in a subsequenttraversal of <strong>the</strong> tree. Because I implemented <strong>the</strong> C++ parser and data structure with thisin mind, I made <strong>the</strong> C++ parser a bit more advanced, letting it add <strong>the</strong> necessary edgeswhile parsing, keep track of <strong>the</strong> ids of nodes and edges and collect nodes and edges inlists for easy access. Of course this required that procedures return information about<strong>the</strong> part of <strong>the</strong> tree just parsed (<strong>the</strong> subtree below).The few dissimilarities in <strong>the</strong> two data structures used are not playing a significantrole in <strong>the</strong> implementations. There is a slight difference in <strong>the</strong> way a tree is traversed,since an internal node in <strong>the</strong> Python data structure has a parent and a number of sub<strong>trees</strong>,whereas <strong>the</strong> C++ data structure simply has a number of sub<strong>trees</strong> related. Fur<strong>the</strong>rmore,<strong>the</strong> fact that <strong>the</strong> two implementations do <strong>the</strong> job in different orders result in <strong>the</strong>edges having different ids in <strong>the</strong> two languages which will change <strong>the</strong> order of processingfor edges. This should not have any impact on <strong>the</strong> results of <strong>the</strong> algorithms however.


Chapter 4Quartic time algorithmHere I will describe an approach to calculating <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> two <strong>trees</strong> Tand T ′ that leads to a quartic time complexity, that is, a running time of O(n 4 ). The algorithmwas suggested by Christiansen et al. [6]. Given n leaves <strong>the</strong>re are O(n 4 ) <strong>quartet</strong>sto consider. Consequently, being able to determine <strong>the</strong> topology of some <strong>quartet</strong> in eachof <strong>the</strong> two <strong>trees</strong> and compare <strong>the</strong>m in constant time makes it possible to make <strong>the</strong> entirealgorithm perform within O(n 4 ).The main principle of <strong>the</strong> algorithm is to process a triplet, that is three leaves, ata time instead of a <strong>quartet</strong>. Then, for every triplet, in linear time, process each of <strong>the</strong>remaining n − 3 leaves in turn. There are O(n 3 ) triplets, meaning that we achieve a totalrunning time of O(n 4 ).The triplets are processed as follows: For every triplet (a,b,c) <strong>the</strong>re is a unique nodethat lies on <strong>the</strong> three paths <strong>between</strong> a and b, <strong>between</strong> a and c and <strong>between</strong> b and c. Callsuch a node a center and denote it C . A node in <strong>the</strong> tree is associated with a number ofsub<strong>trees</strong>, each corresponding to an outgoing edge from <strong>the</strong> node. Each subtree containsa number of leaves. For some center node, C , let T a , T b and T c denote <strong>the</strong> sub<strong>trees</strong> thatcontain <strong>the</strong> leaves a, b and c respectively. Fur<strong>the</strong>rmore, let T rest denote <strong>the</strong> collectionof all sub<strong>trees</strong> of C , except T a , T b and T c . See Fig. 4.1. Every leaf, x, o<strong>the</strong>r than a, b,and c, will be <strong>the</strong> fourth leaf in some <strong>quartet</strong>. The topology of this <strong>quartet</strong> can easilybe determined, based on <strong>the</strong> position of x, relative to C . If x is positioned in <strong>the</strong> samesubtree as one of <strong>the</strong> leaves a, b or c, <strong>the</strong> resulting <strong>quartet</strong> is a butterfly. O<strong>the</strong>rwise, itis a star. If x ∈ T a , <strong>the</strong> topology is ax|bc. If x ∈ T b , <strong>the</strong> topology is bx|ac. If x ∈ T c , <strong>the</strong>topology is cx|ab. Fur<strong>the</strong>rmore, if x ∈ T rest , <strong>the</strong> topology is a x ×b c .Finding <strong>the</strong> center of three leaves and collecting <strong>the</strong> leaf sets T a , T b , T c and T rest canbe done in linear time. The topologies of <strong>the</strong> n−3 <strong>quartet</strong>s containing both a, b and c can<strong>the</strong>n be determined in linear time by going through <strong>the</strong> collections of leaves. To compare19


20 CHAPTER 4. QUARTIC TIME ALGORITHMbT baT aCT ccFigure 4.1: A center node C of <strong>the</strong> leaves a,b,c. The white sub<strong>trees</strong> make up <strong>the</strong> setof sub<strong>trees</strong> denoted T rest.<strong>the</strong> topologies found in <strong>the</strong> two <strong>trees</strong>, an array is computed, holding, at position i , <strong>the</strong>topology of <strong>the</strong> <strong>quartet</strong> containing a, b, c and <strong>the</strong> i ’th leaf. Given two of those arrays,one for T and one for T ′ , <strong>the</strong> topologies can be compared in linear time and <strong>the</strong> numberof different topologies associated with <strong>the</strong> triplet (a,b,c) counted. This gives an overallrunning time of O(n 4 ).A number of different <strong>quartet</strong> topologies is thus computed by processing all triplets(a,b,c). However, each <strong>quartet</strong> (a,b,c,d) is actually considered four times; once for eachpossible triplet composed from <strong>the</strong> four leaves – in o<strong>the</strong>r words, each of <strong>the</strong> four leaveswill eventually act as <strong>the</strong> leaf x, described above. Consequently, <strong>the</strong> total number of different<strong>quartet</strong>s counted has to be divided by four to get <strong>the</strong> <strong>quartet</strong> <strong>distance</strong>. See Alg. 4.1for an outline of <strong>the</strong> algorithm.Regarding space consumption, this approach only needs memory for <strong>the</strong> tree datastructure and <strong>the</strong> two arrays, which are all linear in <strong>the</strong> number of leaves. Thus, <strong>the</strong>algorithm uses O(n) space.4.1 ImplementationThe quartic algorithm is simple and easy to understand and this counts for <strong>the</strong> implementationof it as well. The algorithm is a bit clumsy by nature, making a lot of traversalsof <strong>the</strong> tree, but since it is <strong>the</strong> starting point for my investigation of <strong>quartet</strong> <strong>distance</strong> calculation,my focus will be on correctness ra<strong>the</strong>r than efficiency, and thus, I will make noattempt to optimize <strong>the</strong> algorithm in any way. Having a solid foundation and referencepoint for comparison with <strong>the</strong> o<strong>the</strong>r implementations is important to gain confidence in<strong>the</strong> final results.The main loop, line 2 of Algorithm 4.1, goes through all distinct triplets, which can begenerated in various ways. The centers of a tree are found by first making three traversals


4.1. IMPLEMENTATION 21Algorithm 4.1 Quartic algorithm for <strong>quartet</strong> <strong>distance</strong>1: diffQ = 02: for all triplets (a,b,c) do ⊲ O(n 3 )3: Find center C in T and C ′ in T ′ ⊲ O(n)4: Collect leaf sets T a , T b , T c and T rest ⊲ O(n)5: and leaf sets T a ′ , T ′ b , T c ′ and T rest ′ ⊲ O(n)6: Create arrays A and A ′7: for all leaves x ∉ {a,b,c} do ⊲ O(n)8: Topology can be decided in constant time based on leaf set membership9: A[x] ← topology(a,b,c, x) in T ⊲ O(1)10: A ′ [x] ← topology(a,b,c, x) in T ′11: end for12: for all topologies t ∈ A and t ′ ∈ A ′ do13: if t ≠ t ′ <strong>the</strong>n14: diffQ = diffQ + 115: end if16: end for17: end for18: return diffQ / 4to find <strong>the</strong> path <strong>between</strong> each pair of nodes (a,b), (a,c) and (b,c) and <strong>the</strong>n examining<strong>the</strong>se paths to find <strong>the</strong> single common internal node that is <strong>the</strong> center. The leaf sets T a ,T b and T c are easily found by traversing <strong>the</strong> tree with origin in <strong>the</strong> center node and T rest isnot needed, as this set is just every leaf that is not in one of <strong>the</strong> o<strong>the</strong>r sets. Now, to processeach iteration of <strong>the</strong> loop in line 7 in constant time, it is necessary to go through <strong>the</strong> leafsets and deal with each leaf contained in <strong>the</strong>se, ra<strong>the</strong>r than to just go through all leaves,which would require repeated look-ups in each leaf set, breaking <strong>the</strong> time bound. Thefour topologies are represented by integers 0,1,2,3 and comparison of <strong>the</strong> two arrays isdone by traversing <strong>the</strong>m in parallel and comparing elements.4.1.1 ResultThe algorithm has been implemented in Python and C++ and tested on <strong>the</strong> five groupsof artificial test <strong>trees</strong> described in Sec. 3.1. The result is displayed in Fig. 4.2 and Fig. 4.3.Because <strong>the</strong> algorithm is slow, <strong>the</strong> size of <strong>the</strong> <strong>trees</strong> used is limited to around 140 leaves.Larger <strong>trees</strong> would induce unacceptable running times and 140 leaves are enough toobserve any tendencies of <strong>the</strong> development in performance. The experiments have beenrepeated five times.Naturally, <strong>the</strong> expectation is to see a O(n 4 ) development on all <strong>trees</strong> and this is indeed<strong>the</strong> case; all plots seem to be parallel to or less steep than <strong>the</strong> line indicating a quartic development.Fur<strong>the</strong>rmore, <strong>the</strong> estimated exponents of <strong>the</strong> lines plotted are below four.


22 CHAPTER 4. QUARTIC TIME ALGORITHMThis evidence makes it clear that it is in fact a worst-case quartic algorithm. In addition,<strong>the</strong> figure also reveals <strong>the</strong> interrelationship <strong>between</strong> <strong>the</strong> five groups of data. Observe that<strong>the</strong>re is a ra<strong>the</strong>r large difference in running time, with <strong>the</strong> star tree being <strong>the</strong> faster and<strong>the</strong> binary tree being <strong>the</strong> slower one to deal with. This seems natural, since <strong>the</strong> internalstructures of <strong>the</strong> <strong>trees</strong> are very different. The traversal of a tree with a complex internalstructure, a large number of internal nodes, is more time consuming and since <strong>the</strong> algorithmis based on a large number of traversals, this penalty is observable in <strong>the</strong> plots.And of course, <strong>the</strong> time bound is related to <strong>the</strong> number of leaves and not <strong>the</strong> internalstructure of <strong>the</strong> <strong>trees</strong>.This algorithm has shown – not surprisingly – to be very slow and not competitivewith a sub-cubic algorithm. It has merely worked as a reference of correctness, whenimplementing <strong>the</strong> o<strong>the</strong>r algorithms. Never<strong>the</strong>less, in Chapter 8, where all <strong>the</strong> results ofthis <strong>the</strong>sis are compared and discussed, <strong>the</strong> practical performance is compared to thoseof <strong>the</strong> o<strong>the</strong>r algorithms.


4.1. IMPLEMENTATION 23time in seconds - t(n)100001000100101bin, best exp= 3.93ran, best exp= 3.84sqrt, best exp= 3.69star, best exp= 3.65wc, best exp= 3.69n 3n 4n 5Python implementation of <strong>the</strong> quartic algorithm.0.10.0110 50 100 200number of leaves - nFigure 4.2: Performance of <strong>the</strong> quartic algorithm implemented in Python.time in seconds - t(n)1000010001001010.1bin, best exp= 3.81ran, best exp= 3.71sqrt, best exp= 3.37star, best exp= 3.27wc, best exp= 3.44n 3n 4n 5C++ implementation of <strong>the</strong> quartic algorithm.0.0110 50 100 200number of leaves - nFigure 4.3: Performance of <strong>the</strong> quartic algorithm implemented in C++.


Chapter 5Calculating leaf set sizesCalculating <strong>the</strong> size of every subtree leaf set in a given tree, that is, <strong>the</strong> number of leavescontained in <strong>the</strong> subtree, is an essential part of two of <strong>the</strong> algorithms studied in this<strong>the</strong>sis, namely <strong>the</strong> cubic time algorithm of Chap. 6 and, most important, <strong>the</strong> sub-cubicalgorithm of Chap. 7, that is <strong>the</strong> central topic of <strong>the</strong> <strong>the</strong>sis. So is <strong>the</strong> calculation of sharedleaf set sizes <strong>between</strong> two <strong>trees</strong>. A shared leaf set is <strong>the</strong> intersection of a pair of subtreeleaf sets. It turns out that <strong>the</strong> former is needed under fast calculation of <strong>the</strong> latter.Christiansen and Randers [5] study <strong>the</strong>se calculations for different types of <strong>trees</strong> andhave been my primary inspiration. Here I will focus solely on outlining <strong>the</strong> approach andaspects important to this <strong>the</strong>sis.5.1 Subtree leaf set sizesThe calculation of all subtree leaf set sizes can be done in time O(n). Consider some<strong>general</strong> unrooted tree, T like <strong>the</strong> one in Fig. 5.1(a). If T is rooted in one of its internalnodes, e.g. r , one can consider directed edges as ei<strong>the</strong>r pointing away from r or towardsr , like e and e 2 respectively, see Fig. 5.1(b). Each directed edge represents a subtree. Thesubtree F , represented by e, does not contain r and <strong>the</strong> subtree ¯F , represented by <strong>the</strong>opposite edge, ē, does does contain r .The leaf set size of every subtree that, like F , does not contain r , can be calculatedrecursively using <strong>the</strong> following recipe. Make a depth-first traversal of all nodes, v ∈ T ,starting at r . For each node v we look at <strong>the</strong> subtree defined by <strong>the</strong> edge entering v andpointing away from r . If v is a leaf node, <strong>the</strong> leaf set size is 1. If v is an internal node,<strong>the</strong> leaf set size is <strong>the</strong> sum of all leaf set sizes of <strong>the</strong> sub<strong>trees</strong> pointing downwards from v.Half of <strong>the</strong> leaf set sizes have been calculated and this calculation takes O(n) time, sincea tree contains a linear number of nodes.25


26 CHAPTER 5. CALCULATING LEAF SET SIZESAno<strong>the</strong>r approach is needed when calculating <strong>the</strong> o<strong>the</strong>r half of <strong>the</strong> leaf set sizes, becausea repetition of <strong>the</strong> procedure for every possible root would take too long. Clearly,F ∪ ¯F contains every leaf and thus | ¯F | = |L|−|F | = n −|F |. With this in mind and using <strong>the</strong>values previously computed, all leaf set sizes containing r can be counted in linear time,giving a total time complexity of O(n).rre e 2(a)Figure 5.1: (a) An unrooted tree and (b) <strong>the</strong> same tree rooted in <strong>the</strong> node r . Half of <strong>the</strong> directededges, like e, point away from r and identify sub<strong>trees</strong>, like F , that do not contain r . The o<strong>the</strong>r halfpoint towards r , like e 2 and identify sub<strong>trees</strong> that do contain r .(b)F5.2 Shared leaf set sizesThe shared leaf set size for two sub<strong>trees</strong> F ∈ T and G ∈ T ′ , written |F ∩G|, is <strong>the</strong> numberof leaves contained in both sub<strong>trees</strong>. Filling out a table holding <strong>the</strong> shared leaf set sizeof all pairs of sub<strong>trees</strong> <strong>between</strong> two <strong>trees</strong> can be done in time O(n 2 ). As mentioned, thistime bound is crucial to <strong>the</strong> cubic and sub-cubic <strong>quartet</strong> <strong>distance</strong> algorithms.As basis for my implementation I have used Christiansen and Randers [5] and I will<strong>the</strong>refore give an explanation of <strong>the</strong> idea here. It is an adaptation of <strong>the</strong> original idea for<strong>computing</strong> <strong>the</strong> shared leaf set sizes <strong>between</strong> two binary <strong>trees</strong>, presented in Bryant et al.[4] and elaborated in Tsang [19].Again, consider <strong>the</strong> two sub<strong>trees</strong> from above. F and G are rooted in two nodes, sayv and v ′ respectively, and each composed of a number of smaller sub<strong>trees</strong>, given by <strong>the</strong>children of <strong>the</strong>se nodes. If v has sub<strong>trees</strong> F 1 ,...,F dv −1 and v ′ has sub<strong>trees</strong> G 1 ,...,G dv ′ −1,<strong>the</strong> shared leaf set size of F and G can be expressed in two different ways by <strong>the</strong> followingequation:d∑v −1d v ′∑−1|F ∩G| = |F i ∩G| = |F ∩G j |. (5.1)i=1Doing this for all pairs of sub<strong>trees</strong> like F,G is cumbersome, however, using <strong>the</strong> same ideaas for subtree leaf set sizes, <strong>the</strong> running time can be kept within <strong>the</strong> critical bound ofO(n 2 ). That is, root T and T ′ in some internal nodes r and r ′ , see Fig. 5.1, and calculatej =1


5.3. IMPLEMENTING LEAF SET ALGORITHMS 27only <strong>the</strong> shared leaf set sizes for <strong>the</strong> sub<strong>trees</strong> pointing away from <strong>the</strong>se roots. In a tree,<strong>the</strong>re is exactly one of <strong>the</strong>se sub<strong>trees</strong> for each node. Comparison of two leaves is done inconstant time because leaves are numbered. This leads to <strong>the</strong> following expression:∑ ∑∑v∈Imin((d v − 1),(d v′ − 1)) + ∑ 1 = O(n 2 ) (5.2)v ′ ∈I ′ v∈L v ′ ∈L ′Now, processing of all pairs of sub<strong>trees</strong> where one or both of <strong>the</strong>m contains <strong>the</strong> root,still remains. Making use of <strong>the</strong> subtree leaf set sizes of T and T ′ , each remaining paircan be processed in a similar manner as used for subtree leaf set sizes in Sec. 5.1. Thereare three kinds of pairs of sub<strong>trees</strong> still to process and using <strong>the</strong> filled parts of <strong>the</strong> table,that is, all entries associated with a subtree pair |F ∩G|, <strong>the</strong>ir shared leaf set sizes can becomputed in constant time, using <strong>the</strong> following recipe:| ¯F ∩G| = |G| − |F ∩G|,|F ∩Ḡ| = |F | − |F ∩G|,| ¯F ∩Ḡ| = n − (|F | + |G| − |F ∩G|). (5.3)This gives a complete time bound of O(n 2 ) for filling <strong>the</strong> entire table.In <strong>the</strong> following section I will give details about my implementation of <strong>the</strong> algorithmand verification of <strong>the</strong> running time.5.3 Implementing leaf set algorithms5.3.1 Subtree leaf set sizesThe calculation of subtree leaf set sizes is easily implemented using <strong>the</strong> recipe describedin Sec. 5.1 and hence <strong>the</strong>re is no notable differences <strong>between</strong> <strong>the</strong> Python and C++ implementation.Since a directed edge uniquely identifies a subtree, an array of length equal to <strong>the</strong>number of directed edges is used to store <strong>the</strong> result and edge ids are used to index <strong>the</strong> array.First, a recursive function is counting leaves in a depth-first manner, for all sub<strong>trees</strong>not containing <strong>the</strong> root of <strong>the</strong> tree. Next, all remaining entries of <strong>the</strong> array are filled bysimple calculation. Every directed edge knows its opposite and <strong>the</strong> value associated withan up-edge can <strong>the</strong>refore be calculated using <strong>the</strong> value associated with <strong>the</strong> correspondingdown-edge.5.3.2 Shared leaf set sizesThe description in Sec. 5.2 is easily translated into Python or C++ code as well. A twodimensionaltable is used to store <strong>the</strong> result and again ids of directed edges are used


28 CHAPTER 5. CALCULATING LEAF SET SIZESas indices. All entries associated with two edges pointing away from <strong>the</strong> root are filledby a recursive function. The function makes a combined depth-first traversal of <strong>the</strong> two<strong>trees</strong>, compares <strong>the</strong> leaves and sums up <strong>the</strong> shared leaf set sizes on its way up through <strong>the</strong>two <strong>trees</strong>. Equation (5.1) shows that every time two sub<strong>trees</strong> are encountered <strong>the</strong>re is adecision to make; for which subtree should we sum over <strong>the</strong> children? This can influence<strong>the</strong> total number of additions <strong>the</strong> function has to make. Equation (5.2) includes thischoice as <strong>the</strong> min-expression, however, analysing <strong>the</strong> expression reveals that <strong>the</strong> choiceis not significant; <strong>the</strong> running time will stay within <strong>the</strong> asymptotic time bound whicheverway is chosen.After processing all pairs of sub<strong>trees</strong> not containing <strong>the</strong> root, <strong>the</strong> results are used tofill <strong>the</strong> remaining entries of <strong>the</strong> table, cf. Eq. 5.3.I have of course tested all parts of my code thoroughly for correctness and efficiency.However, as mentioned earlier, <strong>the</strong> running time of this procedure is critical to <strong>the</strong> overallrunning time of <strong>the</strong> cubic and sub-cubic algorithms and I have <strong>the</strong>refore decided ondocumenting <strong>the</strong> experiments verifying <strong>the</strong> time usage of <strong>the</strong> shared leaf set size calculation.The algorithm has been tested against each of <strong>the</strong> five types of test <strong>trees</strong> describedin Sec. 3.1.Expectations Before commenting on <strong>the</strong> results of my experiments I will discuss whichexpectations one might have to <strong>the</strong> actual performance of <strong>the</strong> algorithm. Clearly wewould expect it to perform within <strong>the</strong> <strong>the</strong>oretical time bounds of O(n 2 ), but what about<strong>the</strong> actual time usage - will it e.g. perform equally good on every type of test tree or does<strong>the</strong> internal structure of a tree influence <strong>the</strong> running time?It is difficult to guess about <strong>the</strong> actual time spent by <strong>the</strong> algorithm. One thing to expect,however, is <strong>the</strong> C++ implementation to perform better than <strong>the</strong> Python implementation.With regards to <strong>the</strong> performance <strong>between</strong> different types of <strong>trees</strong>, one should takea look at <strong>the</strong> actual mechanisms of <strong>the</strong> algorithm. Where is <strong>the</strong> work really done? Whenlooking at Eq. (5.2) it is evident that <strong>the</strong> number of internal nodes is significant, meaningthat a high number of inner nodes will make <strong>the</strong> two sums very large. What about <strong>the</strong> degreeof inner nodes <strong>the</strong>n? Even though <strong>the</strong>re is a correspondence <strong>between</strong> <strong>the</strong> number ofinner nodes, |I |, and <strong>the</strong> degree of <strong>the</strong>se, d v , as pointed out in Sec. 3.1, I find it hard to see<strong>the</strong> exact influence on <strong>the</strong> algorithm here. From my point of view, it is easier to look at<strong>the</strong> implementation details. Here I deal with directed edges pointing away from <strong>the</strong> rootand <strong>the</strong> number of <strong>the</strong>se is clearly <strong>the</strong> same as <strong>the</strong> number of (undirected) edges in <strong>the</strong>tree altoge<strong>the</strong>r. From this point of view I would expect <strong>the</strong> number of edges to be criticalto <strong>the</strong> running time, of course meaning that fewer edges lead to a better performance.Therefore I expect binary <strong>trees</strong>, having 2n − 3 edges, to demand a high processing time,


5.3. IMPLEMENTING LEAF SET ALGORITHMS 29whereas <strong>the</strong> star tree, having n edges, would be faster to deal with. In <strong>between</strong>, sqrt-<strong>trees</strong>should be faster than wc-<strong>trees</strong>, again due to <strong>the</strong> number of edges.This algorithm has a quadratic space consumption, due to <strong>the</strong> table being used forstorage, which should not cause any problems with regards to memory. As an example,think of two binary <strong>trees</strong> of 800 leaves. The table has an entry for each pair of sub<strong>trees</strong>,which is equal to each pair of directed edges. Since a binary tree has |E| = 2n − 3, <strong>the</strong>equation is roughly:2|E| × 2|E| × size_of_int ≈ 2 × 1600 × 2 × 1600 × 4 Bytes ≈ 41 MB.This will be no problem for <strong>the</strong> test environment described in Sec. 2.2, which will haveplenty of memory to spare.Result Figure 5.2 and 5.3 display <strong>the</strong> results of <strong>the</strong> two implementations, Python andC++ respectively, being applied to <strong>the</strong> five types of <strong>trees</strong>. The first thing to observe is <strong>the</strong>correctness of <strong>the</strong> time bound. Every plot is parallel to <strong>the</strong> line indicating a quadraticgrowth. Fur<strong>the</strong>rmore, <strong>the</strong> estimated exponents of <strong>the</strong> expressions describing <strong>the</strong> plotsare very close to 2. Next, <strong>the</strong> expectations about <strong>the</strong> order among <strong>the</strong> <strong>trees</strong> hold true.It seems to be correct that more edges lead to higher processing time. There is actuallyas large a difference as a factor of ten <strong>between</strong> <strong>the</strong> best and worst results. What is notas clear, because of <strong>the</strong> log-log-plot, is that <strong>the</strong> difference is more pronounced <strong>the</strong> larger<strong>the</strong> <strong>trees</strong> become. This is <strong>the</strong> case, because <strong>the</strong> <strong>distance</strong> <strong>between</strong> plots remains <strong>the</strong> same,even though one step on an axis becomes increasingly significant when moving along <strong>the</strong>axis. This is of course a consequence of <strong>the</strong> quadratic development.Last and less important, we can compare <strong>the</strong> two figures and see that <strong>the</strong> C++ implementationis clearly faster than <strong>the</strong> Python implementation.


30 CHAPTER 5. CALCULATING LEAF SET SIZEStime in seconds - t(n)1001010.1Python implementation of <strong>the</strong> shared leaf set sizes calculationbin, best exp= 2.02ran, best exp= 2.04sqrt, best exp= 1.89star, best exp= 2.00wc, best exp= 2.00nn 2n 30.0150 100 200 500 800number of leaves - nFigure 5.2: Experiments showing <strong>the</strong> running time of <strong>the</strong> shared leaf set sizes calculationalgorithm in Python for <strong>the</strong> different types of test <strong>trees</strong>.time in seconds - t(n)1001010.10.01C++ implementation of <strong>the</strong> shared leaf set sizes calculationbin, best exp= 2.03ran, best exp= 2.05sqrt, best exp= 1.86star, best exp= 1.97wc, best exp= 1.99nn 2n 350 100 200 500 800number of leaves - nFigure 5.3: Experiments showing <strong>the</strong> running time of <strong>the</strong> shared leaf set sizes calculationalgorithm in C++ for <strong>the</strong> different types of test <strong>trees</strong>.


Chapter 6Cubic time algorithmHere I will describe ano<strong>the</strong>r algorithm for <strong>quartet</strong> <strong>distance</strong> computation by Christiansenet al. [6] that improves <strong>the</strong> solution to a cubic running time at <strong>the</strong> expense of an increasedspace consumption. The algorithm is based on <strong>the</strong> concept of shared leaf set sizes, introducedin Section 5.2, and fur<strong>the</strong>r extends <strong>the</strong> idea of using centers, as introduced alongwith <strong>the</strong> quartic algorithm described in Chapter 4.We are interested in <strong>the</strong> number of <strong>quartet</strong>s for which <strong>the</strong> topology differ <strong>between</strong><strong>the</strong> two <strong>trees</strong>. Therefore, we might as well count or calculate <strong>the</strong> number of <strong>quartet</strong>sthat share <strong>the</strong> same topology and <strong>the</strong>n subtract this number from <strong>the</strong> overall number of<strong>quartet</strong>s, ( n4).Having calculated <strong>the</strong> shared leaf set sizes, <strong>the</strong> number of leaves common to twosub<strong>trees</strong>, |T x ∩ T x ′ |, can be found with a constant time look-up. This will be used extensivelyto calculate <strong>the</strong> number of shared <strong>quartet</strong>s containing some triplet of leaves,(a,b,c), in constant time. However, <strong>the</strong> main idea of <strong>the</strong> algorithm is to process pairs ofleaves, (a,b), of which <strong>the</strong>re are O(n 2 ), and <strong>the</strong>n, in linear time, to calculate <strong>the</strong> numberof shared <strong>quartet</strong>s that include a given pair. This is done by considering all internalnodes on <strong>the</strong> path from a to b as centers. These centers can clearly be found in lineartime. Every leaf c, different from a and b, can be reached from exactly one of <strong>the</strong> centersC , by following an outgoing edge from C that is not part of <strong>the</strong> path <strong>between</strong> a and b. SeeFig. 6.1.In linear time, a path <strong>between</strong> a and b is found and an array computed, storing inentry i , <strong>the</strong> center of <strong>the</strong> triplet (a,b,i ). This is done for both <strong>trees</strong>. The arrays are linearin size and processing each pair of centers of leaves (a,b,i ) in constant time gives anoverall running time of <strong>the</strong> algorithm of O(n 3 ). The following and remaining part of <strong>the</strong>algorithm is a recipe for constant time computation of <strong>the</strong> number of shared <strong>quartet</strong>scontaining a triplet (a,b,c), given two centers, C in T and C ′ in T ′ , of <strong>the</strong> triplet.31


32 CHAPTER 6. CUBIC TIME ALGORITHMcT CxbC xC yaT CyFigure 6.1: Each node on <strong>the</strong> path <strong>between</strong> a and b make up a center C . For everyleaf c contained in T cx , C x will be <strong>the</strong> center of <strong>the</strong> triplet (a,b,c).Recall that center C , of leaves a, b and c, defines three sub<strong>trees</strong> T a , T b , T c and also <strong>the</strong>set T rest , of all remaining sub<strong>trees</strong> of <strong>the</strong> center node. The same applies to C ′ . Identicallyto <strong>the</strong> quartic algorithm of Chap. 4, every leaf o<strong>the</strong>r than a, b, and c, will be <strong>the</strong> fourthleaf in some <strong>quartet</strong>. Thus, if x appears in T a , <strong>the</strong> resulting topology in T will be ax|bc.Likewise, if x ∈ T ′ a , <strong>the</strong> <strong>quartet</strong> will have <strong>the</strong> same topology in T ′ . It follows that every leafdifferent from a, appearing in both sub<strong>trees</strong> T a and T a ′ , will result in a shared <strong>quartet</strong>topology. It is easy to see that all this boils down to a single look-up in <strong>the</strong> <strong>the</strong> table ofshared leaf set sizes, namely |T a ∩ T a ′ |. One should be subtracted, because <strong>the</strong> table alsoincludes <strong>the</strong> leaf a itself, which is obviously present in both sub<strong>trees</strong>. Thus, we can countall <strong>quartet</strong>s of this type, associated with a certain triplet, in constant time. The sameapplies to <strong>the</strong> leaves b and c, giving a constant time computation of all shared butterflytopologies with <strong>the</strong> following expression:|T a ∩ T ′ a | + |T b ∩ T ′ b | + |T c ∩ T ′ c | − 3 (6.1)Counting <strong>the</strong> number of shared star topologies is not as straight forward. Recall that ifx ∈ T rest , <strong>the</strong> topology of <strong>quartet</strong> (a,b,c, x) is a x ×b c . Because T rest is actually not a singlesubtree, but a set of sub<strong>trees</strong>, <strong>the</strong> expression |T rest ∩ Trest ′ | cannot be found with a singlelook-up. Instead, we can make use of <strong>the</strong> following observation: Every leaf x ∈ T is inexactly one of T a , T b , T c or T rest , since <strong>the</strong>y contain all leaves and are all disjoint, that is,|T a ∪ T b ∪ T c ∪ T rest | = n and T a ∩ T b ∩ T c ∩ T rest = . Similarly for leaves in T ′ . This gives<strong>the</strong> following expression:|T rest ∩ T ′ rest | = |T ′ rest | − (|T ′ rest ∩ T a| + |T ′ rest ∩ T b| + |T ′ rest ∩ T c|) (6.2)This is only part of <strong>the</strong> solution however, since none of <strong>the</strong> terms are stored in <strong>the</strong> sharedleaf sets table. Using <strong>the</strong> same principle over again a couple of times, we can eliminateall occurrences of T rest and Trest ′ and rewrite every term to be expressed only by terms


6.1. IMPLEMENTATION 33that we can look up in constant time:|T ′ rest ∩ T a| = |T a | − (|T a ∩ T ′ a | + |T a ∩ T ′ b | + |T a ∩ T ′ c |)|T ′ rest ∩ T b| = |T b | − (|T b ∩ T ′ a | + |T b ∩ T ′ b | + |T b ∩ T ′ c |)|T rest ′ ∩ T c| = |T c | − (|T c ∩ T a ′ | + |T c ∩ T ′ b | + |T c ∩ T c ′ |) (6.3)The last expression is derived directly from <strong>the</strong> number of leaves in T ′ :|T rest ′ | = n − (|T a ′ | + |T ′ b | + |T c ′ |) (6.4)We are now able to compute <strong>the</strong> number of shared star <strong>quartet</strong>s and thus, <strong>the</strong> overallnumber of shared <strong>quartet</strong>s of some triplet in constant time. Combining this with <strong>the</strong>approach of finding all centers associated with some pair of leaves in linear time, yieldsa running time of <strong>the</strong> entire algorithm of O(n 3 ). Because of <strong>the</strong> table for shared leaf setsizes, <strong>the</strong> algorithm requires a space consumption of O(n 2 ).Just like with <strong>the</strong> quartic algorithm <strong>the</strong> shared <strong>quartet</strong>s are counted too many times.Here, however, each <strong>quartet</strong> is counted twelve times. This is due to <strong>the</strong> fact that we dealwith pairs and that each <strong>quartet</strong> is considered once for each possible six pairs that canbe composed from <strong>the</strong> four leaves. In addition, each of those pairs will be used for constructionof two triplets; one for each of <strong>the</strong> two remaining leaves. Therefore, <strong>the</strong> numberof shared <strong>quartet</strong> topologies counted is divided by twelve.See Alg. 6.1 for an outline of <strong>the</strong> algorithm.Contribution 6.1 Despite of <strong>the</strong> description in Section 2.2 of Christiansen et al. [6]which states that each <strong>quartet</strong> is counted four times, <strong>the</strong> cubic algorithm is actuallycounting each <strong>quartet</strong> twelve times as explained in <strong>the</strong> text. This is a minor change,but of course necessary to keep in mind, as to get <strong>the</strong> correct result when implementing<strong>the</strong> algorithm. I made <strong>the</strong> discovery while working on my implementation, seeing that<strong>the</strong> result was consistently triple of what was expected.6.1 ImplementationThe cubic algorithm has been implemented first in Python and later in C++. Since nounusual tricks or language features are required, <strong>the</strong> two implementations are very similar.The first step is <strong>the</strong> implementation of <strong>the</strong> algorithms for calculating <strong>the</strong> leaf sets –line 1 and 2 of Alg. 6.1. These have been described and tested in Sec. 5.3.


34 CHAPTER 6. CUBIC TIME ALGORITHMAlgorithm 6.1 Cubic algorithm for <strong>quartet</strong> <strong>distance</strong>1: Calculate subtree leaf set sizes ⊲ O(n)2: Calculate shared leaf set sizes ⊲ O(n 2 )3: sharedQ = 04: for all pairs (a,b) do ⊲ O(n 2 )5: Find paths p in T and p ′ in T ′ <strong>between</strong> a and b ⊲ O(n)6: Find all centers C on <strong>the</strong> path p and C ′ on p ′ ⊲ O(n)7: for all leaves c ∉ {a,b} do ⊲ O(n)8: tmpQ = count <strong>quartet</strong>s (a,b,c,x) with same topology in <strong>the</strong> two <strong>trees</strong> ⊲ O(1)9: – this is done in constant time using C abc and shared leaf set sizes10: sharedQ = sharedQ + tmpQ11: end for12: end for13: diffQ = ( n4)− sharedQ/1214: return diffQGoing through each pair of leaves (a,b) is easy. The path <strong>between</strong> two leaves is foundby making one traversal of <strong>the</strong> tree. Every internal node on <strong>the</strong> path has a number ofsub<strong>trees</strong> (besides <strong>the</strong> two containing a and b) and each of <strong>the</strong>se are traversed in turn,registering which leaves are contained in that subtree. A center is simply storing <strong>the</strong> idsof <strong>the</strong> edges that identify <strong>the</strong> sub<strong>trees</strong> containing a, b and a third leaf. An array, call it Ais created to store each of <strong>the</strong> n −2 centers. Centers are stored in <strong>the</strong> array A on <strong>the</strong> indexcorresponding to <strong>the</strong> id of <strong>the</strong> third leaf known as c.Now only <strong>the</strong> counting remains. Looking at <strong>the</strong> pair of centers associated with everychoice of third leaf, <strong>the</strong> shared butterfly <strong>quartet</strong>s and shared star <strong>quartet</strong>s associated wi<strong>the</strong>xactly those centers are counted, cf. Eq. (6.1), Eq. (6.2) and subsequent expressions.6.1.1 ResultThe implementations have been exposed to <strong>the</strong> usual range of <strong>trees</strong>. The result is displayedin Fig. 6.2 and Fig. 6.3.As expected, both implementations seem to have a cubic behaviour on all <strong>trees</strong>, since<strong>the</strong>y are parallel to <strong>the</strong> purely cubic line. This is fur<strong>the</strong>r supported by <strong>the</strong> estimated exponentswhich are all very close to three.The experiments have been continued until <strong>the</strong> running times exceeded what is acceptable.Because of <strong>the</strong> fact that <strong>the</strong> Python implementation is significantly slower andsince a cubic development in running time is ra<strong>the</strong>r dramatic, <strong>the</strong> Python experimentshave only been continued to as much as 400 leaves. The C++ experiments to 800 leaves.The experiments have in <strong>general</strong> been repeated five times, however, <strong>the</strong> larger Pythonexperiments take more than an hour of running time each and have <strong>the</strong>refore only been


6.1. IMPLEMENTATION 35carried out once.


36 CHAPTER 6. CUBIC TIME ALGORITHMtime in seconds - t(n)100001000100101bin, best exp= 2.97ran, best exp= 2.97sqrt, best exp= 2.92star, best exp= 3.02wc, best exp= 2.99n 2n 3n 4Python implementation of <strong>the</strong> cubic algorithm.0.10.0110 50 100 200 500number of leaves - nFigure 6.2: Performance of <strong>the</strong> cubic algorithm implemented in Python.time in seconds - t(n)1000100101Performance of <strong>the</strong> C++ implementation of <strong>the</strong> cubic algorithm.bin, best exp= 2.85ran, best exp= 2.86sqrt, best exp= 2.72star, best exp= 3.03wc, best exp= 2.96n 2n 3n 40.10.0150 100 200 500 800number of leaves - nFigure 6.3: Performance of <strong>the</strong> cubic algorithm implemented in C++.


Chapter 7Sub-cubic time algorithmIn this section I will describe <strong>the</strong> sub-cubic algorithm of Mailund et al. [14], which is <strong>the</strong>main topic of this <strong>the</strong>sis. I will establish <strong>the</strong> <strong>the</strong>oretical foundation necessary to understand<strong>the</strong> experimental investigation that follows. The first part, Sec.7.1, is an intuitiveoutline of <strong>the</strong> approach used to calculate <strong>the</strong> <strong>quartet</strong> <strong>distance</strong>. Then follows a detailedexplanation of <strong>the</strong> most critical point in <strong>the</strong> algorithm including a discussion of <strong>the</strong> influenceit has on <strong>the</strong> asymptotic worst-case analysis. The algorithm has a sub-cubic runningtime but <strong>the</strong> exact complexity is tightly related to <strong>the</strong> complexity of matrix multiplicationand this is significant in <strong>the</strong> analysis. Fur<strong>the</strong>rmore, this complexity is used asa parameter in <strong>the</strong> algorithm and thus has to be known. This is a subject of study inSec. 7.2, that will give details about my work on implementation of <strong>the</strong> algorithm and<strong>the</strong> experimental results.7.1 The algorithmIn this algorithm, <strong>the</strong> overall approach to <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> problem is changed, whencompared to <strong>the</strong> two reference algorithms of Chapters 4 and 6. Instead of countingshared or different star <strong>quartet</strong>s separately, <strong>the</strong>y can be expressed solely in terms of <strong>the</strong>butterfly topologies. The article Christiansen et al. [6] introduces <strong>the</strong> idea. The <strong>quartet</strong><strong>distance</strong> is <strong>the</strong> total number of <strong>quartet</strong>s that induce different topologies in <strong>the</strong> two <strong>trees</strong>.That is, <strong>the</strong> number of <strong>quartet</strong>s that have one butterfly topology in <strong>the</strong> one tree and ano<strong>the</strong>rbutterfly topology in <strong>the</strong> o<strong>the</strong>r tree plus <strong>the</strong> number of <strong>quartet</strong>s that have a startopology in one of <strong>the</strong> <strong>trees</strong> and a butterfly topology in <strong>the</strong> o<strong>the</strong>r. In short,qdist(T,T ′ ) = diff B (T,T ′ ) + diff S (T,T ′ ).It turns out that diff S (T,T ′ ) can be expressed in terms of <strong>the</strong> number of butterflies ineach tree, B in T and B ′ in T ′ , <strong>the</strong> number of shared butterflies, shared B (T,T ′ ), and <strong>the</strong>37


38 CHAPTER 7. SUB-CUBIC TIME ALGORITHMnumber of different butterflies, diff B (T,T ′ ), in <strong>the</strong> following way:diff S (T,T ′ ) = B + B ′ − 2(shared B (T,T ′ ) + diff B (T,T ′ ))Which gives a complete expression for <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> two <strong>trees</strong>:qdist(T,T ′ ) = B + B ′ − 2shared B (T,T ′ ) − diff B (T,T ′ ) (7.1)Given <strong>the</strong> fact that B = shared B (T,T ) and likewise B ′ = shared B (T ′ ,T ′ ), it is clear thatonly two procedures are needed to calculate Eq. (7.1); one for shared B and one for diff B .They are described separately in <strong>the</strong> following Sec. 7.1.2 and Sec. 7.1.3.Like <strong>the</strong> cubic algorithm, <strong>the</strong> sub-cubic algorithm makes heavy use of <strong>the</strong> conceptof shared leaf set sizes, introduced in Section 5.2, but makes even less use of tree examination– one can say that it has a more computational nature. Never<strong>the</strong>less, it is easy toillustrate <strong>the</strong> intuition behind <strong>the</strong> algorithm. Two new concepts are introduced.A directed <strong>quartet</strong>, written ab → cd, first encountered in <strong>the</strong> article Brodal et al. [3]is a butterfly <strong>quartet</strong>, ab|cd, with a direction on <strong>the</strong> path <strong>between</strong> <strong>the</strong> two “wings” of<strong>the</strong> butterfly, see Fig. 7.1. Of course <strong>the</strong> number of shared butterfly <strong>quartet</strong> topologies<strong>between</strong> two <strong>trees</strong> is equal to half <strong>the</strong> number of shared directed <strong>quartet</strong>s. Likewise fordifferent butterflies. We observe that each directed <strong>quartet</strong> is identified by exactly onedirected edge e, in such a way that a and b are positioned behind e and c and d are positionedin front of e, in two different sub<strong>trees</strong> of <strong>the</strong> end node of e.acacacbdbdbdFigure 7.1: The two directed <strong>quartet</strong>s induced from a butterfly <strong>quartet</strong>.A claim, written A −→ e (C ,D), is <strong>the</strong> term used to denote such an edge, e, “claiming” alldirected <strong>quartet</strong>s ab → cd where a and b are contained in <strong>the</strong> subtree A behind e, while cand d are contained in two different sub<strong>trees</strong> C and D in front of e. It is convenient to addthis to our vocabulary now, as <strong>the</strong> algorithm will eventually deal with edges. Naturally, aclaim is associated with sub<strong>trees</strong> and <strong>the</strong>refore one edge typically claims multiple butterfly<strong>quartet</strong>s. Also, an edge can be involved in different claims with different sub<strong>trees</strong>,but as mentioned, every directed <strong>quartet</strong> is claimed by exactly one edge. See Fig. 7.2 foran illustration. We see how <strong>the</strong> edge e “divides” <strong>the</strong> tree into several parts; one subtreebehind <strong>the</strong> edge and several different sub<strong>trees</strong> in front of <strong>the</strong> edge.The fundamentals have been established. The algorithm will process all possiblepairs of directed edges, (e,e ′ ) ∈ T × T ′ , and for all <strong>the</strong> directed <strong>quartet</strong>s claimed by both


7.1. THE ALGORITHM 39CcabADFigure 7.2: Example of a claim. The directed edge is a unique identifierof <strong>the</strong> directed <strong>quartet</strong> ab → cd.dedges, count <strong>the</strong> <strong>quartet</strong> as a shared butterfly if <strong>the</strong> two claims give rise to <strong>the</strong> same topology,and o<strong>the</strong>rwise count <strong>the</strong> <strong>quartet</strong> as a different butterfly. Making heavy use of preprocessing,one such pair of edges can be handled in constant time, see Sec. 7.1.4. Since|E| = O(n) and <strong>the</strong>re are O(n 2 ) pairs of edges, <strong>the</strong> process of counting <strong>the</strong> butterflies canbe done in O(n 2 ) time. Thus, <strong>the</strong> preprocessing step is crucial to <strong>the</strong> running time.7.1.1 Basic preprocessingHere I will describe only a fundamental part of <strong>the</strong> preprocessing that is necessary tounderstand <strong>the</strong> intuition behind <strong>the</strong> algorithm and how to count shared and differentbutterflies. More preprocessing is needed to make it possible to process a pair of directededges in constant time as mentioned. Unfortunately that part of <strong>the</strong> preprocessing posesa threat to <strong>the</strong> sub-cubic complexity of <strong>the</strong> entire algorithm and has direct influence on<strong>the</strong> running time and must be handled with care. For now it is merely confusing and Iwill postpone <strong>the</strong> introduction of this until <strong>the</strong> appropriate section.As we shall see, it comes in handy that <strong>the</strong> notion of claims and <strong>the</strong> concept of sharedleaf set sizes both deal with sub<strong>trees</strong>. The first preprocessing step is to calculate <strong>the</strong>shared leaf set sizes, as explained in Sec. 5.2, which has quadratic time and space consumption.The next step is to calculate, for each pair of internal nodes v ∈ T and v ′ ∈ T ′ , withsub<strong>trees</strong> F 1 ,...,F dv and G 1 ,...,G d′v, a matrix I where I [i , j ] = |F i ∩G j |. When processingpairs of edges as mentioned above, we will need this matrix, I , associated with <strong>the</strong> twonodes that <strong>the</strong> edges point to.This is enough information about <strong>the</strong> preprocessing step to complete <strong>the</strong> intuitiveexplanation for counting butterflies in Sec. 7.1.2 and 7.1.3.


40 CHAPTER 7. SUB-CUBIC TIME ALGORITHM7.1.2 Counting shared butterfliesThe number of shared butterflies, <strong>the</strong> quantity shared B (T,T ′ ), can be counted using <strong>the</strong>following method: The edges e and e ′ represent a pair of claims, name <strong>the</strong>m F ie−→ (Fk ,F m )and G je ′−→ (G l ,G n ). The intention of <strong>the</strong> algorithm is to count <strong>the</strong> number of directedbutterflies, ab → cd, where a,b ∈ F i ∩G j , c ∈ F k ∩G l and d ∈ F m ∩G n . This is illustratedin Fig. 7.3.cbaF kG jabF iG lcF mG nddFigure 7.3: Two edges both claiming <strong>the</strong> butterfly ab → cd.Now, <strong>the</strong> total number of shared directed butterflies can be calculated using <strong>the</strong> expression:( )1 |F i ∩G j | ∑ ∑|F k ∩G l |2 2k≠i l≠j∑ ∑|F m ∩G n |. (7.2)m≠i ,k n≠j,lwhich, using <strong>the</strong> preprocessing matrix I , associated with <strong>the</strong> two nodes that e and e ′point to, is equal to <strong>the</strong> expression:( )1 I [i , j ] ∑ ∑I [k,l]2 2k≠i l≠j∑ ∑I [m,n], (7.3)m≠i ,k n≠j,lwhere <strong>the</strong> division by 2 is necessary to avoid counting <strong>the</strong> same butterfly twice becauseof symmetry. That is, symmetry <strong>between</strong> <strong>the</strong> two entries I [k,l] and I [m,n]. Figure 7.4 a)illustrates Eq. (7.3) with a choice of indices and it is clear that a symmetry occurs <strong>between</strong><strong>the</strong> two entries mentioned.Contribution 7.1 The article originally claimed that <strong>the</strong> count should be divided by fourand not two, due to symmetry <strong>between</strong> <strong>the</strong> indices k and m and <strong>between</strong> <strong>the</strong> indices l andn. It turned out that this is actually not <strong>the</strong> case. I got suspicious after implementing <strong>the</strong>algorithm and seeing <strong>the</strong> result consistently being only half of what it should be. Whenlooking at Fig. 7.3 it is easy to imagine how <strong>the</strong> sub<strong>trees</strong> F k and F m could interchangeand likewise for G l and G n . This gives rise to four different constellations. Figure 7.4 a),


7.1. THE ALGORITHM 41reflecting Eq. (7.3), however, makes it clear that <strong>the</strong> symmetry only occurs <strong>between</strong> twoentries in <strong>the</strong> matrix. Therefore <strong>the</strong> butterflies are only counted twice. This is becauseEq. (7.3) is incomplete and only captures half of <strong>the</strong> constellations of <strong>the</strong> indices. To produce<strong>the</strong> total number of shared directed butterflies due to symmetry, <strong>the</strong> two equations(7.2) and (7.3) should have read:( )1 |F i ∩G j | ∑ ∑ ∑ ∑(|F k ∩G l ||F m ∩G n |) + (|F k ∩G n ||F m ∩G l |) (7.4)4 2k≠i l≠j m≠i ,k n≠j,land( )1 I [i , j ] ∑ ∑4 2k≠i l≠j∑ ∑(I [k,l]I [m,n]) + (I [k,n]I [m,l]). (7.5)m≠i ,k n≠j,lFigure 7.4 b) illustrates a choice of indices in Eq. (7.5). It is clear that for some fixed choiceof <strong>the</strong> (i , j ) entry, <strong>the</strong>re are four ways to choose <strong>the</strong> o<strong>the</strong>r entries; as soon as one entry isselected, say (k,l), <strong>the</strong>re is only one way to select <strong>the</strong> remaining three entries (m,n), (k,n)and (m,l) if precisely those four entries are to be selected.As <strong>the</strong> result in <strong>the</strong> article is simply wrong by a factor of 2, <strong>the</strong> succeeding parts of<strong>the</strong> article are still valid, if we divide by 2 instead of 4. One can show that <strong>the</strong> two subexpressionsof Eq. (7.5) are actually expressing <strong>the</strong> same quantity and thus, this shouldbe a correct workaround. This is most fortunate, since what remains from that point in<strong>the</strong> article is to transform Eq. (7.3) into an expression that is constant time computable.This work will still be valid.jnljnliimmkk(a)(b)Figure 7.4: Illustrates (a) <strong>the</strong> choice of entries in Eq. (7.3) and (b) <strong>the</strong> choice of entries in Eq. (7.5)A recipe for counting shared butterflies associated with a pair of edges has now beendescribed. This is done for all pairs of directed edges. As mentioned in <strong>the</strong> descriptionof directed butterflies, one butterfly is identified by two different edges in one tree. Theconsequence is that each shared butterfly is associated with two different pairs of claims.


42 CHAPTER 7. SUB-CUBIC TIME ALGORITHMIf e and e ′ are <strong>the</strong> edges that claim <strong>the</strong> directed <strong>quartet</strong> ab → cd in <strong>the</strong> two <strong>trees</strong> T andT ′ respectively, <strong>the</strong> directed <strong>quartet</strong> ab ← cd will also be claimed by some edge in eachtree, <strong>the</strong>se edges will obviously point in <strong>the</strong> opposite directions. Lets denote <strong>the</strong>se edgesē and ē ′ . Each butterfly will be counted twice; once as associated with <strong>the</strong> pair (e,e ′ ) andonce as associated with <strong>the</strong> pair (ē,ē ′ ). Obviously, a directed <strong>quartet</strong> can not be claimedby both e and ē and <strong>the</strong>refore shared <strong>quartet</strong>s are only counted twice. This might beobvious, but I make my point because <strong>the</strong> same is not true when it comes to countingdifferent butterflies.7.1.3 Counting different butterfliesDifferent butterflies, <strong>the</strong> quantity diff B (T,T ′ ), can be counted using <strong>the</strong> following method.Once again, <strong>the</strong> edges e and e ′ erepresent a pair of claims, name <strong>the</strong>m F i −→ (Fk ,F m ) andeG ′j −→ (G l ,G n ). Of interest are <strong>the</strong> <strong>quartet</strong>s that are claimed by both edges where thosetwo claims result in different butterfly topologies. As an example, <strong>the</strong> two directed butterfliesab e −→ cd and ac e′−→ bd, would be <strong>the</strong> result of <strong>the</strong> following positioning of <strong>the</strong>leaves; a ∈ F i ∩G j , b ∈ F i ∩G l , c ∈ F k ∩G j and d ∈ F m ∩G n . This is illustrated in Fig. 7.5.This situation is expressed through <strong>the</strong> following equation:ccaF kG jabF iG lbF mG nddFigure 7.5: Two edges claim butterflies ab → cd and ac → bd respectively.|F i ∩G j | ∑ ∑ ∑ ∑|F i ∩G l ||F k ∩G j ||F m ∩G n |. (7.6)k≠i l≠j m≠i ,k n≠j,lAnd using matrix notation:I [i , j ] ∑ ∑ ∑ ∑I [i ,l]I [k, j ]I [m,n]. (7.7)k≠i l≠j m≠i ,k n≠j,lFigure 7.6 illustrates <strong>the</strong> entries that are multiplied for some choice of indices. Observingthis or Fig. 7.5, it should be fairly clear that <strong>the</strong>re is no symmetry issue whencounting different butterflies. No two entries can be switched without forcing <strong>the</strong> o<strong>the</strong>rentries to move as well, making it impossible to produce <strong>the</strong> same situation with ano<strong>the</strong>r


7.1. THE ALGORITHM 43choice of indices.jnlimkFigure 7.6: Illustrates <strong>the</strong> choice of entries in Eq. (7.7).Once again, each possible pair of directed edges needs processing. And once again,since a regular undirected butterfly <strong>quartet</strong> is subject of two claims, each <strong>quartet</strong> is encounteredboth when dealing with e and when dealing with ē. Again, ē is some oppositelydirected edge, namely <strong>the</strong> one that claims <strong>the</strong> same <strong>quartet</strong> as e.The total result of counting different butterflies should be divided by four, as opposedto <strong>the</strong> shared butterflies. While similarity is identified by observing <strong>the</strong> same two leavesbehind <strong>the</strong> edge in both T and T ′ , difference is identified by requiring that <strong>the</strong> pairsof leaves behind <strong>the</strong> edges should be different. The consequence is, that if <strong>the</strong> edge pair(e,e ′ ) is identifying a different butterfly <strong>quartet</strong>, <strong>the</strong> pair (e,ē ′ ) does also and so does (ē,e ′ )and (ē,ē ′ ). Thus, one single <strong>quartet</strong> is counted as different four times and consequently<strong>the</strong> result should be divided by four.Contribution 7.2 It is evident that <strong>the</strong> use of directed edge claiming introduces <strong>the</strong> needfor dividing <strong>the</strong> total count by some factor. However, <strong>the</strong> article states that different butterfliesare counted twice, just like shared butterflies, and that <strong>the</strong> result should <strong>the</strong>reforebe divided by two.After implementing <strong>the</strong> diff B (T,T ′ ) algorithm I observed that <strong>the</strong> result was, “forsome reason”, consistently double of what I expected. After thorough verification thatmy implementation was correct, this led to fur<strong>the</strong>r investigations of <strong>the</strong> correctness of <strong>the</strong>article. I realized that <strong>the</strong> reasoning should be different and that <strong>the</strong> result should bedivided by four, as explained in <strong>the</strong> text.


44 CHAPTER 7. SUB-CUBIC TIME ALGORITHM7.1.4 How to count butterflies in constant timeSec. 7.1.1, 7.1.2 and 7.1.3 gave an intuitive explanation of <strong>the</strong> algorithm. This sectionwill clarify that <strong>the</strong> number of butterflies, shared and different, associated with a pair ofdirected edges, can in fact be calculated in constant time, given <strong>the</strong> right preprocessinginformation. All preprocessing arrays and tables are listed in Appendix A. These will bereferenced here and <strong>the</strong> most important table, calculation of which is posing a threat to<strong>the</strong> overall complexity of <strong>the</strong> algorithm, will be explained in detail below.The article explains how <strong>the</strong> expression in Eq. (7.3), used to calculate <strong>the</strong> number ofshared directed butterflies associated with a pair of internal nodes, can be translated into<strong>the</strong> following constant time computable expression:( )1 I [i , j ] (M ′ − R ′ [i ] −C ′ [j ] + I ′ [i , j ]+2 2(I [i , j ] − R[i ] −C [j ])(M − R[i ] −C [j ] + I [i , j ])+R ′′ [i ] − I [i , j ](C [j ] − I [i , j ])+C ′′ [j ] − I [i , j ](R[i ] − I [i , j ]) ) (7.8)And fur<strong>the</strong>rmore how <strong>the</strong> expression in Eq. (7.7), used to calculate <strong>the</strong> number ofdifferent directed butterflies associated with a pair of internal nodes, is translated intoone of <strong>the</strong> succeeding two expressions.I [i , j ]((M − R[i ] −C [j ] + I [i , j ])(R[i ] − I [i , j ])(C [j ] − I [i , j ])+(R[i ] − I [i , j ])(I [i , j ](R[i ] − I [i , j ]) −C ′′ [j ])+(C [j ] − I [i , j ])(I [i , j ](C [j ] − I [i , j ]) − R ′′ [i ])+I 1 ′′′′′[i , j ] − I [i , j ]I 1 [i ,i ] − I [i , j ](C ′′′ [j ] − I [i , j ] 2 ) ) (7.9)I [i , j ]((M − R[i ] −C [j ] + I [i , j ])(R[i ] − I [i , j ])(C [j ] − I [i , j ])+(R[i ] − I [i , j ])(I [i , j ](R[i ] − I [i , j ]) −C ′′ [j ])+(C [j ] − I [i , j ])(I [i , j ](C [j ] − I [i , j ]) − R ′′ [i ])+I 2 ′′′′′[i , j ] − I [i , j ]I 2 [j, j ] − I [i , j ](R′′′ [i ] − I [i , j ] 2 ) ) (7.10)This requires some explanation. The respective translations into <strong>the</strong>se constant timecomputable expressions are very cumbersome and not essential to this <strong>the</strong>sis and I willmerely refer to <strong>the</strong> appendix of <strong>the</strong> original article by Mailund et al. [14] for details onthis process.


7.1. THE ALGORITHM 45More interesting are <strong>the</strong> steps needed to prepare each of <strong>the</strong> tables used for look-up– because <strong>the</strong> actual calculation of <strong>the</strong>se is part of <strong>the</strong> algorithm and a thorough understandingis <strong>the</strong>refore essential. All tables are prepared during <strong>the</strong> preprocessing of eachpair of internal nodes, along with <strong>the</strong> table I , presented in Sec. 7.1.1. They are all resultsof fur<strong>the</strong>r processing of I and necessary to make <strong>the</strong> constant time calculation. With <strong>the</strong>exception of <strong>the</strong> tables I1 ′′′ ′′′and I2, none of <strong>the</strong> tables are time-consuming to deal withand no worse than O(d v d v ′) which, for all pairs of inner nodes, leads to a total time of∑v∈T∑v ′ ∈T ′ d v d v′ = ( ∑ v∈T d v )( ∑ v ′ ∈T ′ d v ′) ≤ (2|E|)(2|E|) = O(n2 ).These two exceptions, I1 ′′′ ′′′and I2, are actually identical and only differ in <strong>the</strong> way<strong>the</strong>y are calculated. With one term, <strong>the</strong>y shall simply be known as I ′′′ and <strong>the</strong> calculationof this table will be explained thoroughly in <strong>the</strong> following Sec. 7.1.4.1.7.1.4.1 The calculation of I ′′′The table named I ′′′ plays a part in <strong>the</strong> calculation of <strong>the</strong> number of different butterfliesand in order to keep <strong>the</strong> entire running time of <strong>the</strong> algorithm sub-cubic, special care isneeded in <strong>the</strong> calculation of <strong>the</strong> table. It is defined as follows:I ′′′ [i , j ] =∑d vd v ′∑k=1,k≠i l=1,l≠jI [i ,l]I [k, j ]I [k,l] (7.11)Filling <strong>the</strong> table naively, in accordance with <strong>the</strong> formula, takes time O(n 4 ) and <strong>the</strong>sub-cubic time bound promised will be broken. Hence, <strong>the</strong> calculation constitutes aserious barrier and demands separate attention. The solution is instead to calculate ei<strong>the</strong>rI ′′′1 = (I I T )I or o<strong>the</strong>rwise I ′′′2 = I (I T I ) which are both described in more detail in <strong>the</strong>appendix. At first sight, this does not seem to solve <strong>the</strong> problem, since <strong>the</strong> solution is relyingon matrix multiplication. The complexity of matrix multiplication, if done naively,is O(n 3 ). As explained in <strong>the</strong> article [14] choosing ei<strong>the</strong>r <strong>the</strong> first or second solution willresult in an explicit running time of ei<strong>the</strong>r O(d 2 v d v ′) or O(d v d 2 v ′ ) for processing a pair ofinternal nodes (v, v ′ ) with degrees d v and d v′ respectively. However, o<strong>the</strong>r methods forcalculating <strong>the</strong> matrix product may be utilized and this is essential to <strong>the</strong> algorithm. Applyinga matrix multiplication method with a time complexity of O(n ω ) on square matrices,one can make <strong>the</strong> calculation in time O(max(d v ,d v ′) ω ). This value might be smallerfor some matrices that are nearly square, but <strong>the</strong> approach requires that <strong>the</strong> matrices arepadded with zeroes to become square – i.e. extending <strong>the</strong> matrix to fit <strong>the</strong> requirementsof <strong>the</strong> matrix multiplication method used.It is difficult to predict <strong>the</strong> impact of <strong>the</strong> matrix multiplication on <strong>the</strong> entire algorithm;since it is on an internal node basis <strong>the</strong> complete running time is not identical to<strong>the</strong> one of <strong>the</strong> multiplication method. In <strong>the</strong> article [14], Section 4 gives a thorough case


46 CHAPTER 7. SUB-CUBIC TIME ALGORITHManalysis of <strong>the</strong> problem, clarifying that <strong>the</strong> worst-case running time of <strong>the</strong> algorithm isO(n α+2 ) where α = ω−12. A <strong>the</strong>oretical asymptotic lower bound on matrix multiplicationis Ω(n 2 ) because all 2n 2 entries have to be processed, and thus, one can assume that2 ≤ ω ≤ 3. The consequence is that 1 2≤ α ≤ 1 and thus, <strong>the</strong> algorithm might be sub-cubic,but not below Ω(n 2.5 ).The article mentions <strong>the</strong> Coppersmith-Winograd algorithm [7] as an example of afast method for matrix multiplication, featuring ω = 2.376, which, according to <strong>the</strong> analysis,will result in a running time of O(n 2.688 ).Never<strong>the</strong>less, <strong>the</strong>se <strong>the</strong>oretic considerations regarding <strong>the</strong> exponent are not <strong>the</strong> focusof this <strong>the</strong>sis. First of all because such an analysis merely establishes a worst-casebound of <strong>the</strong> algorithm which might be loose and not even close to <strong>the</strong> real value. Thus,an implementation might yield better results. Second, because ω is in fact <strong>the</strong>oretical,meaning that it would have to be estimated through practical experiments to be used asa parameter of <strong>the</strong> algorithm. Regardless, I am interested in <strong>the</strong> study of <strong>the</strong> algorithm asa tool for calculating <strong>the</strong> <strong>quartet</strong> <strong>distance</strong>, and <strong>the</strong>refore <strong>the</strong> best choice of matrix multiplicationmethod will be some robust, practical fast implementation. For example, aquick investigation of Coppersmith-Winograd reveals that <strong>the</strong> algorithm is not applicableon real data because <strong>the</strong> constant factors are far too large and thus, <strong>the</strong> benefit wouldnot be seen on data of <strong>the</strong> size manageable on present-day computers.The worst-case scenario is that <strong>the</strong> analysis given in <strong>the</strong> article is tight, meaning thatit is impossible to do better in practice. Suppose, in <strong>the</strong> same time, that <strong>the</strong>re is no subcubicmethod for matrix multiplication practically useful. This would lead to a strictperformance of Θ(n 3 ) in practice. Hopefully, <strong>the</strong> analysis is “loose” and leaves room forimprovement, meaning that an implementation does not necessarily perform close to<strong>the</strong> asymptotic worst-case bound predicted. Fur<strong>the</strong>rmore, it may be easy to find an efficientmatrix multiplication implementation, be it naive or sub-cubic.In short, <strong>the</strong> algorithm leaves us with a choice to make, namely to decidemin(max(d v ,d v ′) ω ,dv 2 d v ′,d v d 2 v ′), (7.12)which is dependent on <strong>the</strong> exponent ω. In practice, this is a bit more cumbersome, however,than to simply calculate that expression, since <strong>the</strong> two methods for matrix multiplication,naive and fast, hide different constant coefficients. One will need to experimentallyestimate ω and/or make an empiric conclusion on when to choose one wayover <strong>the</strong> o<strong>the</strong>rs. This is fur<strong>the</strong>r investigated in Sec. 7.2.2 where I choose a fast and robustimplementation and use it as a black box.


7.2. IMPLEMENTATION 477.2 ImplementationThis section gives a description of my work on implementing <strong>the</strong> sub-cubic algorithmof Sec. 7.1. As mentioned, it comes down to implementing two algorithms, namelyshared B (T,T ′ ) and diff B (T,T ′ ), used for counting shared and different butterflies in <strong>the</strong>two <strong>trees</strong> T and T ′ respectively.To calculate <strong>the</strong> total number of butterflies in one tree <strong>the</strong> algorithm for countingshared butterflies is used. This means that <strong>the</strong> basic preprocessing, namely calculating<strong>the</strong> shared leaf set sizes, has to be done three times; once for counting shared and differentbutterflies in <strong>the</strong> two <strong>trees</strong>, once for counting butterflies in <strong>the</strong> tree T and oncefor counting butterflies in <strong>the</strong> tree T ′ . The implementation of <strong>the</strong> algorithm for sharedleaf set size calculation has been described in Sec. 5.3 and will comprise <strong>the</strong> most spaceconsuming part of <strong>the</strong> sub-cubic algorithm, resulting in O(n 2 ) memory usage.The paper [14] presents <strong>the</strong> algorithms as follows: do <strong>the</strong> preprocessing for each pairof inner nodes, <strong>the</strong>n do <strong>the</strong> counting for each pair of directed edges. Instead, I have chosena strategy of processing each pair of inner nodes in turn, including preprocessing andcounting for each pair of directed edges adjacent to <strong>the</strong> nodes. Consequently, <strong>the</strong> onlyreal preprocessing that is done prior to processing <strong>the</strong> nodes is calculating <strong>the</strong> sharedleaf set sizes, <strong>the</strong> implementation of which is described in Sec. 5.3.The full amount of preprocessing information needed to deal with a pair of nodes islisted in App. A. Some of it is needed in both shared B (T,T ′ ) and diff B (T,T ′ ), and someis specific to <strong>the</strong> latter. Therefore, I implemented <strong>the</strong> two toge<strong>the</strong>r in a single functioncapable of returning both counts. If both values are needed, <strong>the</strong> preprocessing specificto <strong>the</strong> counting of different butterflies is just an extension to <strong>the</strong> preprocessing done forshared butterflies.Implementing shared B (T,T ′ ) is straight forward following <strong>the</strong> recipe I have given inSec. 7.1.2, 7.1.4 and App. A. I loop through every pair of internal nodes and fill out all<strong>the</strong> preprocessing tables needed for shared <strong>quartet</strong>s only. Then I go through <strong>the</strong> pairs ofedges adjacent to <strong>the</strong> two nodes, apply Eq. (7.8) and add <strong>the</strong> result to <strong>the</strong> total sum ofshared butterflies. I make use of <strong>the</strong> Boost 1 library for binomial coefficient calculation.Before returning <strong>the</strong> total number of shared butterflies <strong>the</strong> result is divided by 4; by 2because of symmetry (see clarification in Contribution 7.1) and 2 because <strong>the</strong>re are twiceas many directed edges as <strong>the</strong>re are undirected edges. One thing to note is that it onlymakes sense to look at edges pointing to internal nodes since <strong>the</strong>re has to be at least twoleaves behind <strong>the</strong> edge (e.g. in <strong>the</strong> subtree F i on Fig. 7.3) to form a butterfly.Implementing diff B (T,T ′ ) requires more attention. If <strong>the</strong> analysis given in <strong>the</strong> paper1 Boost: http://www.boost.org/


48 CHAPTER 7. SUB-CUBIC TIME ALGORITHMis optimal, this is a key point to success or failure since <strong>the</strong> matrix multiplication poses<strong>the</strong> worst threat to <strong>the</strong> sub-cubic time bound promised. Solving <strong>the</strong> task involves decidingwhich library to use for matrix multiplication and to study <strong>the</strong> behaviour of this, as tofind out how to make <strong>the</strong> crucial comparison of <strong>the</strong> three quantities max(d v ,d v ′) ω , dv 2d v ′and d v d 2 .v ′After implementing those two algorithms only one thing remains, namely to apply<strong>the</strong> algorithms on <strong>the</strong> two <strong>trees</strong> and put <strong>the</strong> counts toge<strong>the</strong>r according to <strong>the</strong> expressionshown in Eq. (7.1).7.2.1 PrototypeStarting out softly, <strong>the</strong> first goal is to make a prototype implementation only focusingon <strong>the</strong> correct result. Consequently, I will make a simple implementation of <strong>the</strong> choiceand solely base it on which of d 2 v d v ′ and d v d 2 v ′ is smaller. That is easy to determine, andalong with <strong>the</strong> choice comes <strong>the</strong> calculation of ei<strong>the</strong>r C ′′′ and I ′′′1 = (I I T )I , or R ′′′ andI ′′′2 = I (I T I ), respectively. I will use a basic, naive implementation of matrix multiplication.For <strong>the</strong> Python implementation I will utilize <strong>the</strong> NumPy library for scientific <strong>computing</strong>2 and for <strong>the</strong> C++ implementation I will utilize <strong>the</strong> Boost.uBLAS Library 3 . Theyboth provide matrix data structures and routines for matrix multiplication. Then it is allabout looping through pairs of edges, applying Eq. (7.9) or Eq. (7.10), and summing up<strong>the</strong> results. Before returning, <strong>the</strong> result is divided by four because directed edges give riseto four different situations (see Contribution 7.2).ExpectationsSo, what would one expect from <strong>the</strong> prototype implementations whensubjected to <strong>the</strong> usual range of <strong>trees</strong>? Using naive matrix multiplication, and hence <strong>the</strong>value of α = 1, will make <strong>the</strong> whole algorithm O(n 3 ). Thus, we can expect to see cubicworst-case behavior. However, since <strong>the</strong> analysis is simply worst-case, we can still,as mentioned earlier, hope that it is not a tight upper bound and that <strong>the</strong> algorithm isefficient in practice.With regards to <strong>the</strong> difference in performance <strong>between</strong> <strong>the</strong> <strong>trees</strong> used, I will, onceagain, base my expectations on <strong>the</strong> actual code written. From my point of view <strong>the</strong>re isone reasonable way to look at this. The algorithm is processing pairs of internal nodes,and for each of <strong>the</strong>se pairs <strong>the</strong>re is some preprocessing and some calculations to do,both of which are dependent on <strong>the</strong> degree of <strong>the</strong> two internal nodes. The preprocessingrelies on matrix multiplication, meaning that larger degrees will be harder to deal with.Depending on <strong>the</strong> overhead due to preprocessing and calculations <strong>the</strong> relationship be-2 SciPy/NumPy: http://numpy.scipy.org/3 Boost.uBLAS: http://www.boost.org/doc/libs/1_42_0/libs/numeric/ublas/doc/index.htm


7.2. IMPLEMENTATION 49tween <strong>the</strong> number of inner nodes and <strong>the</strong>ir degree might be more or less important. Thesame relation made <strong>the</strong> case analysis difficult.Based on <strong>the</strong>se observations my assumption is that <strong>trees</strong> with a large number of highdegreeinternal nodes will pose <strong>the</strong> worst challenge; here <strong>the</strong> implementation will sufferfrom increasing cost of <strong>the</strong> matrix multiplication while <strong>the</strong> overhead on o<strong>the</strong>r calculationswill influence <strong>the</strong> time consumption as well. As described in Sec. 3.1, sqrt<strong>trees</strong> are intended to have this property. For bin <strong>trees</strong> however, with a large numberof small-degree inner nodes, <strong>the</strong> cost of matrix multiplication per node pair will nevergrow. Therefore <strong>the</strong> result will probably be that this cost can be disregarded and that <strong>the</strong>algorithm will behave as a quadratic function, even though <strong>the</strong>re might be a large overheaddue to <strong>the</strong> many node pairs being processed. The star <strong>trees</strong> have one node of muchhigher degree which could result in a very small overhead but a much faster growth.Result The performance of <strong>the</strong> prototypes applied to <strong>the</strong> five types of test <strong>trees</strong> is shownin Fig. 7.7 and 7.8. The result is quite surprising and <strong>the</strong> first thing that catches <strong>the</strong> eye, is<strong>the</strong> relatively slow growth of <strong>the</strong> time usage. For <strong>the</strong> Python implementation <strong>the</strong> runningtimes seem to underlie a more or less quadratic development, with <strong>the</strong> star tree being <strong>the</strong>worst challenge, resulting in a slightly steeper slope, however, far from cubic. The plotof <strong>the</strong> C++ implementation displays <strong>the</strong> same tendency, <strong>the</strong> development being closeto quadratic. It is, however, even more clear that <strong>the</strong> implementation is suffering frommatrix multiplication costs when dealing with star <strong>trees</strong> and to some extent also <strong>the</strong> wc<strong>trees</strong>.The conclusion is, as assumed, that <strong>the</strong> algorithm responds differently to <strong>trees</strong>, dependingon <strong>the</strong>ir internal structure. For smaller input <strong>trees</strong>, <strong>the</strong> algorithm easily dealswith <strong>the</strong> ones that have a small number of internal nodes, whereas a complex internalstructure with a large number of internal nodes results in a large overhead. When <strong>the</strong>input size is increased <strong>the</strong> overhead is neutralized and <strong>the</strong> more dominant characteristicseems to be <strong>the</strong> maximum degree of <strong>the</strong> internal nodes. Large degrees result in large matrixmultiplications and <strong>the</strong> consequence is a faster growth in <strong>the</strong> development of timeusage.These experiments have been repeated five times each, once again with <strong>the</strong> exceptionof some of <strong>the</strong> larger Python experiments. Especially <strong>the</strong> large experiments on bin,ran and wc <strong>trees</strong> are slow and have not been repeated.7.2.2 Introducing a library for matrix multiplicationHaving gained <strong>the</strong> first experience with <strong>the</strong> algorithm, I will now introduce a better methodfor matrix multiplication and attempt to improve <strong>the</strong> algorithm in accordance to <strong>the</strong> <strong>the</strong>-


50 CHAPTER 7. SUB-CUBIC TIME ALGORITHMSimple Python implementation of <strong>the</strong> sub-cubic algorithm without advanced mat. mult.time in seconds - t(n)1000100101bin, best exp= 2.03ran, best exp= 2.06sqrt, best exp= 1.81star, best exp= 2.02wc, best exp= 1.99nn 2n 30.10.01Figure 7.7:Python.50 100 200 500 800number of leaves - nPerformance of <strong>the</strong> naive prototype of <strong>the</strong> sub-cubic algorithm intime in seconds - t(n)10001001010.1Simple C++ implementation of <strong>the</strong> sub-cubic algorithm without advanced mat. mult.bin, best exp= 2.05ran, best exp= 2.06sqrt, best exp= 1.90star, best exp= 2.96wc, best exp= 2.18nn 2n 30.0150 100 200 500 800number of leaves - nFigure 7.8: Performance of <strong>the</strong> naive prototype of <strong>the</strong> sub-cubic algorithm in C++.


7.2. IMPLEMENTATION 51oretic approach.Mailund et al. [14] suggest <strong>the</strong> use of advanced matrix multiplication methods andsubstantiate <strong>the</strong>ir algorithmic results with <strong>the</strong> best <strong>the</strong>oretic result within <strong>the</strong> field ofmatrix multiplication, namely <strong>the</strong> Coppersmith-Winograd algorithm [7]. As mentionedin Chap. 1, this kind of <strong>the</strong>oretic result will be met with immediate scepticism by mostprogrammers, since one might suspect that <strong>the</strong> demands of <strong>the</strong> <strong>the</strong>ory can never be metby an implementation. In this case because of <strong>the</strong> size of <strong>the</strong> input being too small toobserve <strong>the</strong> improvement.Here <strong>the</strong> ideal situation would be to find an algorithm that is sub-cubic in <strong>the</strong>oryand still efficiently implemented in practice. Searching <strong>the</strong> literature and <strong>the</strong> web for areasonable method for matrix multiplication I came across <strong>the</strong> Strassen algorithm [18],which is approximately O(n 2.8 ) and is described as being efficient in practice for largematrices. However, I was not successful in finding an optimized implementation of <strong>the</strong>algorithm or a linear algebra library containing one. Since matrix multiplication is not<strong>the</strong> focus of this <strong>the</strong>sis, but ra<strong>the</strong>r a smaller piece in <strong>the</strong> algorithmic puzzle, <strong>the</strong> time didnot allow me to attempt implementing it on my own. In addition, it seems unreasonablethat I would be competitive with highly optimized libraries.Instead I decided to go with a highly optimized, reliable and robust library. This isa meaningful decision, since <strong>the</strong> goal of <strong>the</strong> <strong>the</strong>sis is to test <strong>the</strong> practical usefulness of<strong>the</strong> algorithm. What is <strong>the</strong> use of a <strong>the</strong>oretic result if it is not practically applicable? Thefinal choice was to use <strong>the</strong> Basic Linear Algebra Subprograms (BLAS) API 4 , which is a defacto standard for various linear algebra packages. Since my first implementation was inPython, BLAS caught my attention after discovering that it integrates with SciPy/NumPy,which is automatically compiled against <strong>the</strong> BLAS installation if present on <strong>the</strong> machine.Fur<strong>the</strong>rmore, <strong>the</strong> Boost.NumericBindings library 5 , provides a generic layer <strong>between</strong> <strong>the</strong>Boost.uBlas data types and <strong>the</strong> BLAS linear algebra routines in C++.The exact routines utilized are some level 3 BLAS calls for matrix-matrix multiplication.In NumPy and Boost.NumericBindings <strong>the</strong>se have <strong>general</strong> interfaces used through<strong>the</strong> calls C = dot(A,B) and void gemm(A,B,C) respectively. Both solutions require abit of setup using <strong>the</strong> right type of matrix container structure, but with <strong>the</strong> libraries thisis not difficult. I will not provide fur<strong>the</strong>r details, but merely refer to <strong>the</strong> code which isavailable (see App. B).To get an idea of how <strong>the</strong> performance improved, a comparison has been made of <strong>the</strong>four implementations used, namely <strong>the</strong> two used in <strong>the</strong> prototypes and <strong>the</strong> two BLAS in-4 The BLAS specification: http://www.netlib.org/blas/5 Boost.NumericBindings: http://svn.boost.org/svn/boost/sandbox/numeric_bindings-v1/libs/numeric/bindings/doc/index.html


52 CHAPTER 7. SUB-CUBIC TIME ALGORITHMtegrations. The result of multiplying square matrices of increasing sizes is illustrated in<strong>the</strong> plot of Fig. 7.9. As expected, <strong>the</strong> BLAS calls are significantly faster than <strong>the</strong> basic im-Performance of <strong>the</strong> different matrix multiplication methods used.10010python-scipypython-blascpp-boostcpp-blasn 3time in seconds - t(n)10.10.0150 100 200 500 800size of matrix - nFigure 7.9: Comparison of <strong>the</strong> matrix multiplication methods used.plementations. Nearly as much as a factor of one thousand. The plot is based on tenrepetitions of each matrix multiplication, but never<strong>the</strong>less, <strong>the</strong> points are a bit scatteredand not forming straight lines. I blame this on <strong>the</strong> methods used which may suffer fromstructural influences and this could also be indicated in <strong>the</strong> similar behaviour of <strong>the</strong> twoBLAS routines. Especially for <strong>the</strong> larger matrices <strong>the</strong> optimized calls are superior. It is,however, surprising to see that both Python calls are performing better than <strong>the</strong>ir respectiveC++ combatant.The only method that seems to be strict O(n 3 ) is <strong>the</strong> C++ Boost.uBlas call. In retrospectone might see this as <strong>the</strong> reason why <strong>the</strong> star tree was such a challenge to <strong>the</strong> C++prototype implementation.7.2.3 The final versionNow, had I been successful in finding an implementation of a <strong>the</strong>oretically good algorithm,<strong>the</strong> next step would be to estimate <strong>the</strong> exponent ω, and thus being able to choose<strong>the</strong> smallest of max(d v ,d v ′) ω , dv 2d v ′ and d v d 2 . However, as discussed in Sec. 7.1.1, <strong>the</strong>v ′matrix multiplication routines hide away different coefficients and in addition, time isspent padding <strong>the</strong> matrices in some cases, all of which has to be taken into account as


7.2. IMPLEMENTATION 53well. My implementation will differ from <strong>the</strong> <strong>the</strong>ory by making no difference <strong>between</strong> <strong>the</strong>naive and some advanced method for matrix multiplication. I might as well use BLAS inevery case, since it is not specifically designed for square matrices. Anyway, even thoughI have no knowledge of <strong>the</strong> internals of <strong>the</strong> BLAS implementation I can treat it as a blackbox and test it as a piece in <strong>the</strong> algorithmic puzzle. With all of this in mind, I will makean array of experiments with <strong>the</strong> purpose of clarifying exactly when, if ever, it is feasibleto pad <strong>the</strong> matrices, making <strong>the</strong>m square, before applying <strong>the</strong> BLAS routine.I have been executing an equivalent to <strong>the</strong> prototype algorithm, but substituting <strong>the</strong>matrix multiplication method by BLAS. In each case <strong>the</strong> matrix multiplication had to becarried out, I made both calculations, with and without padding of <strong>the</strong> matrices, comparing<strong>the</strong> running time of that particular part of <strong>the</strong> execution. This has been donefor each combination of <strong>the</strong> different types of test <strong>trees</strong>, for larger <strong>trees</strong> of mainly 800leaves. The result is shown in Table 7.1. For each execution of <strong>the</strong> algorithm it is shownhow many times <strong>the</strong> multiplication was carried out and for how many of those paddingwas advantageous. This is also shown percentage-wise. For every case where paddingwas in fact advantageous <strong>the</strong> table shows <strong>the</strong> largest degree for an internal node and <strong>the</strong>largest difference <strong>between</strong> <strong>the</strong> degree of two of <strong>the</strong> internal nodes being processed whichis identical to <strong>the</strong> difference <strong>between</strong> <strong>the</strong> number of rows and columns in <strong>the</strong> matrix.Trees Number of Number of times Largest Largest %multiplications padding was degree differencean advantagein degreebin vs bin 636804 0 - - 0.000ran vs ran 154449 30 9 4 0.019sqrt vs sqrt 841 0 - - 0.000star vs star 1 0 - - 0.000wc vs wc 160801 0 - - 0.000bin vs ran 313614 43 7 4 0.014bin vs sqrt 8358 0 - - 0.000bin vs star 798 0 - - 0.000bin vs wc 319998 0 - - 0.000ran vs sqrt 4221 0 - - 0.000ran vs star 393 0 - - 0.000ran vs wc 157593 130 7 4 0.082sqrt vs star 21 0 - - 0.000sqrt vs wc 4221 0 - - 0.000star vs wc 401 0 - - 0.000Table 7.1: Investigating when it is advantageous to multiply square matrices.


54 CHAPTER 7. SUB-CUBIC TIME ALGORITHMFrom <strong>the</strong> table it is evident that padding <strong>the</strong> matrices is not speeding up <strong>the</strong> multiplication.Only in three experiments it seems faster in some cases and only in a very smallpercentage of <strong>the</strong> multiplications. Fur<strong>the</strong>rmore, from <strong>the</strong> sizes of <strong>the</strong> internal nodes involvedwe observe that it is only when dealing with very small matrices (#rows/#columns< 10), that we “benefit” from padding. More likely, it is a coincidence ra<strong>the</strong>r than an actualtendency. My conclusion is that it is not beneficial to pad <strong>the</strong> matrices and consequently,<strong>the</strong>re is no reason to separately take care of <strong>the</strong> case where max(d v ,d v ′) ω issmallest and I will continue to distinguish only <strong>between</strong> dv 2d v ′ and d v d 2 , as done in <strong>the</strong>v ′prototype.Expectations The implementation still does not meet <strong>the</strong> formal description of <strong>the</strong> article,which means that we can not count on <strong>the</strong> analysis and thus, <strong>the</strong> implementationmight brake <strong>the</strong> sub-cubic time bound and behave as a cubic algorithm. It seems reasonableto keep <strong>the</strong> assumptions from <strong>the</strong> prototype. Never<strong>the</strong>less, I hope to see a <strong>general</strong>improvement which of course is dependent on <strong>the</strong> proportion of time actually spent onmatrix multiplication. This should be more significant to <strong>the</strong> experiments involving <strong>the</strong>largest multiplications, namely <strong>the</strong> star <strong>trees</strong> and <strong>the</strong> worst case <strong>trees</strong> which seemed tobe suffering because of <strong>the</strong> larger matrices encountered.Result The result of applying <strong>the</strong> two final implementations on <strong>the</strong> usual array of testdata is shown in Fig. 7.10 and Fig. 7.11. We observe that <strong>the</strong>re is no significant improvementin running time. The <strong>general</strong> picture is much <strong>the</strong> same as for <strong>the</strong> prototype; <strong>the</strong>final implementations perform in <strong>the</strong> same range of time consumption as <strong>the</strong> prototypeimplementations with a few exceptions. The reason might very likely be <strong>the</strong> relativelylow amount of time spent doing matrix multiplication and <strong>the</strong> relatively high amount oftime spent processing <strong>the</strong> pairs of internal nodes.There are a couple of cases, however, where a change in <strong>the</strong> slope of <strong>the</strong> plot is clearlyobservable. That is, not surprisingly, <strong>the</strong> <strong>trees</strong> with few inner nodes, resulting in largematrices; star <strong>trees</strong> and wc <strong>trees</strong>. And only for <strong>the</strong> C++ implementation. Especially <strong>the</strong>star tree yields a dramatic improvement. The plot of <strong>the</strong> C++ prototype, that was almostparallel to <strong>the</strong> O(n 3 ) line, has now improved to being almost parallel to <strong>the</strong> O(n 2 ) line.However, <strong>the</strong> plot is not making an exact straight line which I suspect might be because<strong>the</strong> overhead is gradually equalized by <strong>the</strong> complexity of <strong>the</strong> matrix multiplication as <strong>the</strong>size of <strong>the</strong> matrices grow. Eventually this will result in <strong>the</strong> plot showing <strong>the</strong> complexityof <strong>the</strong> matrix multiplication only. The processing time of an 800-leaf star tree is loweredfrom over 100 seconds to below 10 seconds meaning that <strong>the</strong> change of matrix librarywas indeed a great improvement.


7.2. IMPLEMENTATION 55Performance of <strong>the</strong> Python implementation of <strong>the</strong> sub-cubic algorithm.time in seconds - t(n)1000100101bin, best exp= 2.04ran, best exp= 2.07sqrt, best exp= 1.79star, best exp= 2.00wc, best exp= 2.00nn 2n 30.10.0150 100 200 500 800number of leaves - nFigure 7.10: Performance of <strong>the</strong> sub-cubic algorithm in Python.time in seconds - t(n)10001001010.1Performance of <strong>the</strong> C++ implementation of <strong>the</strong> sub-cubic algorithm.bin, best exp= 2.05ran, best exp= 2.07sqrt, best exp= 1.81star, best exp= 2.04wc, best exp= 2.02nn 2n 30.0150 100 200 500 800number of leaves - nFigure 7.11: Performance of <strong>the</strong> sub-cubic algorithm in C++.


56 CHAPTER 7. SUB-CUBIC TIME ALGORITHMFur<strong>the</strong>r verifications of <strong>the</strong> quality of <strong>the</strong> algorithm In order to establish completeconfidence in <strong>the</strong> quality of <strong>the</strong> algorithm two more experiments have been carried out.The first one is an extension of <strong>the</strong> experiments already done, but with <strong>the</strong> purpose ofexamining more challenging combinations of <strong>trees</strong> to compare. This is done to fur<strong>the</strong>rensure that <strong>the</strong> algorithm really is behaving well in all imaginable scenarios which mightindicate that it is in fact sub-cubic, even though we do not have <strong>the</strong> proof of <strong>the</strong> analysisto back up such a claim.The <strong>trees</strong> having a single internal node of very high degree, namely <strong>the</strong> star tree and<strong>the</strong> wc tree, were <strong>the</strong> ones showing <strong>the</strong> steepest growth because of <strong>the</strong> matrix multiplicationof larger matrices. I will combine <strong>the</strong>se with <strong>the</strong> <strong>trees</strong> having a larger number ofinternal nodes and yielding a worse actual running time to see if <strong>the</strong> combination of <strong>the</strong>two properties will result in a longer running time than <strong>the</strong> two do separately.The result is clear. These new combinations of input are not a greater challenge to<strong>the</strong> algorithm. Figure 7.12 shows that <strong>the</strong> resulting performance is no worse than whatwe have previously witnessed; <strong>the</strong> growth in running time seems close to quadratic and<strong>the</strong> actual running time is better than <strong>the</strong> worst we have seen so far.time in seconds - t(n)1000100101Performance of <strong>the</strong> C++ implementation on different combinations of <strong>trees</strong>.star-bin, best exp= 2.03star-ran, best exp= 2.05star-wc, best exp= 2.01wc-bin, best exp= 2.04wc-ran, best exp= 2.04nn 2n 30.10.0150 100 200 500 800number of leaves - nFigure 7.12: Combining <strong>the</strong> <strong>trees</strong> with high-degree inner nodes with o<strong>the</strong>r <strong>trees</strong>.The purpose of <strong>the</strong> second experiment is to ensure that <strong>the</strong> arrangement, or order,of <strong>the</strong> input is not influencing <strong>the</strong> running time. Some algorithms perform significantlybetter on certain inputs. For example, some sorting algorithms, like quick sort and bub-


7.2. IMPLEMENTATION 57ble sort, will, in spite of a quadratic worst-case running time, perform in linear time oninput that is nearly sorted, indeed giving a wrong impression of <strong>the</strong> algorithm. My implementationmakes a depth-first numbering of <strong>the</strong> leaves in T and <strong>the</strong>n numbers <strong>the</strong>leaves in T ′ according to this. This numbering determines in which order we processdifferent parts of <strong>the</strong> tree. If we shuffle <strong>the</strong> leaves in both <strong>trees</strong> and redo <strong>the</strong> experiment<strong>the</strong> topologies of <strong>the</strong> <strong>trees</strong> remain <strong>the</strong> same, but <strong>the</strong> algorithm will process each tree in adifferent order.Repeating <strong>the</strong> experiment 25 times with different permutations of <strong>the</strong> leaves result in<strong>the</strong> plot displayed in Fig. 7.13. The difference in performance is difficult to observe andit is evident that <strong>the</strong> order of <strong>the</strong> leaves is not a decisive factor; <strong>the</strong> running time is onlydependent on <strong>the</strong> structure of <strong>the</strong> tree. These results will hopefully heighten <strong>the</strong> level ofconfidence.time in seconds - t(n)10001001010.1Performance of <strong>the</strong> C++ implementation if leaves are shuffled before execution.bin, best exp= 2.06ran, best exp= 2.07sqrt, best exp= 1.79star, best exp= 2.04wc, best exp= 2.02nn 2n 30.0150 100 200 500 800number of leaves - nFigure 7.13: Shuffling <strong>the</strong> leaves does not yield a different running time.


Chapter 8Results and discussionThe experiments with each of <strong>the</strong> three algorithms have been commented separately inChap. 4, 6 and 7. The <strong>general</strong> impression has been positive: <strong>the</strong> quartic and cubic algorithmswere indeed performing within – and also ra<strong>the</strong>r close to – <strong>the</strong> expected timebounds, however, <strong>the</strong> quartic algorithm was more sensitive to <strong>the</strong> variations in inputdata. The implementation of <strong>the</strong> sub-cubic algorithm has not been precisely following<strong>the</strong> <strong>the</strong>oretic guidelines and as a consequence some compromises have been introduced.More precisely, experimental verification made it clear that padding <strong>the</strong> matricesto square matrices did not give an advantage over <strong>the</strong> normal multiplication. Never<strong>the</strong>less,<strong>the</strong> final implementation has been thoroughly tested with different input – in anattempt of finding <strong>the</strong> worst possible input – and <strong>the</strong> result is an algorithm that behavessub-cubicly in every situation. However, while <strong>the</strong> algorithm shows a sub-cubic performancefor all types of <strong>trees</strong> presented, both <strong>the</strong> actual running time and <strong>the</strong> developmentin time-usage is clearly varying <strong>between</strong> <strong>trees</strong>. Hence, <strong>the</strong> star tree has an interesting role.For <strong>the</strong> size of <strong>trees</strong> used in this <strong>the</strong>sis (≤ 800 leaves), <strong>the</strong> star tree is both <strong>the</strong> fastest todeal with and <strong>the</strong> tree that makes <strong>the</strong> algorithm show <strong>the</strong> worst development in timeusage.As a consequence, I find it natural to assume that running time of <strong>the</strong> sub-cubicalgorithm on <strong>the</strong> star tree, and to some extent also <strong>the</strong> wc tree, will continue growingfaster than on <strong>the</strong> o<strong>the</strong>r inputs and eventually, on much larger <strong>trees</strong>, <strong>the</strong> star <strong>trees</strong> will be<strong>the</strong> most time consuming inputs to deal with.The final result of <strong>the</strong> experiments was briefly revealed and commented ra<strong>the</strong>r cursoryin Chap. 1 but shown again in Fig. 8.1 and Fig. 8.2. The two figures compare <strong>the</strong>experimental results of <strong>the</strong> three algorithms implemented in Python and C++ respectively.These plots are clear evidence that <strong>the</strong> sub-cubic algorithm does not only meet<strong>the</strong> <strong>the</strong>oretic expectations but it is faster in practice as well. The C++ implementation isfaster for <strong>trees</strong> with 80 leaves and more while <strong>the</strong> Python implementation is not clearly59


60 CHAPTER 8. RESULTS AND DISCUSSIONfaster for <strong>trees</strong> with less than 200 leaves.A <strong>general</strong> observation based on <strong>the</strong> experiments with <strong>the</strong> three algorithms is thateven though <strong>the</strong>ir respective time complexities are only defined in terms of <strong>the</strong> inputsize n, this does not mean that <strong>the</strong> implementations are not sensitive to <strong>the</strong> structure of<strong>the</strong> input, e.g. <strong>the</strong> degree of <strong>the</strong> internal nodes.8.1 Large numbersOne thing not mentioned, but important to every <strong>quartet</strong> <strong>distance</strong> algorithm, is <strong>the</strong> needof handling large numbers. Quartet <strong>distance</strong> results can often be very large, simply because<strong>the</strong>re are such a large amount of <strong>quartet</strong>s to compare. As an example, 600 leavesresult in ( 600) 4 = 5346164850 distinct <strong>quartet</strong>s to compare, which is a number exceeding<strong>the</strong> capacity of a normal 32 bit integer. Thus, if many of those <strong>quartet</strong>s inherit a differenttopology in <strong>the</strong> two <strong>trees</strong>, <strong>the</strong>re is a risk that <strong>the</strong> result will overflow. Therefore, thisshould be taken into consideration, when implementing a <strong>quartet</strong> <strong>distance</strong> algorithm –which size of <strong>trees</strong> are to be compared and which architecture will be used.Python provides a transparent handling of large numbers, which will be enough foralgorithms like <strong>the</strong> cubic and quartic algorithms. However, when utilizing external routines,which was <strong>the</strong> case with BLAS matrix multiplication, in <strong>the</strong> sub-cubic algorithm,special care must be taken and <strong>the</strong> right data structures used.This means that pure C++ implementations, without special libraries for large numbers,are limited by <strong>the</strong> architecture but pure Python implementations are not. In anycase, external procedures may giver fur<strong>the</strong>r restrictions.


8.1. LARGE NUMBERS 61time in seconds - t(n)10000100010010Comparison of <strong>the</strong> three algorithms, Python implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 410.150 100 200 500 800number of leaves - nFigure 8.1: Comparison of <strong>the</strong> performances of <strong>the</strong> three algorithms implementedin Python.time in seconds - t(n)1000100101Comparison of <strong>the</strong> three algorithms, C++ implementation.sub-cubic algorithmcubic algorithmquartic algorithmn 2n 3n 40.10.0150 100 200 500 800number of leaves - nFigure 8.2: Comparison of <strong>the</strong> performances of <strong>the</strong> three algorithms implementedin C++.


Chapter 9ConclusionThe focus of this <strong>the</strong>sis has been a thorough experimental study of <strong>the</strong> algorithm for<strong>quartet</strong> <strong>distance</strong> computation <strong>between</strong> <strong>general</strong> <strong>trees</strong> described by Mailund et al. [14]with <strong>the</strong> aim of eventually being able to ei<strong>the</strong>r accept or reject <strong>the</strong> postulated sub-cubicrunning time as being a practically usable result.As a foundation I have given a gentle introduction to <strong>the</strong> subject of <strong>quartet</strong> <strong>distance</strong>calculation followed by a detailed description of two reference algorithms – a quartic anda cubic – including implementations and experimental verification of those.Since my focus has been on experiments, I have given a careful review of a wide rangeof test data, arguing that <strong>the</strong> whole spectrum of challenging input has been covered.I have implemented <strong>the</strong> supposed sub-cubic time algorithm and my conclusion ispositive – I am certainly convinced that <strong>the</strong> algorithm performs in sub-cubic time in practice.However, my implementation is not strictly following <strong>the</strong> <strong>the</strong>oretical proposal andconsequently I do not have <strong>the</strong> support of <strong>the</strong> analytic result to rely on. Never<strong>the</strong>less Ican provide robust evidence to support my conclusion: exposing <strong>the</strong> algorithm to <strong>the</strong>entire set of test <strong>trees</strong> showed good practical behavior in every situation. As a matterof fact, <strong>the</strong> algorithm is showing a quadratic behavior on most input and especially in<strong>the</strong> least artificial cases. The most challenging input introduced has been <strong>the</strong> star tree– a single internal node with n leaves attached – which forces <strong>the</strong> algorithm to make amultiplication of two n × n matrices, however, <strong>the</strong> situation is very artificial and <strong>the</strong> twomatrices multiplied contain zeroes in every entry except n entries that contain ones. Itis not within <strong>the</strong> scope of this <strong>the</strong>sis to gain knowledge about <strong>the</strong> internals of <strong>the</strong> libraryused for matrix multiplication and <strong>the</strong>refore, I do not know if such matrices are treateddifferently. The experiments on star <strong>trees</strong> show good results but <strong>the</strong> need for sub-cubicmatrix multiplication is inevitable.Ano<strong>the</strong>r view on this conclusion is that <strong>the</strong> analysis of <strong>the</strong> algorithm is not tight63


64 CHAPTER 9. CONCLUSIONenough to reflect <strong>the</strong> actual running time. The algorithm performs better than predictedon most input but given two star <strong>trees</strong> it should perform with exactly <strong>the</strong> same time complexityas <strong>the</strong> matrix multiplication method in use, namely in (O ω ) time.The resulting implementation of <strong>the</strong> sub-cubic algorithm is not only behaving wellwith regards to time development – it also performs better than <strong>the</strong> two reference algorithmsand thus, <strong>the</strong> algorithm does not suffer from being <strong>the</strong> product of a seeminglycomplex idea.Finally, I have verified <strong>the</strong> correctness of each algorithm and left a few contributionsin this regard. These are highlighted as Contribution 6.1, 7.1 and 7.2.In summary I have verified <strong>the</strong> correctness and time complexity of three algorithms,with focus on <strong>the</strong> sub-cubic algorithm. Hopefully I have left <strong>the</strong> reader with enoughinsight and <strong>the</strong> necessary tools to continue <strong>the</strong> work in a direction towards an even fasteralgorithm.9.1 Future workI find it natural that this work should be extended algorithmically in <strong>the</strong> attempt of findinga purely quadratic algorithm. I think <strong>the</strong> ideas used by Mailund et al. [14] are interestingand promising. In any case, <strong>the</strong> sub-cubic algorithm may be usable in practice,but it is not superior to <strong>the</strong> o<strong>the</strong>r existing algorithms for <strong>general</strong> <strong>trees</strong> and hence, I thinkalgorithmic improvements would be far more valuable than practical optimizations.I think focus should be on finding a different way to calculate diff B (T,T ′ ) as to getrid of <strong>the</strong> matrix multiplication. Alternatively, it might be possible to calculate <strong>the</strong> totalamount of <strong>quartet</strong>s where both inherit a butterfly topology, call this quantity total B (T,T ′ ),and <strong>the</strong>n calculate diff B (T,T ′ ) = total B (T,T ′ ) − shared B (T,T ′ ).


Bibliography[1] Mukul S. Bansal, Jianrong Dong, and David Fernández-Baca. Comparing and aggregatingpartially resolved <strong>trees</strong>. In LATIN’08: Proceedings of <strong>the</strong> 8th Latin Americanconference on Theoretical informatics, pages 72–83, Berlin, Heidelberg, 2008.Springer-Verlag. ISBN 3-540-78772-0, 978-3-540-78772-3.[2] Gerth Stølting Brodal, Rolf Fagerberg, and Christian N. S. Pedersen. Computing <strong>the</strong><strong>quartet</strong> <strong>distance</strong> <strong>between</strong> evolutionary <strong>trees</strong> in time O(n log 2 n). Proceedings of <strong>the</strong>12th International Symposium for Algorithms and Computation (ISAAC), 2223:731–742, 2001.[3] Gerth Stølting Brodal, Rolf Fagerberg, and Christian N. S. Pedersen. Computing <strong>the</strong><strong>quartet</strong> <strong>distance</strong> <strong>between</strong> evolutionary <strong>trees</strong> in time O(n logn). Algorithmica, 38:377–395, 2003.[4] David Bryant, John Tsang, Paul E. Kearney, and Ming Li. Computing <strong>the</strong> <strong>quartet</strong><strong>distance</strong> <strong>between</strong> evolutionary <strong>trees</strong>. In SODA, pages 285–286, 2000.[5] Chris Christiansen and Martin Randers. Computing <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong><strong>trees</strong> of arbitrary degrees. Master’s <strong>the</strong>sis, University of Aarhus, January 2006.[6] Chris Christiansen, Thomas Mailund, Christian N. S. Pedersen, and Martin Randers.Computing <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> <strong>trees</strong> of arbitrary degree. In WABI, pages77–88, 2005.[7] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions.Journal of Symbolic Computation, 9(3):251–280, 1990. ISSN 0747-7171.Computational algebraic complexity editorial.[8] Bhaskar DasGupta, Xin He, Tao Jiang, Ming Li, John Tromp, and Louxin Zhang. On<strong>computing</strong> <strong>the</strong> nearest neighbor interchange <strong>distance</strong>. 1997.65


66 BIBLIOGRAPHY[9] Brian W. Davis, Gang Li, and William J. Murphy. Supermatrix and species tree methodsresolve phylogenetic relationships within <strong>the</strong> big cats, pan<strong>the</strong>ra (carnivora: Felidae).Molecular Phylogenetics and Evolution, 56(1):64 – 76, 2010. ISSN 1055-7903.doi: DOI:10.1016/j.ympev.2010.01.036.[10] William H. E. Day. Optimal algorithms for comparing <strong>trees</strong> with labeled leaves. Journalof Classification, 2:7–28, 1985. ISSN 0176-4268. 10.1007/BF01908061.[11] Annette Dobson. Comparing <strong>the</strong> shapes of <strong>trees</strong>. In Anne Street and Walter Wallis,editors, Combinatorial Ma<strong>the</strong>matics III, volume 452 of Lecture Notes in Ma<strong>the</strong>matics,pages 95–100. Springer Berlin / Heidelberg, 1975. 10.1007/BFb0069548.[12] George F. Estabrook, F. R. McMorris, and Christopher A. Meacham. Comparison ofundirected phylogenetic <strong>trees</strong> based on sub<strong>trees</strong> of four evolutionary units. SystematicZoology, 34(2):193–200, 1985. ISSN 00397989.[13] Thomas Mailund and Christian N. S. Pedersen. QDist-<strong>quartet</strong> <strong>distance</strong> <strong>between</strong>evolutionary <strong>trees</strong>. Bioinformatics, 20(10):1636–1637, 2004.[14] Thomas Mailund, Jesper Nielsen, and Christian N. S. Pedersen. A sub-cubic timealgorithm for <strong>computing</strong> <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> two <strong>general</strong> <strong>trees</strong>. In InternationalJoint Conferences on Bioinformatics, Systems Biology and Intelligent Computing(IJCBS), pages 565–568, 2009.[15] D. F. Robinson and L. R. Foulds. Comparison of phylogenetic <strong>trees</strong>. Ma<strong>the</strong>maticalBiosciences, 53(1-2):131–147, 1981. ISSN 0025-5564.[16] Mike A. Steel and David Penny. Distributions of tree comparison metrics-some newresults. Systematic Biology, 42(2):pp. 126–141, 1993. ISSN 10635157.[17] Martin Stig Stissing, Christian N. S. Pedersen, Thomas Mailund, Gerth Stølting Brodal,and Rolf Fagerberg. Computing <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> evolutionary<strong>trees</strong> of bounded degree. In Series on Advances in Bioinformatics and ComputationalBiology, volume 5, pages 101–110, 2007.[18] Volker Strassen. Gaussian elimination is not optimal. Numerische Ma<strong>the</strong>matik, 13:354–356, 1969. ISSN 0029-599X.[19] John Tsang. An approximation algorithm for character compatibility and fast<strong>quartet</strong>-based phylogenetic tree comparison. Master’s <strong>the</strong>sis, University of Waterloo,October 2000.


BIBLIOGRAPHY 67[20] M.S. Waterman and T.F. Smith. On <strong>the</strong> similarity of dendrograms. Journal of TheoreticalBiology, 73(4):789–800, 1978. ISSN 0022-5193.


Appendix APreprocessing for <strong>the</strong> sub-cubicalgorithmIn this appendix I will list <strong>the</strong> details of <strong>the</strong> preprocessing steps omitted in <strong>the</strong> descriptionof <strong>the</strong> sub-cubic algorithm in Chap. 7. Though crucial to <strong>the</strong> running time of <strong>the</strong>complete algorithm, <strong>the</strong>y do not support <strong>the</strong> reader in obtaining an overview, and thus,I have chosen to hide <strong>the</strong>m away here.Recall from Sec. 7.1.1 <strong>the</strong> matrix I which is storing a subset of <strong>the</strong> table of shared leafset sizes <strong>between</strong> two <strong>trees</strong>. All remaining preprocessing arrays and tables are results offur<strong>the</strong>r processing <strong>the</strong> information in I . Since <strong>the</strong> information is making little sense byitself, <strong>the</strong> naming conventions have been kept simple and are not very explanatory.Row and column sums and <strong>the</strong> sum of all entries of I :R[i ] =d v ′∑j =1I [i , j ]∑d vC [j ] = I [i , j ]M =i=1∑d v d v ′∑i=1 j =1I [i , j ](A.1)(A.2)(A.3)Ano<strong>the</strong>r matrix I ′ , its row sums, column sums and its total sum of entries:I ′ [i , j ] = I [i , j ](M − R[i ] −C [j ] + I [i , j ])R ′ [i ] =d v ′∑j =1I ′ [i , j ](A.4)(A.5)69


70 APPENDIX A. PREPROCESSING FOR THE SUB-CUBIC ALGORITHMAnd two more arrays:R ′′ [i ] =C ′ ∑d v[j ] = I ′ [i , j ]M ′ =∑d vj =1i=1∑d v d v ′∑i=1 j =1I ′ [i , j ]I [i , j ](C [j ] − I [i , j ])(A.6)(A.7)(A.8)C ′′ ∑d v[j ] = I [i , j ](R[i ] − I [i , j ])(A.9)i=1Only one of <strong>the</strong> two following arrays will need calculation, depending on which of <strong>the</strong>two expressions for counting different butterflies is used, Eq. (7.9) or Eq. (7.10):R ′′′ [i ] =∑d vj =1I [i , j ] 2C ′′′ ∑d v[j ] = I [i , j ] 2With <strong>the</strong> approach taken in <strong>the</strong> article [14], <strong>the</strong> following table needs calculation.i=1(A.10)(A.11)I ′′′ [i , j ] =∑d vd v ′∑k=1,k≠i l=1,l≠jI [i ,l]I [k, j ]I [k,l]However, if calculated naively, <strong>the</strong> sub-cubic time bound promised will be broken. Thesolution is instead to calculate ei<strong>the</strong>r I1 ′′ = I I T and <strong>the</strong>n I1 ′′′ = I ′′ I , which is basically1dI 1 ′′v ′[i ,k] = ∑j =1I [i , j ]I [k, j ](A.12)I 1 ′′′d v[i , j ] = ∑k=1I [k, j ]I 1 ′′ [i ,k], (A.13)or o<strong>the</strong>rwise calculate I2 ′′ = I T I and <strong>the</strong>n I2 ′′′ = I I ′′ , which is basically2I 2 ′′d v[j,l] = ∑i=1I [i , j ]I [i ,l](A.14)dI 2 ′′′v ′[i , j ] = ∑I [i ,l]I 2 ′′ [l,k]. (A.15)l=1The choice is essential to <strong>the</strong> algorithm and is explained in fur<strong>the</strong>r detail in Sec. 7.1.4.1.


Appendix BObtaining and running <strong>the</strong> programsA small website has been dedicated to this <strong>the</strong>sis. It gives information on how to download,compile and run <strong>the</strong> programs:link: http://www.cs.au.dk/~dalko/<strong>the</strong>sis/O<strong>the</strong>rwise, request <strong>the</strong> code at:e-mail: anders.kabell.kristensen@gmail.com71


Appendix CReal-life application of <strong>the</strong> algorithmSome of <strong>the</strong> programs produced in this <strong>the</strong>sis have already shown to be usable in practice.This was not a goal of this <strong>the</strong>sis, but I find it satisfactory to know – and seen inretrospect I think it adds ano<strong>the</strong>r level of justification to <strong>the</strong> <strong>the</strong>sis.Peter Foster, Natural History Museum of London, inquired about software for <strong>quartet</strong><strong>distance</strong> computation on <strong>general</strong> <strong>trees</strong> at <strong>the</strong> Bioinformatics Research Center at AarhusUniversity. Foster writes:I am working with my colleague Mark Wilkinson here at <strong>the</strong> Natural HistoryMuseum in London on developing a new method of constructing super<strong>trees</strong>,called Quartet Joining. We hope that this will become an alternativeto <strong>the</strong> widely-used MRP (matrix representation with parsimony) methodof supertree construction. We test it by simulating a random master treeand <strong>the</strong>n taking subset <strong>trees</strong> from that master tree, and <strong>the</strong>n reconstructingsuper<strong>trees</strong> from those subset <strong>trees</strong>, and <strong>the</strong>n comparing <strong>the</strong> topology of<strong>the</strong> resulting supertree to <strong>the</strong> master. We have been using Robinson-Foulds<strong>distance</strong>s, but have thought that using <strong>quartet</strong> <strong>distance</strong>s would offer an additionalperspective on <strong>the</strong> topology <strong>distance</strong>. However, simplistic <strong>quartet</strong>comparisons proved to be far too slow for sizable <strong>trees</strong>. Often our super<strong>trees</strong>are not fully resolved, and so we needed a fast algorithm that could handlepolytomies.Peter let me know that <strong>the</strong> size of <strong>the</strong> <strong>trees</strong> compared is up to a few hundred leavesand that <strong>the</strong> experiments would be repeated hundreds or thousands of times. They usuallywork with Python and <strong>the</strong>refore I sent my Python implementation of <strong>the</strong> sub-cubicalgorithm. It compares two <strong>trees</strong> in less than a minute and that worked alright for a start.73


74 APPENDIX C. REAL-LIFE APPLICATION OF THE ALGORITHMHowever, I later provided my C++ implementation that does <strong>the</strong> job in a few seconds andafter wrapping it as a module for Python it worked painlessly with <strong>the</strong>ir software:The new algorithm is exactly what we needed, and it will be quite useful.Foster provided an example of a master tree and a QJ tree made from some subsetsof <strong>the</strong> master tree. They are displayed in Fig. C.1 and we can observe that <strong>the</strong>y do indeedinclude polytomies. The <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> <strong>the</strong> two <strong>trees</strong> is 317524.


75t29t25t54t20t27 t13 t1t3t17t24t41t49t7t14t28t53t15t36t10t47t6t0t43t57t48t45t9t31t30t59t56t42t51t4t39t5t2t8t18t33t19t11t38t35t23t52t40t50t37t34t55t46t21t16t58t22t26t44t12t32t11t4t39t8t18 t33 t35t50t52t34t46t38t55t5t6t30t47t51t15t42t7t2t9t13t31t3t43t1t10t0t12t56t14t53t16t49t21t59t48t58t22t57t26t45t44t41t32t37t17t36t28t24t40t23t29t25t54t27t20t19Figure C.1: An example of a master tree and a reconstruction of <strong>the</strong> master tree by <strong>quartet</strong> joining.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!