computing the quartet distance between general trees

More documents

Recommendations

Info

4 CHAPTER 1. INTRODUCTIONIt is evident that within binary trees, a quartet can only inherit one of the three butterflytopologies, however, with the inclusion of the star topology, the method works justas well on general trees. This means that the quartet distance can be used with trees thatinclude partly resolved relationships, i.e. polytomies, but it is required that the two treesspecify the exact same set of leaves.The quartet distance does not consider branch length but focuses solely on topologicalproperties. It works equally well with rooted and unrooted trees since a rootedtree can be interpreted as unrooted and a quartet will inherit the same topology in both.However, for rooted trees, there is in fact more than one topology for a group of onlythree leaves meaning that also a triplet distance has its justification, see Dobson [11].1.3 Overview of algorithms for quartet distance computationIn computer science, a widely used data structure is the tree data structure. It comesin numerous variants, has an endless amount of applications, and therefore algorithmsworking on trees have been studied in very great detail. Hence, algorithms for calculationof the quartet distance between evolutionary trees is just another application, and whilealgorithms for this problem benefit from previous research, they might end up beingof use within completely different areas than phylogenetics. Likewise, the focus of thisthesis will be purely algorithmic and not on applications in bioinformatics.There are ( n4)∈ O(n 4 ) unique quartets in a tree with n leaves which makes quartetdistance calculation a computationally heavy problem. Computing the quartet distancenaively by explicitly inspecting the topologies of the O(n 4 ) quartets in the two trees takesO(n 5 ) time.Several algorithms have been designed over the years, resulting in dramatic improvementsin time usage. Focus has been on calculation of the quartet distance between binarytrees, which do not include star quartets and seem to be less complex to handle.Steel and Penny [16] showed how to calculate the quartet distance in time O(n 3 ). Bryantet al. [4] improved this to O(n 2 ) and introduced some concepts important to this thesis.The work of Brodal et al. [3] has also been important to this thesis and resulted in thefastest known algorithm for binary trees with a time bound of O(n logn).For general trees, Bansal et al. [1] describe an O(n 2 ) time 2-approximation algorithm,but this thesis will only deal with exact quartet distance calculation. Christiansen et al.[6] present three algorithms with running times of O(n 4 ), O(n 3 ) and O(n 2 d 2 ) respectively,where d is the maximum degree of any node in the two trees. Stissing et al. [17]present an O(d 9 n logn) time algorithm. Note that some of the algorithms are boundedby the degree of the internal nodes whereas others are independent of this factor and
1.3. OVERVIEW OF ALGORITHMS FOR QUARTET DISTANCE COMPUTATION 5only bounded by the number of leaves in the tree. However, d ≤ n and consequently acomplexity of O(d 2 n 2 ) is in any case better than one of O(n 4 ).In this thesis, the main subject of study will be an algorithm proposed by Mailundet al. [14], that yields a running time of O(n 2+α ), where α = ω−12and O(n ω ) is the complexityof matrix multiplication. At the time of writing, it is the only algorithm that provides asub-cubic time complexity, while being independent of the degree. However, it is dependenton the utilization of a fast method for matrix multiplication, providing a sub-cubictime usage on square matrices. Naive matrix multiplication which features an ω = 3yields α = 1 and thus a running time of O(n 3 ). The authors of the article point out thatutilizing an advanced method for matrix multiplication with a theoretical good result willyield a sub-cubic time complexity. It mentions the Coppersmith-Winograd algorithm [7],which provides an ω = 2.376, resulting in a time complexity of O(n 2.688 ). Now, despitebeing the fastest method for matrix multiplication to date, the Coppersmith-Winogradalgorithm is not practically applicable.Such theoretical results will often leave the programmer reluctant since an implementationseems most unrealistic. However, from my discussions with the authors Iknow that the early theoretical work on the analysis of the quartet distance algorithmleft a general first impression that it was more efficient than eventually shown by the finalasymptotic worst-case analysis. Theoretical upper bounds on the time-usage of analgorithm might be far worse than the actual time-usage, as a result of the algorithm beingdifficult to analyse, and I therefore find this information promising and hope thatthese early predictions were true. This may sound ambiguous but the case is that whileit might be difficult, because of practical limitations, to obtain a worst-case running timeas good as the one of the analysis, an implementation might at the same time performsignificantly better on most input. Hence, this thesis will study the practical behaviorof the algorithm with the goal of clarifying the relation between the theoretical and thepractical time complexities. For reference, the two other algorithms that have a runningtime that is independent of the degree [6], are studied in detail as well. This will allowcomparison of the performance and verification of correctness.1.3.1 Final resultThe final result of the study of the practical behaviour of the algorithms is illustrated inFig. 1.4. The three algorithms are compared for two different implementations. Withoutgoing too much into detail at this point (see Chap. 8 for details), note that the plots showthat the O(n 4 ) and O(n 3 ) algorithms are behaving as expected, but more interestingly wecan observe that the sub-cubic algorithm is not only faster in practice, it also behaves
Page 1: Master’s thesisCOMPUTING THE QUAR
Page 5: AcknowledgementsFirst of all I woul
Page 8 and 9: viiiCONTENTS5.3 Implementing leaf s
Page 10 and 11: 2 CHAPTER 1. INTRODUCTIONhas is cal
Page 14 and 15: 6 CHAPTER 1. INTRODUCTIONsub-cubic
Page 17 and 18: Chapter 2PrerequisitesFirst, this c
Page 19: 2.2. CHOICE OF LANGUAGE AND TEST EN
Page 22 and 23: 14 CHAPTER 3. EXPERIMENTAL APPROACH
Page 28 and 29: 20 CHAPTER 4. QUARTIC TIME ALGORITH
Page 30 and 31: 22 CHAPTER 4. QUARTIC TIME ALGORITH
Page 33 and 34: Chapter 5Calculating leaf set sizes
Page 35 and 36: 5.3. IMPLEMENTING LEAF SET ALGORITH
Page 37 and 38: 5.3. IMPLEMENTING LEAF SET ALGORITH
Page 39 and 40: Chapter 6Cubic time algorithmHere I
Page 41 and 42: 6.1. IMPLEMENTATION 33that we can l
Page 43 and 44: 6.1. IMPLEMENTATION 35carried out o
Page 45 and 46: Chapter 7Sub-cubic time algorithmIn
Page 47 and 48: 7.1. THE ALGORITHM 39CcabADFigure 7
Page 49 and 50: 7.1. THE ALGORITHM 41reflecting Eq.
Page 51 and 52: 7.1. THE ALGORITHM 43choice of indi
Page 53 and 54: 7.1. THE ALGORITHM 45More interesti
Page 55 and 56: 7.2. IMPLEMENTATION 477.2 Implement
Page 57 and 58: 7.2. IMPLEMENTATION 49tween the num
Page 59 and 60: 7.2. IMPLEMENTATION 51oretic approa
Page 61 and 62: 7.2. IMPLEMENTATION 53well. My impl
Page 63 and 64:
7.2. IMPLEMENTATION 55Performance o
Page 65:
7.2. IMPLEMENTATION 57ble sort, wil
Page 68 and 69:
60 CHAPTER 8. RESULTS AND DISCUSSIO
Page 71 and 72:
Chapter 9ConclusionThe focus of thi
Page 73 and 74:
Bibliography[1] Mukul S. Bansal, Ji
Page 75:
BIBLIOGRAPHY 67[20] M.S. Waterman a
Page 78 and 79:
70 APPENDIX A. PREPROCESSING FOR TH
Page 81 and 82:
Appendix CReal-life application of
Page 83:
75t29t25t54t20t27 t13 t1t3t17t24t41
show all

computing the quartet distance between general trees

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?