computing the quartet distance between general trees

More documents

Recommendations

Info

48 CHAPTER 7. SUB-CUBIC TIME ALGORITHMis optimal, this is a key point to success or failure since the matrix multiplication posesthe worst threat to the sub-cubic time bound promised. Solving the task involves decidingwhich library to use for matrix multiplication and to study the behaviour of this, as tofind out how to make the crucial comparison of the three quantities max(d v ,d v ′) ω , dv 2d v ′and d v d 2 .v ′After implementing those two algorithms only one thing remains, namely to applythe algorithms on the two trees and put the counts together according to the expressionshown in Eq. (7.1).7.2.1 PrototypeStarting out softly, the first goal is to make a prototype implementation only focusingon the correct result. Consequently, I will make a simple implementation of the choiceand solely base it on which of d 2 v d v ′ and d v d 2 v ′ is smaller. That is easy to determine, andalong with the choice comes the calculation of either C ′′′ and I ′′′1 = (I I T )I , or R ′′′ andI ′′′2 = I (I T I ), respectively. I will use a basic, naive implementation of matrix multiplication.For the Python implementation I will utilize the NumPy library for scientific computing2 and for the C++ implementation I will utilize the Boost.uBLAS Library 3 . Theyboth provide matrix data structures and routines for matrix multiplication. Then it is allabout looping through pairs of edges, applying Eq. (7.9) or Eq. (7.10), and summing upthe results. Before returning, the result is divided by four because directed edges give riseto four different situations (see Contribution 7.2).ExpectationsSo, what would one expect from the prototype implementations whensubjected to the usual range of trees? Using naive matrix multiplication, and hence thevalue of α = 1, will make the whole algorithm O(n 3 ). Thus, we can expect to see cubicworst-case behavior. However, since the analysis is simply worst-case, we can still,as mentioned earlier, hope that it is not a tight upper bound and that the algorithm isefficient in practice.With regards to the difference in performance between the trees used, I will, onceagain, base my expectations on the actual code written. From my point of view there isone reasonable way to look at this. The algorithm is processing pairs of internal nodes,and for each of these pairs there is some preprocessing and some calculations to do,both of which are dependent on the degree of the two internal nodes. The preprocessingrelies on matrix multiplication, meaning that larger degrees will be harder to deal with.Depending on the overhead due to preprocessing and calculations the relationship be-2 SciPy/NumPy: http://numpy.scipy.org/3 Boost.uBLAS: http://www.boost.org/doc/libs/1_42_0/libs/numeric/ublas/doc/index.htm
7.2. IMPLEMENTATION 49tween the number of inner nodes and their degree might be more or less important. Thesame relation made the case analysis difficult.Based on these observations my assumption is that trees with a large number of highdegreeinternal nodes will pose the worst challenge; here the implementation will sufferfrom increasing cost of the matrix multiplication while the overhead on other calculationswill influence the time consumption as well. As described in Sec. 3.1, sqrttrees are intended to have this property. For bin trees however, with a large numberof small-degree inner nodes, the cost of matrix multiplication per node pair will nevergrow. Therefore the result will probably be that this cost can be disregarded and that thealgorithm will behave as a quadratic function, even though there might be a large overheaddue to the many node pairs being processed. The star trees have one node of muchhigher degree which could result in a very small overhead but a much faster growth.Result The performance of the prototypes applied to the five types of test trees is shownin Fig. 7.7 and 7.8. The result is quite surprising and the first thing that catches the eye, isthe relatively slow growth of the time usage. For the Python implementation the runningtimes seem to underlie a more or less quadratic development, with the star tree being theworst challenge, resulting in a slightly steeper slope, however, far from cubic. The plotof the C++ implementation displays the same tendency, the development being closeto quadratic. It is, however, even more clear that the implementation is suffering frommatrix multiplication costs when dealing with star trees and to some extent also the wctrees.The conclusion is, as assumed, that the algorithm responds differently to trees, dependingon their internal structure. For smaller input trees, the algorithm easily dealswith the ones that have a small number of internal nodes, whereas a complex internalstructure with a large number of internal nodes results in a large overhead. When theinput size is increased the overhead is neutralized and the more dominant characteristicseems to be the maximum degree of the internal nodes. Large degrees result in large matrixmultiplications and the consequence is a faster growth in the development of timeusage.These experiments have been repeated five times each, once again with the exceptionof some of the larger Python experiments. Especially the large experiments on bin,ran and wc trees are slow and have not been repeated.7.2.2 Introducing a library for matrix multiplicationHaving gained the first experience with the algorithm, I will now introduce a better methodfor matrix multiplication and attempt to improve the algorithm in accordance to the the-
Page 1:
Master’s thesisCOMPUTING THE QUAR
Page 5: AcknowledgementsFirst of all I woul
Page 8 and 9: viiiCONTENTS5.3 Implementing leaf s
Page 10 and 11: 2 CHAPTER 1. INTRODUCTIONhas is cal
Page 12 and 13: 4 CHAPTER 1. INTRODUCTIONIt is evid
Page 14 and 15: 6 CHAPTER 1. INTRODUCTIONsub-cubic
Page 17 and 18: Chapter 2PrerequisitesFirst, this c
Page 19: 2.2. CHOICE OF LANGUAGE AND TEST EN
Page 22 and 23: 14 CHAPTER 3. EXPERIMENTAL APPROACH
Page 28 and 29: 20 CHAPTER 4. QUARTIC TIME ALGORITH
Page 30 and 31: 22 CHAPTER 4. QUARTIC TIME ALGORITH
Page 33 and 34: Chapter 5Calculating leaf set sizes
Page 35 and 36: 5.3. IMPLEMENTING LEAF SET ALGORITH
Page 37 and 38: 5.3. IMPLEMENTING LEAF SET ALGORITH
Page 39 and 40: Chapter 6Cubic time algorithmHere I
Page 41 and 42: 6.1. IMPLEMENTATION 33that we can l
Page 43 and 44: 6.1. IMPLEMENTATION 35carried out o
Page 45 and 46: Chapter 7Sub-cubic time algorithmIn
Page 47 and 48: 7.1. THE ALGORITHM 39CcabADFigure 7
Page 49 and 50: 7.1. THE ALGORITHM 41reflecting Eq.
Page 51 and 52: 7.1. THE ALGORITHM 43choice of indi
Page 53 and 54: 7.1. THE ALGORITHM 45More interesti
Page 55: 7.2. IMPLEMENTATION 477.2 Implement
Page 59 and 60: 7.2. IMPLEMENTATION 51oretic approa
Page 61 and 62: 7.2. IMPLEMENTATION 53well. My impl
Page 63 and 64: 7.2. IMPLEMENTATION 55Performance o
Page 65: 7.2. IMPLEMENTATION 57ble sort, wil
Page 68 and 69: 60 CHAPTER 8. RESULTS AND DISCUSSIO
Page 71 and 72: Chapter 9ConclusionThe focus of thi
Page 73 and 74: Bibliography[1] Mukul S. Bansal, Ji
Page 75: BIBLIOGRAPHY 67[20] M.S. Waterman a
Page 78 and 79: 70 APPENDIX A. PREPROCESSING FOR TH
Page 81 and 82: Appendix CReal-life application of
Page 83: 75t29t25t54t20t27 t13 t1t3t17t24t41
show all

computing the quartet distance between general trees

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?