22.01.2015 Views

Refined Buneman Trees

Refined Buneman Trees

Refined Buneman Trees

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

worst case time and space complexity. However, with respect to performance<br />

in terms of reconstructing accurate phylogenetic trees, the refined <strong>Buneman</strong><br />

method might demonstrate unknown strengths.<br />

16.1 Test setup<br />

In the following section we will test the refined <strong>Buneman</strong> tree algorithm against<br />

two other known algorithms, the <strong>Buneman</strong> method ([Bun71]) and the Neighbor-<br />

Joining ([SN87]) method. All experiments have been run on test data from the<br />

PFAM database of protein sequence families. The data consists of distance<br />

matrices with sizes ranging from 10–50 (30), 100–200 (50) and 500–700 (50),<br />

for a total of 130 test matrices. Matrices with larger sizes are available, but<br />

they will be disregarded due to time constraints. Distance data is given in the<br />

common Phylip format.<br />

16.2 <strong>Buneman</strong> and refined <strong>Buneman</strong><br />

The implementation of the refined <strong>Buneman</strong> tree algorithm can easily be adapted<br />

to mark those splits which are both in B(δ) andinRB(δ). Since we have the<br />

n − 3 least scoring quartets used to calculate the refined <strong>Buneman</strong> index for a<br />

split, it is easy to find the least scoring quartet among them. If that quartet<br />

has positive <strong>Buneman</strong> score, we can mark the split as belonging in B(δ).<br />

This first experiment consists of running the (modified) refined <strong>Buneman</strong><br />

tree algorithm on examples from the set of PFAM distance matrices, counting<br />

for each one the number of splits in B(δ) andinRB(δ). The results from the<br />

experiment are given in Figure 16.1, sorted according to increasing distance<br />

matrix size and plotted in percentage (the size of RB(δ) is 100%).<br />

Figure 16.1 shows that the size of B(δ) fluctuates quite a bit, especially<br />

when the size of the distance matrix increases, where more and more datasets<br />

do not infer any <strong>Buneman</strong> splits at all. The refined <strong>Buneman</strong> method is clearly<br />

much less restrictive then the <strong>Buneman</strong> method, as expected. Regarding the<br />

quality of the splits that are in RB(δ) but not in B(δ), further studies would<br />

need to be undertaken — one could study either the specific dataset for which<br />

the <strong>Buneman</strong> method produces few splits, or use simulated data which would<br />

provide a key to which splits are well-supported and which are unsupported.<br />

16.3 <strong>Refined</strong> <strong>Buneman</strong> and Neighbor-Joining<br />

To test the refined <strong>Buneman</strong> tree method against the Neighbor-Joining method,<br />

the author has run the implementation of the refined <strong>Buneman</strong> tree algorithm<br />

described in this paper, against the Quick-Join algorithm described in [BFM + 03].<br />

The Quick-Join software is available from this website:<br />

http://www.birc.dk/Software/QuickJoin/<br />

89

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!