16 CHAPTER 3. EXPERIMENTAL APPROACHWorst-case <strong>trees</strong> (wc) In a worst-case tree <strong>the</strong>re is one internal node of degree n 2 and n 2internal nodes of degree three, see Figure3.4. The tree has its name from Christiansenand Randers [5], where it was invented with <strong>the</strong> purpose of having a tree that has O(n)internal nodes and O(n) internal edges connected to an internal node of degree O(n).More interesting to this <strong>the</strong>sis is <strong>the</strong> total number of edges, which is |E| = 3 2n. The nameshould not be taken literally, since we do not yet know how <strong>the</strong> algorithms in this <strong>the</strong>siswill perform.Figure 3.4: A wc tree3.2 Tree constructionSince this <strong>the</strong>sis mainly encompasses an experimental approach to evaluating algorithms,I find it natural to describe my choice of data format and data construction.3.2.1 Newick formatThe programs are reading in tree data from files, described in <strong>the</strong> Newick tree format 3which is a commonly used format for representation of phylogenetic <strong>trees</strong>. It is a simpletext format based on paren<strong>the</strong>ses and commas. The grammar in Fig 3.5 provides a formaldescription for parsing <strong>the</strong> Newick format.Tree → Subtree ";"| Branch ";"Subtree → Leaf | InternalLeaf → NameInternal → "("BranchSet ")"NameBranchSet → Branch | BranchSet ","BranchBranch → Subtree LengthName → empty | stringLength → empty | ":"numberFigure 3.5: A context-free grammar for <strong>the</strong> Newick format3 Wikipedia article on <strong>the</strong> Newick format: http://en.wikipedia.org/wiki/Newick_format (accessedOct. 3, 2010)
3.2. TREE CONSTRUCTION 17Formal descriptions are good, however, this one has some small restrictions, e.g.paren<strong>the</strong>ses inside Name are prohibited. Never<strong>the</strong>less, this grammar more than satisfies<strong>the</strong> needs for this <strong>the</strong>sis. Note that a tree ei<strong>the</strong>r descends from nowhere, or it is rootedin an internal node. Thus, <strong>the</strong> Newick format is incapable of describing unrooted <strong>trees</strong>.This is not a problem, since <strong>the</strong> tree can merely be rooted in an arbitrary internal nodewhich, when parsing, will be treated as any o<strong>the</strong>r internal node.The two <strong>trees</strong> from <strong>the</strong> example in Fig. 1.1 would be written as follows in Newick treeformat. Note that <strong>the</strong>re is only one named Internal node in each tree, namely <strong>the</strong> root(Pan<strong>the</strong>ra):(((Tiger, Snow Leopard), ((Lion, Leopard), Jaguar)), Clouded Leopard) Pan<strong>the</strong>ra;(((Lion, Leopard), Snow Leopard, Jaguar), (Tiger, Clouded Leopard)) Pan<strong>the</strong>ra;Data construction Data files have been generated in Newick format using a range ofsimple Python-scripts for each class of artificial tree described in 3.1. They are foundalong with <strong>the</strong> code and all <strong>the</strong> data files used, see Appendix B.3.2.2 Parsing data files and building a tree data structureOne way of parsing data described by a grammar, like <strong>the</strong> one in Fig. 3.5, is by using arecursive descent parser. As <strong>the</strong> name indicates, a recursive descent parser is workingtop-down, by repeatedly identifying one construct in <strong>the</strong> data that corresponds to a productionrule in <strong>the</strong> grammar and <strong>the</strong>n process this piece of data according to <strong>the</strong> rule.A number of mutually recursive procedures will typically represent <strong>the</strong> rules defined by<strong>the</strong> grammar.For <strong>the</strong> Python implementation I have made use of a parser based on <strong>the</strong> Toy ParserGenerator framework 4 (TPG), and written by Thomas Mailund 5 , with a few modifications.For <strong>the</strong> C++ implementation I wrote my own parser from scratch. I wrote proceduresParse(), ParseSubtree(), ParseInternalNode(), ParseLeafNode(), ParseBranchSet()and ParseBranch(), corresponding to <strong>the</strong> six first rules of <strong>the</strong> grammar of <strong>the</strong> Newick format.Both parsers are capable of parsing all <strong>the</strong> formats suggested by <strong>the</strong> grammar inSec. 3.2.1 and <strong>the</strong>y meet <strong>the</strong> needs of this <strong>the</strong>sis.While <strong>the</strong> algorithms in <strong>general</strong> deal with non-rooted <strong>trees</strong>, some parts view <strong>the</strong> treeas rooted, in which cases it turns out more convenient to have a rooted, top-down representation.Fur<strong>the</strong>rmore, <strong>the</strong> Newick format dictates that some random node should4 TPG framework: http://christophe.delord.free.fr/tpg/index.html5 Mailunds parser: www.mailund.dk/index.php/2009/01/19/yet-ano<strong>the</strong>r-newick-parser/