38 CHAPTER 7. SUB-CUBIC TIME ALGORITHMnumber of different butterflies, diff B (T,T ′ ), in <strong>the</strong> following way:diff S (T,T ′ ) = B + B ′ − 2(shared B (T,T ′ ) + diff B (T,T ′ ))Which gives a complete expression for <strong>the</strong> <strong>quartet</strong> <strong>distance</strong> <strong>between</strong> two <strong>trees</strong>:qdist(T,T ′ ) = B + B ′ − 2shared B (T,T ′ ) − diff B (T,T ′ ) (7.1)Given <strong>the</strong> fact that B = shared B (T,T ) and likewise B ′ = shared B (T ′ ,T ′ ), it is clear thatonly two procedures are needed to calculate Eq. (7.1); one for shared B and one for diff B .They are described separately in <strong>the</strong> following Sec. 7.1.2 and Sec. 7.1.3.Like <strong>the</strong> cubic algorithm, <strong>the</strong> sub-cubic algorithm makes heavy use of <strong>the</strong> conceptof shared leaf set sizes, introduced in Section 5.2, but makes even less use of tree examination– one can say that it has a more computational nature. Never<strong>the</strong>less, it is easy toillustrate <strong>the</strong> intuition behind <strong>the</strong> algorithm. Two new concepts are introduced.A directed <strong>quartet</strong>, written ab → cd, first encountered in <strong>the</strong> article Brodal et al. [3]is a butterfly <strong>quartet</strong>, ab|cd, with a direction on <strong>the</strong> path <strong>between</strong> <strong>the</strong> two “wings” of<strong>the</strong> butterfly, see Fig. 7.1. Of course <strong>the</strong> number of shared butterfly <strong>quartet</strong> topologies<strong>between</strong> two <strong>trees</strong> is equal to half <strong>the</strong> number of shared directed <strong>quartet</strong>s. Likewise fordifferent butterflies. We observe that each directed <strong>quartet</strong> is identified by exactly onedirected edge e, in such a way that a and b are positioned behind e and c and d are positionedin front of e, in two different sub<strong>trees</strong> of <strong>the</strong> end node of e.acacacbdbdbdFigure 7.1: The two directed <strong>quartet</strong>s induced from a butterfly <strong>quartet</strong>.A claim, written A −→ e (C ,D), is <strong>the</strong> term used to denote such an edge, e, “claiming” alldirected <strong>quartet</strong>s ab → cd where a and b are contained in <strong>the</strong> subtree A behind e, while cand d are contained in two different sub<strong>trees</strong> C and D in front of e. It is convenient to addthis to our vocabulary now, as <strong>the</strong> algorithm will eventually deal with edges. Naturally, aclaim is associated with sub<strong>trees</strong> and <strong>the</strong>refore one edge typically claims multiple butterfly<strong>quartet</strong>s. Also, an edge can be involved in different claims with different sub<strong>trees</strong>,but as mentioned, every directed <strong>quartet</strong> is claimed by exactly one edge. See Fig. 7.2 foran illustration. We see how <strong>the</strong> edge e “divides” <strong>the</strong> tree into several parts; one subtreebehind <strong>the</strong> edge and several different sub<strong>trees</strong> in front of <strong>the</strong> edge.The fundamentals have been established. The algorithm will process all possiblepairs of directed edges, (e,e ′ ) ∈ T × T ′ , and for all <strong>the</strong> directed <strong>quartet</strong>s claimed by both
7.1. THE ALGORITHM 39CcabADFigure 7.2: Example of a claim. The directed edge is a unique identifierof <strong>the</strong> directed <strong>quartet</strong> ab → cd.dedges, count <strong>the</strong> <strong>quartet</strong> as a shared butterfly if <strong>the</strong> two claims give rise to <strong>the</strong> same topology,and o<strong>the</strong>rwise count <strong>the</strong> <strong>quartet</strong> as a different butterfly. Making heavy use of preprocessing,one such pair of edges can be handled in constant time, see Sec. 7.1.4. Since|E| = O(n) and <strong>the</strong>re are O(n 2 ) pairs of edges, <strong>the</strong> process of counting <strong>the</strong> butterflies canbe done in O(n 2 ) time. Thus, <strong>the</strong> preprocessing step is crucial to <strong>the</strong> running time.7.1.1 Basic preprocessingHere I will describe only a fundamental part of <strong>the</strong> preprocessing that is necessary tounderstand <strong>the</strong> intuition behind <strong>the</strong> algorithm and how to count shared and differentbutterflies. More preprocessing is needed to make it possible to process a pair of directededges in constant time as mentioned. Unfortunately that part of <strong>the</strong> preprocessing posesa threat to <strong>the</strong> sub-cubic complexity of <strong>the</strong> entire algorithm and has direct influence on<strong>the</strong> running time and must be handled with care. For now it is merely confusing and Iwill postpone <strong>the</strong> introduction of this until <strong>the</strong> appropriate section.As we shall see, it comes in handy that <strong>the</strong> notion of claims and <strong>the</strong> concept of sharedleaf set sizes both deal with sub<strong>trees</strong>. The first preprocessing step is to calculate <strong>the</strong>shared leaf set sizes, as explained in Sec. 5.2, which has quadratic time and space consumption.The next step is to calculate, for each pair of internal nodes v ∈ T and v ′ ∈ T ′ , withsub<strong>trees</strong> F 1 ,...,F dv and G 1 ,...,G d′v, a matrix I where I [i , j ] = |F i ∩G j |. When processingpairs of edges as mentioned above, we will need this matrix, I , associated with <strong>the</strong> twonodes that <strong>the</strong> edges point to.This is enough information about <strong>the</strong> preprocessing step to complete <strong>the</strong> intuitiveexplanation for counting butterflies in Sec. 7.1.2 and 7.1.3.