Disputation Nils Grimsmo - Department of Computer and Information ...

Nils Grimsmo 

Bottom Up and Top Down — 

Twig Pattern Matching on Indexed Trees 

Thesis for the degree of philosophiae doctor 

Trondheim, 2010-09-02 

Norwegian University of Science and Technology. 

Faculty of Information Technology, Mathematics and Electrical Engineering. 

Department of Computer and Information Science.

NTNU 

Norwegian University of Science and Technology 

Thesis for the degree of philosophiae doctor 

Faculty of Information Technology, Mathematics and Electrical Engineering 

Department of Computer and Information Science 

c○Nils Grimsmo 

ISBN 978-82-471-2723-0 (printed version) 

ISBN 978-82-471-2724-7 (electronic version) 

ISSN 1503-8181 

Doctoral theses at NTNU, 2011:96 

Printed by NTNU-trykk

Preface 

This thesis is submitted to the Norwegian University of Science and Technology (NTNU) 

for partial fulfillment of the requirements for the degree of philosophiae doctor. The doctoral 

work has been performed at the Department of Computer and Information Science, 

NTNU, Trondheim, with Bjørn Olstad as main supervisor, and Øystein Torbjørnsen and 

Magnus Lie Hetland as co-supervisors. 

The candidate was supported by the Research Council of Norway under the grant 

NFR 162349, and by the iAD project, also funded by the Research Council of Norway. 

5

Summary 

This PhD thesis is a collection of papers presented with a general introduction to the 

topic, which is twig pattern matching (TPM) on indexed tree data. TPM is a pattern 

matching problem where occurrences of a query tree are found in a usually much larger 

data tree. This has applications in XML search, where the data is tree shaped and 

the queries specify tree patterns. The papers included present contributions on how to 

construct and use structure indexes, which can speed up pattern matching, and on how 

to efficiently join together results for the different parts of the query with so-called twig 

joins. 

• Paper 1 [18] shows how to perform more efficient matching of root-to-leaf query 

paths in so-called path indexes, by using new opportunistic algorithms on existing 

data structures. 

• Paper 2 [19] proves a tight bound on the worst-case space usage for a data structure 

used to implement path indexes. 

• Paper 3 [24] presents an XML indexing system which combines existing techniques 

in a novel way, and has orders of magnitude improved performance over existing 

commercial and open-source systems. 

• Paper 4 [20] reviews and creates a taxonomy for the many advances in the field of 

TPM on indexed data, and proposes opportunities for further research. 

• Paper 5 [21] bridges the gap between worst-case optimality and practical performance 

in current twig join algorithms. 

• Paper 6 [22] improves the construction cost of so-called forward and backward path 

indexes for tree data from loglinear to linear. 

7

Acknowledgments 

The day-to-day supervision of the PhD work during the first years was mostly done by 

the external supervisor Dr. Øystein Torbjørnsen from Fast Search and Transfer, who has 

been a good source of ideas and clever technical solutions. Dr. Magnus Lie Hetland from 

my department has been supervising the last year, and has given substantial help both 

scientifically and in the writing process of some papers. The visits of my formal supervisor 

Dr. Bjørn Olstad have been inspirational. The discussions with Dr. Felix Weigel during 

his internship at FAST resulted in many ideas. I would like to thank fellow PhD student 

Truls Amundsen Bjørklund for good times, fruitful discussions and honest feedback during 

our work together. 

Thank you Nina, for your patience, beauty and delicious cooking. 

9

Contents 

Preface 5 

Summary 7 

Acknowledgments 9 

Contents 12 

1 Introduction 13 

1.1 Indexing/search in semi-structured data . . . . . . . . . . . . . . . . . . . 13 

1.2 Use-case: XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

1.2.1 XPath and XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

1.3 Abstract problem: Twig Pattern Matching . . . . . . . . . . . . . . . . . . 15 

1.3.1 Research scope: TPM on indexed data . . . . . . . . . . . . . . . . 16 

1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

2 Background 19 

2.1 Twig joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.1.1 Twig join work-flow . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.1.2 Result enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 

2.1.2.1 Single output query node . . . . . . . . . . . . . . . . . . 23 

2.1.3 Simple intermediate result architecture . . . . . . . . . . . . . . . . 23 

2.1.4 Tree position encoding . . . . . . . . . . . . . . . . . . . . . . . . . 24 

2.1.5 Partial match filtering . . . . . . . . . . . . . . . . . . . . . . . . . 25 

2.1.6 Intermediate result construction . . . . . . . . . . . . . . . . . . . 26 

2.1.7 Merging input streams . . . . . . . . . . . . . . . . . . . . . . . . . 29 

2.1.8 Data locality and updatability . . . . . . . . . . . . . . . . . . . . 30 

2.1.9 Twig join conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

2.2 Partitioning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 

2.2.1 Motivation for fragmentation . . . . . . . . . . . . . . . . . . . . . 31 

2.2.2 Path partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 

2.2.3 Backward and forward path partitioning . . . . . . . . . . . . . . . 33 

2.2.4 Balancing fragmentation . . . . . . . . . . . . . . . . . . . . . . . . 34 

2.3 Reading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

11

2.3.1 Skipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

2.3.1.1 Skipping child matches . . . . . . . . . . . . . . . . . . . 36 

2.3.1.2 Skipping parent matches . . . . . . . . . . . . . . . . . . 37 

2.3.1.3 Holistic skipping . . . . . . . . . . . . . . . . . . . . . . . 37 

2.3.2 Virtual streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 

2.3.2.1 Virtual matches for non-branching internal query nodes . 38 

2.3.2.2 Tree position encoding allowing ancestor reconstruction . 38 

2.3.2.3 Virtual matches for branching query nodes . . . . . . . . 39 

2.4 Related problems and solutions . . . . . . . . . . . . . . . . . . . . . . . . 40 

3 Research Summary 43 

3.1 Formalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

3.2 Publications and research process . . . . . . . . . . . . . . . . . . . . . . . 44 

3.2.1 Paper 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

3.2.2 Paper 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

3.2.3 Paper 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.2.4 Paper 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 

3.2.5 Paper 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

3.2.6 Paper 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

3.3 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 

3.4 Evaluation of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 51 

3.4.1 Research questions revisited . . . . . . . . . . . . . . . . . . . . . . 51 

3.4.2 Opportunities revisited . . . . . . . . . . . . . . . . . . . . . . . . 52 

3.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 

3.5.1 Strong structure summaries for independent documents . . . . . . 54 

3.5.2 A simpler fast optimal twig join . . . . . . . . . . . . . . . . . . . 55 

3.5.3 Simpler and faster evaluation with non-output nodes . . . . . . . . 55 

3.5.4 Ultimate data access shoot-out . . . . . . . . . . . . . . . . . . . . 56 

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

Bibliography 61 

4 Included papers 63 

Paper 1: Faster Path Indexes for Search in XML Data . . . . . . . . . . . . . . 65 

Paper 2: On the Size of Generalised Suffix Trees Extended with String ID Lists 87 

Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins and Virtual Nodes 93 

Paper 4: Towards Unifying Advances in Twig Join Algorithms . . . . . . . . . 115 

Paper 5: Fast Optimal Twig Joins . . . . . . . . . . . . . . . . . . . . . . . . . 141 

Paper 6: Linear Computation of the Maximum Simultaneous Forward and Backward 

Bisimulation for Node-Labeled Trees . . . . . . . . . . . . . . . . . . 169 

A Other Papers 193 

Paper 7: On performance and cache effects in substring indexes . . . . . . . . . 195 

Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems . . 237 

Paper 9: Search Your Friends And Not Your Enemies . . . . . . . . . . . . . . 249 

12

Chapter 1 

Introduction 

“Research is formalized curiosity. 

It is poking and prying with a purpose.” 

– Zora Neale Hurston 

The thesis is submitted as a paper collection bound together by a general introduction. 

This chapter presents the context of the research, which is indexing and querying 

semi-structured data, and the abstract problem investigated, which is twig pattern 

matching (TPM). Chapter 2 gives a high-level introduction to techniques used in state 

of the art TPM on indexed data. Chapter 3 lists the included published papers with 

short qualitative assessments, evaluates the total contribution of this thesis, and proposes 

future work. 

1.1 Indexing/search in semi-structured data 

So-called semi-structured data gives both flexibility and expressional power, and is commonly 

used for storing and exchanging data in heterogeneous information systems. In the 

semi-structured data model, documents have a structure that specifies how the different 

parts of the content relate to each other. This means information is contained both in 

the structure and the content. Documents are usually structurally self-contained, meaning 

that the structure can be understood from the document alone, without additional 

meta-data. 

The focus of this thesis is algorithms and data structures for indexing and querying 

semi-structured data, where queries specify both structure and content. The use of semistructured 

data can functionally cover both traditional structure-oriented and contentoriented 

data management, and the thesis therefore touches the fields of both databases 

and information retrieval. 

13

CHAPTER 1. 

INTRODUCTION 

1.2 Use-case: XML 

XML is a simple yet flexible markup language [46], and has become the de facto standard 

for storing semi-structured data. An example XML document is shown in Figure 1.1. 

Standard XML has a tree model, where there are mainly three types of nodes in a document 

tree: element, attribute and text. All internal nodes in the document tree are of 

type element, and are given by start and end tags, such as the node with name book in 

the example. Text and attribute nodes are always leaf nodes. Text nodes have simple 

string values, while attributes have both a name and a value, such as the ISBN node in 

the example. 

 

 

Kritik der Unvollständigkeit 

Kant 

Gödel 

 

... 

 

Figure 1.1: Example XML document 

1.2.1 XPath and XQuery 

XPath [45] and XQuery [47] have become standard languages for querying XML. Comparing 

the two, XPath is a simpler declarative language, while XQuery is a more complex 

language that uses XPath expressions as building blocks. The XPath expression in Figure 

1.2a asks for the title of all books coauthored by Kant and Gödel. 

In XPath single and double forward slashes specify child and descendant relationships 

between nodes, respectively. Square brackets contain predicates, and the rightmost node 

not part of a predicate is the output node, also called the return node. XPath queries are 

trees, and the tree representation of the example is shown in Figure 1.2b. In XPath there 

are 11 so-called axes in addition to descendant and child: parent, ancestor, followingsibling, 

preceding-sibling, following, preceding, attribute, namespace, self, descendant-orself 

and ancestor-or-self [45]. There can also be more complex value predicates than 

simple tests on string equality, using function such as count(), contains(), sum(), etc.. 

XQuery is a powerful language where small programs are built with path expressions 

as building blocks, in so-called FLWOR expressions (for, let, where, order, return). Figure 

1.3 shows an XQuery program similar to the XPath expression in Figure 1.2a, which in 

addition orders books by title and retrieves both title and ISBN. 

14

1.3. ABSTRACT PROBLEM: TWIG PATTERN MATCHING 

//book[author/text()="Kant"][author/text()="Gödel"]/title 

(a) Expression. 

 

 

 

 

"Kant" 

"Gödel" 

(b) Tree representation. 

Figure 1.2: XPath example finding books coauthored by Kant and Gödel. 

for $b in doc("lib.xml")/library/book 

let $t := $b/title 

where $b/author = "Kant" and $b/author = "Gödel" 

order by $t 

return ($t, $b/@ISBN) 

Figure 1.3: Example XQuery. 

1.3 Abstract problem: Twig Pattern Matching 

In XPath a large number of functions can be used in value predicates, and thirteen different 

axes dictate the relationships between nodes. The many details in the language 

makes it hard to reason about the complexity of evaluation algorithms and hard to implement 

prototypes. TPM is a more abstract tree matching problem that covers a subset 

of XPath. It is of academic interest because a TPM solution covers the majority of the 

workload in most XML search systems [15]. 

In TPM both query and data are node-labeled trees, as shown in the example in 

Figure 1.4. Node predicates are on label equality, and all nodes have the same type. 

There are two types of query edges that dictate the relationship between data nodes in a 

match, ancestor–descendant (A–D) and parent–child (P–C), denoted in figures by double 

and single edges, respectively. The result of a TPM query is the set of mappings of query 

nodes to data nodes that both respect the node labels and satisfy the A–D and P–C 

relationships specified by the query edges. 

In settings with XML document collections, the data is a forest of trees, but this can 

easily be transformed into a single tree by adding a virtual super-root node. 

15

CHAPTER 1. 

INTRODUCTION 

c 1 

a 1 

b 1 c 2 a 1 d 2 b 6 

b 1 c 1 a 2 e 1 a 4 d 1 c 5 

b 2 a 3 c 4 b 3 b 5 c 6 

c 3 f 1 b 4 

Figure 1.4: TPM example with a query tree on the left and a data tree on the right. 

One of the matches for the query in the data is shown with arrows from query nodes to 

data nodes. In the following, query nodes are drawn with circles and data nodes with 

rounded rectangles. Node labels are written with typewriter font, and the superscripts in 

query nodes and subscripts in data nodes are used to identify the nodes (together with 

the labels). 

1.3.1 Research scope: TPM on indexed data 

The scope of this thesis is twig pattern matching on indexed data, and we assume that the 

processes of preparing the index and evaluating queries are separate. For this strategy 

to be viable, the cost of index construction must be justified by the performance gain 

for query evaluation compared to evaluation without an index, seen in light of the index 

construction cost. 

We use the following abstract view of an index: It is a mechanism which provides a 

function from some feature of a node, to nodes in the data tree that have this feature. The 

simplest non-trivial such feature is node label, as used in the index shown in Figure 1.5a. 

In a typical implementation, entries in a so-called dictionary on label point to so-called 

occurrence lists containing nodes with matching label. 

When indexing on label, a query can be evaluated by reading the label-matching data 

nodes for each query node, and joining these into full query matches. The number of full 

query matches may be small compared to the total number of query node matches read, 

but if the labels on the query nodes are selective, much fewer data nodes will be processed 

than when evaluating the query on the data tree without an index. 

Indexing on node label can be extended to indexing on path labels, the string of 

labels from the root to a node, as illustrated in Figure 1.5b. This can again be extended 

to classify nodes not only on labels of the ancestor nodes on the path above, but also 

on the labels of the children in the subtree below. These indexing strategies, called 

structure indexing, will be discussed in the next chapter, together with so-called twig join 

algorithms. 

16

1.4. RESEARCH QUESTIONS 

c 

c 1 

a 

a 1 a 2 a 3 a 4 

c a 

a 1 

b 

b 1 b 2 b 3 b 4 b 5 b 6 

c a a 

a 2 a 4 

c 

c 1 c 2 c 3 c 4 c 5 c 6 

c a a a 

a 3 

d 

d 1 d 2 

c a a b 

b 2 b 3 b 5 

e 

e 1 

f 

f 1 

c d 

d 2 

(a) Indexing on label. 

(b) Indexing on path. 

Figure 1.5: Indexing the data tree from Figure 1.4. 

1.4 Research questions 

The following are the main research questions I have investigated during the work with 

this thesis: 

• RQ1: How can matches for tree queries be joined more efficiently? 

• RQ2: How can pattern matching in the dictionary be done more efficiently? 

• RQ3: How can structure indexes be constructed faster and using less space? 

These questions will be revisited in Section 3.4.1, where I will evaluate to what extent 

they have been answered by my research. Note that more efficient query evaluation can 

mean either that all or most queries are evaluated using less time, or that queries from 

some important group are evaluated using less time. Preferably, faster evaluation for one 

group of queries should not cause slower evaluation for other groups. 

17

Chapter 2 

Background 

“Research is what I’m doing 

when I don’t know what I’m doing.” 

– Werner von Braun 

This chapter presents some underlying concepts for state-of-the-art approaches for 

TPM on indexed data, which will hopefully ease the understanding of the contributions 

in the research papers included in this thesis. A high-level conceptual overview is given 

instead of an in-depth description of details in state-of-the-art solutions, because this is 

better covered by the included papers where the specific techniques are discussed. 

The following discussion divides the problem of TPM on indexed data into three 

somewhat orthogonal issues: How to construct full query matches from individual query 

node matches in so-called twig joins, how to partition the underlying data nodes such 

that as few as possible are read to evaluate a query, and how to efficiently read streams 

of data nodes during a join. 

Notation. The following notation is used in the discussion: A graph G has node set V G 

and edge set E G ⊆ V G ×V G . All graphs are directed. A graph is a tree if all nodes have one 

incoming edge except the root, which has zero incoming edges. Nodes with zero outgoing 

edges are called leaves. A graph is called a forest if it consists of many unconnected trees, 

i.e., if all nodes have zero to one incoming edges. If a relation R relates x to y, this may 

be denoted both xRy and 〈x, y〉 ∈ R and x↦→y ∈ R. We primarily use angle brackets for 

graph edges, as in 〈u, v〉 ∈ E G , and the “maps to” arrow for mappings of query nodes 

to data nodes, as in q ↦→d ∈ M. The transitive closure of a relation R is denoted by 

R ∗ . In the problems discussed there will mostly be a query tree Q and a data tree D, 

where each node v ∈ V Q ∪ V D has a Label(v) ∈ A. Assume |A| ∈ O(|D|) for simplicity. 

Each query edge 〈u, v〉 ∈ E Q has an EdgeType(u, v) ∈ {“A–D”, “P–C”}, specifying an 

ancestor–descendant or a parent–child relationship. 

Remember from Section 1.3 that in TPM we have a single node type, and only differentiate 

nodes by label, while in XML there are different node types. We can generalize 

19

CHAPTER 2. 

BACKGROUND 

TPM to cover this by using different label codings for different node type, such as for 

example starting element node labels with “

2.1. TWIG JOINS 

to enumerate and output the set of full query matches O. The first phase has two 

components, where the first merges the streams S v , materializing I v for each v ∈ V Q , into 

a single stream S, materializing the total input set I. 

Phase 1, 

Component 1: 

Input stream 

merger 

Phase 1, 

Component 2: 

Intermediate 

result constr. 

Phase 2: 

Result 

enumeration 

a 1 a 1 a 2 . . . 

b 1 b 1 b 2 

. . . 

c 1 c 1 c 2 . . . 

c 1 ↦→c 1 

b 1 ↦→b 1 

. . . 

Intermed. 

results 

a 1 

b 2 c 5 

. . . 

Figure 2.1: Work-flow of twig join algorithms. 

Figure 2.2 illustrates why the two phases are temporally separate, as in the worst 

case, all the data must be read before it is known whether or not the nodes in the input 

are useful. On the other hand, use of the two components in Phase 1 can be temporally 

overlapping, because Component 2 reads data and query node pairs from Component 1 in 

some order that can be implemented without lookahead in the individual streams. Note 

that for some combinations of query and data, the construction of intermediate results is 

not necessary for linear evaluation (as we exploit in Paper 3 included in Chapter 4). 

a 

a 1 

1 

c 1 b 1 b 1 b n a 2 

c n+1 

b n+1 c 1 c n 

Figure 2.2: Example showing why Phase 1 and Phase 2 are temporally separate. When 

the input streams are sorted in tree preorder, it cannot be known whether b 1 , . . . , b n are 

part of a query match before c n+1 is seen, or whether c 1 , . . . , c n are part of a query match 

before b n+1 is seen. Note that there is no stream ordering such that all twig queries can 

be evaluated without storing intermediate results [10]. 

To understand the design choices in the approach depicted in Figure 2.1, it is easiest to 

start with the last step, result enumeration, and work backwards. Section 2.1.2 sketches 

a generic algorithm for enumerating results, and Section 2.1.3 sketches the layout of 

a generic data structure that enables evaluating that algorithm in linear time. With 

this as a starting point, I go through various techniques and strategies for implementing 

the generic approach. Section 2.1.4 briefly presents a common tree position encoding 

that makes it possible to decide A–D and P–C relationships between data nodes in the 

21

CHAPTER 2. 

BACKGROUND 

various streams in constant time. Section 2.1.5 describes two common data node filtering 

strategies, and Section 2.1.6 shows how one of these can be used to realize the conceptual 

data structure from Section 2.1.3 in linear time. Section 2.1.7 describes the input stream 

merge component, where filtering strategies can be used for practical speedups. 

2.1.2 Result enumeration 

Algorithm 1 gives a high-level description of how to output all unique query matches 

that can be constructed from the input. The approach is a generalization of what is used 

in state of the art twig joins [7, 27, 39, 33]. The algorithm recursively constructs full 

query matches from partial matches that are known to be part of full query matches, 

denoted here as partial full matches. Formally, a partial full match is an M ′ such that 

M ′ ⊆ M for some full query match M ∈ M. The set of all partial full matches is 

M ′ = {M ′ | ∃M ∈ M : M ′ ⊆ M}. 

Algorithm 1 Result enumeration 

Denote the set of partial full matches by M ′ . 

Start with M ′ = {}, an empty partial full twig match. 

Assume any fixed ordering of the nodes in Q, 

and let v ∈ Q be the first node in this ordering. 

For all v ′ such that {v ↦→v ′ } ∈ M ′ : 

Call Recurse(v ↦→v ′ ). 

The function Recurse(u↦→u ′ ): 

Insert u↦→u ′ into M ′ . 

If |M ′ | = |Q|: 

Output M ′ . 

Otherwise: 

Let v be the node following u in Q. 

For all v ↦→v ′ such that M ′ ∪ {v ↦→v ′ } ∈ M ′ : 

Recurse(v ↦→v ′ ) 

Remove u↦→u ′ from M ′ . 

Example 1. We evaluate the query and data in Figure 1.4 using Algorithm 1, and order 

query nodes in tree preorder. A candidate match for query node a 1 that is part of a 

full match is data node a 1 , and hence one of the top-level calls to Recurse will be 

with the parameter u↦→u ′ set to a 1 ↦→a 1 . After this pair has been inserted into M ′ , we 

consider the query node b 1 , which follows a 1 in tree preorder. Since M ′ = {a 1 ↦→a 1 }, and 

M ′ ∪ {b 1 ↦→b 1 } is a partial full match, b 1 ↦→b 1 is one of the pairs we recurse with. In that 

recursive call we have M ′ = {a 1 ↦→a 1 , b 1 ↦→b 1 }, and consider matches for the final query 

node c 1 . As {a 1 ↦→a 1 , b 1 ↦→b 1 } ∪ {c 1 ↦→c 1 } is a partial full match, we again recurse with 

c 1 ↦→c 1 , and output the new M ′ , since it is a complete full match. 

Assume that the set of partial full matches M ′ does not have to be materialized, and 

that given a partial full match M ′ ∈ M ′ , where all nodes u preceding v have a mapping 

22


in u↦→u ′ ∈ M ′ , all v ↦→v ′ such that M ′ ∪ {v ↦→v ′ } ∈ M ′ can be traversed in time linear 

in their number. Under these assumptions the algorithm can be evaluated in O(|O| · |Q|) 

time, linear in the total number of data nodes in the output. The intuition is that each 

recursive call constructs in constant time a partial full match not seen before, and that 

each unique partial full match yields at least one unique full query match. 

2.1.2.1 Single output query node 

In TPM the answers in the result set are all legal ways of matching the query nodes 

to the data nodes, but in many information retrieval settings other semantics may be 

more useful. In the XPath language [45] queries have a single output node, and the 

result set contains all matches for this query node that are part of some full query match. 

In the XQuery language [47], which is used for more complex information retrieval and 

processing, there can be any number of output and non-output nodes in the query. 

Only minor changes are needed in Algorithm 1 for this generalized case with both 

output and non-output query nodes. A simple solution is to put the output query nodes 

first in the fixed ordering, and stop the recursion before non-output nodes are considered. 

Note that practical data structures that enable linear enumeration for any combination 

of output and non-output nodes [7] are not as simple as the data structures described in 

the following sections. 

2.1.3 Simple intermediate result architecture 

Figure 2.2 illustrated why it is not possible to output query matches directly by just 

inspecting the heads of the streams for each query node. In the example all the nodes 

labeled c must be read before it can be known whether or not any of the nodes labeled 

b are useful, and vice versa. 

The purpose of storing intermediate results is to organize the data nodes in such a way 

that an implementation of the approach in Algorithm 1 can be evaluated efficiently. If 

the query nodes are ordered in tree preorder, it is natural to maintain for each u↦→u ′ that 

is part of a full query match, for each child v of u, the list of pairs v ↦→v ′ used together 

with u↦→u ′ in some full query match. Figure 2.3 illustrates this strategy. In addition to 

the lists of pointers to useful child query node matches for each pair, there must be a list 

of pointers to the data nodes that match the query root in full query matches. 

a 1 

b 1 c 1 

b 1 b 2 b 3 b 4 b 5 b 6 

Full 

c 1 c 2 c 3 c 4 c 5 c 6 

match 

roots 

a 1 a 2 a 3 a 4 

Figure 2.3: Generic intermediate results for the data tree in Figure 1.4. 

23

CHAPTER 2. 

BACKGROUND 

This data structure takes O(|I| + |O| · |Q|) space, linear in the size of the input and 

output, because the lists of data nodes take O(|I|) space, and each root pointer or child 

match pointer is used at least once in Algorithm 1, which has time complexity O(|O| · |Q|). 

The following intuition shows how this data structure can be used to efficiently implement 

Algorithm 1 when query nodes are ordered in tree preorder: (i) The pairs v ↦→v ′ in 

the initial calls in the outer for-loop are trivially found by traversing the list of pointers 

to full match roots. (ii) In a recursive call, after u↦→u ′ has been added to M ′ , the current 

M ′ is a partial full match by assumption. Let v be the node following u in preorder, and 

let p be the parent of v (possibly p = u). All query nodes preceding v have a mapping in 

M ′ , and assume M ′ (p) = p ′ . Let Q p and Q v be the subgraphs resulting from removing the 

edge 〈p, v〉 from Q. These subqueries can be matched independently when the mapping 

of both p and v is fixed in a way such that EdgeType(p, v) is satisfied. If v ↦→v ′ is used in 

some full query match together with p↦→p ′ , we know that 〈p ′ , v ′ 〉 satisfies EdgeType(p, v). 

Then, if M ′ is a partial full match, M ′ ∪ {v ↦→v ′ } must also be a partial full match. 

Example 2. This example illustrates how to implement the data access for Example 1 

using the data structure in Figure 2.3. The first match for the query root a 1 that is part 

of a full match is the data node a 1 , and hence the first non-empty partial full match 

in Algorithm 1 is M ′ = {a 1 ↦→a 1 }. When considering the next query node in preorder, 

b 1 , we see from the pointers in the data structure that b 2 is the first data node usable 

together with a 1 . Hence the next partial full match is M ′ = {a 1 ↦→a 1 , b 1 ↦→b 2 }. Then, 

when considering the next query node c 1 , we see that the data node c 5 is the only data 

node usable with a 1 , the current match for a 1 , the parent of query node c 1 . We insert 

c 1 ↦→c 5 to get the full match M ′ = {a 1 ↦→a 1 , b 1 ↦→b 2 , c 1 ↦→c 5 }. 

2.1.4 Tree position encoding 

To construct the intermediate results efficiently it must be decidable from position information 

following the data nodes whether or not they satisfy A–D and P–C relationships. 

A common solution is the interval-based BEL encoding [56], where each node is given 

integer numbers begin, end and level, as shown in Figure 2.4. 

1,10,1 

2,5,2 

3,3,3 4,4,3 

6,9,2 

7,7,3 8,8,3 

Figure 2.4: The BEL encoding for a tree, with begin, end and level numbers. 

This encoding is similar to preorder and postorder traversal numbers, and can be 

computed in a depth-first traversal of the tree. The reason the encoding is often preferred 

is probably that the begin and end numbers correspond to the document position of 

opening and closing tags in XML. 

24


With the BEL encoding, a node a is an ancestor of a node b iff a.begin < b.begin and 

b.begin < a.end, and it is a parent if also a.level + 1 = b.level. Sorting on begin or end 

numbers respectively gives the same sorting orders as preorder and postorder traversal 

numbers. 

There exists a large number of tree position encodings with different properties [50]. 

Some allow decision of more types of node relationships, and some allow reconstruction 

of related nodes. They differ in the computational cost of evaluating relationships, space 

usage, and how well they handle updates in the data tree. 

2.1.5 Partial match filtering 

When constructing intermediate results it is often possible to filter out some query and 

data node pairs that will never be part of a full query match. In current twig join 

algorithms filtering is used both for practical speedup [5, 27, 33], and/or as a necessity 

for worst-case efficient result enumeration [7]. 

A filtering strategy does not have to be perfect, but it must certainly not remove pairs 

that are part of full query matches. In other words, it can have false positives, but not 

false negatives. Most filtering strategies are based on the observation that if there is some 

subquery (a subgraph of the query), such that the pair v ↦→v ′ is not part of any match 

for the subquery, then v ↦→v ′ is not part of any match for the entire query, and can safely 

be thrown away [21]. 

The two most common filtering strategies are illustrated in Figure 2.5. The first is 

based on checking if query prefix paths are matched [5, 27, 33], and the second on checking 

if query subtrees are matched [7, 39, 33]. The prefix path of a query node is the subquery 

containing the nodes on the path from the root down to the node. 

c 1 

c 1 

b 1 c 2 a 1 

d 2 b 6 

b 1 c 2 a 1 

d 2 b 6 

a 1 

b 1 c 1 

f 1 b 2 

(a) Query. 

a 2 

e 1 a 4 d 1 c 5 

b 2 a 3 c 4 b 3 b 5 c 6 

c 3 f 1 b 4 

(b) Matching prefix paths. 

a 2 

e 1 a 4 d 1 c 5 

b 2 a 3 c 4 b 3 b 5 c 6 

c 3 f 1 b 4 

(c) Matching subtrees. 

Figure 2.5: Matching query parts. 

We call a pair v ↦→v ′ that is part of a prefix path match for v a prefix path matcher. 

Filtering query and data node pairs on whether or not they are prefix path matchers is 

easy to implement with an inductive strategy: Assuming that v ∈ Q has parent u, the 

25

CHAPTER 2. 

BACKGROUND 

pair v ↦→v ′ is a prefix path matcher for v if and only if there exists a pair u↦→u ′ that is a 

prefix path matcher for u such that 〈u ′ , v ′ 〉 satisfies the A–D or P–C relationship specified 

by EdgeType(u, v) [5]. Prefix path filtering is easiest to implement when data nodes are 

seen in tree preorder, where ancestors are seen before descendants. 

Example 3. Figure 2.5b illustrates prefix path match checking. The pair a 1 ↦→a 1 is 

trivially a prefix path matcher, and b 1 ↦→b 3 must then be a prefix path matcher because 

EdgeType(a 1 , b 1 ) = “A–D” and 〈a 1 , b 3 〉 ∈ E ∗ D . This again implies that f 1 ↦→f 1 must be 

a prefix path matcher because EdgeType(b 1 , f 1 ) = “P–C” and 〈b 3 , f 1 〉 ∈ E D . 

Filtering pairs on whether or not they are subtree matchers can be implemented with 

a similar strategy: The pair v ↦→v ′ is a subtree matcher if and only if for each child 

w of v, there exists a subtree matcher w ↦→w ′ such that 〈v ′ , w ′ 〉 satisfies the A–D or 

P–C relationship specified by EdgeType(v, w) [7]. Subtree match filtering is easiest to 

implement when data nodes are seen in tree postorder. 

Example 4. Figure 2.5c illustrates subtree match checking. The pairs f 1 ↦→f 1 , b 2 ↦→b 4 

and c 1 ↦→c 5 are trivially subtree matchers because the query nodes are leaves. The pair 

b 1 ↦→b 3 is a subtree matcher because f 1 ↦→f 1 is a subtree matcher and 〈b 3 , f 1 〉 ∈ E D satisfies 

EdgeType(b 1 , f 1 ) = “P–C”, and because b 2 ↦→b 4 is a subtree matcher and 〈b 3 , b 4 〉 ∈ 

E D ∗ satisfies EdgeType(b 1 , b 2 ) = “A–D”. The pair a 1 ↦→a 1 is a subtree matcher, because 

b 1 ↦→b 3 is a subtree matcher and 〈a 1 , b 3 〉 ∈ E D ∗ satisfies EdgeType(a 1 , b 1 ) = “A–D”, 

and because c 1 ↦→c 5 is a subtree matcher and 〈a 1 , c 5 〉 ∈ E D satisfies EdgeType(a 1 , b 1 ) = 

“P–C”. 

2.1.6 Intermediate result construction 

The filtering on matched subtrees described in the previous section is strongly related 

to a strategy that can be used to efficiently build a data structure that realizes the 

conceptual structure depicted in Figure 2.3. What is described in the following is a slight 

simplification of what is used in the Twig 2 Stack [7] algorithm, which was the first twig 

join algorithm with cost linear in the size of the input data and the output result set. 

The reason preorder processing of data nodes and filtering on matched prefix paths is 

not a suitable starting point for a worst-case efficient algorithm, is that even though paths 

in the data do match paths in the query, it is hard to figure out on the fly during preorder 

processing whether or not other paths in the query can use the same branching nodes. 

On the other hand, with postorder processing matches for the query can be constructed 

bottom up by combining subtree matches into bigger subtree matches. The storage order 

of data nodes in the index does not have to be changed for postorder processing, as a 

preorder stream of match pairs can be translated to a postorder stream with a stack: 

When a pair v ↦→v ′ is read in preorder, all pairs u↦→u ′ on the stack such that u ′ is not an 

ancestor of v ′ are popped off and processed one by one, before v ↦→v ′ is pushed on stack. 

When following the strategy from Sections 2.1.2 and 2.1.3, the key to efficient enumeration 

of results is the ability to efficiently find usable subtree matches. Given a candidate 

v ↦→v ′ , we need to find for all children w of v, the list of matchers w ↦→w ′ such that 

〈v ′ , w ′ 〉 satisfies EdgeType(v, w). Subtree matches for the query root are trivially full 

query matches. 

26


The overall strategy for the proposed data structure is to maintain for each query 

node v a list of disjoint trees T v consisting of node matches from the stream S v , as shown 

in Figure 2.6. Some additional dummy nodes are used to bind the trees together. For 

each data node in the trees for a query node, there is a list of pointers to usable child 

query node matches. P–C matches are pointed to directly, while A–D matches are found 

in the entire subtrees pointed to. 

a 1 

a 4 

c 1 

b 1 

c 2 

b 2 

c 3 c 4 c 5 

b 3 b 5 

c 6 

b 4 

Figure 2.6: 

Figure 1.4. 

Postorder construction of intermediate results for the data and query in 

Algorithm 2 shows how this data structure can be constructed, specifying the processing 

of a single pair v ↦→v ′ in postorder. For each query node v, there is a list T v of 

disjoint trees consisting of subtree matchers v ↦→v ′ where v ′ ∈ S v . When processing a pair 

v ↦→v ′ , the trees where the root data nodes are descendants of v ′ are joined into single 

trees, both in the lists T w for the children w of v, and in the list T v for v itself. For P–C 

edges, pointers from v ↦→v ′ to w ↦→w ′ denote single direct child matches, while for A–D 

edges, pointers denote that entire subtrees contain matches. A pair v ↦→v ′ is only added if 

there is at least one pointer for each child w of v, and this effectively implements subtree 

match filtering as described in Section 2.1.5. 

Example 5. Figure 2.7 shows the step processing a 1 ↦→a 1 when constructing intermediate 

results for the data and query from Figure 1.4 with Algorithm 2. The trees at the end 

of T b 1, where the roots are b 2 and a dummy node, are joined into a single tree. So are 

the trees at the end of T c 1, where the roots are c 3 , c 4 and c 5 . Pointers are added from 

a 1 ↦→a 1 to the tree of descendants in T b 1, and to the child match c 1 ↦→c 5 in T c 1. Since 

a 1 ↦→a 1 has pointers both to matches for b 1 and c 1 , it is a subtree match, and is added 

to T a 1. 

When evaluating the input I with Algorithm 2, the total number of calls to the 

procedure Process() would be ∑ v∈V Q 

|S v | = |I|, and the total number of rounds in the 

for-loop would be ∑ v∈V Q 

|I v | · b v ∈ O(|I| · b Q ), where b v is number of children of v and 

b Q is the maximal number of children for any node in Q. Apart from constant time 

27

CHAPTER 2. 

BACKGROUND 

Algorithm 2 Postorder intermediate result construction 

Function Process(v ↦→v ′ ): 

For each child w of v: 

Let T ′ w be the trees at the end of T w where root nodes are descendants of v ′ . 

If EdgeType(v, v) = “P–C”: 

Add pointers from v ↦→v ′ to all w ↦→w ′ in T ′ w where depth(w ′ ) = depth(v ′ )+1. 

If |T ′ w| > 1 

Replace T ′ w by a dummy node with the trees from T ′ w as children. 

If EdgeType(v, v) = “A–D” and |T ′ w| > 0: 

Add a descendant pointer from v ↦→v ′ to the single node in T ′ w. 

If v ↦→v ′ does not have at least one pointer per child w of v: 

Discard v ↦→v ′ and return failure. 

Remove from the end of T v all roots where data nodes are descendants of v ′ , 

add them as children of v ↦→v ′ , and append v ↦→v ′ to T v . 

a 1 

a 1 

a 4 

c 1 

b 1 

c 2 

b 2 

c 3 c 4 c 5 

b 3 b 5 

c 6 

b 4 

(a) Before adding a 1 ↦→a 1 . 

a 4 

c 1 

b 1 

c 2 

b 2 

c 3 c 4 c 5 

b 3 b 5 

c 6 

b 4 

(b) After adding a 1 ↦→a 1 . 

Figure 2.7: A step in postorder construction of intermediate results for the data and query 

in Figure 1.4. Dotted boxes give the current list of trees T v for each v ∈ V Q . 

28


operations for each input v ↦→v ′ and each child w of v, there is some non-trivial cost 

associated with merging trees and adding pointers to P–C and A–D child matches. A 

merge attempt either inspects only one tree root and does not change T v , or inspects 

k > 1 roots, removes k − 1 roots from T v and adds a new one. This means that the cost 

of merge operations is bounded by the number of attempts and the sizes of the trees, i.e., 

∑ 

v∈V Q 

O(|I v | + |I v | · b v ). Now consider the cost of adding pointers from matches for a 

query node u to matches for a child query node w. If EdgeType(v, w) = “A–D”, then 

only a single edge is added from each v ↦→v ′ . If EdgeType(v, w) = “P–C”, then only a 

single edge is added to each w ↦→w ′ , as a node can have only one parent. In conclusion, 

the total cost of using Algorithm 2 is ∑ v∈V Q 

O(|I v | + |I v | · b v ) ⊆ O(|I| + |I| · b Q ). 

What is presented here is a slight simplification of the Twig 2 Stack algorithm [7]. The 

main difference between the above depiction and Twig 2 Stack is that in the latter, the data 

structure for each query node is a list of trees of stacks of nodes, instead of simply lists 

of trees of nodes. Many alternative twig join algorithms have been presented [27, 39, 33] 

in the years following the publication of the Twig 2 Stack algorithm. What is common to 

these algorithms is that they have improved practical performance, but higher worst-case 

complexity in the result enumeration phase. An example is the TwigList algorithm, which 

stores intermediate nodes in simple vectors instead of trees, and implements a weaker form 

of subtree filtering, where all query edges are considered to have type A–D. 

2.1.7 Merging input streams 

The final component missing to implement the strategy in Figure 2.1 is the input stream 

merger. The input to the merge is one preorder sorted stream representing I v for each 

v ∈ Q, and the desired output is a sorted stream representing I. The sort order required 

for using the approach from Section 2.1.6 is that the pairs v ↦→v ′ ∈ I are sorted primarily 

on the preorder of the data nodes, and secondarily on the postorder of the query nodes. 

This means that after translating the stream into data node postorder with a stack, the 

new stream is sorted secondarily on query node preorder. This is required by Algorithm 2 

for cases where a single data node matches multiple query nodes, as a data node could 

hide useful children of itself if the sorting was not secondarily on query node preorder. 

The simplest merge approach is to traverse the query in postorder, and find some 

minimum v ↦→v ′ by taking a preorder minimum v ′ that is head of a stream I v for a 

postorder minimal v. This takes Θ(|Q|) time per extraction, and gives a total cost of 

Θ(|I| · |Q|) for the merge. An asymptotically better approach is to organize the individual 

streams in a priority queue implemented with a binary heap, sorted primarily on the heads 

of the streams and secondarily on the query nodes. Extractions then take O(log |Q|) time, 

and the total cost is O(|I| log |Q|) [11]. 

Since the preorder and postorder tree traversal numbers we are sorting on are bounded 

by the size of the input, the sorting complexity is not loglinear, but linear under the unit 

cost assumption. The entire set I can be put in a single array, and sorted using radix sort 

in Θ(|I|) time [11]. As the intermediate result construction is already O(|I| · b Q ), the radix 

sort approach gives no advantage over the heap based approach when log |Q| ∼ b Q . Since 

the latter uses much less memory in practice, Θ(|Q|) instead of Θ(|I|), it is preferable in 

most real-world scenarios. 

29

CHAPTER 2. 

BACKGROUND 

Some of the newer twig join algorithms storing intermediate results in preorder [27, 33] 

use a O(|I| · |Q|) input stream merge component that implements a weak form of subtree 

match filtering, where all query edges are considered to have type A–D [5]. The merger 

uses only O(|Q|) memory and is very fast in practice because queries are typically small. 

It returns data nodes in a relaxed preorder, where the ordering is only guaranteed between 

matches for query nodes related by ancestry. This stream is not easily translated into 

postorder, and hence the merger is not used for postorder processing algorithms [21]. 

2.1.8 Data locality and updatability 

This chapter does in general not make a distinction between data stored in main memory 

and on disk, but in practical implementations it is important to consider the costs of 

different access patterns in different media. While main memory on modern computers 

does not really have a uniform memory access cost, due to the use of caches, we can 

design usable systems that use random memory reads and writes. On the other hand, if 

the data is so large it must reside on disk, a system that uses a lot of random access will 

not be efficient in practice. 

Consider now the different phases and components in our twig join strategy. The input 

stream merger is assumed to only inspect stream heads and store a minimal amount of 

state. Hence it should work well on an architecture where the candidate matches for each 

query node are streamed from disk. The intermediate result construction, as shown in 

Algorithm 2, inspects in each call a number of tree roots stored contiguously at the end 

of the current list of trees for each query node. This in itself is simple to implement with 

good spatial locality, but it should also be considered how the layout of data affects the 

result enumeration phase. Luckily, if intermediate nodes are stream onto disk and inserted 

into blocks in postorder, most nodes that are close in the data tree will be stored closely 

on disk. This strategy will give fairly good spatial locality during result enumeration [7]. 

The problem of intermediate results exceeding the size of main memory can be avoided 

in many practical cases, by observing that when the uppermost candidate match for the 

root query node is closed, none of the data nodes seen so far in the tree preorder will be 

used in any match involving data nodes later in the tree preorder [7]. This means that 

when the uppermost query root match candidate is closed, the current intermediate data 

can be used to enumerate the current set of query matches, before this data is discarded. 

Example 6. Consider the data in Figure 1.4, and an algorithm that pushes nodes onto a 

stack in preorder and pops them off in postorder. When the data node b 6 is processed, it 

causes the popping of a 1 , and there are no more a-nodes on the stack. As a match for the 

query node a 1 must be above the match for any other query node in a full query match, no 

nodes preceding b 6 in the data will be involved in a match together with nodes following 

and including b 6 . Hence we can enumerate results, and delete the current intermediate 

data structures. 

In many practical cases with large amounts of data, the underlying information is 

stored in a large number of independent documents of moderate size, and in these cases the 

above trick is always applicable. Data updates are also easy to handle in such a setting. A 

way of encoding global data node positions is by combining document identifiers and local 

30

2.2. PARTITIONING DATA 

node position encodings, such as BEL, and this simplifies updates: Updating a document 

can be viewed as deleting it and then re-adding it with a new document identifier, as is 

common in search systems for unstructured data [51]. Note that when the data is a single 

large tree that cannot easily be partitioned into independent documents, we need a node 

position encoding that has affordable cost for tree updates. There exist a number of such 

encodings with different properties [50]. 

2.1.9 Twig join conclusion 

We have now discussed all the components in a state-of-the-art twig join algorithm, and 

the costs of the different components are: 

• input stream merge: O(|I| log |Q|) for the heap-based approach, 

• intermediate results construction: O(|I| · b Q ), 

• and result enumeration: O(|O| · |Q|). 

This gives a total combined data, query and result complexity of O(|I| log |Q| + |I| · b Q + 

|O|·|Q|). Commonly the size of the query is viewed as a constant, and twig join algorithms 

are called linear and optimal if the combined data and result complexity is O(|I| + |O|). 

2.2 Partitioning data 

In the previous discussion it was assumed that the data nodes where partitioned on label 

in the index. This section considers the advantages and challenges that arise from more 

advanced indexing strategies. 

2.2.1 Motivation for fragmentation 

Let us first recap the introduction to the general strategy for TPM on indexed data from 

Section 1.3.1: The index is a mechanism which provides a function from some feature of 

a node to the set of nodes in the data that have this feature. 

The main motivation for using an index is of course reading and processing less data 

during query processing. If node labels are selective then simple label partitioning is an 

efficient approach, but this is not always the case. Figure 2.8 shows a case with many 

label-matches for the individual query nodes in the data, but only a few full matches for 

the query. 

The above example may be unrealistic, but reconsider the data in Figure 1.1 and the 

query in Figure 1.2 on page 14. If the given library has billions of books, then the cost of 

reading the data nodes labeled book will be huge compared to the size of the output result 

set. This motivates the use of a more fragmented partitioning of the data to improve the 

selectivity of query nodes. Note that another way of improving performance in these cases 

is to use skipping, discussed later in Section 2.3.1. 

31

CHAPTER 2. 

BACKGROUND 

b 1 

a 1 

a 2 a 9 a 15 

a 2 b 3 

b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20 

a 4 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21 

(a) Example query and data, showing first of four matches. 

a 1 

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 

a 2 b 3 

b 1 b 3 b 5 b 10 b 13 b 14 b 16 b 17 b 18 b 19 b 20 

a 4 a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 

(b) Example query and streams read. Marked stream nodes are useful. 

Figure 2.8: Partitioning on label. 

2.2.2 Path partitioning 

A natural extension of label partitioning is to partition data nodes on the paths by which 

they are reachable [37, 13, 36, 8]. Section 2.1.5 described how useless data nodes could 

be filtered out during intermediate result construction if they did not match prefix paths 

in the query. When indexing data nodes on prefix path, the same filtering is performed 

in advance, and we only process data nodes from classes where the prefix paths match 

the prefix paths in the query. 

To identify useful partitions when evaluating a query, we need some form of dictionary. 

In Figure 1.5b on page 17 a simple dictionary of path strings was used in the index, but 

this approach does not have attractive worst-case properties. There may be many unique 

paths in the data, and the size of this naive dictionary can be O(|D| 2 ) if the tree is deep. 

A more robust approach is to use a dictionary tree called a path summary, where shared 

prefixes of paths are only encoded once. 

Figure 2.9a shows the path partitioning for the data tree in Figure 2.8a. A path 

summary can be constructed from this partitioning by creating one node for each block 

in the partition, and creating edges between summary nodes whenever there are edges 

between data nodes in the related blocks, as shown on the left in Figure 2.9b. 

Prefix path matches for each query node can be found individually by using a matching 

algorithm on the summary tree, but this may give many individual matches that never 

take part in full query matches. A robust and efficient way to find useful prefix path 

matches is to index the summary itself on label, and use a twig join algorithms to evaluate 

queries directly on the summary to find relevant nodes [2]. 

32

2.2. PARTITIONING DATA 

b 1 

a 2 a 9 a 15 

a 7 a 8 b 3 b 10 b 13 b 16 b 18 b 20 

a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19 

(a) Path partition. 

a p3 

b p1 

a 1 

a p2 

a 2 a 9 a 15 

a 7 a 8 

a 2 b 3 

b p4 

b 3 b 10 b 13 b 16 b 18 b 20 

a 4 a 4 a 6 a 11 a 12 a 21 

a p5 b p6 

(b) Summary, query and streams read. 

Figure 2.9: Partitioning on prefix path. 

Figure 2.9b shows how a query is evaluated on the data from Figure 2.8a. The legal 

mappings of query nodes to summary nodes are found, and streams of related data nodes 

are read. Note that in this particular example there is only one match in the summary 

for each query node. If one query node matches multiple summary nodes, the streams 

of data nodes related to each of these summary nodes must somehow be merged into a 

single stream [8, 3]. 

An extra bonus with path indexing applies for non-branching queries where the leaf 

node is the only output node. For such queries, the data nodes related to the summary 

nodes matched by the output query node can be read directly from the index, without 

using any join. It is known that these data nodes have the necessary ancestor nodes 

for matching the path query, and hence the ancestors do not have to be read. As an 

example, if the query node a 2 in Figure 2.9b is removed, and a 4 is the only output node, 

all matching data nodes can be read from the extents of the matching summary nodes, 

which in this case is only a p5 . 

2.2.3 Backward and forward path partitioning 

With path indexing, data nodes are placed in the same block in the partition iff they 

have the same incoming paths. An alternative recursive formalization is that two nodes 

are in the same block iff they have the same label and both have parents from the same 

block [36]. This can be extended to requiring also that the nodes also have children from 

the same blocks, yielding a partition on the sets of incoming and outgoing paths [28]. 

With this backward and forward path partitioning, branching queries can be evaluated 

more efficiently than with simple backward path partitioning [28]. 

Figure 2.10 shows backward and forward path partitioning, the resulting structure 

summary, and the evaluation of a query. Note that in this example, each query node 

matches only one summary node, but this is not always the case. 

Recall that for simple path indexing, non-branching queries with a single leaf output 

node could be evaluated without joins, by just reading the matches for the output leaf. 

The reason was that the existence of useful ancestor nodes was implied from the summary 

matching. A bonus with backward and forward path indexing is that the existence of 

33

CHAPTER 2. 

BACKGROUND 

b 1 

a 2 

a 9 a 15 

a 7 a 8 b 3 b 10 b 20 b 13 b 16 b 18 

b 5 

(a) F&B partition. 

b 14 b 17 b 19 

a s3 

b s1 

a 

a 

a s2 

a 1 

2 

s7 

a 7 a 8 

a 2 b 3 

b s4 b s8 b s10 

b 3 

a 

a s5 b s6 a 4 a 4 a 6 

s9 b s11 

(b) Summary, query and streams read. 

Figure 2.10: Forward and backward path partitioning. 

useful matches both for ancestor and descendant query nodes is implied from the summary 

matching, even for branching queries. Assuming there is a match for the query in the 

backward and forward path summary, as shown in Figure 2.10b, it is known that all data 

nodes related to matched summary nodes are guaranteed to be part of at least one full 

query match [28]. In the example, it is certain that all data nodes classified by a p2 have 

children classified by a p3 and b p4 , and that data nodes classified by b p4 have children 

classified by a p5 and b p6 . 

A second bonus is that no joins are necessary for any query with a single output 

node. All matching can be performed in the summary, and relevant data nodes matching 

the output query node can be read directly from the blocks related to the matched 

summary nodes. 

The formal definitions of the recursive partitioning strategies sketched above are based 

on binary relations called bisimulations [36, 28], which can be computed in O(|E| log |V |) 

time for any graph [38]. 

2.2.4 Balancing fragmentation 

The main advantage of the finer partitioning schemes is that the increased fragmentation 

usually leads to less data read from the index, as smaller blocks are read for each query 

node. This does not come without a cost. A larger number of blocks in the partition 

gives a larger summary structure, which leads to more expensive initial matching in 

the summary. Over-refined partitioning can even give increased query evaluation cost 

without considering the expenses of summary matching. If the properties of nodes used 

to classify data nodes in the index are more detailed than what a query describes, then 

many partition blocks would have to be read and merged for each query node. To sum 

up, we need to balance on one hand, the number of false positives read for each query 

node, and on the other, the cost of summary matching and merging blocks. 

A way of reducing the cost of summary matching for strong structural indexing with 

high fragmentation is to use a multi-level strategy. For example can the data be indexed 

with backward and forward path partitioning, the resulting backward and forward 

path summary be indexed with simple backward path partitioning [52], and the result- 

34

2.3. READING DATA 

ing backward path summary be indexed with label partitioning [2]. Figure 2.11 shows 

conceptually how a query can be evaluated with such an index. 

a 1 

a 2 b 3 

(a) 

a 4 

b p1 

b s1 

a p2 

a s2 

a p3 b p4 a s3 

a p5 b p6 

a s7 

b s4 b s8 

a s5 b s6 

a s9 

b 1 

a 2 a 9 a 15 

b s10 b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20 

b s11 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21 

(b) (c) (d) 

Figure 2.11: Multi-level partitioning. Showing (a) the query (b) backward path summary 

(c) backward and forward path summary, and (d) the underlying data. 

The path-based partitioning strategies presented here give over-refinement in some 

practical cases, and various weaker indexing schemes have been developed. There exist 

indexing schemes with fragmentation between simple path indexing and backward and 

forward path indexing [28], schemes considering only shorter local paths [30], and schemes 

that adapt to query workload [6]. 

A way of reducing the impact of the fragmentation problem is to use an adaptive 

strategy with different partitioning schemes for different classed of data nodes. For example 

could label partitioning be used for nodes with infrequent labels, and path indexing 

be used for nodes with frequent labels. A static strategy that is used for XML data, is to 

use path indexing for XML element nodes (tags), and value indexing for XML text values 

and attributes [29]. This works well in practice, because the alphabet of element node 

names is typically small, while the alphabet of text values is typically very large. Also, it 

lets the same index be used for both semi-structured and unstructured query processing, 

a simple text values can be looked up directly in the index. 

2.3 Reading data 

Section 2.1 was concerned with how to join together full query matches from individual 

node matches, while Section 2.2 was concerned with how to partition the data in such a 

way that as few matches as possible had to be read for each individual query node. In 

Section 2.1.7 it was mentioned that some input stream mergers implement some inexpensive 

form of weak filtering to relieve the more expensive filtering in the intermediate result 

construction component. This section considers how to store the data in a way such that 

this filtering can be performed more efficiently, even without reading all individual query 

node matches. 

2.3.1 Skipping 

Consider evaluating the query “Britney Grimsmo” on a regular web search engine. The 

simplest way to find the co-occurrences of the two terms is to go through the two lists 

35

CHAPTER 2. 

BACKGROUND 

of occurrences in parallel. But if there are many more occurrences of “Britney” than 

“Grimsmo”, it is more efficient to read through the hits for the second term and somehow 

search for related matches in the list of hits for the first term, skipping those that are 

irrelevant. This can be performed either with some sort of binary search if all the data is 

stored uncompressed in main memory, or with some specialized data structure, such as 

for example skip lists or B-trees. 

Skipping in joins of streams of different length is also relevant for queries on semistructured 

data, as shown by the example library query in Figure 1.2a, where there 

probably are much fewer matches for “Gödel” than for . Unfortunately, skipping 

is not always trivial to implement for tree queries. In the following it will be discussed 

how to efficiently skip through matches when aligning streams for parent and child query 

nodes, and how to make sure the skipping is efficient for the query as a whole. Figure 2.12 

illustrates different issues in connection with skipping twig joins. 

a 1 

a 1 

a 1 

b 1 

b p 

b q 

b 1 

c 1 

x 1 

c q 

c 1 c p 

b 1 

b· b· 

b· b· 

b· b r 

c 1 

(a) 

(b) 

(c) 

x 1 

x 1 

a 1 

b 1 

x 2 

b 2 c 2 

x p 

b p c p 

a 2 

b q 

x 1 

a 1 b 1 c 1 

x p 

a p b p c p 

a q 

b q 

(d) 

c q 

(e) 

c q 

Figure 2.12: Cases where different skipping techniques are needed for efficient processing. 

(a) Query. (b) Descendants easily skipped with B-tree or similar. (c) XR-tree needed 

to skip ascendants. (d) Holistic skipping preferred. (e) Holistic skipping with XR-tree 

needed. 

2.3.1.1 Skipping child matches 

For a given parent and child query node pair, we want to efficiently forward the related 

streams of data nodes to the first pair of stream positions where a match for the parent 

36


query node is an ancestor (or parent) of the match for the child query node. We will first 

consider the simpler problem of forwarding a stream of matches for the child query node 

to catch up with the parent’s stream. 

Figure 2.12b shows a case where there is a current match b 1 for the query node b 1 , and 

we want to find the first match for the child query node c 1 which is a descendant of b 1 . In 

other words, we desire the first match for c 1 that that follows b 1 in preorder. If this node 

also precedes b 1 in postorder, it must be a descendant. For example, let b 1 .begin = k, 

using the BEL encoding described in Section 2.1.4. We can binary search for the first 

match for c 1 with begin value greater than k, which is c q . As also c q .end 

have that c q must be a descendant of b 1 . 

2.3.1.2 Skipping parent matches 

Skipping through matches for a parent query node to find the first data node that is an 

ancestor of the current match for a child query node is not as simple, as illustrated in 

Figure 2.12c. Given a current match c 1 ↦→c 1 , the first ancestor of c 1 in the stream for 

b 1 could be any node before c 1 in the preorder. Preceding ancestors are mixed with 

preceding non-ancestors in the preorder sorting. They are spread throughout the stream, 

and hence it is hard to predict where they are. Note that sorting in postorder does not 

solve this problem, as an ancestor of a node is any node later in the postorder. 

Implementing parent match skipping requires specialized data structures. The XR-tree 

is a B-tree variant that can be used to retrieve all r ancestors of a node from n candidates 

in O(log n + r) time [25, 26]. The data structure maintains one tree for each node label 

α, and conceptually stores for each node u with Label(u) = α, the list of ancestors of u 

that have label α. Linear space usage is achieved by removing the redundancy of storing 

common ancestors multiple times. 

2.3.1.3 Holistic skipping 

A common method for forwarding the streams associated with the nodes in a query is to 

pick an edge, and repeatedly forward the streams for the parent and child node until they 

are aligned, then pick another edge, and so on, until the entire query is aligned. Unfortunately, 

this procedure can lead to suboptimal behavior, as illustrated by Figure 2.12d. As 

the query edge 〈a 1 , b 1 〉 is satisfied initially, the streams for b 1 and c 1 will be repeatedly 

forwarded until the edge 〈b 1 , c 1 〉 is satisfied. This means all data nodes b 2 . . . b p and 

c 2 . . . c p will be inspected. 

To avoid this pitfall, the query should be considered holistically when forwarding 

streams. A robust approach is to repeatedly forward streams top-down and bottom-up in 

the query [12]. In the top-down pass, a stream is forwarded until the current head follows 

the head for the stream of the parent query node in preorder. In the bottom-up pass, a 

stream is forwarded until the current head follows the heads of the streams of all child 

query nodes in postorder. If this approach was followed in Figure 2.12d, nearly all the 

useless nodes labeled b and c would be skipped past. 

Figure 2.12e shows a case where both holistic skipping and specialized data structures 

for ancestor skipping are needed for an optimal skipping strategy. 

37

CHAPTER 2. 

BACKGROUND 

2.3.2 Virtual streams 

In the previous section we considered how data nodes could be efficiently skipped past 

if they were not part of query matches. This section describes a different approach: 

Reconstruction of data nodes needed for a complete query match. 

2.3.2.1 Virtual matches for non-branching internal query nodes 

Reconstructing a data node based on the existence of other data nodes requires some 

information about the structure the known nodes are part of. A good starting point is 

using structural summaries, such as those described in Section 2.2, and store with each 

data node the classification of the structure it is part of. 

Example 7. For the query and data in Figure 2.13 the data nodes are classified on 

path, with the path summary shown on the right in the figure. Given a current matching 

a 1 ↦→a 2 and a 4 ↦→a 4 , we know that the current data nodes have paths specified by a p2 and 

a p5 , and we can deduce by using the path summary that they must have a node on the 

path between them specified by b p4 . 

b 1 b p1 

a 1 

a 2 

a p2 

a 2 b 3 

b 3 a 7 a 8 a p3 

a 4 a 4 b 5 a 6 

a p5 

b p4 

b p6 

Figure 2.13: Virtual match for non-branching internal query node. 

As illustrated above, the existence of matches for non-branching internal query nodes 

can be implied by the matches for above and below query nodes and information from 

pattern matching in the path summary. This means no data nodes have to be read for 

non-branching internal query nodes, as long as they are not output nodes. 

2.3.2.2 Tree position encoding allowing ancestor reconstruction 

When virtualizing output nodes, matches cannot just be implied, but must be reconstructed, 

at least to an extent such that they can be uniquely identified. The most 

common approach to implementing this is to use a node encoding which allows ancestor 

reconstruction, such as Dewey [44], which encodes positions with strings of integers. With 

the Dewey encoding the tree position of a node is encoded as the position of the parent 

concatenated with the child number of the node, as shown in Figure 2.14. This means 

that the string length of a position encoding is equal to the depth of the corresponding 

data node. 

38


1 

1.1 1.2 

1.3 

1.2.1 1.2.2 1.2.3 

Figure 2.14: Dewey encoding of node positions. 

2.3.2.3 Virtual matches for branching query nodes 

Using virtual matches for branching query nodes is considerably harder than for nonbranching 

query nodes, and requires an encoding allowing ancestor reconstruction even 

when the query nodes to be virtualized are not output nodes. An example is shown in 

Figure 2.15. Even though it is known that there is an above data node classified by a p2 

in the path summary, both for a 2 ↦→a 7 and a 4 ↦→a 4 , it cannot be determined from the 

summary matching alone that it is the same above data node. 

b 1 b p1 

a 1 

a 2 

a p2 

a 2 b 3 

b 3 a 7 a 8 a p3 

a 4 a 4 b 5 a 6 

a p5 

b p4 

b p6 

Figure 2.15: Virtual match for branching internal query node. 

Luckily, with the Dewey encoding, the lowest common ancestor of two data nodes can 

be determined by finding the longest common prefix of the Dewey strings. This means 

that virtual matches for branching query nodes can be generated by combining node 

position encoding with information from the path summary matching [53, 34]. 

Example 8. In the example in Figure 2.15, data nodes a 7 and a 4 have Dewey position 

encodings 1.1.2 and 1.1.1.1. The longest common prefix of these strings is 1.1, which 

has length 2. Since both a p3 and a p5 have the ancestor a p2 at depth 2, it is determined 

that a 7 and a 4 have a common ancestor with structure belonging to the group specified 

by a p2 , and the query is matched. 

An advantage of using virtual matches for internal query nodes is that you avoid the 

issues with skipping parent matches discussed in Section 2.3.1.2. A downside is that the 

39

CHAPTER 2. 

BACKGROUND 

Dewey encoding requires O(d) space per data node, linear in the maximal depth of the 

data. 

2.4 Related problems and solutions 

This background chapter featured a high-level description of some of the concepts that 

are used in the research papers included in this thesis. A number of related issues that I 

have not covered are briefly listed below. 

The strategy of using indexing and joins is considered to be the most efficient out 

of three common strategies [15], where the other two are navigation and subsequence 

matching. In navigation-based approaches, the data tree is indexed similarly to in joinbased 

approaches, but instead of joining together matches for different parts of the query, 

only the matches for one part are read from the index, and for each such partial match, 

the rest of the query is matched by navigating in the data tree. Subsequence indexing is 

a radically different indexing approach, where both data and query trees are converted to 

sequences with special properties, such that a subsequence match for the query sequence 

in the data sequence indicates a tree match [48, 40, 49]. 

When the underlying data is not a tree, but a general graph, many aspects of pattern 

matching become more complex. In Section 2.1.4 it was described how a simple tree 

position encoding using constant space per data node could be used to decide P–C and 

A–D relationships in constant time. Unfortunately, this is not as simple in general graphs, 

where encodings giving constant time decision are expensive to compute and store, while 

encodings with small space requirements that can be computed efficiently give expensive 

decision of node relationships [55]. 

Note that for general graph data, partitioning nodes on their sets of incoming and/or 

outgoing paths is actually PSPACE complete [43]. Luckily there exist refinements of this 

ideal partition that are tractable. Recursively partitioning nodes on their label and the 

blocks of their parent and/or child nodes gives a refinement that is usually very close to 

the ideal in practice [36, 29]. 

As discussed in Section 2.1.2.1, a difference between XPath evaluation and TPM is 

that in the former, the result is all matches for the output node in the query, while in 

the latter, the result is all legal combinations of matches for the query nodes. There is 

a third output type called filtering [42] where the output is the set of documents where 

there exists at least one match for the query. This is useful in many information retrieval 

settings. 

The full XPath language is complex, and the first algorithms for evaluating XPath 

queries in polynomial time were presented first years after the language standard was 

formalized [14]. A method for extending a TPM solution to cover the different axes in 

XPath, is to use an algorithm to rewrite queries to use a smaller set of axes. What is 

missing from TPM is any notion of order between matches for query nodes not related by 

ancestry, and it has been shown that all XPath queries can be rewritten to queries using 

only the self, child, descendant and following axes [57]. Adding checks of the relative 

ordering of data nodes in a match can be done with post-processing, checks during result 

40

2.4. RELATED PROBLEMS AND SOLUTIONS 

enumeration, or by modifying the intermediate result construction. The latter is required 

for optimal evaluation. 

There are many pattern matching problems on trees that are related to TPM, such as 

ordered and unordered tree inclusion, path inclusion, region inclusion, child inclusion and 

the subtree problem [31]. Of these, the most similar to twig pattern matching is unordered 

tree inclusion (UTI). The differences between UTI and TPM are that in the former all 

query edges are of type A–D, and that for any match function M, M(u) = M(v) if and 

only if u = v. The latter requirement makes UTI NP-complete [31]. 

41

Chapter 3 

Research Summary 

“There is nothing like looking, if you want to find something. 

You certainly usually find something, if you look, 

but it is not always quite the something you were after.” 

– J.R.R. Tolkien 

This chapter gives a brief overview of the research contained in this thesis. Section 3.2 

lists the papers included, describes the research process and the roles of my co-authors, 

and gives a short evaluation of each paper in retrospect. Section 3.3 discusses the methodology 

used in my research, and Section 3.4 tries to evaluate the contributions in the papers 

in the light of the research questions from Section 1.4. Section 3.5 gives a short list of 

future work I find interesting, and Section 3.6 concludes the evaluation of the research, 

3.1 Formalities 

The thesis is a paper collection submitted for partial fulfillment of the requirements for 

the degree of philosophiae doctor. I have been enrolled in a four year PhD program 

at the Department of Computer and Information Science at the Norwegian University 

of Science and Technology. Three years worth of financing was given by the Research 

Council of Norway under the grant NFR 162349, and one year of financing was given by 

the department in exchange for 25% duty work during my stay. 

I started in the PhD program summer 2005, and it has taken an additional year to 

complete it. From August 2008 I was on a full and later partial sick leave due to tendinitis 

in both hands, which I had acquired from rock climbing. During the partial sick leave 

I started setting up voice recognition software for programming, and most of the C++ 

implementation used in Paper 3 was actually written using this setup. Six extra months 

of financing was given by the iAD project sponsored by the Research Council of Norway, 

because I had been without a supervisor for a longer period. 

In the PhD program it is mandatory to take five courses, and I have taken the following: 

43

CHAPTER 3. 

RESEARCH SUMMARY 

• DT8101 Highly Concurrent Algorithms 

• DT8102 Database Management Systems 

• DT8108 Topics in Information Technology 

• TDT4215 Knowledge in Document Collections 

• TT8001 Pattern Recognition 

In my duty work for the department I have worked with the following: 

• The Nordic Collegiate Programming Contest 

• TDT4120 Algorithms and Data Structures 

• TDT4215 Algorithm Construction, Advanced Course 

• TDT4287 Algorithms for Bioinformatics 

3.2 Publications and research process 

After finishing my Masters Thesis on substring indexing [16] I wanted to continue this 

research, but this venture was abandoned for various reasons described in Appendix A 

(Paper 7), and resulted in a technical report [17]. After that I started the research on 

XML indexing, which resulted in the papers listed below. In addition I have co-authored 

some other papers on indexing and search in other types of data together with fellow PhD 

student Truls Amundsen Bjørklund and others. These are listed in Appendix A (Papers 8 

and 9). 

3.2.1 Paper 1 

Authors: Nils Grimsmo. 

Title: Faster Path Indexes for Search in XML Data [18]. 

Publication: Proceeding of the Nineteenth Australasian Database Conference (ADC 2008). 

Abstract: This article describes how to implement efficient memory resident path indexes 

for semi-structured data. Two techniques are introduced, and they are shown to 

be significantly faster than previous methods when facing path queries using the descendant 

axis and wild-cards. 1 The first is conceptually simple and combines inverted lists, 

selectivity estimation, hit expansion and brute force search. The second uses suffix trees 

with additional statistics and multiple entry points into the query. The entry points are 

partially evaluated in an order based on estimated cost until one of them is complete. 

Many path index implementations are tested, using paths generated both from statistical 

models and DTDs. 

Research process 

When I started working with XML search my supervisor Dr. Torbjørnsen had a plan 

for the architecture of a commercial XML indexing system. This system was to use 

1 Note that wild-cards are not mentioned in Chapter 2, but can be supported by fixing the level 

difference between matches for above and below query nodes when necessary, or simply by reading all 

data nodes for the wild-card query node. 

44

3.2. PUBLICATIONS AND RESEARCH PROCESS 

path indexing for XML structure elements and inverted files for textual content. My 

first assignment to design an efficient path index, and as a first step, only look at single 

paths. I used my experience from string indexing, and came up with one solution based 

on joins, and one based on suffix trees. Both used selectivity estimates and opportunistic 

optimizations. 

Retrospective view 

The solutions I came up with for solving the problem at hand were rather advanced, 

especially the opportunistic algorithm using suffix trees. The focus of the work was on 

practical performance, and given the interplay between the path index and the value 

index that was planned in the underlying design, I feel that the solutions I found were 

good. On the other hand, from a more theoretical viewpoint the worst-case behavior of 

the solutions is not attractive, as individually matching paths can give large intermediate 

results. An asymptotically better approach, is to use a twig join algorithm on the path 

summary, as described in Section 2.2.2. 

As the size of the path index is small compared to the size of the data for many 

document collections, it is a justified question whether or not path index lookups are 

important for total query evaluation time in a complete search system. 

3.2.2 Paper 2 

Authors: Nils Grimsmo and Truls Amundsen Bjørklund. 

Title: On the Size of Generalised Suffix Trees Extended with String ID Lists [19]. 

Publication: Technical Report IDI-TR-2007-01, Norwegian University of Science and 

Technology, Trondheim, Norway, 2007. 

Abstract: The document listing problem can be solved with linear preprocessing and 

optimal search time by using a generalised suffix tree, additional data structures and 

constant time range minimum queries. A simpler solution is to use a generalised suffix 

tree in which internal nodes are extended with a list of all string IDs seen in the subtree 

below the respective node. This report makes some remarks on the size of such a structure. 

For the case of a set of equal length strings, a bound of Θ(n √ n) for the worst case space 

usage of such lists is given, where n is the total length of the strings. 


During the implementation of the system for Paper 1, I found some recent work which 

used an extension of generalized suffix trees [58]. Suffix trees have attractive asymptotic 

complexity, but as the authors did not comment on the complexity of their extension, I 

felt that I should investigate it before I based my work on their solution. As there was 

not space for these results in Paper 1, they were published as a technical report. 

Roles of the authors 

Bjørklund helped write and simplify the proofs, and helped write the paper itself. 

45

CHAPTER 3. 



This report only investigated worst-case space usage as a function of the total length of 

the strings indexed. It could also have been of interest to find the best case and average 

case space usage, and to express the complexity as a function of the number of strings 

and their average length. 

3.2.3 Paper 3 

Authors: Nils Grimsmo, Truls Amundsen Bjørklund and Øystein Torbjørnsen. 

Title: XLeaf: Twig Evaluation with Skipping Loop Joins and Virtual Nodes [24]. 

Publication: Proceedings of the Second International Conference on Advances in 

Databases, Knowledge, and Data Applications (DBKDA 2010). 

Abstract: XML indexing and search has become an important topic, and twig joins 

are key building blocks in XML search systems. This paper describes a novel approach 

using a nested loop twig join algorithm, which combines several existing techniques to 

speed up evaluation of XML queries. We combine structural summaries, path indexing 

and prefix path partitioning to reduce the amount of data read by the join. This effect 

is amplified by only reading data for leaf query nodes, and inferring data for internal 

nodes from the structural summary. Skipping is used to speed up merges where query 

leaves have differing selectivity. Multiple access methods are implemented as materialized 

views instead of succinct secondary indexes for better locality. This redundancy is made 

affordable in terms of space by using compression in a back-end with columnar storage. 

We have implemented an experimental prototype, which shows a speedup of two orders 

of magnitude on XPath queries with value predicates, when compared to existing open 

source and commercial systems using a subset of the techniques. Space usage is also 

improved. 


This was an attempt to build an academic prototype for a high performance XML search 

system, based on design ideas by my supervisor Dr. Torbjørnsen and myself. The system 

turned out rather complex and had many features, and in my view at the time, substantial 

academic contributions. The main contribution was the virtualization of internal query 

nodes, which I thought was a novelty. A paper was submitted to XSym 2009, but was 

rejected, mainly because of missing references to previous work, but also because of a 

weak description of the join algorithms used. Virtual streams for branching internal 

query nodes had been presented in independent papers from 2004 [53] and 2005 [34]. We 

then rewrote the presentation of the paper, and the final form was more of a experimental 

systems-paper that featured some minor novelties. 


Both Bjørklund and Torbjørnsen were part of all phases of this work, except the implementation 

of the system. Bjørklund contributed mainly with knowledge about columnar 

storage and compression schemes. Torbjørnsen contributed with general knowledge of 

46


database systems and search engines, and also had many of the initial design ideas for 

the system. 


In this research I mostly compared my full system with other full systems, and tried to 

explain performance differences in the experiments based on which features each of the 

systems used. The experiments would probably have been more academically interesting 

if I in addition had extended my system such that all the different features could be 

turned on and off to isolate their effects. 

I learned two important lessons from this project. The first is that I should have 

spent considerably more time reading previous research, and less time focusing on my 

own ideas in the beginning. The second lesson is that it is not very tactical for a PhD 

student working mostly on his own to spend a year implementing a large system, unless 

it is certain that it will bear fruits in terms of publishing. Large systems should be left 

to large research groups with long term projects. 

3.2.4 Paper 4 

Authors: Nils Grimsmo and Truls Amundsen Bjørklund. 

Title: Towards Unifying Advances in Twig Join Algorithms [20]. 

Publication: Proceedings of the 21st Australasian Database Conference (ADC 2010). 

Abstract: Twig joins are key building blocks in current XML indexing systems, and numerous 

algorithms and useful data structures have been introduced. We give a structured, 

qualitative analysis of recent advances, which leads to the identification of a number of 

opportunities for further improvements. Cases where combining competing or orthogonal 

techniques would be advantageous are highlighted, such as algorithms avoiding redundant 

computations and schemes for cheaper intermediate result management. We propose some 

direct improvements over existing solutions, such as reduced memory usage and stronger 

filters for bottom-up algorithms. In addition we identify cases where previous work has 

been overlooked or not used to its full potential, such as for virtual streams, or the benefits 

of previous techniques have been underestimated, such as for skipping joins. Using 

the identified opportunities as a guide for future work, we are hopefully one step closer 

to unification of many advances in twig join algorithms. 


After the disappointment of having reinvented the wheel when working with Paper 3, 

I started doing a more thorough literature review to get a deeper understanding of the 

different aspects of XML indexing and twig joins in particular. My notes from this study 

gradually grew into a mixture of a survey paper and a long list of research ideas. As most 

conferences do not accept survey papers, and because I did not feel I had time for the 

process of publishing a journal publication, the focus of the paper was turned to the list 

of research opportunities. 

47

CHAPTER 3. 



Bjørklund helped structure the paper, was part of discussions about the different research 

opportunities listed, and helped writing the final version. 


This paper features a long list of ideas. Some of these may be of interest to other researchers, 

while others probably are not. It would of course had been a much greater 

contribution to the research community if this work had been presented at a more visible 

venue, such as a journal, but this would require much more time, and probably the help 

of co-authors experienced in the field of twig joins. 

In retrospect, it may be backwards to write such a literature review at the end of a 

PhD, and I think my academic development would have benefited from doing it at an 

earlier stage. On the other hand, with no experience I probably would not have been able 

to analyze previous work with the same understanding. 

3.2.5 Paper 5 

Authors: Nils Grimsmo, Truls Amundsen Bjørklund and Magnus Lie Hetland. 

Title: Fast Optimal Twig Joins [21]. 

Publication: Proceedings of the 36th international conference on Very Large Data Bases 

(VLDB 2010). 

Abstract: In XML search systems twig queries specify predicates on node values and 

on the structural relationships between nodes, and a key operation is to join individual 

query node matches into full twig matches. Linear time twig join algorithms exist, but 

many non-optimal algorithms with better average-case performance have been introduced 

recently. These use somewhat simpler data structures that are faster in practice, but have 

exponential worst-case time complexity. In this paper we explore and extend the solution 

space spanned by previous approaches. We introduce new data structures and improved 

strategies for filtering out useless data nodes, yielding combinations that are both worstcase 

optimal and faster in practice. An experimental study shows that our best algorithm 

outperforms previous approaches by an average factor of three on common benchmarks. 

On queries with at least one unselective leaf node, our algorithm can be an order of 

magnitude faster, and it is never more than 20% slower on any tested benchmark query. 


This work started off as an idea for space saving in twig joins, listed as Opportunity 3 in 

Paper 4, but then I noticed that the same idea could be used to close the gap between 

worst-case behavior and practical performance in current twig join algorithms. Also, the 

creation of the framework for classifying strategies we presented in this paper was inspired 

by Opportunity 4 in Paper 4, which suggested extending the range of different filtering 

strategies used in current twig joins. 

As many of the recently presented algorithms used very similar underlying data structures, 

I implemented a minimalistic system where different filtering strategies and storage 

48


techniques could be switched on and off. The result was a system which spanned out a 

large space of solutions, where interplay between the effects of the different features could 

be analyzed. 


Bjørklund was part of the idea phase that lead up to this paper, discussions about the 

development of the system, and gave much constructive feedback during the writing phase. 

Hetland helped simplify our formalizations, was of great help writing the theoretical part 

of the paper, and had the idea for how to implement the post-processing for our preorder 

storage algorithm TJStrictPre. 


I feel that the most important contribution in this paper is the framework we developed 

for classifying the different filtering strategies used in twig joins, and the analysis of how 

the strategies affect practical and worst-case performance. 

On the day before the submission of the camera-ready copy of the paper, I started 

thinking about how our novel input stream merger called getPart could be modified to 

return data nodes strictly in order. It would indeed be interesting to know whether such a 

modification would increase the cost of the merge, because with strict ordering, there are 

actually more possibilities during intermediate result construction, as you can use a global 

stack [21, Appendix H]. My current impression is that with a modified getPart merger, 

it should be possible to create an algorithm that is much simpler and more elegant than 

the TJStrictPre algorithm presented in this paper, but still as fast. 

3.2.6 Paper 6 

Authors: Nils Grimsmo, Truls Amundsen Bjørklund and Magnus Lie Hetland. 

Title: Linear Computation of the Maximum Simultaneous Forward and Backward Bisimulation 

for Node-Labeled Trees [22]. 

Publication: Proceedings of the 7th International XML Database Symposium on 

Database and XML Technologies (XSym 2010). 

Abstract: The F&B-index is used to speed up pattern matching in tree and graph data, 

and is based on the maximum F&B-bisimulation, which can be computed in loglinear 

time for graphs. It has been shown that the maximum F-bisimulation can be computed 

in linear time for DAGs. We build on this result, and introduce a linear algorithm for 

computing the maximum F&B-bisimulation for tree data. It first computes the maximum 

F-bisimulation, and then refines this to a maximal B-bisimulation. We prove that 

the result equals the maximum F&B-bisimulation. 


When I started working on this project, the idea was to write a paper presenting three 

contributions: Linear F&B-index construction for trees, subtree addition with cost dependent 

only of the size of the addition, and a publish/subscribe solution that indexed 

49

CHAPTER 3. 


queries on what document structures they matched. Note that the second part is Opportunity 

9 in Paper 4. After I had found a solution to the first part and implemented it, I 

found a published incorrect algorithm for solving this problem, and felt that perhaps this 

deserved a publication of it’s own. 

Our initial contribution was two-fold: Linear construction of single-directional bisimilarity 

for DAGs, and linear construction of two-directional bisimilarity for trees. One 

week before the submission deadline for XSym I found a paper which described a solution 

to the first problem, and we had to rewrite the paper slightly. We had to describe how 

our solution was different from theirs, and justify why these differences were necessary 

for solving the second problem, two-directional bisimilarity for trees. 


Bjørklund and Hetland both participated in discussions about how to solve the problems, 

and in the writing of the paper. 


Some of the reviewers at XSym commented that the paper would benefit from more 

examples illustrating the algorithms, and also that the experiments were superfluous 

because of the asymptotic improvement. The experiments were therefore removed from 

the published version to give room for more examples [22], but can be found in the 

extended technical report version [23]. The experiments are on standard XML benchmark 

data, where it turns out the old loglinear algorithm performs rather well. It would perhaps 

be more interesting to see experiments on random data showing the loglinear worst-case, 

but this could turn out to be a difficult exercise in constructing data. 

3.3 Research methodology 

Most of the work presented here can be considered constructive research: Given a computational 

problem and the set of existing solutions, find a computationally cheaper solution. 

Cheaper can mean that the asymptotic running time is less, such as in Paper 6 listed 

above, or it can mean that an implementation runs faster for a given set of instances of 

the problem on a given computer, such as in Papers 1 and 3, or both, such as in Paper 5. 

Empirical methods are needed to draw conclusions about the practical performance 

of solutions in general. Unfortunately, this is a weak part of many subfields of computer 

science, such as XML indexing, and it is also a weak part in most of my research. Most 

papers on twig joins and XPath evaluation use around three standard data sources, and 

run maybe five to ten standard queries on each data set, and then draw conclusions about 

which solution is faster. These queries were typically written by some paper author to 

highlight a specific difference between two solutions. This can almost be called qualitative 

research. Using a few standard data sets may be the only feasible solution, but it should 

be possible to create a large population of standard queries from which samples could be 

drawn. Another solution is to create statistical model from which the data and queries 

are generated [17]. This allows drawing conclusion about which solution is better given 

50

3.4. EVALUATION OF CONTRIBUTIONS 

properties of the data and queries, such as label alphabet distribution, tree size and shape, 

etc.. 

Some of the work presented here can also be considered descriptive research, such as 

the taxonomy of techniques in Paper 4, and the classification of filtering strategies in 

previous approaches in Paper 5. 

3.4 Evaluation of contributions 

This section revisits the research questions from Section 1.4, and tries to evaluate to what 

extent they have been answered by the research papers that constitute this thesis. 

3.4.1 Research questions revisited 

Below follows a brief list of my contributions, grouped by research question. 

1. RQ1: How can matches for tree queries be joined more efficiently? 

• Paper 3: The XLeaf system uses multiple materialized views and selectivity 

estimates to reduce the amount of data read during XPath evaluation. Query 

leaves are looked up on either value, path or tag, and then possibly filtered on 

path and/or value. 

• Paper 3: Information from the summary matching is used to determine whether 

or not a linear join can be employed. This is stronger than previous approaches, 

which determined this only from properties of the query [42]. 

• Paper 3: Both the linear and the nested loop join in the XLeaf system store 

intermediate results of negligible size, and achieve this by never materializing 

virtual matches for internal query node, using only implicit prefixes of descendant 

leaf Deweys. Previous approaches for virtual matches explicitly generate 

node identifiers [53, 34]. 

• Paper 3: The compressed Dewey encoding used in the XLeaf system allows 

faster joins, because the Deweys can be compared without decompression. 

• Paper 5: The TJStrictPost and TJStrictPre algorithms combine worst-case 

optimality [7] with good practical performance [33], bridging a gap between 

current twig join algorithms. 

• Paper 5: The getPart input stream merger gives inexpensive weak full match 

filtering, and a considerable speedup during intermediate result construction 

compared to the previous getNext input stream merger [5], which gave weak 

subtree match filtering. 

2. RQ2: How can pattern matching in the dictionary be done more efficiently? 

• Paper 1: EsIe3 was a novel strategy for indexing paths combines tuple indexing, 

selectivity estimates, and nested-loop lookups. This was shown experimentally 

to be much faster than previous approaches based on merge joins [32]. 

51

CHAPTER 3. 


• Paper 1: Smfe was a new opportunistic approach to using suffix trees for 

matching path patterns, which was considerably faster than previous similar 

methods [58]. 

• Paper 3: To join branch matches when using path indexing, it must be 

known how different data paths match the query paths, to make sure different 

branches in the query use the same data nodes for branching points. To 

avoid storing the potentially exponential number of ways the paths can match 

the data, the XLeaf systems stores for each path matched, the set of usable 

matches for the nearest ancestor branching nodes. 

• Paper 3: Using the meta information from the path summary matching, for 

each leaf stream alignment, the joins in the XLeaf system use a simple bottomup 

then top-down traversal of the query tree, which determines which branching 

nodes can be used, and whether or not the leaf Deweys match down to 

these branching points. 

3. RQ3: How can structure indexes be constructed faster and using less space? 

• Paper 2: We determined the worst-case space complexity of a previous path 

indexing strategy building on an extension of suffix trees [58]. This gives a 

more balanced view of the usefulness of this data structure. 

• Paper 3: We showed in the XLeaf system how multiple materialized views 

of a structure index could be made affordable with compression and shared 

dictionaries in a column store back-end. The system used less space for three 

materialized views than a state of the art system without such compression [4] 

uses for one view. 

• Paper 6: The cost of construction of the forward and backward path index for 

tree data was reduced from loglinear to linear, improving the usability of these 

strong structure indexes [28]. 

3.4.2 Opportunities revisited 

In Paper 4 we listed a number of different research opportunities, and below I list the 

developments I have found for each of them, in my own research and others’. It may be 

advantageous to skip this section and revisit it after reading the paper. As Paper 4 was 

written after the XLeaf system in Paper 3 was developed, I already had partial answers 

to some of the questions I posed, but at the time of writing Paper 4, I did not know 

whether or not Paper 3 would be published. 

1. Removing redundant computation in top-down one-phase joins - This is a point we 

did not pursue in the twig joins in Paper 5, something which could have resulted in 

even faster input stream merging. 

2. Top-down memory usage - The TJStrict algorithms in Paper 5 uses strict matching 

equivalent to what was proposed here to reduce the number of nodes added to 

intermediate results. 

52

3.4. EVALUATION OF CONTRIBUTIONS 

3. Bottom-up memory usage - It is possible to add some more inexpensive space savings 

to the TJStrict algorithms as proposed in Paper 4, and using early enumeration 

when the topmost query root match is closed [7]. More aggressive space savings, 

requiring more dynamic storage of intermediate results, have been presented recently 

[35]. 

4. Stronger filters - This topic was treated extensively in Paper 5. 

5. Unification or assimilation - Paper 5 may have brought us a small step closer to 

unifying top-down and bottom-up approaches, but as stated in the conclusion, it 

would be nice to see a solution that is simpler and more elegant than TJStrictPre, 

but still as efficient. 

6. Holistic effective ancestor skipping - As this is just the combination of two previous 

techniques (see Section 2.3.1), it probably holds little academic interest. 

7. Simpler and faster skipping data structures - I have seen no advances on this point, 

and I am not sure whether or not it is of academic interest. 

8. Updates in stronger summaries - During the background research for Paper 6 we 

found papers covering updates in indexes based on single-directional bisimulations 

for graph data [28, 54, 41], but none for the bidirectional case. This is probably of 

less practical interest than the following opportunity. 

9. Hashing F&B summaries - My original plan for Paper 6 featured this as the main 

contribution, but I have not had time for exploring it. I still believe investigating 

this could yield interesting results. 

10. Exploring summary structures and how to search them - As noted in Paper 4, the 

recently proposed ideas of multi-level structure indexes [52] and using twig join 

algorithms to do path summary matching [2] may be part of the ultimate structure 

indexing package, but a thorough experimental investigation of what benefits the 

different techniques give in different cases is still missing. 

11. Access methods for multiple matching partitions - A recent paper [3] compared the 

advanced access methods in iTwigJoin [8] with simply merge joining the different 

streams of matching nodes for a query node, and found that the latter scaled better. 

The method we used in our XLeaf system in the case of multiple matching paths 

was to read a stream of label-matching nodes, and filter on path using a bit vector. 

This is probably more efficient than merging path matches when accessing on path 

is not much more selective than accessing on tag. An optimizer could chose which 

access method to use based on cost estimates. 

12. Improved virtual streams - Paper 3 made some progress on how to store summary 

matching meta-information and how to avoid materializing virtual node positions, 

but I believe that a better solution is waiting to be designed, and that it would use 

features both from Virtual Cursors [53], TJFast [34] and XLeaf [24]. 

53

CHAPTER 3. 


13. Holistic skipping among leaf streams - As noted in Paper 4, Virtual Cursors [53] 

is very closed to implementing holistic skipping and so-called optimal data-access. 

The only piece missing is to always generate virtual internal query node matches 

from the below leaf with the greatest Dewey, as done in the XLeaf system when the 

simple linear join is used. 

14. Identifying and using difficulty classes - The XLeaf system decides whether or not 

a linear join algorithm can be used based on whether or not the tree depth of a 

node is fixed after the path summary matching. This is stronger than previous 

approaches [42]. 

3.5 Future work 

Below I list a few directions I would pursue as a continuation of my research. 

3.5.1 Strong structure summaries for independent documents 

This was proposed as Opportunity 9 in Paper 4. The conjecture is that if the underlying 

data is a set of unconnected graphs, then the structure summary for the forward and 

backward path path partition will also be a set of unconnected graphs. This is shown for 

a collection of independent tree-shaped documents in Figure 3.1. Note that for trees, the 

document roots are partitioned into blocks represented by roots in the summary. Also, 

the only change caused by adding a virtual root for the documents is the addition of a 

virtual root in the summary. 

D 1 D 2 D 3 D 4 D 5 D n 

S 1 S 2 S 3 S m 

Figure 3.1: Shape of forward and backward path summary (below) for set of independent 

documents with virtual root (above). 

The idea is that when adding a new document, its nodes will either be mapped to 

summary nodes in a single tree in the summary, or a new tree will be created in the 

54

3.5. FUTURE WORK 

summary. When a new document is seen, the structure summary tree for this document 

can be constructed independently of the summaries for the previous documents. If this 

summary tree is label-isomorph to an existing top-level subtree in the global summary, 

we translate the classification of the document nodes to the equivalent existing classes, 

otherwise, the document summary is added as a new top-level subtree in the summary. 

This means we need a lookup structure on the shapes of the summary trees, for example 

with hashing of sorted trees, or some form of tree automaton. 

3.5.2 A simpler fast optimal twig join 

The getPart input stream merger presented in Paper 5, has one major flaw, namely that 

similarly to the original getNext input stream merger [5], it outputs data nodes in socalled 

local preorder [21]. As discussed in our paper, a global strict preorder is needed 

when using a global stack to open nodes in preorder and closing them in postorder, which 

is needed for on the fly postorder subtree match filtering. In TJStrictPre, a postorder 

processing pass over the data is used to perform the subtree match filtering, which is 

required for optimality. 

As full weak match filtering removes many nodes, it may be possible that a combination 

of a getPart variant with global preorder output and a simple tree-based intermediate 

result construction as described in Section 2.1.6 would be competitive with TJStrictPre. 

Such an algorithm would be significantly more elegant and simpler to implement. 

3.5.3 Simpler and faster evaluation with non-output nodes 

As described in Section 2.1.2.1, XPath queries have a single output node. Using a regular 

twig join algorithm and removing duplicates during post-processing can be very inefficient, 

but the Full Match Theorem from Paper 5 makes it simple to modify a twig join algorithm 

to get complexity linear in the input and output also with a single output node. As long 

as nodes are filtered first on being subtree matchers in postorder, and then on being prefix 

path matchers in preorder, all nodes are part of a full match. We can then simply output 

the remaining matches for the output query node. Note that the second filtering pass is 

easier to implement when using tree-based intermediate data structures as suggested in 

Section 3.5.2. 

The two-pass full match filtering approach may also be useful when dealing with 

multiple-output nodes in so-called generalized tree pattern matching [9, 7]. In cases 

where output nodes form a connected tree, linear enumeration is easily obtained: Assume 

we use intermediate data structures similar to those constructed by Algorithm 2 (page 

28), and that the enumeration in Algorithm 1 (page 22) traverses the query in preorder. 

We can then start the recursive enumeration at the root of the output subtree, recurse 

only on output query nodes, and output the current match M ′ when its size is equal to 

the number of output query nodes. 

Based on this approach, it may also be possible to formulate a simpler and more 

elegant enumeration algorithm for the general case with possibly unconnected output 

nodes [7]. 

55

CHAPTER 3. 


3.5.4 Ultimate data access shoot-out 

It would be interesting to know which would perform better in practice of a system 

using virtual streams and skipping, improved as suggested in Opportunities 12 and 13 

in Paper 4, and a system with explicit streams and skipping, improved as suggested in 

Opportunity 6. With virtual streams the total amount of data read is probably lower (also 

when skipping), and the advanced data structures for ancestor skipping is avoided. On 

the other hand, node encodings enabling the ancestor reconstruction needed for virtual 

streams have O(d) space usage per encoded position, where d is the maximal depth of 

the data. 

Deciding which is the better approach in practice is no simple task. It would require 

experienced programmers with good knowledge of modern hardware, man-hours to 

optimize both solutions sufficiently, and a thorough empirical evaluation. 

3.6 Conclusions 

This thesis presents a number of contributions on the construction and use of structure 

indexes, and on twig join algorithms. A strength of the thesis is that it combines both work 

on practical efficiency of implementations and on theoretical aspects, read asymptotic 

complexity. A weakness is that it does not consider the broader picture: What is search in 

XML data used for in practice? And, are the advances in twig joins presented here useful 

for real world systems? I believe that the answer to the latter question is yes. In Paper 3 

we built a system that combined new and previous techniques in a new way, such that 

a large group of queries were evaluated orders of magnitude faster than in current stateof-the-art 

open source and commercial systems. This should definitively be of interest 

for system implementors. Also, I believe that the use of improved filtering strategies and 

intermediate data structures introduced in Paper 5 are applicable independently of the 

data partitioning and data access methods used. Whether or not the faster construction 

of F&B-indexes for trees presented in Paper 6 is useful, depends on whether or not F&Bindexing 

is considered favorable over for example simple path indexing in the future. 

The use of the recently introduced multi-level indexing [52] may make F&B-indexes more 

attractive in use. 

A lesson learned during my research is a general strategy for how to process trees, 

which inspired the name of the thesis. When implementing virtual matches in Paper 3, 

candidate sets for branching nodes are calculated bottom-up in the query tree, before 

they are chosen top-down. The Full Match Theorem from Paper 5 states that only nodes 

part of a full match remain after filtering nodes bottom-up on matched subtrees, and then 

top-down on matched prefix paths. In Paper 6 it is shown how the maximum forward and 

backward bisimulation can be computed for tree data by first computing the maximum 

forward bisimulation bottom-up in the tree, and then refining to a maximal backward 

bisimulation top-down. 

The work presented in this thesis builds heavily on the research that has been presented 

by the community during the last ten years. Hopefully some of my contributions will also 

be built on, and result in further advances in the field. 

56

Bibliography 

[1] S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, and D. Srivastava. Structural 

joins: A primitive for efficient XML query pattern matching. In Proc. ICDE, 2002. 

2.1.1 

[2] Radim Bača, Michal Krátký, and Václav Snášel. On the efficient search of an XML 

twig query in large DataGuide trees. In Proc. IDEAS, 2008. 2.2.2, 2.2.4, 10 

[3] Radim Bača and Michal Krátký. On the efficiency of a prefix path holistic algorithm. 

In Proc. XSym, 2009. 2.2.2, 11 

[4] Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger, 

and Jens Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational 

engine. In Proc. SIGMOD, 2006. 3 

[5] Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic twig joins: Optimal 

XML pattern matching. In Proc. SIGMOD, 2002. 2.1.1, 2.1.5, 2.1.5, 2.1.7, 1, 3.5.2 

[6] Qun Chen, Andrew Lim, and Kian Win Ong. D(k)-index: an adaptive structural 

summary for graph-structured data. In Proc. SIGMOD, 2003. 2.2.4 

[7] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant 

Agrawal, and K. Selçuk Candan. Twig 2 Stack: bottom-up processing of generalizedtree-pattern 

queries over XML documents. In Proc. VLDB, 2006. 2.1.1, 2.1.2, 2.1.2.1, 

2.1.5, 2.1.5, 2.1.6, 2.1.6, 2.1.8, 1, 3, 3.5.3 

[8] Ting Chen, Jiaheng Lu, and Tok Wang Ling. On boosting holism in XML twig 

pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005. 

2.2.2, 2.2.2, 11 

[9] Zhimin Chen, H. V. Jagadish, Laks V. S. Lakshmanan, and Stelios Paparizos. From 

tree patterns to generalized tree patterns: on efficient evaluation of xquery. In 

Proc. VLDB, 2003. 3.5.3 

[10] Byron Choi, Malika Mahoui, and Derick Wood. On the optimality of holistic algorithms 

for twig queries. In Proc. DEXA, 2003. 2.2 

[11] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 

Introduction to Algorithms. McGraw-Hill Higher Education, 2001. 2.1.7 

57

BIBLIOGRAPHY 

[12] Marcus Fontoura, Vanja Josifovski, Eugene Shekita, and Beverly Yang. Optimizing 

cursor movement in holistic twig joins. In Proc. CIKM, 2005. 2.3.1.3 

[13] Roy Goldman and Jennifer Widom. DataGuides: Enabling query formulation and 

optimization in semistructured databases. In Proc. VLDB, 1997. 2.2.2 

[14] Georg Gottlob, Christoph Koch, and Reinhard Pichler. Efficient algorithms for processing 

XPath queries. In Proc. VLDB, 2002. 2.4 

[15] Gang Gou and Rada Chirkova. Efficiently querying large XML data repositories: A 

survey. Knowl. and Data Eng., 2007. 1.3, 2.4 

[16] Nils Grimsmo. Dynamic indexes vs. static hierarchies for substring search. Master’s 

thesis, Norwegian University of Science and Technology, 2005. 3.2, 7 

[17] Nils Grimsmo. On performance and cache effects in substring indexes. Technical 

Report IDI-TR-2007-04, NTNU, Trondheim, Norway, 2007. 3.2, 3.3, 7 

[18] Nils Grimsmo. Faster path indexes for search in XML data. In Proc. ADC, 2008. 

(document), 3.2.1 

[19] Nils Grimsmo and Truls Amundsen Bjørklund. On the size of generalised suffix trees 

extended with string ID lists. Technical Report IDI-TR-2007-01, NTNU, Trondheim, 

Norway, 2007. (document), 3.2.2 

[20] Nils Grimsmo and Truls Amundsen Bjørklund. Towards unifying advances in twig 

join algorithms. In Proc. ADC, 2010. (document), 3.2.4 

[21] Nils Grimsmo, Truls Amundsen Bjørklund, and Magnus Lie Hetland. Fast optimal 

twig joins. In Proc. VLDB, 2010. (document), 2.1.5, 2.1.7, 3.2.5, 3.2.5, 3.5.2 

[22] Nils Grimsmo, Truls Amundsen Bjørklund, and Magnus Lie Hetland. Linear computation 

of the maximum simultaneous forward and backward bisimulation for nodelabeled 

trees. In Proc. XSym, 2010. (document), 3.2.6, 3.2.6 



trees (extended version). Technical Report IDI-TR-2010-10, NTNU, Trondheim, 

Norway, 2010. 3.2.6 

[24] Nils Grimsmo, Truls Amundsen Bjørklund, and Øystein Torbjørnsen. Xleaf: Twig 

evaluation with skipping loop joins and virtual nodes. In Proc. DBKDA, 2010. 

(document), 3.2.3, 12 

[25] Haifeng Jiang, Hongjun Lu, Wei Wang, and Beng Chin Ooi. XR-tree: Indexing XML 

data for efficient structural joins. In Proc. ICDE, 2003. 2.3.1.2 

[26] Haifeng Jiang, Wei Wang, Hongjun Lu, and Jeffrey Xu Yu. Holistic twig joins on 

indexed XML documents. In Proc. VLDB, 2003. 2.3.1.2 

58

BIBLIOGRAPHY 

[27] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, and Qiang Zhu Dunren Che. Efficient 

processing of XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA, 

2007. 2.1.1, 2.1.2, 2.1.5, 2.1.6, 2.1.7 

[28] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, and Henry F. Korth. Covering 

indexes for branching path queries. In Proc. SIGMOD, 2002. 2.2.3, 2.2.3, 2.2.4, 

3, 8 

[29] Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, and Raghu Ramakrishnan. 

On the integration of structure indexes and inverted lists. In Proc. SIG- 

MOD, 2004. 2.2.4, 2.4 

[30] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, and Ehud Gudes. Exploiting 

local similarity for indexing paths in graph-structured data. In Proc. ICDE, 2002. 

2.2.4 

[31] Pekka Kilpeläinen. Tree matching problems with applications to structured text 

databases. Technical Report A-1992-6, Department of Computer Science, University 

of Helsinki, 1992. 2.4 

[32] Krishna P. Leela and Jayant R. Haritsa. Schema-conscious XML indexing. Information 

Systems, 32:344–364, 2007. 2 

[33] Jiang Li and Junhu Wang. Fast matching of twig patterns. In Proc. DEXA, 2008. 

2.1.1, 2.1.2, 2.1.5, 2.1.6, 2.1.7, 1 

[34] Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, and Ting Chen. From region encoding 

to extended Dewey: On efficient processing of XML twig pattern matching. In 

Proc. VLDB, 2005. 2.3.2.3, 3.2.3, 1, 12 

[35] Federica Mandreoli, Riccardo Martoglia, and Pavel Zezula. Principles of holism for 

sequential twig pattern matching. The VLDB Journal, 18(6), 2009. 3 

[36] Tova Milo and Dan Suciu. Index structures for path expressions. Proc. ICDT, 1999. 

2.2.2, 2.2.3, 2.2.3, 2.4 

[37] Svetlozar Nestorov, Jeffrey D. Ullman, Janet L. Wiener, and Sudarshan S. Chawathe. 

Representative objects: Concise representations of semistructured, hierarchial data. 

In Proc. ICDE, 1997. 2.2.2 

[38] Robert Paige and Robert E. Tarjan. Three partition refinement algorithms. SIAM 

J. Comput., 1987. 2.2.3 

[39] Lu Qin, Jeffrey Xu Yu, and Bolin Ding. TwigList: Make twig pattern matching fast. 

In Proc. DASFAA, 2007. 2.1.1, 2.1.2, 2.1.5, 2.1.6 

[40] Praveen Rao and Bongki Moon. PRIX: Indexing and querying XML using Prüfer 

sequences. In Proc. ICDE, 2004. 2.4 

59

BIBLIOGRAPHY 

[41] Diptikalyan Saha. An incremental bisimulation algorithm. In Proc. FSTTCS, 2007. 

8 

[42] Mirit Shalem and Ziv Bar-Yossef. The space complexity of processing XML twig 

queries over indexed documents. In Proc. ICDE, 2008. 2.4, 1, 14 

[43] L. J. Stockmeyer and A. R. Meyer. Word problems requiring exponential time (preliminary 

report). In Proc. STOC, 1973. 2.4 

[44] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene 

Shekita, and Chun Zhang. Storing and querying ordered XML using a relational 

database system. In Proc. SIGMOD, 2002. 2.3.2.2 

[45] W3C. XPath 1.0, 1999. http://w3.org/TR/xpath. 1.2.1, 1.2.1, 2.1.2.1 

[46] W3C. Extensible markup language (XML) 1.0 (fourth edition), 2006. http://www. 

w3.org/TR/2006/REC-xml-20060816/. 1.2 

[47] W3C. XQuery 1.0, 2007. http://w3.org/TR/xquery. 1.2.1, 2.1.2.1 

[48] H. Wang, S. Park, W. Fan, and P.S. Yu. ViST: a dynamic index method for querying 

XML data by tree structures. Proc. SIGMOD, pages 110–121, 2003. 2.4 

[49] Haixun Wang and Xiaofeng Meng. On the sequencing of tree structures for XML 

indexing. In Proc. ICDE, pages 372–383, 2005. 2.4 

[50] Felix Weigel. Structural summaries as a core technology for efficient XML retrieval. 

PhD thesis, Ludwig-Maximilians-Universität München, 2006. 2.1.4, 2.1.8 

[51] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing gigabytes (2nd ed.): 

compressing and indexing documents and images. 1999. 2.1.8 

[52] Xin Wu and Guiquan Liu. XML twig pattern matching using version tree. Data & 

Knowl. Eng., 2008. 2.2.4, 10, 3.6 

[53] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, and Kevin 

Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004. 2.3.2.3, 3.2.3, 1, 12, 13 

[54] Ke Yi, Hao He, Ioana Stanoi, and Jun Yang. Incremental maintenance of XML 

structural indexes. In Proc. SIGMOD, 2004. 8 

[55] Jeffrey Xu Yu and Jiefeng Cheng. Graph reachability queries: A survey. In Managing 

and Mining Graph Data, Advances in Database Systems. Springer US, 2010. 2.4 

[56] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, and Guy Lohman. On 

supporting containment queries in relational database management systems. SIG- 

MOD Rec., 2001. 2.1.1, 2.1.4 

[57] Ning Zhang, Varun Kacholia, and M. Tamer Özsu. A succinct physical storage 

scheme for efficient evaluation of path queries in XML. In Proc. ICDE, 2004. 2.4 

60

BIBLIOGRAPHY 

[58] Liang Zuopeng, Hu Kongfa, Ye Ning, and Dong Yisheng. An efficient index structure 

for XML based on generalized suffix tree. Information Systems, 32:283–294, 2007. 

3.2.2, 2, 3 

61

Chapter 4 

Included papers 

“As for me, all I know is that I know nothing.” 

– Socrates 

This chapter contains the research papers that constitute the main body of this thesis. 

They have been reformatted to fit the format used here, and figures and tables have been 

moved, resized and rotated to become readable. Some other papers and reports I have 

authored and co-authored can be found in Appendix A. 

63

Paper 1 


Faster Path Indexes for Search in XML Data 

Proceeding of the Nineteenth Australasian Database Conference (ADC 2008) 

Abstract This article describes how to implement efficient memory resident path indexes 

for semi-structured data. Two techniques are introduced, and they are shown to 

be significantly faster than previous methods when facing path queries using the descendant 

axis and wild-cards. The first is conceptually simple and combines inverted lists, 

selectivity estimation, hit expansion and brute force search. The second uses suffix trees 

with additional statistics and multiple entry points into the query. The entry points are 

partially evaluated in an order based on estimated cost until one of them is complete. 

Many path index implementations are tested, using paths generated both from statistical 

models and DTDs. 

65

Paper 1: Faster Path Indexes for Search in XML Data 

1 Introduction 

With the advent of XML and query languages such as XPath and XQuery came the 

need for efficient ways to query the structure of XML documents. This article focuses on 

settings where a document collection can be indexed in advance, as opposed to querying on 

the fly. For efficient solutions to the latter problem, see for example Gottlob et al. (2005). 

An important component in many systems indexing XML is a path index (Bertino and 

Kim 1989) summarising and indexing all unique paths seen in the document collection. 

(Other names for similar structures are representative objects (Nestorov et al. 1997), 

DataGuides (Goldman and Widom 1997), and access support relations (Kemper and 

Moerkotte 1992).) For many document collections following schemas, this set of paths 

will be small compared to the total size of the data. The path index is in some way 

connected to a value index (or content index), which allows search for words or values. 

XPath is a query language allowing search for regular path expressions in XML documents. 

It is a simple declarative language, but techniques used for XPath queries can 

also be components in more advanced procedural query languages such as XQuery. Many 

FLOWR expressions can also be rewritten to simpler XPath expressions (Michiels et al. 

2007). 

Assume the XML document shown in Figure 1, and the XPath query /a//c/"foo" 1 . 

There are two matches for the path part of the query, and four matches for the value 

predicate "foo", but only one match for the entire query, which is the third occurrence 

of "foo". Note that each unique path is only indexed once in a path index. There are 

two occurrences of the path /a/b in the example document, but there will only be one in 

a path index. 

An efficient path index is important when indexing large heterogeneous document 

collections for structural querying. If the data has a very homogeneous structure, the 

number of unique paths seen is small, and the implementation of the path index itself is 

not significant. An example where document structure could be very heterogeneous, is an 

enterprise search engine, indexing all information generated by a business. This could be 

composed of the content from multiple databases, repositories of reports, etc.. Another 

case is search engines for the semantic web, where structural search is a key feature. It is 

hard to imagine a future for the semantic web without search engines that scale as well 

as current web search engines. 

1.1 Previous approaches 

Many approaches for indexing XML and supporting path queries have been proposed in 

the last ten years. They can mainly be divided into node indexing, path indexing and 

sequence indexing. In node indexing, one makes an inverted index for all XML tags, and 

one for all data values and terms. Given a simple path query, lookup all the tags and 

values, and merge the results. To be able to merge hits for XML elements, an encoding 

which tells whether a node is a child or descendant of another node is needed. Common 

solutions are the range based (docid, start:end, level) encoding (Zhang et al. 2001), the 

1 Let path/"term" be short for path[normalize-space(text())="term"]. 

67

CHAPTER 4. 

INCLUDED PAPERS 

1 

foo 1.1 

foo 1.2(.1) 

1.3 

foo 1.3.1(.1) 

1.3.2 

bar 1.3.2.1 

1.3.2.2 

bar 1.3.2.2.1 

1.3.2.2.2 

 

 

 

foo 1.4(.1) 

bar 1.5(.1) 

 

Figure 1: Example XML document. Dewey order encoding of elements shown on the 

right. 

(post, pre) encoding (Grust 2002) and the prefix based Dewey order encoding (Tatarinov 

et al. 2002). 

A “brute force” merge between element hits may be extremely inefficient. A lot of 

research has gone into finding better merge algorithms in terms of time and IO-complexity: 

MPMGJN (Zhang et al. 2001), EE/EA-join (Li and Moon 2001), tree-merge and stack-tree 

(Al-Khalifa et al. 2002), Anc Des (Chien et al. 2002), PathStack and TreeStack (Bruno 

et al. 2002), XR-tree and XR-stack (Wang and Ooi 2003), TSGeneric (Jiang et al. 2003), 

and TJFast (Lu et al. 2005). Even though these algorithms are very efficient in terms 

of their input and output, many systems which use them perform a lot of unnecessary 

work. Assume for example the query /a/b/c/"foo", and that a is the first element of the 

path for half of the data values stored in the database, while c is seen only a few times. 

To do the merge, either the entire occurrence list for a must be read from disk, or every 

occurrence of c must be looked up in some data structure for a on disk. It is obvious that 

methods which utilise the varying selectivity of the query elements or the full paths will 

be faster. 

Various methods which use index structures other than inverted lists have been proposed. 

The index fabric (Cooper et al. 2001) maintains a layered Patricia trie of all paths 

seen in the data. It is organised in multiple levels, so that a path query will result in a 

number of lookups (block) logarithmic in the size of the total data. A problem with this 

index structure is that only queries starting at the document root element are efficiently 

supported. 

When the structure of XML documents is highly regular, the number of unique paths 

seen in a collection will be very small compared to the total size of the data. This set of 

unique paths can often fit in main memory, and be searched very efficiently. The first use 

68


of path indexing known to the author is by Bertino and Kim (1989), while perhaps the 

best known is the DataGuide (Goldman and Widom 1997) used in the Lore DBMS for 

semistructured data. 

Let a path summary denote a summary of all paths which is not indexed for fast 

matching. Perhaps the simplest use of path summaries is by Buneman et al. (2005), 

where all paths are extracted from the data, and maintained as an in-memory “skeleton”. 

For each path seen, there is an on-disk vector containing all terms and values seen below 

instances of this path. Weaknesses of this approach is that a full search for matching 

paths in the skeleton is required when paths do not start with the root element, and 

that a brute force (or binary) search through the vector is necessary for queries with 

value predicates. The ToXin system (Rizzolo and Mendelzon 2001) improves the latter 

by maintaining for each path an index over the values seen below it. The strength of 

ToXin is an efficient matching of twig queries, by storing navigational information for 

the data in the nodes in the index. A further improvement is ToXop (Barta et al. 2005), 

in which query plans are made based on the selectivity on the path query elements, and 

clever combinations of merges and searches are used. A potential weakness is that if a 

query does not start with a root element, a brute force search through the path summary 

is required to match the path expression. 

An enhancement over a brute force search through the summary is to make an inverted 

index over the paths on their tags. Given a path query, look up the individual tags in 

a path index and merge the results. This is used in SphinX (Poola and Haritsa 2007), 

where there is a value index for each path (as in ToXin). In the case where the path 

index is of considerable size, the merging can be costly. The systems APEX (Chung et al. 

2002) and XIST (Runapongsa et al. 2004) address this by maintaining index entries for 

sub-paths of lengths greater than one on demand. 

A simple and elegant system for XML indexing using path indexing in a RDBMS is 

XRel (Yoshikawa et al. 2001). One of the four tables used is a mapping from paths to 

integer identifiers. All text and values indexed have a reference to the path under which 

they reside, and path matching is done using simple LIKE queries with wild-cards in the 

path table. Similar solutions is used in many systems based on RDBMS. 

A problem with keeping a separate value index for each path is cases where many 

paths match the query. The worst case scenario is when the query consists of only a 

value predicate. This results in many disk accesses, unless the indexes are stored in some 

interleaved fashion, grouped on the value key. An alternative is to have a single value 

index, where the occurrences of a value are stored with their parent path ID. After the 

entry for a value has been found, the occurrences are filtered on matching path IDs. If 

the occurrence list is large, it can be stored sorted on path ID, and pointers into the list 

can be used to avoid having to read all of it (known as skip lists). 

When the path summary fits in main memory, the choice of index structures which 

are suitable for implementing it is greater than if it would have to reside on disk. One 

structure which is only efficient in main memory is the suffix tree (McCreight 1976). 

PIGST (Zuopeng et al. 2007) is a system maintaining a generalised suffix tree as the 

69

CHAPTER 4. 


path index 2 . See Section 2.4 for a description of this solution. A more common use of 

(often pruned) suffix trees is selectivity estimation for optimising query plans (Aboulnaga 

et al. 2001, Chen et al. 2001). 

A method very different from node and path indexing is sequence indexing, where all 

documents are converted to a sequence representation, and searching is done by subsequence 

matching. ViST 3 (Wang et al. 2003) is a system using this approach. An 

advantage is that searching for twig queries can be done without merging partial results. 

A problem with ViST is that the index has quadratic size in the worst case, if the trees 

indexed are very deep. PRIX (Rao and Moon 2004) solves this by taking a different 

approach to the sequencing, using Prüfer sequences. Wang and Meng (2005) use a representation 

similar to in ViST, but using a more clever sequencing they optimise for smaller 

indexes and faster queries. The querying process also becomes much simpler than in ViST 

and PRIX. 

1.2 Contributions 

This article describes how to do efficient path matching. It is assumed that there is an 

overlying system similar to what is common when using path indexing (see Section 2.1). 

• It is shown that to combine an inverted index for the path summary with brute 

force search is in practise much faster than merging path element hits. The methods 

introduced exploit the varying selectivity of the query path elements. 

• It is shown how the use of a generalised suffix tree can be enhanced by adding 

statistics to the tree nodes, and changing the way searches are performed. Multiple 

entry points into the query are partially evaluated in parallel, depending on the 

evaluation cost. 

• Many path index implementations are compared, using paths generated from statistical 

models and from DTDs. 

2 Path index implementation 

Below follows descriptions of various solutions for implementing path summaries. 

2.1 Assumptions 

This article only addresses the implementation of the path index, and assumes that an 

overlying system with the following design is using it: All values and terms seen in 

the document collection are indexed by ordinary inverted lists. Stored in each entry is 

information encoding document ID, position in the document, the values parent path 

identifier, and a local specifier for the path instance (range based or prefix based). Figure 

2 The authors make some extensions which makes the suffix trees super-linear in size, seemingly without 

considering this. 

3 Stands for Virtual Suffix Tree, but only due to a misconception from the author. 

70


2 shows the value index for the example XML document in Figure 1, with path ID and 

Dewey order encoding of path instance shown. Document ID and position is omitted for 

brevity. The enumeration of the paths is shown in Figure 3. 

"foo": 1 1.1 

2 1.2.1 

3 1.3.1.1 

2 1.4.1 

"bar": 4 1.3.2.1 

5 1.3.2.2.1 

7 1.5.1 

Figure 2: Value index for example XML from Figure 1. Storing path ID and path instance 

Dewey order. Document ID and position omitted. 

1. /a 

2. /a/b 

3. /a/b/c 

4. /a/b/b 

5. /a/b/b/a 

6. /a/b/b/a/b 

7. /a/c 

Figure 3: Enumeration of unique paths seen in XML in Figure 1. This is the set of paths 

which would be indexed by a path index. 

Given a non-branching XPath query with a value predicate, all paths matching are 

found with the path index. The value is then looked up in the value index, and the 

hits are filtered with the set of matching paths. For the query /a//c/"foo", the paths 

matching /a//c in the example XML are number 3 and 7. The only occurrence in the 

lists for "foo" with a path which matched is (3 1.3.1.1). In the case of XPath queries 

without value predicates, an index for occurrences of XML tags should be maintained 

in addition. The value index may also have the occurrence lists split/sorted on path ID 

for faster filtering, as in ToXin (Rizzolo and Mendelzon 2001) and SphinX (Poola and 

Haritsa 2007). 

It is assumed that representatives for the unique paths seen in a document collection 

fit in main memory, and further any index structure which is linear in their size. Only 

“simple” path expressions are considered, not twigs. An example of a twig XPath query 

is /a/b[c/"foo" and b/"bar"]. It is assumed that a system using the path index here 

would perform two queries (one for each branch in the twig), and merge the results. Here 

the merge would check for a common prefix of length three in the Dewey encoding of the 

paths. Note that the problem is much more involved in general. Unordered tree inclusion 

in general is even NP complete (Kilpeläinen 1992). The reason twigs are not treated 

here, is that in most cases, the leaves of the twigs will be value predicates(as in the given 

71

CHAPTER 4. 


query), which will have to be looked up in the value index in any case, given the overall 

system design. 

Below follows the descriptions of various path index data structures and matching 

approaches. 

2.2 Brute force search 

The simplest way to implement a path index is to store the paths seen in a list, and 

perform brute force searches for path expressions through this list. Given regular path 

expressions, a deterministic finite automata (Aho et al. 1986) (DFA) for the query can be 

built. A DFA can be exponential in the size of the query, but for most cases queries can 

be considered to be of constant length. Using a DFA gives a linear time scan through the 

data. For document collections with large data, but small schema, a brute force search 

may be a sufficient solution, as the scan through memory is relatively cheap compared to 

the disk accesses needed for the lookup into the value index. 

2.3 Inverted list solutions 

When inverting paths, each tag in a path is treated as a symbol. The index will contain 

for each symbol a list of positions in which it occurs, given as path ID and position within 

path. An index for the paths in Figure 3 is shown in Figure 4. Whether the index should 

store pairs of path ID and position, or store the path ID and a list occurrence positions 

within the path, depends on the expected lengths of these lists. In an implementation not 

using compression, an additional integer would be needed for storing the length of each 

list. This means the latter approach is more space efficient when the expected list length 

is greater than two. This could happen with recursive document schemas. The approach 

using simple pairs was chosen for simplicity in the implementations used here. 

a: 1,1 2,1 3,1 4,1 5,1 5,4 6,1 6,4 7,1 

b: 2,2 3,2 4,2 4,3 5,2 5,3 6,2 6,3 6,5 

c: 3,3 7,2 

Figure 4: Inverted lists for paths seen in example XML. Storing path ID and position 

within path. 

Given a path query using only the child axis, each element is looked up, and the results 

are merged where the elements are adjacent in paths. Assuming the query //a/b/c, 

merging left to right, first merge hits for a and b, and keep all hits with adjacent elements. 

The reason all hits are needed, even though the final output is only path IDs, is that which 

hits for a/b have adjacent hits for c is not known. After merging with c in the example, 

a match in path 3 is left, from position 1 to 3. 

For the descendant axis, hits need not be adjacent, only in the correct order. Given 

that the element hits are merged left to right, only the hit with the leftmost right border 

need to be passed on to the next step in the merge. For the query //a//b//c, first merge 

72


hits for a and b, and keep at most a single match in each path, one with b as far to the 

left as possible. Then merge this hit set with the hits for c. 

For queries using both the child and descendant axis, the hits for elements in a parent– 

child relationship should be merged first, then elements in an ancestor–descendant relationship. 

This is because the former needs all intermediate hits, while the latter does 

not. 

2.3.1 Indexing tuples 

The performance for path queries using the child axis can be greatly improved by indexing 

pairs, triples, or even longer substrings in the inverted lists for the paths. What is indexed 

can be decided dynamically, as in APEX (Chung et al. 2002) or XiST (Runapongsa et al. 

2004), or statically, as is done here for simplicity. If the size of the data (in this case the 

paths) is large compared to the alphabet, the space overhead associated with starting a 

list in the index is small compared to the size of the list contents. In this case, an index 

for all pairs will not require much more space than an index for all single elements. 

It is expected that using pairs or triples will greatly reduce query cost in practise, as 

these probably will have much better selectivity than single elements. 

2.3.2 Extending possible hits 

Given queries using the child axis, a simple trick can be used to improve the performance. 

Assume only single elements are indexed, and the query is /a/b/c. If a is the root element 

of every second path, b exists in around half the paths, but c is seldom seen, the size of 

the result will be very small compared to the total cost of the merges. The cost of merging 

large and small hit sets can be reduced by performing binary searches in the large set. 

The merges can also be done out of order to reduce the cost. 

A simpler and more efficient solution is possible. Since all paths reside in main memory, 

checking single elements in the paths is very cheap. Take the hits for the most selective 

element, and for each one, check whether it is preceded and succeeded by the needed 

elements. This avoids merging with larger sets due to poorer selectivity. The method can 

be combined with indexing pairs and triples. 

2.3.3 Estimate, choose, brute 

Expensive merges on the descendant axis can also be avoided when a part of the query 

has good selectivity. Assume the XPath query /a//c, where c has good selectivity, but 

a does not. As the paths are relatively short strings, a brute force search through the set 

of paths containing c should be more efficient than a merge. The memory management 

overhead of handling intermediate hit sets is also avoided. 

2.4 Suffix tree solutions 

A generalised suffix tree for a set of strings is a compacted trie for all suffixes of the strings 

(McCreight 1976). An example tree is shown in Figure 5. This structure can be built in 

73

CHAPTER 4. 


time and space linear to the total length of the strings for constant and integer alphabets 

(Farach 1997). For general alphabets the complexity is O(n log |Σ|). The implementation 

used in this article combines the child arrays from Grimsmo (2005, 2007) and hashing. 

The index can decide whether a given string exists as a substring in the set in expected 

time linear to the length of the string, and all hits can then be extracted in time linear to 

their number. The set of paths in a path summary can be seen as a set of strings, where 

XML tags are string symbols, and indexed with the suffix tree. This requires that the 

suffix tree implementation can handle the possibly large alphabet of XML tags. If this 

is not the case, the paths can be spelled out separated by delimiters, and these longer 

strings can be indexed. In the implementation used here, tags were mapped to an integer 

alphabet used as string symbols. 

Figure 5: Generalised suffix tree for the strings abba and abbb. 

If only the child axis is used all matches are contiguous sub-paths, and the node whose 

subtree represents all hits can be found in optimal time. But not all occurrences of the 

sub-path are needed, only the set of paths containing them. As an example, a/b occurs 

twice in the path /a/b/b/a/b, but given the query //a/b//"foo", only the fact that it 

occurs is of interest. In PIGST (Zuopeng et al. 2007) this is solved by storing in each 

internal node of the tree the set of path IDs seen in the subtree below. For random paths 

from a uniform distribution, the average ratio between the size of the subtree and the size 

of the list of path IDs will be inversely proportional to the average path length. Another 

argument for their solution, is that the nodes in a suffix tree built with any known linear 

construction algorithm have bad spatial locality, while the path IDs stored in the lists 

in PIGST have perfect locality. This is of importance also in main memory, because of 

the cache effects of modern computers. No bounds on space usage or construction time 

for their extended suffix tree is given (Zuopeng et al. 2007), but both are super-linear, 

even in the average case (Grimsmo and Bjørklund 2007). Finding the set of strings which 

contain a given substring is known as the document listing problem, and can actually be 

solved optimally with linear preprocessing (Muthukrishnan 2002). 

74


2.4.1 Intersect, brute 

The straight forward way to search for a path expression using the descendant axis, is to 

first do a search for the node representing the first part of the query (using only the child 

axis), and then do a full recursive search of the subtree below. To avoid this, PIGST does 

a separate search for each part of the query, intersects the resulting sets of path IDs, and 

performs a brute force search through the set of possible paths. Note that this merging 

on the descendant axis is different from what is described earlier in Section 2.3, as sets of 

path IDs are intersected, not sets of hits in paths, where in-path order is of importance. 

A variant of this is to take only the smallest set of path IDs, and perform a brute force 

search through the respective paths. This is similar to what was described for inverted 

files in Section 2.3.3. One difference is that if a part of a query using only the child axis 

has been matched in the tree, the size of the hit set does not need to be estimated, as it 

is known. Another is that the set of matches is extracted without overhead, no matter 

how long the query part is. For inverted lists, merging or hit expansion was necessary if 

the part was longer than the tuples indexed. 

Skipping the intersection and just doing the brute force search through the smallest 

set should pay off when the total length of the paths in the smallest set of possible paths 

is less than the number of path IDs in the largest set. This could happen often if the 

paths were generated by a source with skewed distribution. 

2.4.2 Selective suffix tree traversal 

Below follows the description of a novel algorithm using multiple entry points into the 

query. Pseudo-code is given in Algorithm 1. The nodes of an ordinary generalised suffix 

tree are each extended with a number giving the size of the subtree below the node. This 

allows for more intelligent traversal of trees and queries. Two variants are used, where 

the second uses two suffix trees. 

For the first variant the number of entry points into a query is equal to the number of 

parts separated by the descendant axis. Given the query /a/b//c//d/e, there are entries 

starting at a, c and d. 

All entry points are kept in a priority queue. They are ordered on the expected cost 

of evaluating them one step further. As soon as one is completely evaluated, the matches 

are extracted. Each entry point may during evaluation be at multiple positions in the 

suffix tree, if wild-cards or the descendant axis have been used. The cost of moving an 

entry point forward in the query is the sum of the costs for moving downward at each 

position held in the tree, with a reduction for having advanced further into the query. 

Evaluating a step of the child axis costs 1, except when the next symbol is a wild-card, 

where the cost is equal to the number of children of the current node in the suffix tree. 

For the descendant axis, the cost of moving one step forward in the query is equal to the 

size of the subtree below the current node. For an entry point which started inside the 

query, and is expanded all the way to the end, the cost of evaluating it “one step further” 

is an estimate of the cost of a brute force search through the paths with the partial match, 

which must be done to get a full match. 

75

CHAPTER 4. 


Input: path expression P , suffix tree ST 

Output: set of matching paths 

Q ← PriQueue(getEntryPoints(P )); 

while not complete(front(Q)) do 

ep ← pop(Q); 

next ← {}; 

foreach p ∈ ep.positions do 

next ← next ∪ advance(p); 

end 

c ← 0; 

foreach p ∈ next do 

c ← c + nextAdvanceCost(p); 

end 

ep.positions ← next; 

ep.advanceCost ← c; 

push(Q, ep) 

end 

ep ← front (Q); 

return ep.matches; 

Algorithm 1: Selective suffix tree traversal 

The other variant of this method also uses a suffix tree for the reverse representation 

of the paths. It has additional entry points moving backwards from the last element in 

each part of the query, doing the matching in the second suffix tree. For the example 

query, there would also be entry points moving backwards from b, c and e. A variant 

using only the suffix tree for the reversed paths is also included in the tests. 

2.4.3 Skipping leading wild-cards 

A simple variant of the basic use of a suffix tree can be used when the query starts with a 

wild-card. The leading wild-cards can be omitted from the query, and after the hits have 

been retrieved, they can be filtered on their starting positions in the paths. 

3 Experimental results 

The tests where run on an AMD64 3500+ running Linux 2.6.16 compiled for AMD64. All 

tested implementations where written in Java, and run with 64-bit Sun Java 1.5.0 06. As 

the Java virtual machine often shows a radical speedup from “warming up” (optimising 

byte-code), and as many of the solutions shared code, some measures had to be taken 

to give a fair treatment. Each test was run repeatedly with all implementations, until 

neither of them showed a deviance of more than 2% in total running time for the test from 

the last attempt. This ensured that all implementations got a proper and fair warmup. 

76


Some care also had to be taken when measuring the memory usage, as Java relies on 

garbage collection. The garbage collector was called multiple times before the indexing 

process started, and the base memory usage was measured 4 . It was called again multiple 

times after the indexing finished, and the difference to the base memory usage was then 

recorded. If the garbage collector was run only a single time, the space usage measured 

differed greatly between runs. 

3.1 Data generation 

The paths used in the tests were generated both from statistical models and DTDs. A 

uniform distribution and Zipf distributions were used, in addition to first order Markov 

models with underlying Zipf distributions. 

The DTDs used for path generation were taken from the benchmarks DBLP (Ley 

2007), GedML (Kay 1999), Michigan (Runapongsa et al. 2003), XBench (Yao et al. 2003), 

XMach-1 (Böhme and Rahm 2001), XMark (Schmidt et al. 2001) and XOO7 (Bressan 

et al. 2003). Paths were generated by a breadth first search through the space of possible 

paths defined by the DTDs. Some of the DTDs were not used alone, as they were small 

and/or non-recursive. In tests using multiple DTDs, a breadth first search through their 

complete space was done, such that all paths of length k were generated before any path 

of length k + 1. For the data from statistical distributions, path lengths were drawn 

randomly from 1 to 10. 

Queries for paths from DTDs were generated as follows: The length of the query was 

drawn from its distribution (Default: uniform from 1 to 5). Then, for each position in 

the query, the use of the child or descendant axis was chosen. (Default: p(desc) = 0.3). If 

the descendant axis was not used at all, a random path of the specified length was drawn, 

and used as query. If not, a random path of at least the specified length was drawn. Then 

the child axis operators were inserted into the query at random locations, and random 

path elements next to them were removed until the query had the specified length. This 

procedure ensured that all queries related to the generated data. The probability that a 

query required matches to start at a root element was set to 0.5. All queries were required 

to match the end of a path. Finally, the probability that a path element was substituted 

with a wild-card was by default 0.1. 

3.2 Methods tested 

The following methods were used in the tests : 

Br Brute force search through all strings. (See Section 2.2) 

MgInv Inverted lists and merging. (2.3) 

MgIe1 Inverted lists, selective entry point in contiguous part, expanding on child axis, 

merging on descendant axis. (2.3.2) 

MgIe2 Indexing singles and pairs. (2.3.1+2.3.2) 

4 Runtime.totalMemory() - Runtime.freeMemory() 

77

CHAPTER 4. 


MgIe3 Indexing singles, pairs and triples. (2.3.1+2.3.2) 

EsIe[1,2,3] Estimating contiguous part with fewest hits, extracting possible paths, filtering 

with brute force. (2.3.2+2.3.3 ) 

St Straight forward use of suffix tree. (2.4) 

Ss Suffix tree, skipping leading wild-cards, and filtering on start position in string. (2.4.3) 

InSe Suffix tree with path ID lists in internal nodes. Intersection of path ID sets on 

descendant axis and brute force through result. As described in (Zuopeng et al. 

2007). (2.4.1) 

EsSe Suffix tree with path ID lists in internal nodes. Finding the contiguous part of the 

query (no descendant axis) with fewest matches, and brute force search through the 

corresponding set of paths. (2.3.3+2.4.1) 

Sm[f,r,2] Suffix tree(s) with multiple entry points. Testing single tree with forward 

strings, reversed strings, and two trees, one forward and one reversed. (2.4.2). 

Smfe Suffix tree enhanced with path ID lists, using multiple entry points, using only 

forward tree. (2.4.1+2.4.2) 

3.3 Tests using various data sources 

Table 1 shows query performance for the tested methods. 10000 paths were indexed, 

drawn from the different data sources. 5000 queries of length 1 to 5 were run, with a 0.3 

probability of using the child axis and 0.1 probability of wild-cards. See later tests for 

variations over this. Table 2 shows more measures for the test using all the DTDs. 

% Br MgInv MgIe1 MgIe2 MgIe3 EsIe1 EsIe2 EsIe3 St Ss InSe EsSe Smf Smr Sm2 Smfe 

U 10 1.8 1068 5637 1426 755 715 807 206 164 350 365 241 150 148 224 148 99 

U 100 0.2 1012 937 164 82 81 96 25 24 81 51 68 57 22 29 26 19 

Z 100,0.7 0.4 1021 1719 306 186 182 117 41 37 126 106 92 55 35 44 41 28 

Z 100,1.0 0.9 1038 3658 691 448 437 223 89 77 249 241 156 76 73 90 77 55 

Z 100,1.3 2.3 1094 8067 1541 1097 1052 483 219 181 543 549 320 157 165 275 166 117 

MZ 100,0.7 0.2 1013 957 177 94 94 96 28 26 92 73 64 48 23 33 26 19 

MZ 100,1.0 0.3 1020 1132 230 139 136 108 35 33 144 128 76 46 27 38 30 21 

MZ 100,1.3 0.5 1038 1394 307 205 204 137 48 46 215 205 92 50 36 53 40 25 

DBLP.dtd 2.4 1241 9455 2283 1724 1673 804 326 271 985 1012 559 262 239 533 286 170 

GedML.dtd 1.2 1181 7504 1583 1205 1205 460 119 117 731 737 394 92 75 151 98 56 

XMark.dtd 3.2 1347 9754 2382 1731 1689 942 401 359 1097 1137 590 339 336 1074 307 263 

*.dtd 0.6 1097 3222 715 600 599 161 78 73 365 369 201 62 61 115 85 50 

Table 1: Microseconds per query, average. Testing uniform distribution (parameter |Σ|), 

Zipf distribution (|Σ|, s), a first order Markov model with underlying Zipf (|Σ|, s), and 

various DTDs. Second column shows average query selectivity per cent. 

The brute force solution (Br) serves as a base case for comparison. It is faster than 

some methods on many of the tests, as these methods have to merge very large sets. The 

performance is also related to the query selectivity. For the test with poorest average 

selectivity, the fastest method is only five times faster than the brute force search, while 

78


Br MgInv MgIe1 MgIe2 MgIe3 EsIe1 EsIe2 EsIe3 St Ss InSe EsSe Smf Smr Sm2 Smfe 

µs/q dev 336 3214 1107 1063 1074 304 220 231 680 691 355 181 170 365 245 132 

µs/path 0 2 2 3 5 2 3 6 14 14 21 21 14 17 33 22 

b/elem 4 18 18 33 50 18 33 50 48 48 76 76 48 69 114 76 

Table 2: Indexing paths from all DTDs. Showing microseconds per query standard deviation, 

microseconds per path when indexing, and bytes per path element in the complete 

index. 

where the selectivity is best, the fastest method is more than 50 times faster. The simplest 

merging method is MgInv, which looks up every single (non-wild-card) element of the path 

query, and merges the results, first on the child axis, then on the descendant axis. In the 

case of uniform data (|Σ| = 100) it is comparable with the brute force solution, but it is 

much slower on many other tests. 

The methods named MgIe* improve this by finding entry points into the contiguous 

parts of the queries, keeping the hits that expand with a match, and then merging only 

on the ascendant axis. These methods are only faster than Br on the artificial tests with 

rather uniform data. As the probability of using the descendant axis is 0.3, there will frequently 

be parts of the query still with low selectivity after expanding over the child axis. 

Notice that also indexing pairs (MgIe2 ) gives a significant speedup, while the speedup 

from indexing triples (MgIe3 ) is less dramatic. Table 2 shows that MgIe2 uses almost 

twice as much memory as MgIe1, and MgIe3 almost three times as much, as expected. 

The total memory used for indexing 10000 paths with MgIe3 was measured to 2.6 MB. 

The methods which combine inversion and brute force (EsIe* ) have a greater speedup 

from indexing pairs and triples. They also have a significant speedup over merging in 

general. On the test using multiple DTDs, EsIe3 is more than eight times faster than 

MgIe3. Indexing triples does not help the latter much if the shortest contiguous parts of 

the query are single elements with poor selectivity. 

The straight forward use of a suffix tree (St) has better performance than any of 

the merging variants (Mg* ), but is slower than the combinations of indexing, selectivity 

estimation and brute force (Es* ). When the suffix tree encounters use of the descendant 

axis, it must traverse an entire subtree, which is a costly operation if the first part of 

the query was not very selective. The space usage for the suffix tree is similar to that of 

indexing triples (*Ie3 ). The improvement of Ss over St on some of the tests comes from 

queries starting with wild-cards. St must branch to every child of the root node in the 

tree, while Ss skips this part of the query, and filters hits on their starting positions in 

the paths afterwards. The reason Ss is sometimes slower is probably the overhead of the 

filtering. St can be fast when a query starts with a wild-card if the next elements are 

very selective, and the branching effectively cut off. 

The method InSe, based on PIGST (Zuopeng et al. 2007), is faster than St on all, 

and Ss on most of the tests. It uses a suffix tree enhanced with path ID lists, path 

set intersection and brute force search, as described in Section 2.4.1. As predicted, the 

related method EsSe is considerably faster on the tests with non-uniform data. It skips 

the intersection, and performs the brute force search through the smallest set, exploiting 

79

CHAPTER 4. 


the varying selectivity of the parts of the query. It should be noted that EsIe3 is faster 

than InSe on all the tests, while EsSe has a similar performance. EsIe3 also uses less 

space and has faster index construction (See Table 2). 

The suffix trees using multiple entry points into the query (Sm[f,r,2]) have very good 

performance. Smf is more than three times faster than InSe on the test using multiple 

DTDs, and also faster than EsIe3. Using the forward representation of the paths (Smf ) 

is more efficient than using the backward representation (Smr). There are two reasons 

for this. The first is the probabilities for requiring match in the beginning and end of a 

path, which are 0.5 and 1.0. When these were swapped Smr was equivalently faster on 

tests using uniform and Zipf data. For paths generated from DTDs and Markov chains, 

Smf was still faster. Notice that Smr uses more memory. Not all of it comes from storing 

the reverse strings. The number of internal nodes in a suffix tree is upper bounded by the 

size of the input, but depends on its characteristics. A larger number of internal nodes 

gives more expensive tree traversals. It seems the reverse paths from DTDs have Markov 

properties that give a larger tree. 

Building two suffix trees and using both forward and reverse entry points (Sm2 ) does 

not increase performance. More entry points are partially evaluated, seemingly without 

reducing the total cost. It is possible that a more intelligent and well tuned implementation 

would give better results. Sm2 used the most memory of all implementations, 

totalling to 5.8 MB on the test will all DTDs. 

Smfe combines features from Smf and InSe, and turns out to beat all methods on 

query performance. A drawback from Smf is the increased construction time and memory 

usage, which comes from using the data structures from InSe. 

In the subsequent tests, only the fastest representatives from each group of implementations 

are shown. 

3.4 Increasing number of paths 

Figure 6 shows how the number of hits returned per second changes as the number of 

paths indexed increases. The paths were generated from all the DTDs, and otherwise 

the default parameters were used. Hits are counted instead of queries to get a better 

perspective of what happens when the size of the data is large. A “higher is better” 

measure is used to improve the visualisation of the difference between the best methods. 

Smfe, Smf, EsSe and EsIe3 out-scale the other methods by a good margin, with the 

first showing the best performance. The reason for the various drops and rises in the 

graph may be that the distribution of the paths generated from the DTDs change as the 

maximal path length increases. A similar test on paths from a Zipf distribution did not 

show the same drops. 

3.5 Descendant axis 

Figure 7 shows hits returned per second as the probability of using the descendant axis 

increases. The paths were generated from the DTDs, and the default parameters were 

used, except that the probability of wild-cards was set to zero. This was to isolate the 

80


1.6e+06 

1.4e+06 

1.2e+06 

Avg. hits per second 

1e+06 

800000 

600000 

400000 

200000 

0 

100 1000 10000 

Number of paths 

MgIe3 

EsIe3 

Ss 

EsSe 

Smf 

Smfe 

Br 

Figure 6: Increasing number of paths. Hits per second. 

effect of using the descendant axis. 10000 paths were indexed. Here EsSe is fastest when 

the probability of using the descendant axis is low, while Smfe is fastest when it is high. 

EsSe depends on finding contiguous parts of the query with good selectivity, which is 

harder when all parts are very short. 

800000 

700000 

600000 


500000 

400000 

300000 

200000 

100000 

0 

0 0.2 0.4 0.6 0.8 1 

Prob. of descendant axis 

MgIe3 

EsIe3 

Ss 

EsSe 

Smf 

Smfe 

Br 

Figure 7: Increasing probability of descendant axis. Hits per second. 

Note that the simple use of a suffix tree (Ss) has the best performance when there 

is no use of the descendant axis, but poor performance otherwise. The other methods 

using suffix trees could switch to the simple search technique when this is expected to be 

profitable. 

81

CHAPTER 4. 


3.6 Wild-cards 

Increasing the probability of wild-cards gives different results than increasing the use of 

the descendant axis, as shown in Figure 8. The descendant axis was not used at all in 

this test. Here the suffix trees are much faster than the other methods. For the trees, 

branching on a wild-card is much less critical than branching on the descendant axis, as 

the former is cut off as soon as a mismatch is seen, while the latter introduces a full search 

of the subtree. 

900000 

800000 

700000 


600000 

500000 

400000 

300000 

200000 

100000 

0 

0 0.1 0.2 0.3 0.4 0.5 

Prob. of wildcard 

MgIe3 

EsIe3 

Ss 

EsSe 

Smf 

Smfe 

Figure 8: Increasing probability of wild-card. Hits per second. 

Br 

It is interesting to note that the simple use of a suffix tree, only skipping leading 

wild-cards (Ss), here is much more efficient than the implementation supporting multiple 

entry points into the query (Smf ), even though only a single entry point is created. The 

realisation of the search automata in Ss is just a recursive program, while Smf maintains 

a set of state objects, giving a lot of overhead. 

3.7 Index construction 

Figure 9 shows the indexing performance of the various indexes as the number of paths 

increases. Single representatives for methods using the same data structure were tested. 

MgIe1 represents MgInv and EsIe1, MgIe2 represents EsIe2, MgIe3 represents EsIe3, 

St represents Ss and Smf, and finally InSe represents EsSe and Smfe. The methods 

tested are all asymptotically linear in the worst case except InSe. The performance drop 

observed is probably due to the overhead of memory management and cache effects. The 

construction of the suffix trees is slower than construction of inverted lists, but as the 

time cost of adding a path is less than 30 µs, this would constitute a neglible part of 

the cost of indexing XML in a complete system, if it is assumed that the path index can 

reside in main memory, while the value index must reside on disk. 

82


900000 

800000 

700000 

Avg. paths indexed per second 

600000 

500000 

400000 

300000 

200000 

100000 

0 

100 1000 10000 

Number of paths 

MgIe1 MgIe2 MgIe3 St Sm2 InSe 

Figure 9: Paths indexed per second. 

4 Conclusion and future work 

The most advanced method using suffix trees (Smfe) is the winner on query performance 

in these tests. Whether this should be chosen as the path index component in a larger 

system, depends on the amount of main memory available and on whether or not the 

performance of the path index significantly impacts the performance of the complete 

system. Another issue is the complexity of the implementation. The suffix tree itself is a 

rather complex structure, and its use as described here may be hard to grasp. 

The method combining inverted lists, extension over the child axis and brute force 

(EsIe3 ), would be the authors’ choice when implementing a larger system. It is conceptually 

simple, easy to implement, and has good performance. If memory usage is an issue, 

EsIe2 or EsIe1 could be used. These methods could also be modified to work on disk 

when the path index does not fit in main memory. The paths themselves could be stored 

in the entries in their inverted lists, allowing the extension technique to work, at the cost 

of using more disk space and IO. 

In future work, the authors would like to add the fast path summaries introduced to 

an existing system for indexing XML, and compare this with various implementations, 

both with and without path summaries, such as Lore (Goldman and Widom 1997), ToXin 

(Rizzolo and Mendelzon 2001) XISS (Li and Moon 2001), XIST (Runapongsa et al. 2004), 

Ctree (Zou et al. 2004), SphinX (Poola and Haritsa 2007) and MXI (Yan and Liang 2005). 

Acknowledgements 

The author would like to thank Øystein Torbjørnsen, Svein-Olaf Hvasshovd and Truls 

Amundsen Bjørklund for helpful feedback on this article. 

83

CHAPTER 4. 


References 

Aboulnaga, A., Alameldeen, A. R. & Naughton, J. F. (2001), Estimating the selectivity 

of XML path expressions for internet scale applications, in ‘Proc. VLDB’, pp. 591–600. 

Aho, A., Sethi, R. & Ullman, J. (1986), Compilers, Principles, Techniques and Tools., 

Addison Wesley, Reading. 

Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J. & Srivastava, D. (2002), Structural 

joins: A primitive for efficient XML query pattern matching, in ‘Proc. ICDE’, pp. 141– 

152. 

Barta, A., Consens, M. P. & Mendelzon, A. O. (2005), Benefits of path summaries in an 

XML query optimizer supporting multiple access methods, in ‘Proc. VLDB’, pp. 133– 

144. 

Bertino, E. & Kim, W. (1989), ‘Indexing techniques for queries on nested objects’, IEEE 

Transactions on Knowledge and Data Engineering 1(2), 196–214. 

Böhme, T. & Rahm, E. (2001), ‘XMach-1: A benchmark for XML data management’, 

Proc. German database conference BTW pp. 264–273. 

Bressan, S., Lee, M., Li, Y., Lacroix, Z. & Nambiar, U. (2003), ‘The XOO7 benchmark’, 

Proc. EEXTT’02 pp. 146–147. 

Bruno, N., Koudas, N. & Srivastava, D. (2002), Holistic twig joins: Optimal XML pattern 

matching, in ‘Proc. SIGMOD’, pp. 310–321. 

Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R. & Viglas, S. D. (2005), Vectorizing 

and querying large XML repositories, in ‘Proc. ICDE’, pp. 261–272. 

Chen, Z., Jagadish, H., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R. & Srivastava, 

D. (2001), ‘Counting twig matches in a tree’, Proc. ICDE . 

Chien, S., Vagena, Z., Zhang, D., Tsotras, V. & Zaniolo, C. (2002), ‘Efficient structural 

joins on indexed XML documents’, Proc. VLDB 2, 263–274. 

Chung, C.-W., Min, J.-K. & Shim, K. (2002), APEX: An adaptive path index for XML 

data, in ‘Proc. SIGMOD’, pp. 121–132. 

Cooper, B., Sample, N., Franklin, M. J., Hjaltason, G. R. & Shadmon, M. (2001), A fast 

index for semistructured data, in ‘Proc. VLDB’, pp. 341–350. 

Farach, M. (1997), Optimal suffix tree construction with large alphabets, in ‘Proc. FOCS’, 

pp. 137–143. 

Goldman, R. & Widom, J. (1997), DataGuides: Enabling query formulation and optimization 

in semistructured databases, in ‘Proc. VLDB’, pp. 436–445. 

84


Gottlob, G., Koch, C. & Pichler, R. (2005), ‘Efficient algorithms for processing xpath 

queries’, ACM Trans. Database Syst. 30(2), 444–491. 

Grimsmo, N. (2005), Dynamic indexes vs. static hierarchies for substring search, Master’s 

thesis, Norwegian University of Science and Technology. 

Grimsmo, N. (2007), On performance and cache effects in substring indexes, Technical 

Report IDI-TR-2007-04, NTNU, Trondheim, Norway. 

Grimsmo, N. & Bjørklund, T. A. (2007), On the size of generalised suffix trees extended 

with string ID lists, Technical Report IDI-TR-2007-01, NTNU, Trondheim, Norway. 

Grust, T. (2002), Accelerating XPath location steps, in ‘Proc. SIGMOD’, pp. 109–120. 

Jiang, H., Wang, W., Lu, H. & Yu, J. (2003), Holistic twig joins on indexed XML documents, 

in ‘Proc. VLDB’, pp. 273–284. 

Kay, M. H. (1999), ‘GedML’. http://users.breathe.com/mhkay/gedml/. 

Kemper, A. & Moerkotte, G. (1992), ‘Access support relations: an indexing method for 

object bases’, Inf. Syst. 17(2), 117–145. 

Kilpeläinen, P. (1992), Tree matching problems with applications to structured text 

databases, Technical Report A-1992-6, Department of Computer Science, University 

of Helsinki. 

Ley, M. (2007), ‘DBLP XML Records’. http://www.informatik.uni-trier.de/~ley/ 

db/. 

Li, Q. & Moon, B. (2001), Indexing and querying xml Data for regular path expressions, 

in ‘Proc VLDB’, pp. 361–370. 

Lu, J., Ling, T., Chan, C. & Chen, T. (2005), From region encoding to extended dewey: 

On efficient processing of XML twig pattern matching, in ‘Proc. VLDB’, pp. 193–204. 

McCreight, E. M. (1976), ‘A space-economical suffix tree construction algorithm’, J. ACM 

23(2), 262–272. 

Michiels, P., Mihaila, G. A. & Simeon, J. (2007), Put a tree pattern in your algebra, in 

‘Proc. ICDE’, pp. 246–255. 

Muthukrishnan, S. (2002), Efficient algorithms for document retrieval problems, in ‘Proc. 

SODA’, pp. 657–666. 

Nestorov, S., Ullman, J. D., Wiener, J. L. & Chawathe, S. S. (1997), Representative 

objects: Concise representations of semistructured, hierarchial data, in ‘Proc. ICDE’, 

pp. 79–90. 

Poola, L. & Haritsa, J. (2007), ‘Schema-conscious XML indexing’, Information Systems 

32, 344–364. 

85

CHAPTER 4. 


Rao, P. & Moon, B. (2004), Prix: Indexing and querying xml using prüfer sequences, 

in ‘ICDE ’04: Proceedings of the 20th International Conference on Data Engineering’, 

IEEE Computer Society, Washington, DC, USA, p. 288. 

Rizzolo, F. & Mendelzon, A. (2001), Indexing XML data with ToXin, in ‘Proc. WebDB’, 

Vol. 2001. 

Runapongsa, K., Patel, J., Bordawekar, R. & Padmanabhan, S. (2004), XIST: An XML 

index selection tool, in ‘Proc. XSym’. 

Runapongsa, K., Patel, J., Jagadish, H. & Al-Khalifa, S. (2003), The Michigan Benchmark: 

A microbenchmark for XML query processing systems, in ‘Proc. EEXTT’02’, 

Springer. 

Schmidt, A. R., Waas, F., Kersten, M. L., Florescu, D., Manolescu, I., Carey, M. J. & 

Busse, R. (2001), The XML Benchmark Project, Technical Report INS-R0103, CWI, 

Amsterdam, The Netherlands. 

Tatarinov, I., Viglas, S. D., Beyer, K., Shanmugasundaram, J., Shekita, E. & Zhang, 

C. (2002), Storing and querying ordered XML using a relational database system, in 

‘Proc. SIGMOD’, pp. 204–215. 

Wang, H. & Meng, X. (2005), On the sequencing of tree structures for XML indexing, in 

‘Proc. ICDE’, pp. 372–383. 

Wang, H. & Ooi, B. (2003), XR-tree: Indexing XML data for efficient structural joins, in 

‘Proc. ICDE’, pp. 253–264. 

Wang, H., Park, S., Fan, W. & Yu, P. (2003), ‘ViST: a dynamic index method for querying 

XML data by tree structures’, Proc. SIGMOD pp. 110–121. 

Yan, L. & Liang, Z. (2005), ‘Multiple schema based XML indexing’, Lecture Notes in 

Computer Science 3619, 891–900. 

Yao, B. B., Özsu, M. T. & Keenleyside, J. (2003), XBench - a family of benchmarks for 

XML DBMSs, in ‘Proc. EEXTT’02’, pp. 162–164. 

Yoshikawa, M., Amagasa, T., Shimura, T. & Uemura, S. (2001), ‘XRel: a path-based 

approach to storage and retrieval of XML documents using relational databases’, ACM 

Transactions on Internet Technology 1(1), 110–141. 

Zhang, C., Naughton, J., DeWitt, D., Luo, Q. & Lohman, G. (2001), ‘On supporting 

containment queries in relational database management systems’, SIGMOD Rec. 

30(2), 425–436. 

Zou, Q., Liu, S. & Chu, W. W. (2004), Ctree: a compact tree for indexing XML data, in 

‘Proc. WIDM’, pp. 39–46. 

Zuopeng, L., Kongfa, H., Ning, Y. & Yisheng, D. (2007), ‘An efficient index structure for 

XML based on generalized suffix tree’, Information Systems 32, 283–294. 

86

Paper 2 

Nils Grimsmo and Truls Amundsen Bjørklund 

On the Size of Generalised Suffix Trees Extended with String ID Lists 

Technical Report IDI-TR-2007-01, Norwegian University of Science and Technology, 

Trondheim, Norway, 2007. 

Abstract The document listing problem can be solved with linear preprocessing and 

optimal search time by using a generalised suffix tree, additional data structures and 

constant time range minimum queries. A simpler solution is to use a generalised suffix 

tree in which internal nodes are extended with a list of all string IDs seen in the subtree 

below the respective node. This report makes some remarks on the size of such a structure. 

For the case of a set of equal length strings, a bound of Θ(n √ n) for the worst case space 

usage of such lists is given, where n is the total length of the strings. 

87

Paper 2: On the Size of Generalised Suffix Trees Extended with String ID Lists 

1 Document listing problem 

The occurrence listing problem is given a string T ∈ Σ ∗ of length n, which can be preprocessed, 

and a pattern P of length m, find all z occurrences of P in T . This classical 

problem can be solved with Θ(n) preprocessing and optimal O(m + z) time queries with 

a suffix tree [4] if |Σ| is constant. The document listing problem is, given a set of strings 

T = {T 1 , . . . , T t } of total length n, which can be preprocessed, find all y strings in T which 

contain the pattern P . This can also be solved with Θ(n) preprocessing and O(m + y) 

time queries, by augmenting a generalised suffix tree with additional data, and using 

constant time minimum range queries [5]. 

This report considers the complexity of a simpler solution, which does not have linear 

time preprocessing or linear space usage, but has been used to solve a problem in 

information retrieval [8], without considering the complexity. 

2 Suffix tree definition 

A suffix tree [4] for a string T of length n from the alphabet Σ is a compacted trie 

containing the first n suffixes of T $, where $ /∈ Σ is a unique terminator. Compaction 

here means that all internal non-branching nodes in the trie have been removed, joining 

adjacent edges. As $ is not seen in T , no suffix is a prefix of another, and all n suffixes are 

represented by a unique leaf node. Since all internal nodes are branching, their number 

is strictly upper bounded by n. If the edges in the tree are represented as pointers into 

T , the suffix tree can be represented in Θ(n) space. It can be constructed in Θ(n) time 

if the alphabet is constant [7, 4, 6] or integer [1]. 

Given a pattern P of length m, it can be decided whether it occurs in T by trying 

to follow its path downwards from the root in the suffix tree. The time complexity is 

O(m) if child lookup is Θ(1), which is trivially achieved if |Σ| is constant. The parent 

child relationship is typically implemented with linked sibling lists. An expected lookup 

time of O(m) can be achieved with hashing [4]. Perfect hashing [2] can be used to get 

O(m) worst case lookup, at the cost of a longer construction time. After the position 

representing P in the tree has been located, all z hits can be extracted in Θ(z) time, as all 

leaf nodes in the subtree correspond to unique hits and all internal nodes in the subtree 

are branching. 

3 Generalised suffix tree 

A generalised suffix tree for a set of strings T = {T 1 , . . . , T s } of total length n = n 1 +· · ·+n s 

is a compacted trie containing for each T i , all the first n i suffixes of T i $ i , where $ i is a 

unique terminator string. The tree will have n leaf nodes, and can be constructed in Θ(n) 

time and space [3]. 

89

CHAPTER 4. 


4 String ID list extension 

A generalised suffix tree is an asymptotically optimal substring index for a set of strings 

from an constant alphabet. The cost of a query is linear in the size of the input and 

output. But in many cases, you do not want to extract the set of hit positions, but the 

set of strings in which the hits occur. In cases where substrings are repeated in the strings 

indexed, a string ID can be seen many times in a subtree, and the set of unique string 

IDs will be smaller than the set of hit positions. To avoid this overhead, each node can 

store a list of the string IDs seen in its subtree. This can speed up the search for strings 

with matches. A drawback is a non-linear space usage in the worst case, which is shown 

below. 

5 A lower bound for worst case space usage 

Here follows a proof for a lower bound of Ω(n √ n) for the worst case space usage of the 

suffix trees extended with string ID lists. This is shown through the Θ(n √ n) space usage 

for a family of sets of strings. 

Assume we have the set of strings T = {T 1 , . . . , T t }, where the strings are the rotations 

of (1, . . . , t), such that 

T i = ((0 + i) mod (t + 1), . . . , (t − 1 + i) mod (t + 1) (1) 

The total length of the strings is n = t 2 . Figure 1 shows a generalised suffix tree for 

this set of strings for t = 4. 

Figure 1: Generalised suffix tree for the padded strings (1, 2, 3, 4, $ 1 ), (2, 3, 4, 1, $ 2 ), 

(3, 4, 1, 2, $ 3 ) and (4, 1, 2, 3, $ 4 ). Number of unique string IDs in subtree shown in internal 

nodes. 

90

Paper 2: On the Size of Generalised Suffix Trees Extended with String ID Lists 

Let U = (1 mod (t + 1), . . . , (2t − 1) mod (t + 1). All strings in T are substrings of 

U. Figure 2 shows this for t = 4. Consider the substring U[p . . . q], where 1 ≤ p ≤ t and 

p ≤ q 

q < i, which gives a total of p + (t − q) locations. 

If p + (t − q) > 1, there will be an internal node representing the substring U[p . . . q], 

as it is followed by U[q + 1] in some string, and by an end of string symbol in the padded 

string T q−t+1 $ q−t+1 . Hence the total number of string IDs stored in the lists will be 

|L| = 

t∑ 

p=1 

p+t−2 

∑ 

q=p 

Inserting t = √ n, we get 

p + (t − q) = 

t∑ ∑t−2 

t − q = t 

p=1 q=0 

t∑ 

k=2 

k = t3 + t 2 − 2t 

2 

(2) 

|L| = n√ n + n − 2 √ n 

2 

∈ Θ(n √ n) (3) 

1 2 3 4 1 2 3 

1 2 3 4 $ 1 

2 3 4 1 $ 2 

3 4 1 2 $ 3 

4 1 2 3 $ 4 

Figure 2: Showing how the strings in T are substrings of U for t = 4. For each T i ∈ T , 

each substring of length less than t is seen two places in U. 

As the space usage for a normal generalised suffix tree is Θ(n), we get a total space 

usage of Θ(n √ n) for a tree extended with string ID lists for this family of sets of strings, 

and a bound of Ω(n √ n) for the worst case space usage of this data structure in general. 

6 An upper bound for strings of equal length 

Assume we have T = {T 1 , . . . , T t }, and equal string lengths with |T i | = r. The total 

length of the strings is n = tr. Since there are less than n internal nodes in the tree, and 

each list contains at most t IDs, the total length of the lists is bounded as 

|L| < tn = t(tr) = t 2 r ∈ O(t 2 r) (4) 

Seen differently, the jth suffix of each string has length r − j + 1, so there can be 

at most r − j internal nodes above the leaf node representing the suffix, to which it can 

contribute to the string ID list. This gives a bound 

|L| ≤ t 

r∑ 

j=1 

∑r−1 

(r − 1)r 

r − j = t k = t ∈ O(tr 2 ) (5) 

2 

k=0 

91

CHAPTER 4. 


Together, these two bounds give 

|L| ∈ O(min(t 2 r, tr 2 )) (6) 

which is maximal for r = t. This gives n = t 2 = r 2 , and the bound 

|L| ∈ O(n √ n) (7) 

Combining with Equation 3, we get a bound of Θ(n √ n) for the worst case space usage 

for equal length strings. 

References 

[1] Martin Farach. Optimal suffix tree construction with large alphabets. In Proc. FOCS, 

1997. 

[2] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 

0(1) worst case access time. J. ACM, 31(3), 1984. 

[3] Daniel M. Gusfield. Algorithms on strings, trees, and sequences: Computer science 

and computational biology. Cambridge University Press, 1997. 

[4] Edward M. McCreight. A space-economical suffix tree construction algorithm. J. 

ACM, 23(2), 1976. 

[5] S. Muthu Muthukrishnan. Efficient algorithms for document retrieval problems. In 

Proc. SODA, 2002. 

[6] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(5), 1995. 

[7] Peter Weiner. Linear pattern matching algorithms. In Proc. SWAT, 1973. 

[8] Liang Zuopeng, Hu Kongfa, Ye Ning, and Dong Yisheng. An efficient index structure 

for XML based on generalized suffix tree. Information Systems, 32:283–294, 2007. 

92

Paper 3 

Nils Grimsmo, Truls Amundsen Bjørklund and Øystein Torbjørnsen 

XLeaf: Twig Evaluation with Skipping Loop Joins and Virtual Nodes 

Proceedings of the Second International Conference on Advances in Databases, Knowledge, 

and Data Applications (DBKDA 2010) 

Abstract XML indexing and search has become an important topic, and twig joins 

are key building blocks in XML search systems. This paper describes a novel approach 

using a nested loop twig join algorithm, which combines several existing techniques to 

speed up evaluation of XML queries. We combine structural summaries, path indexing 

and prefix path partitioning to reduce the amount of data read by the join. This effect 

is amplified by only reading data for leaf query nodes, and inferring data for internal 

nodes from the structural summary. Skipping is used to speed up merges where query 

leaves have differing selectivity. Multiple access methods are implemented as materialized 

views instead of succinct secondary indexes for better locality. This redundancy is made 

affordable in terms of space by using compression in a back-end with columnar storage. 

We have implemented an experimental prototype, which shows a speedup of two orders 

of magnitude on XPath queries with value predicates, when compared to existing open 

source and commercial systems using a subset of the techniques. Space usage is also 

improved. 

93

Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins and Virtual Nodes 


XML has evolved as the dominating data format for information exchange of structured 

and semi-structured data over the Internet. It may also be used for integration of data 

from disparate sources. When XML is generated from structured databases it is regular 

and has a known schema, while in other scenarios data is ad-hoc and without any schema. 

XPath [31] and XQuery [32] are languages used to query XML data. XPath is a 

simple declarative language that supports matching of structure and content. XQuery is 

a more expressive iterative language, but uses XPath as a fundamental building block. 

This paper focuses on performance on a subset of XPath called twig pattern matching, 

which covers a majority of queries used in practice [11]. 

An XPath query is a tree, where the relation between the query nodes specify the relation 

between the data nodes in a possible match, and value predicates specify the contents 

of attributes and text nodes in the XML. The example query /lib/book[author="Kant" 

and author="Gödel"], asks for all books coauthored by Kant and Gödel. A parent-child 

relation is denoted by “/”, an ancestor-descendant relation by “//”, and “[]” encloses a 

predicate. Figure 1 shows a document where this query would have a match. 

 

 

Kritik der Unvollständigkeit 

Kant 

Gödel 

... 

 

Figure 1: Example XML document. 

A typical system supporting XPath parses the XML and stores some information 

about each data node, usually its name, type and value, and its relation to other nodes. 

This can be stored in a table, and is usually sorted on document and node order for 

faster structural joins on the nodes in the query tree. When evaluating queries, it may be 

more advantageous to access nodes either by name, value, natural order, or other fields. 

Multiple secondary indexes can be added for direct lookup on all the fields stored for a 

node. 

To find all matches for the example query, many current systems would read six lists of 

nodes from indexes, looking up the four XML element nodes on name (lib, book, author 

and author), and the two text nodes on value ("Kant" and "Gödel"). These would then be 

joined to give full matches, using the structural information stored about each node. In 

a library with millions of books, there would be just as many book nodes, and even more 

author nodes to join. 

The main contributions of this paper are: 1) A twig join algorithm combining previous 

techniques in a novel way; 2) Implementation of an experimental prototype; 3) Experi- 

95

CHAPTER 4. 


ments showing two orders of magnitude speedup over other systems on for queries with 

value predicates, and reduced space usage. 

2 Background 

Indexing and querying XML has been a major research area both in the database and 

information retrieval community the last ten years, resulting in many different approaches. 

2.1 Schema Agnostic XML Indexing 

An early approach to XML indexing, which is usable with simple schema, is to translate 

the XML schema to a relational schema, and put shredded XML data into an RDBMS [28]. 

However, the XML schema can be complex, subject to updates, unknown, or non-existing. 

This flexibility may be considered a strength. 

In schema agnostic XML indexing, all element-, attribute- and text nodes are extracted, 

and stored with information about their type, name, value and position in the 

tree. During query evaluation a list of matching data nodes is read for each query node, 

and these are joined into complete matches for the query tree. To allow joining matches 

for individual query nodes into tree pattern matches, the positional information must 

allow decision of node relationships, such as ancestor-descendant and parent-child. The 

most well known tree position encoding is the regional BEL encoding [37], which gives a 

node’s begin and end position in the document, and its level (depth) in the tree. 

Figure 2 shows an example using BEL encoding on the data extracted for the XML 

document in Figure 1. The data is usually indexed on name for XML element nodes, and 

on value for text nodes. When querying, one list of nodes is typically read for each node 

in the query. Retrieving element nodes on name (tag) is called tag partitioning (or tag 

streaming [6]). To evaluate the XPath query //lib/book, the nodes u and v (for lib and 

book) satisfying u.begin < v.begin ∧ v.end < u.end ∧ u.level + 1 = v.level, in addition to 

the type and name requirements, must be found. 

2.2 Tree-aware Joins 

If the node lists are sorted on doc and begin values, a linear merge can be used when the 

schema and query are simple. But in other cases, matches may be formed from the lists 

out of order. Consider the document in Figure 3, and the query //a[c and d]. A linear 

merge of lists would miss one of the two pairs of c and d nodes usable together. 

Using a full cross join on lists of data nodes matching each query nodes, and checking 

the output for legal matches, is not feasible for large data, as it can give intermediate 

results exponential in the size of the query, even when the final answer is small. 

The first specialized tree joins focused on splitting the query tree into binary relationships 

and stitching the results into complete matches. Specialized loop joins gave optimal 

O(I + O) complexity for ancestor-descendant relationships [37], where I is the size of the 

input lists, and O is the size of the output. When joining an ancestor and a descendant 

96


doc begin end level type name value 

1 1 . . . 1 Elem. lib 

1 2 12 2 Elem. book 

1 3 5 3 Elem. title 

1 4 4 4 Text “Kritik. . . ” 

1 6 8 3 Elem. author 

1 7 7 4 Text “Kant” 

1 9 11 3 Elem. author 

1 10 10 4 Text “Gödel” 

1 13 . . . 2 Elem. article 

. . . . . . . . . . . . . . . . . . . . . 

Figure 2: Data extracted for tag partitioning. 

 

 

 

 

 

Figure 3: Breaking linear merge for //a[c and d]. 

list, a “marker” is left in the descendant list on matching positions, and the descendant 

list is “re-winded” to this position when the ascendant list is forwarded. 

Later stack-based joins gave O(I + O) complexity also for parent-child relationships 

[2]. These maintained a stack of nested ancestor candidates in a single traversal of 

the two lists. Any current descendant candidate would be a descendant of all the nodes 

on the stack, and possibly the child of the top node. 

As evaluating the query node relationships separately still gave useless intermediate 

results of exponential size, multi-way path and twig join algorithms were introduced [4]. 

By maintaining multiple stacks, these achieved optimal O(I + O) complexity using only 

O(d) memory, for path queries and twig queries using only the descendant axis. Later 

algorithms, which break the O(d) memory bound, achieve optimal complexity for all 

combinations of ancestor-descendant and parent-child edges [5]. 

2.3 Skipping Tree Joins 

When the lists of matches for the different query nodes have similar size, regular tag 

partitioning is a fairly efficient solution, but that is often not the case. Take for example 

the query //book[author="Kant"], evaluated on a library with millions of books. The leaf 

predicate probably has good selectivity, while the other nodes do not. 

97

CHAPTER 4. 


Skipping can be used to improve efficiency in such cases. Consider the query //a//d, 

and the problem of forwarding in the lists for the two query nodes to the first position 

where a match for a is an ancestor of a match for d. Skipping forward in the list for d to 

find the first d j which can be a descendant of the current node a i for a, is trivially done by 

finding the first d j with d j .begin > a i .begin. If d k is the last d node with d k .end < a i .end, 

then all d descendants of a i (if any) are stored contiguously from d j to d k . 

However, this technique cannot be used to skip in the ancestor list, as any node with 

a lower begin value is a candidate, and the actual ancestors can be spread evenly among a 

large number of such candidates. Specialized data structures can be used to skip efficiently 

in both ancestor and descendant lists [33]. 

2.4 Adding Structural Summaries 

A different way of avoiding many useless nodes in the query node match lists is to do 

some partial matching in a preprocessing step. In the approach called path indexing, some 

of the structure of the data is extracted and put in a path summary [10], which is a tree 

containing all label-unique root-to-leaf paths seen in the data. This is used in a query 

preprocessing step for partially identifying structural matches for the query. 

Figure 4 shows the structural data extracted from the example in Figure 1. Note that 

each path seen in the data (e.g. /lib/book/author) is only added once to the summary. 

The data stored for each document node is then linked to nodes in the summary in some 

way. One approach is shown in Figure 5. Summary tree nodes are given unique integer 

path IDs, which are referenced by the indexed data nodes. Here the Dewey encoding [30] 

is used to enumerate data node positions. 

root: 0 


doc Dewey pathID value 

1 1 1 

1 1.1 3 

1 1.1.2 6 

1 1.1.2.1 7 “Kritik. . . ” 

1 1.1.3 4 

1 1.1.3.1 5 “Kant” 

1 1.1.4 4 

1 1.1.4.1 5 “Gödel” 

1 1.2 2 

. . . . . . . . . . . . 

Figure 5: Extracted for prefix path partitioning. 

and leaves in the query. Lists of data nodes matching the related summary nodes can the 

be read. This is called prefix path partitioning (or prefix path streaming [6], as opposed 

to tag streaming). It usually results in reading less data, as the set of matching paths is 

often much more selective than the node name. 

2.5 Virtual Nodes 

The Dewey encoding, which is used in Figure 5, has an advantage over the BEL encoding 

when combined with structural summaries, because it allows ancestor reconstruction. 

Only data node lists for branching and leaf query nodes need to be read, because matching 

of non-branching internal nodes is implied by the structural matching of the prefix paths 

of the checked nodes. 

This approach is taken one step further by generating virtual nodes for the internal 

query nodes from the lists of leaf node matches [36]. Given the Dewey and pathID of a 

data node matching a leaf query node, the Deweys and pathIDs of all above data nodes 

can be inferred. The Dewey of an ancestor with depth d is a length d prefix of the Dewey 

of any descendant of the node, and the pathID can be found by going up to depth d in 

the summary tree. 

Using Dewey to enumerate nodes requires non-linear space, and shows poor performance 

for some degenerate cases, but is commonly used in practice, such as in Microsoft 

SQL Server [22]. Space usage can be improved by using various compression schemes [13], 

and updatability can be attained [21]. 

2.6 Column Storage 

A trend in database systems research the last decade has been investigating how performance 

can be improved for read intensive workloads. An important contribution in this 

field is column store databases [29], where each column in a table is stored separately. 

MonetDB/XQuery [20] is a well known XPath/XQuery system using columnar storage. 

99

CHAPTER 4. 


Advantages of column stores include reading less data, if not all columns in a table 

are needed for a query, better cache behavior, if block processing is used in a column, 

better compression, as you easily can use techniques such as run length compression 

(RLE) [14, 1]. And as tuples can be materialized late, there can be less computational 

work, especially if block processing is used [1]. 

Using compression can give benefits beyond the obvious reduced space usage. Compression 

of inverted lists is commonly done in regular search engines to reduce disk 

I/O [35], and using compression can reduce the memory bus bottle-neck in database 

systems [23]. 

3 The XLeaf System 

XLeaf is a novel combination of many previous techniques for evaluating twig queries. 

It combines structural summaries, prefix path partitioning, virtual node lists for internal 

query nodes, skipping joins, multiple access methods, compression and column storage. 

As most other academic XML search prototypes, our system supports the descendant and 

child axis, and simple value predicates. 

The main difference between our approach and the majority of work on twig joins, is 

that we use a nested loop join, where the size of the state is linear in the size of the query. 

Most approaches read input lists once (streaming), and maintain larger intermediate 

results. 

3.1 Query Evaluation 

Algorithm 1 gives an overview of the process of evaluating a query in our system. First 

the query is matched against the summary, as described in Section 3.2. Then an access 

method for candidate matches for each leaf query node is chosen, as described in Section 

3.3. 

For queries with a single return node, such as in XPath, it is also often possible to 

avoid looping, and use a simple linear join. Such a join is correct when the depth of a 

branching node is fixed, as the query will not have out of order matches. The simple 

linear join is described in Section 3.4, and the general looping join in Section 3.5. 

Note that in the following, only output, branching, leaf query nodes need to be considered. 

For internal query nodes with one child, matching is implied by the matching of 

the root-to-node paths of the nearest branching and leaf node descendants in the query 

tree. On the other hand, for a parent query node with multiple children, it is essential 

that all of these can use the same node for the parent in the data to construct a legal 

matching. 

3.2 Summary matching 

Our summary is indexed using an inverted index over the root-to-node paths in the 

summary tree, similar to what is described in [12]. Paths in the document tree are viewed 

as strings of node names, where the names are dictionary coded and stored as integers. 

100


Algorithm 1 Overall query evaluation 

1: procedure Evaluate(Q) 

2: SummaryMatch(Q) 

3: for l q ∈ Q.leaves 

4: ChooseView(l q ) 

5: if no matches possible 

6: return 

7: if ∀b q ∈ Q.branching : b q .minDepth = b q .maxDepth 

8: LinearJoin(Q) 

9: else 

10: LoopingJoin(Q) 

The first step of the summary matching is finding the matches for the individual rootto-leaf 

paths in the query. The most selective element name in each path is looked up in 

the inverted index, and a list of candidate paths is retrieved. This list is then matched 

against the pattern. 

For each query node, the matching path IDs are saved, along with their respective 

possible candidates for the parent branching node. This is more robust than storing 

all legal combinations of matches for above query nodes, as their total number can be 

exponential in the size of the query. This typically happens with deeply recursive schemas. 

After finding individual root-to-leaf path matches, the query tree is traversed bottom 

up and then top-down to prune away node matches which can never be part of a complete 

match. 

3.3 Data access methods 

It is common in XML indexing systems to index element nodes on name and text nodes 

on value. For prefix path partitioning, nodes must also be accessible on path. In our 

system we use multiple access methods for all node types, and attempt to choose the 

most efficient for each query leaf node during twig evaluation. Lists for internal nodes 

are virtual (see Section 3.4 and 3.5). 

One approach for implementing multiple access methods is storing a table with all 

nodes in the data, and having multiple indexes point into this storage. But if reading 

matches from such an index means following pointers into a node table, this will give bad 

spatial locality. It also makes it inefficient to use compression schemes with expensive 

random lookups. To avoid this, we have used multiple materialized views of the data 

table, with different sorting orders. The views we create are shown in Figure 6, where 

underlined fields give the sort order. 

The base table is built while scanning the data, while the other tables are sorted using 

a stable ternary quicksort. As the sorting is stable, and the source is sorted on the first 

two columns, the target table is in order after only sorting on the first target column. 

101

CHAPTER 4. 


t base (doc, dewey, nameID, pathID, value) 

t value (value, doc, dewey, nameID, pathID) 

t path (pathID, doc, dewey, value) 

t name (nameID, doc, dewey, pathID, value) 

Figure 6: Materialized views used 

The mappings for nameID and pathID are stored in tables, which are read into separate 

in-memory data structures for fast lookup. Note that the field nameID also indicates the 

type of the node (element, attribute, text, etc..), and that for text nodes, the name of the 

parent element node is stored. In cases where there are multiple pathID matches for a leaf 

in a query, the hits have to be merged when read from the table t path. In many cases, 

it is cheaper to scan the t name matches, and filter out non-matching nodes on pathID. 

3.4 Simple Linear Join 

Assume that as in XPath the query has one output node, and that there is one list of 

data node matches for each leaf node q in the query, which can be read with Read(q) 

and moved one data node forward with a call to Advance(q). 

The linear join shown in Algorithm 2 is used when for each branching query node, 

there is a fixed tree depth for the matching summary nodes. The correctness of using a 

linear join in this case is trivial from the proofs for TwigStack [4]. There, matches for 

root-to-leaf paths are output from the first step of their join algorithm sorted root-to-leaf 

on the document order of the matched nodes, and fed to a linear merge, which is the 

second and last step. Since the depths of all branching nodes are fixed, the lists for the 

leaf node matches will already be in the correct order, from the fact that the data is a 

tree. 

The reason no pathID ever needs to be read in Algorithm 2, is that the incoming leaf 

lists only contain structural matches for the root-to-leaf paths, with above branching node 

matches fixed in depth. A simple comparison of Deweys determine a common branching 

node in the data. 

3.5 Loop Join 

For cases where a simple linear join cannot guarantee a correct result, a nested loop join 

is used. There are two main modifications from the simple linear join. Markers are left 

in the lists, to which the list cursors are re-winded when necessary. And for a given 

candidate alignment of the leaf lists, it must be checked whether it is possible to choose 

candidates for the branching query nodes which are usable for all the leaf matches. 

In previous loop join approaches [4], where there are explicit lists for internal query 

nodes, child list markers are updated when the parent’s list cursor is forwarded, but 

this method cannot be used directly in our approach. First we advance the list cursors to 

alignment based on the Deweys matched down to the minimal possible depth of the above 

102


Algorithm 2 Simple linear join 

1: procedure LinearJoin(Q) 

2: q s := Q.selectingNode ⊲ Assume a leaf 

3: while SimpleNextMatch(Q.root) 

4: Output(q s ) 

5: Advance(q s ) 

⊲ Note non-branching, non-leaf, non-selecting nodes ignored. 

6: procedure SimpleNextMatch(q) 

7: if IsLeaf(q) 

8: q.d := Read(q).dewey 

9: return q.d 

10: m := q.minDepth ⊲ Equals q.maxDepth 

11: x := max qc∈q.children q c .d 

12: while true 

13: for q c ∈ q.children 

14: Align(q c , x, m) 

15: x c := SimpleNextMatch(q c ) 

16: if x c = ∅ return ∅ 

17: x := max x c , x 

18: if No change since last step 

19: q.d := (x 1 , . . . , x m ) 

20: return q.d 

21: procedure Align(q, x, m) 

22: while Read(q).dewey < (x 1 , . . . , x m ) ⊲ Lexicographic 

23: if not Advance(q) return 

branching nodes. This depth is the maximal among the leaf matches minimal depths for 

a candidate for the branching node (max-of-the-min) from the summary matching phase. 

Then list positions are marked, before we iterate through the possible alignments. When 

list number i in the order is forwarded, list i + 1 is re-winded to the mark, before it 

is aligned based on Dewey, and the new mark is saved if differing from the previous. 

To speed up the iteration, the lists for all leaf query nodes are ordered on the expected 

number of hits. 

The procedure depicted in Algorithm 3 checks whether a given alignment is a match. 

First, the query is traversed bottom-up, and common candidates for the branching nodes 

are chosen. Then it is traversed top-down, choosing the uppermost common candidate for 

each branching node. Finally, it is checked that the chosen branching nodes are the same 

physical nodes, by comparing the Dewey’s. Note that the top-down pass and the final 

bottom-up pass can be completed in one top-down traversal, as the Dewey for an internal 

node need not be materialized, but is given from the Dewey of any leaf descendant and 

103

CHAPTER 4. 


the chosen depth. 

Even though bit-vectors are used to implement the sets of candidates, using Algorithm 

3 is expensive. For cases where many alignments are mismatches, it is cheaper to check 

the Deweys to the max-of-the-min depth first. This gives false positives, and matches 

must be checked with Algorithm 3. The max-of-the-min used here can be calculated 

from the branching nodes usable for the current leaf node matches, instead of from all 

branching node candidates from the summary matching. 

In cases where the output node is not a leaf value node, but matches XML element 

nodes, care must be taken iterate through all candidates, and not to chose duplicates (line 

20 in Algorithm 3). Also, to output entire data subtrees as matches, the table t base (see 

Section 3.3) must be scanned or searched during the query process to retrieve all nodes 

which have the given Dewey as a prefix. 

3.6 Skipping 

Skipping is crucial for efficient merges, and is included in Algorithm 2 by modifying the 

procedure Align to use underlying data structures to skip forward to the next item 

matching the parameter x. Skipping is implemented in our system by combining a “one 

level skip list” [35] and search. Every k-th value is extracted from the columns on which 

merges will be performed (doc and dewey), and put in a separate column. k is chosen to 

be a power of two (usually 32 in our system) to allow for arithmetics using shifts. Note 

that this column can also double as check-points in compression (see Section 3.7). When 

forwarding lists to values, the first few values in the smaller column are checked, and if 

no match is found, a binary search is performed there. Then a segment of the full column 

is scanned linearly. 

3.7 Storage Back-end 

The first iteration of this prototype used MonetDB/SQL [20] as a back-end for storing 

the data. This was changed for our own column store back-end because the overhead of 

using the communication interface to the server back-end was significant, and because of 

the lack of compression in the open source version of MonetDB. 

Our minimalistic implementation uses memory mapped files to write to and read from 

disk, and uses compression to reduce the space usage. A column adapter chooses which 

compression method to use based on statistics from the data. Different compression 

schemes are favorable depending on whether the columns are self-ordered, and whether 

they have few or many distinct values. 

Supported storage methods for integer types are raw, bit packing, run length encoding 

(RLE), delta encoding and dictionary encoding. The last two are more expensive, and 

are used only in the maximum compression variant in the experiments (see Section 4). 

Typically RLE or delta coding is chosen for (partially) sorted columns, and bit packing or 

dictionary encoding for unsorted columns. The delta encoding is done using VByte [27], 

with checkpoints for faster random access. The dictionary encoded column sorts values 

on frequency, and uses VByte to store the coded values. Many column types are stored 

104


Algorithm 3 Checking leaf alignment 

⊲ Note non-branching, non-leaf, non-selecting nodes ignored. 

1: procedure CheckLeafAlignment(Q) 

2: q t := Q.topBranchingNode 

3: Candidates(q t ) ⊲ Bottom-up 

4: if q t .C = ∅ 

5: return false 

6: ChooseUppermost(q) ⊲ Top-down 

7: if not CheckMatch(q) ⊲ Bottom-up 


9: procedure Candidates(q) 


11: q.p := Read(q).pathID 

12: return q.C := {q.p} 

13: else 

14: return q.C := ⋂ q i∈Children(q) ParentCand(q i) 

15: procedure ParentCand(q) 

16: return ⋂ c i∈Candidates(q) q.parMatchCand[c i] 

17: procedure ChooseUppermost(q) 


19: return 

20: q.p := arg min ci∈q.C c i .depth 

21: for q i ∈ q.children 

22: q i .C = {c i | c i ∈ q i .C ∧ q.p ∈ q.parMatchCand[c i ]} 

23: ChooseUppermost(q i ) 

24: procedure CheckMatch(q) 


26: q.d := Read(q).dewey 

27: return true 


29: if not CheckMatch(q i ) 


31: x := q.children[1].d 

32: m := q.p.depth 

33: q.d.: = (x 1 , . . . , x m ) 


35: if LCP(q.d, q i .d) < m 



105

CHAPTER 4. 


as multiple physical columns. For example is an RLE column stored as separate “values” 

and “cumulative count” columns. 

Strings are stored raw, or using a dictionary. For the raw strings, a column of pointers 

is used for random access. Dictionaries are shared between the different materialized 

views. For logical columns consisting of several physical columns, these are compressed 

recursively if this is considered favorable. 

In the current prototype, no explicit indexing is done, and a simple binary search is 

used. In the following experiments, all queries are run warm to simplify the experimental 

setup. Then the poor spatial locality of the binary search is not so much an issue, but 

for later experiments, running large batches of queries over more data, indexing should 

be considered. 

However, some of the column types have specialized search implementations. For 

RLE, the “values” column is searched, and the “cumulative count” column is used to 

translate to row numbers. For the delta encoding, a search in the checkpoints is done 

first, before decoding the final block of coded values. For the dictionary encoding, the 

dictionary is searched first, before searching for the coded value in the data column. 

A note should be given on the encoding of Deweys. A variable number of bytes are 

used per element in the Dewey, and the first bits in an element number give the number of 

bytes in the element. This allows comparing two Deweys with a simple string comparison 

to find which is smaller. This is useful when merging lists of nodes. To check if two 

Deweys match to a given depth in the document tree, the codes must be parsed when 

compared. 

4 Experimental Results 

This section presents results for query performance evaluation of various XPath implementations. 

Experiments were performed on an AMD Athlon 64 Processor 3500+ at 2.2 

GHz, with 4GB main memory, running Linux 2.6.28. 

The open source MonetDB/XQuery (MoXQ) [3] and an anonymous commercial system 

(SysA) were tested. They were included because they feature some, but not all, of 

the techniques from Section 3.1. SysA uses path indexing and multiple access methods, 

but reads lists for internal query nodes. MoXQ uses tag partitioning with no skipping, 

and has a column store back-end. The release from August 2008 was used. We have also 

tested the systems Berkeley DB XML, X-Hive and eXist, but the results are omitted, as 

these systems had poorer performance than MoXQ and SysA, and the results did not 

yield further insights. 

For our own prototype we have included performance results for the compression 

schemes none (XLeaf none ), lightweight (XLeaf light ) and maximum (XLeaf max ). The latter 

includes delta and dictionary encoding for integer types. 

It should be noted that a comparison of a minimalistic prototype with full fledged systems 

with features like transaction management is not fair. Ideally algorithms should be 

compared implemented in the same framework, but this was not done in these preliminary 

experiments, due to time constraints. 

106


Some queries have been rewritten to counting queries to avoid measuring the overhead 

of printing answers, and because outputting hundreds of thousands of results is not a 

probable user scenario. All queries were given 2 warm-up runs, and were executed 10 

times. 

4.1 Indexing Performance and Space Usage 

Figure 7 shows the indexing performance of the tested methods on the DBLP corpus [17], 

and the artificial benchmark XMark [26]. The table shows megabytes indexed per second, 

and the size of the index divided by the size of the original unparsed data. 

MoXQ SysA XLeaf none XLeaf light XLeaf max 

DBLP (440 MB) 

MB/s 1.92 0.27 0.43 0.45 0.30 

Space 2.6 14.1 5.0 1.84 1.59 

XMark (1118 MB) 

MB/s 8.8 0.45 1.39 1.41 1.06 

Space 1.78 10.8 4.2 1.32 1.20 

Figure 7: Indexing performance 

MoXQ has the fastest indexing, and is fairly space efficient. The increased space 

requirements for adding the additional access methods in SysA may make it unusable 

in some scenarios. The space usage for our uncompressed variant (XLeaf none ) may also 

be too high, but the two compressed variants are very space efficient, and use less space 

than MoXQ, even though they feature multiple access methods like SysA, and uses the 

Dewey encoding, which has more redundancy than the BEL encoding used in MoXQ. 

Note that building the lightweight compressed index XLeaf light , is faster than building 

the uncompressed index, while the heavily compressed variant is much slower, but only 

reduces storage requirements by 10 − 15%. 

4.2 DBLP Queries 

Figure 8 shows some queries on DBLP taken from [25], and table 9 gives the running 

times. 

# Query Hits 

P1 //inproceedings[author="Jim Gray"][year="1990"]/@key 6 

P2 //www[editor]/url/text() 5 

P3 //book/author[text()="C. J. Date"]/text() 13 

P4 //inproceedings[title/text()="Semantic Analysis Patterns."]/author/text() 2 

Figure 8: DBLP queries. 

107

CHAPTER 4. 


P1 P2 P3 P4 

MoXQ 1815 131 257 1151 

SysA 972 2.5 0.53 7.2 

XLeaf none 0.126 0.064 0.039 0.058 

XLeaf light 0.130 0.064 0.044 0.058 

XLeaf max 0.42 0.123 0.044 0.067 

Figure 9: Query performance. Run-time in milliseconds. 

The results for MoXQ are worse than one might expect, especially on P1 and P4. This 

is because the data has a very flat and repetitive structure, with many node matches for 

//inproceedings. SysA has better performance, due to the use of path indexing. But it 

is still slow on P1, probably because merging the three branches is expensive. In P2, the 

total number of matches for //www is low. 

Our implementations perform very well in this experiment, because only data for leaf 

nodes are read, and skipping is applied in the join. The differences between XLeaf none 

and XLeaf light are almost negligible. We expected the lightweight compression to give 

better performance due to lower data bandwidth requirements. One reason this is not 

the case may be that queries are run hot, which reduces the bandwidth bottleneck due to 

caching effects, in combination with slightly more expensive computation for XLeaf light . 

XLeaf max has worse performance, due to even more computation, and perhaps worse 

memory access patterns, because of dictionary compression for integer types. 

# Query Hits 

A1 count(/site/closed auctions/closed auction/annotation/description/text/keyword) 40726 

A2 count(//closed auction//keyword) 124843 

A3 count(/site/closed auctions/closed auction//keyword) 124843 

A4 count(/site/closed auctions/closed auction[annotation/description/text/keyword]/date) 26570 

A5 count(/site/closed auctions/closed auction[descendant::keyword]/date) 53342 

A6 count(/site/people/person[profile/gender and profile/age]/name) 32242 

V1 ... keyword[text()=" preventions "] ... 55 



V4 ... keyword[text()=" preventions "] ... date[text()="06/27/1998"] ... 11 

V5 ... keyword[text()=" tempests "] ... date[text()="04/18/1999"] ... 12 

V6 ... gender[text()="male"] ... age[text()="18"] ... name[text()="Mehrdad Takano"] ... 19 

Figure 10: XMark queries from XPathMark. Queries V1-V6 are equal to queries A1-A6 

with added value predicates. 

4.3 XPathMark A Queries 

Queries A1-A6 on the XMark data are from XPathMark [9, 8], an XPath equivalent of the 

XQuery benchmark XMark. XMark is artificially generated, and has properties rather 

different from DBLP, which has a flat and repetitive structure. XMark is a deeper tree, 

following a recursive schema. 

108


A1 A2 A3 A4 A5 A6 V1 V2 V3 V4 V5 V6 

MoXQ 214 85 110 263 156 348 272 249 262 323 294 456 

SysA 3.9 352 348 178 2128 446 11.8 398 371 30 46 97 

XLeaf none 9.1 72 73 56 153 181 0.160 0.27 0.29 0.32 0.144 4.3 

XLeaf light 8.5 94 94 57 180 196 0.24 0.38 0.28 0.58 0.20 4.6 

XLeaf max 9.6 93 94 166 190 708 0.21 0.32 0.34 3.6 0.70 24 

Figure 11: Query performance. Run-time in milliseconds. 

MoXQ and SysA seem to have about equal performance, with the former in the lead. 

In these tests, the former performs relatively much better than on DBLP, because value 

predicates are not used, and the data tree is differently shaped. The number of matches 

for nodes near the root of the query are usually low, and the query leaf matches make up 

the majority. This gives a much cheaper join relative to the total number of matches. 

Note the performance of MoXQ and XLeaf light (or SysA) on queries A1 and A2, which 

highlights a key difference of the two systems. MoXQ is twice as fast on A2 as on A1, 

even though it has three times as many hits. On A1 it must merge matches for seven 

internal nodes, but on A2 only two of these are involved. XLeaf light , on the other hand, 

looks up the first query on pathID, and all nodes are matches. The second query is looked 

up on nameID and filtered on pathID, as there are multiple matches for the latter. Note 

that the performance on A3 is the same as on A2, as the executions are identical on our 

system. The fact that MoXQ is almost as fast as XLeaf none on A2, even though it reads 

more data and does more work, indicates a better implementation. 

XLeaf light is faster on A4 than A3, even though the former is branching. The reason 

is that the leaves can be looked up directly on pathID. Hits are found efficiently using 

skipping. A5 is three times as slow, even though there are only two times as many hits. 

This is because the predicate leaf on keyword has more matches. A6 is also much slower 

than A4, even though the number of hits is similar, again because the individual query 

leaves have more matches. 

Comparing SysA and XLeaf light on queries A2 and A3 may show the benefit of using 

a column store back-end. SysA also looks up nodes on path in these queries, but is more 

than three times slower. Note that for the branching queries A4-A6, its disadvantage of 

reading matches for branching nodes does not show as much as on the DBLP queries, as 

the data is deeper and more “tree shaped”, with more matches for the query leaves than 

for the branching nodes. 

4.4 XPathMark A Queries, with Value Predicates 

Queries V1-V6 are the same as A1-A6, but with added value predicates. These were 

chosen to give as many hits as possible. 

MoXQ is slightly slower with value predicates, probably because of the extra text 

nodes to be read and merged. SysA is also slower on the first three queries, but faster 

on the last three, because looking up the leaves on value is much more selective. There 

are however still many matches for the branching node, giving an unnecessarily expensive 

109

CHAPTER 4. 


join. A comparison of SysA and XLeaf light , both using path indexing, shows the benefits 

of our systems features. Our implementation avoids handling the large number of matches 

for the internal query nodes, and is 20-200 times faster. 

Query V6 is more expensive than the others for XLeaf light , because the result for one 

of the leaves is large, and the merge is more expensive, even when skipping is used and 

the simple linear merge is allowed. 

# Query Hits 

S1 /dblp/inproceedings[@key="conf/3dica/RohalyH00"]/booktitle/text() 1 

S2 //inproceedings[@key="conf/3dica/RohalyH00"]/booktitle/text() 1 

S3 //*[@key="conf/3dica/RohalyH00"]/booktitle/text() 1 

S4 //*[@*="conf/3dica/RohalyH00"]/booktitle/text() 1 

B0 /dblp//author[text()="Michael Stonebraker"]/text() 215 

B1 /dblp/*/author[text()="Michael Stonebraker"]/text() 215 

B2 /dblp/*[author/text()="Michael Stonebraker"]/title/text() 215 

B3 /dblp/*[author/text()="Michael Stonebraker"] 4 

[author/text()="Hector Garcia-Molina"]/title/text() 

B4 /dblp/*[author="Michael Stonebraker"][author="Hector Garcia-Molina"] 1 

[@key="journals/corr/cs-DB-0310006"]/title/text() 

B5 /dblp/*[author="Michael Stonebraker"][author="Hector Garcia-Molina"] 1 

[@key="journals/corr/cs-DB-0310006"][year/ > 1950]/title/text() 

Figure 12: Queries with decreasing path specification, and queries with increasing branching 

complexity, on DBLP. 

4.5 Queries with Decreasing Path Specification 

Queries S1-S4 in Table 13 gives some queries on DBLP with decreasingly specified paths, 

using an increasing number of wild-cards. 

S1 S2 S3 S4 B0 B1 B2 B3 B4 B5 

MoXQ 1549 1387 4553 4694 1464 2609 2704 2940 2948 3029 

SysA 160 289 21911 21932 1450 1247 146273 146353 149293 19127 

XLeaf none 0.054 0.054 0.59 0.93 0.084 0.085 0.63 0.25 0.159 0.20 

XLeaf light 0.055 0.054 0.50 0.96 0.086 0.088 1.00 0.26 0.156 0.192 

XLeaf max 0.060 0.062 0.51 0.98 0.114 0.104 2.2 0.78 0.21 0.27 

Figure 13: Query performance, with run-time in milliseconds. 

. 

This test shows that our systems are more robust for partially specified paths. With 

MoXQ and SysA, the query cost increases greatly from query S2 to S3, because a larger 

number of nodes become candidates for the branching node. Query S4 is a degenerate 

query which probably never would be seen in a real system, but it can be argued that S3 

is not unrealistic. Our implementation is very fast on all queries. One leaf is looked up 

on value consistently, while the other on pathID for S1 and S2, and nameID for S3 and 

S4, which is the reason the last two queries are slower. 

110


4.6 DBLP Queries with Increasing Branching Complexity 

Queries B0-B5 in Table 13 show performance results on queries with increasing branching 

complexity. 

Comparing B0 and B1, the main cost of MoXQ seems to be merging with the lists 

for /dblp/*, which is large. Additional branches give no significant extra cost. Note that 

this node is critical for the semantics of queries B2-B5. The poor results for SysA on 

B2-B5 are due to a “performance bug”. The system sometimes chooses to use nested 

loop lookups, even though other approaches would be more efficient. 

Our implementation is also affected by the added branches in the queries, with some 

interesting effects. B3 is faster than B2, because the join algorithm takes advantage of 

the two author lists being more selective. The common result is smaller, and more can 

be skipped when joining with the title. A similar argument holds for B4 vs B3, but for 

B5, the added predicate on year has low selectivity, and only adds to the cost. 

5 Related Work 

Loop joins have been used previously in MPMGJN [37] for pairs of nodes, and in 

PathMPMJ [4] for path queries, while in our system loop joins are used for full twigs, 

and with with no physical lists of matches for internal nodes. 

Most twig join algorithms are non-looping, and read the input lists once. The original 

TwigStack [4] maintains one stack of nested matches for each query node, and output 

individual root-to-leaf path matches, which are merged in a second phase. More complex 

intermediate result management are used by the later single phase join algorithms, such 

as Twig 2 Stack [5], HolisticTwigStack [16], TwigList [24] and TwigFast [18]. 

Structural summaries [10] are a prerequisite for prefix path partitioning, which is 

used previously in iTwigJoin [6]. A difference between iTwigJoin and our approach is 

that iTwigJoin uses materialized lists for all query nodes, and uses a specialized join to 

combine pairs of lists for directly related query nodes. 

Generating matches for internal query nodes on the fly is done in the Virtual Cursors 

[36] algorithm and in TJFast [19], which use tag partitioning. A disadvantage of TJ- 

Fast is that instead of using a structural summary, the name path and Dewey are stored 

compressed in the leaf lists. This makes reads unnecessarily expensive and increases space 

usage. A disadvantage of Virtual Cursors is that non-branching internal nodes are not 

ignored during evaluation, because generated internal nodes are not guaranteed to have 

matching prefix paths. This also causes useless node candidates for branching internal 

nodes. A shortcoming of both previous approaches, and also ours, is that much work is 

repeated during construction of internal nodes. 

Skipping joins has been used previously with tag partitioning in for example TS- 

Generic+ [15] and TwigOptimal [7]. When physical lists are used for the internal query 

nodes, skipping in a list to catch up with a descendant is not trivial, and specialized data 

structures like XR-trees [33] must be used. An advantage with virtual node lists is that 

skipping for internal nodes can be implicit. This is used previously Virtual Cursors [36], 

where B-trees were used to skip through leaf node matches. 

111

CHAPTER 4. 


Multiple access methods for query node matches, implemented through materialized 

views, is used for XML search in Microsoft SQL Server [22]. Their system uses prefix 

path partitioning similar to ours, but reads data node lists for internal query nodes, and 

uses row oriented storage. MonetDB/XQuery [3] has columnar storage, but it uses tag 

partitioning, and does not feature compression. 

6 Conclusion and Future Work 

This paper has investigated how to combine various techniques for twig matching. A prototype 

was developed to investigate possible benefits. When compared with two existing 

systems, which use only some of these techniques, a speedup of two orders of magnitude 

was shown for queries with value predicates. 

Future work includes modifying the experimental prototype to allow switching features 

on and off individually, to investigate both their individual and combined benefit 

thoroughly. This may give more insight than comparing with other full systems. More 

features, like optimal twig join algorithms, should also be added to our system. 

Acknowledgment 

Thank you to Felix Weigel for fruitful discussions. The first author was supported by the 

Research Council of Norway under grant NFR 162349. 

References 

[1] Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-stores vs. rowstores: 

how different are they really? In Proc. SIGMOD, 2008. 



[3] Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger, 

and Jens Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational 

engine. In Proc. SIGMOD, 2006. 


XML pattern matching. In Proc. SIGMOD, 2002. 



queries over XML documents. In Proc. VLDB, 2006. 



112



cursor movement in holistic twig joins. In Proc. CIKM, 2005. 

[8] M. Franceschet. XPathMark. http://sole.dimi.uniud.it/~massimo. 

franceschet/xpathmark/ (Accessed: 2009-03-06). 

[9] M. Franceschet. XPathMark: an XPath benchmark for XMark generated data. In 

Proc. XSYM, 2005. 


optimization in semistructured databases. In Proc. VLDB, 1997. 

[11] Gang Gou and R. Chirkova. Efficiently querying large XML data repositories: A 

survey. Knowl. and Data Eng., 2007. 

[12] Nils Grimsmo. Faster path indexes for search in XML data. In Proc. ADC, 2008. 

[13] T. Härder, M. Haustein, C. Mathis, and M. Wagner. Node Labeling Schemes for 

Dynamic XML Documents Reconsidered. Data & Knowl. Engineering, 2006. 

[14] Allison L. Holloway and David J. DeWitt. Read-optimized databases, in depth. Proc. 

VLDB Endow., 2008. 

[15] H. Jiang, W. Wang, H. Lu, and J.X. Yu. Holistic twig joins on indexed XML 

documents. In Proc. VLDB, 2003. 



2007. 

[17] Michael Ley. DBLP XML Records, Jan. 2007. http://www.informatik.uni-trier. 

de/~ley/db/ (Accessed: 2008-10-22). 


[19] J. Lu, T.W. Ling, C.Y. Chan, and T. Chen. From region encoding to extended dewey: 

On efficient processing of XML twig pattern matching. In Proc. VLDB, 2005. 

[20] Monet. MonetDB web page. http://monetdb.cwi.nl/ (Accessed: 2009-02-14). 

[21] Patrick O’Neil, Elizabeth O’Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, and 

Nigel Westbury. ORDPATHs: insert-friendly XML node labels. In Proc. SIGMOD, 

2004. 

[22] Shankar Pal, Istvan Cseri, Oliver Seeliger, Gideon Schaller, Leo Giakoumakis, and 

Vasili Zolotov. Indexing XML data stored in a relational database. In Proc. VLDB, 

2004. 

[23] Niels Nes Peter Broncz, Marcin Zukowski. MonetDB/X100: Hype-pipelining query 

execution. In Proc. CIDR, 2005. 

113

CHAPTER 4. 



In Proc. DASFAA, 2007. 

[25] Praveen Rao and Bongki Moon. PRIX: Indexing and querying XML using prüfer 

sequences. In Proc. ICDE, 2004. 

[26] Albrecht Schmidt, Florian Waas, Martin Kersten, Michael J. Carey, Ioana 

Manolescu, and Ralph Busse. XMark: a benchmark for XML data management. 

In Proc. VLDB, 2002. 

[27] Falk Scholer, Hugh E. Williams, John Yiannis, and Justin Zobel. Compression of 

inverted indexes for fast query evaluation. In Proc. SIGIR, 2002. 

[28] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. De- 

Witt, and Jeffrey F. Naughton. Relational databases for querying xml documents: 

Limitations and opportunities. In Proc. VLDB, 1999. 

[29] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, 

Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat 

O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. C-store: a column-oriented dbms. 

In Proc. VLDB, 2005. 



database system. In Proc. SIGMOD, 2002. 

[31] W3C. XPath 1.0, 1999. http://www.w3.org/TR/xpath (Accessed: 2009-04-29). 

[32] W3C. XQuery 1.0, 2007. http://www.w3.org/TR/xquery/ (Accessed: 2009-04-29). 

[33] H.J.H.L.W. Wang and BC Ooi. XR-tree: Indexing XML data for efficient structural 

joins. In Proc. ICDE, 2003. 


PhD thesis, Ludwig-Maximilians-Universität München, 2006. 

[35] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing gigabytes (2nd ed.): 

compressing and indexing documents and images. 1999. 


Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004. 



MOD Rec., 2001. 

114

Paper 4 

Nils Grimsmo and Truls Amundsen Bjørklund 

Towards Unifying Advances in Twig Join Algorithms 

Proceedings of the 21st Australasian Database Conference (ADC 2010) 

Abstract Twig joins are key building blocks in current XML indexing systems, and numerous 

algorithms and useful data structures have been introduced. We give a structured, 

qualitative analysis of recent advances, which leads to the identification of a number of 

opportunities for further improvements. Cases where combining competing or orthogonal 

techniques would be advantageous are highlighted, such as algorithms avoiding redundant 

computations and schemes for cheaper intermediate result management. We propose some 

direct improvements over existing solutions, such as reduced memory usage and stronger 

filters for bottom-up algorithms. In addition we identify cases where previous work has 

been overlooked or not used to its full potential, such as for virtual streams, or the benefits 

of previous techniques have been underestimated, such as for skipping joins. Using 

the identified opportunities as a guide for future work, we are hopefully one step closer 

to unification of many advances in twig join algorithms. 

115

Paper 4: Towards Unifying Advances in Twig Join Algorithms 


Twig matching is the most heavily used building block for systems offering search in XML 

with languages like XPath and XQuery [12]. XML has become the de-facto standard 

for storage of semi-structured data, and the standard for data exchange between disjoint 

information systems. XPath is a declarative language, and XQuery is an iterative language 

which uses XPath as a building block. XPath queries can be evaluated in polynomial time 

[11]. 

Most academic work related to indexing and querying XML focuses on the twig matching 

problem, which is equivalent to a sub-set of XPath: Given a labeled data tree and a 

labeled query tree, find all matchings of the query nodes to the data nodes, where the data 

nodes satisfy the ancestor-descendant (a-d) and parent-child (p-c) relationships specified 

by the query tree edges. 

The example in Figure 1 shows the relation between twig matching and XML search. 

The tree in part (a) is an abstraction of the XML document in (c). Real XML separates 

element (tag), attribute and text nodes, but in the abstract model there is only one type 

of nodes. The XPath and XQuery examples in (d) both specify the same structure as the 

abstract twig query in (b), where double edges symbolize a-d relationships. 

b 

a 

a 

c b 

(a) 

c 

a 

b c 

(b) 

 

 

 

 

 

 

 

 

 

(c) 

//a[.//b][c] 

for $na in //a 

let $nb := $na//b 

where $na/c 

order by $na 

return ($na, $nb) 

(d) 

Figure 1: XML and twig matching relation. (a) Abstract data tree. (b) Twig query. (c) 

XML Data. (d) XPath (above) and XQuery (below). 

This work focuses on twig matching in indexed data trees. In a typical setting, all 

data nodes with the same label are stored together, using some encoding which specifies 

tree positions. To evaluate a query, one stream of data nodes with matching label is read 

for each query node, and are joined to form twig matches. 

This paper gives a structured analysis of recent advances in twig join algorithms, 

which leads to the identification of a number of opportunities for further improvements. 

Some direct improvements are identified, such as reduced memory usage in bottom-up 

algorithms and stronger top-down filters. We highlight cases where new combinations of 

competing and orthogonal techniques would have clear advantages, but also cases where 

important previous work has been has been compared to unfairly in our view. We note 

1 See errata in Appendix A. 

117

CHAPTER 4. 


some open challenges, such as updatability in strong structural summaries, and more 

efficient detection of cases where simpler and faster algorithms can be used (Section 3.7). 

The analysis explores techniques for avoiding redundant computations (Section 3.1), 

schemes for intermediate result management (Section 3.2), top-down filters for bottom-up 

algorithms (Section 3.3), skipping joins (Section 3.4), refined access methods (Section 3.5) 

and virtual streams (Section 3.6). 

2 Background: Concepts and Techniques 

This section goes through some fundamental concepts and techniques which are useful 

for the understanding of later algorithms. First we formally define the problem. 

Definition 1 (Twig matching). Given a rooted unordered labeled query tree Q and a 

rooted ordered labeled data tree D, find all complete matchings of the nodes in Q, such 

that the matched nodes in D follow the structural requirements given by the a-d and p-c 

edges in Q. 

Note that there is a slight difference between the semantics of twig matching and 

XPath. A twig query returns all legal combinations of node matches, while in XPath 

there is a single query return node. The generality of returning all legal combinations of 

matches in twig matching may have been the academic focus point because it is useful 

for the flexibility in XQuery. XPath can also specify more than a-d and p-c relationships, 

but a majority of XPath queries in practice use only the a-d and p-c axis [12]. 

Many early approaches to search in semi-structured data used combinations of indexing 

and tree navigation, but the main focus the last decade has been on indexing with 

inverted lists and structural joins of streams of query node matches [1]. This paper only 

considers twig join algorithms. 

Indexing and node encoding is critical for the efficiency of twig joins. Usually 

data nodes are indexed (partitioned) on node labels, using for example inverted lists. 

Two aspects of how data is stored inside partitions are important: How the position of 

a node is encoded, and how the nodes in the partition are ordered. For most algorithms 

nodes are stored in depth first traversal pre-order, such that ascendants are seen before 

descendants. The positional information which follows nodes must allow decision of a-d 

and p-c relationships. The most common is the regional begin,end,level (BEL) encoding 

[29], which is used in the data extraction example in Figure 2. It reflects the positions of 

opening and closing tags in XML (see Figure 1c). The begin and end numbers are not 

the same as pre- and post-order traversal numbers, but give the same sorting orders. 

In the following, let T q denote the stream of matches for query node q, and C q denote 

the current data node in this stream. For simplicity, polynomial factors in the size of the 

query are ignored in asymptotic notation. 

The early history of twig joins is shown in Figure 3. An early approach for schema 

agnostic XML indexing was to store nodes with BEL encoding in an RDBMS, and specify 

query node relations as a number of inequalities. But these theta-joins are expensive. 

Specialized loop structural joins which leveraged the knowledge that the data encoded is a 

118


a 1,8 

b 2,2 

a 3,6 

c 4,4 b 5,5 

c 7,7 

tag B E L 

a 1 8 1 

a 3 6 2 

b 2 2 2 

b 5 5 3 

c 4 4 3 

c 7 7 2 

a 

b c 

(a 1,1 b 2,2 c 7,7 ) 

(a 1,1 b 5,5 c 7,7 ) 

(a 3,6 b 5,5 c 4,4 ) 

(a) Data. 

(b) Extracted. 

(c) Query. 

(d) Matches. 

Figure 2: Tree indexing and querying example. 

tree where introduced [29, 1]. These have O(I + O) cost for evaluating an a-d relationship, 

where I and O are the sizes of the input and output streams, but quadratic worst-case for 

p-c relationships. Stack joins were introduced to get optimal evaluation for all binary 

structural joins. 

Path m-way loop. 

PathMPMJ [4] 

Twig m-way loop. 

Not explored. 

Leveraging 

RDBMS. 

Specialized loop joins. 

– a-d optimal. 

MPMGJN [29] 

Tree-Merge-Desc/-Anc [1] 

Stack joins. 

– a-d & p-c optimal. 

Stack-Tree-Desc/-Anc [1] 

Path m-way stack. 


PathStack [4] 

Twig m-way stack. 


TwigStack [4] 

Skipping. 

Anc des B+ [7] 

XR-Stack [14] 

Figure 3: The early history of twig joins. Continued in Figure 8. 

A problem with combining the evaluation of a number of binary relationships to answer 

a query, is that the intermediate results may be of size exponential in the query, even if 

the output is small. This lead to the introduction of multi-way join algorithms. 

Stacks are key data structures in most modern twig join algorithms. Their use 

here is motivated by their use in depth first tree traversals. To join streams of ascendants 

and descendants, a stack of currently nested ancestor nodes is maintained. Nodes are 

popped off the ancestor stack when a non-contained (disjoint) node is seen in either 

stream. In a path or twig multi-way algorithm, there must be one stack S qi for each 

internal query node q i . The matches for different query nodes must be processed in total 

pre-order, to ensure that ancestor nodes are added before descendants need them. 

In each step in the PathStack algorithm [4], the current data node is used to clean all 

stacks by popping non-containing nodes, before it is pushed on stack. Figure 4b shows 

the stacks for a query when evaluated on the data in Figure 4a, right after the node c 1 

has been pushed. When the current query node is a leaf, all related matches are output. 

To enable linear time enumeration of the matches encoded in the stacks, each data node 

pushed onto a stack has a pointer to the closest containing data node in the parent stack, 

119

CHAPTER 4. 


which would be the top of the parent stack as the data node was pushed. Nodes above 

on the parent stack cannot be ascendants, as the data nodes are read in pre-order. In the 

example, b 2 and a 2 are not usable together. Because a stack only contain nested nodes, 

the space needed is O(d), where d is the maximal depth of the data tree. 

a 1 

b 1 

b 2 

a 2 

b 3 

c 1 

a 

b 

c 

a 2 

a 1 

b 3 

b 2 

b 1 

c 1 

(a) Data tree 

(b) Query with stacks after pushing c 1 . 

Figure 4: Data structures for PathStack. 

A technique for getting path matches sorted on higher query node matches first is 

critical for the efficiency of TwigStack and other twig multi-way algorithms. Delaying 

out-of-order output is achieved by maintaining so-called self- and inherit-lists for each 

stacked node [1]. The lists for the data and query in Figure 4 is shown in Figure 5. 

As a node is popped off stack, the contents of its lists are appended to the inherit-lists 

of the node below on the same stack, if there is one. This is to maintain correct output 

order. See for example the lists for b 2 and b 1 in the example. But if the popped node 

can use some ancestor node in the parent stack, which the node below in its own stack 

cannot, the contents of the lists must be appended to the self-lists there. This is decided 

from the inter-stack pointers. In the example, popping node b 3 results in adding (a 2 b 3 c 1 ) 

to the self-list of a 2 . PathStack has O(I + O) complexity both with and without delaying 

output, where I is now the total input size. 

a (a 2 b 3 c 1 ) 

2 

∅ 

a 

(a 1 b 1 c 1 )(a 1 b 2 c 1 )(a 1 b 3 c 1 ) 

1 

(a 2 b 3 c 1 ) 

(b 

b 3 c 1 ) 

3 

∅ 

(b 

b 2 c 1 ) 

2 

(b 3 c 1 ) 

(b 

b 1 c 1 ) 

1 

(b 2 c 1 )(b 3 c 1 ) 

Figure 5: Stack nodes with final self- and inherit lists for the data and query in Figure 4. 

Darker nodes popped first. 

120


TwigStack [4] was the first holistic twig join algorithm. Using PathStack on each 

root-to-leaf path in a twig query and merging the matches, may lead to many useless intermediate 

results, because path matches need not be part of complete matches. TwigStack 

improved on this, and achieved O(I + O) complexity for queries with a-d edges only. 

It is a two-phase algorithm, where the first phase outputs matches for each root-toleaf 

path, and the second phase merge joins the path matches. The first phase does 

two things which are critical for the performance of the algorithm: It only outputs path 

matches which possibly are part of some complete query match, and outputs paths sorted 

on higher query nodes first, using the technique from [1]. This allow a linear merge in 

phase two. 

TwigStack does additional checking before pushing nodes on stack compared to Path- 

Stack. The data node at the head of the stream for a query node q is not pushed on stack 

before it has a so called “solution extension”, which means that the heads of the streams 

of all child query nodes are contained by C q , and that child nodes recursively satisfy this 

property. Also, a node is not pushed on stack unless there is a useable ancestor data node 

on the stack for the parent query node. 

Pseudo-code for TwigStack is shown in Algorithm 1 (adapted from [4]). It is included 

here to ease the depiction of the improvements discussed in the following sections. Each 

query node q has an associated stream T q with current element C q , and a stack S q . 

The algorithm revolves around a recursive function getNext(q), which returns a (locally) 

uppermost query node in the subtree of q which has a solution extension. If the parent 

of the returned q has a usable ancestor data node on stack, this means C q is part of a full 

solution extension identified earlier, and C q is pushed on S q . A path match is found when 

a leaf node is pushed on stack, but output is delayed to make sure paths are ordered on 

the query nodes top down (called “blocking” in [4]). Note that actually pushing a leaf 

node on stack is unnecessary, as it will be popped right off. 

Algorithm 1 TwigStack 

1: function TwigStack(Q) 

2: while not atEnd(Q) 

3: q := getNext(Q.root) 

4: if not isRoot(q) 

5: cleanStack(S parent(q) , C q) 

6: if isRoot(q) or not empty(S parent(q) ) 

7: cleanStack(S q, C q) 

8: push(S q, C q, top(S parent (q))) 

9: if isLeaf (q) 

10: outputPathsDelayed(C q) 

11: pop(S q) 

12: advance(T q) 

13: mergePathSolutions() 

14: function getNext(q) 

15: if isLeaf (q) 

16: return q 

17: for q i ∈ children(q) 

18: q j := getNext(q i ) 

19: if q j ≠ q i 

20: return q j 

21: q min = min arg qi ∈children(q) {Cq .begin} 

i 

22: q max = max arg qi ∈children(q) {Cq .begin} 

i 

23: while C q .end < C qmax .begin 

24: advance(C q) 

25: if C q .begin < C qmin .begin 

26: return q 

27: else 

28: return q min 

The getNext() traversal is bottom up, and is short cut if some node does not have a 

solution extension (see line 20). Leaves trivially have solution extensions. The traversal 

has the side effect of advancing the treated query node at least until it contains all it’s 

121

CHAPTER 4. 


children (line 23). If it does not contain all children at this point, the child currently with 

the first pre-order data node (lowest begin value) is returned to be forwarded in line 12. 

Figure 6 shows the state of the algorithm when evaluating the query in Figure 2c, 

right after node b 5,5 has been processed. After the first call to getNext(a), when all the 

streams where at their start position, a itself was returned as it had a solution extension, 

and C a = a 1,8 was pushed on stack. For the second call to getNext(a), this was not the 

case, and b was returned, with head C b = b 2,2 . Since b 2,2 had a usable ancestor a 1,8 on 

the parent stack S a , a 1,8 must have had a solution extension, in which the subtree rooted 

at b 2,2 was usable. So C b = b 2,2 was pushed on its own stack S b , and since it was a leaf, 

the path matching (a 1,8 b 2,2 ) was output. After all paths have been found they are merge 

joined. 

T b 

b 2,2 

b 5,5 

⊥ 

T a 

a 1,8 

a 3,6 

⊥ 

T c 

c 4,4 

c 7,7 

⊥ 

b 5,5 

a 3,6 

b 2,2 

a 1,8 

S b 

c 4,4 

S a 

S c 

a 1,8 b 2,2 

a 1,8 b 5,5 

a 1,8 c 4,4 

a 3,6 b 5,5 

a 3,6 c 7,7 

(a) Streams. 

(b) Stacks 

(c) Path matches. 

Figure 6: TwigStack state when evaluating query in Figure 2c, after processing node b 5,5 . 

TwigStack suboptimality for mixed a-d and p-c queries comes from having to 

output path matches without knowing whether the data nodes used can satisfy all their 

p-c relationships. The algorithm cannot always decide this from the nodes on the stacks 

and the heads of the streams. For the example in Figure 7 it cannot be decided if the 

path matches (a 1 , b 1 ), . . . , (a 1 , b n ) are part of a full match before the node c n+1 is seen. 

a 1 

a 

b 1 b n 

a 2 

c n+1 

b 

c 

(a) Query. 

b n+1 c 1 c n 

(b) Data. 

Figure 7: Bad case for TwigStack. 

For a-d only queries, queries where a p-c edge never follows an a-d edge [24], or on 

data with non-recursive schema [9], twig joins can be solved with linear cost in the size 

of the input and the output, using O(d) memory. Sadly, recursive schema, where nodes 

with a given label may nest nodes with the same label, are common in XML in practice 

[8], and so are mixed queries of the type mentioned above. No algorithm can solve the 

general problem given tag streaming, linear index size, and a query evaluation memory 

122


requirement of O(d) [24]. One alternative is storing multiple sort orders on disk, instead 

of only tree pre-order. This would require Ω(m min m,d · D) disk space in the worst case, 

where m is the number of structurally recursive labels and D is the size of the document 

[9]. Another alternative is to do multiple scans of the streams, but this would require 

Ω(d t ) passes in the worst case, where t is a linear metric on the complexity of the query 

[9]. So, the only viable alternatives left seem to be relaxing the O(d) space requirement, 

or using something different than tag partitioning. The following section investigates this, 

but also many practical speedups to TwigStack. 

3 Advances 

A multitude of different improvements have been presented after the introduction of 

TwigStack. Figure 8 gives an overview of these, with a separation between improved 

join algorithms and changes to how data is indexed and accessed. The rest of this paper 

is devoted to a structured review of these advances. Our goal is to identify further 

improvements, and to shed light on whether it is likely that combining these advances is 

possible and beneficial. 

3.1 Avoiding Redundant Computation 

TwigStack may perform many redundant checks in the calls to getNext(). Each time 

a node is returned, the full subtree below has been inspected. The TJEssential [19] 

algorithm improved three specific deficiencies, exemplified in Figure 9. 

The first deficiency is from self-nested matching nodes. For query node a and data 

nodes a 1 to a p in the example, it is unnecessary to recursively check the full subtrees below 

b and d in each round while pushing the nodes onto S a . The usefulness of a 2 , . . . , a p can 

be seen from the fact that a 1 had a new solution extension, and that a 2 , . . . , a p contains 

b 1 and d 1 , the heads of the streams of a’s children. 

The second observation is on the order in which child nodes are inspected. If the child 

b is inspected before d in line 18 of Algorithm 1, getNext(a) will call getNext(b) before 

getNext(d) shortcuts the search. There will be m − 1 redundant calls getNext(b) while 

forwarding the leaf node e. 

The third observation is that many useless calls could be made after a stream has 

reached its end. Assuming that b 2 was the last b-node in Figure 9, no a node later in the 

tree order would ever be pushed onto stack, and T a could be forwarded to its end. Also, 

if S b was empty, any descendant of b in the query could have their stream forwarded to 

the end, as the remaining nodes could not be part of a solution. 

TJEssential is a total rewrite of TwigStack, and is more complex than the original algorithm. 

TwigStack + [30] is a less involved modification, which only changes the getNext() 

procedure, such that it does not return before a solution is found. TwigStack + does not 

catch any of the tree above cases, but reduces computation for scattered node matches in 

practice. 

123

CHAPTER 4. 


Nested loop twig m-way. 

– Fast on simple q.? 

2-phase holistic top-down. 


TwigStack [4] 

1-phase bottom-up. 


Twig 2 Stack [5] 

Refined access methods. 

– tag, tag+level, path. 

iTwigJoin [6] 

– twig. 

TwigVersion [27] 

Virtual streams. 

Virtual Cursors [28] 

TJFast [21] 

(TwigOptimal [10]) 

Skipping in leaf streams. 

Virtual Cursors [28] 

Holistic skipping in matched 

leaf streams. 

Skipping joins. 

TwigStackXB [4] 

Efficient anc. skip. 

TSGeneric+ [15] 

Holistic skipping. 

TwigOptimal [10] 

Holistic anc. skipping. 

Avoiding redundant comp. 

TJEssential [19] 

TwigStack + [30] 

TJEssential* [19] 

TwigStack + B [30] 

Singe phase essential? 

1-phase top-down. 


HolisticTwigStack [16] 

Simplified intermed. result 

management, top-down. 

TwigFast [20]. 

Unification? 

Bottom-up + filtering. 

Twig 2 Stack+PStack [5] 

Twig 2 Stack+TStack [3] 

TwigMix [20] 

Simplified intermed. result 

management, bottom-up. 

TwigList [23] 

Figure 8: Advances and opportunities in twig joins. 

Optimal data access? 

Combination possible? 

Optimal algorithm? 

⎫⎪ ⎬ 

⎪ ⎭ 

⎫⎪ ⎬ 

⎪ ⎭ 

Changing data and data access Changing algorithms 

124


a 1 

a p 

b 1 

d 1 

a q 

e 1 e m 

b 2 

(a) Data. 

c 1 

c n 

a 

b d 

c e 

(b) Query. 

Figure 9: Giving redundant checks in TwigStack. 

Opportunity 1 (Removing redundant computation in top-down one-phase joins). The 

improvement of TwigStack + can trivially be ported to recent algorithms such as HolisticTwigStack 

and TwigFast, which improve other aspects of TwigStack (see Section 3.2). 

A challenge is to do the same for all the three improvements of TJEssential. Also, case 

three above could be extended to more efficient aligning for multi-document XML collections. 

3.2 Top-down vs. Bottom-up 

There are two main lines of algorithmic improvements over TwigStack which give optimal 

evaluation of mixed a-d and p-c queries by relaxing the O(d) memory requirement: 

Bottom-up algorithms which read nodes in post-order, and later algorithms which go back 

to top-down and pre-order. Differences between these are illustrated in Figure 10. 

Twig 2 Stack [5] generates a single combined stream with post-order sorting for all 

query node matches with the help of a single stack. With post-order processing it can be 

decided if an entire subtree has a match at the time the top node is seen. 

Figure 10c shows the hierarchies of stacks built while processing a query. For each 

query node, a list of trees of stacks is maintained. A data node strictly nests all nodes 

below in the stack, and all nodes in child stacks in the tree. The lists of trees are stored 

sorted in post-order, and are linked together by a common root if an ancestor node is 

processed. From the post-order, the nodes to be linked will always be found at the end of 

the list, and the new root will always be put at the end. The order naturally maintains 

itself, and good locality is achieved. 

Instead of each node on stack having a pointer to an ancestor node on a parent stack 

as in TwigStack, each stacked data node has for each related child query node, a list of 

pointers to top stack nodes matching the query axis relationship. Nodes are only added 

if a-d and p-c relationships can be satisfied, and p-c pointers are only added when levels 

are correct, as seen for the a and c nodes in the example. 

125

CHAPTER 4. 


c 1 

b 1 

c 2 a 1 

a 2 

a 4 

c 5 

b 2 

a 3 c 4 b 3 b 5 

c 6 

(a) Data 

a 

b c 

(b) Query 

a 1 

a 4 

b 1 

c 2 c 3 c 4 c 5 

b 2 

b 3 

b 4 

b 5 

(c) Twig 2 Stack 

c 1 

c 6 

a 4 a 1 

⎧⎪ ⎨ 

⎪ ⎩ 

{ 

⎧⎪ ⎨ 

⎪ ⎩ 

{ 

b 1 b 2 b 4 b 3 b 5 

c 2 c 3 c 4 c 6 c 5 c 1 

(d) TwigList 

tail 

a 1 a 2 a 4 

b 4 

b a 4 

5 b 3 

a 2 c 3 

a 1 

b 2 

(e) HolisticTwigStack 

c 4 

⎧⎪ ⎨ 

⎪ ⎩{ 

{ 

⎧ 

⎪⎨ 

⎪ ⎩{ 

{ 

b 2 b 3 b 4 b 5 

c 3 c 4 c 5 

tail tail 

(f) TwigFast 

Figure 10: (a) Data. (b) Query. (c) Hierarchies of stacks for Twig 2 Stack. (d) Intervals 

for TwigList. Curved arrows are sibling pointers. (e) Lists of stacks for HolisticTwigStack 

right before c 5 is processed. Previously popped nodes shown in gray. (f) Intervals for 

TwigFast after c 5 has been processed. Curved arrows are ancestor pointers. 

126


TwigList [23] is a simplification of Twig 2 Stack using simple lists and intervals given 

by pointers, which improves performance in practice. For each query node, there is a 

post-order list of the data nodes used so far. Each node in a list has, for each child query 

node, a single recorded interval of contained nodes, as shown in Figure 10d. Interval start 

and end positions are recorded as nodes are pushed and popped on and off the global 

stack. All descendant data nodes are processed in between. Compared with the list of 

pointers in Twig 2 Stack, enumeration of matches is not as efficient for p-c edges, but sibling 

pointers can remedy this. 

HolisticTwigStack [16] is a modification of TwigStack which uses pre-order processing, 

but maintains complex stack structures like Twig 2 Stack. The argument against 

Twig 2 Stack was a high memory usage, caused by the fact that all query leaf matches are 

kept in memory until the tree is completely processed, as they could be part of a match. 

HolisticTwigStack differentiates between the top-most branching node and its ancestors, 

for which a regular stack is used, and lower query nodes, which have multiple linked lists 

of stacks, as shown in Figure 10e. Each query node match has one pointer to the first 

descendant in pre-order for each child query node. For “lower” query nodes, new data 

nodes are pushed onto the current stack if contained, otherwise a new stack is created 

and appended to the list. As a match for an “upper” query node is popped, the node 

below on stack must inherit the pointers. Node a 1 would inherit the pointers from both 

a 2 and a 4 in the example in Figure 10e, and the related lists of child matches would be 

linked. 

TwigFast [20] is a simplification of HolisticTwigStack similar to TwigList. There is 

one list containing matches for each query node, naturally sorted in pre-order, and data 

nodes in the lists have pointers giving the interval of contained matches for child query 

nodes, as shown in Figure 10f. Each data node put into the list has a pointer to its closest 

ancestor in the same list, and there is a “tail pointer”, which gives the last position where 

a node can be the ancestor of following nodes in the streams. These pointers are used for 

the construction of the intervals. 

Different advantages of top-down and bottom-up algorithms can be seen in Figure 

10. A top-down algorithm can avoid storing b 1 and c 2 , while a bottom-up algorithm 

is unable to decide that these nodes cannot be part of a solution. On the other hand, a 

bottom-up algorithm can decide that a 2 is not usable, because it cannot satisfy the p-c 

relationship between a and c. Both approaches can decide that a 3 is not useful because 

it does not have a b descendant. 

The worst case space complexity of twig pattern matching is an open problem, and 

the known bounds are Ω(max d, u) and O(I), where u is the number of nodes which are 

part of a solution [24]. However, practical space savings are possible. 

Opportunity 2 (Top-down memory usage). TwigStack treats queries as a-d only in the 

stack construction part of phase one. A node returned from getNext() is pushed on stack if 

it has a usable ancestor on the parent stack, even if the query specifies a p-c relationship. 

For example does not c 3 have to be pushed on stack in Figure 10e, because it does not 

have a usable parent. Strictly checking p-c relationships before adding intermediate results 

would reduce memory usage in practice. This optimization was identified for TJFast [21] 

127

CHAPTER 4. 


(see Section 3.6), but the later HolisticTwigStack and TwigFast do not take advantage of 

this opportunity. 

Opportunity 3 (Bottom-up memory usage). Assume a query node q with a p-c relationship 

to the parent query node. If a candidate match for q is pushed onto a stack in 

Twig 2 Stack, and the data node below on the stack does not have an incoming pointer, 

this means the node below will never get a matching parent, and can be popped off stack. 

For example could the node c 6 be dropped in Figure 10c. Also, when stack trees for q are 

merged, some ancestor data node a i is seen. Then all the stack trees which do not get 

or have an incoming pointer can be dropped, as all later candidates for the parent query 

node will be after in the post-order. In the example, the stack trees containing the single 

nodes c 2 and c 3 could be dropped when a 1 is seen. 

Note that improvements on this is hard to transfer directly to TwigList unless the 

lists are implemented as linked lists. But this is by far inferior to using arrays and array 

doubling on modern hardware, as done in TwigList [22]. Another solution is to keep one 

list for each level for query nodes which have a p-c relationship to the parent query node. 

Siblings would then be stored contiguously, and interval pointers would implicitly be to 

a list on a given level. When the first ancestor of a segment of nodes in need of a parent 

is seen, the useless nodes can be over-written. This modification would also make sibling 

pointers unnecessary and improve efficiency of result enumeration. Figure 11 shows the 

proposed approach for the data and query in Figure 10, where gray list items can be 

overwritten. 

1: c 1 

2: c 2 

3: c 5 

4: c 4 c 6 

5: c 3 

{ { 

a 4 a 1 

b 1 b 2 b 4 b 3 b 5 

⎧⎪ ⎨ 

⎪ ⎩ 

{ 

Figure 11: Proposal for multi-level lists for TwigList. 

3.3 Filtering 

Low memory top-down approaches have been used as filters to bottom-up algorithms to 

reduce space usage by avoiding useless nodes. Note that this does not result in a perfect 

solution. Assume that node a 1 in Figure 10a had a different label. A O(d) space top-down 

pre-order approach could not decide that b 2 in the example was not part of a match, and 

a bottom-up algorithm would have to keep it in memory until the entire tree was read. 

Figure 12c-e shows the effects of different previously proposed filters. 

PathStack Pop Filter. In the original Twig 2 Stack paper [5], PathStack was proposed 

as a pre-filter to allow early result enumeration. PathStack is run as usual, but 

without its result enumeration. As disjoint nodes are popped off their stacks, they are 

128


x 1 

x 1 

x 1 

c 1 a 1 

c 1 a 1 

c 1 a 1 

a 

b c 

x 2 

b 1 

c 2 

b 2 

a 2 

x 2 

b 1 

c 2 

b 2 

a 2 

x 2 

b 1 

c 2 

b 2 

a 2 

(a) Query. 

(b) No filter. 

(c) PathStack pop filter. 

(d) Solution extension filter. 

x 1 

x 1 

x 1 

c 1 a 1 

c 1 a 1 

c 1 a 1 

x 2 

b 1 

c 2 

b 2 

a 2 

(e) TwigStack pop filter. 

x 2 

b 1 

c 2 

b 2 

a 2 

(f) Opportunity: PathStack useful. 

x 2 

b 1 

c 2 

b 2 

a 2 

(g) Opportunity: TwigStack useful. 

Figure 12: Filtering approaches for bottom-up up processing. Filtered nodes shown in 

gray. 

passed to Twig 2 Stack. When the bottom node is popped from the stack of the root query 

node, all results can be output, and the hierarchical stacks destroyed. A side effect of this 

procedure is that only nodes that are part of some prefix path match are used (these are 

not necessarily part of a full root-to-leaf path match). In Figure 12c, node c 1 is avoided. 

Note that one data node may result in the popping of multiple nodes on multiple stacks, 

and that Twig 2 Stack must receive descendants before ascendants. 

Solution Extension Filter. TwigMix [20] is an algorithm which combines the simplified 

data structures in TwigList with the getNext() function from TwigStack as a filter. 

This combination gives efficient evaluation for queries involving p-c edges, and reduced 

memory usage in practice. An advantage of this approach over Twig 2 Stack+PathStack is 

that there is no overhead of maintaining an extra set of stacks, and that internal nodes 

are filtered holistically. The downside is that nodes are added without even having a 

possible parent or ancestor. Figure 12d shows that node a 2 is filtered, because it never 

has a solution extension (misses b node below), while nodes c 1 and b 1 are not filtered. 

TwigStack Pop Filter. TwigStack can also be used as a filter for Twig 2 Stack [3]. A 

node is never added to the hierarchical stacks if it is not popped from a top-down stack in 

TwigStack. As a node is never pushed on stack if it does not have a usable ancestor, which 

again has a solution extension, this gives additional filtering, at the cost of maintaining the 

top-down stacks. Figure 12e shows the improvements both over PathStack and solution 

extension as filters. An issue is that Twig 2 Stack expects the stream of nodes to be in 

post-order, and that TwigStack may pop nodes off stacks out of this order. When a node 

is returned from getNext(), only the related stack and the parent stack are inspected. 

129

CHAPTER 4. 


Also, TwigStack does not keep leaf matches on stack, but nested leaf matches may arrive 

later. In [3] this is solved by keeping an extra queue of data nodes into which popped 

nodes are placed if the algorithm decides later popped nodes may precede them. 

A different solution could be to allow nested nodes on query leaf stacks, and to inspect 

all stacks when popping disjoint nodes to ensure post-order, as with the PathStack filter. 

Also, Twig 2 Stack actually does not need to see nodes in strict post-order, but only to 

see descendants before ascendants. Hence, not all stacks in the query would have to 

inspected, only ascendant and descendant stacks of the current node. 

Opportunity 4 (Stronger filters). There are further possibilities for filters with O(d) 

space usage. Instead of using all nodes popped off stacks in PathStack, one could use 

the nodes which would be used in full path match. As leaf nodes are pushed on stack, a 

simplified enumeration algorithm could be run, tagging nodes which take part in solutions. 

As can be seen in Figure 12f, this is an improvement over the previous PathStack filter, but 

only partially over the solution extension filter, which to a greater extent filters matches 

for higher query nodes. Leaves trivially have solution extensions. The “PathStack 

useful” filter works well on lower query nodes. Note that as the bottom-up algorithms 

to a greater extent handle upper nodes themselves, a filter is of most use if it removes 

lower query node candidates effectively. An even stronger filter would be to only use 

nodes which would have been output as parts of path matches in TwigStack, as shown in 

Figure 12g. None of [5, 20, 3] compare with using any other type of filter. A thorough 

comparison should compare both the practical space reductions filters give, their absolute 

costs, and how their use affect the total computational cost. 

Opportunity 5 (Unification or assimilation). When comparing absolute performance 

gains presented in the respective papers, TwigFast is the winner on performance for pure 

tag streaming. As this is a very important result, it should be verified independently. 

Before TwigFast is picked as the method of choice, at least the following should be answered: 

(i) Can the improvements discussed in Section 3.1 be applied? (ii) Is it superior 

to improved top-down and bottom-up combinations? (iii) Does the picture change when 

random access gets more expensive compared to computation? [20] does not comment on 

the spatial locality of memory access patterns in the intermediate result data structures 

in TwigFast, while they are very good for TwigList [23]. 

3.4 Skipping Joins 

Skipping is a useful technique when the streams to be joined have very different sizes. 

Skipping is used to jump forward in a stream to avoid reading and processing parts of 

the streams which cannot contain useful nodes. Figure 13 shows cases where different 

skipping techniques and data structures can be used. 

Simple B-tree skipping can be used to skip in descendant streams, and to some 

extent in ancestor streams. It is trivial to skip in the descendant stream to find the first 

possible contained node, which is the first node with a larger begin value. In Figure 13b, 

T c is forwarded from c 1 to c q to find the first possible ascendant of b 1 . 

130


a 

b 

c 

x 

a 1 

c q 

c 1 c p 

b 1 

a 1 

b 1 

c 1 b 2 b p 

b q 

c 2 

a 1 

b 1 b p b q 

c 1 

(a) 

(b) 

(c) 

(d) 

x 1 

x 1 

a 1 

x 2 

x p a 2 

b 1 b 2 

c 2 b p 

c p 

b q 

b 1 

c 1 

x 2 

a 2 b 2 

x p 

a p b p 

a q 

b q 

(e) 

c q 

(f) 

c q 

Figure 13: Benefits of skipping techniques. (a) Query. (b) Descendants easily skipped 

with B-tree. (c) Skip past discarded ascendant. (d) XR-tree needed to skip ascendants. 

(e) Holistic skipping preferred. (f) Holistic skipping with XR-tree needed. 

But skipping to find the next ascendant of a node using the same approach is not 

effective, as any node with a lower begin value may be a match. A trick for ancestor 

skipping was introduced in [7]. If a node b i is popped off stack S b due to disjointness with 

the current data node in some query node, T b is forwarded to the first node not contained 

by the popped node, a b j such that b i .end of this can be seen 

in Figure 13c. If c 2 pops b 1 off stack, T b can be forwarded beyond b 1 to b q , because no 

descendant of b 1 could be useful. 

XR-trees enable ancestor skipping in the general case [14]. Figure 13d shows an 

example where the above trick cannot be used. The XR-tree is B-tree variant which can 

retrieve all R ancestors or descendants of a node from N candidates in O(log N + R) 

time. Typically one tree is built for each tag. To find all a ascendants of a node d k , 

find the node a i with the nearest preceding begin value, and then all a ascendants of 

a i in the XR-tree for a. Conceptually the XR-tree contains for each node, the list of 

ascendants, which gives quadratic space usage when implemented naively. Linear space 

usage is achieved by not storing information redundantly internally in XR-tree nodes, and 

by storing common information in internal XR-tree nodes. 

TSGeneric+ (also called XRTwig) [15] extends the use of the XR-tree to TwigStack, 

and does two major modifications to the algorithm. The first is to skip forward to 

containment of the first child in the getNext() procedure (see line 23 in Algorithm 1). 

The second change is more involved. Before calling getNext() on all children in line 18, a 

“broken” edge in the query sub-tree is repeatedly picked, and the two related nodes are 

“zig-zag” forwarded until they match. This is only done if the query node does not have 

131

CHAPTER 4. 


data nodes on the stack. Choosing which edge to fix is either top-down, bottom-up or by 

statistics. 

Holistic skipping was introduced in the TwigOptimal [10] algorithm, which uses 

B-trees. Figure 13e shows a case where the approach from TSGeneric+ would be very 

expensive, reading all nodes b 2 -b p and c 2 -c p to fix the edge between b and c. TwigOptimal 

processes the query bottom-up then top-down. In the bottom-up phase, nodes are 

forwarded to contain their descendants, and in the top-down phase, nodes are forwarded 

until they are contained by their parent. To avoid as many data structure reads as possible, 

nodes are forwarded to “virtual positions”, which have only begin values. When a 

full traversal did not forward any node, the node with the minimal current begin value is 

forwarded to a real data node. 

The name of the TwigOptimal algorithm may be slightly misleading, as the optimality 

is given skip structures on begin values only. Only TSGeneric+ using simple B-trees 

is compared with. The effects of the two contributions, holistic skipping and the virtual 

positions, are not separately tested. TwigOptimal would not be efficient neither on the 

example in Figure 13c nor 13d. The approach is best when there are more matches for 

lower query nodes. An common exception from this is queries with leaf value predicates 

in XML. [10] mentions skipping to the closest ancestor and then backtracking to the first 

ancestor as a possible practical speed-up. 

Opportunity 6 (Holistic effective ancestor skipping). Figure 13f shows a case where 

both TSGeneric+ with XR-trees and TwigOptimal would fail to be efficient. The former 

would zig-zag join a 2 -a p and b 2 -b p , and the latter would be unable to forward T a to a q 

without checking at least all of a 2 -a p for ancestry of c q . Combining holistic skipping and 

data structures for efficient ancestor skipping is required in a robust solution. 

Opportunity 7 (Simpler and faster skipping data structures). The XR-tree is a dynamic 

data structure which supports insertions and deletions [14]. In regular keyword search 

engines, simpler data structures are usually preferred to the heavier B-trees when the 

data is static or semi-static. Similar simpler data structures should also be created for 

efficient ancestor skipping. If their use is still expensive, techniques similar to the trick 

used to skip past discarded ascendants should be applied when possible. 

3.5 Refined Access Methods 

There are alternatives to indexing and accessing data by node labels, such as using label 

and level, or the root-to-node path strings of labels (called tag+level and prefix path 

streaming [6]). With refined partitioning some method must be used to identify the useful 

partitions for each query node. For prefix path streaming this would be the partitions 

with data paths matching the root-to-node downward paths in the query. 

Structural summaries are directory structures used to classify nodes based on their 

surrounding structure. They where first used in combination with tree traversals, but have 

later been integrated with pure partitioning schemes [18]. The most common is a path 

summary, which is a minimal tree containing all unique root-to-node label paths seen in 

the data. The data nodes associated with a summary node is called the extent of the 

132


node. Figure 14a shows the path summary for the data in Figure 14b, and the extents 

are shown in Figure 14d. 

b (1) 

b [1] 

b 1 

a (3) 

a (2) 

b (4) 

b (5) 

a (6) 

(a) 

a [2] 

a [7] 

b [3] 

a [5] b [8] 

a [9] 

a [4] b [6] 

(b) 

b [10] 

a 2 

b 3 b 6 

a 8 

a 4 a 5 a 7 

a 10 

b 11 

b 9 

a 12 

a 13 

(c) 

a 15 

b 16 

a 17 

b 14 

b 18 

(1)→[1] 

(2)→[2, 7] 

(3)→[5, 9] 

(4)→[6, 10] 

(5)→[3, 8] 

(6)→[4] 

(d) 

[1]→1 

[2]→2, 10 

[3]→3, 6, 11 

[4]→4, 5, 7, 12 

[5]→8, 13 

(e) 

[6]→9, 14 

[7]→15 

[8]→16 

[9]→17 

[10]→18 

Figure 14: Structural partitioning example. (a) is path summary for (b), with extents 

shown in (d). (b) is F&B summary for (c), with extents shown in (e). 

Many alternative summary structures have been devised for general graphs. A structure 

which is also directly useful for trees is the stronger F&B-index [17], where two nodes 

are in the same equivalence class if they have the same prefix path, and have children of 

the same equivalence classes. In the example, (b) is the F&B summary of the tree in (c). 

For graphs, the F&B index can be found in O(m log n), where m and n are the number 

of edges and nodes in the graph. It is not known whether the F&B index can be found 

more efficiently for trees. 

Opportunity 8 (Updates in stronger summaries). Simple path summaries are usually 

small, and are easily updateable. When traversing a data tree for indexing, the path 

summary is used as a deterministic automaton, where new nodes are added on the fly 

when needed. Data nodes can be put in the correct extents immediately. If a data tree is 

updated, only the data nodes whose extent changes are affected. An interesting question 

is the updatability of stronger structural summaries. In the worst case for the F&B 

index, the structure of an entire containing subtree below the global root could change 

if a data node is added or removed, by causing or removing equivalence with another 

subtree. What are the implications of strategies lessening the restrictions on the F&B 

index? Would this give critical “fragmentation effects” in practice? And are updates 

cheaper in coarser variants, such as the F+B-index [17]? 

133

CHAPTER 4. 


Opportunity 9 (Hashing F&B summaries). In some search scenarios, there are many 

small documents, which are parts of a virtual tree. Document updates can be implemented 

as document deletes and inserts. With simple path summaries, documents can be 

added with cost linear in the document size, by traversing the summary deterministically. 

However, more refined summaries are not deterministic. Are stronger summaries like the 

F&B index suitable in this model? A challenge is that matching a new document in the 

F&B index has cost linear in the size of the summary in the worst case, not the document. 

Assume now that a 2 , a 10 and a 15 in Figure 14c are document roots. The structure of each 

document is classified by a node on depth two in the F&B summary in Figure 14b. If a 

new document is added below b 1 , it will either have the structure defined by a [2] or a [7] , 

or a new subtree will be added below b [1] . One possibility is to index the F&B summary 

by hashing each level 2 subtree, as these represent full document structures. When a new 

document is indexed, a summary of the document structure can be built and hashed, to 

identify a match in the global F&B index. 

TwigVersion [27] is a twig matching approach which introduces a novel two-layer 

indexing scheme, with an F&B summary of the data, and a path summary of the F&B 

summary. This reduces the expense of matching in the F&B index. But as they only 

compare to twig join algorithms which do not use structural summaries, and also introduce 

many other ideas, it is hard to assess the usefulness of the two-layer approach itself. They 

compare their two layer approach with a pure F&B index, but do not state how they 

search in it. 

A common way to use path summaries is to match each individual root-to-leaf path, 

and prune away matches which cannot be part of a full match [6, 2]. Another solution, 

which is more robust for large path summaries, is to label partition the summary and run 

a full twig join algorithm on it. In [3] a novel combination of Twig 2 Stack and TwigStack 

is used for matching in large path summaries (see Section 3.3). 

Opportunity 10 (Exploring summary structures and how to search them). Many twig 

join algorithms have leveraged the benefits of path summaries. Stronger summaries like 

the F&B index are not as commonly used, maybe because of worst case size and implementational 

complexity. Using different algorithms to search various types and combinations 

of summaries has not been thoroughly explored. An evaluation should address the total 

benefits of different single- and multi-level strategies, but also detail the local cost of 

specific matching methods in specific summary types of different sizes. 

Multi-stream access may be required for a single query node when partitioning on 

path, as there may be many path matches. One solution is to merge the streams. Another 

is to have a tag partitioned store, and filter the data nodes on matching path ID [28]. A 

speedup to this approach is to chain together nodes with the same path [18]. This is also 

useful when indexing text nodes on value and integrating structure information. 

S 3 [13] is a twig matching system which takes all possible combinations of individual 

streams matching query nodes based on prefix path, and solves each combination separately, 

merging the results of each evaluation. This approach does not give polynomial 

worst-case bounds. 

134


Blocking is the reason for the sub-optimality of TwigStack. When partial matches 

must be output to evaluate the data and query in Figure 15a, it is because b 1 and c 1 

blocks each others access to c 2 and b 2 respectively. 

a 1 

a 1 

b 1 

a 2 c 2 a 

b 2 

c 1 b c 

e 1 

a 1 

a 2 b 2 

d 1 d 2 b 1 d 3 

c 1 

c 2 

a 

d b 

c 

b 1 

b 3 

d 1 

c 2 b 4 b 5 c b 

b 2 

a 2 

c 1 

a 3 

e 1 

a 4 d 2 

e 2 

a 

e d 

(a) 

(b) 

(c) 

Figure 15: Cases of data and query blocking with (a) tag streaming, (b) tag+level streaming, 

(c) prefix path streaming. Adapted from [6]. 

Using more fine grained partitioning and streaming solves some blocking issues, because 

there are multiple heads of streams. A partitioning which refines another always 

inherits the reduced blocking [6]. In tag+level streaming there is no blocking when p-c 

edges only are used below the query root [6]. But in mixed queries, such as in Figure 15b, 

blocking can still occur. There the stream for tag d level 3 is [d 1 , d 2 ] and the stream for 

c level 4 is [c 1 , c 2 ]. There are only two matches for the query, and data nodes c 1 and d 1 

block each other. 

Prefix path streaming results in no blocking when there is a single branching node in 

the query. It solves the case in Figure 15b optimally, but not the one in 15c. There the 

stream for the path abac is [c 1 , c 2 ], and the stream for the path ababe is [e 1 , e 2 ]. Even 

though c 2 is also usable in the match with root a 1 , it cannot be known whether or not c 1 

is usable, because e 1 blocks for e 2 . 

iTwigJoin [6] uses a specialized approach for accessing multiple useful streams, which 

supports any partitioning scheme. In its variant of getNext(q), it considers for each 

matching stream, the streams which are usable together with this stream, for each child 

of q. This reduces the amount of blocking when more fine grained partitioning is used. 

The space usage for iTwigJoin is O(d), and the running time is O(t(I + O)) when no 

blocking occurs, where t is the number of useful streams. 

Opportunity 11 (Access methods for multiple matching partitions). Strategies for accessing 

multiple useful streams include merging, filtering and chaining of input, informed 

merging access like in iTwigJoin, and merging the output of multiple joins. Many authors 

do not argue for the rationale of their choice of how to access multiple useful partitions 

for a node. A new access paradigm is presented in [6], but only tag streaming is compared 

with. The benefit of the method for accessing multiple matching streams is not 

separated from the benefit of reduced total input size. The technique reduces the number 

135

CHAPTER 4. 


of intermediate paths output by phase one in TwigStack, and would undoubtedly reduce 

the amount of memory needed for intermediate results in time optimal top-down algorithms 

like HolisticTwigStack and TwigFast, but it is not certain if it is a win-win both 

on memory and speed in practice. 

3.6 Virtual Streams 

Another approach that can lead to reading less input is using “virtual streams” for internal 

nodes, by inferring the existence of nodes from their descendants. This requires a position 

encoding which allows ancestor position reconstruction [26], such as Dewey [25]. Ancestor 

label paths must also be infereable, and a path summary is an excellent tool for this. 

Consider the example in Figure 16, where streams of nodes matching leaf label paths are 

shown. For the node 1.2.1.2 with path (4) a.a.b.a, it can be inferred that one candidate 

for the query root is 1.2 with path (2) a.a. 

a 

b 

a 

a 

b 

a 

b 

b a b b 

(a) 

a (2) 

a (1) 

b (6) 

b (3) 

a (7) 

a (4) b (5) 

(b) 

(1)→1 

(2)→1.2 

1.3 

(3)→1.2.1 

1.3.1 

(4)→1.2.1.2 

(5)→1.2.1.1 

1.2.1.3 

(6)→1.1 

(7)→1.1.1 

(c) 

a 

b b 

a 

(d) 

(7) 1.1.1 

(4) 1.2.1.2 

(e) 

(6) 1.1 

(3) 1.2.1 

(3) 1.3.1 

Figure 16: Virtual stream example. (a) Data. (b) Path summary. (c) Extents of summary 

nodes. (d) Query. (e) Leaf streams. 

Virtual Cursors is an implementation of virtual streams using Deweys and path 

summaries [28]. Generating a “next” match for an internal query node is done by going 

through the prefixes of a leaf node’s Dewey and path, and using those where the ending 

tag is correct. The search stops when the new Dewey is lexicographically greater than 

the previous, meaning later in the pre-order. [28] does not give details on how ancestor 

candidates are generated, but this can be done in time linear in the depth of the leaf 

match used. 

Forwarding the entire query is done by repeatedly picking a leaf with a “broken” path, 

forwarding it to containment by the maximal ancestor, and then forwarding all ancestors 

virtually to contain the leaf. In the system described by [28], tag streaming with path ID 

filtering was used, and B-trees were used for skipping during leaf forwarding. 

Other virtual stream approaches have later been introduced. TJFast [21] is an 

independently developed algorithm which does not use a structural summary, but stores 

with each data node the root-to-node label path and the Dewey encoding together in a 

compressed format. Label paths are matched for each node processed. An improvement 

over Virtual Cursors as described, is that path matching is also done when generating 

136


internal nodes, giving fewer useless candidates. Also, non-branching internal nodes can 

be ignored during query evaluation, because they are implicit from the path matchings 

of below and above nodes. TJFast does not produce streams for internal nodes, but 

maintains sets of currently possible candidates. 

TwigVersion [27] and S 3 [13] (see Section 3.5) are non-holistic approaches which combine 

structural summaries and inference of internal node matches. TwigVersion computes 

sets of matches bottom-up. Each child node query generates a set of candidates for its 

parent query node based on its own matches, and these sets are then intersected. S 3 uses 

the potentially exponential number of ways a query matches the summary, and evaluates 

each such match, merging the results. For one summary matching, it looks at the 

query leaf nodes pairwise, and merge joins sets based on lowest common ancestor query 

nodes. This could give large useless intermediate results. The holistic skipping algorithm 

TwigOptimal [10] does partially implement virtual streams through its “virtual positions” 

(see Section 3.4). 

Opportunity 12 (Improved virtual streams). To reduce the number of matches and 

make it possible to ignore non-branching internal nodes, only structurally matching internal 

nodes should be generated. A structural summary can be used to avoid repeated 

matching of paths. However, how to store path matching information is not obvious. 

Given a matching path for a leaf query node, there may be an exponential number of 

combinations for matching the above query nodes. Should the matches be calculated on 

the fly as in TJFast, kept in independent sets for each node above a leaf match, or encoded 

in stacks? Or is it enough to store candidates for the lowest branching node above each 

leaf match, if the query nodes on a path are processed bottom-up? 

It is common to store Deweys in a succinct format to reduce space usage, but in 

addition, some scheme should also be devised to reduce the redundancy of using related 

Deweys in ascendant and descendant nodes. It is also preferable if node encodings do 

not have to be fully de-compressed to be compared during query evaluation, but that the 

compressed storage format allows for cheap direct comparisons. 

Opportunity 13 (Holistic skipping among leaf streams). In some sense, virtual streams 

are skipping by not generating unrelated matches for internal query nodes. The Virtual 

Cursors algorithm does perform skipping which is “path-holistic”, in the way broken 

root-to-leaf paths are fixed. The order in which leaves are picked is not specified [28], but 

query node pre-order could have been used. If the lexicographically largest broken leaf 

was picked, the skipping would become truly holistic. 

The work in [28], in combination with some intermediate result handling method from 

Section 3.2, may be a suitable starting point in the hunt for the “ultimate” twig matching 

approach, but the work is not much compared with, or even referenced. 

3.7 Query Difficulty Classes 

As mentioned in Section 2, the “difficulty” of twig joins comes from mixture of a-d edges 

followed by p-c edges in queries, in combination with structurally recursive labels in the 

137

CHAPTER 4. 


data. [24] shows that when an a-d edge is never followed by a p-c edge downwards in a 

query, it can be evaluated in linear time and O(d) space. When there in addition is a 

single return node (as in XPath), it can also be evaluated in O(1) space. If after combining 

all the advances listed in this paper, faster evaluation methods still exist for some classes 

of queries, practical implementations should take advantage of this. 

Opportunity 14 (Identifying and using difficulty classes). Can the correctness of using 

a simpler matching algorithm be decided not only from the query, but also from the query 

and the data? Structural summaries give possibilities for this. What happens if there are 

only single path candidates for some query nodes? What happens when the tree level for 

matches of a query node is fixed? What happens if data node matches with given paths 

for some query nodes fix path matches for other query nodes? 

In [2] additional information is collected in a path summary, noting whether a node 

always has a given child, and whether there is at most one child. This information is used 

there to simplify query evaluation when there are non-return nodes in the query, such as 

in XPath. Could such statistics also allow detection of more cases where query evaluation 

can be simplified for general twig matching? 

4 Conclusion 

We have given a structured analysis of recent advances, and identified a number of opportunities 

for further research, focusing both on join algorithms and index organization 

strategies. Hopefully this has given an overview which has led us one step further towards 

unification of the numerous advances in this field. 

One conclusion is that given its sheer volume, it seems nearly impossible to consider 

all related work when presenting new twig join techniques. The field would benefit greatly 

from an open source repository of algorithms and data structures. 

References 



[2] Andrei Arion, Angela Bonifati, Ioana Manolescu, and Andrea Pugliese. Path summaries 

and path partitioning in modern XML databases. In Proc. WWW, 2006. 


twig query in large DataGuide trees. In Proc. IDEAS, 2008. 



138







[7] S.Y. Chien, Z. Vagena, D. Zhang, V.J. Tsotras, and C. Zaniolo. Efficient structural 

joins on indexed XML documents. In Proc. VLDB, 2002. 

[8] Byron Choi. What are real DTDs like. Technical Report MS-CIS-02-05, University 

of Pennsylvania, 2002. 

[9] Byron Choi, Malika Mahoui, and Derick Wood. On the optimality of holistic algorithms 

for twig queries. In Proc. DEXA, 2003. 


cursor movement in holistic twig joins. In Proc. CIKM, 2005. 

[11] Georg Gottlob, Christoph Koch, and Reinhard Pichler. Efficient algorithms for processing 

XPath queries. In Proc. VLDB, 2002. 



[13] Sayyed Kamyar Izadi, Theo Härder, and Mostafa S. Haghjoo. S 3 : Evaluation of 

tree-pattern XML queries supported by structural summaries. Data Knowl. Eng., 

2009. 

[14] Haifeng Jiang, Hongjun Lu, Wei Wang, and Beng Chin Ooi. XR-tree: Indexing XML 

data for efficient structural joins. In Proc. ICDE, 2003. 


indexed XML documents. In Proc. VLDB, 2003. 



2007. 


indexes for branching path queries. In Proc. SIGMOD, 2002. 

[18] Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, and Raghu Ramakrishnan. 

On the integration of structure indexes and inverted lists. In Proc. SIG- 

MOD, 2004. 

[19] Guoliang Li, Jianhua Feng, Yong Zhang, and Lizhu Zhou. Efficient holistic twig 

joins in leaf-to-root combining with root-to-leaf way. In Proc. Advances in Databases: 

Concepts, Systems and Applications, 2007. 

139

CHAPTER 4. 



[21] Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, and Ting Chen. From region encoding 

to extended Dewey: On efficient processing of XML twig pattern matching. In 

Proc. VLDB, 2005. 

[22] Lu Qin. Personal correspondence, 2009. 




queries over indexed documents. In Proc. ICDE, 2008. 



database system. In Proc. SIGMOD, 2002. 


PhD thesis, Ludwig-Maximilians-Universität München, 2006. 


Knowl. Eng., 2008. 





MOD Rec., 2001. 

[30] Junfeng Zhou, Min Xie, and Xiaofeng Meng. TwigStack + : Holistic twig join pruning 

using extended solution extension. Wuhan University Journal of Natural Sciences, 

2007. 

A 

Errata 

1. Algorithms TwigList [23], HolisticTwigStack [16] and TwigFast [20] are incorrectly 

referred to as optimal in Figure 8, in Section 3.2, and in Section 3.5. Only algorithm 

Twig 2 Stack [5] is optimal as described. 

140

Paper 5 

Nils Grimsmo, Truls Amundsen Bjørklund and Magnus Lie Hetland 

Fast Optimal Twig Joins 

Proceedings of the 36th international conference on Very Large Data Bases (VLDB 2010). 

Abstract In XML search systems twig queries specify predicates on node values and 

on the structural relationships between nodes, and a key operation is to join individual 

query node matches into full twig matches. Linear time twig join algorithms exist, but 

many non-optimal algorithms with better average-case performance have been introduced 

recently. These use somewhat simpler data structures that are faster in practice, but 

have exponential worst-case time complexity. In this paper we explore and extend the 

solution space spanned by previous approaches. We introduce new data structures and 

improved strategies for filtering out useless data nodes, yielding combinations that are 

both worst-case optimal and faster in practice. An experimental study shows that our 

best algorithm outperforms previous approaches by an average factor of three on common 

benchmarks. On queries with at least one unselective leaf node, our algorithm can be an 

order of magnitude faster, and it is never more than 20% slower on any tested benchmark 

query. 

141

Paper 5: Fast Optimal Twig Joins 


XML has become the de facto standard for storing and transferring semistructured data 

due to its simplicity and flexibility [6], with XPath and XQuery as the standard query 

languages. XML documents have tree structure, where elements (tags) are internal tree 

nodes, and attributes and text values are leaf nodes. Information may be encoded both in 

structure and content, and query languages need the expressional power to specify both. 

Twig pattern matching (TPM) is an abstract matching problem on trees, which covers 

a subset of XPath, which again is a subset of XQuery. TPM is important because it 

represents the majority of the workload in XML search systems [6]. Both data and 

queries (twigs) in TPM are node-labeled trees, with no distinction between node types. 

Figure 1 shows a twig query and data with a match. A match is a mapping of query nodes 

to data nodes that respects labels and the ancestor-descendant (A–D) and parent-child 

(P–C) relationships specified by the query edges, respectively represented by double and 

single lines in figures here. 

b 1 c 1 y 1 a 4 

a 2 

c 1 

a 1 

b 1 c 2 a 1 

c 3 

b 2 a 3 c 4 b 3 

z 1 b 4 z 2 

Figure 1: Twig query and data with matches. 

x 2 b 6 

b 5 

x 1 c 5 

c 6 

Twig joins are algorithms for evaluating TPM queries on indexed data, where the 

index typically has one list of data nodes for each label. A query is evaluated by reading 

the label-matching data nodes for each query node, and combining these into full 

query matches. There exist algorithms that perform twig joins in worst-case optimal 

time [3], but current non-optimal algorithms that use simpler data structures are faster 

in practice [10, 11]. 

In this paper we present twig join algorithms that achieve worst-case optimality without 

sacrificing practical performance. Our main contributions are (i ) a classification of 

filtering methods as weak or strict, and a discussion of how filtering influences practical 

and worst-case performance; (ii ) level split vectors, a data structure yielding lineartime 

result enumeration with almost no practical overhead; (iii ) getPart, a method for 

merging input streams that gives additional inexpensive filtering and practical speedup; 

(iv ) TJStrictPost and TJStrictPre, worst-case optimal algorithms that unify and extend 

previous filtering strategies; and (v ) a thorough experimental comparison of the effects 

143

CHAPTER 4. 


of combining different techniques. Compared to the fastest previous solution, our best 

algorithm is on average three times as fast, and never more than 20% slower. 

The scope of this paper is twig joins reading simple streams from label-partitioned 

data. See Section 6 for orthogonal related work that introduces other assumptions on 

how to partition and access the underlying data. 

2 Background 

A schema-agnostic system for indexing labeled trees usually maintains one list of data 

nodes per label. Each node is stored with position information that enables checking 

A–D and P–C relationships in constant time. A common approach is to assign intervals 

to nodes, such that containment reflects ancestry. Tree depth can then be used determine 

parenthood [15]. 

An early approach to twig joins was to evaluate query tree edges separately using 

binary joins, but when A–D edges are involved, this can give huge intermediate results 

even when the final set of matches is small [2]. This deficiency led to the introduction 

of multi-way twig join algorithms. TwigStack [2] can evaluate twig queries without P–C 

edges in linear time. It only uses memory linear in the maximum depth of the data 

tree. However, when P–C and A–D edges are mixed, more memory is needed to evaluate 

queries in linear time [13]. The example in Figure 2 hints at why. 

More recent algorithms, which are used as a starting point for our methods, relax 

the memory requirement to be linear in the size of the input to the join. They follow 

a general scheme illustrated in Figure 3. The scheme has two phases, where the first 

phase has two components. The first component merges the stream of data node matches 

for each query node into a single stream of query and data node pairs. The second 

component organizes these matches into an intermediate data structure where matched 

A–D and P–C relationships are registered. This structure is used to enumerate results in 

the second phase. 

The algorithms broadly fall into two categories. So-called top-down and bottomup 

algorithms process and store the data nodes in preorder and postorder, respectively, 

and filter data nodes on matched prefix paths and subtrees before they are added to 

a 

a 1 

1 

c 1 b 1 b 1 b n a 2 

c n+1 

b n+1 c 1 c n 

Figure 2: Hard case with restricted memory. It cannot be known whether b 1 , . . . , b n are 

useful before c n+1 is seen, or whether c 1 , . . . , c n are useful before b n+1 is seen. 

144


a 1 a 1 a 2 

. . . 

b 1 b 1 b 2 

. . . 

c 1 c 1 c 2 

. . . 

Comp. 1 c 1 ↦→c 1 b 1 ↦→b 1 . . . Comp. 2 

Intermed. 

results 

Phase 2 

a 1 

b 2 c 5 

a 1 

b 3 c 5 

. . . 

Figure 3: Work-flow of twig join algorithms, with input stream merge component, intermediate 

result construction component, and result enumeration phase. 

the intermediate results. Many algorithms use both types of filtering, which means the 

processing is a hybrid of top-down and bottom-up. 

Twig 2 Stack [3] was the first linear twig join algorithm. It reorders the input into a 

single postorder stream to build intermediate results bottom-up. The data nodes matching 

a query node are stored in a composite data structure (postorder-sorted lists of trees 

of stacks), as shown in Figure 4a. Matches are added to the intermediate results only if 

relations to child query nodes are satisfied, and each match has a list of pointers to usable 

child query node matches. 

a 1 

b 1 b 2 b 4 b 3 b 5 b 6 

a 4 

c 1 

a 2 a 4 a 1 

b 1 b 2 b 3 b 5 b 6 c 2 c 3 c 4 c 5 

b 4 

c 6 

c 2 c 3 c 4 c 6 c 5 c 1 

(a) Twig 2 Stack. 

(b) TwigList. 

Figure 4: Intermediate result data structures when evaluating the query in Figure 1. 

HolisticTwigStack [8] uses similar data structures, but builds intermediate results topdown 

in preorder, and filters matches on whether there is a usable match for the parent 

query node. It uses the getNext function from the TwigStack algorithm [2] as input 

stream merge component, which implements an inexpensive weaker form of bottom-up 

filtering. The combined filtering does not give worst-case optimality, but results in faster 

average-case evaluation of queries than for Twig 2 Stack [8]. 

One approach for improving practical performance is using simpler data structures for 

intermediate results. TwigList [11] evaluates data nodes in postorder like Twig 2 Stack, 

but stores intermediate nodes in simple vectors, and does not differentiate between A–D 

and P–C relationships in the construction phase. Given a parent and child query node, 

the descendants of a match for the parent are found in an interval in the child’s vector, 

as shown in Figure 4b. Interval start indexes are set as nodes are pushed onto a global 

stack in preorder, and end indexes are set as nodes are popped off the global stack in 

postorder. A node is added to the intermediate results if all descendant intervals are nonempty. 

Compared to Twig 2 Stack, this gives weaker filtering and non-linear worst-case 

145

CHAPTER 4. 


performance, but is more efficient in practice [11], according to the authors because of 

less computational overhead and better spatial locality. 

TwigFast [10] is an algorithm that uses data structures similar to those of TwigList, 

but stores data nodes in preorder. It uses the same preorder filtering as HolisticTwigStack, 

and inherits postorder filtering from the getNext input stream merge component. There 

are several algorithms that utilize both types of filtering, but among these TwigFast has 

the best practical performance [1,3,10]. Figure 5 gives an overview of twig join algorithms, 

and various properties that are introduced in Section 3. 

Filtering Checking of Interm. 

Algorithm Ref. order path subtree results Optimal 

getNext [2] GN none weak N/A N/A 

TwigStack [2] GN+pre strict weak complex no 

Twig 2 Stack [3] post none strict complex yes 

1 

2 T2 S [3] pre+post weak strict complex yes 

1 

2 T2 S [1] GN+pre+post weak strict complex yes 

HolisticTwigStack [8] GN+pre weak weak complex no 

TwigList [11] post none weak vectors no 

TwigMix [10] GN+pre+post weak weak vectors incorrect 

TwigFast [10] GN+pre weak weak vectors no 

TJStrictPost Sect. 4 pre+post strict strict vectors yes 

TJStrictPre Sect. 4 GN+pre(+post) strict strict vectors yes 

Figure 5: Previous combinations of prefix path and subtree filtering. Intermediate result 

storage order given by last item in “filtering order”. GN is the node order returned by 

the getNext input stream merger. 

3 Premises for Performance 

To make algorithms that are both fast in practice and worst-case optimal, we need an 

understanding of how filtering strategies and data structures impact performance. 

For any graph G, let V (G) be its node set and E(G) be its edge set. Let a matching 

problem M be a triple 〈Q, D, I〉, where Q is a query tree, D is a data tree, and I ⊆ 

V (Q) × V (D) is a relation such that for q ↦→q ′ ∈ I the node label of q equals the node 

label of q ′ . Each edge 〈p, q〉 ∈ Q has a label L(〈p, q〉) ∈ {“A–D”, “P–C”}, specifying an 

ancestor–descendant or parent–child relationship. Let a node map for M be any function 

M ⊆ I. Assume a given M = 〈Q, D, I〉 when not otherwise specified. 

Definition 1 (Weak/strict edge satisfaction). The node map M weakly satisfies a downward 

edge e = 〈p, q〉 ∈ E(Q) iff M(p) is an ancestor of M(q), and strictly satisfies e iff 

M(p) and M(q) are related as specified by L(e). 

146


Definition 2 (Match). Given subgraphs Q ′′ ⊆ Q ′ ⊆ Q, the node map M : V (Q ′ ) → V (D) 

is a weak (strict) match for Q ′′ iff all edges in Q ′′ are weakly (strictly) satisfied by M. If 

Q ′′ = Q we call M a weak (strict) full match. 

Where no confusion arises, the term weak (strict) match may also be used for M(Q). 

We denote the set of unique strict full matches by O. As is common, we view the 

size of the query as a constant, and call a twig join algorithm linear and optimal if the 

combined data and result complexity is O(I + O) [3]. 1 

The results presented in the following all apply to both weak and strict matching, 

unless otherwise specified. The following lemma implies that we can use filtering strategies 

that only consider parts of the query. 

Lemma 1 (Filtering). If there exists a Q ′ ⊆ Q containing q where no match M ′ for Q ′ 

contains q ↦→q ′ , then there exists no match M for Q containing q ↦→q ′ . 

Proof. By contraposition. Given a match M ∋ q ↦→q ′ for Q, for any Q ′ ⊆ Q containing q, 

the match M \ {p↦→p ′ | p ∉ Q ′ } matches Q ′ and contains q ↦→q ′ . 

3.1 Preorder Filtering on Matched Paths 

Many current algorithms use the getNext input stream merge component [2], which returns 

data nodes in a relaxed preorder, which only dictates the order of matches for query 

nodes related by ancestry. This is not detailed in the original description [2] and is easy to 

miss. 2 The TwigMix algorithm incorrectly assumes strict preorder [10] (See Appendix E). 

Definition 3 (Match preorder). The sequence of pairs q 1 ↦→q 1, ′ . . . , q k ↦→q 

k ′ ∈ V (Q)×V (D) 

is in global match preorder iff for any i < j either (1) q i ′ precedes q′ j in tree preorder, or 

(2) q i ′ = q′ j and q i precedes q j in tree postorder. The sequence is in local match preorder 

if (1) and (2) hold for any i < j where q i = q j or 〈q i , q j 〉 ∈ E(Q) or 〈q j , q i 〉 ∈ E(Q). 

The following definition formalizes a filtering criterion commonly used when processing 

data nodes in preorder, local or global. 

Definition 4 (Prefix path match). M is a prefix path match for q k ∈ Q iff it is a match 

for the (simple) path q 1 . . . q k , where q 1 is the root of Q. 

To implement prefix path match filtering, preorder algorithms maintain the set of 

open nodes, i.e., the ancestors, at the current position in the tree. Most algorithms have 

one stack of open data nodes for each query node, and given a current pair q ↦→q ′ pop 

non-ancestors of q ′ from the stacks of q and its parent [2, 8, 10]. Weak filtering can then 

be implemented by checking if (i) q is the root, or (ii) the stack for the parent of q is 

1 For Twig 2 Stack the combined data, query and result complexity is O(I log Q + Ib Q + OQ), where 

b Q is the maximum branching factor in the query [3]. The TJStrict algorithms we present in Section 3 

have the same complexity when using a heap-based input stream merger, and O(IQ + OQ) complexity 

when using a getNext-based input stream merger. Note that the total number of data nodes references 

in the input and output are |I| and |O| · |Q|, respectively. 

2 To be precise, getNext also returns matches for sibling query nodes in order. 

147

CHAPTER 4. 


non-empty. If q ′ is not filtered out, it is pushed onto the stack for q, and added to the 

intermediate results. This can be extended to strict checking of P–C edges by inspecting 

the top node on the parent’s stack. Strict prefix path matching is rarely used in practice, 

as can be seen from the fourth column in Figure 5. 

The implementation of prefix path checks is the reason for the secondary ordering on 

query node postorder for match pairs in Definition 3. Without the secondary ordering 

problems arise when multiple query nodes have the same label: A data node could be 

misinterpreted as a usable ancestor of itself when checking for non-empty stacks, or hide 

a proper parent of itself when checking top stack elements. 

Algorithms storing intermediate results in postorder use a global stack for the 

query [3, 11], and inspection of the top stack node cannot be used to implement prefix 

path matching when the query contains A–D edges, as ancestors may be hidden deep 

in the stack. Extending these algorithms to implement prefix path filtering requires maintaining 

additional data structures. 

The choice of preorder filtering does not influence optimality, as illustrated by the 

following example. 

Example 1. Assume non-branching data generated by /(α 1 /) n . . . (α m /) n β/γ, and the 

query ⫽α 1 ⫽ . . . ⫽α m /γ, where α 1 , . . . , α m , β, γ are all distinct labels. Unless it is signaled 

bottom-up that the pattern α m /γ is not matched below, the result enumeration phase will 

take Ω(n m ) time, because all combinations of matches for α 1 , . . . , α m will be tested. 

3.2 Postorder Filtering on Matched Subtrees 

The ordering of match pairs required by most bottom-up algorithms is symmetric with 

the global preorder case: 

Definition 5 (Match postorder). A sequence of pairs q 1 ↦→q 1, ′ . . . , q k ↦→q 

k 

′ is in match 

postorder iff for any i < j either (1) q i ′ precedes q′ j in tree postorder, or (2) q′ i = q′ j and 

q i precedes q j in tree preorder. 

The second property is required for the correctness of both Twig 2 Stack [3], where a 

data node could hide proper children of itself, and TwigList [11], where a node could be 

added as a descendant of itself. 

Definition 6 (Subtree match). 

subtree rooted at q. 

M is a subtree match for q ∈ Q iff it is a match for the 

Example 1 also illustrates why strict subtree match checking is required for optimality, 

because no node labeled γ is a direct child of a node labeled α m in the data. As 

described in Section 2, Twig 2 Stack and TwigList respectively implement strict and weak 

subtree match filtering, and for these algorithms strict filtering is required for optimality. 

TwigList could be extended to strict filtering by traversing descendant intervals to 

look for direct children, but this would have quadratic cost if implemented naively, as 

descendant intervals may overlap. 

In algorithms storing data in preorder, nodes are added to the intermediate results 

after passing preorder checks. If nodes are stored in arrays, later removing a node failing 

148


a postorder check from the middle of an array would incur significant cost. Note that 

many of the algorithms listed in Figure 5 inherit weak subtree match filtering from the 

input stream merger used, as described later in Section 4.4. 

3.3 Result Enumeration 

Even if the strict subtree match checking sketched for TwigList above could be implemented 

efficiently, results would still not be enumerated in linear time, as usable child 

nodes may be scattered throughout descendant intervals, as shown by the following example: 

Example 2. Assume a tree constructed from the nodes {a 1 , . . . , a n , b 1 , . . . , b 2n }, labeled 

a and b. Let each node a i have a left child b i , a right child b n+i , and a middle child a i+1 

if i < n. Given the query ⫽a/b, each node a i is part of two matches, one with b i and one 

with b n+i , but to find these two matches, 2n − 2i useless b-nodes must be traversed in the 

descendant interval. This gives a total enumeration cost of Ω(n 2 ). 

The following theorem formalizes properties of the intermediate result storage in 

Twig 2 Stack that are key to its optimality. 

Theorem 1 (Linear time result enumeration [3]). The result set O can be enumerated in 

Θ(O) time if (i) the data nodes d such that root(Q)↦→d is part of a strict full match can 

be found in time linear in their number, and (ii) given a pair q ↦→q ′ , for each child c of 

q, the pairs c↦→c ′ that are part of a strict subtree match for q together with q ↦→q ′ can be 

enumerated in time linear in their number. 

3.4 Full Matches 

When different types of filtering strategies are combined, it may be interesting to know 

when additional filtering passes will not remove more nodes. 

Theorem 2 (Full Match). (i) A pair q ↦→q ′ is part of a full match iff (ii) it is part of a 

prefix path match that only uses pairs that are part of subtree matches. 

sketch. (i ⇒ ii) Follows from Lemma 1. (i ⇐ ii) Let M = 〈Q, D, I〉 be the initial 

matching problem, and M ′ = 〈Q, D, I ′ 〉 be the matching problem where I ′ is the set of 

pairs that are part of subtree matches in M. The theorem is true for pairs with the query 

root, as for the query root a subtree match is a full match. Assume that there is a prefix 

path match M ↓q ∋ q ↦→q ′ for q in M ′ , and that p is the parent of q. By construction, 

M ↓q ∋ p↦→p ′ is also a prefix path match for p. We use induction on the query node 

depth, and prove that if p↦→p ′ is part of a full match M p for p, then q ↦→q ′ must be 

part of a full match for q. Let Q q be the subtree rooted at q, and Q p = Q \ Q q . Let 

M p ′ = M p \ {r ↦→r ′ | r ∈ Q q }. By the assumption q ↦→q ′ ∈ I ′ , there exists a subtree match 

M ↑q ∋ q ↦→q ′ for q. Then the node map M q = M p ′ ∪ M ↑q is a full match for q, because 

(1) p↦→p ′ ∈ M p ′ and q ↦→q ′ ∈ M ↑q must satisfy the edge 〈p, q〉 as they are used together 

in M ↓q , and (2) Q p and Q q can be matched independently when the mapping of p and q 

is fixed. 

149

CHAPTER 4. 


In other words, if nodes are filtered first in postorder on strictly matched subtrees, and 

then in preorder on strictly matched prefix paths, the intermediate result set contains only 

data nodes that are part of the final result. The opposite is not true: In the example in 

Figure 1, the pair c↦→c 3 would not be removed if strict prefix path filtering was followed 

by strict subtree match filtering. 

Note that strict full match filtering is not necessary for optimal enumeration by Theorem 

1, and that the optimal algorithms we present in the following do not use it. They 

use prefix path match filtering followed by subtree match filtering, where the former is 

only used to speed up practical performance. On the other hand, the input stream merge 

component we introduce in Section 4.4 gives inexpensive weak full match filtering. 

4 Fast Optimal Twig Joins 

In this section we create an algorithmic framework that permits any combination of 

preorder and postorder filtering. First we introduce a new data structure that enables 

strict subtree match checking and linear result enumeration. 

4.1 Level Split Vectors 

A key to the practical performance of TwigList and TwigFast is the storage of intermediate 

nodes in simple vectors [11], but this scheme makes it hard to improve worst-case behavior. 

To enable direct access to usable child matches below both A–D and P–C edges, we 

split the intermediate result vectors for query nodes below P–C edges, such that there is 

one vector for each data tree level observed, as shown in Figure 6. Given a node in the 

intermediate results, matches for a child query node below an A–D edge can be found in 

a regular descendant interval, while the matches for a child query node below a P–C edge 

can be found in a child interval in a child vector. This vector can be identified by the 

depth of the parent data node plus one. 

In the following we assume that level split vectors can be accessed by level in amortized 

constant time. This is true if level split vectors are stored in dynamic arrays, and the 

depth of the deepest data node is asymptotically bounded by the size of the input, that 

is, d ∈ O(I). If this bound does not hold, which can be the case when |I| ≪ |D|, expected 

linear performance can be achieved by storing level split vectors in hash maps. 

When nodes are added to the intermediate results in postorder, checking for nonempty 

descendant and child intervals inductively implies strict subtree match filtering. 

This is illustrated for our example in Figure 6. As each check takes constant time, 

the intermediate results can be constructed in Θ(I) time, as for TwigList [11]. Result 

enumeration with this data structure is a trivial recursive iteration through nodes and 

their child and descendant intervals, which is almost identical to the enumeration in 

TwigList. The difference is that no extra check is necessary for P–C relationships, and 

that the result enumeration is Θ(O) by Theorem 1 when strict subtree match filtering is 

applied. 

150


b 1 b 2 b 4 b 3 b 5 b 6 

a 4 a 1 

1: c 1 

2: c 2 

3: c 5 

4: c 4 c 6 

5: c 3 

Figure 6: Intermediate data structures using level split vectors and strict subtree match 

filtering. As opposed to in Figure 4b, a 2 does not satisfy. 

4.2 The TJStrictPost Algorithm 

Algorithm 1 shows the general framework we use for postorder construction of intermediate 

results, extending algorithms like TwigList [11]. It allows using any combinations of 

the preorder and postorder checks described in Section 3, from none to weak and strict, 

and allows using either simple vectors or level split vectors. A global stack is used to 

maintain the set of open data nodes, and if prefix path matching is implemented, a local 

stack for each query node is maintained in parallel. The input stream merge component 

used is a priority queue implemented with a binary heap. The postorder storage approach 

used here requires global ordering, and cannot read local preorder input (see Appendix E). 

Algorithm 1 Postorder construction. 

While ¬Eof: 

Read next q ↦→d. 

While non-ancestors of d on global stack: 

Pop q ′ ↦→d ′ from global and local stack. 

If q ′ ↦→d ′ satisfies postorder checks: 

Set interval end index for d ′ 

in the vector of each child of q ′ . 

Add d ′ to intermediate results for q ′ . 

If d satisfies preorder checks: 

For each child of q, set interval start index for d. 

Push q ↦→d on global stack. 

Push d on local stack for q. 

Clean remaining nodes from the stacks. 

Enumerate results. 

The correctness of Algorithm 1 follows from the correctness of the filtering strategies 

described in Sections 3.1 and 3.2, and the correctness of TwigList [11], with the enu- 

151

CHAPTER 4. 


meration algorithm trivially extended to use child intervals when level split vectors are 

used. 

We now define the TJStrictPost algorithm, which builds on this framework. Detailed 

pseudocode can be found in Appendix A. The algorithm uses level split vectors, and, as 

opposed to the previous twig join algorithms listed in Figure 5, it includes strict checking 

of both matched prefix paths and subtrees. The former is implemented by checking the top 

data node on the local stack of the parent query node, while the latter is implemented by 

checking for non-empty child and descendant intervals. A Θ(I + O) running time follows 

from the discussion in Section 4.1. 

4.2.1 A note on TwigList: 

As noted in the original description, chaining nodes with the same tree level into linked 

lists inside descendant intervals can improve practical performance in TwigList [11]. However, 

as the first child match with the correct level must still be searched for, further 

changes are needed to achieve linear worst-case evaluation. This can be implemented by 

maintaining a vector for each query node with the previous match on each tree level at 

any time. A node must then be given pointers to such previous matches as it is pushed 

on stack in TwigList. When the node is popped off stack, it can then be checked if any 

children have been found, and intermediate results can be enumerated in linear time, 

assuming that d ∈ O(I), as for our solution. 

4.3 The TJStrictPre Algorithm 

Algorithm 2 shows the general framework we use to construct intermediate results in preorder, 

extending algorithms like TwigFast [10]. It supports any combination of preorder 

and postorder filtering, simple or level split vectors, and input in global or local preorder. 

As opposed to with postorder storage, nodes are inserted directly into intermediate result 

vectors after they have passed a prefix path check. Local stacks store references to open 

nodes in the intermediate results. If strict subtree match filtering is required, or weak 

subtree match filtering is not implied by the input stream merger, intermediate results 

are filtered bottom-up in a post-processing pass. 

TwigFast is reported to have faster average case query evaluation than previous twig 

joins [10], and we hope to match this performance in a worst-case linear algorithm. The 

TJStrictPre algorithm is similar to TJStrictPost, and uses strict checking of prefix paths 

and subtrees, and stores intermediate results in level split vectors. See detailed pseudocode 

in Appendix B. TJStrictPre uses the getPart input stream merger, which is an 

improvement of getNext described in Section 4.4. If the post-processing pass can be performed 

in linear time, then the algorithm can evaluate twig queries in Θ(I + O) time, by 

the same argument as for TJStrictPost. 

The filtering pass is implemented by cleaning intermediate result vectors bottom-up 

in the query, in-place overwriting data nodes not satisfying subtree matches, as described 

in detail in Appendix D. The indexes of kept nodes are stored in a separate vector for 

each query node, and are used to translate old start and end indexes into new positions. 

152


Algorithm 2 Preorder construction. 

While ¬Eof: 

Read next q ↦→d. 

For the stack of both q’s parent and q itself: 

Pop non-ancestors of d, 

and set their end indexes. 

If d satisfies preorder checks: 

For each child of q, set interval start index for d. 

Add d to intermediate results for q. 

Push reference to d on stack for q. 

Clean stacks. 

Clean intermediate results with postorder checks. 

Enumerate results. 

To achieve linear traversal, the intermediate result vector of a node is traversed in parallel 

with the index vectors of the children after they have been cleaned. For level split vectors, 

there is one separate index vector per used level, and a separate vector iterator is used 

per level when the parent is cleaned. Also, there is an array giving the level of each stored 

data node in preorder, such that split and non-split child vectors can then be traversed 

in parallel. Start values are updated as nodes are pushed onto a stack in preorder, while 

end values are updated as nodes are popped off in postorder. 

4.4 The getPart Input Stream Merger 

The getNext input stream merge component implements weak subtree match filtering 

in Θ(I) time, and is used to improve practical performance in many current algorithms 

using preorder storage [8,10]. Assume in the following discussion that there is one preorder 

stream of label-matching data nodes associated with each query node. The input stream 

merger repeatedly returns pairs containing a query node and the data node at the head 

of its stream, implementing Comp. 1 in Figure 3. 

The getNext function processes the query bottom-up to find a query node that satisfies 

the following three properties: (1) when its stream is forwarded at least until its head 

follows the heads of the streams of the children in postorder, it still precedes them in 

preorder, (2) all children satisfy properties 1 and 2, and (3) if there is a parent, it does 

not satisfy 1 and 2. Property 2 implies that weak subtree filtering is achieved, and 

Property 3 implies that local preorder by Definition 3 is achieved. 

The procedure is efficient if leaf query nodes have relatively few matches, which can 

be the case in practice in XML search when all query leaf nodes are selective text value 

predicates. However, if the internal query nodes are more selective than the leaf nodes, 

or if not all leaves are selective, the overhead of using the getNext function may outweigh 

the benefits. 

To improve practical performance we introduce the getPart function, which requires 

the following property in addition to the above three: (4) if there is a parent, then the 

153

CHAPTER 4. 


current head of stream is a descendant of a data node that was the head of stream for 

the parent in some previous subtree match for the parent. This inductively implies that 

nodes returned are also weak prefix path matches, and from the ordering of the filtering 

steps, the result is weak full match filtering by Theorem 2. To allow forwarding streams 

to find such nodes, the algorithm can no longer be stateless, as shown by the following 

example: 

Example 3. Assume that the heads of the streams for query nodes a 1 and b 1 in Figure 1 

are a 3 and b 2 , respectively. Then it cannot be known by only inspecting heads of streams 

whether or not any usable ancestors of b 2 were seen before a 3 , and b 2 must be returned 

regardlessly. 

Property 4 is implemented in getPart by maintaining for each query node, the data 

node latest in the tree postorder that has been part of a weak full match. This value is 

updated when a query node is found to satisfy all four properties. To ensure progress, 

streams are forwarded top-down in the query to match the stored value or the current 

head for the parent node. Note that multiple top-down and bottom-up passes may be 

needed to find a satisfying node, but each such pass forwards at least one stream past 

useless matches. See detailed pseudocode in Appendix C. 

5 Experiments 

The following experiments explore the effects of weak and strict matching of prefix paths 

and subtrees, different input stream merge functions, and level split vectors. 

We have used the DBLP, XMark and Treebank benchmark data, and run the commonly 

used DBLP queries from the PRIX paper [12], the XMark queries from the XPath- 

Mark suite part A [5] (except queries 7 and 8, which are not twigs), and the Treebank 

queries from the TwigList paper [11]. In addition, we have created some artificial data and 

queries. Details can be found in Appendix F. The experiments were run on a computer 

with an AMD Athlon 64 3500+ processor and 4 GB of RAM. All queries were warmed 

up by 3 runs and then run 100 times, or until at least 10 seconds had passed, measuring 

average running-time. 

All algorithms are implemented in C++, and features are turned on or off at compile 

time to make sure the overhead of complex methods does not affect simpler methods. 

Feature combinations are coded with 5 letter tags. We use Heap, getNext and getPart 

for merging the input streams, and store intermediate results in prEorder or pOstorder. 

We use no (-), Weak or Strict prefix path match filtering, and no (-), Weak or Strict 

subtree match filtering. Intermediate results are stored in simple vectors (-) or Level 

split vectors. The previous algorithms TwigList and TwigFast are denoted by HO-Wand 

NEWW-, respectively, while TJStrictPost and TJStrictPre are denoted by HOSSL and 

PESSL. Note that filtering checks are not performed in intermediate result construction 

if the given filtering level is already achieved by the input stream merger. Strict subtree 

match filtering is implemented by descendant interval traversal when not using level split 

vectors. With preorder storage an extra filtering pass is used to implement subtree match 

filtering. Worst-case optimal algorithms match the pattern ***SL. 

154


We present no performance comparisons with the Twig 2 Stack algorithm because it 

does not fit into our general framework. Getting an accurate and fair comparison would 

be an exercise in independent program optimization. TwigList is previously reported to 

be 2–8 times faster than Twig 2 Stack for queries similar to ours [11]. 

5.1 Checked Paths and Subtrees 

Figure 7 shows results for running the XMark query ⫽annotation/keyword, with cost 

divided into different components and phases. Filtering lowers the cost of building intermediate 

data structures because their size is reduced, and the cost of enumerating 

results because redundant traversal is avoided. Note that this query was chosen because 

it shows the potential effects of prefix path and subtree match filtering, and it may not 

be representative. 

HO--- 

HO-W- 

HO-S- 

HO-SL 

HOW-- 

HOWW- 

HOWS- 

HOWSL 

HOS-- 

HOSW- 

HOSS- 

HOSSL 

0 0.05 0.1 

seconds 

merge 

construct 

enumerate 

Figure 7: Query //annotation/keyword on XMark data. 

input, building intermediate results, and result enumeration. 

Cost divided into merging 

Figure 8 shows the effects of prefix path vs. subtree match filtering averaged over all 

queries on DBLP, XMark and Treebank. Heap input stream merging was used because it 

allows all filtering levels, and postorder storage was used to avoid extra filtering passes. 

Each timing has been normalized to 1 for the fastest method for each query. Raw timings 

are listed in Appendix G. As opposed to in Figure 7, there is on average little difference 

between the methods using at least some filtering both in preorder and postorder. The 

benefits of asymptotic constant time checks when using level split vectors seems to be 

outweighed by the cost of maintaining and accessing them, but only by a small margin. 

5.2 Reading Input 

Figure 9 shows the effect of using different input stream mergers. The labels in the 

artificial data tree used are Zipf distributed (s = 1), with a, b, y and z being the labels on 

30%, 13%, 1.0% and 1.0% of the nodes, respectively. The data and queries were chosen 

to shed light on both the benefits and the possible overhead of using the advanced input 

methods. 

For the first query, ⫽a/b[y][z], the leaves are very selective, and both getNext and 

getPart very efficiently filter away most of the nodes. The input stream merging is slightly 

155

CHAPTER 4. 


Prefix path 

- W S 

avg max avg max avg max avg max 

Subtree 

-- 1.84 3.5 1.29 1.61 1.23 1.62 1.45 3.5 

W- 1.24 1.45 1.03 1.13 1.01 1.06 1.09 1.45 

S- 1.24 1.46 1.03 1.10 1.02 1.08 1.10 1.46 

SL 1.32 1.55 1.09 1.19 1.06 1.20 1.16 1.55 

1.41 3.5 1.11 1.61 1.08 1.62 1.20 3.5 

Figure 8: The effect of parent match filtering. vs. child match filtering. Running DBLP, 

XPathMark and Treebank queries. Normalizing query times to 1 for the fastest method 

for each query. Showing arithmetic mean and maximum for normalized time. 

HO--- 

HE--- 

NE-W- 

PEWW- 

HOSSL 

HESSL 

NESSL 

PESSL 

0 1 2 3 

(a) 

0 0.5 1 1.5 2 

(b) 

0 5 10 15 

(c) 

Figure 9: Running queries on Zipf data. (a) Selective leaves. (b) Selective internal nodes. 

(c) No selective nodes. 

more efficient for the simpler getNext. In the second query, ⫽y/z[a][b], the internal nodes 

are selective, while the leaves are not. Here getPart efficiently filters away many nodes, 

while getNext does not, making it even slower than the simple heap, due to the additional 

complexity. The third query shows a case where getPart performs worse than both the 

other methods. In this query, ⫽a[a[a][a]][a[a][a]], all query nodes have very low selectivity, 

and are equally probable. The filtering has almost no effect, and only causes overhead. 

Note the cost difference between HOSSL and HESSL, which is due to the additional filtering 

pass over the large intermediate results. 

5.3 Combined Benefits 

Figure 10 shows the effects of combining different input stream mergers and additional 

filtering strategies. The same queries as in Figure 8 are evaluated, and the first column 

shows the same tendencies: There is not much difference between the strategies as long 

as you do at least weak match filtering on both prefix path and subtree. 

156


Input stream merger 

HO HE NE PE 

avg max avg max avg max avg max avg max 

Match filtering 

--- 7.0 33 6.6 30 6.8 33 

-W- 4.5 23 7.5 34 3.7 19 5.2 34 

-S- 4.5 23 7.5 34 3.7 18 5.2 34 

-SL 4.8 25 8.3 38 3.8 19 5.6 38 

W-- 4.9 24 4.9 23 4.9 24 

WW- 3.7 19 5.4 25 3.2 15 1.02 1.11 3.3 25 

WS- 3.7 19 5.4 26 3.2 15 1.04 1.15 3.3 26 

WSL 3.9 20 5.7 27 3.2 15 1.08 1.22 3.5 27 

S-- 4.8 24 4.8 24 4.8 24 

SW- 3.7 19 5.2 26 3.2 15 1.03 1.12 3.3 26 

SS- 3.7 19 5.1 26 3.2 15 1.05 1.17 3.3 26 

SSL 3.9 20 5.5 27 3.2 15 1.05 1.20 3.4 27 

4.4 33 6.0 38 3.4 19 1.04 1.22 4.1 38 

Figure 10: Input mergers vs. filtering strategies. 

In the second column all methods using any subtree match filtering are more expensive, 

because with preorder storage, subtree match filtering is performed in a second pass over 

the intermediate results. A second pass is also used for for subtree match filtering in 

the third and fourth columns, but in practice the getNext and getPart components have 

already filtered away more nodes, the intermediate results are smaller, and the second 

pass is less expensive. 

Note the difference between using getNext and getPart. The new method is more than 

three times as fast on average, and is more than one order of magnitude faster for queries 

where only some of the leaf nodes are selective. The getPart function also fast forwards 

through useless matches for the leaves that are not selective, while getNext passes all 

leaf matches on to the intermediate result construction component. Also note that the 

maximum overhead of using PESSL, the fastest worst-case optimal method, is at most 

20% in any benchmark query tested. 


This work is based on the assumption that label-partitioning and simple streams is used. 

Orthogonal previous work investigates how the underlying data can be accessed. If the 

streams support skipping, both unnecessary I/O and computation can be avoided [7]. 

Our getPart algorithm, which is detailed in Appendix C, can be modified to use any 

underlying skipping technology by changing the implementation of FwdToAncOf() and 

FwdToDescOf(). Refined partitioning schemes with structure indexing can be used to 

157

CHAPTER 4. 


reduce the number of data nodes read for each query node [4, 9]. Our twig join algorithms 

are independent of the partitioning scheme used, assuming multiple partition 

blocks matching a single query node are merged when read. Another technique is to use a 

node encoding that allows reconstruction of data node ancestors, and use virtual streams 

for the internal query nodes [14]. Our getPart algorithm could be changed to generate 

virtual internal query node matches from leaf query node matches, as complete query 

subtrees are always traversed. For a broader view on XML indexing see the survey by 

Gou and Chirkova [6]. XPath queries can be rewritten to use only the axis self, child, descendant, 

and following [16]. To add support for support for the following-axis, we would 

have to add additional logic for how to forward streams, and modify the data structures 

to store start indexes for the new relationship. 

7 Conclusions and Future Work 

In this paper we have shown how worst-case optimality and fast evaluation in practice 

can be combined in twig joins. We have performed experiments that span out and extend 

the space of the fastest previous solutions. For common benchmark queries our new and 

worst-case optimal algorithms are on average three times as fast as earlier approaches. 

Sometimes they are more than an order of magnitude faster, and they are never more 

than 20% slower. 

In future work we would like to combine the new techniques with previous orthogonal 

techniques such as skipping, refined partitioning and virtual streams. Also, it would be 

interesting to see an elegant worst-case linear algorithm reading local preorder input and 

producing preorder sorted results, that does not perform a post-processing pass over the 

data, and does not need the assumption d ∈ O(I). 

Acknowledgments This material is based upon work supported by the iAd Project 

funded by the Research Council of Norway, and the Norwegian University of Science and 

Technology. Any opinions, findings and conclusions or recommendations expressed in this 

material are those of the authors, and do not necessarily reflect the views of the funding 

agencies. 

References 








158




[5] Massimo Franceschet. XPathMark web page. http://sole.dimi.uniud.it/ 

~massimo.franceschet/xpathmark/. 




indexed XML documents. In Proc. VLDB, 2003. 



2007. 






[12] Praveen Rao and Bongki Moon. PRIX: Indexing and querying XML using Prüfer 

sequences. In Proc. ICDE, 2004. 


queries over indexed documents. In Proc. ICDE, 2008. 





MOD Rec., 2001. 

[16] Ning Zhang, Varun Kacholia, and M. Tamer Özsu. A succinct physical storage 

scheme for efficient evaluation of path queries in XML. In Proc. ICDE, 2004. 

159

CHAPTER 4. 


A 

TJStrictPost Pseudocode 

Algorithm 3 shows more detailed pseudocode for the TJStrictPost algorithm described in 

Section 4.2. Tree positions are assumed to be encoded using BEL (begin, end, level) [15]. 

The function GetVector(q, level) returns the regular intermediate result vector if q is below 

an A–D edge, or the split vector given by level if q is below a P–C edge. 

Algorithm 3 TJStrictPost 

1: function EvaluateGlobal(): 

2: (c, q, d) ← MergedStreamHead() 

3: while c ≠ Eof: 

4: ProcessGlobalDisjoint(d) 

5: if Open(q, d): 

6: Push(S global , (d.end, q)) 


8: ProcessGlobalDisjoint(∞) 

9: function ProcessGlobalDisjoint(d): 

10: while S global ≠ ∅ ∧ Top(S global ).end < d.end: 

11: (·, q) ← Pop(S global ) 

12: Close(q) 

13: function Open(q, d): 

14: if CheckParentMatch(q, d): 

15: u ← new Intermediate(d) 

16: MarkStart(q, u) 

17: Push(S local [q], u) 


19: else: 


21: function Close(q): 

22: u ← Pop(S local [q]) 

23: MarkEnd(q, u) 

24: if CheckChildMatch(q, u): 

25: Append(GetVector(q, u.d.level), u) 

26: function MarkStart(q, u): 

27: for r ∈ q.children: 

28: u.start[r] ← GetVector(r, u.d.level + 1).size + 1 

29: function MarkEnd(q, u): 


31: u.end[r] ← GetVector(r, u.d.level + 1).size 

32: function CheckParentMatch(q, d): 

33: if Axis(q) = “⫽”: 

34: return IsRoot(q) or S local [Parent(q)] ≠ ∅ 

35: else: 

36: if IsRoot(q): return d.level = 1 

37: else: return S local [Parent(q)] ≠ ∅ ∧ d.level = Top(S local [Parent(q)]).level + 1 

38: function CheckChildMatch(q, u): 


40: if u.end[r] < u.start[r]: return false 


160


B 

TJStrictPre Pseudocode 

Algorithm 4 shows more detailed pseudocode for the TJStrictPre algorithm described in 

Section 4.3. 

Algorithm 4 TJStrictPre 

1: function EvaluateLocalTopDown(): 


3: while c ≠ Eof: 

4: if ¬IsRoot(q): 

5: ProcessLocalDisjoint(Parent(q), d) 

6: ProcessLocalDisjoint(q, d) 

7: Open(q, d) 


9: for q ∈ Q: 

10: ProcessLocalDisjoint(q, ∞) 

11: FilterPass(q.root) 

12: function ProcessLocalDisjoint(q, d): 

13: while Top(S local [q]).d.end < d.end: 

14: Close(q) 

15: function Open(q, d): 

16: if CheckParentMatch(q, d): 

17: u ← new Intermediate(d) 

18: V ← GetVector(q, d.level) 

19: if ¬IsLeaf(q): 

20: MarkStart(q, u) 

21: Push(S local [q], (V, V.size)) 

22: Append(V, u) 

23: return ¬IsLeaf(q) 

24: else: 


26: function Close(q): 

27: if ¬IsLeaf(q): 

28: (V, i) ← Pop(S local [q]) 

29: MarkEnd(q, V [i]) 

C 

GetPart function 

Pseudocode for the getPart function is shown in Algorithm 5, where what is conceptually 

different from the previous getNext function is colored dark blue. 

GetPart forwards nodes both to catch up with the parent and child streams, whereas 

getNext only does the latter. The getNext algorithm is completely stateless, and only 

inspects stream heads. When a match for a query subtree is found, the stream for the 

subtree root node is read and forwarded. Then it is not possible to know in the next call 

on this point in the query, whether the child subtrees were once part of a match or not. 

In the getPart function we save one extra value per query node, stored in the M array, 

namely the latest match in the tree postorder which was part of a weak match for the 

entire query. When considering a query subtree, the currently interesting data nodes are 

those that are either part of a match using a previous head in the parent stream, or part 

of a new match using the current head in the parent stream (see Lines 9-13). 

161

CHAPTER 4. 


Algorithm 5 GetPart 

1: function MergedStreamHead(): 

2: while true: 

3: (c, d, q) ← GetPart(Q.root) 

4: if c ≠ MisMatch: 

5: if c ≠ Eof: 

6: Fwd(q) 

7: return (c, d, q) 

8: function GetPart(q): 

9: if ¬IsRoot(q): 

10: p ← Parent(q) 

11: FwdToDescOf(q, M[p]) 

12: if ¬Eof(q) ∧ ¬Eof(p) ∧ M [p].end < H(q).end: 

13: FwdToDescOf(q, H(p)) 

14: if IsLeaf(q): 

15: if Eof(q): return (Eof, q, ⊥) 

16: else: return (Match, q, H(q)) 

17: (d min , q min ) ← (∞, ⊥) ; (d max , q max ) ← (0, ⊥) 


19: (c r, d r, q r) ← GetPart(r) 

20: if c r ≠ Eof: 

21: if c r = MisMatch: flag MisMatch 

22: elif q r ≠ r: return (Match, d r, q r) 

23: if d r .begin < d min .begin: (d min , q min ) ← (d r, q r) 

24: if d r .begin > d max .begin: (d max , q max ) ← (d r, q r) 

25: else: 

26: FwdToEof(q) 

27: if q min = ⊥: return (Eof, ⊥, q) 

28: FwdToAncOf(q, d max ) 

29: if flagged MisMatch: 

30: return (MisMatch, ⊥, q r) 

31: 

32: if ¬Eof(q) ∧ H(q).begin < d min .begin: 

33: if IsRoot(q) ∨ H(q).end < M [p].end: 

34: if M [q].end < H(q).end: M[q] ← H(q) 

35: return (Match, H(q), q) 

36: else: 

37: if d min .begin < M [q].end: 

38: return (Match, d min , q min ) 

39: else: 

40: if Eof(q): return (Eof, ⊥, q) 

41: else: return (MisMatch, ⊥, q) 

42: function FwdToEof(q): 

43: while ¬Eof(q): Fwd(q) 

44: function FwdToDescOf(q, d): 

45: while ¬Eof(q) ∧ H(q).begin ≤ d.begin: Fwd(q) 

46: function FwdToAncOf(q, d): 

47: while ¬Eof(q) ∧ H(q).end ≤ d.begin: Fwd(q) 

The forwarding of streams based on child stream heads is very similar to in getNext 

(Lines 17-28). Unless the search is short-circuit (Line 22), the stream is forwarded at 

least until the head is an ancestor of all the child heads (Line 28). The query node itself 

is returned if an ancestor of the child heads was found, and unless the previous M value 

is an ancestor of the current head, it is updated. When a child query node is returned, it 

162


is known whether or not it is part of a match. 3 

D 

Extra Filtering Pass 

Algorithm 6 gives pseudocode for the extra filtering pass used to obtain strict subtree 

match filtering when using preorder storage in TJStrictPre. 

Algorithm 6 FilterPass 

1: function CleanUp(q): 

2: if Axis(q) = “⫽”: 

3: CleanUpVector(GetVector(q, ·), C[q]) 

4: else: 

5: for h ∈ used levels: 

6: CleanUpVector(GetVector(q, h), C h [q]) 

7: function CleanUpVector(V, C): 

8: i ← j ← 0 

9: while i < V.size: 

10: if CheckChildMatch(q, V [i]): 

11: V [j] ← V [i] 

12: Append(C, i) 

13: j ← j + 1 

14: i ← i + 1 

15: Resize(V, j) 

16: function FilterPass(q): 

17: if IsLeaf(q): return 


19: FilterPass(r) 

20: if NonLeafChildren(q) ≠ ∅: 

21: for u ∈ AllNodes(q): 

22: FilterPassPost(q, u.d) 

23: for r ∈ NonLeafChildren(q): 

24: u.start[r] ← FwdIter(r, u.start[r], u.d) 

25: Push(S local [q], u) 

26: FilterPassPost(q, ∞) 

27: CleanUp(q) 

28: function FilterPassPost(q, d): 

29: while S local [q] ≠ ∅ ∧ Top(S local [q]).end < d.end: 

30: u ← Pop(S local [q]) 

31: for r ∈ NonLeafChildren(q): 

32: u.end[r] ← FwdIter(r, u.end[r], u.d) 

33: function FwdIter(q, pos, d): 

34: if Axis(q) = “⫽”: 

35: while I[q] < C[q].size ∧ C[q][I[q]] < pos: 

36: I[q] ← I[q] + 1 

37: return I[q] 

38: else: 

39: h ← d.level + 1 

40: while I h [q] < C h [q].size ∧ C h [q][I h [q]] < pos: 

41: I h [q] ← I h [q] + 1 

42: return I h [q] 

3 Thanks to Radim Bača for pointing out an error in the original published pseudocode, where Lines 30– 

31 in Algorithm 5 were different. 

163

CHAPTER 4. 


During the clean-up (Line 1-15), nodes failing checks are overwritten, and it is stored 

in the C vectors which values were not dropped. The query nodes are visited bottom-up 

by the FilterPass function, updating the vectors of one query node at a time, based on 

the cleanup in non-leaf child query nodes. The FilterPass and FilterPassPost functions 

go through all data nodes in preorder and postorder respectively, updating interval start 

and end indexes. 

The AllNodes call returns a special iterator to all intermediate data nodes for a query 

node, sorted in total preorder. For query nodes with an incoming P–C edge and level 

split vectors, the order in which nodes were inserted on different levels was recorded 

during construction in TJStrictPre. Details are omitted in Algorithm 4, where an extra 

statement must be added after line 22, storing a reference to the used vector. 

The FwdIter function contains the logic for updating the start and end indexes. Each 

query node has an iterator for each vector, which is utilized when traversing the matches 

for the parent query node. In essence, the segments of child and descendant intervals which 

contain references to nodes which were not saved during a cleanup pass are discarded. 

E 

GetNext and Postorder 

Many algorithms use the getNext function [2] for merging the input streams instead 

of a heap or linear scan [8, 10], because it cheaply filters away many useless nodes by 

implementing weak subtree match filtering. In this Appendix we show why using getNext 

with postorder intermediate result construction gives problems regardless of whether local 

or global stacks are used. 

The getNext function does not return the data nodes in strict preorder, as assumed 

in the correctness proof for the TwigMix algorithm [10], but in local preorder (see Definition 

3). As explained in Section 3.1, the top-down algorithms using getNext maintain 

one stack or equivalent structure for each internal query node. When a new match for a 

given query node is seen, the typical strategy [2] is popping non-ancestor nodes off the 

parent query node’s stack, and the query node’s own stack. 

If a global stack is combined with using getNext input, errors may occur, as for the 

example query and data in Figure 11. With local preorder the ordering between the nodes 

not related through ancestry is not fixed. Assume that the ordering is 

〈a 1 ↦→a 1 , a 1 ↦→a 2 , b 1 ↦→b 1 , c 1 ↦→c 1 , d 1 ↦→d 1 , c 1 ↦→c 2 , e 1 ↦→e 1 〉. 

Then c 1 ↦→c 2 will pop a 1 ↦→a 2 off stack before e 1 ↦→e 1 is observed, and e 1 ↦→e 1 will never 

be added as a descendant of a 1 ↦→a 2 . 

But using local stacks and a bottom-up approach also gives errors, because data nodes 

are added to the intermediate structures as they are popped off stack when using postorder 

storage, which is too late when using getNext and local stacks. If the typical approach is 

used, c 1 ↦→c 2 will only pop b 1 off the stack of b 1 and c 1 off the stack of c 1 . Then d 1 ↦→d 1 

will never be added to the child structures of b 1 ↦→b 1 , because it is popped to late. 

It may be possible to modify the local stack approach to work with postorder storage 

and getNext input, but this would require carefully popping nodes on ancestor and 

descendant stacks in the right order. 

164


a 1 

a 1 

b 1 e 1 

c 1 d 1 

(a) Query 

a 2 

c 2 

b 1 e 1 

c 1 d 1 

(b) Data 

Figure 11: Problematic case with local preorder input and postorder storage. 

F 

Benchmark Data and Queries 

Figure 12 gives some details on the benchmark data used in our experiments, and Figure 

13 lists the queries we have used. 

Name Size Nodes 

Source 

DBLP 676 MB 35 900 666 

http://dblp.uni-trier.de/xml 

XMark 1 118 MB 32 298 988 

http://www.xml-benchmark.org 

Treebank 83 MB 3 829 512 

http://www.cs.washington.edu/research/xmldatasets 

Figure 12: Benchmark datasets used in experiments. 

165

CHAPTER 4. 


# data Hits Source 

xpath 

D1 DBLP 6 [12] 

//inproceedings[author/text()="Jim Gray"][year/text()="1990"]/@key 

D2 DBLP 21 [12] 

//www[editor]/url 

D3 DBLP 13 [12] 

//book/author[text()="C. J. Date"] 

D4 DBLP 2 [12] 

//inproceedings[title/text()="Semantic Analysis Patterns."]/author 

X1 XMark 40 726 [5] 

/site/closed auctions/closed auction/annotation/description/text/keyword 

X2 XMark 124 843 [5] 

//closed auction//keyword 

X3 XMark 124 843 [5] 

/site/closed auctions/closed auction//keyword 

X4 XMark 40 726 [5] 

/site/closed auctions/closed auction[annotation/description/text/keyword]/date 

X5 XMark 124 843 [5] 

/site/closed auctions/closed auction[.//keyword]/date 

X6 XMark 32 242 [5] 

/site/people/person[profile/gender][profile/age]/name 

T1 Treebank 1 183 [11] 

//S/VP//PP[.//NP/VBN]/IN 

T2 Treebank 152 [11] 

//S/VP/PP[IN]/NP/VBN 

T3 Treebank 381 [11] 

//S/VP//PP[.//NN][.//NP[.//CD]/VBN]/IN 

T4 Treebank 1 185 [11] 

//S[.//VP][.//NP]/VP/PP[IN]/NP/VBN 

T5 Treebank 94 535 [11] 

//EMPTY[.//VP/PP//NNP][.//S[.//PP//JJ]//VBN]//PP/NP// NONE 

Figure 13: Benchmark queries used in experiments. 

166


G 

Extended Results 

Figure 14 shows the timings that the aggregates in Section 5 are based on. 

D1 D2 D3 D4 X1 X2 X3 X4 X5 X6 T1 T2 T3 T4 T5 

HO--- 2.53 0.43 1.07 1.59 0.57 0.11 0.11 0.75 0.25 0.25 0.31 0.30 0.40 0.47 0.50 

HO-W- 1.42 0.25 0.44 1.12 0.43 0.10 0.11 0.60 0.24 0.21 0.18 0.17 0.24 0.32 0.32 

HO-S- 1.42 0.26 0.44 1.11 0.43 0.10 0.11 0.61 0.24 0.21 0.17 0.18 0.24 0.32 0.33 

HO-SL 1.70 0.28 0.48 1.22 0.46 0.10 0.11 0.65 0.26 0.23 0.18 0.19 0.24 0.34 0.33 

HOW-- 1.96 0.31 0.32 1.18 0.34 0.08 0.09 0.49 0.19 0.25 0.22 0.22 0.32 0.42 0.42 

HOWW- 1.26 0.19 0.33 0.92 0.32 0.08 0.08 0.46 0.19 0.21 0.16 0.17 0.22 0.31 0.30 

HOWS- 1.34 0.19 0.32 0.91 0.31 0.08 0.08 0.45 0.19 0.21 0.16 0.16 0.21 0.30 0.32 

HOWSL 1.31 0.20 0.31 0.96 0.31 0.09 0.09 0.46 0.20 0.23 0.17 0.17 0.23 0.33 0.32 

HOS-- 2.03 0.31 0.32 1.17 0.32 0.08 0.08 0.47 0.19 0.25 0.21 0.19 0.27 0.35 0.40 

HOSW- 1.27 0.20 0.31 0.91 0.30 0.08 0.08 0.44 0.19 0.21 0.16 0.15 0.21 0.29 0.29 

HOSS- 1.27 0.21 0.32 0.92 0.30 0.08 0.08 0.44 0.19 0.21 0.17 0.15 0.21 0.30 0.30 

HOSSL 1.32 0.20 0.31 0.97 0.30 0.08 0.08 0.46 0.19 0.23 0.17 0.18 0.23 0.34 0.31 

HE--- 2.46 0.40 1.07 1.48 0.55 0.09 0.09 0.70 0.20 0.23 0.30 0.29 0.36 0.45 0.48 

HE-W- 2.73 0.40 1.25 1.67 0.72 0.09 0.10 0.89 0.21 0.27 0.35 0.34 0.43 0.50 0.54 

HE-S- 2.74 0.40 1.25 1.66 0.72 0.09 0.10 0.89 0.21 0.27 0.34 0.34 0.43 0.49 0.54 

HE-SL 3.14 0.42 1.47 1.83 0.80 0.09 0.10 0.97 0.23 0.32 0.37 0.40 0.45 0.54 0.58 

HEW-- 1.97 0.31 0.34 1.13 0.34 0.08 0.08 0.48 0.19 0.24 0.23 0.22 0.28 0.42 0.42 

HEWW- 2.13 0.32 0.34 1.24 0.38 0.08 0.09 0.53 0.19 0.27 0.29 0.28 0.35 0.45 0.49 

HEWS- 2.13 0.32 0.34 1.24 0.39 0.08 0.09 0.52 0.20 0.28 0.29 0.28 0.35 0.44 0.49 

HEWSL 2.33 0.33 0.35 1.31 0.42 0.08 0.09 0.57 0.20 0.32 0.28 0.30 0.37 0.48 0.51 

HES-- 1.99 0.31 0.34 1.16 0.33 0.08 0.08 0.47 0.19 0.24 0.22 0.19 0.28 0.33 0.40 

HESW- 2.15 0.32 0.34 1.26 0.36 0.08 0.09 0.51 0.20 0.28 0.25 0.21 0.31 0.35 0.45 

HESS- 2.14 0.33 0.34 1.24 0.37 0.08 0.09 0.51 0.20 0.29 0.24 0.21 0.31 0.35 0.45 

HESSL 2.35 0.33 0.36 1.33 0.40 0.08 0.09 0.55 0.21 0.34 0.26 0.24 0.36 0.38 0.48 

NOWW- 0.96 0.22 0.09 0.71 0.49 0.11 0.15 0.85 0.34 0.30 0.08 0.08 0.16 0.34 0.22 

NE-W- 1.03 0.25 0.08 0.93 0.59 0.12 0.15 0.99 0.40 0.33 0.08 0.09 0.17 0.34 0.27 

NE-S- 1.03 0.25 0.08 0.89 0.68 0.12 0.16 1.10 0.39 0.35 0.08 0.09 0.17 0.34 0.29 

NE-SL 1.06 0.26 0.08 0.92 0.77 0.12 0.16 1.20 0.42 0.36 0.09 0.09 0.17 0.34 0.29 

NEWW- 0.94 0.23 0.08 0.72 0.51 0.11 0.14 0.87 0.35 0.31 0.07 0.07 0.15 0.31 0.23 

NEWS- 0.95 0.23 0.08 0.72 0.54 0.11 0.15 0.90 0.36 0.32 0.08 0.08 0.15 0.32 0.24 

NEWSL 0.95 0.23 0.08 0.73 0.55 0.11 0.15 0.93 0.37 0.33 0.08 0.08 0.15 0.32 0.24 

NESW- 0.95 0.23 0.08 0.74 0.49 0.11 0.14 0.87 0.36 0.31 0.07 0.07 0.15 0.30 0.23 

NESS- 0.93 0.23 0.08 0.72 0.51 0.11 0.15 0.89 0.36 0.32 0.08 0.07 0.15 0.31 0.23 

NESSL 0.94 0.23 0.08 0.73 0.54 0.11 0.15 0.92 0.37 0.33 0.08 0.08 0.15 0.31 0.24 

PEWW- 0.22 0.04 0.08 0.05 0.33 0.06 0.09 0.43 0.16 0.20 0.06 0.06 0.04 0.13 0.10 

PEWS- 0.22 0.04 0.08 0.05 0.34 0.06 0.09 0.44 0.16 0.21 0.06 0.06 0.04 0.13 0.10 

PEWSL 0.22 0.04 0.08 0.05 0.36 0.06 0.09 0.49 0.18 0.22 0.06 0.06 0.05 0.13 0.10 

PESW- 0.22 0.04 0.08 0.05 0.31 0.06 0.09 0.45 0.18 0.20 0.06 0.06 0.04 0.12 0.10 

PESS- 0.22 0.04 0.08 0.05 0.33 0.07 0.09 0.43 0.16 0.21 0.06 0.06 0.04 0.13 0.10 

PESSL 0.22 0.04 0.08 0.05 0.35 0.06 0.10 0.44 0.17 0.22 0.06 0.06 0.05 0.13 0.10 

Figure 14: Time for all tested methods on all benchmark queries. All queries were warmed 

up by 3 runs and then run 100 times, or until at least 10 seconds had passed, measuring 

average running-time. 

167

CHAPTER 4. 


H 

Bad Behavior 

In this appendix we list some experiments showing the super-linear behavior of previous 

twig join algorithms. 

Figure 15a shows the exponential behavior of TwigList and TwigList (HO-W-) 

and TwigFast (NEWW-) with the data and query from Example 1. The data is 

/(α 1 /) n . . . (α m /) n β/γ with m = 10 and n = 100, and the query is ⫽α 1 ⫽ . . . ⫽α k /γ, 

with k varying from 1 to 7. 

10 0 

HO-W- 

NEWW- 

PESSL 

1.5 

1 

PESS- 

PESSL 

10 −2 

0.5 

10 −4 

10 2 (a) Varying query parameter k. 

1 2 3 4 5 6 7 

0 

10 000 20 000 30 000 40 000 

(b) Varying total nodes. 

Figure 15: (a) Exponential behavior without strict matching. (b) Quadratic behavior 

without optimal enumeration. Query time in seconds. 

Figure 15b shows the results of an experiment based on Example 2 with varying 

n = 10 000. The simple query ⫽a/b has quadratic cost even when strict prefix path and 

subtree match filtering is used, if P–C child matches are not directly accessible. Many of 

the a nodes in the data are nested, and have a small number of b children, but a large 

number of b descendants. For approaches using simple vectors, overlapping descendant 

intervals are scanned for direct children, and this results in quadratic running time. 

168

Paper 6 

Nils Grimsmo, Truls Amundsen Bjørklund and Magnus Lie Hetland 

Linear Computation of the Maximum Simultaneous Forward and Backward 

Bisimulation for Node-Labeled Trees 

Proceedings of the 7th International XML Database Symposium on Database and XML 

Technologies (XSym 2010). 

Abstract The F&B-index is used to speed up pattern matching in tree and graph data, 

and is based on the maximum F&B-bisimulation, which can be computed in loglinear 

time for graphs. It has been shown that the maximum F-bisimulation can be computed 

in linear time for DAGs. We build on this result, and introduce a linear algorithm for 

computing the maximum F&B-bisimulation for tree data. It first computes the maximum 

F-bisimulation, and then refines this to a maximal B-bisimulation. We prove that the 

result equals the maximum F&B-bisimulation. 

169




Structure indexes are used to reduce the cost of pattern matching in labeled trees and 

graphs [5, 15, 10], by capturing structure properties of the data in a structure summary, 

where some or all of the matching can be performed. Efficient construction of such 

indexes is important for their practical usefulness [15], and in this paper we reduce the 

construction cost of the F &B-index [10] for tree data. 

In a structure index, data nodes are typically partitioned into blocks based on properties 

of the surrounding nodes. A structure summary typically has one node per block 

in the partition, and edges between summary nodes where there are edges between data 

nodes in the related blocks. Matching in structure summaries is usually more efficient 

than partitioning the data nodes on label and using structural joins to find full query 

matches [15, 10]. 

In a path index, data nodes are classified by the labels on the paths by which they are 

reachable [5, 15]. For tree data this equals partitioning nodes on their label and the block 

of the parent node. Figures 1c and 1b show path partitioning and the related summary 

for the example data in Figure 1a. With path indexes, non-branching queries can be 

evaluated without processing joins [15, 19]. 

b 1 

a 1 

a 2 a 9 a 15 

a 2 b 3 

b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20 

a 4 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21 

(a) Example query and data. Single/double query edges specify 

parent–child/ancestor–descendant relationships. 

b s1 

a s2 

a s3 b s4 

a s5 b s6 

(b) Path summary. 

b 1 

b 1 

a 2 a 9 a 15 

a 7 a 8 b 3 b 10 b 13 b 16 b 18 b 20 

a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19 

(c) Path partition. 

a 2 

a 8 

a 4 a 6 

a 9 a 15 

20 b 13 b 16 b 18 

a 21 b 14 b 17 b 19 

a 7 b 3 b 10 b 

(d) F&B partition. 

Figure 1: Partitioning strategies. Superscripts and subscripts give node identifiers. 

A natural extension of a path index is the F&B-index, where nodes are partitioned on 

both their label, the partitions of the parents, and the partitions of the children, as shown 

171

CHAPTER 4. 


in Figure 1d. This gives an index where more of the pattern matching can be performed on 

the summary, and also branching queries can be evaluated without processing joins [10]. 

The focus of this paper is efficient computation of the maximum simultaneous forward 

and backward bisimulation (F&B-bisimulation), which is the underlying concept used to 

partition nodes in the F&B-index. It can be computed in time loglinear in the number 

of edges in the graph [10]. A linear construction algorithm for directed acyclic graphs 

(DAGs) has been presented recently [13], but we show that it is incorrect. On the other 

hand, there exists a correct algorithm which can compute either the maximum forward 

bisimulation (F-bisimulation) or the maximum backward bisimulation (B-bisimulation) 

in linear time for DAGs [3]. We extend this algorithm to compute the maximum F&Bbisimulation 

in linear time for tree data. This has relevance for applications where the 

underlying data is known to be tree shaped, such as in many uses of XML [6]. 

2 Background 

In this section we present different types of bisimulations, and show how they can be 

computed by first partitioning on label, and then stabilizing the graph. 

We use the following notation: A directed graph G = 〈V, E〉 has node set V and edge 

set E ⊆ V × V . Let n = |V | and m = |E|. For X ⊆ V , E(X) = {w | ∃v ∈ X : vEw} 

and E −1 (X) = {u | ∃v ∈ X : uEv}. Each node v ∈ V has label L(v). A partition P of 

V is a set of blocks, such that each node v ∈ V is contained in exactly one block. For 

an equivalence relation ∼ ⊆ V × V , the equivalence class containing v ∈ V is denoted by 

[v] ∼ . The equivalence relation arising from the partition P is denoted = P . A relation R 2 

is a refinement of R 1 iff R 2 ⊆ R 1 . A partition P 2 is a refinement of the coarser P 1 iff 

= P2 ⊆ = P1 . Let the contraction graph of a partition P be a graph with one node for each 

equivalence class of = P , and an edge 〈[u] =P , [v] =P 〉 whenever 〈u, v〉 ∈ E. 

The structure summary built for a structure index is typically isomorphic with the 

contraction graph for the data partition. For a partitioning to be useful, it must yield a 

summary that somehow simulates the data, such that pattern matching in the summary 

gives the same results as pattern matching in the data, or at least no false negatives. If 

nodes are partitioned into blocks where nodes in some way simulate each other, then the 

contraction graph also simulates the data in the same way. 

2.1 Bisimulation and Bisimilarity 

Broadly speaking, bisimulations relate nodes that have the same label and related neighbors. 

We use the following properties of a binary relation R ⊆ V × V to formally define 

the different types of bisimulation: 

172



vRv ′ ⇒ L(v) = L(v ′ ) (1) 

vRv ′ ⇒ (uEv ⇒ ∃u ′ : u ′ Ev ′ ∧ uRu ′ ) ∧ 

(u ′ Ev ′ ⇒ ∃u : uEv ∧ uRu ′ ) (2) 

vRv ′ ⇒ (vEw ⇒ ∃w ′ : v ′ Ew ′ ∧ wRw ′ ) ∧ 

(v ′ Ew ′ ⇒ ∃w : vEw ∧ wRw ′ ) (3) 

Definition 1 (Bisimulations [14, 10]). A relation R is a B-bisimulation iff it satisfies (1) 

and (2) above, an F-bisimulation iff it satisfies (1) and (3), and an F&B-bisimulation iff 

it satisfies (1), (2) and (3). 

For each type, there exists a unique maximum bisimulation, of which all other bisimulations 

are refinements [14, 10]. We say that two nodes are bisimilar if there exists 

a bisimulation that relates them, i.e., they are related by the maximum bisimulation. 

Since bisimilarity is an equivalence relation, it can be used to partition the nodes [14, 10]. 

When nodes u and v are backward, forward, and forward and backward bisimilar, we 

write u ∼ B v, u ∼ F v and u ∼ F &B v, respectively. Figure 2 illustrates the different types 

of bisimilarity. 

b 2 

a 1 

c 3 

b 4 

c 5 d 6 

(a) Data. 

a 1 

b 2 b 4 

b 2 

a 1 a 1 

b 4 

b 4 

c 3 c 5 d 6 c 3 c 5 d 6 c 3 c 5 d 6 

(b) ∼ B (c) ∼ F 

b 2 

(d) ∼ F &B 

Figure 2: Partitioning on different types of bisimilarity. 

Two graphs are said to be bisimilar if there exists a total surjective bisimulation from 

the nodes in one graph to the nodes in the other. For a given graph, the smallest bisimilar 

graph is unique, and is exactly the contraction for bisimilarity [3]. The F&B-bisimilarity 

contraction is the basis for the F&B-index [10]. 

2.2 Stability 

The different types of bisimilarity listed in the previous section can be computed by first 

partitioning the data nodes on label, and then finding the coarsest stable refinements of 

the initial partition [16, 4, 10]. A partition is successor stable iff all nodes in a block have 

incoming edges from nodes in the same set of blocks, and is predecessor stable iff all nodes 

in a block have outgoing edges to the same set of blocks [10]. The coarsest successor, 

predecessor, and successor and predecessor stable refinement of a label partition, equal a 

partition on B-bisimilarity, F-bisimilarity and F&B-bisimilarity, respectively [4, 10]. 

173

CHAPTER 4. 


Definition 2 (Partition stability [16, 10]). Given a directed graph G = 〈V, E〉, then 

D ⊆ V is successor stable with respect to B ⊆ V if either all or none of the nodes in 

D are pointed to from nodes in B (meaning D ⊆ E(B) or D ∩ E(B) = ∅), and D is 

predecessor stable with respect to B if either none or all of the nodes in D point to nodes 

in B (meaning D ⊆ E −1 (B) or D ∩ E −1 (B) = ∅). 

For any combination of successor and predecessor stability, a partition P of V is said 

to be stable with respect to a block B if all blocks in P are stable with respect to B. A 

partition P is stable with respect to another partition Q if it is stable with respect to all 

blocks in Q. P is said to be stable if it is stable with respect to itself. 

Figure 3 shows cases where a block can be split to achieve different types of stability. 

The block D is not stable with respect to B, but we can split it into blocks that are: 

Assume that D is stable with respect to a union of blocks S such that B ⊂ S. We can 

split D into blocks that are stable with respect to both B and S \ B, shown as D B , D BS 

and D S in the figure. Stabilizing also with respect to S \ B is crucial for obtaining a 

O(m log n) running time in the partition stabilization algorithm explained in the next 

section. 

B 

D B 

D BS 

D S 

S 

D 

D 

S 

D B D BS D S 

(a) Succ-stability. 

B 

(b) Pred-stability. 

Figure 3: Refining D into blocks that are stable with respect to B and S \ B. 

2.3 Stabilizing graph partitions 

We now go through Paige and Tarjan’s algorithm for refinement to the coarsest predecessor 

stable partition [16], extended to simultaneous successor and predecessor stability by 

Kaushik et al. [10], as shown in Algorithm 1. The input to the algorithm is a partition 

P and the set of flags Directions ⊆ {Succ, Pred}, which specifies whether P is to be 

successor and/or predecessor stabilized. 

Figure 4 illustrates an example run of the algorithm with Directions = {Succ, Pred}. 

In addition to the current partition P , the algorithm maintains a partition X, where the 

blocks are unions of blocks in P . Initially X contains a single block that is the union 

of all blocks in P , and the algorithm maintains the loop invariant that P is stable with 

respect to X by Definition 2. The algorithm terminates when the partitions P and X 

174



Algorithm 1 Graph partition stabilization. 

1: ⊲ P is the initial partition. 

2: function StabilizeGraph(P, Directions): 

3: for dir ∈ Directions: 

4: InitialRefinement(P, dir) 

5: X ← {⋃ B∈P B} 

6: while P ≠ X: 

7: Extract B ∈ P such that B ⊂ S ∈ X and |B| ≤ |S|/2. 

8: Replace S by B and S \ B in X. 

9: for dir ∈ Directions: 

10: StabilizeWRT(copy of B, P, dir) 

are equal, which means P must be stable with respect to itself. But the loop invariant 

may not be true for the given input partition initially: Blocks containing both roots and 

non-roots are not successor stable with respect to X, because non-roots have incoming 

edges from the single block S ∈ X, while roots do not. Similarly, blocks containing both 

leaves and non-leaves are not predecessor stable with respect to X. In Algorithm 1 initial 

stability is achieved by calls to InitialRefinement(), which splits blocks in a simple linear 

pass. Initial splitting is illustrated by the step from line (a) to line (b) in Figure 4. 

The algorithm repeatedly selects a block S ∈ X that is a compound union of blocks 

from P , and selects a splitter block B ⊂ S with size at most half of S. Then S is replaced 

by B and S \ B in X, as shown when extracting B = {a 2 , a 9 , a 15 } between lines (b) and 

(c) in Figure 4. The call to StabilizeWRT() uses the strategies depicted in Figure 3 to 

stabilize P with respect to both B and S \ B, to make sure P is stable with respect to 

the new X. It is important to use a copy of B as splitter, as the stabilization may cause 

B itself to be split. The step from line (b) to (c) in the figure shows that a block of nodes 

labeled a is split when successor stabilizing with respect to B = {a 2 , a 9 , a 15 }. 

Efficient implementation of the above requires some attention to detail [16]: The 

partition X can be realized through a set X containing sets of pointers to blocks in P , such 

that for each S ∈ X we have S = ⋃ B∈S 

B for the related S ∈ X . To extract a B ⊂ S ∈ X 

in constant time, we also maintain the set of compound unions C = {S ∈ X | 1 < |S|}. A 

block B ⊂ S such that |B| ≤ |S|/2 can be found by choosing the smalllest of the two first 

blocks in any S ∈ C. Note that we can check if P = X by checking whether C is empty, 

and neither P , X nor X need to be materialized, as they are never used directly. You 

only need to maintain C, and a P ⊆ P containing the blocks in P not in some S ∈ C. For 

inserting and removing elements in constant time, the sets are implemented as doubly 

linked list. In addition, each v ∈ B has a pointer to B, and each B ∈ S has a pointer to 

S. This allows keeping the data structures updated throughout the evaluation. 

As all operations in the while-loop excluding the calls to StabilizeWRT() are constant 

time operations on linked lists, the complexity of the loop is bounded by the number of 

splitter blocks selected, which again is bounded by the number of times a node may be 

part of such a splitter. Splitter blocks at most half the size of a compound union are 

175

CHAPTER 4. 


b 1 

a 2 

a 9 a 15 

b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20 

a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21 

(a) 

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 b 13 

b 5 b 10 b 13 b 14 b 16 b 17 b 19 b 18 b 20 

(b) 

a 4 a 6 a 7 a 8 a 11 a 12 a 21 a 2 a 9 a 15 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20 

(c) 

a 2 a 9 a 15 a 4 a 6 a 11 a 12 a 21 a 7 a 8 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20 

(d) 

a 7 a 8 a 4 a 6 a 11 a 12 a 21 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20 a 9 a 15 a 2 

. 

(l) 

a 11 a 12 a 21 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 2 a 9 a 15 a 7 a 8 b 1 b 5 b 14 b 17 b 19 b 3 

Figure 4: Algorithm 1 doing successor and predecessor stabilization on a label partition 

of the data from Figure 1a. The white boxes are the current blocks in P , while the gray 

boxes are the current blocks in X. Line (a) shows initial label partition. Step (a)–(b) 

shows refinement separating roots from non-roots and leaves from non-leaves, and steps 

(b)–(l) show simultaneous predecessor and successor stabilization. 

always selected, and no node in this block is part of a splitter again before the block itself 

has become a compound union. This means that the number of times a given node is 

part of a splitter block is O(log n), and that the total number of splitter blocks used is 

O(n log n) [16]. 

Algorithm 2 shows how all blocks in a partition P can be successor (or predecessor) 

stabilized with respect to a block B ∈ P and S \ B, where B ⊂ S ∈ X, in time linear in 

the number of edges going out from (or into) B [16]. For successor stability, only blocks 

D pointed to from B are affected, and they are stabilized with respect to both B and 

S \ B without involving S \ B directly. This is done by maintaining for each node v ∈ V , 

the number of times it is pointed to from each set S ∈ X, and storing references to these 

records from the related edges. We can then differentiate between nodes pointed to from 

B, pointed to from both B and S \ B, and pointed to only from S \ B. Nodes from the 

first two categories are moved into new blocks, while the rest are untouched. 

As the cost of a single call to StabilizeWRT() is bounded by the number of nodes in the 

splitter block and the number of outgoing (or incoming) edges, the total cost for the calls 

is O((m+n) log n), as a given node or edge is used for splitting O(log n) times. Assuming 

n ∈ O(m), the cost of Algorithm 1 is O(n) for the initial refinement, O(n log n) for the 

176



Algorithm 2 Stabilizing with respect to a block 

1: function StabilizeWRT(B, P, d): 

2: Assume dir = Succ. (or Pred) 

3: for D ∈ P pointed to from B: (or pointing into B) 

4: Initialize D B and D BS and associate with D. 

5: for v ∈ D ∈ P pointed to from B: (or pointing into B) 

6: if v is pointed to only from B: (or pointing only into B) 

7: D ′ ← D B 

8: else: 

9: D ′ ← D BS 

10: if D ′ /∈ P : Insert D ′ after D in P 

11: Move v from D to D ′ . 

12: if D = ∅: Remove D from P . 

while-loop excluding StabilizeWRT(), and O(m log n) for the StabilizeWRT() calls, giving 

a total of O(m log n) [16, 10]. 

3 Linear Time Stabilization 

Linear time computation of F&B-bisimilarity for DAG data has been attempted earlier. 

The SAM algorithm [13] partitions the data separately on B-bisimilarity and 

F-bisimilarity, and then combines the partitions by putting two nodes in the same final 

block iff they are in the same blocks in both partitions. It builds on the following 

theorem, which is stated without proof or reference: “Node n 1 and node n 2 satisfy F&Bbisimulation 

if and only if they satisfy F-bisimulation and B-bisimulation.” The only if 

part is of course true, but the if part is not, as can be seen from the partitioning of a tree 

with six nodes in Figure 2. Here c 3 ∼ B c 5 and c 3 ∼ F c 5 , but c 3 ≁ F &B c 5 , because for the 

parent nodes b 2 ≁ F &B b 4 . Also note that it is assumed for the running time of the SAM 

algorithm that the number of edges to and from each node can be viewed as a constant. 

3.1 Stabilizing DAG partitions 

We now present an algorithm for refining a partition of the nodes in a DAG either to successor 

stability or to predecessor stability. It is based on two different previous algorithms: 

Paige and Tarjan’s loglinear algorithm for stabilizing general graphs [16], and Dovier, Piazza 

and Policriti’s algorithm for computing F-bisimilarity on unlabeled graphs [3], which 

has linear complexity for DAG data. A difference between these two algorithms is that 

the former is given an initial partition as input, which is then refined, while the latter 

starts with the set of singleton blocks, from which the final partition is constructed. These 

are called negative and positive strategies, respectively [3]. Dovier et al. describe how 

their algorithm can be extended to compute F-bisimilarity for labeled data, but when 

developing an algorithm for refining to simultaneous successor and predecessor stability 

177

CHAPTER 4. 


in the next section, we use the result of a predecessor stabilization as input to a successor 

stabilization, and hence cannot use a positive strategy. 

Dovier et al.’s algorithm initially computes the rank of each node in the DAG, which 

is the length of the longest path from the node to a leaf. We extend the notion of rank 

to both directions in the DAG: 

Definition 3 (Rank). In a DAG G, the successor and predecessor rank of v ∈ V (G) is 

defined as: 

{ 

0 

if v is a root in G 

rank Succ (v) = 

1 + max 〈u,v〉∈E(G) rank Succ (u) 

otherwise 

{ 

0 

if v is a leaf in G 

rank Pred (v) = 

1 + max 〈v,w〉∈E(G) rank Pred (w) otherwise 

Algorithm 3 shows our modification of Paige and Tarjan’s algorithm [16] based on 

Dovier et al.’s principles [3]. It refines a partition of a DAG either to predecessor or to 

successor stability, and runs in linear time, due to the order in which splitter blocks are 

chosen. 

Algorithm 3 DAG partition stabilization. 

1: ⊲ Assume sets are ordered. 

2: function StabilizeDAG(P, dir): 

3: RefineAndSortOnRank(P, dir) 

4: X ← {⋃ B∈P B} 

5: while P ≠ X: 

6: Extract first B ∈ P such that B ⊂ S ∈ X. 

7: Replace S by B and S \ B in X. 

8: StabilizeWRT(B copy, P, dir) 

A run of the algorithm with dir = Pred is illustrated in Figure 5. Instead of only 

separating between leaves and non-leaves (or roots and non-roots) as in Algorithm 1, 

blocks are initially split such that predecessor (or successor) rank is uniform within each 

block, and sorted, such that the rank is monotonically increasing in the partition. This 

is done in the function RefineAndSortOnRank(), which is described later in this section. 

An initial refinement and sorting on predecessor rank is shown when going from line (a) 

to line (b) in Figure 5. 

In a partition that respects a given type of rank, let the rank of a block be equal to 

the rank of the contained nodes. The following lemma implies that the initial refinement 

on rank does not split blocks unnecessarily: 

Lemma 1. Given nodes u, v ∈ V (G) and a partition P of G, if P is successor stable 

then [u] =P = [v] =P ⇒ rank Succ (u) = rank Succ (v), and if P is predecessor stable then 

[u] =P = [v] =P ⇒ rank Pred (u) = rank Pred (v). 

178



(a) 

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 b 13 

b 5 b 10 b 13 b 14 b 16 b 17 b 19 b 18 b 20 

(b) 

a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 10 b 20 b 3 b 13 b 16 b 18 a 2 a 9 a 15 b 1 

(c) 


(c) 


. 

(i) 


Figure 5: Using Algorithm 3 to predecessor stabilize a label partition of the data from 

Figure 1a. The first step shows refinement on predecessor rank. 

Proof. For F-bisimilarity [u] ∼F = [v] ∼F ⇒ rank Pred (u) = rank Pred (v) [3]. If P is predecessor 

stable, then = P is an F-bisimulation [4], and therefore a refinement of partitioning 

on ∼ F , such that [u] =P = [v] =P ⇒ [u] ∼F = [v] ∼F . The case for successor stability is 

symmetric. 

The next lemma implies that if blocks are chosen as splitters in order of their rank, a 

node will be part of a block that is used as a splitter at most once. This property is used 

to achieve a linear complexity in our algorithm. 

Lemma 2 ([3]). Given a DAG G, a partition P of G that respects predecessor rank and a 

block B ∈ P , predecessor stabilization of P with respect to B only splits blocks D where 

rank Pred (D) > rank Pred (B). 

This is symmetric for successor stabilization and successor rank, as a reversed DAG is 

also a DAG. 

In Algorithm 3 the blocks in the current partition P are maintained ordered on rank. 

This is implemented through a detail in our method for stabilization with respect to a 

block in Algorithm 2, which is different from the original description [16]: The new blocks 

D B and D BS are inserted into P on the position after the old block D, and not at the 

end of P . The sets of blocks which make up the unions in X are also ordered such that 

their concatenation yields an ordered list of blocks. This is maintained by inserting B 

followed by S \ B on the original position of S in X. Notice how blocks never change 

positions during the stabilization shown in lines (b)–(i) in Figure 5. 

In Dovier et al.’s algorithm, rank is computed by performing a topological traversal 

of the DAG depth first [3]. Because we need to refine a given partition on rank, as 

opposed to constructing a partition on rank from scratch, the problem is slightly more 

involved. Algorithm 4 shows how a partitioning can be refined and sorted on successor 

or predecessor rank in a single pass. The algorithm traverses the DAG with a hybrid 

between a topological sort and a breadth first search, implemented using edge counters 

and a queue. Blocks are refined and sorted on the fly. 

179

CHAPTER 4. 


Algorithm 4 Refining and sorting on rank. 

1: function RefineAndSortOnRank(P, dir): 

2: Assume dir = Succ (or dir = Pred) 

3: Q is a queue. 

4: for v ∈ V : 

5: v.count ← |{x | 〈x, v〉 ∈ E}| (or |{x | 〈v, x〉 ∈ E}|) 

6: if v.count = 0: 

7: v.rank dir ← 0 

8: PushBack(Q, v) 

9: for B ∈ P : 

10: B.currRank ← −1 

11: while Q: 

12: v ← PopFront(Q) 

13: Let B ∋ v. 

14: if v.rank dir ≠ B.currRank: 

15: B.rankedB ← {} 

16: Append B.rankedB at the end of P . 

17: B.currRank ← v.rank dir 

18: Move v from B to B.rankedB 

19: Remove B from P if empty. 

20: for x where 〈v, x〉 ∈ E: (or 〈x, v〉 ∈ E) 

21: x.count ← x.count − 1 

22: if x.count = 0: 

23: x.rank dir ← v.rank dir + 1 

24: PushBack(Q, x) 

Lemma 3. Algorithm 4 refines and orders P on successor (or predecessor) rank in O(m+ 

n) time. 

Proof (sketch). Because the queue is initialized with the roots, and a node is added to the 

queue when the last parent is popped from the queue, the nodes are queued and popped 

in order of successor rank, and this distance is calculated from the parent node with the 

greatest successor rank. As the successor rank of the nodes that are moved to a new block 

grows monotonically, only one associated block B.rankedB is created per successor rank 

found in block B, and all such blocks are appended to P in sorted order. As a node is only 

queued once, and the cost of processing a node is proportional to the number of outgoing 

edges, the total running time of the algorithm is O(m + n). The case for predecessor rank 

is symmetric. 

Theorem 1. Algorithm 3 yields the coarsest refinement of a partition of a DAG that is 

successor (or predecessor) stable. 

Proof (sketch). For predecessor stability, the only differences from Paige and Tarjan’s 

algorithm are the initial refinement and the order in which splitter blocks are chosen. 

180



By Lemma 1 blocks are not refined unnecessarily when refininig on rank, and the order 

of split operations is not used in the correctness proof for the original algorithm [16]. 

Successor stability is symmetric. 

To implement Algorithm 3 we use the same underlying data structures as for Algorithm 

1: doubly linked lists C and P realizing X and P . In Algorithm 3, the extract 

operation is implemented by removing the first B ∈ S from the first S ∈ C. The replace 

operation is implemented by moving B from S to the end of P, and if only one block B ′ 

is left in S, this B ′ is also moved from S to P, and S is removed from C. 

Theorem 2. The running time of Algorithm 3 is O(m + n). 

Proof (sketch). We analyze the cost of StabilizeWRT() separately. Outside the while loop, 

the call to RefineAndSortOnRank() has cost O(m+n) by Lemma 3, and the construction 

of C and P has cost O(|P |) ⊆ O(n). As splitters are chosen in order of their rank, by 

Lemma 2 a splitter block is not later split itself. This means that each node is only 

part of a splitter once, and that the while-loop is run O(n) times. The loop condition 

is implemented by checking if C ≠ ∅. As all the operations on linked lists inside the 

loop have complexity O(1), the total cost of the while-loop is O(n). The StabilizeWRT() 

function is called O(n) times, and the cost of one call is linear in the number of nodes 

and edges used [16]. As nodes are only part of a splitter block once, edges are also only 

used for splitting once, and the total cost is O(m + n). 

3.2 Stabilizing Trees 

We now present an algorithm for finding the coarsest successor and predecessor stable 

refinement of the nodes in a tree. It uses the solution for DAGs from the previous section 

to refine a partition first to the coarsest predecessor stable refinement, and then to the 

coarsest successor stable refinement, as shown in Algorithm 5. For trees this yields a 

partition that is still predecessor stable, as we prove in the following. 

Algorithm 5 Stabilization for trees 

1: function StabilizeTree(P, Directions): 

2: if Pred ∈ Directions: 

3: StabilizeDAG(P, Pred) 

4: if Succ ∈ Directions: 

5: StabilizeDAG(P, Succ) 

Figure 6 shows with a continuation of Figure 5 how Algorithm 5 is used to find a 

successor and predecessor stable refinement. The starting point in the figure is a predecessor 

stable refinement found after calling StabilizeDAG(P, Pred). This partition is 

then successor stabilized by calling StabilizeDAG(P, Succ), which first refines and sorts 

on successor rank, shown between lines (i) and (j) in the figure, and then uses the blocks 

in the current P as splitters in order, shown in lines (j)–(t). Compare this partition with 

the F&B-bisimilarity partition in Figure 1d. 

181

CHAPTER 4. 


(i) 


(j) 

b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19 

. 

(m) 


(n) 


. 

(t) 


Figure 6: Continuing Fig. 5: Line (i) shows predecessor stable partition, step (i)–(j) shows 

successor rank refinement, and steps (j)–(t) show successor stabilization. 

Theorem 3. If a predecessor stable partition of the nodes in a tree is refined to successor 

stability, the resulting partition is still predecessor stable. 

Proof (sketch). Blocks are split in three ways: To refine on rank, with respect to a block 

B ∈ P , or as a side effect with respect to S \ B in Algorithm 2. By Lemma 1, the first 

type of split does not cause any split that would not eventually be caused by an algorithm 

that iteratively refines P with respect to a random block B ∈ P , and from the correctness 

proof of Paige and Tarjan’s algorithm [16], neither does splitting with respect to S \ B. 

We now use induction on the refinement steps, and show that the partition P remains 

predecessor stable. It is true initially by assumption. The induction step is to split a 

block D ∈ P on successor stability with respect to a block B ∈ P . 

B will split D into two parts D B and D S , containing the nodes pointed to and not 

pointed to from B, respectively. The splitting of D may only affect the predecessor 

stability of the new blocks D B and D S with respect to their descendants, and of the set 

of blocks B pointing into D with respect to D B and D S . After the split, B ⊆ E −1 (D B ) 

and B ∩ E −1 (D S ) = ∅, and for all other B ′ ∈ B we have that B ′ ∩ E −1 (D B ) = ∅ and 

B ′ ⊆ E −1 (D S ), because these B ′ by assumption have pointers into D, but by the fact 

that the data is a tree, do not have pointers into D B . 

For any block G pointed to from some node in D, D ⊆ E −1 (G) by the initial assumption 

of predecessor stability, meaning all nodes in D point into G. This means that all 

nodes in D B and D S also point into G, and thus D B and D S are predecessor stable with 

respect to G. 

Figure 7a illustrates this theorem. By contrast, assuming the data was not a tree, the 

blocks B ′ ∈ B would not necessarily be predecessor stable after splitting D, as shown in 

Figure 7b. 

Theorem 4. Algorithm 5 finds the coarsest refinement of a partition of the nodes in tree 

that is successor and predecessor stable in O(n) time. 

182



B B ′ 

B B ′ 

D 

D 

G 

(a) For a tree. 

G 

(b) For a DAG. 

Figure 7: Splitting D for successor stability w.r.t. B in a partition P . This does not 

impact predecessor stability for any block given tree data, but may break predecessor 

stability in a DAG. 

Proof. From Theorems 1, 2 and 3, and the fact that m ∈ O(n) for tree data. 

Corollary 1. The F&B-index can be built in O(n) time for tree data using Algorithm 5. 

Proof (sketch). The coarsest successor and predecessor stable refinement of a partitioning 

on label gives the maximum F&B-bisimulation [10], and the summary structure can be 

constructed from the contraction graph [10]. 


There are many variations of structure summaries for graph data, such as partitionings on 

similarity [8, 15], the F+B-index [10], the A(k) index [12], and the D(k)-index [2]. Note 

that an implication of Theorem 3 is that the F+B-index and the F&B-index are identical 

for tree data. The cost of matching in the summary can be reduced by label-partitioning 

it and using specialized joins [1], or using multi-level structure indexing [18]. For queries 

using ancestor–descendant edges on graph data, different graph encodings offer trade-offs 

between space usage and query time [21]. For a general overview of indexing and search 

in XML see the survey by Gou and Chirkova [6]. There is previous research on updates 

of bisimilarity partitions [11, 20, 17]. Some of these methods trade update time for 

coarseness, as refinements of bisimilarity may be cheaper to compute after a data update. 

Single directional bisimulations are used in many fields, such as modal logic, concurrency 

theory, set theory, formal verification, etc. [3], but to our knowledge, F&B-bisimulation 

is not frequently used outside XML search. 

5 Conclusions and Future Work 

In this paper we have improved the running time for refining a partition to the coarsest 

simultaneous successor and predecessor stability for tree data from O(n log n) to O(n), 

183

CHAPTER 4. 


and with that the computation of F&B-bisimilarity, and construction of the F&B-index. 1 

An incorrect linear algorithm for DAGs has been presented recently [13], and it would be 

interesting to know whether the problem is actually solvable in linear time for DAGs. 

A natural extension of our work would be to reduce the cost of updates in F&B-bisimilarity 

partitions for trees. A particularly interesting direction would be to improve indexing performance 

for typical XML document collections, where there is a large number of small 

independent documents. It may be possible to iteratively add documents to the index 

with (expected) cost dependent only on the size of the documents. 

Acknowledgments 

This material is based upon work supported by the iAd Project funded by the Research 

Council of Norway and the Norwegian University of Science and Technology. Any opinions, 

findings and conclusions or recommendations expressed in this material are those of 

the authors, and do not necessarily reflect the views of the funding agencies. 

References 



[2] Qun Chen, Andrew Lim, and Kian Win Ong. D(k)-index: an adaptive structural 

summary for graph-structured data. In Proc. SIGMOD, 2003. 

[3] Agostino Dovier, Carla Piazza, and Alberto Policriti. A fast bisimulation algorithm. 

In Proc. CAV, 2001. 

[4] Jean-Claude Fernandez. An implementation of an efficient algorithm for bisimulation 

equivalence. Sci. Comput. Program., 13(2-3), 1990. 


optimization in semistructured databases. In Proc. VLDB, 1997. 





trees (extended version). Technical Report IDI-TR-2010-10, NTNU, Trondheim, 

Norway, 2010. 

[8] Monika R. Henzinger, Thomas A. Henzinger, and Peter W. Kopke. Computing 

simulations on finite and infinite graphs. In Proc. FOCS, 1995. 

[9] Paris C. Kanellakis and Scott A. Smolka. Ccs expressions, finite state processes, and 

three problems of equivalence. In Proc. PODC, 1983. 

1 See the extended version of this paper for some performance experiments [7]. 

184





[11] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, and Pradeep Shenoy. Updates 

for structure indexes. In Proc. VLDB, 2002. 

[12] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, and Ehud Gudes. Exploiting 

local similarity for indexing paths in graph-structured data. In Proc. ICDE, 2002. 

[13] Xianmin Liu, Jianzhong Li, and Hongzhi Wang. SAM: An efficient algorithm for 

F&B-index construction. In Proc. APWeb/WAIM, 2007. 

[14] Robin Milner. Communication and concurrency. Prentice-Hall, Inc., 1989. 

[15] Tova Milo and Dan Suciu. Index structures for path expressions. Proc. ICDT, 1999. 

[16] Robert Paige and Robert E. Tarjan. Three partition refinement algorithms. SIAM 

J. Comput., 1987. 

[17] Diptikalyan Saha. An incremental bisimulation algorithm. In Proc. FSTTCS, 2007. 


Knowl. Eng., 2008. 



[20] Ke Yi, Hao He, Ioana Stanoi, and Jun Yang. Incremental maintenance of XML 

structural indexes. In Proc. SIGMOD, 2004. 

[21] Jeffrey Xu Yu and Jiefeng Cheng. Graph reachability queries: A survey. In Managing 

and Mining Graph Data, Advances in Database Systems. Springer US, 2010. 

A 

Experiments 

To investigate the practical impact of the algorithms presented we have run a small 

experiment on the common XML tree benchmarks DBLP 2 , XMark 3 and Treebank 4 . The 

general graph algorithm and the specialized tree algorithm were implemented in C++ 

and run on an AMD Athlon 3500+. We generated XMark data of roughly the same size 

as the Treebank benchmark, and used a prefix of the DBLP data of similar size. Only 

XML elements and attributes without their values were indexed. Properties of the data 

sets and the constructed indexes are listed in Figure 8. DBLP is a shallow tree, while 

XMark and Treebank are much deeper. Note that Treebank has a larger number of labels, 

and a structure that gives a large number of bisimulation partitions. 

2 http://dblp.uni-trier.de/xml/dblp.xml 

3 http://www.xml-benchmark.org/downloads.html 

4 http://www.cs.washington.edu/research/xmldatasets/data/treebank/ 

185

CHAPTER 4. 


Nodes Labels B-partitions F-partitions F&B-partitions 

DBLP 2 830 635 33 88 292 2479 

XMark 2 048 193 77 547 31 230 484 615 

Treebank 2 437 667 251 338 749 443 639 2 277 127 

Figure 8: Data set properties. 

Figure 9a shows the number of times each edge was used in a stabilization operation 

on average throughout the refinement, while Figure 9b shows the time in milliseconds 

divided by the number of edges in the graph. For the DBLP data, the original algorithm 

is actually slightly faster for all partition types. The explanation can be found in the 

number of times each edge is used, which is close to the optimal. The specialized tree 

algorithm has some linear time overhead because the refinement on maximum distance 

is slightly more involved than the separation of roots and leaves in the graph algorithm. 

Partitioning on B-bisimilarity and F-bisimilarity has roughly the same cost with both 

algorithms on all the data sets, again because the graph algorithm is very far from its 

worst-case behavior. 

StabilizeGraph 

StabilizeTree 

6 

4 

2 

DBLP 

XMark 

Treebank 

15 

10 

5 

DBLP 

XMark 

Treebank 

0 

0 

B 

F 

F&B 

B 

F 

F&B 

B 

F 

F&B 

B 

F 

F&B 

B 

F 

F&B 

B 

F 

F&B 

(a) Avg. uses per edge. 

(b) Avg. time per edge (in ms). 

Figure 9: Computing different bisimulations using the original algorithm for graphs or 

the specialized algorithm for trees. 

On F&B-index construction for XMark data, the tree algorithm uses one third the 

edge operations, and is almost twice as fast as the general graph algorithm. Note that 

the 5.9 uses of each edge in the graph algorithm is far from the worst-case number of uses 

per edge. For the Treebank benchmark, the number times an edge is used is smaller, and 

so is the difference between the algorithms. A greater part of the cost is the overhead 

of maintaining the blocks in the partition, because the number of blocks is close to the 

number of nodes in the tree, as can be seen in Figure 8. 

186



B 

Bi-directional Bisimulation and Stability 

Most work on bisimulation considers only forward bisimulation on edge-labeled graph 

data, and most work on stabilizing partitions considers only predecessor stabilization 

on node-labeled graph data. When bi-directional bisimulation and the connection to 

predecessor and successor stabilization was introduced [10], extension of these previous 

results were omitted probably due to space restrictions. We therefore go through and 

extend the results for bisimulation [14], partition stabilization [16] and their connection [9] 

to the bi-directional case for node- and edge-labeled data. The extensions may be trivial, 

but this appendix also serves as an easily accessible reference. 

Assume a node- and edge-labeled graph G = 〈V, E〉 with node set V and edge set 

E ⊆ V × V . The label of a node v ∈ V is L(v) ∈ A V , and the label of an edge e ∈ E 

is L(e) ∈ A E . Let A = A V ∪ A E . For a given α ∈ A V , let V α be the set of nodes 

with this label, V α = {v | v ∈ V, L(v) = α}. And similarly, for a given α ∈ A E , let 

E α = {e | e ∈ E, L(e) = α}. 

is a F&B- 

B.1 Bisimulation 

Definition 4 (Node- and edge-labeled F&B-bisimulation). R ⊆ V × V 

bisimulation if v R v ′ implies that the following Properties are satisfied: 

1. L(v) = L(v ′ ) 

2. u E α v ⇒ ∃u ′ : u ′ E α v ′ ∧ u R u ′ 

3. u ′ E α v ′ ⇒ ∃u : u E α v ∧ u R u ′ 

4. v E α w ⇒ ∃w ′ : v ′ E α w ′ ∧ w R w ′ 

5. v ′ E α w ′ ⇒ ∃w : v E α w ∧ w R w ′ 

This is illustrated in Figure 10. The definition becomes equal to the definition of 

F&B-bisimulation for node-labeled data [10] when there is only a single edge label α, and 

equal to the original (forward) bisimulation definition [14] when there is a single node 

label and we ignore Properties 2 and 3. Various types of simulation [8] are defined by 

ignoring at least Properties 3 and 5. 

The following lemma trivially extends the original for forward bisimulation in edge 

labeled graphs [14, Proposition 4.1]. 

Lemma 4 (Constructing F&B-bisimulations). Assuming that R i and R j are F&Bbisimulations, 

then the following are F&B-bisimulations: 

(1) ∅ 

(2) Id V = {〈v, v〉 | v ∈ V } 

(3) R −1 

i = {〈v ′ , v〉 | 〈v, v ′ 〉 ∈ R i } 

(4) R i ∪ R j = {〈v, v ′ 〉 | 〈v, v ′ 〉 ∈ R i ∨ 〈v, v ′ 〉 ∈ R j }. 

187

CHAPTER 4. 


E α 

u 

v 

R 

R 

u ′ 

v ′ 

E α 

E α 

w 

R 

w ′ 

E α 

Figure 10: Illustration of F&B-bisimulation 

(5) R i R j = {〈v, v ′′ 〉 | 〈v, v ′ 〉 ∈ R i , 〈v ′ , v ′′ 〉 ∈ R j } 

Proof. Property 1 is trivially satisfied for (1)–(5), as it is an equality. (1) is true as there 

are no v R v ′ . For (2), substitute u for u ′ and w for w ′ in Properties 2–5. (3) is true by 

the symmetry of Properties 2 and 3, and of Properties 4 and 5. (4) is true by substituting 

R i ∪ R j for R in Properties 2–5. For (5), assume that v R i R j v ′′ . Then there is some v ′ 

such that v R i v ′ and v ′ R j v ′′ . For Property 2, let there be some u such that uE α v. Then, 

as v R i v ′ and R i is a F&B-bisimulation, there exists a u ′ such that u ′ E α v ′ and u R i u ′ . As 

v ′ R j v ′′ and R j is a F&B-bisimulation, there exists a u ′′ such that u ′′ E α v ′′ and u ′ R j u ′′ , 

and the relation u R i R j u ′′ follows. Proofs of Properties 3–5 are similar. 

Corollary 2 (Union of bisimulation refinements). It follows from Lemma 4(4) that given 

an initial relation S, the union of all F&B-bisimulations that are refinements of S is also 

a F&B-bisimulation. Trivially this union is also a refinement of S. 

B.2 Bisimilarity 

We know from Corollary 2 that there exist a unique maximum F&B-bisimulation, of 

which all other F&B-bisimulations are refinements: 

Definition 5 (F&B-bisimilarity). Nodes v and v ′ are F&B-bisimilar, written v ∼ F &B v ′ , 

iff v R v ′ for some F&B-bisimulation R. Equivalently, 

∼ F &B = ⋃ {R | R is a F&B-bisimulation} 

If the type of bisimulation is clear from the context, such as in the following, we write 

just ∼ for ∼ F &B . 

The next theorem is a generalization of the following Corollary 3, which is taken 

from Milner’s original results [14]. This more general result will be useful when linking 

F&B-bisimulations with partition P&S-stability in Section B.3. 

Theorem 5 (Largest bisimulation partition refinement is equivalence). If Q is a partition 

of the nodes in a graph, then the largest F&B-bisimulation R that is a refinement of = Q 

is an equivalence relation. 

188



Proof. Reflexivity: We have that Id V ⊆ = Q as = Q is an equivalence relation, and that 

Id V is a F&B-bisimulation by Lemma 4(2). Hence Id V ⊆ R, as R contains all such 

F&B-bisimulations by Corollary 2. Symmetry: If v R v ′ , then v R i v ′ for some F&Bbisimulation 

R i ⊆ R. We have that v ′ R −1 i v by definition of the inverse, and R −1 i is 

a F&B-bisimulation by Lemma 4(3). For any v ′ R −1 i v, we have than v ′ = Q v, because 

v R i v ′ implies v = Q v ′ by assumption, and = Q is an equivalence relation and hence 

symmetric. This means that R −1 i ⊆ = Q , and that R −1 i ⊆ R by the maximality of R, 

and v ′ R v follows. Transitivity: If v R v ′ and v ′ R v ′′ , then for some F&B-bisimulation 

R i ⊆ R and R j ⊆ R we have that v R i v ′ and v ′ R j v ′′ . By Lemma 4(5) R i R j is a F&Bbisimulation. 

For any x and x ′ , if x R i x ′ and x ′ R j x ′′ , then x = Q x ′ and x ′ = Q x ′′ follows 

from R i , R j ⊆ R ⊆ = Q . Because = Q is an equivalence relation and hence transitive, 

x = Q x ′′ follows, and hence R i R j ⊆ = Q . By the maximality of R we have that R i R j ⊆ R, 

and therefore v R v ′′ . 

The next corollary is a simple extension of the original result for single-directional 

bisimulations in edge-labeled data [14, Proposition 2]. 

Corollary 3 (Bisimilarity is equivalence). It follows from Theorem 5 that F&B-bisimilarity 

is an equivalence relation. 

The following lemma states that for F&B-bisimilarity, the word implies in Definition 5 

can be exchanged with iff. The proof is almost identical to the proof for the singledirectional 

bisimulation for edge-labeled data [14, Lemma 3+Proposition 4]. 

Lemma 5. Properties 1–5 in Definition 4 with ∼ substituted for R imply v ∼ v ′ . 

Proof. Define ∼ ′ to be a relation such that v ∼ ′ v ′ iff Properties 1–5 are satisfied with ∼ 

replaced for R. We have that v ∼ v ′ implies v ∼ ′ v ′ , because v ∼ v ′ implies the properties, 

and the properties imply v ∼ ′ v ′ by the definition of ∼ ′ . To prove that the properties 

imply v ∼ v ′ , we prove that v ∼ ′ v ′ implies v ∼ v ′ , i.e., ∼ ′ ⊆ ∼. Assume that v ∼ ′ v ′ . 

Property 1 is trivially satisfied. By the definition of ∼ ′ , for all u such that u E α v, there 

exist a u ′ such that u ′ E α v ′ and u ∼ u ′ , which, as shown above implies u ∼ ′ u ′ , satisfying 

Property 2. Properties 3–5 are similarly shown. 

B.3 Stability 

The following is a reformulation of Definition 2 of successor and predecessor stability, 

adding edge labels: 

Definition 6 (Stability). A set D ⊆ V is S α -stable with respect to a set B ⊆ V iff 

D ⊆ E α (B) or D ∩ E α (B) = ∅, and P α -stable with respect to B iff D ⊆ E α −1 (B) or 

D ∩ E α −1 (B) = ∅. Furthermore, D is S-stable or P-stable with respect to B iff D is 

S α -stable or P α -stable with respect to B for all α ∈ A E , respectively. 

If D is both successor stable and predecessor stable with respect to B, we shall say 

that D is P&S-stable with respect to B, or simply stable if there is no ambiguity. 

For any combination of successor and predecessor stability, a partition Q of V is said 

to be stable with respect to a block B if all blocks in Q are stable with respect to B. A 

189

CHAPTER 4. 


partition Q is stable with respect to another partition Q ′ if it is stable with respect to all 

blocks in Q ′ . Q is said to be stable if it is stable with respect to itself. 

B.4 Stability and bisimulation 

The next lemma states that a F&B-bisimulation can be found by P&S-stabilizing a partition, 

extending the link between single-directional (forward) bisimulation and the relational 

coarsest (predecessor) stable partition problem [9, Lemma 3]. 

Lemma 6 (Stability implies bisimulation). If a partition Q respects node labels and is 

P&S-stable, then = Q is a F&B-bisimulation. 

Proof. We must prove that v = Q v ′ implies Properties 1–5 in Definition 4. Property 1 

is satisfied as node labeling is assumed to be respected by the partition. For Property 2, 

assume that D ∈ Q is the block containing v and v ′ , and that there is an edge u E α v 

for some α. Let B ∈ P be the block containing u. We have that D ∩ E(B) ≠ ∅ from 

u E α v, and hence D ⊆ E α (B) from the successor stability. This means that for any 

v ′ ∈ D, there is some u ′ ∈ B, i.e., u = Q u ′ , such that u ′ E α v ′ . Properties 3–5 are proved 

similarly. 

Corollary 4. Given a partition Q, by Definition 5 and Lemma 6 there is a unique 

coarsest P&S-stable refinement of Q, of which all other P&S-stable refinements of Q are 

again refinements. 

Lemma 7 (Bisimulation and equivalence implies stability). If a relation R is a F&Bbisimulation 

and an equivalence relation, then the equivalence classes of R give a P&Sstable 

partition respecting node labels. 

Proof. Let Q be the partition arising from R. Q respects labels by Property 1 in Definition 

4. Successor stability: We prove that for blocks D, B ∈ Q, we have that D∩E(B) ≠ ∅ 

implies D ⊆ E(B). Assume D ∩ E(B) ≠ ∅, i.e., there is some node v ∈ D such that 

v ∈ E α (B), and hence some u ∈ B such that u E α v. For each v ′ ∈ D, we have that that 

v R v ′ , and by Property 2 in Definition 4 there is some u ′ such that u ′ E α v ′ and u R u ′ , 

i.e., u ′ ∈ B and v ′ ∈ E(B). Hence, D ⊆ E(B). Predecessor stability: Identical proof 

substituting E α with E α −1 . 

Theorem 6 (Maximum refinements). Given an initial partition Q, if R is the maximum 

F&B-bisimulation that is a refinement of = Q , and P is the coarsest stable partition 

refining Q respecting node labels, then = P is equal to R. 

Proof. Follows from Lemmas 6 and 7. 

B.5 Implementing Stability 

The following two lemmas are trivial extensions of Paige and Tarjan’s original description 

[16]. 

190



Lemma 8 (Stability inherited under refinement). For any type of stability, if a partition 

Q 2 is a refinement of Q 1 and Q 1 is stable with respect to a partition Q, then so is Q 2 . 

Proof. For any block D 2 ∈ Q 2 there is some block D 1 ∈ Q 1 such that D 2 ⊆ D 1 . We 

show successor stability. Assume a given α ∈ A E for S α -stability, or any α for general 

S-stability. For any block B ∈ Q, if D 1 ⊆ E α (B) then D 2 ⊆ E α (B), or if D 1 ∩E α (B) = ∅ 

then D 2 ∩ E α (B) = ∅. Predecessor stability is symmetric. 

Lemma 9 (Stability inherited under union). For any type of stability, if a partition Q is 

stable with respect to both B 1 ⊆ V and B 2 ⊆ V , then Q is stable with respect to B 1 ∪B 2 . 

Proof. For successor stability, for any block D ∈ Q and a given or any α ∈ A E , if at least 

one of D ⊆ E α (B 1 ) and D ⊆ E α (B 2 ) is true, then D ⊆ E α (B 1 ∪ B 2 ) is true. If both 

of D ∩ E α (B 1 ) = ∅ and D ∩ E α (B 2 ) = ∅ are true, then D ∩ E α (B 1 ∪ B 2 ) = ∅ is true. 

Predecessor stability is symmetric. 

When using the procedure StabilizeWRT() in our stabilization algorithms, an inplace 

stabilization is performed, while in the theoretical description [16] of the original 

algorithm, there is a function split(B, Q), which returns the coarsest refinement of Q 

that is predecessor stable with respect to B. This is naturally generalized to stabilizing 

on P-stability, S-stability and S&P-stability. The following functions are handy for the 

definitions: 

clean(Q) = {B | B ∈ Q ∧ B ≠ ∅} 

sep(C, Q) = clean({D ∩ C | D ∈ Q} ∪ {D \ C | D ∈ Q}) 

Note that sep is commutative, i.e., sep(C 1 , sep(C 2 , Q)) = sep(C 2 , sep(C 1 , Q)). 

Definition 7 (α-Split). 

split Sα 

(B, Q) = sep(E α (B), Q) 

split Pα 

(B, Q) = sep(E α −1 (B), Q) 

split P &Sα (B, Q) = split Pα 

(B, split Sα 

(B, Q)) 

Lemma 10 (Splitting and stability). A partition Q is: 

S-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split Sα 

(B, Q) 

P-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split Pα 

(B, Q) 

P&S-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split P &Sα (B, Q) 

Proof. Follows directly from the Definitions 6 and 7. 

Lemma 11 (α-splitting is monotone). Given a set B ⊆ V , if a partition Q 2 is a refinement 

of a partition Q 1 , then split Sα 

(B, Q 2 ) is a refinement of split Sα 

(B, Q 1 ), and split Pα 

(B, Q 2 ) 

is a refinement of split Pα 

(B, Q 1 ), 

Proof. For any block D 2 ∈ Q 2 , there is a block D 1 ∈ Q 1 such that D 2 ⊆ D 1 , and for 

split Sα 

we have D 2 ∩ E(B) ⊆ D 1 ∩ E(B) and D 2 \ E(B) ⊆ D 1 \ E(B). Similarly, for 

split Pα 

we have D 2 ∩ E −1 (B) ⊆ D 1 ∩ E −1 (B) and D 2 \ E −1 (B) ⊆ D 1 \ E −1 (B). 

191

CHAPTER 4. 


Lemma 12 (Correctness of splitting). For an initial partition Q 0 , let R be the coarsest 

refinement of Q 0 that is P&S-stable, and let Q i and Q ′ be refinements of Q 0 such that R 

is a refinement of both Q i and Q ′ . If U is a union of blocks in Q ′ , then for any α we have 

that R is a refinement of split Sα 

(U, Q i ), of split Pα 

(U, Q i ), and hence of split P &Sα (U, Q i ). 

Proof. Since U is a union of blocks from Q ′ it is also a union of blocks from R, which 

refines Q ′ . As R is S-stable we have R = split Sα 

(U, R) for any α. From Lemma 11 we 

have that split Sα 

(U, R) is a refinement of split Sα 

(U, Q i ) when R is a refinement of Q i . 

Hence, R is a refinement of split Sα 

(U, Q i ). P-stability is symmetric for split Pα 

. 

Algorithm 6 shows the generic framework for stabilization used in Paige and Tarjan’s 

algorithm [16], extended to our more general case. 

Algorithm 6 Generic stabilization 

Given a partition Q 0 . 

i ← 0 

Until no change is possible, 

find some set U that is a union of blocks in Q i 

such that Q i ≠ split P &Sα (U, Q i ) for some α ∈ A E . 

Q i+1 ← split P &Sα (U, Q i ). 

i ← i + 1 

Corollary 5 (Stabilization correct). It follows from Lemma 12 that algorithm 6 maintains 

the invariant that the coarsest P&S-stable refinement of the initial partition Q 0 is also a 

refinement of the current partition Q i . 

Theorem 7 (Stabilization terminates). Algorithm 6 terminates, and then Q i 

coarsest P&S-stable refinement of Q 0 . 

is the 

Proof. As long as Q i is not P&S-stable, by Lemma 10 there exist a union U of blocks in 

Q i and an α ∈ A E such that split Xα 

(U, Q i ) ≠ Q i . When split Xα 

(U, Q i ) = Q i for all U 

and α, Q i is P&S-stable by Lemma 10, and by Corollary 5, Q i is the coarsest P&S-stable 

refinement of Q 0 . 

Corollary 6. It follows from Theorem 6 and Theorem 7 that F&B-bisimilarity can be 

computed by using Algorithm 6 to compute the coarsest P&S-stable refinement of the 

label partition. 

192

Appendix A 

Other Papers 

“No. Better research needed. Fire your research person. 

No fishnet stockings. Never. Not in this band.” 

– Gene Simmons 

193

Paper 7 


On performance and cache effects in substring indexes 

Technical Report IDI-TR-2007-04, Norwegian University of Science and Technology, 

Trondheim, Norway, 2007 

Abstract This report evaluates the performance of uncompressed and compressed substring 

indexes on build time, space usage and search performance. It is shown how the 

structures react to increasing data size, alphabet size and repetitiveness in the data. 

Conclusions are drawn as to when different data structures should be used. The main 

contribution is the strong relationship identified between time performance and locality 

in the data structures. As an example, it is found that for byte sized alphabets, suffix 

tree construction can be speeded up by a factor 16, and query lookup by a factor 8, if 

dynamic arrays are used to store the lists of children for each node instead of linked lists, 

at the cost of using about 20% more space. And for enhanced suffix arrays, query lookup 

is up to twice as fast if the data structure is stored as an array of structs instead of a set 

of arrays, at no extra space cost. 


I started on this PhD right after I had finished my masters degree [16], where the topic 

was substring indexes, such as suffix trees and suffix arrays. XML search was the topic 

my PhD supervisors had planned for me to work on, but I felt that I had some results on 

substring indexing I should try to publish first. 


In retrospect, pursuing this research direction was a bad idea, as I was fresh out of school 

and tried to publish in a field that was outside the expertise of my supervisors. Also, 

in the last years before I had started working on this topic, some research had been 

published that drastically changed the field, and my research turned out to be outdated 

and uninteresting for the research community. The results of this venture were published 

in a technical report [17]. 

195

Paper 7: On performance and cache effects in substring indexes 


The suffix tree is a versatile substring index, first introduced by Weiner [44] (see more 

accessible descriptions by McCreight and Ukkonen [29, 42]). It can be used to search 

for occurrences of patterns in a string, and solve many other problems in combinatorial 

pattern matching [16]. The suffix tree and similar structures are mostly used in computational 

biology, where the sequences considered do not have word boundaries, and methods 

such as inverted lists are not suitable. 

1.1 Suffix Tree Definition 

A suffix tree for a string T of length n from the alphabet Σ of size σ, is a trie of all n + 1 

suffixes of the string padded with a unique terminal, edge compacted such that every 

internal node is branching. Since the padded string has n + 1 suffixes, and the unique 

terminal ensures that no padded suffix is a prefix of another, the tree has n+1 leaf nodes. 

There are at most n internal nodes, since they all have at least two children. This means 

that if edges are represented as pointers into the string, the suffix tree needs Θ(n) space 

measured in computer words, or Θ(n log n) bits, which is slightly more than the optimal 

O(n log σ) bits, the space needed to store the string indexed. 

In addition to the parent–child edges, each internal (non-root) node has a suffix link 

to another internal node [29]. If the edges from the root to an internal node spells χα, 

where χ is a string of length 1, this node has a suffix link to the internal node which 

represents the string α. These links are used in construction algorithms, and to solve 

some problems, such as longest common substring [16]. 

A suffix tree can be built in Θ(n) time, when the alphabet size is constant [44] or 

integer [5]. As with any trie, it can be checked if a pattern P of length m is contained 

in the tree in O(m) time, if node child lookup is Θ(1). Since all leaf nodes below the 

tree position found in such a lookup represent unique matches, and all internal nodes are 

branching, at most 2z − 1 nodes must then be visited to find all z matches, giving a total 

time of O(m + z). The problem of finding all matches of a pattern in a string or a set 

of strings in known as the occurrence listing problem. A suffix tree has asymptotically 

optimal construction time for integer alphabets, and optimal search time for constant 

alphabets. Optimal search time for larger alphabets can be achieved with perfect hashing, 

at the cost of a longer construction time [10]. 

A generalised suffix tree [16] for a set of strings S = {T 1 , . . . , T d }, is an edge compacted 

trie of all suffixes of each string T i padded with a unique terminator string $ i . A string 

T i of length n i can be added to the tree in Θ(n i ) time by slightly modifying a suffix tree 

construction algorithm, and can be removed in Θ(n i ) time. 

1.2 Alternatives to Suffix Trees 

A high constant factor in the space needed for suffix trees make them unsuited for solving 

problems with large strings on commodity computers, as they do not work well on disk 

197

APPENDIX A. OTHER PAPERS 

[2]. The space efficient representation by Kurtz [22] needs 13 bytes per string symbol on 

average 1 . This has lead to the development of many alternative index structures. 

The suffix array [27] is a simplification of suffix trees, consisting only of an array A 

with the sorted order of the suffixes of the padded string, such that T [A[i] . . . n + 1] < 

T [A[i + 1] . . . n + 1] for all 1 ≤ i ≤ n. The order in A is equal to the order of the leaf 

nodes seen in an ordered depth first traversal of the suffix tree for the string. The space 

usage is n⌈log n⌉ bits, plus n⌈log σ⌉ bits for storing the text. Lookup by binary search 

is O(m log n), which can be improved to O(m + log n) by using additional arrays storing 

longest common prefixes (LCP) between the suffixes indexed [27]. After finding the left 

and right border of suffixes matching the pattern, the hits are read by a sequential scan 

in the array. Suffix arrays can be constructed in Θ(n) time [18, 20, 21], but algorithms 

with higher worst-case bounds are usually faster in practice [28, 40]. 

A “hybrid” between suffix arrays and trees is the enhanced suffix array, which has the 

same functionality as suffix trees, but uses less space in the worst case. It is a suffix array 

with additional fields, and can be built in linear time. Abouelhoda et al. [1] show how 

to use the enhanced suffix array to replace any algorithm doing top–down, bottom–up or 

suffix link traversal in a suffix tree. The steps of looking up a pattern is similar to what 

is done in a suffix tree implemented with sibling lists. Listing hits is done as in suffix 

arrays, by sequential scan between a left and right border. Enhanced suffix arrays are 

expected to list large numbers of hits much faster than suffix trees in practise, due to the 

improved locality. 

An alternative non-linear suffix tree construction is the write-only top-down construction 

(wotd) [11], in which the suffix tree is constructed in O(n 2 ) time (expected 

Θ(n log n)). Sibling nodes are stored adjacently in memory. This spatial locality makes 

it faster than linear time suffix trees for some cases. 

Many compressed substring indexes have been introduced the last ten years. See [34] 

for an introduction, time and space bounds, and an extensive list of references. The family 

of LZ-indexes [19, 8, 32] use structures based on the Ziv-Lempel decomposition of the 

string [45]. Other indexes are based on suffix arrays, such as the family of Compressed 

suffix arrays [23, 15, 38, 14, 24]. The FM-index family [7, 12, 9, 26] also build on suffix 

arrays, but uses a different search method. These structures offer various tradeoffs between 

index size, construction time, and query performance. Many of these structures do 

not depend upon keeping a copy of the text, and are hence called self indexes. 

Sadakane [39] has shown that it is also possible to combine a compressed suffix array 

with a balanced parenthesis representation of a suffix tree and various other structures to 

get full functionality, with bottom–up, top–down and suffix link traversal of the tree. 

1.3 Dynamic Problems 

Suffix trees have been equalled by other structures in terms of construction time and search 

speed, and surpassed on space usage and practical performance when listing many hits. 

The only problems in which suffix trees have not been matched (at least in asymptotic 

terms, to the authors knowledge), are problems concerning dynamic sets of strings, in 

1 When supporting strings up to 500MB 

198


which strings are added to and removed from the set. Generalised suffix trees can be used 

to solve many of these problems optimally. 

Problems with static sets of strings can be solved by using suffix arrays or similar 

structures, by indexing the concatenation of the strings, separated by a unique symbol /∈ 

Σ. Also, any dynamic decomposable indexing problem can be solved by using a hierarchy 

of static indexes of varying sizes, and the cost of a O(log |S|) overhead on build time 

and search performance [35]. Grimsmo [13] shows the performance of hierarchies of suffix 

arrays compared to a generalised suffix tree. 

A problem related to the occurrence listing problem is the document listing problem for 

sets of strings. Given a pattern, find all documents in which it occurs. Muthukrishnan 

[31] shows how to solve this problem optimally by using a suffix tree with additional 

information. The technique used can be adapted to suffix arrays and similar structures. 

An open problem (to the authors knowledge) is solving the document listing problem 

optimally for dynamic sets of documents. 

1.4 Report Overview 

Section 2 details how to efficiently implement a suffix tree using dynamic arrays for the 

parent–child relationship, with a low space overhead compared to sibling lists. Section 

3 describes some of the choices made in the tested implementation of enhanced suffix 

arrays. Section 4 features extensive tests of many substring indexes, on build time and 

search performance, with many types of test data. It is shown how the indexes react to 

increasing data size, varying alphabet size, and varying text randomness. Section 5 draws 

conclusions from the experiments, and gives some guidelines for when the various types 

of index structures are suitable. 

2 Implementation of Suffix Trees 

The suffix tree data structure consists of a set of nodes. These nodes are linked together 

with parent–child edges and suffix links. One of the major choices when implementing 

suffix trees is how to represent the parent–child relationship. McCreight [29] describes 

the use of linked lists (called sibling lists) and hash tables. His article states that using 

an array of pointers “would be fast to search, but slow to initialise and prohibitive in size 

for a large alphabet.” This was true in 1976. 

2.1 Erroneous Assumptions on Using Arrays for Child Lists 

Various authors have made assumptions about the efficiency of different ways to implement 

suffix trees. Bedathur and Haritsa [2] claim that using arrays would result in wasted 

space, as there would be a lot of null pointers. They say this would be especially severe in 

the lower parts of the tree. This implies that the authors do not consider the possibility of 

storing the pointers in an unsorted array and using dynamic arrays, or that they consider 

such a solution to be inefficient. Tian et al. [41, page 288] make similar claims, referring 

to [29] and [2]. 

199


This report presents results showing that using dynamic arrays for storing the parentchild 

relationship clearly outperforms using sibling lists in terms of speed, at the cost of 

using slightly more space. 

2.2 Parent-Child Relationship 

Suffix tree construction is usually described as being Θ(n), with the assumption that the 

size of the alphabet can be viewed as a constant. The most common way of implementing 

suffix trees is using sibling lists. This is a simple solution, which gives a construction 

time of O(nσ), which is effective for small alphabets, such as DNA. This alphabet factor 

is most visible in trees for highly random strings from large alphabets, where there are 

fewer nodes, but a higher average branching factor. 

When considering general alphabets (sorting only possible by comparison), the running 

time of any suffix tree construction algorithm has a worst case Ω(n log n), as it has sorting 

complexity [6]. Farach [5] shows how to construct suffix trees for integer alphabets (σ ≤ n) 

in Θ(n) time by recursively constructing a suffix tree for odd numbered suffixes, building 

a tree for the even numbered from the information found, and then merging these trees. 

McCreight [29] proposed to use hashing as an alternative to sibling lists. Kurtz [22] 

shows how to do this effectively. Since there is an initial space overhead for each hash 

table, a single table is used for all nodes in the tree. The keys in this table are pairs 

of parent node numbers and edge first symbols, and the values are child node numbers. 

With this scheme, the expected construction time is Θ(n), but finding all occurrences 

when searching can be very slow, as a lookup for children must be done on all possible 

symbols in Σ. With sibling lists, all z occurrences are found in O(mσ + z), while with a 

hash map the expected time is O(m + zσ). It is possible to implement a combination of 

a hash map and a linked list, where the children of a node are linked together, as shown 

by Grimsmo [13]. The proposed structure uses less space than what is needed for the 

combination of the sibling lists and the hash table, but it is complex and slow in practice. 

Simply using both sibling lists and hashing would probably be faster, at the cost of using 

more space. 

Bedathur and Haritsa [2] propose storing all child pointers inside the nodes in their 

disk-based suffix tree construction for very small alphabets (σ = 4). However, when using 

this approach, the array storing the pointers must be of a constant size. 

2.3 Sibling List Implementation 

The implementation by Kurtz [22] is the fastest space efficient implementation of linear 

time suffix trees known to the author. Nodes are stored as integer fields in an array, 

and node references are indexes into this array. The following layout is used for internal 

nodes: 

• first-child – Pointer to first node in list of children 

• branch-brother – Pointer to the next sibling 

200


• suffix-link – Pointer to the node α, if this is node χα. 

• head-position – The starting position of the second suffix in the string represented 

in this node. 

• depth – The depth of the node. 

The fields head-position and depth are used to find the string the incoming edge 

represents (see [22]). These are used instead of start and end positions because they do 

not change for a given node during the construction. The number of internal nodes is not 

known in advance, and are therefore kept in a dynamic array. There are always exactly 

n + 1 leaf nodes, which can be stored in a static array. The only field needed in leaf nodes 

is the branch-brother pointer. 

As the space needed is at most 5n words for the internal nodes, and n words for the 

leaf nodes, an upper limit for the total space needed for the suffix tree is 6n words. A 

word must be ⌈log n⌉ + 1 bits to index all nodes and string positions. A total of 24n bytes 

is needed if 32-bit words are assumed, plus n bytes for storing the string. Kurtz shows 

how to reduce this to 20n bytes, by storing suffix-link in the branch-brother field of the 

last child. Another trick shown is using small nodes, which are internal nodes with fewer 

fields, exploiting redundant information in the tree. On average, the space usage is about 

13n bytes, when supporting strings up to 500 MB. 

When the number of children per node is high, child lookup is a costly part of tree 

traversal, and even more so on a computer architecture where cache misses are expensive. 

When traversing a sibling list, two cache misses are expected per child visited: one for 

looking up fields in the node, and one for extracting the first symbol on the incoming 

edge to the child. 

2.4 Child Array Implementation 

On modern computers, a miss in level 1 cache costs around 20 cycles, while a miss in level 2 

cache costs around 200 cycles, more than four times the cost of an integer division [17] (and 

more if pipelined instructions are considered). This implies that using dynamic arrays 

and copying could be preferable in many applications where linked lists were previously 

considered the best option. Below follows a description of how to implement suffix trees 

with child arrays efficiently. 

Each internal node refers to a child array. There exist child arrays of a set of predefined 

sizes. Arrays of the same size are stored together in a container, giving allocation 

and de-allocation in Θ(1) time by storing free locations in a linked list, with the pointers 

saved inside the free slots. When a node needs room for more children, a new child array 

of larger size is allocated, the child pointers copied here, and the old array released. 

Table 1 shows the layout of nodes in the used implementation. Two new fields replace 

first-child and branch-bro in internal nodes. Which child array size is used is given in 

carr-size, while the position of the child array in the container is given in carr-pos. Since 

there is no branch-brother field, leaf nodes do not need to be explicitly represented at all. 

The reasoning is the same as for a hash table implementation [22]. The fields in-symb 

201


and in-ref listed in the table is the space needed for each node in its parents child array. 

In a traditional sibling list implementation, a lookup in the string is required for each 

child considered in lookup on a specific symbol. This is avoided here by storing the first 

symbol interleaved with the node references. This reduces the number of cache misses. 

While an internal node is identified by its index in memory, a leaf node can be identified 

by its head-position. The depth can be found by subtracting this from the length of the 

string. A bit in each node reference is needed to distinguish between internal and leaf 

nodes. Lookup in the string while traversing the children can also be avoided, by storing 

the first symbol on incoming edges interleaved with the node references. As relatively few 

cache misses are expected compared to the number of nodes considered in child traversal, 

this should prove more efficient than sibling lists. 

Table 1: Node layouts with child arrays 

Large B Small B Leaf B 

carr-size 1 carr-size 1 

carr-pos 4 carr-pos 4 

suffix-link 4 dist 1 

hpos 4 

depth 4 

Child array space usage 

in-symb 1 in-symb 1 in-symb 1 

in-ref 4 in-ref 4 in-ref 4 

Sum 22 Sum 11 Sum 5 

The term array doubling is often used when describing dynamic arrays, but is a bit 

misleading, as the growth factor can be different from 2. The amortised asymptotic cost 

of insertion into the array is Θ(n) for any constant growth factor. If a growth factor of 

2 is used, and n values are inserted into a dynamic array, the worst case total number 

of insertions and re-insertions is about 3n. If the growth factor is 1.1, the worst case is 

about 12n. Because of the cache effects on modern computers, copying is very cheap, and 

using a low growth factor is affordable. 

The total space needed for this implementation, disregarding overhead from dynamic 

arrays, is 27n bytes in the worst. This is 27/20 = 1.35 times more than for sibling 

lists. In practice, the space usage is lower, as the number of internal nodes is usually 

around 0.5n to 0.7n [22]. The space wasted internally in each child array should also 

be considered, which comes in addition to the overhead due to the growth factor in the 

storage for the groups of child arrays. With a growth factor of 1.1, the cost for child 

arrays is 17 · 1.1 + (5 + 5) · 1.1 2 ≈ 30.8, while for sibling lists it is 16 · 1.1 + 4 ≈ 21.6. 

This gives a ratio of 30.8/21.6 ≈ 1.43. This is also lower in practice, as the out-degree 

of internal nodes often has a very skewed distribution, which can be taken into account 

when configuring the pre-defined child array sizes. 

202


3 Implementation of Enhanced Suffix Arrays 

An enhanced suffix array is also tested in the following experiments. The implementation 

is the authors translation of the pseudo-code from Abouelhoda et al. [1] into C++. Some 

choices for the data structures affect the performance, especially query lookup, and are 

therefore discussed here. 

The data structures needed for substring search with an enhanced suffix array is the 

suffix array itself, the LCP table and the child table (CLD), all of which have the same 

length. The first choice to be made in an implementation is whether to store the three 

as separate arrays, or as an array of structs with three fields. As the fields for a given 

“position” in the enhanced suffix array are often read together during search, the latter 

choice should give better locality and performance. 

Abouelhoda et al. describe how store the LCP table in a byte array, overflowing into 

another data structure, where lookup is done by binary search. This was not done here, 

because it gave a considerable slowdown on some of the benchmarks, as they had a high 

average LCP. On the file w3c2 (see Table 2), there was a 36% overflow with 1 byte LCP 

fields, and 11% overflow with 2 bytes. 

The CLD table can also be stored in an overflowing byte array [1]. This gave a very 

low overflow on the tests run here, but still a considerable slowdown during search. This 

is because the “nodes” close to the root of the virtual tree are the most likely to overflow, 

as they “span” larger portions of the enhanced suffix array. Even though a small portion 

of all nodes overflow, a large portion of the nodes traversed during a query lookup are 

expected to have overflowing CLD fields. 

In the implementations tested, all three mentioned fields were stored as four byte 

integers, resulting in a space usage of 3 · 4n + n = 13n bytes including storage for the 

text itself. If the LCP and CLD fields are stored as overflowing bytes, the expected space 

usage is 4n + 3 · n = 7n bytes. Performance for an implementation where the CLD values 

were stored in bytes overflowing into a hash table is also given in some of the tests on 

query lookup in Section 4.7. 

For the tests on tree traversal operations in Section 4.9, two additional fields were 

used. The “left” and “right” suffix link pointers where stored in 2 · 4n bytes, and the 

inverse suffix array used to create them was stored in 4n bytes. [1] describes how to store 

the suffix link pointers in expected 2 · n bytes. 

4 Experiments 

Below follow experiments testing many substring index implementations on various types 

of text data and queries. The experiments were designed to emphasis the strengths and 

weaknesses of the various methods. Construction time, space usage and query performance 

was tested. 

203


4.1 Test Data 

Both “real world” and synthetic data was used in the tests. The largest tests from the 

Canterbury Corpus [3] and the tests from Manzini and Ferragina’s test collection [30] 

were used, in addition to some artificial data generated by the author. The results for 

these tests are given in section 4.3. 

To underline the properties of the methods, other synthetic data was also used: Space 

separated words from a Zipf distribution (Section 4.4), uniform random data with varying 

alphabet size (Section 4.5), and first order Markov data with a Zipf distribution on the 

transitions (Section 4.6). 

4.2 Tested Implementations 

The following implementations were tested: 

• STa - The author’s implementation of suffix trees using child arrays. 

2.4. 

See Section 

• STs - Using sibling lists. See Section 2.3. 

• Kur - The suffix tree implementation by Kurtz [22], taken from MUMmer 3.15 2 . It 

was compiled using the internal flag STREE HUGE, which gives a maximum data size 

of 500 MB. 

• wotde - The Write Only Top Town Eager 3 suffix tree construction algorithm [11]. 

• DS - Deep-shallow suffix array construction algorithm 4 [28]. Default construction 

parameters used. 

• esa1 [b] - Enhanced suffix array. The implementation is a translation of the pseudocode 

from [1] into C++. Node fields stored in separate arrays. Initial suffix array 

built with deep-shallow. Suffix b denotes that CLD values were stored in bytes 

overflowing into a hash map. 

• esa2 [b] - Enhanced suffix array stored as an array of structs. 

• BPR - The Bucket Pointer Refinement suffix array construction algorithm 5 [40]. 

Default construction parameters used. 

• LZ - LZ-index (LZ ) [33], implementation by Navarro and Arroyuelo 6 . 

• CCSA - Compressed compact suffix array [24], implementation by Mäkinen and 

González 7 . 

2 http://mummer.sourceforge.net/ 

3 http://bibiserv.techfak.uni-bielefeld.de/wotd/ 

4 http://www.mfn.unipmn.it/~manzini/lightweight/ 

5 http://bibiserv.techfak.uni-bielefeld.de/bpr/ 

6 http://pizzachili.dcc.uchile.cl/indexes/LZ-index/ 

7 http://pizzachili.dcc.uchile.cl/indexes/Compressed_Compact_Suffix_Array/ 

204


• FM - FM-index (FM ) [8], implementation (version 2) by Ferragina and Venturini 8 . 

• AFFM - Alphabet-Friendly FM-index [9], implementation (version 2) by González 9 . 

• RLFM - Run-Length FM-Index [25], implementation by Mäkinen and González 10 . 

• SSA - Succinct suffix array [26] (FM family), implementation (version 2) by Mäkinen 

and González 11 . 

All programs were compiled with gcc 4.1.2 with -O3 optimisation, and used glibc 

2.3.6. They were run on an AMD Athlon 64 3500+ with 512 KB L2 cache, running 

Debian Sid with Linux 2.6.16. The kernel had the perfctr [37] patch, which makes 

hardware performance counters readable. The PAPI [36] interface was used to read 

the variable PAPI L2 TCM, giving level 2 cache misses. The times given are all wall-clock 

timings, excluding the time for reading files into memory. The memory usage given is 

VmRSS (resident set size) and VmHWM (resident set size peak) from /proc/$pid/status. 

(What is used by the tools top and ps). When running the tests, all initialisation of 

measurement variables and reading of data was done before each timing was started, and 

all related calculations were done after it was stopped. 

4.3 General Tests 

The following tests include data from the Canterbury Corpus [3] and Manzini and Ferragina’s 

test collection [30]. In addition, the 35th Fibonacci string is used. Note that the odd 

numbered 12 Fibonacci strings (counting from 1) give the maximum number of nodes in 

suffix trees (2n + 1), and give the worst case in many non-linear suffix array construction 

algorithms [40]. The test a025 is 2 25 subsequent a’s. Table 2 gives some statistics for the 

tests: Length, alphabet size, LCPs, and first order empirical entropy [34]. 

Table 3 shows the construction speed for the tested methods, given as symbols indexed 

per second. The rationale for giving this instead of absolute time, is to give easier comparison 

between the tests shown in the tables, and clearer differentiation between the best 

methods in the plots that follow. An “x” in the tables denotes that that the test crashed, 

while “-” denotes that it took too long to complete. “N/A” denotes that the implementation 

was not designed to handle this alphabet size. Table 4 shows symbols indexed per 

L2 cache miss. It is expected that this is related to the construction performance. 

The suffix trees with child arrays clearly perform better than the sibling lists variants, 

except on the small DNA tests, where they are slightly slower, and on the Fibonacci string, 

which has an alphabet of size 2. The advantage comes from the improved locality, as can 

be read from Table 4. Kur and STs have similar behaviour, with the former being slightly 

faster. This is probably because the latter was originally developed to index dynamic sets 

of documents. Wotde has a varying performance, which seems dependent of the average 

8 http://pizzachili.dcc.uchile.cl/indexes/FM-indexV2/ 

9 http://pizzachili.dcc.uchile.cl/indexes/Alphabet-Friendly_FM-Index/ 

10 http://pizzachili.dcc.uchile.cl/indexes/Run_Length_FM_index/ 

11 http://pizzachili.dcc.uchile.cl/indexes/Succinct_Suffix_Array/ 

12 In the even numbered Fibonacci strings, a is always followed by b. 

205


Table 2: Test statistics. The three first come from the Canterbury Corpus [3], while the next 10 come from Manzini and 

Ferragina’s test collection [30] 

Length σ Med. LCP Avg. LCP Max. LCP Entr. (1) 

1 bible.txt 4047393 63 11 14 551 3.27 King James bible 

2 E.coli 4638690 4 11 17 2815 1.98 Escherichia coli 

3 world192.txt 2473400 94 13 23 559 3.66 CIA world fact book 

4 chr22.dna 34553758 5 13 1979 199999 1.88 Human chromo. 22 

5 etext99 105277340 146 12 1109 286352 3.57 Project Gutenberg 

6 gcc-3.0.tar 86630400 150 21 8603 856970 3.80 gcc 3.0 source files 

7 howto 39422105 197 12 268 70720 3.92 Linux HOWTO text 

8 jdk13c 69728899 113 113 679 37334 3.54 HTML and Java 

9 linux-2.4.5.tar 116254720 256 17 479 136035 3.90 Linux Kernel 2.4.5 

10 rctail96 114711151 93 61 282 26597 3.47 Reuters news in XML 

11 rfc 116421901 120 19 93 3445 3.40 RFC text files 

12 sprot34.dat 109617186 66 32 89 7373 3.93 Swiss Prot database 

13 w3c2 104201579 256 114 42300 990053 4.06 HTML from w3c.org 

14 fib035 9227465 2 2306866 2435423 5702885 0.59 35th Fibonacci String 

15 rand254 104857600 254 3 2.86 6 7.98 Uniform, |σ| = 254 

16 a025 33554432 1 16777217 16777216 33554431 0.00 a repeated 2 25 times 

206


Table 3: Symbols indexed per second in thousands. 

STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA 

1 1948 1126 1475 1556 531 3935 3430 1897 1816 3201 1011 2525 630 2286 2127 

2 1370 1046 1377 1256 518 3770 3374 1728 1658 4504 869 2365 783 2108 2251 

3 2335 1263 1745 1857 570 4833 3241 1934 2056 3211 1129 2884 522 2622 2310 

4 1212 932 1165 22 426 2900 2250 1236 1325 3753 730 1712 669 1825 1888 

5 1332 632 700 127 309 1736 1292 938 996 1873 606 1430 341 1240 1200 

6 2227 1044 1273 12 398 1300 1813 913 920 2457 630 1804 350 1122 1036 

7 1677 707 791 468 391 2763 2128 1372 1401 2110 796 2006 332 1781 1659 

8 3321 2282 3598 178 429 1251 1341 880 892 3553 579 1360 387 965 879 

9 2049 883 1032 258 366 2520 2086 1348 1365 N/A N/A N/A N/A N/A N/A 

10 2302 1198 1447 287 369 1091 1039 741 767 2708 519 1216 371 855 801 

11 1842 864 1005 533 356 2224 1463 1171 1216 2212 712 1769 410 1550 1433 

12 2029 778 925 499 351 1904 1261 1029 1087 2445 658 1594 473 1356 1267 

13 3052 1509 2138 - 403 1178 1484 825 853 N/A N/A N/A N/A N/A N/A 

14 4586 5346 12375 - 889 91 267 89 89 21400 73 502 73 76 76 

15 606 30 32 1957 473 2470 1218 1020 1094 647 592 x 191 1158 1252 

16 4547 5931 14767 - 4338 41708 22805 12904 9677 55966 3290 - 1012 9558 10168 

207


Table 4: Symbols indexed per L2 cache miss. 


1 0.30 0.15 0.13 0.18 0.039 0.48 0.34 0.20 0.22 0.44 0.15 0.34 0.13 0.34 0.34 

2 0.16 0.13 0.13 0.14 0.038 0.43 0.37 0.17 0.19 0.61 0.15 0.30 0.11 0.31 0.31 

3 0.40 0.18 0.16 0.27 0.042 0.67 0.40 0.23 0.26 0.47 0.18 0.44 0.14 0.44 0.44 

4 0.18 0.14 0.15 0.012 0.036 0.29 0.32 0.14 0.16 0.47 0.11 0.22 0.096 0.23 0.23 

5 0.22 0.092 0.088 0.057 0.032 0.18 0.16 0.11 0.12 0.24 0.088 0.16 0.082 0.15 0.15 

6 0.49 0.16 0.15 0.00087 0.036 0.19 0.15 0.13 0.14 0.34 0.10 0.23 0.11 0.19 0.19 

7 0.28 0.091 0.082 0.11 0.033 0.29 0.24 0.15 0.17 0.29 0.11 0.23 0.11 0.22 0.22 

8 1.2 0.56 0.54 0.031 0.037 0.17 0.093 0.12 0.12 0.43 0.085 0.17 0.090 0.14 0.14 

9 0.44 0.14 0.13 0.068 0.035 0.28 0.19 0.16 0.17 N/A N/A N/A N/A N/A N/A 

10 0.50 0.20 0.19 0.023 0.036 0.11 0.069 0.083 0.088 0.33 0.067 0.14 0.067 0.10 0.10 

11 0.36 0.13 0.13 0.068 0.036 0.24 0.13 0.14 0.15 0.30 0.10 0.21 0.10 0.19 0.19 

12 0.39 0.11 0.11 0.052 0.035 0.20 0.11 0.13 0.13 0.31 0.093 0.19 0.093 0.17 0.17 

13 0.97 0.26 0.26 - 0.036 0.16 0.12 0.11 0.12 N/A N/A N/A N/A N/A N/A 

14 11 10 18 - 0.063 0.015 0.0095 0.014 0.014 4.1 0.013 0.11 0.013 0.014 0.014 

15 0.099 0.0031 0.0033 0.18 0.056 0.34 0.16 0.14 0.15 0.12 0.12 x 0.089 0.25 0.25 

16 23 28 49 - 1.9 130 55 26 14 419 20 - 4.1 32 32 

208


LCP (See Table 2). For tests larger than 5 MB, it is faster than the regular suffix trees 

only on the random data. The trees using sibling lists seem unsuited for random data 

with large alphabets. One might have expected that the time and cache performance for 

the suffix trees was directly proportional to the alphabet size, but it also depends on other 

properties of the data. A test where only the alphabet size is varied is given in Section 

4.5. 

The cost of building the enhanced suffix array is rather high. A reason for this is 

the way data is laid out in memory. The improvement of esa2 over esa1 does not show 

very well in this test on construction performance, as the data fields are built by separate 

algorithms. The tests in Section 4.7 show that esa2 is often significantly faster on query 

lookup. In general, the regular suffix tree with child arrays has faster construction than 

both the enhanced suffix array and wotde. 

DS and BPR are faster than the linear time suffix trees on most of these tests. The 

exception is those with very high LCP. 

All the compressed structures except LZ use a suffix array built with deep-shallow as a 

starting point for their construction. The build times for the compressed representations 

seem very affordable. The LZ index has fast construction, and is even faster than all noncompressed 

methods on many of the tests. For all the compressed indexesthere is a strong 

relationship between time and cache performance. This can be seen when comparing for 

each method the results on the different tests in Tables 3 and 4. 

Query performance is not shown in this test because it depends on too many parameters, 

such as the length of the query, the number of hits, and the data distribution. Query 

performance is evaluated in Section 4.7. 

Table 5 shows the memory usage for the various implementations, in bytes per symbol 

after the construction was finished. The peak space usage is shown in table 6. The 

memory usage of the application seen externally is measured, which could give slightly 

inconsistent results because of space re-allocation. One method might use all its allocated 

memory, while one may have allocated more just before it finished. 

STa uses 15-25% more memory than STs. Both methods use more memory on the 

DNA tests, probably because of a smaller average branching factor and many internal 

nodes. Apart from that all non-compressed methods have a rather constant space usage. 

Both wotde and esa1/2 use slightly less space than the regular suffix trees. Among 

the construction algorithms for non-compressed structures, BPR is the only one using 

considerable extra space during construction. DS is much more space efficient, and has 

virtually no space overhead. 

The FM index is a clear winner on space usage on most of the tests. The exception 

seems to be those with a large alphabet. The other indexes from the FM family, AFFM, 

RLFM and SSA, fare much better there. FIXME: [Fix bug in FM for large σ.] The LZ 

index uses more space than a suffix array on some of the smaller tests, but around half 

the space on the larger tests. 

The working space used during construction varies for the compressed indexes, as seen 

in table 6. CCSA and AFFM need around 8-9 bytes per symbol, while the rest of the FM 

family needs a little more than 6. The working space for the LZ index strongly depends 

on the properties of the text data, as the tree data structures built utilise its repetitions. 

209


Table 5: Space usage in bytes per symbol indexed after construction. 


1 16 13 13 11 7.3 5.5 5.4 13 13 6.2 1.8 1.1 2.4 1.4 1.4 

2 20 17 16 11 6.5 5.4 5.3 13 13 4.4 3.0 1.00 1.0 1.3 1.0 

3 16 13 13 11 7.4 5.8 5.6 14 14 7.4 1.9 1.4 1.8 1.7 1.9 

4 19 16 16 11 5.8 5.1 5.0 13 13 2.4 2.5 0.62 0.66 0.92 0.69 

5 16 13 13 10 5.2 5.0 5.0 13 13 2.8 1.9 0.68 0.83 0.98 1.0 

6 16 13 13 11 5.5 5.0 5.0 13 13 2.5 1.1 0.59 0.96 0.85 1.1 

7 16 13 13 11 5.8 5.0 5.0 13 13 2.9 1.6 0.75 2.0 0.98 1.1 

8 16 13 13 11 5.3 5.0 5.0 13 13 1.9 0.70 0.43 0.76 0.73 1.2 

9 16 13 13 11 5.3 5.0 5.0 13 13 N/A N/A N/A N/A N/A N/A 

10 15 12 12 10 5.2 5.0 5.0 13 13 2.0 0.98 0.46 0.73 0.79 1.1 

11 16 13 13 11 5.3 5.0 5.0 13 13 2.5 1.2 0.60 0.84 0.85 1.0 

12 15 13 13 10 5.2 5.0 5.0 13 13 2.5 1.4 0.57 0.90 0.88 1.1 

13 16 13 14 - 5.2 5.0 5.0 13 13 N/A N/A N/A N/A N/A N/A 

14 19 16 17 - 5.2 5.2 5.2 13 13 1.3 0.64 0.39 0.49 0.85 0.69 

15 13 9.5 8.4 6.4 5.0 5.0 5.0 13 13 5.8 4.0 x 2.5 2.0 1.5 

16 19 16 17 - 5.1 5.1 5.0 13 13 1.1 0.50 - 3.5 0.69 0.49 

210


Table 6: Peak space usage during construction. 


1 16 13 13 11 18 5.5 11 13 13 6.7 8.3 6.5 10 6.7 6.7 

2 20 17 16 11 18 5.4 10 13 13 4.9 8.4 6.5 11 6.7 6.7 

3 16 13 13 11 18 5.8 12 14 14 7.9 8.6 6.8 9.1 7.0 7.0 

4 19 16 16 12 18 5.1 10 13 13 4.1 8.4 6.2 9.8 6.3 6.3 

5 16 13 13 10 17 5.0 10 13 13 5.6 8.5 6.1 9.0 6.3 6.3 

6 16 13 13 11 18 5.1 10 13 13 5.8 8.5 6.1 8.2 6.3 6.3 

7 16 13 13 11 18 5.1 11 13 13 6.6 8.4 6.1 9.7 6.3 6.3 

8 16 13 13 11 18 5.0 10 13 13 4.2 8.5 6.1 7.7 6.3 6.3 

9 16 13 13 11 17 5.0 11 13 13 N/A N/A N/A N/A N/A N/A 

10 15 12 12 10 17 5.0 10 13 13 4.7 8.5 6.1 8.0 6.3 6.3 

11 16 13 13 11 17 5.0 10 13 13 5.4 8.5 6.1 8.3 6.3 6.3 

12 15 13 13 10 17 5.0 10 13 13 5.7 8.5 6.1 8.4 6.3 6.3 

13 16 13 14 - 17 5.0 11 13 13 N/A N/A N/A N/A N/A N/A 

14 19 16 17 - 17 5.2 10 13 13 1.4 8.3 6.3 7.6 6.5 6.5 

15 13 9.5 8.4 7.7 13 5.0 11 13 13 20 9.3 x 12 6.3 6.3 

16 19 16 17 - 17 5.1 10 13 13 1.1 8.4 - 59 6.3 6.3 

211


4.4 Test on Increasing Data Size 

Figure 1 gives the results of indexing increasing amounts of space separated word data 

from a Zipf distribution (parameter s = 1.0). The figure shows symbols indexed per 

second and per L2 cache miss. Notice that these two quantities are closely related, both 

here and in the following tests. 

The suffix tree STa is almost three times faster than STs, even though they logically 

perform the same steps. This is because of a better locality, which can be read from the 

plot for cache performance. STa traverses a contiguous array to lookup the child of a 

node, while STs traverses sibling nodes with “random” memory locations. DS and BPR 

here exhibit the most non-linear performance, suggesting than for larger data, linear time 

suffix array constructions would be faster. Wotde, which is also a non-linear method, is 

faster than STs, but slower than STa. 

One would expect that the reason STa and STs show a slightly non-linear behaviour, 

is the fact that a smaller relative proportion of the tree fits in cache. In the construction 

algorithms the tree is traversed in a seemingly random pattern along a certain depth. 

The average depth is the expected longest common prefix of closest pairs between the 

suffixes added so far. For random data, this is log σ n [4]. But the cache performance is 

more linear than the time performance both for STa and STs. The additional slowdown 

might be due to memory management overhead. 

The LZ index shows great indexing performance. It has a more linear behaviour 

than DS, and is faster when the data size reaches 100 MB. For all the other compressed 

implementations, the time and cache misses for the initial construction of the suffix array 

by deep-shallow has been subtracted. This is done because this construction in many 

cases dominates the running time, and made it very hard to interpret the cost of building 

the compressed representations themselves in Sections 4.5 and 4.6. For FM, RLFM and 

SSA, the cost of the compression is less than the cost of the initial build. An interesting 

feature in the plot for cache performance is that FM, RLFM, and SSA are nearly identical. 

This must be because they have similar access patterns to their data structures. Figure 

3 shows the space usage for the compressed methods on this test. FM is twice as space 

efficient as the next method for 100 MB. 

Figure 4 shows a similar test, but with uniform random data (σ = 20). DS and esa1/2 

show approximately the same performance as in the last test, but the other methods do 

not. Note that BPR here seems has less decrease than DS, and may have been the fastest 

method with even more data. Wotde is now much faster, while the other suffix trees are 

slower. Random data gives lower average LCP, which benefits wotde, but also a higher 

average out degree in the nodes, which is bad for the regular suffix trees. The worst effect 

is seen for the sibling lists, where the performance is halved from what was seen for Zipf 

data in Figure 1. All compressed structures perform slightly worse on the random data. 

This is probably because there are fewer regularities in the text, and the index structures 

grow larger. 

212


3.5e+06 

3e+06 

2.5e+06 

symb/s 

2e+06 

1.5e+06 

1e+06 

500000 

0 

0 20 40 60 80 100 

STa 

STs 

wotde 

skew 

Data size in MiB 

DS 

esa2 

(a) Symbols indexed per second. 

BPR 

LZ 

8e+06 

7e+06 

6e+06 

5e+06 

symb/s 

4e+06 

3e+06 

2e+06 

1e+06 

0 

0 20 40 60 80 100 


CCSA FM AFFM RLFM SSA 

(b) Symbols indexed per second. 

Figure 1: Indexing performance on increasing data size, Zipf word data. Construction of 

initial suffix array not included for compressed indexes. 

213


0.4 

0.35 

0.3 

0.25 

symb/L2m 

0.2 

0.15 

0.1 

0.05 

0 

0 20 40 60 80 100 

STa 

STs 

wotde 

skew 


DS 

esa2 

(a) Symbols indexed per L2 miss. 

BPR 

LZ 

1 

0.9 

0.8 

0.7 

0.6 

symb/L2m 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 20 40 60 80 100 



(b) Symbols indexed per L2 miss. 

Figure 2: Continuing Figure 1. Indexing performance on increasing data size, Zipf word 

data. Construction of initial suffix array not included for compressed indexes. 

214


3.5 

3 

2.5 

B/symb 

2 

1.5 

1 

0.5 

0 

0 20 40 60 80 100 


CCSA FM AFFM RLFM SSA LZ 

Figure 3: Space usage for compressed structures after construction, Zipf word data, increasing 

size. 

4.5 Increasing Alphabet Size 

Figure 6 shows performance when indexing 10 MB of uniformly random data with an 

increasing alphabet size. 

DS here clearly out-performs the other methods for large alphabets, but has a rather 

peculiar behaviour. This suffix array construction algorithm combines two different sorting 

techniques, where one utilises the results from the other. The behaviour seen may be 

due to the individual performance of these, and the way they are combined. Wotde shows 

great performance as the alphabet increases, because of the decreasing average LCP. The 

opposite behaviour is seen for the other suffix trees, where an increasing number of children 

for internal nodes slows down the construction. For STs the performance decreases 

with a factor 30 as the alphabet size goes from 2 to 254. The same effect is seen to a 

lesser degree for STa, with a drop factor of 2. 

The construction times of all compressed indexes except CCSA seem strongly dependant 

on the alphabet size, but the amount of cache misses is not. Seemingly, a larger 

alphabet results in more computation, but as the data accesses are already rather random, 

there are not more cache misses, even though the data structures are larger. Figure 8 

shows the space usage in bytes per symbol. The FM index is extremely efficient for small 

alphabets, and is best of all methods up to an alphabet size of around 128. Remember 

that this very artificial test gives the worst case space usage for the compressed indexes, as 

the entropy of the data is maximal for the given alphabet size. The space usage of around 

215


3.5e+06 

3e+06 

2.5e+06 

symb/s 

2e+06 

1.5e+06 

1e+06 

500000 

0 

0 20 40 60 80 100 

STa 

STs 

wotde 

skew 


DS 

esa2 


BPR 

LZ 

8e+06 

7e+06 

6e+06 

5e+06 

symb/s 

4e+06 

3e+06 

2e+06 

1e+06 

0 

0 20 40 60 80 100 




Figure 4: Indexing performance on increasing data size, random data (σ = 20). Construction 

of initial suffix array not included for compressed indexes. 

216


0.4 

0.35 

0.3 

0.25 

symb/L2m 

0.2 

0.15 

0.1 

0.05 

0 

0 20 40 60 80 100 

STa 

STs 

wotde 

skew 


DS 

esa2 


BPR 

LZ 

1 

0.9 

0.8 

0.7 

0.6 

symb/L2m 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

0 20 40 60 80 100 




Figure 5: Continuing Figure 4. Indexing performance on increasing data size, random 

data (σ = 20). Construction of initial suffix array not included for compressed indexes. 

217


5e+06 

4e+06 

3e+06 

symb/s 

2e+06 

1e+06 

0 

2 4 8 16 32 64 128 256 

STa 

STs 

wotde 

skew 

Alphabet size 

DS 

esa2 

BPR 

LZ 


8e+06 

7e+06 

6e+06 

5e+06 

symb/s 

4e+06 

3e+06 

2e+06 

1e+06 

0 

2 4 8 16 32 64 128 256 

Alphabet size 



Figure 6: Indexing performance on increasing alphabet size, uniform data. Construction 

of initial suffix array not included for compressed indexes. 

218


0.8 

0.7 

0.6 

symb/L2m 

0.5 

0.4 

0.3 

0.2 

0.1 

0 

2 4 8 16 32 64 128 256 

STa 

STs 

wotde 

skew 

Alphabet size 

DS 

esa2 

BPR 

LZ 


1 

0.8 

symb/L2m 

0.6 

0.4 

0.2 

0 

2 4 8 16 32 64 128 256 

Alphabet size 



Figure 7: Continuing Figure 6. Indexing performance on increasing alphabet size, uniform 

data. Construction of initial suffix array not included for compressed indexes. 

219


25 

20 

15 

B/symb 

10 

5 

0 

2 4 8 16 32 64 128 256 

STa 

STs 

wotde 

skew 

Alphabet size 

DS 

esa2 

BPR 

LZ 

Figure 8: Space usage for compressed structures after construction, on increasing alphabet 

size. 

2 bytes per symbol with an alphabet size of 254 for the most space efficient structures is 

impressive. 

4.6 Markov Data 

Many of the methods tested have time and space efficiencies which are dependant on 

the randomness of the data. Less random data gives higher average LCP, degrading the 

performance of some of the non-linear construction methods. Figure 9 shows results for 

indexing first order Markov data with a Zipfian transition distribution with varying parameter 

s. An alphabet size of 20 was used. In a general Zipf distribution, the probability 

of selecting element k is given as 

p(k; s, n) = 

k −s 

∑ n 

i=1 i−s 

Setting s = 0 gives a uniform distribution. For the data used, the average LCP 

approximately doubled each time the parameter s increased by 1, from 4.7 for s = 0, to 

2160 for s = 10. 

Wotde has the greatest dependence on the LCP, and is the only method showing 

a contiguous drop in performance here. DS has a strong peak at s = 4, and a total 

breakdown at s = 7, where the average LCP is 324. The peak probably comes from the 

220


5e+06 

4e+06 

3e+06 

symb/s 

2e+06 

1e+06 

0 

0 2 4 6 8 10 

STa 

STs 

wotde 

skew 

Increasing regularity 

DS 

esa2 


BPR 

LZ 

1.4e+07 

1.2e+07 

1e+07 

symb/s 

8e+06 

6e+06 

4e+06 

2e+06 

0 

0 2 4 6 8 10 




Figure 9: Indexing performance on first order Markov data with Zipfian transition distribution 

and increasing parameter s. (σ = 20). Construction of initial suffix array not 

included for compressed indexes. 

221


1.4 

1.2 

1 

symb/L2m 

0.8 

0.6 

0.4 

0.2 

0 

0 2 4 6 8 10 

STa 

STs 

wotde 

skew 


DS 

esa2 


BPR 

LZ 

2 

1.5 

symb/L2m 

1 

0.5 

0 

0 2 4 6 8 10 




Figure 10: Continuing Figure 9. Indexing performance on first order Markov data with 

Zipfian transition distribution and increasing parameter s. (σ = 20). Construction of 

initial suffix array not included for compressed indexes. 

222


distribution of work between the two methods deep-shallow combines, as was discussed 

in section 4.5. The performances of the regular suffix trees increase as the data grows less 

random, due to lower average branching factors. 

4 

3.5 

3 

2.5 

B/symb 

2 

1.5 

1 

0.5 

0 

0 2 4 6 8 10 



Figure 11: Space usage for compressed structures after construction on Markov data. 

(σ = 20) 

LZ shows an extreme performance on this data, which is very repetitive for large s. 

The curve for LZ continues rising to 35 million symbols per second at s = 10, with 70 

symbols indexed per L2 miss. FM, RLFM and SSA also show a significant increase in 

performance with the regularity. Note that for these methods the bulk of the time used 

is the initial deep-shallow construction, which is subtracted here. To get more robust 

performance, a linear time suffix array construction algorithm could be used. 

Figure 11 shows the space efficiency for the compressed indexes. As in the previous 

tests, the FM index is the most space efficient. CCSA and LZ also show very good 

trends. Although the space usage for LZ is low, it does not match the extreme time 

performance that was shown in Figure 9a. Various tradeoffs between space usage and 

query performance could be made in the implementation. All the compressed methods 

have a dependency on higher order entropy in their asymptotic space usage [34], but the 

implementations show this to a varying extent here. 

For large parameters s, the data has LCPs higher than what you would see in any 

real world data of reasonable size. In these pure Markov chain strings, the LCP between 

almost all neighbouring suffixes is very high. The average LCP here should probably be 

223


compared with the median LCP seen in table 2. 

the median LCP 159. 

At s = 6, the average LCP is 172, and 

4.7 Query Lookup 

As previously mentioned, many parameters influence query performance. One such parameter 

is the total data size. Figure 12 shows the number of queries per second and per 

L2 cache miss, on an increasing amount of Zipf word data (s = 1, σ = 20). Queries of 

length 30 were run, and only one hit was reported (to isolate lookup time). The performance 

drop for the suffix trees is due to the increasing number of nodes seen in downward 

traversal in the tree, and the number of children considered in child lookup. The latter 

hits STs worse than STa. The decrease for the suffix array is due to the increasing number 

of jumps in the binary search. Wotde has the best performance in this test, followed by 

STa. 

Esa1 performs rather poorly, and even worse than STs, which logically performs the 

same steps. This is related to the number of cache misses, which is many times as 

high. A sibling traversal is also performed in esa1, but this has bad locality for two 

reasons: The first is an implementational detail. The information for one “position” in 

the enhanced suffix array is spread over different arrays, giving unnecessary cache misses. 

This is implemented differently in esa2, which has performance similar to STs. The second 

reason is an inherently bad locality in the enhanced suffix array. For an internal “node” 

close to the root, the values read to traverse the child nodes is spread over a large area. 

This is the reason esa2 has much slower lookup than STa, and also a more degrading 

performance as the amount of data increases. . The variants esa1b and esa2b have the 

CLD field stored in bytes overflowing into a hash table (see Section 3). Even though 

the total overflow is neglible in terms of space, the query lookup performance is rougly 

halved, because many nodes in the upper part of the tree overflow. 

The query lookup performance on the compressed indexes differs greatly. SSA is 

almost 20 times faster than FM on 100 MB of data, but 20 times slower than the fastest 

suffix tree. FM has poor time performance, but better cache performance than the other 

methods. The author does not know the implementation well enough to comment this 

properly. Note the different scales on the y-axis for compressed indexes. Searching in the 

compressed structures involves many recursive lookups for each logical step in the search. 

The values to be read are spread throughout the data structures, giving bad locality. 

Figure 14 shows query lookup on data with varying alphabet size. Queries of length 

30 were issued on 10 MB of uniform random data. Lookup performance degrades in the 

suffix tree variants and the enhanced suffix array when the alphabet size grows larger. 

This is because these methods traverse lists of children in trees, and the average lengths 

of these lists increase with the alphabet. The effect hits STs, esa1 and esa2 worst, as 

they have poor locality in the child traversal. 

The performances of many of the compressed indexes are very dependant of the alphabet 

size. Only CCSA and FM have alphabet independent asymptotic lookup times [34]. 

CCSA shows an increase in performance as the alphabet size increases. This is because 

the average length of the match between the search pattern and the suffixes considered in 

224


300000 

250000 

200000 

query/s 

150000 

100000 

50000 

0 

0 20 40 60 80 100 

STa 

STs 

wotde 

DS 


esa1 

esa1b 

(a) Queries per second. 

esa2 

esa2b 

18000 

16000 

14000 

12000 

query/s 

10000 

8000 

6000 

4000 

2000 

0 

0 20 40 60 80 100 



(b) Queries per second. 

Figure 12: Query lookup on increasing data size, Zipf word data. 

225


0.06 

0.05 

0.04 

query/L2m 

0.03 

0.02 

0.01 

0 

0 20 40 60 80 100 

STa 

STs 

wotde 

DS 


esa1 

esa1b 

(a) Queries per L2 miss. 

esa2 

esa2b 

0.0045 

0.004 

0.0035 

0.003 

query/L2m 

0.0025 

0.002 

0.0015 

0.001 

0.0005 

0 

0 20 40 60 80 100 



(b) Queries per L2 miss. 

Figure 13: Continuing Figure 12. Query lookup on increasing data size, Zipf word data. 

226


400000 

350000 

300000 

250000 

query/s 

200000 

150000 

100000 

50000 

0 

2 4 8 16 32 64 128 256 

STa 

STs 

wotde 

DS 

Alphabet size 

esa1 

esa1b 

esa2 

esa2b 

(a) Queries per second. 

400000 

350000 

300000 

250000 

query/s 

200000 

150000 

100000 

50000 

0 

2 4 8 16 32 64 128 256 

STa 

STs 

wotde 

skew 

Alphabet size 

DS 

esa2 

BPR 

LZ 

(b) Queries per second. 

Figure 14: Query lookup on increasing alphabet size, uniform data. 

227


0.12 

0.1 

0.08 

query/L2m 

0.06 

0.04 

0.02 

0 

2 4 8 16 32 64 128 256 

STa 

STs 

wotde 

DS 

Alphabet size 

esa1 

esa1b 

(a) Queries per L2 miss. 

esa2 

esa2b 

0.12 

0.1 

0.08 

query/L2m 

0.06 

0.04 

0.02 

0 

2 4 8 16 32 64 128 256 

STa 

STs 

wotde 

skew 

Alphabet size 

DS 

esa2 

(b) Queries per L2 miss. 

BPR 

LZ 

Figure 15: Continuing Figure 14. 

data. 

Query lookup on increasing alphabet size, uniform 

228


the binary search decreases, resulting in fewer character comparisons. The implementation 

tested does not use the trick of keeping track of how many symbols match on the left 

and right border of the binary search, to reduce the expected number of character comparisons. 

The methods AFFM, RLFM and SSA all have the same asymptotic lookup cost 

[34], but SSA is faster in practise, and has the best query performance of all compressed 

indexes for small alphabets. 

4.8 Reporting Hits 

In the previous tests on query performance a single hit was reported for each query, but 

for some applications the efficiency when listing large numbers of occurrences is more 

relevant. Figure 16 shows the performance when reporting an increasing number of hits 

from 100 MB of uniform data with an alphabet size of 20. All search functions were 

modified to cap the number of hits, and a varying number was requested. The lengths of 

the queries were set such that the number of hits would be at least what was requested. 

Although suffix trees are asymptotically optimal here, both for sibling lists and child 

arrays, the suffix arrays have an advantage in practise. After finding the left and right 

border for a match, values are read between these subsequently, giving great spatial 

locality and performance. DS is more than 15 times faster than STa at the most. The 

enhanced suffix array is not included, as it has a performance almost identical to the 

regular suffix array. Wotde is significantly faster than the regular suffix trees in this test, 

due to a more compact representation and better locality. The slight drop in performance 

for more than 3000 hits seen for the fastest methods is due to the overhead for the dynamic 

container used to hold the hits. This could be easily avoided for the suffix arrays, as the 

number of hits is known after finding the left and right border. 

The LZ index shows the best performance among the compressed indexes, and delivers 

hits nearly as fast as a suffix tree with sibling lists. The other compressed structures are 

3-5 orders of magnitude slower than the suffix array, but in many applications, thousands 

of hits per second is sufficient. 

4.9 Tree Operations 

Suffix trees can be used for many purposes other than substring search. They are used in 

bio-informatics to find many properties of strings. Gusfield [16] lists numerous applications. 

Because suffix trees are costly in space, Abouelhoda et al. [1] show how to replace 

suffix trees with enhanced suffix arrays in all algorithms doing bottom-up, top-down or 

suffix link traversal in the tree. Sadakane [39] has taken this one step further and shown 

how to use a succinct representation of a suffix tree on top of any compressed suffix array, 

supporting the same operations. 

Table 7 shows the performance of suffix trees, enhanced suffix arrays and a compressed 

suffix tree (CST ) on various tree operations. The compressed suffix tree implementation 

by Mäkinen [43] was used. The longest common sub-string (LCSS) between two strings 

is found by building a tree for the first string, and then traversing it matching suffixes of 

the other string, following parent-to-child and suffix links in the tree. (The construction 

229


1e+08 

1e+07 

1e+06 

hit/s 

100000 

10000 

1000 

10 100 1000 10000 

STa 

STs 

wotde 

DS 

Requested number of hits 

LZ 

CCSA 

FM 

AFFM 

(a) Hits reported per second. 

RLFM 

SSA 

100 

10 

1 

hit/L2m 

0.1 

0.01 

0.001 

10 100 1000 10000 

STa 

STs 

wotde 

DS 

Requested number of hits 

LZ 

CCSA 

FM 

AFFM 

(b) Hits reported per L2 miss. 

RLFM 

SSA 

Figure 16: Requesting an increasing number of hits. log 10 scale on both axis. 

230


Table 7: Tree operations. Showing time in seconds for construction, bottom–up and 

top–down traverse, and longest common substring search. 

(a) 50 MB DNA data. 

STa STs esa1 esa2 CST 

Build 50 59 76 77 2604 

Top–down 14 16 1.9 2.5 16 

Bottom–up 9.7 9.3 9.8 9.9 23 

LCSS 24 33 4.0 3.7 313 

Memory 1586 1461 1313 1313 226 

Memory (peak) 1961 1837 1621 1621 364 

(b) 50 MB protein data. 

STa STs esa1 esa2 CST 

Build 42 137 75 76 4788 

Top–down 11 14 1.9 2.5 17 

Bottom–up 6.4 6.3 7.0 7.3 24 

LCSS 22 76 10 6.1 721 

Memory 1362 1236 1331 1364 251 

Memory (peak) 1737 1611 1514 1513 391 

time for the index is excluded.) On LCSS, esa2 is much faster than esa1, because many 

fields are read in each “node”, and these are stored close to each other in esa2. In general, 

esa2 is as fast or faster than the regular suffix trees on tree traversal. 

The compressed suffix tree is very competitive on top–down and bottom–up traversal, 

where the operations on nodes performed in constant time, but around 100 times slower 

than esa2 on LCSS, where they are not [39]. 

5 Conclusion 

It has been shown that performance is strongly dependant of locality also in data structures 

resident in primary memory. This should be considered when implementing indexes, 

as can be seen when comparing the performance of STa with STs, and esa2 with esa1. 

Which index structures should be chosen for which tasks depends on what is more important 

of space usage, construction time, query lookup time and reporting hits. 

• Suffix trees with dynamic arrays can be as much as 20 times faster on construction 

than sibling lists for byte sized alphabets, as seen in figure 6a, and 10 times faster 

on query lookup, as seen in figure 14. The array representation requires about 20% 

more space. 

231


• In applications where large numbers of hits must be reported, suffix arrays are 

strongly preferable over suffix trees, as they list hits 1-2 orders of magnitude faster. 

See Figure 16a. 

• For fast lookup for small numbers of hits, a suffix tree variant is the most effective 

index, if you have enough memory. See Figures 12a and 14a. 

• The lookup performance of the enhanced suffix array is dependant of the alphabet 

size, and a regular suffix array may be faster if the alphabet is sufficiently large. 

See figure 14a. 

• The deep-shallow suffix array construction should in general be chosen over BPR, 

as it has a much lower space overhead. See Table 6. 

• Among the implementations tested here, the FM index is the most space efficient, 

as long as the alphabet size is not too large. See Table 5. 

• The LZ-index is fast on construction and listing hits. As it is more space efficient 

than a suffix array, it would be the structure of choice in many situations. 

• The wotdeager suffix tree is fast on lookup and reporting hits, but the construction 

easily breaks down for non-random data, as seen for some of the benchmarks in 

table 3. 

• Tree traversal algorithms are faster with enhanced suffix arrays than suffix trees. 

See Table 7. 

In general, when designing the layout of data structures, it is important to consider the 

access patterns in construction and search to maximise locality. A few more computational 

steps during lookup will often be cheaper than a cache miss. 

Acknowledgements. 

The author would like to thank all the authors who have made their code publically available, 

and Øystein Torbjørnsen, Magnus Lie Hetland and Tor Egge for helpful feedback 

on this article, 

References 

[1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced 

suffix arrays. J. of Discrete Algorithms, 2(1):53–86, 2004. 

[2] S. J. Bedathur and J. R. Haritsa. Engineering a fast online persistent suffix tree 

construction. In Proc. ICDE, page 720, 2004. 

[3] The canterbury corpus. http://corpus.canterbury.ac.nz/. 

[4] L. Devroye. Note on the average depth of tries. Computing, 28:367–371, 1982. 

232


[5] M. Farach. Optimal suffix tree construction with large alphabets. In Proc. FOCS, 

pages 137–143, 1997. 

[6] M. Farach-Colton, P. Ferragina, and S. M. Muthukrishnan. On the sortingcomplexity 

of suffix tree construction. J. ACM, 47(6):987–1011, 2000. 

[7] P. Ferragina and G. Manzini. Opportunistic data structures with applications. In 

Proc. FOCS, page 390, 2000. 

[8] P. Ferragina and G. Manzini. Indexing compressed text. J. ACM, 52(4):552–581, 

2005. 

[9] P. Ferragina, G. Manzini, V. Makinen, and G. Navarro. An alphabet-friendly FMindex. 

Proc. SPIRE, pages 150–160, 2004. 

[10] M.L. Fredman, J. Komlós, and E. Szemerédi. Storing a Sparse Table with 0 (1) 

Worst Case Access Time. J. ACM, 31(3):538–544, 1984. 

[11] R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. 

Software: Practice and Experience, 33:1035–1049, 2003. 

[12] S. Grabowski, V. Makinen, and G. Navarro. First Huffman, then Burrows-Wheeler: 

An alphabet-independent FM-index. Proc SPIRE, 2004. 

[13] N. Grimsmo. Dynamic indexes vs. static hierarchies for substring search. Master’s 

thesis, Norwegian University of Science and Technology, 2005. 

[14] R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. 

In Proc. SODA, pages 841–850, 2003. 

[15] R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with applications 

to text indexing and string matching (extended abstract). In Proc. STOC, pages 

397–406, 2000. 

[16] D. Gusfield. Algorithms on strings, trees, and sequences: Computer science and 

computational biology. Cambridge University Press, 1997. 

[17] Intel Corporation. IA-32 Intel R○ Architecture Optimization Reference Manual, 2006. 

[18] J. Kärkkäinen and P. Sanders. Simple linear work suffix array construction. In Proc. 

ICALP, 2003. 

[19] J. Kärkkäinen and E. Ukkonen. Lempel-Ziv parsing and sublinear-size index structures 

for string matching. In Proc. WSP, pages 141–155, 1996. 

[20] D. K. Kim, J. S. Sim, H. Park, and K. Park. Linear-time construction of suffix 

arrays. In Proc. CPM, pages 186–199, 2003. 

[21] P. Ko and S. Aluru. Space efficient linear time construction of suffix arrays. In Proc. 

CPM, 2003. 

233


[22] S. Kurtz. Reducing the space requirement of suffix trees. Software: Practice and 

Experience, 29(13):1149–1171, 1999. 

[23] V. Mäkinen. Compact suffix array. In CPM, pages 305–319, 2000. 

[24] V. Mäkinen and G. Navarro. Compressed compact suffix arrays. In CPM, pages 

420–433, 2004. 

[25] V. Mäkinen and G. Navarro. Run-length FM-index. In Proc. DIMACS, pages 17–19, 

2004. 

[26] V. Mäkinen and G. Navarro. Succinct suffix arrays based on run-length encoding. 

In Proc. CPM, 2005. 

[27] U. Manber and G. Myers. Suffix arrays: A new method for on-line string searches. 

SIAM J. on Computing, 22(5):935–948, 1993. 

[28] G. Manzini and P. Ferragina. Engineering a lightweight suffix array construction 

algorithm. Algorithmica, 40(1):33–50, 2004. 

[29] E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 

23(2):262–272, 1976. 

[30] Manzini and Ferragina’s test collection. 

http://www.mfn.unipmn.it/~manzini/lightweight/. 

[31] S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc. 

SODA, pages 657–666, 2002. 

[32] G. Navarro. The LZ-index: A text index based on the Ziv-Lempel trie. Technical 

Report TR/DCC-2003-1, Dept. of Computer Science, Univ. of Chile, 2003. 

[33] G. Navarro. Indexing text using the ziv-lempel trie. J. of Discrete Algorithms, 

2(1):87–114, 2004. 

[34] G. Navarro and V. Mäkinen. Compressed full-text indexes. Technical Report 

TR/DCC-2006-6, Dept. of Comp. Sci., U. of Chile, 2006. 

[35] M. H. Overmars and J. van Leeuwen. Some principles for dynamizing decomposable 

searching problems. Technical Report RUU-CS-80-1, Rijksuniversitet Utrecht, 1980. 

[36] Performance Application Programming Interface. 

http://icl.cs.utk.edu/papi/. 

[37] Perfctr Hardware performance counters. 

http://user.it.uu.se/~mikpe/linux/perfctr/. 

[38] K. Sadakane. New text indexing functionalities of the compressed suffix arrays. J. 

of Algorithms, 48(2):294–313, 2003. 

234


[39] K. Sadakane. Compressed suffix trees with full functionality. Theory of Computing 

Systems (Online), 2007. 

[40] K. Schürmann and J. Stoye. An incomplex algorithm for fast suffix array construction. 

In Proc. ALENEX, 2005. 

[41] Y. Tian, S. Tata, A. Hankins, and M. Patel. Practical methods for constructing 

suffix trees. VLDB J., 14(3):281–299, 2005. 

[42] E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(5):249–260, 1995. 

[43] Niko Välimäki, Wolfgang Gerlach, Kashyap Dixit, and Veli Mäkinen. Engineering a 

compressed suffix tree implementation. In Proc. WEA, pages 217–228, 2007. 

[44] P. Weiner. Linear pattern matching algorithms. In Proc. SWAT, pages 1–11, 1973. 

[45] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE 

Transactions on Information Theory, 23(3):337–343, 1977. 

235

Paper 8 

Truls A. Bjørklund, Nils Grimsmo, Johannes Gehrke and Øystein Torbjørnsen 

Inverted Indexes vs. Bitmap Indexes in Decision Support Systems 

Proceeding of the 18th ACM conference on Information and knowledge management 

(CIKM 2009) 

Abstract Bitmap indexes are widely used in Decision Support Systems (DSSs) to improve 

query performance. In this paper, we evaluate the use of compressed inverted 

indexes with adapted query processing strategies from Information Retrieval as an alternative. 

In a thorough experimental evaluation on both synthetic data and data from the 

Star Schema Benchmark, we show that inverted indexes are more compact than bitmap 

indexes in almost all cases. This compactness combined with efficient query processing 

strategies results in inverted indexes outperforming bitmap indexes for most queries, often 

significantly. 

My role as an author 

This paper and the contained ideas was mostly the work of Bjørklund. I took part in 

some idea brainstormings, discussions about the implementation, and the writing process. 

A technical contribution of mine was to configure the FastBit system compared with to 

make it not crash for large queries. 


Bjørklund and I share supervisors, started on our PhDs roughly at the same time, and we 

were both given tasks related to indexing and search. Common denominators have been 

columnar storage and multi-way operators. In retrospect it could have been advantageous 

for us if we were given even closer topics, and shared more implementation in our research 

projects. 

237

Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems 


Decision Support Systems (DSSs) support queries over large amounts of structured data, 

and bitmap indexes are often used to improve the efficiency of important query classes 

involving selection predicates and joins [16, 17]. 

Bitmap indexes were formerly also used in Information Retrieval (IR), but are today 

mainly replaced by inverted indexes. Part of the reason why inverted indexes gained 

popularity in IR was that they easily support integrating new fields required to support 

ranked queries. The switch from bitmap indexes to inverted indexes lead to a flood of 

research on efficient inverted indexes [25, 30, 6, 13, 24, 3, 31, 29], and inverted indexes 

are now the preferred indexing method in search engines [30]. 

In this paper, we are asking (and answering) the question: What are the trade-offs 

of using inverted indexes in DSSs, and should they be considered a serious alternative 

to bitmap indexes? The main contributions of this paper are (1) the study of how to 

use and implement inverted indexes in DSSs, and (2) a thorough performance evaluation 

that compares inverted indexes and bitmap indexes in DSSs. In particular, we compare 

inverted indexes with FastBit, 1 a state-of-the-art bitmap query processing and indexing 

system based on WAH-compressed bitmap indexes [27]. 

2 Background 

A standard bitmap index has one bitmap per distinct value for the indexed attribute, with 

1’s at positions for tuples with the represented value, and 0’s elsewhere. Bitmaps can be 

combined using bitwise operators to answer complex boolean queries. For attributes with 

few distinct values, bitmap indexes are relatively compact, but their space usage increases 

linearly with the cardinality. One approach to limit the space usage of bitmap indexes 

for high cardinality attributes is compression. WAH [27] is one of several introduced 

compression schemes. Although there are schemes with more compact indexes, WAH 

supports efficient query processing. This combined with the fact that FastBit is openly 

available motivates the use of WAH-compressed bitmap indexes in the experiments in this 

paper. 

WAH-compression is a form of word-aligned run-length encoding for bitmap indexes, 

where consecutive words containing only 0’s or 1’s are stored as fill words, and other words 

are stored literally [26, 27]. WAH-compressed bitmaps for high cardinality attributes are 

relatively compact because most words contain only 0’s. 

In IR, inverted indexes consist of a search structure for all searchable words called a 

dictionary, and lists of references to documents containing each searchable word, called 

inverted lists. An inverted index for an attribute in a DSS consists of a dictionary of the 

distinct values in the attribute, with pointers to inverted lists that reference tuples with 

the given value through tuple identifiers (TIDs). To reduce both space usage and the 

I/O requirements in query processing, the inverted lists are often compressed by storing 

the deltas between the sorted references [30]. This approach makes small values more 

1 http://sdm.lbl.gov/fastbit/ 

239


likely, and several compression schemes that represent small values compactly have been 

suggested. According to a recent study, PForDelta [31] is currently the most efficient 

method [29], and is therefore used in this paper. PForDelta stores deltas in a wordaligned 

version of bit packing, which also includes exceptions to enable storing larger 

values than the chosen number of bits allows [31]. 

Two overall query processing approaches exist in search engines. Document-at-atime 

strategies avoid materializing intermediate results by processing all inverted lists 

in a query in parallel [6, 24], and are well suited for boolean query processing. They 

can be combined with skipping, which is used in search engines to avoid reading and 

decompressing parts of inverted lists that are not required to process a query [13]. We 

give a brief description of how we use these ideas in the query processing in this paper in 

the next section. 

3 Query Processing 

Recall that we use document-at-a-time strategies that avoid materializing intermediate 

results to process inverted index queries in this paper. We support three operators which 

can be combined to answer complex queries. They all support skipping to the next result 

with a given minimum TID value, in addition to standard Volcano-style iteration [9]. 

The SCAN operator can iterate through an inverted list. To support skipping, each kth 

TID in each inverted list is stored in an external list. The external list is kept in memory 

during scans, and supports binary searches to find the correct part of the inverted list to 

process when skipping. 

The OR operator provides an iterator interface over the sorted merge of its multiple 

input iterators. The iterators are organized in a priority-queue based on a heap, which is 

maintained to make sure that the input with the smallest next TID is at the top. Skipping 

in the OR operator is based on a breadth-first search in the heap. A skip may not result 

in an actual skip for a given input iterator. If so, we know that neither of its children 

in the heap can do any skipping either, and we therefore avoid testing. After the search, 

we make sure that only the part of the heap involving iterators that actually skipped is 

maintained. This approach is reasonably efficient when actually performing skips in both 

large and small fractions of the set of input iterators. 

The AND operator expects that the input iterators are sorted in ascending order according 

to the expected number of returned results. To find the next result, we start with 

a candidate from the iterator with the fewest number of expected results. We then try to 

skip to the candidate in the other input iterators, re-starting with a new candidate if the 

current candidate is absent in one iterator. A candidate found in all inputs is returned 

as a result. To support skipping, we start with the value to skip to as the candidate and 

proceed as in normal iteration. 

240


MB 

200 

150 

100 

50 

0 

MB 

200 

150 

100 

50 

0 

Uniform data 

2 1 10 1 10 2 10 3 10 4 10 5 10 6 10 7 

Zipf data, k = 1.5 

2 1 10 1 10 2 10 3 10 4 10 5 10 6 10 7 

Uncompr. attrib. 

FastBit 

InvInd 

InvInd w/skip 

Figure 1: Index sizes. X-axis gives attribute cardinality. 

4 Experiments 

To investigate the trade-offs between inverted indexes and bitmaps, we experiment with 

FastBit and our inverted index solutions with and without support for skipping. We 

present results from experiments with synthetic data and data from the Star Schema 

Benchmark (SSB) [14]. 

All experiments are run on a quad-core Intel Xeon CPU at 3.2 GHz with 16 GB main 

memory. All indexes are stored on disk, but queries are run warm by performing 10 

runs during one invocation of the system, and reporting the average from the last 8, thus 

measuring the in-memory query processing performance. We run FastBit version 1.0.5 

(implemented in C++), with extra stack space to enable processing queries with many 

operands. Our approaches are implemented in Java (version 1.6). We use additional 

warm-up for our system to enable run-time optimizations in the Java virtual machine 

that reduce variance between runs. Additional warm-up did not change the performance 

for FastBit. 

4.1 Synthetic Data 

To experiment with synthetic data we generate two tables. Both tables have 10 million 

tuples and 8 indexed attributes with maximum cardinalities ranging from 2 via all powers 

of 10 up to 10 million. The attributes in the first table follow a uniform distribution, while 

a Zipf distribution (with k = 1.5) is used in the other. 

241


4.1.1 Index Size 

The sizes of the uncompressed attributes in the synthetic tables and their indexes are 

shown in Figure 1. When using standard PForDelta compression on the attribute with 

cardinality 2 in the table with uniform data, each value is represented with 4 bits in 

the most compact index. The reason why a lower number of bits results in a larger 

index is that the implementation of PForDelta might result in artificial extra exceptions 

when using a small number of bits per value [31]. Bitmap indexes are known to be 

compact when the cardinality is 2, and FastBit outperforms our approaches in this case. 

PForDelta results in compact indexes for higher cardinality attributes, and most of the 

space usage for the highest cardinality attribute comes from the dictionary (62 out of 

91MB). The WAH-compressed bitmaps for the same attribute can in worst case contain 

nearly 3 computer words per tuple, resulting in a space usage of over 228MB on a 64-bit 

architecture. The actual results are significantly better, but compressed inverted indexes 

are clearly more compact. 

Indexes for Zipf distributed attributes are more compact than for uniformly distributed 

attributes with the same maximum cardinality, because skewed distributions make it less 

likely for the actual cardinality to be equal to the maximum. 

4.1.2 Query processing 

To experiment with query processing performance, we test four different query types 

which all vary the attribute on which there is a single value predicate: 

1. Query type SCAN: Finds all tuples with value 0 for a varied attribute. 

2. Query type skewed AND: Finds all tuples having value 0 for the attribute with 

cardinality 10, in addition to 0 in one other varied attribute. 

3. Query type OR: Finds all tuples having values in the lower half range for a varied 

attribute. 

4. Query type AND-OR: Finds all tuples with value in the lower half range for the 

attribute with cardinality 100000, and value 0 for another varied attribute. 

All queries compute the sum of the primary keys of the matching tuples, to ensure that 

the output from the index is used to perform table look-ups. In the table with uniform 

distributions, there were no tuples with value 0 for the highest cardinality attribute, so 

all single valued predicates on this attribute was changed to require the value to be 2. 

The results are shown in Figure 2. 

Compared to bitmaps, decompressed inverted lists are well suited for looking up other 

attributes for the qualifying tuples, a factor contributing to faster scans for uniform data. 

The difference in index size also seems to have an impact. All scans are relatively slow for 

Zipfian data because we always search for the most common attribute value in a skewed 

distribution, except for in the highest cardinality attribute as noted above. 

Skewed AND favors methods capable of taking advantage of the different density in 

the operands. Inverted indexes with skipping are therefore efficient for uniform data, 

but introduce overhead for Zipfian data because both operands are dense when the most 

common values in skewed distributions are accessed. FastBit performs well on dense 

242


FastBit 

InvInd 

InvInd w/skip 

SCAN query 

(logscale) 

Skewed AND 

OR query 

AND-OR query 

(logscale) 

(a1) 

(a2) 

(a3) 

(a4) 

10 −1 150 

0.01 

10 2 

10 −4 

100 

10 0 

0.005 

50 

10 −2 

Uniform data 

10 −7 

0 

0 

2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 10 1 10 2 10 3 10 4 10 5 10 6 10 7 2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 

(b1) 

(b2) 

(b3) 

(b3) 

0.15 

40 

10 2 

10 −1 

10 −2 

0.10 

0.05 

30 

20 

10 

10 0 

Zipf data 

10 −2 

10 −3 

0 

0 

2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 10 1 10 2 10 3 10 4 10 5 10 6 10 7 2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 

Figure 2: Results from running queries on generated tables, showing query time in seconds for varying cardinalities. 

243


MB 

40 

20 

0 

custkey partkey suppkey 

Uncompr. attr. 

FastBit 

InvInd 

InvInd w/skip 

Figure 3: Size of indexes for foreign key columns in SSB 

operands, both because it can combine multiple logical TIDs using one CPU instruction, 

and because it performs the operator before extracting the tuple references. Because neither 

input is smaller than the output for AND operators, FastBit decodes fewer references 

compared to the inverted indexes. 

The multi-way OR operators in our solution demonstrate better scalability than FastBit 

with respect to the number of inputs for both tables. 

The idea of skipping in OR operators is ideally suited for query type AND-OR, but it is 

only useful when the other operands to the AND return data that enables reasonable skip 

lengths, which occurs for high cardinality attributes with uniform distributions. 

4.2 Star Schema Benchmark 

Star schemas represent a best practice for how to organize data in decision support systems, 

and are characterized by a central fact table that references several smaller dimension 

tables. Typical queries on such schemas involve joins of the fact table with relevant 

dimension tables called star joins. Bitmap indexes can be constructed over the foreign 

keys in the fact table to speed up such joins, and are then called join indexes [16, 17]. 

We experiment with using inverted indexes as an alternative to bitmap indexes for this 

purpose in the Star Schema Benchmark (SSB) [14]. We use Scale Factor 1, and precalculate 

the foreign keys that match the queries in SSB. They are submitted as part of 

the query to the tested systems. We also avoid calculating the exact answer, but rather 

let all queries return the sum of an attribute of the fact table. This isolates the effects 

of the indexes while making sure the returned results are suitable for further look-ups in 

the fact table. There are four dimension tables in SSB, but we avoid constructing join 

indexes for the Date table because FastBit is unable to process the queries involving all 

tables without a very significant stack size. 

4.2.1 Index Size 

Figure 3 shows the join index sizes in both systems. FastBit has significantly larger indexes 

because the foreign keys have relatively high cardinalities. The attribute custkey is partly 

sorted, resulting in longer runs in WAH-compressed indexes, and the relative difference 

between FastBit and inverted indexes is therefore smaller in that case. 

244


20 

2.0 

1.5 

0.20 

0.15 

20 

10 

1.0 

0.5 

0.10 

0.05 

10 

0 

1.0 

0.5 

0 

Q2.1 

0.03 

0.02 

0.01 

Q2.2 

0 

200 

150 

100 

50 

Q2.3 

0 

10 

5 

Q3.1 

FastBit 

InvInd 

InvInd w/skip 

0 

Q3.2 

0 

Q3.3/3.4 

0 

Q4.1/4.2 

0 

Q4.3 

Figure 4: Query processing time in seconds for SSB queries. 

4.2.2 Join Processing 

The query processing results for SSB are given in Figure 4. Within each set of queries, the 

predicates on the dimension tables become increasingly selective, making FastBit perform 

better relative to inverted indexes because the OR operators that combine the tuples 

representing each qualifying foreign key have fewer inputs. Query 2.3 has skewed input to 

an AND operator making skipping important for performance. The OR operator providing 

dense input to the AND is over the attribute with the lowest cardinality, contributing to 

the smaller performance difference between FastBit and inverted indexes for this query. 


Several alternatives to the compression schemes discussed in this paper have been suggested 

both for bitmaps [4, 10, 5, 15, 21] and inverted indexes [25, 20, 1]. Experiments 

have shown that the query processing efficiency of WAH remains attractive, even though 

there are approaches resulting in smaller indexes. WAH is known to result in smaller indexes 

when the table is sorted on the indexed attribute [19, 12]. Due to space restrictions, 

we do not experiment with sorted tables in this paper. Experiments with compression in 

inverted indexes in IR have shown that PForDelta currently is the most efficient technique 

[29], and further improvements to the technique have also been suggested recently [28]. 

As an alternative to compression, there are several approaches that reduce the number 

of bitmaps in the index [17, 7, 8, 11, 22]. Strategies for operating on many bitmaps by 

processing two at a time have been explored for WAH-compressed bitmap indexes [26], 

245


and a recent paper suggests using multi-way operators for bitmaps, but the idea is not 

tested [12]. Query processing approaches in inverted indexes in IR have focused on termat-a-time 

strategies in addition to the document-at-a-time approach used in this paper 

[6, 24, 18, 2, 3, 23]. 

6 Conclusions 

In this paper, we have evaluated the applicability of compressed inverted indexes as an 

alternative to bitmap indexes in DSSs. Inverted indexes are generally significantly more 

space efficient. The only case where WAH-compressed bitmaps are clearly more compact 

is when the cardinality of the indexed attribute is very low. FastBit performs well on 

simple queries with dense operands, but inverted indexes are better in other cases, often 

significantly. 

Acknowledgments: This material is based upon work supported by New York State 

Science Technology and Academic Research under agreement number C050061, by Grant 

NFR 162349, by the National Science Foundation under Grants 0534404 and 0627680, 

and by the iAd Project funded by the Research Council of Norway. Any opinions, findings 

and conclusions or recommendations expressed in this material are those of the authors 

and do not necessarily reflect the views of NYSTAR, the National Science Foundation or 

the Research Council of Norway. 

References 

[1] V. N. Anh and A. Moffat. Index compression using fixed binary codewords. In 

Proc. ADC, 2004. 

[2] V. N. Anh and A. Moffat. Simplified similarity scoring using term ranks. In Proc. SI- 

GIR, 2005. 

[3] V. N. Anh and A. Moffat. Pruned query evaluation using pre-computed impacts. In 

Proc. SIGIR, 2006. 

[4] G. Antoshenkov. Byte-aligned bitmap compression. In Proc. DCC, 1995. 

[5] T. Apaydin, G. Canahuate, H. Ferhatosmanoglu, and A. S. Tosun. Approximate 

encoding for direct access and query processing over compressed bitmaps. In 


[6] E. W. Brown. Fast evaluation of structured queries for information retrieval. In 


[7] C.-Y. Chan and Y. E. Ioannidis. Bitmap index design and evaluation. SIGMOD 

Rec., 27(2), 1998. 

[8] C.-Y. Chan and Y. E. Ioannidis. An efficient bitmap encoding scheme for selection 

queries. In Proc. SIGMOD, 1999. 

246


[9] G. Graefe. Volcano - an extensible and parallel query evaluation system. IEEE 

Trans. on Knowl. and Data Eng., 6(1), 1994. 

[10] T. Johnson. Performance measurements of compressed bitmap indices. In 


[11] N. Koudas. Space efficient bitmap indexing. In Proc. CIKM, 2000. 

[12] D. Lemire, O. Kaser, and K. Aouiche. Sorting improves word-aligned bitmap indexes. 

CoRR, abs/0901.3751, 2009. 

[13] A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM 

Trans. Inf. Syst., 14(4), 1996. 

[14] E. O’Neil, P. O’Neil, and X. Chen. http://www.cs.umb.edu/ poneil/StarSchemaB.PDF. 

[15] E. O’Neil, P. O’Neil, and K. Wu. Bitmap index design choices and their performance 

implications. In Proc. IDEAS, 2007. 

[16] P. O’Neil and G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD 

Rec., 24(3), 1995. 

[17] P. O’Neil and D. Quass. Improved query performance with variant indexes. In 

Proc. SIGMOD, 1997. 

[18] M. Persin, J. Zobel, and R. Sacks-Davis. Filtered document retrieval with frequencysorted 

indexes. J. Am. Soc. Inf. Sci., 47(10), 1996. 

[19] A. Pinar, T. Tao, and H. Ferhatosmanoglu. Compressing bitmap indices by data 

reorganization. In Proc. ICDE, 2005. 

[20] F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes 

for fast query evaluation. In Proc. SIGIR, 2002. 

[21] M. Stabno and R. Wrembel. Rlh: Bitmap compression technique based on run-length 

and huffman encoding. Information Systems, 2008. 

[22] K. Stockinger, K. Wu, and A. Shoshani. Evaluation strategies for bitmap indices 

with binning. In Database and Expert Systems Applications, 2004. 

[23] T. Strohman and W. B. Croft. Efficient document retrieval in main memory. In 


[24] T. Strohman, H. Turtle, and W. B. Croft. Optimization strategies for complex 

queries. In Proc. SIGIR, 2005. 

[25] I. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes. Academic Press, 1999. 

[26] K. Wu, E. Otoo, and A. Shoshani. On the performance of bitmap indices for high 

cardinality attributes. In Proc. VLDB, 2004. 

247


[27] K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient 

compression. ACM Trans. Database Syst., 31(1), 2006. 

[28] H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with 

optimized document ordering. In Proc. WWW, 2009. 

[29] J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in 

search engines. In Proc. WWW, 2008. 

[30] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 

38(2), 2006. 

[31] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. 

In Proc. ICDE, 2006. 

248

Paper 9 

Truls A. Bjørklund, Michaela Götz, Johannes Gehrke and Nils Grimsmo 

Search Your Friends And Not Your Enemies 

Submitted to Proceedings of the VLDB Endowment (VLDB 2011) 

Abstract More and more data is accumulated inside social networks. Keyword search 

provides a simple interface for exploring this content. However, a lot of the content is 

private, and a search system must enforce the privacy settings of the social network. 

In this paper, we present a workload aware keyword search system with access control. 

We develop a range of cost models that vary in sophistication and accuracy. These 

cost models provide input to an optimization algorithm that selects the ideal solution 

for a given workload. With our cost models we find designs that outperform previous 

approaches by up to a factor of 3. We also address the query processing strategy in 

the system, and develop a novel union operator called HeapUnion that speeds up query 

processing with a factor between 1.1 and 2.3 compared to the best previous solution. We 

believe that both our cost models and our novel union operator will be of independent 

interest for future work. 

My role as an author 

Bjørklund was the main contributor for this paper. I supported him during debugging 

sessions for the cost models, offered a shoulder to cry on when they had to be modified, 

and helped during the writing process. 

249

Paper 9: Search Your Friends And Not Your Enemies 


More and more data is accumulated inside social networks where users tweet, update their 

status, chat, post photos, and comment on each other’s lives. From a user’s perspective, a 

lot of her content is private and should not be accessible to everybody. To limit arbitrary 

information flow, social networks enable users to adjust their privacy settings, e.g., to 

ensure that only friends can see the content they have posted. Keyword search provides 

a simple interface for the exploration of content in social networks. It is the challenge 

of enabling search over the content while enforcing privacy settings that motivates the 

research in this paper. 

Search over collections of documents — without taking social networks into account 

— is a well-studied problem [29], and a search query consisting of keywords is usually 

answered using an inverted index [31]. An inverted index consists of a look-up structure 

over all unique terms in the indexed document collection, and a posting list for each term. 

The posting list contains the document identifiers (IDs) of all documents that contain the 

term. 

Search in a social network is more challenging because each user sees a unique subset 

of the document collection. We view a social network as a directed graph where each node 

represents a user and a directed edge represents a (one-way) friendship. We assume in this 

paper that the social network implements the following privacy setting: A user has access 

to the documents authored by herself and to the documents authored by her friends. We 

rank the results of a query according to recency with the most recent documents ranked 

highest; then we retrieve the top-k results. 

Figure 1(a) shows an example of a social network with four users, where User 4 is 

friends with Users 1 and 3, and User 2 is friends with Users 1 and 4. All the users have 

posted documents, and each document has an ID which is shown in its top right corner. 

User 3 has posted Documents 2 and 5, and in our model she can search across Documents 

2, 5, and 7. 

In previous work we developed a conceptual framework for solutions to this problem 

[8]. Since we need to both search and enforce access control, we have characterized 

solutions along two axes; the index axis and the access axis. The index axis captures the 

idea that instead of creating one single inverted index over all the content in the social 

network, we may create several inverted indexes, each containing a subset of the content. 

A set of inverted indexes and their content is called an index design. The access axis 

mirrors the index axis and describes the meta-data used to filter out inaccessible results; 

the meta-data is organized into author-lists. For the purpose of this paper, an author-list 

contains the IDs of all documents authored by a set of users. An access design describes 

a set of author-lists. 

Previous work experimented with a few extreme points in this solution space; it showed 

that two of the most promising solutions both use an index design with a single index 

containing all users’ documents, while the access design in the two approaches differ. 

The first approach is called user design and has one author-list per user that contains 

the document IDs posted by that particular user. The second approach is called friends 

design; it also has one author-list per user, but this author-list contains the documents 

251


1 

social 

network 

4 

access 

control 

6 

keyword 

search 

3 

keyword 

access 

User 1: 6 4 1 

User 2: 3 

User 3: 5 2 

User 4: 7 

User 1 User 2 

(b) User Design 

User 3 User 4 

2 5 

7 

network social 

social 

access search 

control 

User 1: 7 6 4 3 1 

User 2: 7 6 4 3 1 

User 3: 7 5 2 

User 4: 7 6 5 4 2 1 

(a) Example social 

network with posted 

documents 

Figure 1: Social Network and Basic Designs 

(c) Friends Design 

posted by the user and all of her friends. The author-lists for the user and friends designs 

for our example from Figure 1(a) are shown in Figures 1(b) and 1(c), respectively. In 

both of these designs, a keyword query from a user is processed in the single inverted 

index. To enforce access control, the results from the index are intersected with a set 

of author-lists containing all friends of the user. In the friends design, all friends of the 

user are represented in the author-list for the user, whereas in the user design, we need 

to calculate the union of the author-lists for all friends. 

Because of the promising performance of user and friends designs in previous work [8], 

we have chosen them as the basis for our solutions. Note that the two designs have very 

different tradeoffs. When a user posts a new document, only a single author-list must 

be updated in the user design. In the friends design, however, the author-lists for all 

users that are friends with the posting user must be updated. During queries, only one 

author-list is accessed with the friends design, whereas one author-list for each friend of 

the user must be accessed with the user design. 

In this paper we bridge the gap between the two extrema to enable efficient search 

in a social networks with access control. We propose an intermediate strategy that combines 

the best of both the friends design and the user design: low search costs and low 

update costs. Our solution starts with the user design and then judiciously adds selected 

252


additional author-lists to improve query performance. 

The remainder of the paper explains our solution in detail. In Section 2, we introduce 

notation and define the problem we address. Then, we develop efficient query processing 

strategies based on a novel union operator (Section 3). Furthermore, we develop a set of 

cost models with various degrees of sophistication and accuracy, and use them as bases for 

workload optimization to find the ideal access design for a particular workload (Section 4). 

In a thorough experimental evaluation, we demonstrate the efficiency of our new union 

operator, validate the accuracy of our cost models, and compare access designs resulting 

from workload optimization based on different cost models to previous work (Section 5). 

Related work is discussed in Section 6 and we conclude in Section 7. We believe that both 

our novel union operator and our cost models will be of independent interest for future 

work. 

2 Problem Definition 

In this section, we will introduce some notation which is used to define the problem 

addressed in this paper. 

Search Data and Query Model. We view a social network as a directed graph 

〈V, E〉, where each node u ∈ V represents a user. There is an edge 〈u, v〉 ∈ E if user v is 

a friend of user u, denoted v ∈ F u or alternatively that u is one of v’s followers, denoted 

u ∈ O v . We always have u ∈ F u and u ∈ O u . 

We consider workloads that consist of two different operations: posting new documents 

and issuing queries. A new document, which we also refer to as an update, consists of a set 

of terms. We will call the user who posted document d the author of d, and we will also 

say that the user authored d. The new document gets assigned a unique document ID; 

more recently posted documents have higher IDs. Let n u denote the number of documents 

authored by user u, and let N = ∑ u n u denote the total number of documents in the 

system. 

A query submitted by a user u consists of a set of keywords. As mentioned in the 

previous section, we assume that only documents authored by users in F u are accessible 

to u, and that the ranking is based on recency. Thus the results of a keyword query are 

the k documents that (1) contain the query keywords, (2) are authored by a user in F u , 

and (3) have the k highest document IDs among the documents that satisfy (1) and (2). 

Beyond User and Friends Designs. The relative merits of the user and friends 

designs motivate the solution in this paper. Recall from the previous section that in the 

user design, updates are efficient because only u’s author-list is updated when u posts 

a document; queries, however, need to access the author-lists of all users in F u . In the 

friends design, queries are more efficient because only the author-list of u is accessed. 

Updates, however, need to change the author-lists of all users in O u . 

In our approach, we start out with the user design. In addition, we add one author-list 

l u for each user u; l u contains the IDs of all documents authored by a selected set of users 

L u ⊆ F u . When user u submits a query, there is no need to access specific author-lists for 

users in L u , and queries therefore become more efficient as more users are represented in 

253


L u . On the other hand, representing more users in L u leads to higher update costs. We 

therefore determine the contents of L u (and thus l u ) based on the workload characteristics. 

System Model. Keyword search over collections where updates are interleaved with 

queries is typically supported with a hierarchy of indexes [19, 10, 17, 21]. New data is 

accumulated in a small updatable structure that also supports concurrent queries, while 

the rest of the hierarchy consists of a set of read-only indexes. The read-only indexes 

change through merges that form larger read-only indexes, resulting in a hierarchy where 

the largest indexes contain the least recent documents. 

An index hierarchy is well suited for search in social networks, especially when used 

in combination with an access design that adapts to the workload. The time at which 

indexes are merged represents an opportunity to modify the access design and adapt it to 

the current workload, so that different indexes in the hierarchy potentially have different 

access designs. In our system, we process top-k queries with recency used as ranking, and 

the largest indexes would therefore rarely be accessed. Our workload-aware algorithms 

can be used to find suitable access designs in these cases. 

In this paper, we focus on selecting an access design for an index with a stratified workload, 

where all updates are processed before all queries. A stratified workload closely approximates 

the workloads for the indexes at the various levels in the hierarchy, except that 

the update costs can vary slightly based on the structure of the hierarchy [19, 10, 17, 21]. 

Recall that a system based on a hierarchy with indexes, where each index supports stratified 

workloads, will still support interleaved workloads overall, because of the small updatable 

part of the index. The updatable part supports interleaved workloads, but we 

have verified experimentally that because it is small, its processing times for stratified 

workloads very well approximate (within single percentage points) the processing times 

for interleaved workloads because the constant costs in query processing dominate. The 

problem we focus on in this paper is thus important for a system that supports an interleaved 

workload of updates and queries, and we will address both query processing and 

adaptation to workloads based on cost models. 

3 Query processing 

Our main memory-based search system constructs indexes by accumulating batches of 

documents in an updatable structure where the lists are compressed using VByte [24]. 

The batches are combined in the end to form the complete index, where the lists are 

compressed using PForDelta [32, 30]. 

The system answers queries by computing the intersection of a posting list with a 

union of author lists. 1 Note that similar types of queries occur in several scenarios, e.g., 

in star joins in data warehouses [23]. We process the queries with the template shown in 

Figure 2, which is a combination of three operators (intersection, union and list iterator) 

that all support the following interface: 

• Init(), initialize the operator. 

1 Notice that our presentation focuses on single-term queries. This is done to simplify the presentation 

and does not reflect a limitation of our system. 

254


⋂ 

p t 

⋃ 

a 1 a 2 . . . a n 

Figure 2: Query Template 

• Current(), retrieve the current result. 

• Next(), forward to and retrieve the next result. 

• SkipTo(val), forward to next result with value ≤ val. 

All results are returned in sorted order based on descending document IDs to facilitate 

efficient ranking on recency. The top-k ranked results are therefore found by calling Next() 

on the intersection operator k times. A standard intersection operator alternates between 

the inputs and skips forward to find a value that occurs in both. With such a solution, 

the implementation of the SkipTo(val) operation in the union operator is essential to the 

overall processing strategy. The union operator can for example merge all the inputs first 

and perform all operations on the merged list [23]. Another strategy is to perform all skips 

on all inputs and return the maximum result. This last strategy essentially calculates the 

intersection of the posting list and each single author-list and then returns the union of 

the results [23]. 

Raman et al. have introduced a union operator called Lazy Merge, which is based on 

the idea that if the number of skip operations in the union is large compared to the length 

of an input, it would have been ideal to pre-merge this input into a list of intermediate 

results. Lazy Merge adaptively merges an input into the intermediate results when the 

number of skip operations exceeds the length of the input times a constant α. The 

strategy never uses more than twice the processing time of a solution that pre-merges the 

optimal set of inputs [23]. However, the approach is not well suited for our workloads for 

two reasons. First, we usually process top-k queries and therefore only need the first k 

results from the intersection. It is inefficient to merge a set of complete inputs when only 

a small fraction of the results are used during query processing. Second, we often have a 

large number of inputs to the union, which can lead to many merges in Lazy Merge. The 

analysis in Lazy Merge does not take the actual cost of merges into consideration, and 

this can result in far-from-optimal processing strategies with many inputs. 

To address these issues, we develop a union operator called HeapUnion which is described 

next. The implementation of the other operators is described in Appendix A. 

255


3.1 HeapUnion 

HeapUnion, our novel union operator, is designed to be efficient regardless of whether all 

or only a fraction of the results are actually needed, and to scale gracefully to a very large 

number of inputs. To achieve these goals, HeapUnion is based on a binary heap. The heap 

contains one entry for each input operator, and it is ordered based on the value obtained 

from calling Current() on each input operator (referred to as the current value for the 

input operator). The same overall strategy is commonly used in information retrieval 

during inverted index construction [29]. 

We support the standard operator interface by always having the input with the 

highest current value at the top of the heap, so that this value is also the current value for 

HeapUnion. The heap is initialized the first time Next() or SkipTo(val) is called. When 

the first call is a Next() (SkipTo(val) resp.), HeapUnion calls Next() (SkipTo(val)) on all 

inputs, and the heap is constructed using Floyd’s Algorithm [13]. Floyd’s algorithm calls 

a sub-procedure called heapify() which can be used to construct a legal heap from an entry 

with two legal sub-heaps as children. The heapify() operation has logarithmic worst-case 

complexity in the size of the heap, and Floyd’s Algorithm runs in linear time [13]. We 

will also use these operations during heap maintenance. 

After initialization, HeapUnion works as shown in the pseudo code in Algorithm 1. 

The Current() operation either returns the current value from the input operator at the 

top of the heap, or it indicates that there are no more results if the heap is empty. The 

Next() operation forwards the input with the current value, and calls heapify() to ensure 

that the input with the new highest current value is at the top of the heap. The worst-case 

complexity of this operation is thus logarithmic in the number of input operators. 

The SkipTo(val) operation is based on a breadth-first search (BFS) in the heap. When 

forwarding to a value val, only the inputs with a current value > val actually need to be 

forwarded. Because the heap is organized according to the current value for all inputs, 

we know that if a given input has a current value ≤ val, the same is true for all of its 

descendants. If we determine that no skip is necessary in a given input, we thus also 

know that no skip is required in any of its children in the heap, and there is no need to 

process the children in the BFS. Furthermore, if an input is not forwarded, we know that 

its position in the heap relative to its children will not change. We also take advantage 

of this observation by calling heapify() only for the inputs where an actual skip occurred, 

and use a complete run of Floyd’s algorithm only in the worst case. 

Because HeapUnion never pre-merges any inputs, its efficiency does not depend on 

the total number of results, only on the number of returned results. It is therefore 

well suited for top-k queries. Furthermore, the BFS in the heap during skip operations 

ensures scalability with the number of inputs. If only one input is forwarded during a 

skip operation, the processing cost is logarithmic in the number of inputs; if all inputs 

are forwarded, the total maintenance costs becomes linear in the number of inputs. 

256


Algorithm 1 HeapUnion Operator 

1: function Init(): 

2: Allocate heap 

3: function Next(): 

4: heap[0].Next() 

5: heapify(0) 

6: return Current() 

7: function SkipTo(int Val): 

8: Perform a breadth-first search in the heap from the root 

9: while BFS queue is not empty: 

10: if Current iterator is forwarded in SkipTo(val): 

11: Add to LIFO-list of entries for heap reorganization 

12: Add children to BFS queue 

13: Call heapify() for forwarded inputs in LIFO-list 

14: return Current() 

15: function Current(): 

16: if size(heap) == 0: 

17: return Eof 

18: else: 

19: return heap[0].Current() 

4 Cost Models 

The efficiency of our system for a particular workload depends on the contents of the 

additional author-lists, and selecting a good set of lists is therefore essential. For each 

user u, any subset of F u can be included in L u , which leads to 2 ∑ u∈V |Fu| = 2 |E| possible 

designs. We use cost models to investigate this large space; the models are used as bases 

for optimization algorithms to find the best access design for a stratified workload of 

updates and queries. 

In related work, Silberstein et al. have developed a simple cost model to address the 

problem of finding the most recent events in users’ event feeds [25]. It is straight-forward 

to adapt their model to our problem, and we will refer to this model as Simple. Simple 

assumes that the cost of a search is linear in the number of accessed author-lists, and 

that the cost of constructing a list is linear in the number of document IDs in the list. It 

thus captures the intuition that including more users in the additional author-lists leads 

to lower search costs and higher update costs. A formal description of Simple is found in 

Figure 3. Silberstein et al. have shown that when using Simple as a basis for optimization, 

a globally optimal solution can be found by making local decisions for each friend pair. 

This implies that only ∑ u∈V |F u| = |E| different designs have to be explored in order to 

257


find the optimal one [25]. 

Unfortunately, Simple is not accurate enough for our application scenario. Figure 4 

compares actual running times of our system to predictions from Simple. The tested 

workloads are described in Appendix B, but for now it suffices to know that a low limit 

indicates that the additional author-lists are almost empty, while a high limit indicates 

that the additional author-lists include most or all friends. The search estimates for 

Simple are clearly far from accurate, and as we will see in Section 5, this inaccuracy 

can lead to access designs that are up to 30% slower than designs from more accurate 

models. In this section, we introduce two more accurate models called Monotonic and 

Non-monotonic. 

4.1 Monotonic 

Monotonic is designed to be an accurate yet tractable cost model, where only a small 

number of access designs must be checked to find a globally optimal solution. The model 

for updates in Monotonic is the same as in Simple, but we use a different approach to 

model query costs. Monotonic has one cost model for each operator and query costs are 

estimated by combining the models for all operators in the query. The cost model for an 

operator describes the cost of each method supported by the operator. For operators that 

have other operators as inputs, like HeapUnion and Intersection, the cost is calculated 

by combining the cost of operations within the operator with the cost of method calls on 

the inputs. To find the cost of the queries we use in this paper, we combine the operators 

according to the template in Figure 2, and calculate the cost of k Next() calls for the 

Intersection to retrieve the top-k results (assuming there are at least k). 

Monotonic is described in Figure 3. Skip(s) is a model for SkipTo(val). If SkipTo(val) 

forwards the current value of an operator with ∆v document IDs, we model the cost of the 

operation with Skip( r∆v 

N 

), where r is the number of results of the operator. The number of 

results for an operator is the number of times we can call Next() and retrieve a new result. 

Monotonic and Non-monotonic use the same models for List Iterator and Intersection, 

but they have different models for HeapUnion. We will now explain Monotonic’s model 

for HeapUnion; models for the other operators are explained in Appendix C. 2 

HeapUnion. Let us assume that HeapUnion has m inputs k 1 , . . . , k m . We use r i to 

denote the number of results from input i, and define R = ∑ m 

i=1 r i. We assume that 

the cost of initialization within the HeapUnion operator itself is negligible, and therefore 

model the cost of initialization as the sum of the initialization costs for all inputs. 

Recall from Section 3.1 that the first call to either Next() or SkipTo(val) will involve 

construction of the heap; we therefore have two different cases in the model for Next() and 

SkipTo(val), depending on whether it is the first or subsequent calls. The cost of the first 

call to Next() includes the cost of calling Next() on all m inputs, and the cost of the heap 

construction using Floyd’s Algorithm. For heap construction, we model the cost of each 

call to heapify() as being constant, c h . With Floyd’s algorithm, heapify() is called for half 

the heap entries, and then recursively every time there is a reorganization. The average 

case complexity of Floyd’s algorithm is well known, and the number of relocations in the 

2 We determined values for the constants with microbenchmarks as described in Appendix E. 

258


Update Cost (|lu| = n) Query Cost (user u) 

Simple cupdaten clist (|Fu| − |Lu| + 1) 

Monotonic ⎪⎨ c1b ∗ n if N n 

cupdaten ⎧ 

see below 

Non-monotonic c2b ∗ n if b1 ≤ N 

⎪ n 

⎩ 

c3b ∗ n otherwise 

List Iterator 

Init() cg Next() Skip(s) 

{ 

{ 

ceinit empty list 

cnext 

cc + s b c d + scan(s)csc Skip(b) + log ( ) 

s 

if s ≤ b 

cinit otherwise 

otherwise 

Intersection k1.Init() + k2.Init() 

Monotonic 

HeapUnion 

Nonmonotonic 

HeapUnion 

∑ m 

i=1 k i.Init() 

∑ m 

i=1 k i.Init() 

k1.Next() + (t − 1)k1.Skip(1 + r1 

) r2 

+t · k2.Skip( t−1 

t 

(1 + r2 

) + r2 

r1 r1t ) not used 

{ ∑m 

i=1 k i.Next() + (γ + 1 2 ) · m · c h first call 

∑ m 

i=1 

ri 

R k i.Next() + (γ + 1) · ch otherwise 

{ ∑m 

i=1 k i.Next() + (γ + 1 2 ) · m · c h first call 

∑ m 

i=1 

ri 

R (k i.Next() + log(1 + pi) · ch) otherwise 

b 

⎧ ∑ ⎪⎨ m 

i=1 k i.Skip( ris 

R ) + (γ + 1 2 ) · m · c h first call 

ris 

min(1, 

⎪ ⎩ 

∑ m 

i=1 

R )k i.Skip(max(1, ris 

R )) 

+(γ + 1) · ms · ch otherwise 

∑ m 

i=1 

⎧⎪ k i.Skip( ris 

R ) + (γ + 1 2 ) · m · c h first call 

⎨ ris 

min(1, 

⎪ ⎩ 

∑ m 

i=1 

R )k i.Skip(max(1, ris 

R )) 

+ max(msγ + min(ms, m 2 ), 

ms( ∑ m 

j=1 

rj 

R log(1 + p j)) − h(⌊ms⌋)) · ch otherwise 

Figure 3: Overview of Cost Estimates 

259


heap is approximately γm = 0.744m [27]. Thus we model the cost of heap construction 

as (γ + 1 2 ) · m · c h. 

A first call to SkipTo(val) involves skipping in all inputs, in addition to heap construction. 

Given that s results are skipped in this operation, we simply assume that the 

number of entries skipped on input i is ris 

R , resulting in a cost of ∑ m 

i=1 k i.Skip( ris 

R 

). The 

cost of heap construction is modeled as explained for Next() above. 

Subsequent calls to Next() involve a call to Next() for the input at the top of the heap 

and a heap reorganization. We estimate the cost of the call to Next() for the input at the 

top as a weighted average over the inputs. The model for the cost of heap maintenance is 

simple. We assume that there will be γ relocations when a single operator is forwarded, 

leading to γ + 1 calls to heapify(). 

Subsequent calls to SkipTo(val) will potentially forward all inputs, and then reorganize 

the heap according to the new current values of the inputs. On average, each input will 

be forwarded past ris 

R 

entries. However, HeapUnion will ensure not to call SkipTo(val) 

for inputs that will not be forwarded. Therefore, when the average skip length is less than 

1, we model the cost as ris 

R 

calls that skip 1 entry. To find the cost of heap maintenance 

we estimate the number of forwarded inputs as: m s = ∑ m sri 

i=1 

min( 

R 

, 1). Assuming that 

there will be as many relocations in the heap as when constructing a heap with m s entries, 

the cost of heap maintenance is (γ + 1) · m s · c h . 

Optimization Algorithm. Although Monotonic is more complex than Simple, it still 

has the property that testing only |E| access designs is sufficient to find the optimal 

approach. We will describe how this is done in two steps. The following lemma shows 

that we can find a globally optimal solution by choosing the contents of the additional 

author-list for each user individually; it follows directly from the definitions in Figure 3. 

Lemma 1. In Monotonic, the costs and savings from including the users L u in the 

additional author-list for u are not affected by the contents of l v when v ≠ u. 

Lemma 1 reduces the number of access designs to test in the optimization algorithm 

from 2 ∑ u∈V |Fu| to ∑ u∈V 2|Fu| . 

Theorem 2. If Monotonic estimates that for a user u and a given workload, the performance 

is improved if v ∈ F u is included in L u , then Monotonic will predict that it leads 

to a performance improvement to also include user w in L u if w ∈ F u , 0 < n w < n v , and 

c d − csc 

2b 

≥ cg 

ln 2 . 

The proof of Theorem 2 is found in Appendix D. We notice that in our case, c d − csc 

2b 

≥ 

c g 

ln 2 

translates to 330.6578 ≥ 114.0751, which holds with a significant margin. Theorem 2 

implies that if we sort all friends of u based on the number of documents they post, the 

optimal contents of L u is a prefix of this sorted list. We thus need to check |E| designs 

in total in order to find he optimal solution. Furthermore, notice that there is no cost 

associated with including users who do not post documents in the additional author-lists, 

and it is therefore always beneficial to do so. 

Validation. Monotonic’s predictions for our test workloads are presented in Figure 4. 

The search estimates are much more accurate than Simple’s, but there is still room for 

improvement when limit is close to 0 for Workload 2. Inaccurate modeling of the cost 

260


Search time 

Update time 

2.5 ·1011 limit 

·10 12 limit 

Processing time (s) 

2 

1.5 

1 

0.5 

2.5 

2 

1.5 

1 

0.5 

0 

0 0.5 1 1.5 

·10 4 

Search time 

0 

0 0.5 1 1.5 

·10 4 

Update time 

Actual time 

Simple 

Monotonic 

Non-monotonic 

Processing time (s) 

·10 11 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 1,000 2,000 3,000 4,000 5,000 

limit 

·10 12 

1.5 

1 

0.5 

0 

0 1,000 2,000 3,000 4,000 5,000 

limit 

Figure 4: Accuracy of cost models for queries and updates in Workload 1 and Workload 2 

of heap maintenance is one factor that contributes to this error, and we will therefore 

improve this aspect in Non-monotonic. 

4.2 Non-monotonic 

Non-monotonic is designed to be more accurate than Monotonic, and rather sacrifice the 

efficiency of the optimization algorithm that finds the globally optimal access design. 

To achieve better accuracy, two aspects of Monotonic are changed: The model for heap 

maintenance is extended, and a slightly more advanced update model is used. The formal 

description of Non-monotonic is found in Figure 3. 

The update model from Monotonic is extended by taking list compression into account. 

During accumulation, the lists are compressed using VByte, which implies that lists with 

few entries result in long deltas that use more storage space. The model assumes that 

the cost of updating a list depends on the number of bytes used by VByte to represent 

the average delta length. 

The models for heap maintenance in subsequent calls to Next() and SkipTo(val) reflect 

261


that the cost of heap maintenance often depends on the total number of inputs as well 

as on the number of forwarded inputs. Given that input i is at the top of the heap when 

Next() is called, let p i denote the number of inputs that will have Current() values larger 

than input i after the call to Next(). We estimate p i as ∑ m 

k=1 min( r k 

ri 

, 1), and assume 

that the heap maintenance cost when input i is at the top of the heap is log(1 + p i )c h . 

By calculating a weighted average over all inputs, we end up with an average cost of heap 

maintenance for a Next() as shown in Figure 3. 

The model for heap maintenance in SkipTo(val) is slightly more complex, and the 

maximum of two different estimates is used: (1) The first alternative is similar to the 

estimate in Monotonic, but incorporates that Floyd’s algorithm will never call heapify() 

for more than half the inputs, which yields the following estimate: (γm s +min(m s , m 2 ))c h. 

(2) The other alternative reflects that the cost can be logarithmic in the number of inputs 

when the number of forwarded inputs is low. We have already estimated the average 

number of calls to heapify() when only one input is forwarded in the model for Next(), 

denoted h next in the following. We now assume that all forwarded inputs will lead to 

h next heapify() operations, but compensate for the fact that many of the inputs are not 

at the top of the heap when heapify() is called. The compensation is achieved with the 

function h(m s ) in Figure 3, which returns the minimum possible total distance from the 

root to m s entries in the heap. 

Optimization Algorithm. Lemma 1 also holds for Non-monotonic. However, the 

proof of Theorem 2 does not hold due to the extensions in Non-monotonic. As a result, 

an optimization algorithm based on Non-monotonic must test ∑ u∈V 2|Fu| access designs 

to find the optimal approach. In social networks, users have hundreds of friends, and 

it is therefore not feasible to test all possibilities. We thus choose to limit the solution 

space to the same space as the optimization algorithm based on Monotonic explores. If 

Non-monotonic is actually more accurate than Monotonic, the resulting design should be 

at least as efficient. 

Validation. Non-monotonic’s predictions on our two test workloads are shown in Figure 

4. The extended update model represents a slight improvement. For search costs, 

Non-monotonic is clearly more accurate than Monotonic for low limits in Workload 2. 

This reflects that the model for heap maintenance actually plays a significant role. 

5 Experiments 

To validate the efficiency of our system, we conduct a set of experiments. We compare our 

query processing strategy based on HeapUnion to Lazy Merge, and test how the solutions 

based on optimizations with different cost models perform compared to the user design 

and the friends design. Our experiments are based on the system described in Section 3 

which is implemented in Java. The experiments are run on a computer with an Intel 

Xeon 3.2 GHz CPU with 6 MB cache and 16 GB main memory. The computer is running 

RedHat Enterprise 5.3 and Java version 1.6. 

262


5.1 Efficiency of HeapUnion 

To test the efficiency of HeapUnion, we compare it to the alternative suggested by Raman 

et al. by exchanging the HeapUnion operator in the queries with Lazy Merge [23]. To do 

so, we have implemented a version of Lazy Merge that stores the intermediate results in 

an uncompressed list where skips are implemented with galloping search [6]. As explained 

in Section 3, the parameter α describes how eager Lazy Merge is at pre-merging inputs 

into the intermediate results. When setting α to 0, Lazy Merge pre-merges all inputs, 

and when setting it to ∞, no inputs are pre-merged. 

We use the workloads from Appendix B in the experiments, but limit the design to 

the user design because this provides a real test of the solutions’ ability to process queries 

with many author-lists efficiently. We report the time spent processing the 100,000 queries 

for each of the workloads, and vary α between 0 and ∞. We have argued that one of 

the reasons why Lazy Merge is not ideal for our workloads is that we typically process 

top-k queries. To isolate this effect, we process the workload both when only returning 

the top-100 results and when returning all results. 

The results from the experiments are shown in Figure 5. Notice that HeapUnion 

does not depend on α, and its cost is therefore constant. When using Lazy Merge, the 

difference between the cost of top-100 queries and retrieving all results increases with the 

size of α. This reflects the inadequacy of approaches that pre-merge when processing 

top-k queries. The processing time with HeapUnion is clearly dependent on the number 

of retrieved results, and HeapUnion is therefore an attractive solution for top-k queries 

as expected. 

Lazy Merge consistently performs best with α set to one of the extreme values. Poor 

performance in the other cases is mainly caused by the large number of merges resulting 

from many inputs with different lengths. What extreme value of α that performs best 

for a particular workload depends on the average length of the author-lists compared 

to the posting list. HeapUnion outperforms all configurations of Lazy Merge for both 

workloads with a speed-up between 1.13 and 2.36, reflecting that the high number of 

inputs is processed efficiently regardless of workload characteristics. 

5.2 Workload Optimization 

We have tested the accuracy of the different cost models in Section 4, but the key success 

factor for a cost model is whether using it in an optimization algorithm leads to efficient 

designs. We therefore conduct a set of experiments to compare the access designs suggested 

by the different cost models and associated optimization algorithms. To do so, we 

use the networks and documents from the workloads in Appendix B, and combine them 

with several different sets of queries. We first test workloads with different numbers of 

queries generated by letting a random user search for a random term. We also experiment 

with workloads where the users who search follow a Zipfian distribution with exponent 

1.5, to see how skew affects the efficiency of the different approaches. We compare the designs 

based on Simple, Monotonic and Non-Monotonic to the user design and the friends 

design. 

263


Lazy Merge (top-100) 

Lazy Merge (top-∞) 

HeapUnion (top-100) 

HeapUnion (top-∞) 

Workload 1 

Workload 2 

2 

processing time (ns) 

1 

·10 13 alpha 

4 

2 

·10 11 alpha 

0 

0 10 −4 10 −3 10 −2 10 −1 10 0 10 1 inf 

0 

0 10 −4 10 −3 10 −2 10 −1 10 0 10 1 inf 

Figure 5: HeapUnion vs. Lazy Merge varying α. 

The results of our experiments are shown in Figure 6, where the first line shows the 

results for workloads with uniformly distributed queries. To get a better view of the 

relative differences between the methods, the second line shows the performance of all 

approaches relative to the best of user design and friends design. 

For the workloads with uniformly distributed queries, Simple often leads to designs 

that are slower than choosing the best of user design and friends design due to the 

inaccuracies in prediction. Compared to Monotonic and Non-monotonic, the designs 

from Simple are up to 30% slower. Monotonic and Non-monotonic generally lead to 

reasonable designs that are comparable to or faster than the basic approaches. However, 

for Workload 2, both approaches lead to sub-optimal designs when queries are frequent 

relative to updates. This reflects the inaccuracies in the estimates for low limits for 

Workload 2 in Figure 4. However, Non-monotonic clearly results in better designs than 

Monotonic in this case, so the additional complexity pays off. 

The results from Workload 2 with Zipfian queries are shown in the third line in Figure 

6, denoted Workload 2Z. We have omitted the results for Workload 1Z due to space 

constraints, but the results are similar as for Workload 2Z. The results show that our 

overall solution performs much better compared to the basic designs when there is skew, 

with a speed-up of up to 3.4. With skew, the optimization problem is simple because 

a few users submit almost all queries, and all cost models are able to reflect that these 

users should have additional author-lists with nearly all their friends represented. Nonmonotonic 

is still slightly better than the others, but the difference is not significant. 


Due to its clear commercial value there has recently been some interest in search in social 

networks. Search engines like Google and Bing already support search over public Twitter 

264


Workload 1 

Workload 2 

Processing time (ns) 

5 

4 

3 

2 

1 

0 

0 1 2 3 4 

6 ·1012 #queries ·10 6 


3 

2.5 

2 

1.5 

1 

0.5 

0 

0 1 2 3 4 

3.5 ·1012 #queries ·10 6 

Workload 1 

Workload 2 

Relative performance 

1.2 

1.1 

1 

0.9 

0.8 

Relative performance 

1.2 

1.1 

1 

0.9 

0.8 

0 1 2 3 4 

#queries ·10 6 

0 1 2 3 4 

#queries ·10 6 

Workload 2Z 

User design 

Friends design 

Simple 

Monotonic 

Non-monotonic 


3 

2.5 

2 

1.5 

1 

0.5 

0 

0 1 2 3 4 

3.5 ·1012 #queries ·10 6 

Figure 6: Performance for workloads with different fractions of documents vs. queries 

posts, and they plan to include private posts as well. 3 Facebook allows users to search 

their friends’ posts. However, the details of these commercial solutions have not been 

published. 

The problem of supporting keyword search over content in a social network was described 

in [8]. The problem is related to access control in both information retrieval [9, 26] 

and for structured data [7]. The solutions we explore in this paper have relations to the 

3 http://www.networkworldme.com/v1/news.aspx?v=1&nid=3236&sec=netmanagement 

265


solution explored by Silberstein et al. on how to support retrieving the most recent events 

in users’ news feeds in a social network [25]. While our problem is more complex because 

we support keyword search, the procedure of solving it is along the same lines. 

Cost models have been used to estimate the efficiency of processing strategies in both 

information retrieval and databases [29, 10, 11, 20]. There exists advanced cost models to 

evaluate different index construction and update strategies [10, 29]. For search queries, 

however, simple models are most commonly used, sometimes without verification on an 

actual system [11]. We use simple cost models for updates in this paper, but relatively 

advanced models are required to predict search performance accurately. Manegold et 

al. outline a generic strategy to estimate the memory access costs in databases [20]. The 

memory access costs in our advanced models can be considered part of the model for the 

list iterator. However, our focus is on estimating processing time, whereas Manegold et 

al. focus on memory access costs. 

The problems of calculating unions and particularly intersections of lists have attracted 

a lot of attention, both through the introduction of new algorithms with theoretical 

analyses [18, 16, 1, 4], and experimental investigations [5, 2]. The algorithms for single 

operations are also combined into full query plans [23, 12]. Most algorithms assume that 

the input sets are uncompressed. Motivated by the ability to store more data and in 

some cases improve computational efficiency, work from the IR community relaxes this 

assumption by introducing synchronization points in the compressed lists to speed up 

random look-ups [14, 22]. In addition, there exists work on algorithms adapted to new 

hardware trends [28]. Our intersection operator is based on the ideas from Demaine et 

al. [16], and we use a strategy similar to the one introduced by Culpepper and Moffat 

for synchronization points in the lists, but we use a novel union operator. Another 

discriminating factor is that we attempt to estimate the exact running time of the queries, 

while most previous work focus on either asymptotic complexity or experimental running 

times. 

7 Conclusions 

In this paper we have presented a system for keyword search in social networks with 

access control, which is designed to perform well for a wide range of workloads. We have 

addressed the query processing strategy and developed an efficient HeapUnion operator 

that results in a speed-up between 1.13 and 2.36 compared to previous solutions. We 

have also developed cost models for the system and used them as bases for workload 

optimization. With an accurate cost model, workload optimization leads to comparable 

or better performance than the best basic designs; the speed-up is up to 3.4 for workloads 

with skew. 

We believe that the ideas discussed here have even wider applicability than search 

in social networks. First, queries similar to the ones we have discussed in this paper 

occur in several settings, like in star joins in data warehouses [23], and HeapUnion might 

be a practical and efficient solution in such application scenarios as well. Furthermore, 

workload optimization might be applicable for other problems. In particular, faceted 

266


search is supported by storing meta-data about the documents, and several different 

storage schemes are possible [15]. Workload optimization could be used to find an optimal 

storage scheme for the meta-data, an interesting research area for future work. 

References 

[1] R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Proc. CPM, 

2004. 

[2] R. Baeza-Yates and A. Salinger. Fast intersection algorithms for sorted sequences. 

Algorithms and Applications, 2010. 

[3] A. L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 

286(5439), 1999. 

[4] J. Barbay and C. Kenyon. Alternation and redundancy analysis of the intersection 

problem. ACM Trans. Algorithms, 4(1), 2008. 

[5] J. Barbay, A. López-Ortiz, T. Lu, and A. Salinger. An experimental investigation of 

set intersection algorithms for text searching. J. Exp. Algorithmics, 14, 2009. 

[6] J. L. Bentley and A. C.-C. Yao. An almost optimal algorithm for unbounded searching. 

Information Processing Letters, 5(3), 1976. 

[7] E. Bertino, S. Jajodia, and P. Samarati. Database security: research and practice. 

Inf. Syst., 20(7), 1995. 

[8] T. A. Bjørklund, M. Götz, and J. Gehrke. Search in social networks with access 

control. In Proc. KEYS, 2010. 

[9] S. Büttcher and C. L. A. Clarke. A security model for full-text file system search in 

multi-user environments. In FAST, 2005. 

[10] S. Büttcher, C. L. A. Clarke, and B. Lushman. Hybrid index maintenance for growing 

text collections. In Proc. SIGIR, 2006. 

[11] B B. Cambazoglu, V. Plachouras, and R. Baeza-Yates. Quantifying performance and 

quality gains in distributed web search engines. In Proc. SIGIR, 2009. 

[12] E. Chiniforooshan, A. Farzan, and M. Mirzazadeh. Worst case optimal unionintersection 

expression evaluation. In ICALP, 2005. 

[13] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, 

Second Edition. The MIT Press, 2001. 

[14] J. S. Culpepper and A. Moffat. Compact set representation for information retrieval. 

In Proc. SPIRE, 2007. 

267


[15] D. Dash, J. Rao, N. Megiddo, A. Ailamaki, and G. Lohman. Dynamic faceted search 

for discovery-driven analysis. In Proc. CIKM, 2008. 

[16] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Adaptive set intersections, unions, 

and differences. In Proc. SODA, 2000. 

[17] S. Gurajada and S. Kumar P. On-line index maintenance using horizontal partitioning. 

In Proc. CIKM, 2009. 

[18] F. K. Hwang and S. Lin. Optimal merging of 2 elements with n elements. Acta 

Informatica, 1(2), 1971. 

[19] N. Lester, A. Moffat, and J. Zobel. Fast on-line index construction by geometric 

partitioning. In Proc. CIKM, 2005. 

[20] S. Manegold, P. Boncz, and M. L. Kersten. Generic database cost models for hierarchical 

memory systems. In Proc. VLDB, 2002. 

[21] G. Margaritis and S. V. Anastasiadis. Low-cost management of inverted files for 

online full-text search. In Proc. CIKM, 2009. 

[22] A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM 

Trans. Inf. Syst., 14(4), 1996. 

[23] V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H. Yang, and F.-L. Ling. 

Lazy, adaptive rid-list intersection, and its application to index anding. In Proc. SIG- 

MOD, 2007. 

[24] F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes 

for fast query evaluation. In Proc. SIGIR, 2002. 

[25] A. Silberstein, J. Terrace, B. F. Cooper, and R. Ramakrishnan. Feeding frenzy: 

Selectively materializing users. event feeds. In Proc. SIGMOD, 2010. 

[26] A. Singh, M. Srivatsa, and L. Liu. Efficient and secure search of enterprise file 

systems. In Proc. ICWS, 2007. 

[27] R. Sprugnoli. Recurrence relations on heaps. Algorithmica, 15(5), 1996. 

[28] D. Tsirogiannis, S. Guha, and N. Koudas. Improving the performance of list intersection. 

Proc. VLDB Endow., 2(1), 2009. 

[29] I. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes. Academic Press, 1999. 

[30] J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in 

search engines. In Proc. WWW, 2008. 

[31] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 

38(2), 2006. 

[32] M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar ram-cpu cache compression. 

In Proc. ICDE, 2006. 

268


A 

Query Processing Operators 

A.1 List Iterator 

The List Iterator is used to iterate through posting lists and author lists in our system. 

Recall from Section 3 that the lists are compressed using PForDelta. PForDelta is a 

form of delta compression which is commonly used to compress inverted indexes in search 

engines [32, 30]. With delta compression, an entry in a sorted list is stored as the delta 

from the previous entry. Small values are therefore likely, and compression is achieved by 

using a representation where small values require little space. 

To achieve efficient decompression rates, we decompress batches of b = 128 values 

at a time [32]. With this in mind, the implementation of Init(), Current() and Next() 

is straight-forward. During initialization, we allocate an array to store the batch being 

processed in an uncompressed format. A standard call to Next() will forward the current 

result to the next in the batch, and every bth call will result in decompression of a new 

batch. 

The use of delta compression complicates the implementation of SkipTo(val), because 

previous entries in the list must be decompressed to find the value of an entry; straightforward 

random look-ups are therefore impossible. We can implement SkipTo(val) by 

calling Next() until the returned value is ≤ val, but that is potentially inefficient when 

many entries are skipped. We enable more efficient random look-ups by storing the first 

value of each batch uncompressed in an auxiliary array [14]. A SkipTo(val) is processed 

by first checking whether the entry to skip to is located in the current batch. If not, we do 

a galloping search in the auxiliary array to find the correct batch and decompress it [6]. 

When the correct batch is stored in the intermediate array, we forward within that batch 

until the correct answer is found. 

A.2 Intersection 

The intersection operator sorts its inputs according to their expected number of results. 

The number of results for an operator is the number of times we can call Next() and 

retrieve a new result. Both Next() and SkipTo(val), will begin with the input with fewest 

results, and call Next() or SkipTo(val), respectively. From then on, these methods are 

similar. Both of them alternate between the inputs and tries to skip forward to the 

last value returned from the other input. When the same value is found in both inputs, 

this value is part of the intersection and is therefore returned. The outlined processing 

strategy is similar to the one introduced by Demaine et al. [16]. 

B 

Test Workloads 

Two workloads with different characteristics, denoted Workload 1 and Workload 2, are 

used to test the accuracy of our cost models and efficiency of our system. Workload 1 

has 10,000 users and Workload 2 has 100,000. In both networks, users have 100 friends 

each and the friendships are generated with Barabasi’s preferential attachment model [3]. 

269


Documents in both workloads were obtained from a crawl of Twitter in February 2010. In 

Workload 1, we assign 1,500,000 documents to the users in a random process where there is 

a strong correlation between number of documents posted and number of followers, while 

the 2,500,000 documents in Workload 2 are assigned to users without such a correlation. 

Each workload also has 100,000 top-100 queries generated by choosing a random user who 

searches for a random term in the documents (except stop words). 

We use a set of pre-generated designs to test the accuracy of the cost models. Motivated 

by the fact that search performance is improved if more users are represented in the 

additional author-lists, and that it is less costly to include a user who post infrequently 

compared to one who posts frequently, we generate a set of designs with the following 

simple strategy: We represent a user u in the additional author-lists of all her followers, 

O u , if n u < limit. By choosing a set of different values for limit, we obtain a range of 

designs in between empty additional author-lists and additional author-lists that include 

all friends. 

C 

Cost Models for Operators 

Monotonic and Non-monotonic use the same models for the List Iterator and Intersection 

operators, and the underlying intuition for these models are presented here. A more 

formal description is given in Figure 3. 

C.1 List Iterator 

The list iterator operator is initialized by using a look-up structure to find the list to 

process. Recall that we decompress lists in batches of b values at a time. An array for 

the b intermediate results is therefore allocated during initialization, and the first batch 

of values is decompressed. However, if the look-up indicates that the list is empty, no 

intermediate array allocation nor decompression is required. In conclusion, we model the 

cost of Init() for list iterators with two constants, c init and c einit , depending on whether 

the list is empty or not. 

Every bth call to Next() will trigger decompression of a new batch. However, we 

estimate the average cost of a call to Next() to be constant, c next . 

The model for the SkipTo(val)-operation is slightly more complex and depends on the 

number of entries skipped, denoted s. As described in Appendix A.1, a galloping search 

is used to find the correct batch to decompress if many entries are skipped, and we will 

therefore discriminate between cases for when s > b or not. If s ≤ b, we assume that 

there is some constant cost associated with a skip operation, c c . It might be necessary 

to decompress a new batch of values to find the correct return value. We model the 

cost of decompressing a segment as a constant, c d , and the probability that it happens 

as s b 

. In addition, we have to forward within the correct batch of values until we find 

the value we seek. The number of scanned values is lower than s when we need to 

decompress a new batch, and the average number of entries scanned in our implementation 

is scan(s) = 1 + (2b−1)s 

2b 

− s2 

2b . We assume that scanning an entry also has a constant 

270


cost, c sc . When s > b, the logarithmic cost of the galloping search is incorporated into 

the model as c g log ( s 

b 

) 

, and the rest of the cost is equal to Skip(b). 

C.2 Intersection 

For the intersection operator, we will restrict our analysis to two inputs because that is 

what we use in this paper. The initialization within the Intersection operator is assumed 

to have negligible cost overall, and the cost of Init() is the sum of the initialization costs in 

the two inputs, k 1 and k 2 . The inputs are assumed to have r 1 and r 2 results, respectively, 

with r 1 < r 2 . 

As explained in Appendix A.2, the Intersection operator works similarly for Next() 

and SkipTo(val)-operations, but we focus on Next() in this paper because that is the 

method we will use. The processing of a Next() call will begin with a call to Next() on 

input k 1 . Then, there will be a set of skips in each input until we find a value that occurs 

in both. We assume that deltas between entries in the inputs are N r 1 

and N r 2 

document 

IDs, respectively. Furthermore, we assume that the inputs are independent, so that the 

deltas between results of the Intersection can be estimated as N 2 

r 1r 2 

. The amount skipped 

in one round of skips (one skip in each input) is estimated to N r 1 

+ N r 2 

. We now calculate 

t, the expected number of rounds with skips, as: 

t = 1 + 

N 2 

r 1r 2 

− N r 1 

N 

r 1 

+ N = 1 + N − r 2 

r 

r 1 + r 2 

2 

Because input 1 will be forwarded with a Next() call in the first round, there will be t − 1 

skips in it with average length N r + N 1 r 2 

N = 1 + r1 

r 

r 2 

. Similarly, there will be t skips in input 2, 

1 

but one of the skips will, according to our assumptions, arrive at the last value from input 

1 which results in a shorter skip in the last round. The average skip length is therefore: 

t−1 

t 

(1 + r2 

r 1 

) + r2 

r . 1t 

D Proof of Theorem 2 

Proof. To see this, we will show that the additional costs from also including user w are 

never larger per document posted by w than the costs of including v per document posted 

by v, and that the savings during query processing from also including w are at least as 

large per posted document as for v. 

The cost of updates to l u is linear in the number of document IDs. The costs of including 

the documents from v and w in l u are therefore c update n v and c update n w , respectively. 

Per document for each user, the update cost is thus the same, namely c update . 

We will now focus on queries. Recall that the queries are processed with the query 

template in Figure 2. The only things that will change in the query plans for search 

queries from user u when L u changes are the inputs to HeapUnion. Notice that what 

HeapUnion returns for a particular call to SkipTo(val) or Next() will not change, and 

the processing cost for the intersection operator and the iterator for the posting list is 

271


thus unaffected. We therefore only need to show that for the methods for Monotonic 

HeapUnion in Figure 3, the savings from including w in addition to v in L u are at least 

as large per posted document as when only including v, assuming that all inputs to the 

union are List Iterators. We will go through each of the methods in order. 

Init(): When v is included in L u , the cost will decrease with c init if |L u | > 0 before v 

was included, because the included list does not require initialization. Otherwise, the cost 

will remain constant because we have to initialize l u instead. When w is also included in 

L u , the cost will definitely decrease with c init because we know that |L u | > 0 before w 

was included. Because cinit 

n v 

< cinit 

n w 

when n v > n w , the savings during Init() are at least 

as large per entry for w as for v. 

Next(), first call: From the formula in Figure 3, it is clear that when assuming that 

|l u | > 0 before v is included, the savings from including v are c next + (γ + 1 2 )c h (it is less if 

|l u | = 0). The savings from including w in L u are at least the same (or c next + 2(γ + 1 2 )c h 

if adding w to L u makes L u = F u , in which case no heap is required). By a similar 

argument as for Init(), the savings per n w for w are at least as large as per n v for v. 

Next(), subsequent calls: Because all inputs to the union are List Iterators, the cost 

of the call to Next() for the input at the top of the heap remains the same when inputs 

are removed. There will be no savings from the heap maintenance costs either unless 

including w makes L u = F u (when no heap is required). It is therefore straight-forward 

to conclude that the savings from including w are at least as large per n w as the savings 

from including v per n v . 

Skip(s), first call: We will first consider the cost of the calls to skip in the inputs 

that are processed in this operation. Notice that when including a new user in L u , the 

resulting changes in skip costs are: 1) Each skip in l u will be longer, and the added length 

depends on the number of documents posted by the new user. And, 2) the skips in the 

list that is included will no longer be necessary. In the following, we will refer to the 

cost of performing a skip with length s in a List Iterator as LI.Skip(s). Given that there 

are |l u | entries in the additional author-list before v or w is included, the savings from 

including v are: 

S v = LI .Skip 

( snv 

) 

R 

( ) s|lu | 

+ LI .Skip 

R 

( ) 

s(|lu | + n v ) 

− LI .Skip 

R 

The savings from including w as well are: 

( snw 

) ( ) 

s(|lu | + n v ) 

S w = LI .Skip + LI .Skip 

R 

R 

( ) 

s(|lu | + n v + n w ) 

− LI .Skip 

R 

272


We need to show that Sv 

n v 

≤ Sw 

n w 

. Or, in particular that: 

LI .Skip ( sn w 

R 

) 

n w 

(LI .Skip 

− 

− LI .Skip ( sn v 

R 

n v 

LI .Skip 

+ 

( 

s(|lu|+n v+n w) 

) 

R 

) 

− LI .Skip 

n w 

( ) 

s(|lu|+n v) 

R 

− LI .Skip 

( ) 

s(|lu|+n v) 

R 

( ) 

s|lu| 

n v 

R 

≥ 0 

We will now show that LI .Skip(s) is concave, a fact we will use to show the above. 

First, the second derivative of LI .Skip(s) when s 

b 

, which is negative because 

c sc and b are both positive constants. When s > b, the second derivative is − 

cg 

s 2 ln 2 , which 

is also negative. Furthermore, LI .Skip(s) is continuous at s = b. To show that it is 

concave, it thus remains to show that the derivative as s approaches b from below is not 

smaller than when s approaches b from above. By calculating the derivatives and finding 

the limits, we see that the following must hold: 

c d − c sc 

2b ≥ c g 

ln 2 

And this holds by assumption in the theorem. We thus know that LI .Skip(s) is concave. 

For a concave function f(x) and two points x 1 and x 2 where x 1 < x 2 , it follows that 

the slope between the two points ( f(x2)−f(x1) 

x 2−x 1 

) will never increase when either x 1 , x 2 or 

both increase. We can therefore conclude that 

LI .Skip( s(|lu|+nv) 

R 

) − LI .Skip( s|lu| 

R ) 

n v 

s(|lu|+nv+nw) 

(LI .Skip( 

R 

) − LI .Skip( s(|lu|+nv) 

R 

) 

≥ 

n w 

snw 

LI .Skip( R ) 

n w 

snv 

LI .Skip( R 

It remains to show that ≥ ) 

n v 

. To do so, we return to f(x) as 

discussed above, and substitute x 1 = 0. As long as f(0) ≥ 0, the average slope between 

(0, 0) and (x 2 , f(x 2 ) is smaller than between (0, 0) and (x 3 , f(x 3 )) when x 3 < x 2 because 

f(x) is concave. Because LI.Skip(0) = c c + c sc is positive as no costs are negative, we 

snw 

LI .Skip( R ) 

n w 

LI .Skip( 

snv 

R ) 

n v 

. 

can conclude that ≥ 

We know that the theorem holds for the heap construction from the description of 

Next(), first call, above. 

Skip(s), subsequent calls: The proof for the savings in calls to SkipTo(val) for the 

inputs is mostly as for the first call to Skip(s) above. The only difference occurs when the 

skip length for an input is less than 1. We thus need to prove that this resulting function 

273


is still concave, and that it is not negative for s = 0. Because it is linear when s < 1, 

the second derivative is 0. Furthermore, for the derivative not to increase at s = 1, we 

require that: 

This holds when: 

c c + c d 

b 

+ (1 + 

2b − 1 

2b 

− 1 2b )c sc 

≥ c d 

b + (2b − 1 − 2 2b 2b )c sc 

c c + (1 + 1 2b )c sc ≥ 0 

This holds because all costs are positive. Furthermore, when the skip length is 0, the cost 

is 0, so we reach the same conclusion as for Skip(s), first call, above. 

The model for the cost of heap maintenance in subsequent calls to SkipTo(val) is 

linear in the number of forwarded inputs. Hence, we proceed to show that the number of 

forwarded inputs decrease at least as much per n w when w is included in the additional 

author-list as per n v when only v is included. The reduction when including v in L u is: 

R v = min 

( s|lu | 

R , 1 ) 

( snv 

) 

+ min 

R , 1 ( s(|lu | + n v ) 

− min 

R 

) 

, 1 

And, the reduction when also including w is: 

( ) 

s(|lu | + n v ) 

( snw 

) 

R w = min 

, 1 + min 

R 

R , 1 ( ) 

s(|lu | + n v + n w ) 

− min 

, 1 

R 

We need to show that Rw 

n w 

≥ Rv 

n v 

. 

n w < n v , and it remains to show that: 

( ) 

min s(|lu|+n v) 

R 

, 1 

It is clear that 

min( 

snv 

R ,1) 

n v 

( ) 

− min s(|lu|+n v+n w) 

R 

, 1 

≤ 

min( 

snw 

R ,1) 

n w 

when 

n w 

( ) ( ) 

min s|lu| 

R , 1 − min s(|lu|+n v) 

R 

, 1 

− 

≥ 0 

n v 

Assume first that s(|lu|+nv+nw) 

R 

≤ 1. In that case, we see that the statement sums to 0 

because |l u |, n v , n w ≥ 0. There are three more possible cases: 

a) s |lu|+nv+nw 

R 

b) s |lu|+nv 

R 

> 1 and s |lu|+nv 

R 

< 1, 

< 1, and 

≥ 1 and s |lu| 

R 

274



1.5 

0.5 

0.1 

·10 12 Actual time 

Estimate 

0.5 1 1.5 

#authorings ·10 9 

10 11 

10 10 

10 5 10 6 

#lists 

Figure 7: (Left) update workload with x list entries (c update = 1001.7939), (right) search 

workload with x accessed author-lists (c list = 24111.2354). 

c) s |lu| 

R ≥ 1. 

( ) 

In case a), − min s(|lu|+n v+n w) 

R 

, 1 , which is negative, is limited so we subtract less 

relative ( to when ) the statement ( summed to 0, ) so the statement is positive. ( In case) 

b), 


R 

, 1 and − min s(|lu|+n v+n w) 

R 

, 1 cancel. Furthermore, min s|lu| 

R , 1 < 

( ) 


R 

, 1 , making the statement positive. In case c), the result is 0. 

We have now showed that the costs of including w in addition to v in the additional 

author-list for u are exactly the same per document posted by w, n w , as the cost of including 

only v per n v . Furthermore, we have showed that the savings per posted document 

are at least as large for all relevant functions in HeapUnion when also including w as 

when including v in L u . Because it by assumption leads to a performance improvement 

to include v, it will thus lead to a further performance improvement to also include w. 

E 

Microbenchmarks 

This section contains an overview of the microbenchmarks used to determine the constants 

in the cost models in Section 4. All constants are summarized in Table 1. 

The constants used in Simple are estimated from a complete workload. We process 

Workload 2 from Appendix B and measure update cost as a function of author-list length, 

and search cost as a function of the number of merged lists in the queries. Results and 

estimates are shown in Figure 7. 

The costs of initialization and Next() operations in list iterators are estimated in an 

experiment using author-lists with deltas between entries ranging from 1 to 100,000. The 

results are shown in Figure 8. In a similar experiment using posting lists instead of 

author-lists, the estimate for c init was higher. This is probably due to that we use a 

different look-up structure in the inverted index, and we therefore have two values for 

c init . 

275


10 5 delta 



10 4 

10 3 

10 4 

10 0 10 1 10 2 10 3 10 4 10 5 

1 

2 

5 

10 

20 

50 

100 

200 

500 

1000 

2000 

5000 

10000 

20000 

50000 

100000 

resulting estimate 

10 5 delta 

10 3 

10 0 10 1 10 2 10 3 10 4 10 5 

Figure 8: Time spent on x next operations with different deltas. For author-lists (above) 

estimate with c init = 1261.6450 and c next = 15.0138. For posting-lists (below) estimate 

with c initposting = 1675.6671. 

The cost of accessing an empty list is estimated with an experiment where we have a 

user who is friends with x users that do not post documents. We perform 100,000 queries 

on behalf of this user and record the time per search. The results are shown in Figure 9 

along with the estimates. 

The cost of skipping in the list iterator involves several constants. We first estimate 

the scan cost with an experiment that tests all scan lengths up to b = 128. The results 

and the resulting estimates are shown in Figure 9. The rest of the constants for skipping 

in list iterators are estimated from a set of experiments where skips with various lengths 

(1 to 10,000) are performed in 1000 lists. All results from these experiments are shown in 

Figure 10. We use these results to first estimate c c and c d by considering all skip lengths 

below b and subtracting the average scan cost (from the above experiment). The constant 

for long skips is estimated by subtracting the estimated cost of skipping b values from 

the longer skips. The results and resulting estimates from these experiments are shown 

in Figure 11. 

The cost of a heapify() operation is estimated by running union queries over a set of 

iterators that simply return numbers with given intervals. We are thus able to predetermine 

the number of heapify() operations. To estimate the constants in construction, we 

test a workload with 100,000 lists and accumulate 1,000 entries in each list. The deltas 

276



3 ·104 Results 

2.5 Estimates 

2 

1.5 

1 

0.5 

0 

0 5 10 15 20 25 30 35 40 45 50 

#empty inits 

160 

140 

120 

100 

80 

60 

40 

20 

0 

0 20 40 60 80 100 120 

scan length 

Figure 9: Left: Time spent on x initializations of empty lists c einit = 543.7537. Right: 

Cost of scanning x entries in array (c sc = 1.1889). 

between the entries vary from 1 to 100,000, which results in different numbers of bytes 

used during accumulation. Results and estimates are shown in Figure 12. 


10 5 

10 4 

10 3 

10 0 10 1 10 2 10 3 10 4 

num skip operations 

1 skipped entries 













Figure 10: Cost of SkipTo(val) in list iterator 

277



300 

250 

200 

150 

100 

50 

Results 

Estimates 

250 

200 

150 

100 

50 

0 

0 20 40 60 80 100 

skip length 

0 

200 400 600 800 1,000 

skip length 

Figure 11: Determination of skip constants for short skips (left: c c = 24.8305 and c d = 

330.6624) and long skips (right: c g = 79.0708). 


4 ·108 Results 

Estimates 

3 

2 

1 

0 

0 1 2 3 4 

Number of heapify ops ·10 7 

Processing per entry (ns) 

1,000 

500 

0 

10 0 10 1 10 2 10 3 10 4 10 5 

delta 

Figure 12: Left: Time per heapify operations (c h = 9.9073). Right: Microbenchmark for 

construction (c 1b = 990.3406, c 2b = 1082.8243 and c 3b = 1135.9749). 

Constant 

Estimate 

c update 1001.7939 

c list 24111.2354 

c init 1261.6450 

c init−posting 1675.6671 

c einit 543.7537 

c next 15.0138 

c sc 1.1889 

c c 24.8305 

c d 330.6624 

c g 79.0708 

c h 9.9073 

c 1b 990.3406 

c 2b 1082.8243 

c 3b 1135.9749 

Table 1: Estimates from microbenchmarks 

278

Disputation Nils Grimsmo - Department of Computer and Information ...

Create successful ePaper yourself

Delete template?

Save as template?