21.01.2014 Views

Disputation Nils Grimsmo - Department of Computer and Information ...

Disputation Nils Grimsmo - Department of Computer and Information ...

Disputation Nils Grimsmo - Department of Computer and Information ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Nils</strong> <strong>Grimsmo</strong><br />

Bottom Up <strong>and</strong> Top Down —<br />

Twig Pattern Matching on Indexed Trees<br />

Thesis for the degree <strong>of</strong> philosophiae doctor<br />

Trondheim, 2010-09-02<br />

Norwegian University <strong>of</strong> Science <strong>and</strong> Technology.<br />

Faculty <strong>of</strong> <strong>Information</strong> Technology, Mathematics <strong>and</strong> Electrical Engineering.<br />

<strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science.


NTNU<br />

Norwegian University <strong>of</strong> Science <strong>and</strong> Technology<br />

Thesis for the degree <strong>of</strong> philosophiae doctor<br />

Faculty <strong>of</strong> <strong>Information</strong> Technology, Mathematics <strong>and</strong> Electrical Engineering<br />

<strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science<br />

c○<strong>Nils</strong> <strong>Grimsmo</strong><br />

ISBN 978-82-471-2723-0 (printed version)<br />

ISBN 978-82-471-2724-7 (electronic version)<br />

ISSN 1503-8181<br />

Doctoral theses at NTNU, 2011:96<br />

Printed by NTNU-trykk


Preface<br />

This thesis is submitted to the Norwegian University <strong>of</strong> Science <strong>and</strong> Technology (NTNU)<br />

for partial fulfillment <strong>of</strong> the requirements for the degree <strong>of</strong> philosophiae doctor. The doctoral<br />

work has been performed at the <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science,<br />

NTNU, Trondheim, with Bjørn Olstad as main supervisor, <strong>and</strong> Øystein Torbjørnsen <strong>and</strong><br />

Magnus Lie Hetl<strong>and</strong> as co-supervisors.<br />

The c<strong>and</strong>idate was supported by the Research Council <strong>of</strong> Norway under the grant<br />

NFR 162349, <strong>and</strong> by the iAD project, also funded by the Research Council <strong>of</strong> Norway.<br />

5


Summary<br />

This PhD thesis is a collection <strong>of</strong> papers presented with a general introduction to the<br />

topic, which is twig pattern matching (TPM) on indexed tree data. TPM is a pattern<br />

matching problem where occurrences <strong>of</strong> a query tree are found in a usually much larger<br />

data tree. This has applications in XML search, where the data is tree shaped <strong>and</strong><br />

the queries specify tree patterns. The papers included present contributions on how to<br />

construct <strong>and</strong> use structure indexes, which can speed up pattern matching, <strong>and</strong> on how<br />

to efficiently join together results for the different parts <strong>of</strong> the query with so-called twig<br />

joins.<br />

• Paper 1 [18] shows how to perform more efficient matching <strong>of</strong> root-to-leaf query<br />

paths in so-called path indexes, by using new opportunistic algorithms on existing<br />

data structures.<br />

• Paper 2 [19] proves a tight bound on the worst-case space usage for a data structure<br />

used to implement path indexes.<br />

• Paper 3 [24] presents an XML indexing system which combines existing techniques<br />

in a novel way, <strong>and</strong> has orders <strong>of</strong> magnitude improved performance over existing<br />

commercial <strong>and</strong> open-source systems.<br />

• Paper 4 [20] reviews <strong>and</strong> creates a taxonomy for the many advances in the field <strong>of</strong><br />

TPM on indexed data, <strong>and</strong> proposes opportunities for further research.<br />

• Paper 5 [21] bridges the gap between worst-case optimality <strong>and</strong> practical performance<br />

in current twig join algorithms.<br />

• Paper 6 [22] improves the construction cost <strong>of</strong> so-called forward <strong>and</strong> backward path<br />

indexes for tree data from loglinear to linear.<br />

7


Acknowledgments<br />

The day-to-day supervision <strong>of</strong> the PhD work during the first years was mostly done by<br />

the external supervisor Dr. Øystein Torbjørnsen from Fast Search <strong>and</strong> Transfer, who has<br />

been a good source <strong>of</strong> ideas <strong>and</strong> clever technical solutions. Dr. Magnus Lie Hetl<strong>and</strong> from<br />

my department has been supervising the last year, <strong>and</strong> has given substantial help both<br />

scientifically <strong>and</strong> in the writing process <strong>of</strong> some papers. The visits <strong>of</strong> my formal supervisor<br />

Dr. Bjørn Olstad have been inspirational. The discussions with Dr. Felix Weigel during<br />

his internship at FAST resulted in many ideas. I would like to thank fellow PhD student<br />

Truls Amundsen Bjørklund for good times, fruitful discussions <strong>and</strong> honest feedback during<br />

our work together.<br />

Thank you Nina, for your patience, beauty <strong>and</strong> delicious cooking.<br />

9


Contents<br />

Preface 5<br />

Summary 7<br />

Acknowledgments 9<br />

Contents 12<br />

1 Introduction 13<br />

1.1 Indexing/search in semi-structured data . . . . . . . . . . . . . . . . . . . 13<br />

1.2 Use-case: XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

1.2.1 XPath <strong>and</strong> XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

1.3 Abstract problem: Twig Pattern Matching . . . . . . . . . . . . . . . . . . 15<br />

1.3.1 Research scope: TPM on indexed data . . . . . . . . . . . . . . . . 16<br />

1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />

2 Background 19<br />

2.1 Twig joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.1.1 Twig join work-flow . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.1.2 Result enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />

2.1.2.1 Single output query node . . . . . . . . . . . . . . . . . . 23<br />

2.1.3 Simple intermediate result architecture . . . . . . . . . . . . . . . . 23<br />

2.1.4 Tree position encoding . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />

2.1.5 Partial match filtering . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

2.1.6 Intermediate result construction . . . . . . . . . . . . . . . . . . . 26<br />

2.1.7 Merging input streams . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />

2.1.8 Data locality <strong>and</strong> updatability . . . . . . . . . . . . . . . . . . . . 30<br />

2.1.9 Twig join conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

2.2 Partitioning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />

2.2.1 Motivation for fragmentation . . . . . . . . . . . . . . . . . . . . . 31<br />

2.2.2 Path partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />

2.2.3 Backward <strong>and</strong> forward path partitioning . . . . . . . . . . . . . . . 33<br />

2.2.4 Balancing fragmentation . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

2.3 Reading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

11


2.3.1 Skipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

2.3.1.1 Skipping child matches . . . . . . . . . . . . . . . . . . . 36<br />

2.3.1.2 Skipping parent matches . . . . . . . . . . . . . . . . . . 37<br />

2.3.1.3 Holistic skipping . . . . . . . . . . . . . . . . . . . . . . . 37<br />

2.3.2 Virtual streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />

2.3.2.1 Virtual matches for non-branching internal query nodes . 38<br />

2.3.2.2 Tree position encoding allowing ancestor reconstruction . 38<br />

2.3.2.3 Virtual matches for branching query nodes . . . . . . . . 39<br />

2.4 Related problems <strong>and</strong> solutions . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

3 Research Summary 43<br />

3.1 Formalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

3.2 Publications <strong>and</strong> research process . . . . . . . . . . . . . . . . . . . . . . . 44<br />

3.2.1 Paper 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

3.2.2 Paper 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

3.2.3 Paper 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.2.4 Paper 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />

3.2.5 Paper 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

3.2.6 Paper 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

3.3 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />

3.4 Evaluation <strong>of</strong> contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />

3.4.1 Research questions revisited . . . . . . . . . . . . . . . . . . . . . . 51<br />

3.4.2 Opportunities revisited . . . . . . . . . . . . . . . . . . . . . . . . 52<br />

3.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />

3.5.1 Strong structure summaries for independent documents . . . . . . 54<br />

3.5.2 A simpler fast optimal twig join . . . . . . . . . . . . . . . . . . . 55<br />

3.5.3 Simpler <strong>and</strong> faster evaluation with non-output nodes . . . . . . . . 55<br />

3.5.4 Ultimate data access shoot-out . . . . . . . . . . . . . . . . . . . . 56<br />

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

Bibliography 61<br />

4 Included papers 63<br />

Paper 1: Faster Path Indexes for Search in XML Data . . . . . . . . . . . . . . 65<br />

Paper 2: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists 87<br />

Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes 93<br />

Paper 4: Towards Unifying Advances in Twig Join Algorithms . . . . . . . . . 115<br />

Paper 5: Fast Optimal Twig Joins . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees . . . . . . . . . . . . . . . . . . 169<br />

A Other Papers 193<br />

Paper 7: On performance <strong>and</strong> cache effects in substring indexes . . . . . . . . . 195<br />

Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems . . 237<br />

Paper 9: Search Your Friends And Not Your Enemies . . . . . . . . . . . . . . 249<br />

12


Chapter 1<br />

Introduction<br />

“Research is formalized curiosity.<br />

It is poking <strong>and</strong> prying with a purpose.”<br />

– Zora Neale Hurston<br />

The thesis is submitted as a paper collection bound together by a general introduction.<br />

This chapter presents the context <strong>of</strong> the research, which is indexing <strong>and</strong> querying<br />

semi-structured data, <strong>and</strong> the abstract problem investigated, which is twig pattern<br />

matching (TPM). Chapter 2 gives a high-level introduction to techniques used in state<br />

<strong>of</strong> the art TPM on indexed data. Chapter 3 lists the included published papers with<br />

short qualitative assessments, evaluates the total contribution <strong>of</strong> this thesis, <strong>and</strong> proposes<br />

future work.<br />

1.1 Indexing/search in semi-structured data<br />

So-called semi-structured data gives both flexibility <strong>and</strong> expressional power, <strong>and</strong> is commonly<br />

used for storing <strong>and</strong> exchanging data in heterogeneous information systems. In the<br />

semi-structured data model, documents have a structure that specifies how the different<br />

parts <strong>of</strong> the content relate to each other. This means information is contained both in<br />

the structure <strong>and</strong> the content. Documents are usually structurally self-contained, meaning<br />

that the structure can be understood from the document alone, without additional<br />

meta-data.<br />

The focus <strong>of</strong> this thesis is algorithms <strong>and</strong> data structures for indexing <strong>and</strong> querying<br />

semi-structured data, where queries specify both structure <strong>and</strong> content. The use <strong>of</strong> semistructured<br />

data can functionally cover both traditional structure-oriented <strong>and</strong> contentoriented<br />

data management, <strong>and</strong> the thesis therefore touches the fields <strong>of</strong> both databases<br />

<strong>and</strong> information retrieval.<br />

13


CHAPTER 1.<br />

INTRODUCTION<br />

1.2 Use-case: XML<br />

XML is a simple yet flexible markup language [46], <strong>and</strong> has become the de facto st<strong>and</strong>ard<br />

for storing semi-structured data. An example XML document is shown in Figure 1.1.<br />

St<strong>and</strong>ard XML has a tree model, where there are mainly three types <strong>of</strong> nodes in a document<br />

tree: element, attribute <strong>and</strong> text. All internal nodes in the document tree are <strong>of</strong><br />

type element, <strong>and</strong> are given by start <strong>and</strong> end tags, such as the node with name book in<br />

the example. Text <strong>and</strong> attribute nodes are always leaf nodes. Text nodes have simple<br />

string values, while attributes have both a name <strong>and</strong> a value, such as the ISBN node in<br />

the example.<br />

<br />

<br />

Kritik der Unvollständigkeit<br />

Kant<br />

Gödel<br />

<br />

...<br />

<br />

Figure 1.1: Example XML document<br />

1.2.1 XPath <strong>and</strong> XQuery<br />

XPath [45] <strong>and</strong> XQuery [47] have become st<strong>and</strong>ard languages for querying XML. Comparing<br />

the two, XPath is a simpler declarative language, while XQuery is a more complex<br />

language that uses XPath expressions as building blocks. The XPath expression in Figure<br />

1.2a asks for the title <strong>of</strong> all books coauthored by Kant <strong>and</strong> Gödel.<br />

In XPath single <strong>and</strong> double forward slashes specify child <strong>and</strong> descendant relationships<br />

between nodes, respectively. Square brackets contain predicates, <strong>and</strong> the rightmost node<br />

not part <strong>of</strong> a predicate is the output node, also called the return node. XPath queries are<br />

trees, <strong>and</strong> the tree representation <strong>of</strong> the example is shown in Figure 1.2b. In XPath there<br />

are 11 so-called axes in addition to descendant <strong>and</strong> child: parent, ancestor, followingsibling,<br />

preceding-sibling, following, preceding, attribute, namespace, self, descendant-orself<br />

<strong>and</strong> ancestor-or-self [45]. There can also be more complex value predicates than<br />

simple tests on string equality, using function such as count(), contains(), sum(), etc..<br />

XQuery is a powerful language where small programs are built with path expressions<br />

as building blocks, in so-called FLWOR expressions (for, let, where, order, return). Figure<br />

1.3 shows an XQuery program similar to the XPath expression in Figure 1.2a, which in<br />

addition orders books by title <strong>and</strong> retrieves both title <strong>and</strong> ISBN.<br />

14


1.3. ABSTRACT PROBLEM: TWIG PATTERN MATCHING<br />

//book[author/text()="Kant"][author/text()="Gödel"]/title<br />

(a) Expression.<br />

<br />

<br />

<br />

<br />

"Kant"<br />

"Gödel"<br />

(b) Tree representation.<br />

Figure 1.2: XPath example finding books coauthored by Kant <strong>and</strong> Gödel.<br />

for $b in doc("lib.xml")/library/book<br />

let $t := $b/title<br />

where $b/author = "Kant" <strong>and</strong> $b/author = "Gödel"<br />

order by $t<br />

return ($t, $b/@ISBN)<br />

Figure 1.3: Example XQuery.<br />

1.3 Abstract problem: Twig Pattern Matching<br />

In XPath a large number <strong>of</strong> functions can be used in value predicates, <strong>and</strong> thirteen different<br />

axes dictate the relationships between nodes. The many details in the language<br />

makes it hard to reason about the complexity <strong>of</strong> evaluation algorithms <strong>and</strong> hard to implement<br />

prototypes. TPM is a more abstract tree matching problem that covers a subset<br />

<strong>of</strong> XPath. It is <strong>of</strong> academic interest because a TPM solution covers the majority <strong>of</strong> the<br />

workload in most XML search systems [15].<br />

In TPM both query <strong>and</strong> data are node-labeled trees, as shown in the example in<br />

Figure 1.4. Node predicates are on label equality, <strong>and</strong> all nodes have the same type.<br />

There are two types <strong>of</strong> query edges that dictate the relationship between data nodes in a<br />

match, ancestor–descendant (A–D) <strong>and</strong> parent–child (P–C), denoted in figures by double<br />

<strong>and</strong> single edges, respectively. The result <strong>of</strong> a TPM query is the set <strong>of</strong> mappings <strong>of</strong> query<br />

nodes to data nodes that both respect the node labels <strong>and</strong> satisfy the A–D <strong>and</strong> P–C<br />

relationships specified by the query edges.<br />

In settings with XML document collections, the data is a forest <strong>of</strong> trees, but this can<br />

easily be transformed into a single tree by adding a virtual super-root node.<br />

15


CHAPTER 1.<br />

INTRODUCTION<br />

c 1<br />

a 1<br />

b 1 c 2 a 1 d 2 b 6<br />

b 1 c 1 a 2 e 1 a 4 d 1 c 5<br />

b 2 a 3 c 4 b 3 b 5 c 6<br />

c 3 f 1 b 4<br />

Figure 1.4: TPM example with a query tree on the left <strong>and</strong> a data tree on the right.<br />

One <strong>of</strong> the matches for the query in the data is shown with arrows from query nodes to<br />

data nodes. In the following, query nodes are drawn with circles <strong>and</strong> data nodes with<br />

rounded rectangles. Node labels are written with typewriter font, <strong>and</strong> the superscripts in<br />

query nodes <strong>and</strong> subscripts in data nodes are used to identify the nodes (together with<br />

the labels).<br />

1.3.1 Research scope: TPM on indexed data<br />

The scope <strong>of</strong> this thesis is twig pattern matching on indexed data, <strong>and</strong> we assume that the<br />

processes <strong>of</strong> preparing the index <strong>and</strong> evaluating queries are separate. For this strategy<br />

to be viable, the cost <strong>of</strong> index construction must be justified by the performance gain<br />

for query evaluation compared to evaluation without an index, seen in light <strong>of</strong> the index<br />

construction cost.<br />

We use the following abstract view <strong>of</strong> an index: It is a mechanism which provides a<br />

function from some feature <strong>of</strong> a node, to nodes in the data tree that have this feature. The<br />

simplest non-trivial such feature is node label, as used in the index shown in Figure 1.5a.<br />

In a typical implementation, entries in a so-called dictionary on label point to so-called<br />

occurrence lists containing nodes with matching label.<br />

When indexing on label, a query can be evaluated by reading the label-matching data<br />

nodes for each query node, <strong>and</strong> joining these into full query matches. The number <strong>of</strong> full<br />

query matches may be small compared to the total number <strong>of</strong> query node matches read,<br />

but if the labels on the query nodes are selective, much fewer data nodes will be processed<br />

than when evaluating the query on the data tree without an index.<br />

Indexing on node label can be extended to indexing on path labels, the string <strong>of</strong><br />

labels from the root to a node, as illustrated in Figure 1.5b. This can again be extended<br />

to classify nodes not only on labels <strong>of</strong> the ancestor nodes on the path above, but also<br />

on the labels <strong>of</strong> the children in the subtree below. These indexing strategies, called<br />

structure indexing, will be discussed in the next chapter, together with so-called twig join<br />

algorithms.<br />

16


1.4. RESEARCH QUESTIONS<br />

c<br />

c 1<br />

a<br />

a 1 a 2 a 3 a 4<br />

c a<br />

a 1<br />

b<br />

b 1 b 2 b 3 b 4 b 5 b 6<br />

c a a<br />

a 2 a 4<br />

c<br />

c 1 c 2 c 3 c 4 c 5 c 6<br />

c a a a<br />

a 3<br />

d<br />

d 1 d 2<br />

c a a b<br />

b 2 b 3 b 5<br />

e<br />

e 1<br />

f<br />

f 1<br />

c d<br />

d 2<br />

(a) Indexing on label.<br />

(b) Indexing on path.<br />

Figure 1.5: Indexing the data tree from Figure 1.4.<br />

1.4 Research questions<br />

The following are the main research questions I have investigated during the work with<br />

this thesis:<br />

• RQ1: How can matches for tree queries be joined more efficiently?<br />

• RQ2: How can pattern matching in the dictionary be done more efficiently?<br />

• RQ3: How can structure indexes be constructed faster <strong>and</strong> using less space?<br />

These questions will be revisited in Section 3.4.1, where I will evaluate to what extent<br />

they have been answered by my research. Note that more efficient query evaluation can<br />

mean either that all or most queries are evaluated using less time, or that queries from<br />

some important group are evaluated using less time. Preferably, faster evaluation for one<br />

group <strong>of</strong> queries should not cause slower evaluation for other groups.<br />

17


Chapter 2<br />

Background<br />

“Research is what I’m doing<br />

when I don’t know what I’m doing.”<br />

– Werner von Braun<br />

This chapter presents some underlying concepts for state-<strong>of</strong>-the-art approaches for<br />

TPM on indexed data, which will hopefully ease the underst<strong>and</strong>ing <strong>of</strong> the contributions<br />

in the research papers included in this thesis. A high-level conceptual overview is given<br />

instead <strong>of</strong> an in-depth description <strong>of</strong> details in state-<strong>of</strong>-the-art solutions, because this is<br />

better covered by the included papers where the specific techniques are discussed.<br />

The following discussion divides the problem <strong>of</strong> TPM on indexed data into three<br />

somewhat orthogonal issues: How to construct full query matches from individual query<br />

node matches in so-called twig joins, how to partition the underlying data nodes such<br />

that as few as possible are read to evaluate a query, <strong>and</strong> how to efficiently read streams<br />

<strong>of</strong> data nodes during a join.<br />

Notation. The following notation is used in the discussion: A graph G has node set V G<br />

<strong>and</strong> edge set E G ⊆ V G ×V G . All graphs are directed. A graph is a tree if all nodes have one<br />

incoming edge except the root, which has zero incoming edges. Nodes with zero outgoing<br />

edges are called leaves. A graph is called a forest if it consists <strong>of</strong> many unconnected trees,<br />

i.e., if all nodes have zero to one incoming edges. If a relation R relates x to y, this may<br />

be denoted both xRy <strong>and</strong> 〈x, y〉 ∈ R <strong>and</strong> x↦→y ∈ R. We primarily use angle brackets for<br />

graph edges, as in 〈u, v〉 ∈ E G , <strong>and</strong> the “maps to” arrow for mappings <strong>of</strong> query nodes<br />

to data nodes, as in q ↦→d ∈ M. The transitive closure <strong>of</strong> a relation R is denoted by<br />

R ∗ . In the problems discussed there will mostly be a query tree Q <strong>and</strong> a data tree D,<br />

where each node v ∈ V Q ∪ V D has a Label(v) ∈ A. Assume |A| ∈ O(|D|) for simplicity.<br />

Each query edge 〈u, v〉 ∈ E Q has an EdgeType(u, v) ∈ {“A–D”, “P–C”}, specifying an<br />

ancestor–descendant or a parent–child relationship.<br />

Remember from Section 1.3 that in TPM we have a single node type, <strong>and</strong> only differentiate<br />

nodes by label, while in XML there are different node types. We can generalize<br />

19


CHAPTER 2.<br />

BACKGROUND<br />

TPM to cover this by using different label codings for different node type, such as for<br />

example starting element node labels with “


2.1. TWIG JOINS<br />

to enumerate <strong>and</strong> output the set <strong>of</strong> full query matches O. The first phase has two<br />

components, where the first merges the streams S v , materializing I v for each v ∈ V Q , into<br />

a single stream S, materializing the total input set I.<br />

Phase 1,<br />

Component 1:<br />

Input stream<br />

merger<br />

Phase 1,<br />

Component 2:<br />

Intermediate<br />

result constr.<br />

Phase 2:<br />

Result<br />

enumeration<br />

a 1 a 1 a 2 . . .<br />

b 1 b 1 b 2<br />

. . .<br />

c 1 c 1 c 2 . . .<br />

c 1 ↦→c 1<br />

b 1 ↦→b 1<br />

. . .<br />

Intermed.<br />

results<br />

a 1<br />

b 2 c 5<br />

. . .<br />

Figure 2.1: Work-flow <strong>of</strong> twig join algorithms.<br />

Figure 2.2 illustrates why the two phases are temporally separate, as in the worst<br />

case, all the data must be read before it is known whether or not the nodes in the input<br />

are useful. On the other h<strong>and</strong>, use <strong>of</strong> the two components in Phase 1 can be temporally<br />

overlapping, because Component 2 reads data <strong>and</strong> query node pairs from Component 1 in<br />

some order that can be implemented without lookahead in the individual streams. Note<br />

that for some combinations <strong>of</strong> query <strong>and</strong> data, the construction <strong>of</strong> intermediate results is<br />

not necessary for linear evaluation (as we exploit in Paper 3 included in Chapter 4).<br />

a<br />

a 1<br />

1<br />

c 1 b 1 b 1 b n a 2<br />

c n+1<br />

b n+1 c 1 c n<br />

Figure 2.2: Example showing why Phase 1 <strong>and</strong> Phase 2 are temporally separate. When<br />

the input streams are sorted in tree preorder, it cannot be known whether b 1 , . . . , b n are<br />

part <strong>of</strong> a query match before c n+1 is seen, or whether c 1 , . . . , c n are part <strong>of</strong> a query match<br />

before b n+1 is seen. Note that there is no stream ordering such that all twig queries can<br />

be evaluated without storing intermediate results [10].<br />

To underst<strong>and</strong> the design choices in the approach depicted in Figure 2.1, it is easiest to<br />

start with the last step, result enumeration, <strong>and</strong> work backwards. Section 2.1.2 sketches<br />

a generic algorithm for enumerating results, <strong>and</strong> Section 2.1.3 sketches the layout <strong>of</strong><br />

a generic data structure that enables evaluating that algorithm in linear time. With<br />

this as a starting point, I go through various techniques <strong>and</strong> strategies for implementing<br />

the generic approach. Section 2.1.4 briefly presents a common tree position encoding<br />

that makes it possible to decide A–D <strong>and</strong> P–C relationships between data nodes in the<br />

21


CHAPTER 2.<br />

BACKGROUND<br />

various streams in constant time. Section 2.1.5 describes two common data node filtering<br />

strategies, <strong>and</strong> Section 2.1.6 shows how one <strong>of</strong> these can be used to realize the conceptual<br />

data structure from Section 2.1.3 in linear time. Section 2.1.7 describes the input stream<br />

merge component, where filtering strategies can be used for practical speedups.<br />

2.1.2 Result enumeration<br />

Algorithm 1 gives a high-level description <strong>of</strong> how to output all unique query matches<br />

that can be constructed from the input. The approach is a generalization <strong>of</strong> what is used<br />

in state <strong>of</strong> the art twig joins [7, 27, 39, 33]. The algorithm recursively constructs full<br />

query matches from partial matches that are known to be part <strong>of</strong> full query matches,<br />

denoted here as partial full matches. Formally, a partial full match is an M ′ such that<br />

M ′ ⊆ M for some full query match M ∈ M. The set <strong>of</strong> all partial full matches is<br />

M ′ = {M ′ | ∃M ∈ M : M ′ ⊆ M}.<br />

Algorithm 1 Result enumeration<br />

Denote the set <strong>of</strong> partial full matches by M ′ .<br />

Start with M ′ = {}, an empty partial full twig match.<br />

Assume any fixed ordering <strong>of</strong> the nodes in Q,<br />

<strong>and</strong> let v ∈ Q be the first node in this ordering.<br />

For all v ′ such that {v ↦→v ′ } ∈ M ′ :<br />

Call Recurse(v ↦→v ′ ).<br />

The function Recurse(u↦→u ′ ):<br />

Insert u↦→u ′ into M ′ .<br />

If |M ′ | = |Q|:<br />

Output M ′ .<br />

Otherwise:<br />

Let v be the node following u in Q.<br />

For all v ↦→v ′ such that M ′ ∪ {v ↦→v ′ } ∈ M ′ :<br />

Recurse(v ↦→v ′ )<br />

Remove u↦→u ′ from M ′ .<br />

Example 1. We evaluate the query <strong>and</strong> data in Figure 1.4 using Algorithm 1, <strong>and</strong> order<br />

query nodes in tree preorder. A c<strong>and</strong>idate match for query node a 1 that is part <strong>of</strong> a<br />

full match is data node a 1 , <strong>and</strong> hence one <strong>of</strong> the top-level calls to Recurse will be<br />

with the parameter u↦→u ′ set to a 1 ↦→a 1 . After this pair has been inserted into M ′ , we<br />

consider the query node b 1 , which follows a 1 in tree preorder. Since M ′ = {a 1 ↦→a 1 }, <strong>and</strong><br />

M ′ ∪ {b 1 ↦→b 1 } is a partial full match, b 1 ↦→b 1 is one <strong>of</strong> the pairs we recurse with. In that<br />

recursive call we have M ′ = {a 1 ↦→a 1 , b 1 ↦→b 1 }, <strong>and</strong> consider matches for the final query<br />

node c 1 . As {a 1 ↦→a 1 , b 1 ↦→b 1 } ∪ {c 1 ↦→c 1 } is a partial full match, we again recurse with<br />

c 1 ↦→c 1 , <strong>and</strong> output the new M ′ , since it is a complete full match.<br />

Assume that the set <strong>of</strong> partial full matches M ′ does not have to be materialized, <strong>and</strong><br />

that given a partial full match M ′ ∈ M ′ , where all nodes u preceding v have a mapping<br />

22


2.1. TWIG JOINS<br />

in u↦→u ′ ∈ M ′ , all v ↦→v ′ such that M ′ ∪ {v ↦→v ′ } ∈ M ′ can be traversed in time linear<br />

in their number. Under these assumptions the algorithm can be evaluated in O(|O| · |Q|)<br />

time, linear in the total number <strong>of</strong> data nodes in the output. The intuition is that each<br />

recursive call constructs in constant time a partial full match not seen before, <strong>and</strong> that<br />

each unique partial full match yields at least one unique full query match.<br />

2.1.2.1 Single output query node<br />

In TPM the answers in the result set are all legal ways <strong>of</strong> matching the query nodes<br />

to the data nodes, but in many information retrieval settings other semantics may be<br />

more useful. In the XPath language [45] queries have a single output node, <strong>and</strong> the<br />

result set contains all matches for this query node that are part <strong>of</strong> some full query match.<br />

In the XQuery language [47], which is used for more complex information retrieval <strong>and</strong><br />

processing, there can be any number <strong>of</strong> output <strong>and</strong> non-output nodes in the query.<br />

Only minor changes are needed in Algorithm 1 for this generalized case with both<br />

output <strong>and</strong> non-output query nodes. A simple solution is to put the output query nodes<br />

first in the fixed ordering, <strong>and</strong> stop the recursion before non-output nodes are considered.<br />

Note that practical data structures that enable linear enumeration for any combination<br />

<strong>of</strong> output <strong>and</strong> non-output nodes [7] are not as simple as the data structures described in<br />

the following sections.<br />

2.1.3 Simple intermediate result architecture<br />

Figure 2.2 illustrated why it is not possible to output query matches directly by just<br />

inspecting the heads <strong>of</strong> the streams for each query node. In the example all the nodes<br />

labeled c must be read before it can be known whether or not any <strong>of</strong> the nodes labeled<br />

b are useful, <strong>and</strong> vice versa.<br />

The purpose <strong>of</strong> storing intermediate results is to organize the data nodes in such a way<br />

that an implementation <strong>of</strong> the approach in Algorithm 1 can be evaluated efficiently. If<br />

the query nodes are ordered in tree preorder, it is natural to maintain for each u↦→u ′ that<br />

is part <strong>of</strong> a full query match, for each child v <strong>of</strong> u, the list <strong>of</strong> pairs v ↦→v ′ used together<br />

with u↦→u ′ in some full query match. Figure 2.3 illustrates this strategy. In addition to<br />

the lists <strong>of</strong> pointers to useful child query node matches for each pair, there must be a list<br />

<strong>of</strong> pointers to the data nodes that match the query root in full query matches.<br />

a 1<br />

b 1 c 1<br />

b 1 b 2 b 3 b 4 b 5 b 6<br />

Full<br />

c 1 c 2 c 3 c 4 c 5 c 6<br />

match<br />

roots<br />

a 1 a 2 a 3 a 4<br />

Figure 2.3: Generic intermediate results for the data tree in Figure 1.4.<br />

23


CHAPTER 2.<br />

BACKGROUND<br />

This data structure takes O(|I| + |O| · |Q|) space, linear in the size <strong>of</strong> the input <strong>and</strong><br />

output, because the lists <strong>of</strong> data nodes take O(|I|) space, <strong>and</strong> each root pointer or child<br />

match pointer is used at least once in Algorithm 1, which has time complexity O(|O| · |Q|).<br />

The following intuition shows how this data structure can be used to efficiently implement<br />

Algorithm 1 when query nodes are ordered in tree preorder: (i) The pairs v ↦→v ′ in<br />

the initial calls in the outer for-loop are trivially found by traversing the list <strong>of</strong> pointers<br />

to full match roots. (ii) In a recursive call, after u↦→u ′ has been added to M ′ , the current<br />

M ′ is a partial full match by assumption. Let v be the node following u in preorder, <strong>and</strong><br />

let p be the parent <strong>of</strong> v (possibly p = u). All query nodes preceding v have a mapping in<br />

M ′ , <strong>and</strong> assume M ′ (p) = p ′ . Let Q p <strong>and</strong> Q v be the subgraphs resulting from removing the<br />

edge 〈p, v〉 from Q. These subqueries can be matched independently when the mapping<br />

<strong>of</strong> both p <strong>and</strong> v is fixed in a way such that EdgeType(p, v) is satisfied. If v ↦→v ′ is used in<br />

some full query match together with p↦→p ′ , we know that 〈p ′ , v ′ 〉 satisfies EdgeType(p, v).<br />

Then, if M ′ is a partial full match, M ′ ∪ {v ↦→v ′ } must also be a partial full match.<br />

Example 2. This example illustrates how to implement the data access for Example 1<br />

using the data structure in Figure 2.3. The first match for the query root a 1 that is part<br />

<strong>of</strong> a full match is the data node a 1 , <strong>and</strong> hence the first non-empty partial full match<br />

in Algorithm 1 is M ′ = {a 1 ↦→a 1 }. When considering the next query node in preorder,<br />

b 1 , we see from the pointers in the data structure that b 2 is the first data node usable<br />

together with a 1 . Hence the next partial full match is M ′ = {a 1 ↦→a 1 , b 1 ↦→b 2 }. Then,<br />

when considering the next query node c 1 , we see that the data node c 5 is the only data<br />

node usable with a 1 , the current match for a 1 , the parent <strong>of</strong> query node c 1 . We insert<br />

c 1 ↦→c 5 to get the full match M ′ = {a 1 ↦→a 1 , b 1 ↦→b 2 , c 1 ↦→c 5 }.<br />

2.1.4 Tree position encoding<br />

To construct the intermediate results efficiently it must be decidable from position information<br />

following the data nodes whether or not they satisfy A–D <strong>and</strong> P–C relationships.<br />

A common solution is the interval-based BEL encoding [56], where each node is given<br />

integer numbers begin, end <strong>and</strong> level, as shown in Figure 2.4.<br />

1,10,1<br />

2,5,2<br />

3,3,3 4,4,3<br />

6,9,2<br />

7,7,3 8,8,3<br />

Figure 2.4: The BEL encoding for a tree, with begin, end <strong>and</strong> level numbers.<br />

This encoding is similar to preorder <strong>and</strong> postorder traversal numbers, <strong>and</strong> can be<br />

computed in a depth-first traversal <strong>of</strong> the tree. The reason the encoding is <strong>of</strong>ten preferred<br />

is probably that the begin <strong>and</strong> end numbers correspond to the document position <strong>of</strong><br />

opening <strong>and</strong> closing tags in XML.<br />

24


2.1. TWIG JOINS<br />

With the BEL encoding, a node a is an ancestor <strong>of</strong> a node b iff a.begin < b.begin <strong>and</strong><br />

b.begin < a.end, <strong>and</strong> it is a parent if also a.level + 1 = b.level. Sorting on begin or end<br />

numbers respectively gives the same sorting orders as preorder <strong>and</strong> postorder traversal<br />

numbers.<br />

There exists a large number <strong>of</strong> tree position encodings with different properties [50].<br />

Some allow decision <strong>of</strong> more types <strong>of</strong> node relationships, <strong>and</strong> some allow reconstruction<br />

<strong>of</strong> related nodes. They differ in the computational cost <strong>of</strong> evaluating relationships, space<br />

usage, <strong>and</strong> how well they h<strong>and</strong>le updates in the data tree.<br />

2.1.5 Partial match filtering<br />

When constructing intermediate results it is <strong>of</strong>ten possible to filter out some query <strong>and</strong><br />

data node pairs that will never be part <strong>of</strong> a full query match. In current twig join<br />

algorithms filtering is used both for practical speedup [5, 27, 33], <strong>and</strong>/or as a necessity<br />

for worst-case efficient result enumeration [7].<br />

A filtering strategy does not have to be perfect, but it must certainly not remove pairs<br />

that are part <strong>of</strong> full query matches. In other words, it can have false positives, but not<br />

false negatives. Most filtering strategies are based on the observation that if there is some<br />

subquery (a subgraph <strong>of</strong> the query), such that the pair v ↦→v ′ is not part <strong>of</strong> any match<br />

for the subquery, then v ↦→v ′ is not part <strong>of</strong> any match for the entire query, <strong>and</strong> can safely<br />

be thrown away [21].<br />

The two most common filtering strategies are illustrated in Figure 2.5. The first is<br />

based on checking if query prefix paths are matched [5, 27, 33], <strong>and</strong> the second on checking<br />

if query subtrees are matched [7, 39, 33]. The prefix path <strong>of</strong> a query node is the subquery<br />

containing the nodes on the path from the root down to the node.<br />

c 1<br />

c 1<br />

b 1 c 2 a 1<br />

d 2 b 6<br />

b 1 c 2 a 1<br />

d 2 b 6<br />

a 1<br />

b 1 c 1<br />

f 1 b 2<br />

(a) Query.<br />

a 2<br />

e 1 a 4 d 1 c 5<br />

b 2 a 3 c 4 b 3 b 5 c 6<br />

c 3 f 1 b 4<br />

(b) Matching prefix paths.<br />

a 2<br />

e 1 a 4 d 1 c 5<br />

b 2 a 3 c 4 b 3 b 5 c 6<br />

c 3 f 1 b 4<br />

(c) Matching subtrees.<br />

Figure 2.5: Matching query parts.<br />

We call a pair v ↦→v ′ that is part <strong>of</strong> a prefix path match for v a prefix path matcher.<br />

Filtering query <strong>and</strong> data node pairs on whether or not they are prefix path matchers is<br />

easy to implement with an inductive strategy: Assuming that v ∈ Q has parent u, the<br />

25


CHAPTER 2.<br />

BACKGROUND<br />

pair v ↦→v ′ is a prefix path matcher for v if <strong>and</strong> only if there exists a pair u↦→u ′ that is a<br />

prefix path matcher for u such that 〈u ′ , v ′ 〉 satisfies the A–D or P–C relationship specified<br />

by EdgeType(u, v) [5]. Prefix path filtering is easiest to implement when data nodes are<br />

seen in tree preorder, where ancestors are seen before descendants.<br />

Example 3. Figure 2.5b illustrates prefix path match checking. The pair a 1 ↦→a 1 is<br />

trivially a prefix path matcher, <strong>and</strong> b 1 ↦→b 3 must then be a prefix path matcher because<br />

EdgeType(a 1 , b 1 ) = “A–D” <strong>and</strong> 〈a 1 , b 3 〉 ∈ E ∗ D . This again implies that f 1 ↦→f 1 must be<br />

a prefix path matcher because EdgeType(b 1 , f 1 ) = “P–C” <strong>and</strong> 〈b 3 , f 1 〉 ∈ E D .<br />

Filtering pairs on whether or not they are subtree matchers can be implemented with<br />

a similar strategy: The pair v ↦→v ′ is a subtree matcher if <strong>and</strong> only if for each child<br />

w <strong>of</strong> v, there exists a subtree matcher w ↦→w ′ such that 〈v ′ , w ′ 〉 satisfies the A–D or<br />

P–C relationship specified by EdgeType(v, w) [7]. Subtree match filtering is easiest to<br />

implement when data nodes are seen in tree postorder.<br />

Example 4. Figure 2.5c illustrates subtree match checking. The pairs f 1 ↦→f 1 , b 2 ↦→b 4<br />

<strong>and</strong> c 1 ↦→c 5 are trivially subtree matchers because the query nodes are leaves. The pair<br />

b 1 ↦→b 3 is a subtree matcher because f 1 ↦→f 1 is a subtree matcher <strong>and</strong> 〈b 3 , f 1 〉 ∈ E D satisfies<br />

EdgeType(b 1 , f 1 ) = “P–C”, <strong>and</strong> because b 2 ↦→b 4 is a subtree matcher <strong>and</strong> 〈b 3 , b 4 〉 ∈<br />

E D ∗ satisfies EdgeType(b 1 , b 2 ) = “A–D”. The pair a 1 ↦→a 1 is a subtree matcher, because<br />

b 1 ↦→b 3 is a subtree matcher <strong>and</strong> 〈a 1 , b 3 〉 ∈ E D ∗ satisfies EdgeType(a 1 , b 1 ) = “A–D”,<br />

<strong>and</strong> because c 1 ↦→c 5 is a subtree matcher <strong>and</strong> 〈a 1 , c 5 〉 ∈ E D satisfies EdgeType(a 1 , b 1 ) =<br />

“P–C”.<br />

2.1.6 Intermediate result construction<br />

The filtering on matched subtrees described in the previous section is strongly related<br />

to a strategy that can be used to efficiently build a data structure that realizes the<br />

conceptual structure depicted in Figure 2.3. What is described in the following is a slight<br />

simplification <strong>of</strong> what is used in the Twig 2 Stack [7] algorithm, which was the first twig<br />

join algorithm with cost linear in the size <strong>of</strong> the input data <strong>and</strong> the output result set.<br />

The reason preorder processing <strong>of</strong> data nodes <strong>and</strong> filtering on matched prefix paths is<br />

not a suitable starting point for a worst-case efficient algorithm, is that even though paths<br />

in the data do match paths in the query, it is hard to figure out on the fly during preorder<br />

processing whether or not other paths in the query can use the same branching nodes.<br />

On the other h<strong>and</strong>, with postorder processing matches for the query can be constructed<br />

bottom up by combining subtree matches into bigger subtree matches. The storage order<br />

<strong>of</strong> data nodes in the index does not have to be changed for postorder processing, as a<br />

preorder stream <strong>of</strong> match pairs can be translated to a postorder stream with a stack:<br />

When a pair v ↦→v ′ is read in preorder, all pairs u↦→u ′ on the stack such that u ′ is not an<br />

ancestor <strong>of</strong> v ′ are popped <strong>of</strong>f <strong>and</strong> processed one by one, before v ↦→v ′ is pushed on stack.<br />

When following the strategy from Sections 2.1.2 <strong>and</strong> 2.1.3, the key to efficient enumeration<br />

<strong>of</strong> results is the ability to efficiently find usable subtree matches. Given a c<strong>and</strong>idate<br />

v ↦→v ′ , we need to find for all children w <strong>of</strong> v, the list <strong>of</strong> matchers w ↦→w ′ such that<br />

〈v ′ , w ′ 〉 satisfies EdgeType(v, w). Subtree matches for the query root are trivially full<br />

query matches.<br />

26


2.1. TWIG JOINS<br />

The overall strategy for the proposed data structure is to maintain for each query<br />

node v a list <strong>of</strong> disjoint trees T v consisting <strong>of</strong> node matches from the stream S v , as shown<br />

in Figure 2.6. Some additional dummy nodes are used to bind the trees together. For<br />

each data node in the trees for a query node, there is a list <strong>of</strong> pointers to usable child<br />

query node matches. P–C matches are pointed to directly, while A–D matches are found<br />

in the entire subtrees pointed to.<br />

a 1<br />

a 4<br />

c 1<br />

b 1<br />

c 2<br />

b 2<br />

c 3 c 4 c 5<br />

b 3 b 5<br />

c 6<br />

b 4<br />

Figure 2.6:<br />

Figure 1.4.<br />

Postorder construction <strong>of</strong> intermediate results for the data <strong>and</strong> query in<br />

Algorithm 2 shows how this data structure can be constructed, specifying the processing<br />

<strong>of</strong> a single pair v ↦→v ′ in postorder. For each query node v, there is a list T v <strong>of</strong><br />

disjoint trees consisting <strong>of</strong> subtree matchers v ↦→v ′ where v ′ ∈ S v . When processing a pair<br />

v ↦→v ′ , the trees where the root data nodes are descendants <strong>of</strong> v ′ are joined into single<br />

trees, both in the lists T w for the children w <strong>of</strong> v, <strong>and</strong> in the list T v for v itself. For P–C<br />

edges, pointers from v ↦→v ′ to w ↦→w ′ denote single direct child matches, while for A–D<br />

edges, pointers denote that entire subtrees contain matches. A pair v ↦→v ′ is only added if<br />

there is at least one pointer for each child w <strong>of</strong> v, <strong>and</strong> this effectively implements subtree<br />

match filtering as described in Section 2.1.5.<br />

Example 5. Figure 2.7 shows the step processing a 1 ↦→a 1 when constructing intermediate<br />

results for the data <strong>and</strong> query from Figure 1.4 with Algorithm 2. The trees at the end<br />

<strong>of</strong> T b 1, where the roots are b 2 <strong>and</strong> a dummy node, are joined into a single tree. So are<br />

the trees at the end <strong>of</strong> T c 1, where the roots are c 3 , c 4 <strong>and</strong> c 5 . Pointers are added from<br />

a 1 ↦→a 1 to the tree <strong>of</strong> descendants in T b 1, <strong>and</strong> to the child match c 1 ↦→c 5 in T c 1. Since<br />

a 1 ↦→a 1 has pointers both to matches for b 1 <strong>and</strong> c 1 , it is a subtree match, <strong>and</strong> is added<br />

to T a 1.<br />

When evaluating the input I with Algorithm 2, the total number <strong>of</strong> calls to the<br />

procedure Process() would be ∑ v∈V Q<br />

|S v | = |I|, <strong>and</strong> the total number <strong>of</strong> rounds in the<br />

for-loop would be ∑ v∈V Q<br />

|I v | · b v ∈ O(|I| · b Q ), where b v is number <strong>of</strong> children <strong>of</strong> v <strong>and</strong><br />

b Q is the maximal number <strong>of</strong> children for any node in Q. Apart from constant time<br />

27


CHAPTER 2.<br />

BACKGROUND<br />

Algorithm 2 Postorder intermediate result construction<br />

Function Process(v ↦→v ′ ):<br />

For each child w <strong>of</strong> v:<br />

Let T ′ w be the trees at the end <strong>of</strong> T w where root nodes are descendants <strong>of</strong> v ′ .<br />

If EdgeType(v, v) = “P–C”:<br />

Add pointers from v ↦→v ′ to all w ↦→w ′ in T ′ w where depth(w ′ ) = depth(v ′ )+1.<br />

If |T ′ w| > 1<br />

Replace T ′ w by a dummy node with the trees from T ′ w as children.<br />

If EdgeType(v, v) = “A–D” <strong>and</strong> |T ′ w| > 0:<br />

Add a descendant pointer from v ↦→v ′ to the single node in T ′ w.<br />

If v ↦→v ′ does not have at least one pointer per child w <strong>of</strong> v:<br />

Discard v ↦→v ′ <strong>and</strong> return failure.<br />

Remove from the end <strong>of</strong> T v all roots where data nodes are descendants <strong>of</strong> v ′ ,<br />

add them as children <strong>of</strong> v ↦→v ′ , <strong>and</strong> append v ↦→v ′ to T v .<br />

a 1<br />

a 1<br />

a 4<br />

c 1<br />

b 1<br />

c 2<br />

b 2<br />

c 3 c 4 c 5<br />

b 3 b 5<br />

c 6<br />

b 4<br />

(a) Before adding a 1 ↦→a 1 .<br />

a 4<br />

c 1<br />

b 1<br />

c 2<br />

b 2<br />

c 3 c 4 c 5<br />

b 3 b 5<br />

c 6<br />

b 4<br />

(b) After adding a 1 ↦→a 1 .<br />

Figure 2.7: A step in postorder construction <strong>of</strong> intermediate results for the data <strong>and</strong> query<br />

in Figure 1.4. Dotted boxes give the current list <strong>of</strong> trees T v for each v ∈ V Q .<br />

28


2.1. TWIG JOINS<br />

operations for each input v ↦→v ′ <strong>and</strong> each child w <strong>of</strong> v, there is some non-trivial cost<br />

associated with merging trees <strong>and</strong> adding pointers to P–C <strong>and</strong> A–D child matches. A<br />

merge attempt either inspects only one tree root <strong>and</strong> does not change T v , or inspects<br />

k > 1 roots, removes k − 1 roots from T v <strong>and</strong> adds a new one. This means that the cost<br />

<strong>of</strong> merge operations is bounded by the number <strong>of</strong> attempts <strong>and</strong> the sizes <strong>of</strong> the trees, i.e.,<br />

∑<br />

v∈V Q<br />

O(|I v | + |I v | · b v ). Now consider the cost <strong>of</strong> adding pointers from matches for a<br />

query node u to matches for a child query node w. If EdgeType(v, w) = “A–D”, then<br />

only a single edge is added from each v ↦→v ′ . If EdgeType(v, w) = “P–C”, then only a<br />

single edge is added to each w ↦→w ′ , as a node can have only one parent. In conclusion,<br />

the total cost <strong>of</strong> using Algorithm 2 is ∑ v∈V Q<br />

O(|I v | + |I v | · b v ) ⊆ O(|I| + |I| · b Q ).<br />

What is presented here is a slight simplification <strong>of</strong> the Twig 2 Stack algorithm [7]. The<br />

main difference between the above depiction <strong>and</strong> Twig 2 Stack is that in the latter, the data<br />

structure for each query node is a list <strong>of</strong> trees <strong>of</strong> stacks <strong>of</strong> nodes, instead <strong>of</strong> simply lists<br />

<strong>of</strong> trees <strong>of</strong> nodes. Many alternative twig join algorithms have been presented [27, 39, 33]<br />

in the years following the publication <strong>of</strong> the Twig 2 Stack algorithm. What is common to<br />

these algorithms is that they have improved practical performance, but higher worst-case<br />

complexity in the result enumeration phase. An example is the TwigList algorithm, which<br />

stores intermediate nodes in simple vectors instead <strong>of</strong> trees, <strong>and</strong> implements a weaker form<br />

<strong>of</strong> subtree filtering, where all query edges are considered to have type A–D.<br />

2.1.7 Merging input streams<br />

The final component missing to implement the strategy in Figure 2.1 is the input stream<br />

merger. The input to the merge is one preorder sorted stream representing I v for each<br />

v ∈ Q, <strong>and</strong> the desired output is a sorted stream representing I. The sort order required<br />

for using the approach from Section 2.1.6 is that the pairs v ↦→v ′ ∈ I are sorted primarily<br />

on the preorder <strong>of</strong> the data nodes, <strong>and</strong> secondarily on the postorder <strong>of</strong> the query nodes.<br />

This means that after translating the stream into data node postorder with a stack, the<br />

new stream is sorted secondarily on query node preorder. This is required by Algorithm 2<br />

for cases where a single data node matches multiple query nodes, as a data node could<br />

hide useful children <strong>of</strong> itself if the sorting was not secondarily on query node preorder.<br />

The simplest merge approach is to traverse the query in postorder, <strong>and</strong> find some<br />

minimum v ↦→v ′ by taking a preorder minimum v ′ that is head <strong>of</strong> a stream I v for a<br />

postorder minimal v. This takes Θ(|Q|) time per extraction, <strong>and</strong> gives a total cost <strong>of</strong><br />

Θ(|I| · |Q|) for the merge. An asymptotically better approach is to organize the individual<br />

streams in a priority queue implemented with a binary heap, sorted primarily on the heads<br />

<strong>of</strong> the streams <strong>and</strong> secondarily on the query nodes. Extractions then take O(log |Q|) time,<br />

<strong>and</strong> the total cost is O(|I| log |Q|) [11].<br />

Since the preorder <strong>and</strong> postorder tree traversal numbers we are sorting on are bounded<br />

by the size <strong>of</strong> the input, the sorting complexity is not loglinear, but linear under the unit<br />

cost assumption. The entire set I can be put in a single array, <strong>and</strong> sorted using radix sort<br />

in Θ(|I|) time [11]. As the intermediate result construction is already O(|I| · b Q ), the radix<br />

sort approach gives no advantage over the heap based approach when log |Q| ∼ b Q . Since<br />

the latter uses much less memory in practice, Θ(|Q|) instead <strong>of</strong> Θ(|I|), it is preferable in<br />

most real-world scenarios.<br />

29


CHAPTER 2.<br />

BACKGROUND<br />

Some <strong>of</strong> the newer twig join algorithms storing intermediate results in preorder [27, 33]<br />

use a O(|I| · |Q|) input stream merge component that implements a weak form <strong>of</strong> subtree<br />

match filtering, where all query edges are considered to have type A–D [5]. The merger<br />

uses only O(|Q|) memory <strong>and</strong> is very fast in practice because queries are typically small.<br />

It returns data nodes in a relaxed preorder, where the ordering is only guaranteed between<br />

matches for query nodes related by ancestry. This stream is not easily translated into<br />

postorder, <strong>and</strong> hence the merger is not used for postorder processing algorithms [21].<br />

2.1.8 Data locality <strong>and</strong> updatability<br />

This chapter does in general not make a distinction between data stored in main memory<br />

<strong>and</strong> on disk, but in practical implementations it is important to consider the costs <strong>of</strong><br />

different access patterns in different media. While main memory on modern computers<br />

does not really have a uniform memory access cost, due to the use <strong>of</strong> caches, we can<br />

design usable systems that use r<strong>and</strong>om memory reads <strong>and</strong> writes. On the other h<strong>and</strong>, if<br />

the data is so large it must reside on disk, a system that uses a lot <strong>of</strong> r<strong>and</strong>om access will<br />

not be efficient in practice.<br />

Consider now the different phases <strong>and</strong> components in our twig join strategy. The input<br />

stream merger is assumed to only inspect stream heads <strong>and</strong> store a minimal amount <strong>of</strong><br />

state. Hence it should work well on an architecture where the c<strong>and</strong>idate matches for each<br />

query node are streamed from disk. The intermediate result construction, as shown in<br />

Algorithm 2, inspects in each call a number <strong>of</strong> tree roots stored contiguously at the end<br />

<strong>of</strong> the current list <strong>of</strong> trees for each query node. This in itself is simple to implement with<br />

good spatial locality, but it should also be considered how the layout <strong>of</strong> data affects the<br />

result enumeration phase. Luckily, if intermediate nodes are stream onto disk <strong>and</strong> inserted<br />

into blocks in postorder, most nodes that are close in the data tree will be stored closely<br />

on disk. This strategy will give fairly good spatial locality during result enumeration [7].<br />

The problem <strong>of</strong> intermediate results exceeding the size <strong>of</strong> main memory can be avoided<br />

in many practical cases, by observing that when the uppermost c<strong>and</strong>idate match for the<br />

root query node is closed, none <strong>of</strong> the data nodes seen so far in the tree preorder will be<br />

used in any match involving data nodes later in the tree preorder [7]. This means that<br />

when the uppermost query root match c<strong>and</strong>idate is closed, the current intermediate data<br />

can be used to enumerate the current set <strong>of</strong> query matches, before this data is discarded.<br />

Example 6. Consider the data in Figure 1.4, <strong>and</strong> an algorithm that pushes nodes onto a<br />

stack in preorder <strong>and</strong> pops them <strong>of</strong>f in postorder. When the data node b 6 is processed, it<br />

causes the popping <strong>of</strong> a 1 , <strong>and</strong> there are no more a-nodes on the stack. As a match for the<br />

query node a 1 must be above the match for any other query node in a full query match, no<br />

nodes preceding b 6 in the data will be involved in a match together with nodes following<br />

<strong>and</strong> including b 6 . Hence we can enumerate results, <strong>and</strong> delete the current intermediate<br />

data structures.<br />

In many practical cases with large amounts <strong>of</strong> data, the underlying information is<br />

stored in a large number <strong>of</strong> independent documents <strong>of</strong> moderate size, <strong>and</strong> in these cases the<br />

above trick is always applicable. Data updates are also easy to h<strong>and</strong>le in such a setting. A<br />

way <strong>of</strong> encoding global data node positions is by combining document identifiers <strong>and</strong> local<br />

30


2.2. PARTITIONING DATA<br />

node position encodings, such as BEL, <strong>and</strong> this simplifies updates: Updating a document<br />

can be viewed as deleting it <strong>and</strong> then re-adding it with a new document identifier, as is<br />

common in search systems for unstructured data [51]. Note that when the data is a single<br />

large tree that cannot easily be partitioned into independent documents, we need a node<br />

position encoding that has affordable cost for tree updates. There exist a number <strong>of</strong> such<br />

encodings with different properties [50].<br />

2.1.9 Twig join conclusion<br />

We have now discussed all the components in a state-<strong>of</strong>-the-art twig join algorithm, <strong>and</strong><br />

the costs <strong>of</strong> the different components are:<br />

• input stream merge: O(|I| log |Q|) for the heap-based approach,<br />

• intermediate results construction: O(|I| · b Q ),<br />

• <strong>and</strong> result enumeration: O(|O| · |Q|).<br />

This gives a total combined data, query <strong>and</strong> result complexity <strong>of</strong> O(|I| log |Q| + |I| · b Q +<br />

|O|·|Q|). Commonly the size <strong>of</strong> the query is viewed as a constant, <strong>and</strong> twig join algorithms<br />

are called linear <strong>and</strong> optimal if the combined data <strong>and</strong> result complexity is O(|I| + |O|).<br />

2.2 Partitioning data<br />

In the previous discussion it was assumed that the data nodes where partitioned on label<br />

in the index. This section considers the advantages <strong>and</strong> challenges that arise from more<br />

advanced indexing strategies.<br />

2.2.1 Motivation for fragmentation<br />

Let us first recap the introduction to the general strategy for TPM on indexed data from<br />

Section 1.3.1: The index is a mechanism which provides a function from some feature <strong>of</strong><br />

a node to the set <strong>of</strong> nodes in the data that have this feature.<br />

The main motivation for using an index is <strong>of</strong> course reading <strong>and</strong> processing less data<br />

during query processing. If node labels are selective then simple label partitioning is an<br />

efficient approach, but this is not always the case. Figure 2.8 shows a case with many<br />

label-matches for the individual query nodes in the data, but only a few full matches for<br />

the query.<br />

The above example may be unrealistic, but reconsider the data in Figure 1.1 <strong>and</strong> the<br />

query in Figure 1.2 on page 14. If the given library has billions <strong>of</strong> books, then the cost <strong>of</strong><br />

reading the data nodes labeled book will be huge compared to the size <strong>of</strong> the output result<br />

set. This motivates the use <strong>of</strong> a more fragmented partitioning <strong>of</strong> the data to improve the<br />

selectivity <strong>of</strong> query nodes. Note that another way <strong>of</strong> improving performance in these cases<br />

is to use skipping, discussed later in Section 2.3.1.<br />

31


CHAPTER 2.<br />

BACKGROUND<br />

b 1<br />

a 1<br />

a 2 a 9 a 15<br />

a 2 b 3<br />

b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />

a 4 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />

(a) Example query <strong>and</strong> data, showing first <strong>of</strong> four matches.<br />

a 1<br />

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21<br />

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21<br />

a 2 b 3<br />

b 1 b 3 b 5 b 10 b 13 b 14 b 16 b 17 b 18 b 19 b 20<br />

a 4 a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21<br />

(b) Example query <strong>and</strong> streams read. Marked stream nodes are useful.<br />

Figure 2.8: Partitioning on label.<br />

2.2.2 Path partitioning<br />

A natural extension <strong>of</strong> label partitioning is to partition data nodes on the paths by which<br />

they are reachable [37, 13, 36, 8]. Section 2.1.5 described how useless data nodes could<br />

be filtered out during intermediate result construction if they did not match prefix paths<br />

in the query. When indexing data nodes on prefix path, the same filtering is performed<br />

in advance, <strong>and</strong> we only process data nodes from classes where the prefix paths match<br />

the prefix paths in the query.<br />

To identify useful partitions when evaluating a query, we need some form <strong>of</strong> dictionary.<br />

In Figure 1.5b on page 17 a simple dictionary <strong>of</strong> path strings was used in the index, but<br />

this approach does not have attractive worst-case properties. There may be many unique<br />

paths in the data, <strong>and</strong> the size <strong>of</strong> this naive dictionary can be O(|D| 2 ) if the tree is deep.<br />

A more robust approach is to use a dictionary tree called a path summary, where shared<br />

prefixes <strong>of</strong> paths are only encoded once.<br />

Figure 2.9a shows the path partitioning for the data tree in Figure 2.8a. A path<br />

summary can be constructed from this partitioning by creating one node for each block<br />

in the partition, <strong>and</strong> creating edges between summary nodes whenever there are edges<br />

between data nodes in the related blocks, as shown on the left in Figure 2.9b.<br />

Prefix path matches for each query node can be found individually by using a matching<br />

algorithm on the summary tree, but this may give many individual matches that never<br />

take part in full query matches. A robust <strong>and</strong> efficient way to find useful prefix path<br />

matches is to index the summary itself on label, <strong>and</strong> use a twig join algorithms to evaluate<br />

queries directly on the summary to find relevant nodes [2].<br />

32


2.2. PARTITIONING DATA<br />

b 1<br />

a 2 a 9 a 15<br />

a 7 a 8 b 3 b 10 b 13 b 16 b 18 b 20<br />

a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />

(a) Path partition.<br />

a p3<br />

b p1<br />

a 1<br />

a p2<br />

a 2 a 9 a 15<br />

a 7 a 8<br />

a 2 b 3<br />

b p4<br />

b 3 b 10 b 13 b 16 b 18 b 20<br />

a 4 a 4 a 6 a 11 a 12 a 21<br />

a p5 b p6<br />

(b) Summary, query <strong>and</strong> streams read.<br />

Figure 2.9: Partitioning on prefix path.<br />

Figure 2.9b shows how a query is evaluated on the data from Figure 2.8a. The legal<br />

mappings <strong>of</strong> query nodes to summary nodes are found, <strong>and</strong> streams <strong>of</strong> related data nodes<br />

are read. Note that in this particular example there is only one match in the summary<br />

for each query node. If one query node matches multiple summary nodes, the streams<br />

<strong>of</strong> data nodes related to each <strong>of</strong> these summary nodes must somehow be merged into a<br />

single stream [8, 3].<br />

An extra bonus with path indexing applies for non-branching queries where the leaf<br />

node is the only output node. For such queries, the data nodes related to the summary<br />

nodes matched by the output query node can be read directly from the index, without<br />

using any join. It is known that these data nodes have the necessary ancestor nodes<br />

for matching the path query, <strong>and</strong> hence the ancestors do not have to be read. As an<br />

example, if the query node a 2 in Figure 2.9b is removed, <strong>and</strong> a 4 is the only output node,<br />

all matching data nodes can be read from the extents <strong>of</strong> the matching summary nodes,<br />

which in this case is only a p5 .<br />

2.2.3 Backward <strong>and</strong> forward path partitioning<br />

With path indexing, data nodes are placed in the same block in the partition iff they<br />

have the same incoming paths. An alternative recursive formalization is that two nodes<br />

are in the same block iff they have the same label <strong>and</strong> both have parents from the same<br />

block [36]. This can be extended to requiring also that the nodes also have children from<br />

the same blocks, yielding a partition on the sets <strong>of</strong> incoming <strong>and</strong> outgoing paths [28].<br />

With this backward <strong>and</strong> forward path partitioning, branching queries can be evaluated<br />

more efficiently than with simple backward path partitioning [28].<br />

Figure 2.10 shows backward <strong>and</strong> forward path partitioning, the resulting structure<br />

summary, <strong>and</strong> the evaluation <strong>of</strong> a query. Note that in this example, each query node<br />

matches only one summary node, but this is not always the case.<br />

Recall that for simple path indexing, non-branching queries with a single leaf output<br />

node could be evaluated without joins, by just reading the matches for the output leaf.<br />

The reason was that the existence <strong>of</strong> useful ancestor nodes was implied from the summary<br />

matching. A bonus with backward <strong>and</strong> forward path indexing is that the existence <strong>of</strong><br />

33


CHAPTER 2.<br />

BACKGROUND<br />

b 1<br />

a 2<br />

a 9 a 15<br />

a 7 a 8 b 3 b 10 b 20 b 13 b 16 b 18<br />

b 5<br />

(a) F&B partition.<br />

b 14 b 17 b 19<br />

a s3<br />

b s1<br />

a<br />

a<br />

a s2<br />

a 1<br />

2<br />

s7<br />

a 7 a 8<br />

a 2 b 3<br />

b s4 b s8 b s10<br />

b 3<br />

a<br />

a s5 b s6 a 4 a 4 a 6<br />

s9 b s11<br />

(b) Summary, query <strong>and</strong> streams read.<br />

Figure 2.10: Forward <strong>and</strong> backward path partitioning.<br />

useful matches both for ancestor <strong>and</strong> descendant query nodes is implied from the summary<br />

matching, even for branching queries. Assuming there is a match for the query in the<br />

backward <strong>and</strong> forward path summary, as shown in Figure 2.10b, it is known that all data<br />

nodes related to matched summary nodes are guaranteed to be part <strong>of</strong> at least one full<br />

query match [28]. In the example, it is certain that all data nodes classified by a p2 have<br />

children classified by a p3 <strong>and</strong> b p4 , <strong>and</strong> that data nodes classified by b p4 have children<br />

classified by a p5 <strong>and</strong> b p6 .<br />

A second bonus is that no joins are necessary for any query with a single output<br />

node. All matching can be performed in the summary, <strong>and</strong> relevant data nodes matching<br />

the output query node can be read directly from the blocks related to the matched<br />

summary nodes.<br />

The formal definitions <strong>of</strong> the recursive partitioning strategies sketched above are based<br />

on binary relations called bisimulations [36, 28], which can be computed in O(|E| log |V |)<br />

time for any graph [38].<br />

2.2.4 Balancing fragmentation<br />

The main advantage <strong>of</strong> the finer partitioning schemes is that the increased fragmentation<br />

usually leads to less data read from the index, as smaller blocks are read for each query<br />

node. This does not come without a cost. A larger number <strong>of</strong> blocks in the partition<br />

gives a larger summary structure, which leads to more expensive initial matching in<br />

the summary. Over-refined partitioning can even give increased query evaluation cost<br />

without considering the expenses <strong>of</strong> summary matching. If the properties <strong>of</strong> nodes used<br />

to classify data nodes in the index are more detailed than what a query describes, then<br />

many partition blocks would have to be read <strong>and</strong> merged for each query node. To sum<br />

up, we need to balance on one h<strong>and</strong>, the number <strong>of</strong> false positives read for each query<br />

node, <strong>and</strong> on the other, the cost <strong>of</strong> summary matching <strong>and</strong> merging blocks.<br />

A way <strong>of</strong> reducing the cost <strong>of</strong> summary matching for strong structural indexing with<br />

high fragmentation is to use a multi-level strategy. For example can the data be indexed<br />

with backward <strong>and</strong> forward path partitioning, the resulting backward <strong>and</strong> forward<br />

path summary be indexed with simple backward path partitioning [52], <strong>and</strong> the result-<br />

34


2.3. READING DATA<br />

ing backward path summary be indexed with label partitioning [2]. Figure 2.11 shows<br />

conceptually how a query can be evaluated with such an index.<br />

a 1<br />

a 2 b 3<br />

(a)<br />

a 4<br />

b p1<br />

b s1<br />

a p2<br />

a s2<br />

a p3 b p4 a s3<br />

a p5 b p6<br />

a s7<br />

b s4 b s8<br />

a s5 b s6<br />

a s9<br />

b 1<br />

a 2 a 9 a 15<br />

b s10 b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />

b s11 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />

(b) (c) (d)<br />

Figure 2.11: Multi-level partitioning. Showing (a) the query (b) backward path summary<br />

(c) backward <strong>and</strong> forward path summary, <strong>and</strong> (d) the underlying data.<br />

The path-based partitioning strategies presented here give over-refinement in some<br />

practical cases, <strong>and</strong> various weaker indexing schemes have been developed. There exist<br />

indexing schemes with fragmentation between simple path indexing <strong>and</strong> backward <strong>and</strong><br />

forward path indexing [28], schemes considering only shorter local paths [30], <strong>and</strong> schemes<br />

that adapt to query workload [6].<br />

A way <strong>of</strong> reducing the impact <strong>of</strong> the fragmentation problem is to use an adaptive<br />

strategy with different partitioning schemes for different classed <strong>of</strong> data nodes. For example<br />

could label partitioning be used for nodes with infrequent labels, <strong>and</strong> path indexing<br />

be used for nodes with frequent labels. A static strategy that is used for XML data, is to<br />

use path indexing for XML element nodes (tags), <strong>and</strong> value indexing for XML text values<br />

<strong>and</strong> attributes [29]. This works well in practice, because the alphabet <strong>of</strong> element node<br />

names is typically small, while the alphabet <strong>of</strong> text values is typically very large. Also, it<br />

lets the same index be used for both semi-structured <strong>and</strong> unstructured query processing,<br />

a simple text values can be looked up directly in the index.<br />

2.3 Reading data<br />

Section 2.1 was concerned with how to join together full query matches from individual<br />

node matches, while Section 2.2 was concerned with how to partition the data in such a<br />

way that as few matches as possible had to be read for each individual query node. In<br />

Section 2.1.7 it was mentioned that some input stream mergers implement some inexpensive<br />

form <strong>of</strong> weak filtering to relieve the more expensive filtering in the intermediate result<br />

construction component. This section considers how to store the data in a way such that<br />

this filtering can be performed more efficiently, even without reading all individual query<br />

node matches.<br />

2.3.1 Skipping<br />

Consider evaluating the query “Britney <strong>Grimsmo</strong>” on a regular web search engine. The<br />

simplest way to find the co-occurrences <strong>of</strong> the two terms is to go through the two lists<br />

35


CHAPTER 2.<br />

BACKGROUND<br />

<strong>of</strong> occurrences in parallel. But if there are many more occurrences <strong>of</strong> “Britney” than<br />

“<strong>Grimsmo</strong>”, it is more efficient to read through the hits for the second term <strong>and</strong> somehow<br />

search for related matches in the list <strong>of</strong> hits for the first term, skipping those that are<br />

irrelevant. This can be performed either with some sort <strong>of</strong> binary search if all the data is<br />

stored uncompressed in main memory, or with some specialized data structure, such as<br />

for example skip lists or B-trees.<br />

Skipping in joins <strong>of</strong> streams <strong>of</strong> different length is also relevant for queries on semistructured<br />

data, as shown by the example library query in Figure 1.2a, where there<br />

probably are much fewer matches for “Gödel” than for . Unfortunately, skipping<br />

is not always trivial to implement for tree queries. In the following it will be discussed<br />

how to efficiently skip through matches when aligning streams for parent <strong>and</strong> child query<br />

nodes, <strong>and</strong> how to make sure the skipping is efficient for the query as a whole. Figure 2.12<br />

illustrates different issues in connection with skipping twig joins.<br />

a 1<br />

a 1<br />

a 1<br />

b 1<br />

b p<br />

b q<br />

b 1<br />

c 1<br />

x 1<br />

c q<br />

c 1 c p<br />

b 1<br />

b· b·<br />

b· b·<br />

b· b r<br />

c 1<br />

(a)<br />

(b)<br />

(c)<br />

x 1<br />

x 1<br />

a 1<br />

b 1<br />

x 2<br />

b 2 c 2<br />

x p<br />

b p c p<br />

a 2<br />

b q<br />

x 1<br />

a 1 b 1 c 1<br />

x p<br />

a p b p c p<br />

a q<br />

b q<br />

(d)<br />

c q<br />

(e)<br />

c q<br />

Figure 2.12: Cases where different skipping techniques are needed for efficient processing.<br />

(a) Query. (b) Descendants easily skipped with B-tree or similar. (c) XR-tree needed<br />

to skip ascendants. (d) Holistic skipping preferred. (e) Holistic skipping with XR-tree<br />

needed.<br />

2.3.1.1 Skipping child matches<br />

For a given parent <strong>and</strong> child query node pair, we want to efficiently forward the related<br />

streams <strong>of</strong> data nodes to the first pair <strong>of</strong> stream positions where a match for the parent<br />

36


2.3. READING DATA<br />

query node is an ancestor (or parent) <strong>of</strong> the match for the child query node. We will first<br />

consider the simpler problem <strong>of</strong> forwarding a stream <strong>of</strong> matches for the child query node<br />

to catch up with the parent’s stream.<br />

Figure 2.12b shows a case where there is a current match b 1 for the query node b 1 , <strong>and</strong><br />

we want to find the first match for the child query node c 1 which is a descendant <strong>of</strong> b 1 . In<br />

other words, we desire the first match for c 1 that that follows b 1 in preorder. If this node<br />

also precedes b 1 in postorder, it must be a descendant. For example, let b 1 .begin = k,<br />

using the BEL encoding described in Section 2.1.4. We can binary search for the first<br />

match for c 1 with begin value greater than k, which is c q . As also c q .end < b 1 .end, we<br />

have that c q must be a descendant <strong>of</strong> b 1 .<br />

2.3.1.2 Skipping parent matches<br />

Skipping through matches for a parent query node to find the first data node that is an<br />

ancestor <strong>of</strong> the current match for a child query node is not as simple, as illustrated in<br />

Figure 2.12c. Given a current match c 1 ↦→c 1 , the first ancestor <strong>of</strong> c 1 in the stream for<br />

b 1 could be any node before c 1 in the preorder. Preceding ancestors are mixed with<br />

preceding non-ancestors in the preorder sorting. They are spread throughout the stream,<br />

<strong>and</strong> hence it is hard to predict where they are. Note that sorting in postorder does not<br />

solve this problem, as an ancestor <strong>of</strong> a node is any node later in the postorder.<br />

Implementing parent match skipping requires specialized data structures. The XR-tree<br />

is a B-tree variant that can be used to retrieve all r ancestors <strong>of</strong> a node from n c<strong>and</strong>idates<br />

in O(log n + r) time [25, 26]. The data structure maintains one tree for each node label<br />

α, <strong>and</strong> conceptually stores for each node u with Label(u) = α, the list <strong>of</strong> ancestors <strong>of</strong> u<br />

that have label α. Linear space usage is achieved by removing the redundancy <strong>of</strong> storing<br />

common ancestors multiple times.<br />

2.3.1.3 Holistic skipping<br />

A common method for forwarding the streams associated with the nodes in a query is to<br />

pick an edge, <strong>and</strong> repeatedly forward the streams for the parent <strong>and</strong> child node until they<br />

are aligned, then pick another edge, <strong>and</strong> so on, until the entire query is aligned. Unfortunately,<br />

this procedure can lead to suboptimal behavior, as illustrated by Figure 2.12d. As<br />

the query edge 〈a 1 , b 1 〉 is satisfied initially, the streams for b 1 <strong>and</strong> c 1 will be repeatedly<br />

forwarded until the edge 〈b 1 , c 1 〉 is satisfied. This means all data nodes b 2 . . . b p <strong>and</strong><br />

c 2 . . . c p will be inspected.<br />

To avoid this pitfall, the query should be considered holistically when forwarding<br />

streams. A robust approach is to repeatedly forward streams top-down <strong>and</strong> bottom-up in<br />

the query [12]. In the top-down pass, a stream is forwarded until the current head follows<br />

the head for the stream <strong>of</strong> the parent query node in preorder. In the bottom-up pass, a<br />

stream is forwarded until the current head follows the heads <strong>of</strong> the streams <strong>of</strong> all child<br />

query nodes in postorder. If this approach was followed in Figure 2.12d, nearly all the<br />

useless nodes labeled b <strong>and</strong> c would be skipped past.<br />

Figure 2.12e shows a case where both holistic skipping <strong>and</strong> specialized data structures<br />

for ancestor skipping are needed for an optimal skipping strategy.<br />

37


CHAPTER 2.<br />

BACKGROUND<br />

2.3.2 Virtual streams<br />

In the previous section we considered how data nodes could be efficiently skipped past<br />

if they were not part <strong>of</strong> query matches. This section describes a different approach:<br />

Reconstruction <strong>of</strong> data nodes needed for a complete query match.<br />

2.3.2.1 Virtual matches for non-branching internal query nodes<br />

Reconstructing a data node based on the existence <strong>of</strong> other data nodes requires some<br />

information about the structure the known nodes are part <strong>of</strong>. A good starting point is<br />

using structural summaries, such as those described in Section 2.2, <strong>and</strong> store with each<br />

data node the classification <strong>of</strong> the structure it is part <strong>of</strong>.<br />

Example 7. For the query <strong>and</strong> data in Figure 2.13 the data nodes are classified on<br />

path, with the path summary shown on the right in the figure. Given a current matching<br />

a 1 ↦→a 2 <strong>and</strong> a 4 ↦→a 4 , we know that the current data nodes have paths specified by a p2 <strong>and</strong><br />

a p5 , <strong>and</strong> we can deduce by using the path summary that they must have a node on the<br />

path between them specified by b p4 .<br />

b 1 b p1<br />

a 1<br />

a 2<br />

a p2<br />

a 2 b 3<br />

b 3 a 7 a 8 a p3<br />

a 4 a 4 b 5 a 6<br />

a p5<br />

b p4<br />

b p6<br />

Figure 2.13: Virtual match for non-branching internal query node.<br />

As illustrated above, the existence <strong>of</strong> matches for non-branching internal query nodes<br />

can be implied by the matches for above <strong>and</strong> below query nodes <strong>and</strong> information from<br />

pattern matching in the path summary. This means no data nodes have to be read for<br />

non-branching internal query nodes, as long as they are not output nodes.<br />

2.3.2.2 Tree position encoding allowing ancestor reconstruction<br />

When virtualizing output nodes, matches cannot just be implied, but must be reconstructed,<br />

at least to an extent such that they can be uniquely identified. The most<br />

common approach to implementing this is to use a node encoding which allows ancestor<br />

reconstruction, such as Dewey [44], which encodes positions with strings <strong>of</strong> integers. With<br />

the Dewey encoding the tree position <strong>of</strong> a node is encoded as the position <strong>of</strong> the parent<br />

concatenated with the child number <strong>of</strong> the node, as shown in Figure 2.14. This means<br />

that the string length <strong>of</strong> a position encoding is equal to the depth <strong>of</strong> the corresponding<br />

data node.<br />

38


2.3. READING DATA<br />

1<br />

1.1 1.2<br />

1.3<br />

1.2.1 1.2.2 1.2.3<br />

Figure 2.14: Dewey encoding <strong>of</strong> node positions.<br />

2.3.2.3 Virtual matches for branching query nodes<br />

Using virtual matches for branching query nodes is considerably harder than for nonbranching<br />

query nodes, <strong>and</strong> requires an encoding allowing ancestor reconstruction even<br />

when the query nodes to be virtualized are not output nodes. An example is shown in<br />

Figure 2.15. Even though it is known that there is an above data node classified by a p2<br />

in the path summary, both for a 2 ↦→a 7 <strong>and</strong> a 4 ↦→a 4 , it cannot be determined from the<br />

summary matching alone that it is the same above data node.<br />

b 1 b p1<br />

a 1<br />

a 2<br />

a p2<br />

a 2 b 3<br />

b 3 a 7 a 8 a p3<br />

a 4 a 4 b 5 a 6<br />

a p5<br />

b p4<br />

b p6<br />

Figure 2.15: Virtual match for branching internal query node.<br />

Luckily, with the Dewey encoding, the lowest common ancestor <strong>of</strong> two data nodes can<br />

be determined by finding the longest common prefix <strong>of</strong> the Dewey strings. This means<br />

that virtual matches for branching query nodes can be generated by combining node<br />

position encoding with information from the path summary matching [53, 34].<br />

Example 8. In the example in Figure 2.15, data nodes a 7 <strong>and</strong> a 4 have Dewey position<br />

encodings 1.1.2 <strong>and</strong> 1.1.1.1. The longest common prefix <strong>of</strong> these strings is 1.1, which<br />

has length 2. Since both a p3 <strong>and</strong> a p5 have the ancestor a p2 at depth 2, it is determined<br />

that a 7 <strong>and</strong> a 4 have a common ancestor with structure belonging to the group specified<br />

by a p2 , <strong>and</strong> the query is matched.<br />

An advantage <strong>of</strong> using virtual matches for internal query nodes is that you avoid the<br />

issues with skipping parent matches discussed in Section 2.3.1.2. A downside is that the<br />

39


CHAPTER 2.<br />

BACKGROUND<br />

Dewey encoding requires O(d) space per data node, linear in the maximal depth <strong>of</strong> the<br />

data.<br />

2.4 Related problems <strong>and</strong> solutions<br />

This background chapter featured a high-level description <strong>of</strong> some <strong>of</strong> the concepts that<br />

are used in the research papers included in this thesis. A number <strong>of</strong> related issues that I<br />

have not covered are briefly listed below.<br />

The strategy <strong>of</strong> using indexing <strong>and</strong> joins is considered to be the most efficient out<br />

<strong>of</strong> three common strategies [15], where the other two are navigation <strong>and</strong> subsequence<br />

matching. In navigation-based approaches, the data tree is indexed similarly to in joinbased<br />

approaches, but instead <strong>of</strong> joining together matches for different parts <strong>of</strong> the query,<br />

only the matches for one part are read from the index, <strong>and</strong> for each such partial match,<br />

the rest <strong>of</strong> the query is matched by navigating in the data tree. Subsequence indexing is<br />

a radically different indexing approach, where both data <strong>and</strong> query trees are converted to<br />

sequences with special properties, such that a subsequence match for the query sequence<br />

in the data sequence indicates a tree match [48, 40, 49].<br />

When the underlying data is not a tree, but a general graph, many aspects <strong>of</strong> pattern<br />

matching become more complex. In Section 2.1.4 it was described how a simple tree<br />

position encoding using constant space per data node could be used to decide P–C <strong>and</strong><br />

A–D relationships in constant time. Unfortunately, this is not as simple in general graphs,<br />

where encodings giving constant time decision are expensive to compute <strong>and</strong> store, while<br />

encodings with small space requirements that can be computed efficiently give expensive<br />

decision <strong>of</strong> node relationships [55].<br />

Note that for general graph data, partitioning nodes on their sets <strong>of</strong> incoming <strong>and</strong>/or<br />

outgoing paths is actually PSPACE complete [43]. Luckily there exist refinements <strong>of</strong> this<br />

ideal partition that are tractable. Recursively partitioning nodes on their label <strong>and</strong> the<br />

blocks <strong>of</strong> their parent <strong>and</strong>/or child nodes gives a refinement that is usually very close to<br />

the ideal in practice [36, 29].<br />

As discussed in Section 2.1.2.1, a difference between XPath evaluation <strong>and</strong> TPM is<br />

that in the former, the result is all matches for the output node in the query, while in<br />

the latter, the result is all legal combinations <strong>of</strong> matches for the query nodes. There is<br />

a third output type called filtering [42] where the output is the set <strong>of</strong> documents where<br />

there exists at least one match for the query. This is useful in many information retrieval<br />

settings.<br />

The full XPath language is complex, <strong>and</strong> the first algorithms for evaluating XPath<br />

queries in polynomial time were presented first years after the language st<strong>and</strong>ard was<br />

formalized [14]. A method for extending a TPM solution to cover the different axes in<br />

XPath, is to use an algorithm to rewrite queries to use a smaller set <strong>of</strong> axes. What is<br />

missing from TPM is any notion <strong>of</strong> order between matches for query nodes not related by<br />

ancestry, <strong>and</strong> it has been shown that all XPath queries can be rewritten to queries using<br />

only the self, child, descendant <strong>and</strong> following axes [57]. Adding checks <strong>of</strong> the relative<br />

ordering <strong>of</strong> data nodes in a match can be done with post-processing, checks during result<br />

40


2.4. RELATED PROBLEMS AND SOLUTIONS<br />

enumeration, or by modifying the intermediate result construction. The latter is required<br />

for optimal evaluation.<br />

There are many pattern matching problems on trees that are related to TPM, such as<br />

ordered <strong>and</strong> unordered tree inclusion, path inclusion, region inclusion, child inclusion <strong>and</strong><br />

the subtree problem [31]. Of these, the most similar to twig pattern matching is unordered<br />

tree inclusion (UTI). The differences between UTI <strong>and</strong> TPM are that in the former all<br />

query edges are <strong>of</strong> type A–D, <strong>and</strong> that for any match function M, M(u) = M(v) if <strong>and</strong><br />

only if u = v. The latter requirement makes UTI NP-complete [31].<br />

41


Chapter 3<br />

Research Summary<br />

“There is nothing like looking, if you want to find something.<br />

You certainly usually find something, if you look,<br />

but it is not always quite the something you were after.”<br />

– J.R.R. Tolkien<br />

This chapter gives a brief overview <strong>of</strong> the research contained in this thesis. Section 3.2<br />

lists the papers included, describes the research process <strong>and</strong> the roles <strong>of</strong> my co-authors,<br />

<strong>and</strong> gives a short evaluation <strong>of</strong> each paper in retrospect. Section 3.3 discusses the methodology<br />

used in my research, <strong>and</strong> Section 3.4 tries to evaluate the contributions in the papers<br />

in the light <strong>of</strong> the research questions from Section 1.4. Section 3.5 gives a short list <strong>of</strong><br />

future work I find interesting, <strong>and</strong> Section 3.6 concludes the evaluation <strong>of</strong> the research,<br />

3.1 Formalities<br />

The thesis is a paper collection submitted for partial fulfillment <strong>of</strong> the requirements for<br />

the degree <strong>of</strong> philosophiae doctor. I have been enrolled in a four year PhD program<br />

at the <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science at the Norwegian University<br />

<strong>of</strong> Science <strong>and</strong> Technology. Three years worth <strong>of</strong> financing was given by the Research<br />

Council <strong>of</strong> Norway under the grant NFR 162349, <strong>and</strong> one year <strong>of</strong> financing was given by<br />

the department in exchange for 25% duty work during my stay.<br />

I started in the PhD program summer 2005, <strong>and</strong> it has taken an additional year to<br />

complete it. From August 2008 I was on a full <strong>and</strong> later partial sick leave due to tendinitis<br />

in both h<strong>and</strong>s, which I had acquired from rock climbing. During the partial sick leave<br />

I started setting up voice recognition s<strong>of</strong>tware for programming, <strong>and</strong> most <strong>of</strong> the C++<br />

implementation used in Paper 3 was actually written using this setup. Six extra months<br />

<strong>of</strong> financing was given by the iAD project sponsored by the Research Council <strong>of</strong> Norway,<br />

because I had been without a supervisor for a longer period.<br />

In the PhD program it is m<strong>and</strong>atory to take five courses, <strong>and</strong> I have taken the following:<br />

43


CHAPTER 3.<br />

RESEARCH SUMMARY<br />

• DT8101 Highly Concurrent Algorithms<br />

• DT8102 Database Management Systems<br />

• DT8108 Topics in <strong>Information</strong> Technology<br />

• TDT4215 Knowledge in Document Collections<br />

• TT8001 Pattern Recognition<br />

In my duty work for the department I have worked with the following:<br />

• The Nordic Collegiate Programming Contest<br />

• TDT4120 Algorithms <strong>and</strong> Data Structures<br />

• TDT4215 Algorithm Construction, Advanced Course<br />

• TDT4287 Algorithms for Bioinformatics<br />

3.2 Publications <strong>and</strong> research process<br />

After finishing my Masters Thesis on substring indexing [16] I wanted to continue this<br />

research, but this venture was ab<strong>and</strong>oned for various reasons described in Appendix A<br />

(Paper 7), <strong>and</strong> resulted in a technical report [17]. After that I started the research on<br />

XML indexing, which resulted in the papers listed below. In addition I have co-authored<br />

some other papers on indexing <strong>and</strong> search in other types <strong>of</strong> data together with fellow PhD<br />

student Truls Amundsen Bjørklund <strong>and</strong> others. These are listed in Appendix A (Papers 8<br />

<strong>and</strong> 9).<br />

3.2.1 Paper 1<br />

Authors: <strong>Nils</strong> <strong>Grimsmo</strong>.<br />

Title: Faster Path Indexes for Search in XML Data [18].<br />

Publication: Proceeding <strong>of</strong> the Nineteenth Australasian Database Conference (ADC 2008).<br />

Abstract: This article describes how to implement efficient memory resident path indexes<br />

for semi-structured data. Two techniques are introduced, <strong>and</strong> they are shown to<br />

be significantly faster than previous methods when facing path queries using the descendant<br />

axis <strong>and</strong> wild-cards. 1 The first is conceptually simple <strong>and</strong> combines inverted lists,<br />

selectivity estimation, hit expansion <strong>and</strong> brute force search. The second uses suffix trees<br />

with additional statistics <strong>and</strong> multiple entry points into the query. The entry points are<br />

partially evaluated in an order based on estimated cost until one <strong>of</strong> them is complete.<br />

Many path index implementations are tested, using paths generated both from statistical<br />

models <strong>and</strong> DTDs.<br />

Research process<br />

When I started working with XML search my supervisor Dr. Torbjørnsen had a plan<br />

for the architecture <strong>of</strong> a commercial XML indexing system. This system was to use<br />

1 Note that wild-cards are not mentioned in Chapter 2, but can be supported by fixing the level<br />

difference between matches for above <strong>and</strong> below query nodes when necessary, or simply by reading all<br />

data nodes for the wild-card query node.<br />

44


3.2. PUBLICATIONS AND RESEARCH PROCESS<br />

path indexing for XML structure elements <strong>and</strong> inverted files for textual content. My<br />

first assignment to design an efficient path index, <strong>and</strong> as a first step, only look at single<br />

paths. I used my experience from string indexing, <strong>and</strong> came up with one solution based<br />

on joins, <strong>and</strong> one based on suffix trees. Both used selectivity estimates <strong>and</strong> opportunistic<br />

optimizations.<br />

Retrospective view<br />

The solutions I came up with for solving the problem at h<strong>and</strong> were rather advanced,<br />

especially the opportunistic algorithm using suffix trees. The focus <strong>of</strong> the work was on<br />

practical performance, <strong>and</strong> given the interplay between the path index <strong>and</strong> the value<br />

index that was planned in the underlying design, I feel that the solutions I found were<br />

good. On the other h<strong>and</strong>, from a more theoretical viewpoint the worst-case behavior <strong>of</strong><br />

the solutions is not attractive, as individually matching paths can give large intermediate<br />

results. An asymptotically better approach, is to use a twig join algorithm on the path<br />

summary, as described in Section 2.2.2.<br />

As the size <strong>of</strong> the path index is small compared to the size <strong>of</strong> the data for many<br />

document collections, it is a justified question whether or not path index lookups are<br />

important for total query evaluation time in a complete search system.<br />

3.2.2 Paper 2<br />

Authors: <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund.<br />

Title: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists [19].<br />

Publication: Technical Report IDI-TR-2007-01, Norwegian University <strong>of</strong> Science <strong>and</strong><br />

Technology, Trondheim, Norway, 2007.<br />

Abstract: The document listing problem can be solved with linear preprocessing <strong>and</strong><br />

optimal search time by using a generalised suffix tree, additional data structures <strong>and</strong><br />

constant time range minimum queries. A simpler solution is to use a generalised suffix<br />

tree in which internal nodes are extended with a list <strong>of</strong> all string IDs seen in the subtree<br />

below the respective node. This report makes some remarks on the size <strong>of</strong> such a structure.<br />

For the case <strong>of</strong> a set <strong>of</strong> equal length strings, a bound <strong>of</strong> Θ(n √ n) for the worst case space<br />

usage <strong>of</strong> such lists is given, where n is the total length <strong>of</strong> the strings.<br />

Research process<br />

During the implementation <strong>of</strong> the system for Paper 1, I found some recent work which<br />

used an extension <strong>of</strong> generalized suffix trees [58]. Suffix trees have attractive asymptotic<br />

complexity, but as the authors did not comment on the complexity <strong>of</strong> their extension, I<br />

felt that I should investigate it before I based my work on their solution. As there was<br />

not space for these results in Paper 1, they were published as a technical report.<br />

Roles <strong>of</strong> the authors<br />

Bjørklund helped write <strong>and</strong> simplify the pro<strong>of</strong>s, <strong>and</strong> helped write the paper itself.<br />

45


CHAPTER 3.<br />

RESEARCH SUMMARY<br />

Retrospective view<br />

This report only investigated worst-case space usage as a function <strong>of</strong> the total length <strong>of</strong><br />

the strings indexed. It could also have been <strong>of</strong> interest to find the best case <strong>and</strong> average<br />

case space usage, <strong>and</strong> to express the complexity as a function <strong>of</strong> the number <strong>of</strong> strings<br />

<strong>and</strong> their average length.<br />

3.2.3 Paper 3<br />

Authors: <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Øystein Torbjørnsen.<br />

Title: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes [24].<br />

Publication: Proceedings <strong>of</strong> the Second International Conference on Advances in<br />

Databases, Knowledge, <strong>and</strong> Data Applications (DBKDA 2010).<br />

Abstract: XML indexing <strong>and</strong> search has become an important topic, <strong>and</strong> twig joins<br />

are key building blocks in XML search systems. This paper describes a novel approach<br />

using a nested loop twig join algorithm, which combines several existing techniques to<br />

speed up evaluation <strong>of</strong> XML queries. We combine structural summaries, path indexing<br />

<strong>and</strong> prefix path partitioning to reduce the amount <strong>of</strong> data read by the join. This effect<br />

is amplified by only reading data for leaf query nodes, <strong>and</strong> inferring data for internal<br />

nodes from the structural summary. Skipping is used to speed up merges where query<br />

leaves have differing selectivity. Multiple access methods are implemented as materialized<br />

views instead <strong>of</strong> succinct secondary indexes for better locality. This redundancy is made<br />

affordable in terms <strong>of</strong> space by using compression in a back-end with columnar storage.<br />

We have implemented an experimental prototype, which shows a speedup <strong>of</strong> two orders<br />

<strong>of</strong> magnitude on XPath queries with value predicates, when compared to existing open<br />

source <strong>and</strong> commercial systems using a subset <strong>of</strong> the techniques. Space usage is also<br />

improved.<br />

Research process<br />

This was an attempt to build an academic prototype for a high performance XML search<br />

system, based on design ideas by my supervisor Dr. Torbjørnsen <strong>and</strong> myself. The system<br />

turned out rather complex <strong>and</strong> had many features, <strong>and</strong> in my view at the time, substantial<br />

academic contributions. The main contribution was the virtualization <strong>of</strong> internal query<br />

nodes, which I thought was a novelty. A paper was submitted to XSym 2009, but was<br />

rejected, mainly because <strong>of</strong> missing references to previous work, but also because <strong>of</strong> a<br />

weak description <strong>of</strong> the join algorithms used. Virtual streams for branching internal<br />

query nodes had been presented in independent papers from 2004 [53] <strong>and</strong> 2005 [34]. We<br />

then rewrote the presentation <strong>of</strong> the paper, <strong>and</strong> the final form was more <strong>of</strong> a experimental<br />

systems-paper that featured some minor novelties.<br />

Roles <strong>of</strong> the authors<br />

Both Bjørklund <strong>and</strong> Torbjørnsen were part <strong>of</strong> all phases <strong>of</strong> this work, except the implementation<br />

<strong>of</strong> the system. Bjørklund contributed mainly with knowledge about columnar<br />

storage <strong>and</strong> compression schemes. Torbjørnsen contributed with general knowledge <strong>of</strong><br />

46


3.2. PUBLICATIONS AND RESEARCH PROCESS<br />

database systems <strong>and</strong> search engines, <strong>and</strong> also had many <strong>of</strong> the initial design ideas for<br />

the system.<br />

Retrospective view<br />

In this research I mostly compared my full system with other full systems, <strong>and</strong> tried to<br />

explain performance differences in the experiments based on which features each <strong>of</strong> the<br />

systems used. The experiments would probably have been more academically interesting<br />

if I in addition had extended my system such that all the different features could be<br />

turned on <strong>and</strong> <strong>of</strong>f to isolate their effects.<br />

I learned two important lessons from this project. The first is that I should have<br />

spent considerably more time reading previous research, <strong>and</strong> less time focusing on my<br />

own ideas in the beginning. The second lesson is that it is not very tactical for a PhD<br />

student working mostly on his own to spend a year implementing a large system, unless<br />

it is certain that it will bear fruits in terms <strong>of</strong> publishing. Large systems should be left<br />

to large research groups with long term projects.<br />

3.2.4 Paper 4<br />

Authors: <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund.<br />

Title: Towards Unifying Advances in Twig Join Algorithms [20].<br />

Publication: Proceedings <strong>of</strong> the 21st Australasian Database Conference (ADC 2010).<br />

Abstract: Twig joins are key building blocks in current XML indexing systems, <strong>and</strong> numerous<br />

algorithms <strong>and</strong> useful data structures have been introduced. We give a structured,<br />

qualitative analysis <strong>of</strong> recent advances, which leads to the identification <strong>of</strong> a number <strong>of</strong><br />

opportunities for further improvements. Cases where combining competing or orthogonal<br />

techniques would be advantageous are highlighted, such as algorithms avoiding redundant<br />

computations <strong>and</strong> schemes for cheaper intermediate result management. We propose some<br />

direct improvements over existing solutions, such as reduced memory usage <strong>and</strong> stronger<br />

filters for bottom-up algorithms. In addition we identify cases where previous work has<br />

been overlooked or not used to its full potential, such as for virtual streams, or the benefits<br />

<strong>of</strong> previous techniques have been underestimated, such as for skipping joins. Using<br />

the identified opportunities as a guide for future work, we are hopefully one step closer<br />

to unification <strong>of</strong> many advances in twig join algorithms.<br />

Research process<br />

After the disappointment <strong>of</strong> having reinvented the wheel when working with Paper 3,<br />

I started doing a more thorough literature review to get a deeper underst<strong>and</strong>ing <strong>of</strong> the<br />

different aspects <strong>of</strong> XML indexing <strong>and</strong> twig joins in particular. My notes from this study<br />

gradually grew into a mixture <strong>of</strong> a survey paper <strong>and</strong> a long list <strong>of</strong> research ideas. As most<br />

conferences do not accept survey papers, <strong>and</strong> because I did not feel I had time for the<br />

process <strong>of</strong> publishing a journal publication, the focus <strong>of</strong> the paper was turned to the list<br />

<strong>of</strong> research opportunities.<br />

47


CHAPTER 3.<br />

RESEARCH SUMMARY<br />

Roles <strong>of</strong> the authors<br />

Bjørklund helped structure the paper, was part <strong>of</strong> discussions about the different research<br />

opportunities listed, <strong>and</strong> helped writing the final version.<br />

Retrospective view<br />

This paper features a long list <strong>of</strong> ideas. Some <strong>of</strong> these may be <strong>of</strong> interest to other researchers,<br />

while others probably are not. It would <strong>of</strong> course had been a much greater<br />

contribution to the research community if this work had been presented at a more visible<br />

venue, such as a journal, but this would require much more time, <strong>and</strong> probably the help<br />

<strong>of</strong> co-authors experienced in the field <strong>of</strong> twig joins.<br />

In retrospect, it may be backwards to write such a literature review at the end <strong>of</strong> a<br />

PhD, <strong>and</strong> I think my academic development would have benefited from doing it at an<br />

earlier stage. On the other h<strong>and</strong>, with no experience I probably would not have been able<br />

to analyze previous work with the same underst<strong>and</strong>ing.<br />

3.2.5 Paper 5<br />

Authors: <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong>.<br />

Title: Fast Optimal Twig Joins [21].<br />

Publication: Proceedings <strong>of</strong> the 36th international conference on Very Large Data Bases<br />

(VLDB 2010).<br />

Abstract: In XML search systems twig queries specify predicates on node values <strong>and</strong><br />

on the structural relationships between nodes, <strong>and</strong> a key operation is to join individual<br />

query node matches into full twig matches. Linear time twig join algorithms exist, but<br />

many non-optimal algorithms with better average-case performance have been introduced<br />

recently. These use somewhat simpler data structures that are faster in practice, but have<br />

exponential worst-case time complexity. In this paper we explore <strong>and</strong> extend the solution<br />

space spanned by previous approaches. We introduce new data structures <strong>and</strong> improved<br />

strategies for filtering out useless data nodes, yielding combinations that are both worstcase<br />

optimal <strong>and</strong> faster in practice. An experimental study shows that our best algorithm<br />

outperforms previous approaches by an average factor <strong>of</strong> three on common benchmarks.<br />

On queries with at least one unselective leaf node, our algorithm can be an order <strong>of</strong><br />

magnitude faster, <strong>and</strong> it is never more than 20% slower on any tested benchmark query.<br />

Research process<br />

This work started <strong>of</strong>f as an idea for space saving in twig joins, listed as Opportunity 3 in<br />

Paper 4, but then I noticed that the same idea could be used to close the gap between<br />

worst-case behavior <strong>and</strong> practical performance in current twig join algorithms. Also, the<br />

creation <strong>of</strong> the framework for classifying strategies we presented in this paper was inspired<br />

by Opportunity 4 in Paper 4, which suggested extending the range <strong>of</strong> different filtering<br />

strategies used in current twig joins.<br />

As many <strong>of</strong> the recently presented algorithms used very similar underlying data structures,<br />

I implemented a minimalistic system where different filtering strategies <strong>and</strong> storage<br />

48


3.2. PUBLICATIONS AND RESEARCH PROCESS<br />

techniques could be switched on <strong>and</strong> <strong>of</strong>f. The result was a system which spanned out a<br />

large space <strong>of</strong> solutions, where interplay between the effects <strong>of</strong> the different features could<br />

be analyzed.<br />

Roles <strong>of</strong> the authors<br />

Bjørklund was part <strong>of</strong> the idea phase that lead up to this paper, discussions about the<br />

development <strong>of</strong> the system, <strong>and</strong> gave much constructive feedback during the writing phase.<br />

Hetl<strong>and</strong> helped simplify our formalizations, was <strong>of</strong> great help writing the theoretical part<br />

<strong>of</strong> the paper, <strong>and</strong> had the idea for how to implement the post-processing for our preorder<br />

storage algorithm TJStrictPre.<br />

Retrospective view<br />

I feel that the most important contribution in this paper is the framework we developed<br />

for classifying the different filtering strategies used in twig joins, <strong>and</strong> the analysis <strong>of</strong> how<br />

the strategies affect practical <strong>and</strong> worst-case performance.<br />

On the day before the submission <strong>of</strong> the camera-ready copy <strong>of</strong> the paper, I started<br />

thinking about how our novel input stream merger called getPart could be modified to<br />

return data nodes strictly in order. It would indeed be interesting to know whether such a<br />

modification would increase the cost <strong>of</strong> the merge, because with strict ordering, there are<br />

actually more possibilities during intermediate result construction, as you can use a global<br />

stack [21, Appendix H]. My current impression is that with a modified getPart merger,<br />

it should be possible to create an algorithm that is much simpler <strong>and</strong> more elegant than<br />

the TJStrictPre algorithm presented in this paper, but still as fast.<br />

3.2.6 Paper 6<br />

Authors: <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong>.<br />

Title: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward Bisimulation<br />

for Node-Labeled Trees [22].<br />

Publication: Proceedings <strong>of</strong> the 7th International XML Database Symposium on<br />

Database <strong>and</strong> XML Technologies (XSym 2010).<br />

Abstract: The F&B-index is used to speed up pattern matching in tree <strong>and</strong> graph data,<br />

<strong>and</strong> is based on the maximum F&B-bisimulation, which can be computed in loglinear<br />

time for graphs. It has been shown that the maximum F-bisimulation can be computed<br />

in linear time for DAGs. We build on this result, <strong>and</strong> introduce a linear algorithm for<br />

computing the maximum F&B-bisimulation for tree data. It first computes the maximum<br />

F-bisimulation, <strong>and</strong> then refines this to a maximal B-bisimulation. We prove that<br />

the result equals the maximum F&B-bisimulation.<br />

Research process<br />

When I started working on this project, the idea was to write a paper presenting three<br />

contributions: Linear F&B-index construction for trees, subtree addition with cost dependent<br />

only <strong>of</strong> the size <strong>of</strong> the addition, <strong>and</strong> a publish/subscribe solution that indexed<br />

49


CHAPTER 3.<br />

RESEARCH SUMMARY<br />

queries on what document structures they matched. Note that the second part is Opportunity<br />

9 in Paper 4. After I had found a solution to the first part <strong>and</strong> implemented it, I<br />

found a published incorrect algorithm for solving this problem, <strong>and</strong> felt that perhaps this<br />

deserved a publication <strong>of</strong> it’s own.<br />

Our initial contribution was two-fold: Linear construction <strong>of</strong> single-directional bisimilarity<br />

for DAGs, <strong>and</strong> linear construction <strong>of</strong> two-directional bisimilarity for trees. One<br />

week before the submission deadline for XSym I found a paper which described a solution<br />

to the first problem, <strong>and</strong> we had to rewrite the paper slightly. We had to describe how<br />

our solution was different from theirs, <strong>and</strong> justify why these differences were necessary<br />

for solving the second problem, two-directional bisimilarity for trees.<br />

Roles <strong>of</strong> the authors<br />

Bjørklund <strong>and</strong> Hetl<strong>and</strong> both participated in discussions about how to solve the problems,<br />

<strong>and</strong> in the writing <strong>of</strong> the paper.<br />

Retrospective view<br />

Some <strong>of</strong> the reviewers at XSym commented that the paper would benefit from more<br />

examples illustrating the algorithms, <strong>and</strong> also that the experiments were superfluous<br />

because <strong>of</strong> the asymptotic improvement. The experiments were therefore removed from<br />

the published version to give room for more examples [22], but can be found in the<br />

extended technical report version [23]. The experiments are on st<strong>and</strong>ard XML benchmark<br />

data, where it turns out the old loglinear algorithm performs rather well. It would perhaps<br />

be more interesting to see experiments on r<strong>and</strong>om data showing the loglinear worst-case,<br />

but this could turn out to be a difficult exercise in constructing data.<br />

3.3 Research methodology<br />

Most <strong>of</strong> the work presented here can be considered constructive research: Given a computational<br />

problem <strong>and</strong> the set <strong>of</strong> existing solutions, find a computationally cheaper solution.<br />

Cheaper can mean that the asymptotic running time is less, such as in Paper 6 listed<br />

above, or it can mean that an implementation runs faster for a given set <strong>of</strong> instances <strong>of</strong><br />

the problem on a given computer, such as in Papers 1 <strong>and</strong> 3, or both, such as in Paper 5.<br />

Empirical methods are needed to draw conclusions about the practical performance<br />

<strong>of</strong> solutions in general. Unfortunately, this is a weak part <strong>of</strong> many subfields <strong>of</strong> computer<br />

science, such as XML indexing, <strong>and</strong> it is also a weak part in most <strong>of</strong> my research. Most<br />

papers on twig joins <strong>and</strong> XPath evaluation use around three st<strong>and</strong>ard data sources, <strong>and</strong><br />

run maybe five to ten st<strong>and</strong>ard queries on each data set, <strong>and</strong> then draw conclusions about<br />

which solution is faster. These queries were typically written by some paper author to<br />

highlight a specific difference between two solutions. This can almost be called qualitative<br />

research. Using a few st<strong>and</strong>ard data sets may be the only feasible solution, but it should<br />

be possible to create a large population <strong>of</strong> st<strong>and</strong>ard queries from which samples could be<br />

drawn. Another solution is to create statistical model from which the data <strong>and</strong> queries<br />

are generated [17]. This allows drawing conclusion about which solution is better given<br />

50


3.4. EVALUATION OF CONTRIBUTIONS<br />

properties <strong>of</strong> the data <strong>and</strong> queries, such as label alphabet distribution, tree size <strong>and</strong> shape,<br />

etc..<br />

Some <strong>of</strong> the work presented here can also be considered descriptive research, such as<br />

the taxonomy <strong>of</strong> techniques in Paper 4, <strong>and</strong> the classification <strong>of</strong> filtering strategies in<br />

previous approaches in Paper 5.<br />

3.4 Evaluation <strong>of</strong> contributions<br />

This section revisits the research questions from Section 1.4, <strong>and</strong> tries to evaluate to what<br />

extent they have been answered by the research papers that constitute this thesis.<br />

3.4.1 Research questions revisited<br />

Below follows a brief list <strong>of</strong> my contributions, grouped by research question.<br />

1. RQ1: How can matches for tree queries be joined more efficiently?<br />

• Paper 3: The XLeaf system uses multiple materialized views <strong>and</strong> selectivity<br />

estimates to reduce the amount <strong>of</strong> data read during XPath evaluation. Query<br />

leaves are looked up on either value, path or tag, <strong>and</strong> then possibly filtered on<br />

path <strong>and</strong>/or value.<br />

• Paper 3: <strong>Information</strong> from the summary matching is used to determine whether<br />

or not a linear join can be employed. This is stronger than previous approaches,<br />

which determined this only from properties <strong>of</strong> the query [42].<br />

• Paper 3: Both the linear <strong>and</strong> the nested loop join in the XLeaf system store<br />

intermediate results <strong>of</strong> negligible size, <strong>and</strong> achieve this by never materializing<br />

virtual matches for internal query node, using only implicit prefixes <strong>of</strong> descendant<br />

leaf Deweys. Previous approaches for virtual matches explicitly generate<br />

node identifiers [53, 34].<br />

• Paper 3: The compressed Dewey encoding used in the XLeaf system allows<br />

faster joins, because the Deweys can be compared without decompression.<br />

• Paper 5: The TJStrictPost <strong>and</strong> TJStrictPre algorithms combine worst-case<br />

optimality [7] with good practical performance [33], bridging a gap between<br />

current twig join algorithms.<br />

• Paper 5: The getPart input stream merger gives inexpensive weak full match<br />

filtering, <strong>and</strong> a considerable speedup during intermediate result construction<br />

compared to the previous getNext input stream merger [5], which gave weak<br />

subtree match filtering.<br />

2. RQ2: How can pattern matching in the dictionary be done more efficiently?<br />

• Paper 1: EsIe3 was a novel strategy for indexing paths combines tuple indexing,<br />

selectivity estimates, <strong>and</strong> nested-loop lookups. This was shown experimentally<br />

to be much faster than previous approaches based on merge joins [32].<br />

51


CHAPTER 3.<br />

RESEARCH SUMMARY<br />

• Paper 1: Smfe was a new opportunistic approach to using suffix trees for<br />

matching path patterns, which was considerably faster than previous similar<br />

methods [58].<br />

• Paper 3: To join branch matches when using path indexing, it must be<br />

known how different data paths match the query paths, to make sure different<br />

branches in the query use the same data nodes for branching points. To<br />

avoid storing the potentially exponential number <strong>of</strong> ways the paths can match<br />

the data, the XLeaf systems stores for each path matched, the set <strong>of</strong> usable<br />

matches for the nearest ancestor branching nodes.<br />

• Paper 3: Using the meta information from the path summary matching, for<br />

each leaf stream alignment, the joins in the XLeaf system use a simple bottomup<br />

then top-down traversal <strong>of</strong> the query tree, which determines which branching<br />

nodes can be used, <strong>and</strong> whether or not the leaf Deweys match down to<br />

these branching points.<br />

3. RQ3: How can structure indexes be constructed faster <strong>and</strong> using less space?<br />

• Paper 2: We determined the worst-case space complexity <strong>of</strong> a previous path<br />

indexing strategy building on an extension <strong>of</strong> suffix trees [58]. This gives a<br />

more balanced view <strong>of</strong> the usefulness <strong>of</strong> this data structure.<br />

• Paper 3: We showed in the XLeaf system how multiple materialized views<br />

<strong>of</strong> a structure index could be made affordable with compression <strong>and</strong> shared<br />

dictionaries in a column store back-end. The system used less space for three<br />

materialized views than a state <strong>of</strong> the art system without such compression [4]<br />

uses for one view.<br />

• Paper 6: The cost <strong>of</strong> construction <strong>of</strong> the forward <strong>and</strong> backward path index for<br />

tree data was reduced from loglinear to linear, improving the usability <strong>of</strong> these<br />

strong structure indexes [28].<br />

3.4.2 Opportunities revisited<br />

In Paper 4 we listed a number <strong>of</strong> different research opportunities, <strong>and</strong> below I list the<br />

developments I have found for each <strong>of</strong> them, in my own research <strong>and</strong> others’. It may be<br />

advantageous to skip this section <strong>and</strong> revisit it after reading the paper. As Paper 4 was<br />

written after the XLeaf system in Paper 3 was developed, I already had partial answers<br />

to some <strong>of</strong> the questions I posed, but at the time <strong>of</strong> writing Paper 4, I did not know<br />

whether or not Paper 3 would be published.<br />

1. Removing redundant computation in top-down one-phase joins - This is a point we<br />

did not pursue in the twig joins in Paper 5, something which could have resulted in<br />

even faster input stream merging.<br />

2. Top-down memory usage - The TJStrict algorithms in Paper 5 uses strict matching<br />

equivalent to what was proposed here to reduce the number <strong>of</strong> nodes added to<br />

intermediate results.<br />

52


3.4. EVALUATION OF CONTRIBUTIONS<br />

3. Bottom-up memory usage - It is possible to add some more inexpensive space savings<br />

to the TJStrict algorithms as proposed in Paper 4, <strong>and</strong> using early enumeration<br />

when the topmost query root match is closed [7]. More aggressive space savings,<br />

requiring more dynamic storage <strong>of</strong> intermediate results, have been presented recently<br />

[35].<br />

4. Stronger filters - This topic was treated extensively in Paper 5.<br />

5. Unification or assimilation - Paper 5 may have brought us a small step closer to<br />

unifying top-down <strong>and</strong> bottom-up approaches, but as stated in the conclusion, it<br />

would be nice to see a solution that is simpler <strong>and</strong> more elegant than TJStrictPre,<br />

but still as efficient.<br />

6. Holistic effective ancestor skipping - As this is just the combination <strong>of</strong> two previous<br />

techniques (see Section 2.3.1), it probably holds little academic interest.<br />

7. Simpler <strong>and</strong> faster skipping data structures - I have seen no advances on this point,<br />

<strong>and</strong> I am not sure whether or not it is <strong>of</strong> academic interest.<br />

8. Updates in stronger summaries - During the background research for Paper 6 we<br />

found papers covering updates in indexes based on single-directional bisimulations<br />

for graph data [28, 54, 41], but none for the bidirectional case. This is probably <strong>of</strong><br />

less practical interest than the following opportunity.<br />

9. Hashing F&B summaries - My original plan for Paper 6 featured this as the main<br />

contribution, but I have not had time for exploring it. I still believe investigating<br />

this could yield interesting results.<br />

10. Exploring summary structures <strong>and</strong> how to search them - As noted in Paper 4, the<br />

recently proposed ideas <strong>of</strong> multi-level structure indexes [52] <strong>and</strong> using twig join<br />

algorithms to do path summary matching [2] may be part <strong>of</strong> the ultimate structure<br />

indexing package, but a thorough experimental investigation <strong>of</strong> what benefits the<br />

different techniques give in different cases is still missing.<br />

11. Access methods for multiple matching partitions - A recent paper [3] compared the<br />

advanced access methods in iTwigJoin [8] with simply merge joining the different<br />

streams <strong>of</strong> matching nodes for a query node, <strong>and</strong> found that the latter scaled better.<br />

The method we used in our XLeaf system in the case <strong>of</strong> multiple matching paths<br />

was to read a stream <strong>of</strong> label-matching nodes, <strong>and</strong> filter on path using a bit vector.<br />

This is probably more efficient than merging path matches when accessing on path<br />

is not much more selective than accessing on tag. An optimizer could chose which<br />

access method to use based on cost estimates.<br />

12. Improved virtual streams - Paper 3 made some progress on how to store summary<br />

matching meta-information <strong>and</strong> how to avoid materializing virtual node positions,<br />

but I believe that a better solution is waiting to be designed, <strong>and</strong> that it would use<br />

features both from Virtual Cursors [53], TJFast [34] <strong>and</strong> XLeaf [24].<br />

53


CHAPTER 3.<br />

RESEARCH SUMMARY<br />

13. Holistic skipping among leaf streams - As noted in Paper 4, Virtual Cursors [53]<br />

is very closed to implementing holistic skipping <strong>and</strong> so-called optimal data-access.<br />

The only piece missing is to always generate virtual internal query node matches<br />

from the below leaf with the greatest Dewey, as done in the XLeaf system when the<br />

simple linear join is used.<br />

14. Identifying <strong>and</strong> using difficulty classes - The XLeaf system decides whether or not<br />

a linear join algorithm can be used based on whether or not the tree depth <strong>of</strong> a<br />

node is fixed after the path summary matching. This is stronger than previous<br />

approaches [42].<br />

3.5 Future work<br />

Below I list a few directions I would pursue as a continuation <strong>of</strong> my research.<br />

3.5.1 Strong structure summaries for independent documents<br />

This was proposed as Opportunity 9 in Paper 4. The conjecture is that if the underlying<br />

data is a set <strong>of</strong> unconnected graphs, then the structure summary for the forward <strong>and</strong><br />

backward path path partition will also be a set <strong>of</strong> unconnected graphs. This is shown for<br />

a collection <strong>of</strong> independent tree-shaped documents in Figure 3.1. Note that for trees, the<br />

document roots are partitioned into blocks represented by roots in the summary. Also,<br />

the only change caused by adding a virtual root for the documents is the addition <strong>of</strong> a<br />

virtual root in the summary.<br />

D 1 D 2 D 3 D 4 D 5 D n<br />

S 1 S 2 S 3 S m<br />

Figure 3.1: Shape <strong>of</strong> forward <strong>and</strong> backward path summary (below) for set <strong>of</strong> independent<br />

documents with virtual root (above).<br />

The idea is that when adding a new document, its nodes will either be mapped to<br />

summary nodes in a single tree in the summary, or a new tree will be created in the<br />

54


3.5. FUTURE WORK<br />

summary. When a new document is seen, the structure summary tree for this document<br />

can be constructed independently <strong>of</strong> the summaries for the previous documents. If this<br />

summary tree is label-isomorph to an existing top-level subtree in the global summary,<br />

we translate the classification <strong>of</strong> the document nodes to the equivalent existing classes,<br />

otherwise, the document summary is added as a new top-level subtree in the summary.<br />

This means we need a lookup structure on the shapes <strong>of</strong> the summary trees, for example<br />

with hashing <strong>of</strong> sorted trees, or some form <strong>of</strong> tree automaton.<br />

3.5.2 A simpler fast optimal twig join<br />

The getPart input stream merger presented in Paper 5, has one major flaw, namely that<br />

similarly to the original getNext input stream merger [5], it outputs data nodes in socalled<br />

local preorder [21]. As discussed in our paper, a global strict preorder is needed<br />

when using a global stack to open nodes in preorder <strong>and</strong> closing them in postorder, which<br />

is needed for on the fly postorder subtree match filtering. In TJStrictPre, a postorder<br />

processing pass over the data is used to perform the subtree match filtering, which is<br />

required for optimality.<br />

As full weak match filtering removes many nodes, it may be possible that a combination<br />

<strong>of</strong> a getPart variant with global preorder output <strong>and</strong> a simple tree-based intermediate<br />

result construction as described in Section 2.1.6 would be competitive with TJStrictPre.<br />

Such an algorithm would be significantly more elegant <strong>and</strong> simpler to implement.<br />

3.5.3 Simpler <strong>and</strong> faster evaluation with non-output nodes<br />

As described in Section 2.1.2.1, XPath queries have a single output node. Using a regular<br />

twig join algorithm <strong>and</strong> removing duplicates during post-processing can be very inefficient,<br />

but the Full Match Theorem from Paper 5 makes it simple to modify a twig join algorithm<br />

to get complexity linear in the input <strong>and</strong> output also with a single output node. As long<br />

as nodes are filtered first on being subtree matchers in postorder, <strong>and</strong> then on being prefix<br />

path matchers in preorder, all nodes are part <strong>of</strong> a full match. We can then simply output<br />

the remaining matches for the output query node. Note that the second filtering pass is<br />

easier to implement when using tree-based intermediate data structures as suggested in<br />

Section 3.5.2.<br />

The two-pass full match filtering approach may also be useful when dealing with<br />

multiple-output nodes in so-called generalized tree pattern matching [9, 7]. In cases<br />

where output nodes form a connected tree, linear enumeration is easily obtained: Assume<br />

we use intermediate data structures similar to those constructed by Algorithm 2 (page<br />

28), <strong>and</strong> that the enumeration in Algorithm 1 (page 22) traverses the query in preorder.<br />

We can then start the recursive enumeration at the root <strong>of</strong> the output subtree, recurse<br />

only on output query nodes, <strong>and</strong> output the current match M ′ when its size is equal to<br />

the number <strong>of</strong> output query nodes.<br />

Based on this approach, it may also be possible to formulate a simpler <strong>and</strong> more<br />

elegant enumeration algorithm for the general case with possibly unconnected output<br />

nodes [7].<br />

55


CHAPTER 3.<br />

RESEARCH SUMMARY<br />

3.5.4 Ultimate data access shoot-out<br />

It would be interesting to know which would perform better in practice <strong>of</strong> a system<br />

using virtual streams <strong>and</strong> skipping, improved as suggested in Opportunities 12 <strong>and</strong> 13<br />

in Paper 4, <strong>and</strong> a system with explicit streams <strong>and</strong> skipping, improved as suggested in<br />

Opportunity 6. With virtual streams the total amount <strong>of</strong> data read is probably lower (also<br />

when skipping), <strong>and</strong> the advanced data structures for ancestor skipping is avoided. On<br />

the other h<strong>and</strong>, node encodings enabling the ancestor reconstruction needed for virtual<br />

streams have O(d) space usage per encoded position, where d is the maximal depth <strong>of</strong><br />

the data.<br />

Deciding which is the better approach in practice is no simple task. It would require<br />

experienced programmers with good knowledge <strong>of</strong> modern hardware, man-hours to<br />

optimize both solutions sufficiently, <strong>and</strong> a thorough empirical evaluation.<br />

3.6 Conclusions<br />

This thesis presents a number <strong>of</strong> contributions on the construction <strong>and</strong> use <strong>of</strong> structure<br />

indexes, <strong>and</strong> on twig join algorithms. A strength <strong>of</strong> the thesis is that it combines both work<br />

on practical efficiency <strong>of</strong> implementations <strong>and</strong> on theoretical aspects, read asymptotic<br />

complexity. A weakness is that it does not consider the broader picture: What is search in<br />

XML data used for in practice? And, are the advances in twig joins presented here useful<br />

for real world systems? I believe that the answer to the latter question is yes. In Paper 3<br />

we built a system that combined new <strong>and</strong> previous techniques in a new way, such that<br />

a large group <strong>of</strong> queries were evaluated orders <strong>of</strong> magnitude faster than in current state<strong>of</strong>-the-art<br />

open source <strong>and</strong> commercial systems. This should definitively be <strong>of</strong> interest<br />

for system implementors. Also, I believe that the use <strong>of</strong> improved filtering strategies <strong>and</strong><br />

intermediate data structures introduced in Paper 5 are applicable independently <strong>of</strong> the<br />

data partitioning <strong>and</strong> data access methods used. Whether or not the faster construction<br />

<strong>of</strong> F&B-indexes for trees presented in Paper 6 is useful, depends on whether or not F&Bindexing<br />

is considered favorable over for example simple path indexing in the future.<br />

The use <strong>of</strong> the recently introduced multi-level indexing [52] may make F&B-indexes more<br />

attractive in use.<br />

A lesson learned during my research is a general strategy for how to process trees,<br />

which inspired the name <strong>of</strong> the thesis. When implementing virtual matches in Paper 3,<br />

c<strong>and</strong>idate sets for branching nodes are calculated bottom-up in the query tree, before<br />

they are chosen top-down. The Full Match Theorem from Paper 5 states that only nodes<br />

part <strong>of</strong> a full match remain after filtering nodes bottom-up on matched subtrees, <strong>and</strong> then<br />

top-down on matched prefix paths. In Paper 6 it is shown how the maximum forward <strong>and</strong><br />

backward bisimulation can be computed for tree data by first computing the maximum<br />

forward bisimulation bottom-up in the tree, <strong>and</strong> then refining to a maximal backward<br />

bisimulation top-down.<br />

The work presented in this thesis builds heavily on the research that has been presented<br />

by the community during the last ten years. Hopefully some <strong>of</strong> my contributions will also<br />

be built on, <strong>and</strong> result in further advances in the field.<br />

56


Bibliography<br />

[1] S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, <strong>and</strong> D. Srivastava. Structural<br />

joins: A primitive for efficient XML query pattern matching. In Proc. ICDE, 2002.<br />

2.1.1<br />

[2] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />

twig query in large DataGuide trees. In Proc. IDEAS, 2008. 2.2.2, 2.2.4, 10<br />

[3] Radim Bača <strong>and</strong> Michal Krátký. On the efficiency <strong>of</strong> a prefix path holistic algorithm.<br />

In Proc. XSym, 2009. 2.2.2, 11<br />

[4] Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger,<br />

<strong>and</strong> Jens Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational<br />

engine. In Proc. SIGMOD, 2006. 3<br />

[5] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />

XML pattern matching. In Proc. SIGMOD, 2002. 2.1.1, 2.1.5, 2.1.5, 2.1.7, 1, 3.5.2<br />

[6] Qun Chen, Andrew Lim, <strong>and</strong> Kian Win Ong. D(k)-index: an adaptive structural<br />

summary for graph-structured data. In Proc. SIGMOD, 2003. 2.2.4<br />

[7] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />

Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />

queries over XML documents. In Proc. VLDB, 2006. 2.1.1, 2.1.2, 2.1.2.1,<br />

2.1.5, 2.1.5, 2.1.6, 2.1.6, 2.1.8, 1, 3, 3.5.3<br />

[8] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />

pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />

2.2.2, 2.2.2, 11<br />

[9] Zhimin Chen, H. V. Jagadish, Laks V. S. Lakshmanan, <strong>and</strong> Stelios Paparizos. From<br />

tree patterns to generalized tree patterns: on efficient evaluation <strong>of</strong> xquery. In<br />

Proc. VLDB, 2003. 3.5.3<br />

[10] Byron Choi, Malika Mahoui, <strong>and</strong> Derick Wood. On the optimality <strong>of</strong> holistic algorithms<br />

for twig queries. In Proc. DEXA, 2003. 2.2<br />

[11] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, <strong>and</strong> Clifford Stein.<br />

Introduction to Algorithms. McGraw-Hill Higher Education, 2001. 2.1.7<br />

57


BIBLIOGRAPHY<br />

[12] Marcus Fontoura, Vanja Josifovski, Eugene Shekita, <strong>and</strong> Beverly Yang. Optimizing<br />

cursor movement in holistic twig joins. In Proc. CIKM, 2005. 2.3.1.3<br />

[13] Roy Goldman <strong>and</strong> Jennifer Widom. DataGuides: Enabling query formulation <strong>and</strong><br />

optimization in semistructured databases. In Proc. VLDB, 1997. 2.2.2<br />

[14] Georg Gottlob, Christoph Koch, <strong>and</strong> Reinhard Pichler. Efficient algorithms for processing<br />

XPath queries. In Proc. VLDB, 2002. 2.4<br />

[15] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />

survey. Knowl. <strong>and</strong> Data Eng., 2007. 1.3, 2.4<br />

[16] <strong>Nils</strong> <strong>Grimsmo</strong>. Dynamic indexes vs. static hierarchies for substring search. Master’s<br />

thesis, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology, 2005. 3.2, 7<br />

[17] <strong>Nils</strong> <strong>Grimsmo</strong>. On performance <strong>and</strong> cache effects in substring indexes. Technical<br />

Report IDI-TR-2007-04, NTNU, Trondheim, Norway, 2007. 3.2, 3.3, 7<br />

[18] <strong>Nils</strong> <strong>Grimsmo</strong>. Faster path indexes for search in XML data. In Proc. ADC, 2008.<br />

(document), 3.2.1<br />

[19] <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund. On the size <strong>of</strong> generalised suffix trees<br />

extended with string ID lists. Technical Report IDI-TR-2007-01, NTNU, Trondheim,<br />

Norway, 2007. (document), 3.2.2<br />

[20] <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund. Towards unifying advances in twig<br />

join algorithms. In Proc. ADC, 2010. (document), 3.2.4<br />

[21] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Fast optimal<br />

twig joins. In Proc. VLDB, 2010. (document), 2.1.5, 2.1.7, 3.2.5, 3.2.5, 3.5.2<br />

[22] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Linear computation<br />

<strong>of</strong> the maximum simultaneous forward <strong>and</strong> backward bisimulation for nodelabeled<br />

trees. In Proc. XSym, 2010. (document), 3.2.6, 3.2.6<br />

[23] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Linear computation<br />

<strong>of</strong> the maximum simultaneous forward <strong>and</strong> backward bisimulation for nodelabeled<br />

trees (extended version). Technical Report IDI-TR-2010-10, NTNU, Trondheim,<br />

Norway, 2010. 3.2.6<br />

[24] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Øystein Torbjørnsen. Xleaf: Twig<br />

evaluation with skipping loop joins <strong>and</strong> virtual nodes. In Proc. DBKDA, 2010.<br />

(document), 3.2.3, 12<br />

[25] Haifeng Jiang, Hongjun Lu, Wei Wang, <strong>and</strong> Beng Chin Ooi. XR-tree: Indexing XML<br />

data for efficient structural joins. In Proc. ICDE, 2003. 2.3.1.2<br />

[26] Haifeng Jiang, Wei Wang, Hongjun Lu, <strong>and</strong> Jeffrey Xu Yu. Holistic twig joins on<br />

indexed XML documents. In Proc. VLDB, 2003. 2.3.1.2<br />

58


BIBLIOGRAPHY<br />

[27] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />

processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />

2007. 2.1.1, 2.1.2, 2.1.5, 2.1.6, 2.1.7<br />

[28] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />

indexes for branching path queries. In Proc. SIGMOD, 2002. 2.2.3, 2.2.3, 2.2.4,<br />

3, 8<br />

[29] Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, <strong>and</strong> Raghu Ramakrishnan.<br />

On the integration <strong>of</strong> structure indexes <strong>and</strong> inverted lists. In Proc. SIG-<br />

MOD, 2004. 2.2.4, 2.4<br />

[30] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, <strong>and</strong> Ehud Gudes. Exploiting<br />

local similarity for indexing paths in graph-structured data. In Proc. ICDE, 2002.<br />

2.2.4<br />

[31] Pekka Kilpeläinen. Tree matching problems with applications to structured text<br />

databases. Technical Report A-1992-6, <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> Science, University<br />

<strong>of</strong> Helsinki, 1992. 2.4<br />

[32] Krishna P. Leela <strong>and</strong> Jayant R. Haritsa. Schema-conscious XML indexing. <strong>Information</strong><br />

Systems, 32:344–364, 2007. 2<br />

[33] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />

2.1.1, 2.1.2, 2.1.5, 2.1.6, 2.1.7, 1<br />

[34] Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, <strong>and</strong> Ting Chen. From region encoding<br />

to extended Dewey: On efficient processing <strong>of</strong> XML twig pattern matching. In<br />

Proc. VLDB, 2005. 2.3.2.3, 3.2.3, 1, 12<br />

[35] Federica M<strong>and</strong>reoli, Riccardo Martoglia, <strong>and</strong> Pavel Zezula. Principles <strong>of</strong> holism for<br />

sequential twig pattern matching. The VLDB Journal, 18(6), 2009. 3<br />

[36] Tova Milo <strong>and</strong> Dan Suciu. Index structures for path expressions. Proc. ICDT, 1999.<br />

2.2.2, 2.2.3, 2.2.3, 2.4<br />

[37] Svetlozar Nestorov, Jeffrey D. Ullman, Janet L. Wiener, <strong>and</strong> Sudarshan S. Chawathe.<br />

Representative objects: Concise representations <strong>of</strong> semistructured, hierarchial data.<br />

In Proc. ICDE, 1997. 2.2.2<br />

[38] Robert Paige <strong>and</strong> Robert E. Tarjan. Three partition refinement algorithms. SIAM<br />

J. Comput., 1987. 2.2.3<br />

[39] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />

In Proc. DASFAA, 2007. 2.1.1, 2.1.2, 2.1.5, 2.1.6<br />

[40] Praveen Rao <strong>and</strong> Bongki Moon. PRIX: Indexing <strong>and</strong> querying XML using Prüfer<br />

sequences. In Proc. ICDE, 2004. 2.4<br />

59


BIBLIOGRAPHY<br />

[41] Diptikalyan Saha. An incremental bisimulation algorithm. In Proc. FSTTCS, 2007.<br />

8<br />

[42] Mirit Shalem <strong>and</strong> Ziv Bar-Yossef. The space complexity <strong>of</strong> processing XML twig<br />

queries over indexed documents. In Proc. ICDE, 2008. 2.4, 1, 14<br />

[43] L. J. Stockmeyer <strong>and</strong> A. R. Meyer. Word problems requiring exponential time (preliminary<br />

report). In Proc. STOC, 1973. 2.4<br />

[44] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene<br />

Shekita, <strong>and</strong> Chun Zhang. Storing <strong>and</strong> querying ordered XML using a relational<br />

database system. In Proc. SIGMOD, 2002. 2.3.2.2<br />

[45] W3C. XPath 1.0, 1999. http://w3.org/TR/xpath. 1.2.1, 1.2.1, 2.1.2.1<br />

[46] W3C. Extensible markup language (XML) 1.0 (fourth edition), 2006. http://www.<br />

w3.org/TR/2006/REC-xml-20060816/. 1.2<br />

[47] W3C. XQuery 1.0, 2007. http://w3.org/TR/xquery. 1.2.1, 2.1.2.1<br />

[48] H. Wang, S. Park, W. Fan, <strong>and</strong> P.S. Yu. ViST: a dynamic index method for querying<br />

XML data by tree structures. Proc. SIGMOD, pages 110–121, 2003. 2.4<br />

[49] Haixun Wang <strong>and</strong> Xia<strong>of</strong>eng Meng. On the sequencing <strong>of</strong> tree structures for XML<br />

indexing. In Proc. ICDE, pages 372–383, 2005. 2.4<br />

[50] Felix Weigel. Structural summaries as a core technology for efficient XML retrieval.<br />

PhD thesis, Ludwig-Maximilians-Universität München, 2006. 2.1.4, 2.1.8<br />

[51] Ian H. Witten, Alistair M<strong>of</strong>fat, <strong>and</strong> Timothy C. Bell. Managing gigabytes (2nd ed.):<br />

compressing <strong>and</strong> indexing documents <strong>and</strong> images. 1999. 2.1.8<br />

[52] Xin Wu <strong>and</strong> Guiquan Liu. XML twig pattern matching using version tree. Data &<br />

Knowl. Eng., 2008. 2.2.4, 10, 3.6<br />

[53] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />

Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004. 2.3.2.3, 3.2.3, 1, 12, 13<br />

[54] Ke Yi, Hao He, Ioana Stanoi, <strong>and</strong> Jun Yang. Incremental maintenance <strong>of</strong> XML<br />

structural indexes. In Proc. SIGMOD, 2004. 8<br />

[55] Jeffrey Xu Yu <strong>and</strong> Jiefeng Cheng. Graph reachability queries: A survey. In Managing<br />

<strong>and</strong> Mining Graph Data, Advances in Database Systems. Springer US, 2010. 2.4<br />

[56] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />

supporting containment queries in relational database management systems. SIG-<br />

MOD Rec., 2001. 2.1.1, 2.1.4<br />

[57] Ning Zhang, Varun Kacholia, <strong>and</strong> M. Tamer Özsu. A succinct physical storage<br />

scheme for efficient evaluation <strong>of</strong> path queries in XML. In Proc. ICDE, 2004. 2.4<br />

60


BIBLIOGRAPHY<br />

[58] Liang Zuopeng, Hu Kongfa, Ye Ning, <strong>and</strong> Dong Yisheng. An efficient index structure<br />

for XML based on generalized suffix tree. <strong>Information</strong> Systems, 32:283–294, 2007.<br />

3.2.2, 2, 3<br />

61


Chapter 4<br />

Included papers<br />

“As for me, all I know is that I know nothing.”<br />

– Socrates<br />

This chapter contains the research papers that constitute the main body <strong>of</strong> this thesis.<br />

They have been reformatted to fit the format used here, <strong>and</strong> figures <strong>and</strong> tables have been<br />

moved, resized <strong>and</strong> rotated to become readable. Some other papers <strong>and</strong> reports I have<br />

authored <strong>and</strong> co-authored can be found in Appendix A.<br />

63


Paper 1<br />

<strong>Nils</strong> <strong>Grimsmo</strong><br />

Faster Path Indexes for Search in XML Data<br />

Proceeding <strong>of</strong> the Nineteenth Australasian Database Conference (ADC 2008)<br />

Abstract This article describes how to implement efficient memory resident path indexes<br />

for semi-structured data. Two techniques are introduced, <strong>and</strong> they are shown to<br />

be significantly faster than previous methods when facing path queries using the descendant<br />

axis <strong>and</strong> wild-cards. The first is conceptually simple <strong>and</strong> combines inverted lists,<br />

selectivity estimation, hit expansion <strong>and</strong> brute force search. The second uses suffix trees<br />

with additional statistics <strong>and</strong> multiple entry points into the query. The entry points are<br />

partially evaluated in an order based on estimated cost until one <strong>of</strong> them is complete.<br />

Many path index implementations are tested, using paths generated both from statistical<br />

models <strong>and</strong> DTDs.<br />

65


Paper 1: Faster Path Indexes for Search in XML Data<br />

1 Introduction<br />

With the advent <strong>of</strong> XML <strong>and</strong> query languages such as XPath <strong>and</strong> XQuery came the<br />

need for efficient ways to query the structure <strong>of</strong> XML documents. This article focuses on<br />

settings where a document collection can be indexed in advance, as opposed to querying on<br />

the fly. For efficient solutions to the latter problem, see for example Gottlob et al. (2005).<br />

An important component in many systems indexing XML is a path index (Bertino <strong>and</strong><br />

Kim 1989) summarising <strong>and</strong> indexing all unique paths seen in the document collection.<br />

(Other names for similar structures are representative objects (Nestorov et al. 1997),<br />

DataGuides (Goldman <strong>and</strong> Widom 1997), <strong>and</strong> access support relations (Kemper <strong>and</strong><br />

Moerkotte 1992).) For many document collections following schemas, this set <strong>of</strong> paths<br />

will be small compared to the total size <strong>of</strong> the data. The path index is in some way<br />

connected to a value index (or content index), which allows search for words or values.<br />

XPath is a query language allowing search for regular path expressions in XML documents.<br />

It is a simple declarative language, but techniques used for XPath queries can<br />

also be components in more advanced procedural query languages such as XQuery. Many<br />

FLOWR expressions can also be rewritten to simpler XPath expressions (Michiels et al.<br />

2007).<br />

Assume the XML document shown in Figure 1, <strong>and</strong> the XPath query /a//c/"foo" 1 .<br />

There are two matches for the path part <strong>of</strong> the query, <strong>and</strong> four matches for the value<br />

predicate "foo", but only one match for the entire query, which is the third occurrence<br />

<strong>of</strong> "foo". Note that each unique path is only indexed once in a path index. There are<br />

two occurrences <strong>of</strong> the path /a/b in the example document, but there will only be one in<br />

a path index.<br />

An efficient path index is important when indexing large heterogeneous document<br />

collections for structural querying. If the data has a very homogeneous structure, the<br />

number <strong>of</strong> unique paths seen is small, <strong>and</strong> the implementation <strong>of</strong> the path index itself is<br />

not significant. An example where document structure could be very heterogeneous, is an<br />

enterprise search engine, indexing all information generated by a business. This could be<br />

composed <strong>of</strong> the content from multiple databases, repositories <strong>of</strong> reports, etc.. Another<br />

case is search engines for the semantic web, where structural search is a key feature. It is<br />

hard to imagine a future for the semantic web without search engines that scale as well<br />

as current web search engines.<br />

1.1 Previous approaches<br />

Many approaches for indexing XML <strong>and</strong> supporting path queries have been proposed in<br />

the last ten years. They can mainly be divided into node indexing, path indexing <strong>and</strong><br />

sequence indexing. In node indexing, one makes an inverted index for all XML tags, <strong>and</strong><br />

one for all data values <strong>and</strong> terms. Given a simple path query, lookup all the tags <strong>and</strong><br />

values, <strong>and</strong> merge the results. To be able to merge hits for XML elements, an encoding<br />

which tells whether a node is a child or descendant <strong>of</strong> another node is needed. Common<br />

solutions are the range based (docid, start:end, level) encoding (Zhang et al. 2001), the<br />

1 Let path/"term" be short for path[normalize-space(text())="term"].<br />

67


CHAPTER 4.<br />

INCLUDED PAPERS<br />

1<br />

foo 1.1<br />

foo 1.2(.1)<br />

1.3<br />

foo 1.3.1(.1)<br />

1.3.2<br />

bar 1.3.2.1<br />

1.3.2.2<br />

bar 1.3.2.2.1<br />

1.3.2.2.2<br />

<br />

<br />

<br />

foo 1.4(.1)<br />

bar 1.5(.1)<br />

<br />

Figure 1: Example XML document. Dewey order encoding <strong>of</strong> elements shown on the<br />

right.<br />

(post, pre) encoding (Grust 2002) <strong>and</strong> the prefix based Dewey order encoding (Tatarinov<br />

et al. 2002).<br />

A “brute force” merge between element hits may be extremely inefficient. A lot <strong>of</strong><br />

research has gone into finding better merge algorithms in terms <strong>of</strong> time <strong>and</strong> IO-complexity:<br />

MPMGJN (Zhang et al. 2001), EE/EA-join (Li <strong>and</strong> Moon 2001), tree-merge <strong>and</strong> stack-tree<br />

(Al-Khalifa et al. 2002), Anc Des (Chien et al. 2002), PathStack <strong>and</strong> TreeStack (Bruno<br />

et al. 2002), XR-tree <strong>and</strong> XR-stack (Wang <strong>and</strong> Ooi 2003), TSGeneric (Jiang et al. 2003),<br />

<strong>and</strong> TJFast (Lu et al. 2005). Even though these algorithms are very efficient in terms<br />

<strong>of</strong> their input <strong>and</strong> output, many systems which use them perform a lot <strong>of</strong> unnecessary<br />

work. Assume for example the query /a/b/c/"foo", <strong>and</strong> that a is the first element <strong>of</strong> the<br />

path for half <strong>of</strong> the data values stored in the database, while c is seen only a few times.<br />

To do the merge, either the entire occurrence list for a must be read from disk, or every<br />

occurrence <strong>of</strong> c must be looked up in some data structure for a on disk. It is obvious that<br />

methods which utilise the varying selectivity <strong>of</strong> the query elements or the full paths will<br />

be faster.<br />

Various methods which use index structures other than inverted lists have been proposed.<br />

The index fabric (Cooper et al. 2001) maintains a layered Patricia trie <strong>of</strong> all paths<br />

seen in the data. It is organised in multiple levels, so that a path query will result in a<br />

number <strong>of</strong> lookups (block) logarithmic in the size <strong>of</strong> the total data. A problem with this<br />

index structure is that only queries starting at the document root element are efficiently<br />

supported.<br />

When the structure <strong>of</strong> XML documents is highly regular, the number <strong>of</strong> unique paths<br />

seen in a collection will be very small compared to the total size <strong>of</strong> the data. This set <strong>of</strong><br />

unique paths can <strong>of</strong>ten fit in main memory, <strong>and</strong> be searched very efficiently. The first use<br />

68


Paper 1: Faster Path Indexes for Search in XML Data<br />

<strong>of</strong> path indexing known to the author is by Bertino <strong>and</strong> Kim (1989), while perhaps the<br />

best known is the DataGuide (Goldman <strong>and</strong> Widom 1997) used in the Lore DBMS for<br />

semistructured data.<br />

Let a path summary denote a summary <strong>of</strong> all paths which is not indexed for fast<br />

matching. Perhaps the simplest use <strong>of</strong> path summaries is by Buneman et al. (2005),<br />

where all paths are extracted from the data, <strong>and</strong> maintained as an in-memory “skeleton”.<br />

For each path seen, there is an on-disk vector containing all terms <strong>and</strong> values seen below<br />

instances <strong>of</strong> this path. Weaknesses <strong>of</strong> this approach is that a full search for matching<br />

paths in the skeleton is required when paths do not start with the root element, <strong>and</strong><br />

that a brute force (or binary) search through the vector is necessary for queries with<br />

value predicates. The ToXin system (Rizzolo <strong>and</strong> Mendelzon 2001) improves the latter<br />

by maintaining for each path an index over the values seen below it. The strength <strong>of</strong><br />

ToXin is an efficient matching <strong>of</strong> twig queries, by storing navigational information for<br />

the data in the nodes in the index. A further improvement is ToXop (Barta et al. 2005),<br />

in which query plans are made based on the selectivity on the path query elements, <strong>and</strong><br />

clever combinations <strong>of</strong> merges <strong>and</strong> searches are used. A potential weakness is that if a<br />

query does not start with a root element, a brute force search through the path summary<br />

is required to match the path expression.<br />

An enhancement over a brute force search through the summary is to make an inverted<br />

index over the paths on their tags. Given a path query, look up the individual tags in<br />

a path index <strong>and</strong> merge the results. This is used in SphinX (Poola <strong>and</strong> Haritsa 2007),<br />

where there is a value index for each path (as in ToXin). In the case where the path<br />

index is <strong>of</strong> considerable size, the merging can be costly. The systems APEX (Chung et al.<br />

2002) <strong>and</strong> XIST (Runapongsa et al. 2004) address this by maintaining index entries for<br />

sub-paths <strong>of</strong> lengths greater than one on dem<strong>and</strong>.<br />

A simple <strong>and</strong> elegant system for XML indexing using path indexing in a RDBMS is<br />

XRel (Yoshikawa et al. 2001). One <strong>of</strong> the four tables used is a mapping from paths to<br />

integer identifiers. All text <strong>and</strong> values indexed have a reference to the path under which<br />

they reside, <strong>and</strong> path matching is done using simple LIKE queries with wild-cards in the<br />

path table. Similar solutions is used in many systems based on RDBMS.<br />

A problem with keeping a separate value index for each path is cases where many<br />

paths match the query. The worst case scenario is when the query consists <strong>of</strong> only a<br />

value predicate. This results in many disk accesses, unless the indexes are stored in some<br />

interleaved fashion, grouped on the value key. An alternative is to have a single value<br />

index, where the occurrences <strong>of</strong> a value are stored with their parent path ID. After the<br />

entry for a value has been found, the occurrences are filtered on matching path IDs. If<br />

the occurrence list is large, it can be stored sorted on path ID, <strong>and</strong> pointers into the list<br />

can be used to avoid having to read all <strong>of</strong> it (known as skip lists).<br />

When the path summary fits in main memory, the choice <strong>of</strong> index structures which<br />

are suitable for implementing it is greater than if it would have to reside on disk. One<br />

structure which is only efficient in main memory is the suffix tree (McCreight 1976).<br />

PIGST (Zuopeng et al. 2007) is a system maintaining a generalised suffix tree as the<br />

69


CHAPTER 4.<br />

INCLUDED PAPERS<br />

path index 2 . See Section 2.4 for a description <strong>of</strong> this solution. A more common use <strong>of</strong><br />

(<strong>of</strong>ten pruned) suffix trees is selectivity estimation for optimising query plans (Aboulnaga<br />

et al. 2001, Chen et al. 2001).<br />

A method very different from node <strong>and</strong> path indexing is sequence indexing, where all<br />

documents are converted to a sequence representation, <strong>and</strong> searching is done by subsequence<br />

matching. ViST 3 (Wang et al. 2003) is a system using this approach. An<br />

advantage is that searching for twig queries can be done without merging partial results.<br />

A problem with ViST is that the index has quadratic size in the worst case, if the trees<br />

indexed are very deep. PRIX (Rao <strong>and</strong> Moon 2004) solves this by taking a different<br />

approach to the sequencing, using Prüfer sequences. Wang <strong>and</strong> Meng (2005) use a representation<br />

similar to in ViST, but using a more clever sequencing they optimise for smaller<br />

indexes <strong>and</strong> faster queries. The querying process also becomes much simpler than in ViST<br />

<strong>and</strong> PRIX.<br />

1.2 Contributions<br />

This article describes how to do efficient path matching. It is assumed that there is an<br />

overlying system similar to what is common when using path indexing (see Section 2.1).<br />

• It is shown that to combine an inverted index for the path summary with brute<br />

force search is in practise much faster than merging path element hits. The methods<br />

introduced exploit the varying selectivity <strong>of</strong> the query path elements.<br />

• It is shown how the use <strong>of</strong> a generalised suffix tree can be enhanced by adding<br />

statistics to the tree nodes, <strong>and</strong> changing the way searches are performed. Multiple<br />

entry points into the query are partially evaluated in parallel, depending on the<br />

evaluation cost.<br />

• Many path index implementations are compared, using paths generated from statistical<br />

models <strong>and</strong> from DTDs.<br />

2 Path index implementation<br />

Below follows descriptions <strong>of</strong> various solutions for implementing path summaries.<br />

2.1 Assumptions<br />

This article only addresses the implementation <strong>of</strong> the path index, <strong>and</strong> assumes that an<br />

overlying system with the following design is using it: All values <strong>and</strong> terms seen in<br />

the document collection are indexed by ordinary inverted lists. Stored in each entry is<br />

information encoding document ID, position in the document, the values parent path<br />

identifier, <strong>and</strong> a local specifier for the path instance (range based or prefix based). Figure<br />

2 The authors make some extensions which makes the suffix trees super-linear in size, seemingly without<br />

considering this.<br />

3 St<strong>and</strong>s for Virtual Suffix Tree, but only due to a misconception from the author.<br />

70


Paper 1: Faster Path Indexes for Search in XML Data<br />

2 shows the value index for the example XML document in Figure 1, with path ID <strong>and</strong><br />

Dewey order encoding <strong>of</strong> path instance shown. Document ID <strong>and</strong> position is omitted for<br />

brevity. The enumeration <strong>of</strong> the paths is shown in Figure 3.<br />

"foo": 1 1.1<br />

2 1.2.1<br />

3 1.3.1.1<br />

2 1.4.1<br />

"bar": 4 1.3.2.1<br />

5 1.3.2.2.1<br />

7 1.5.1<br />

Figure 2: Value index for example XML from Figure 1. Storing path ID <strong>and</strong> path instance<br />

Dewey order. Document ID <strong>and</strong> position omitted.<br />

1. /a<br />

2. /a/b<br />

3. /a/b/c<br />

4. /a/b/b<br />

5. /a/b/b/a<br />

6. /a/b/b/a/b<br />

7. /a/c<br />

Figure 3: Enumeration <strong>of</strong> unique paths seen in XML in Figure 1. This is the set <strong>of</strong> paths<br />

which would be indexed by a path index.<br />

Given a non-branching XPath query with a value predicate, all paths matching are<br />

found with the path index. The value is then looked up in the value index, <strong>and</strong> the<br />

hits are filtered with the set <strong>of</strong> matching paths. For the query /a//c/"foo", the paths<br />

matching /a//c in the example XML are number 3 <strong>and</strong> 7. The only occurrence in the<br />

lists for "foo" with a path which matched is (3 1.3.1.1). In the case <strong>of</strong> XPath queries<br />

without value predicates, an index for occurrences <strong>of</strong> XML tags should be maintained<br />

in addition. The value index may also have the occurrence lists split/sorted on path ID<br />

for faster filtering, as in ToXin (Rizzolo <strong>and</strong> Mendelzon 2001) <strong>and</strong> SphinX (Poola <strong>and</strong><br />

Haritsa 2007).<br />

It is assumed that representatives for the unique paths seen in a document collection<br />

fit in main memory, <strong>and</strong> further any index structure which is linear in their size. Only<br />

“simple” path expressions are considered, not twigs. An example <strong>of</strong> a twig XPath query<br />

is /a/b[c/"foo" <strong>and</strong> b/"bar"]. It is assumed that a system using the path index here<br />

would perform two queries (one for each branch in the twig), <strong>and</strong> merge the results. Here<br />

the merge would check for a common prefix <strong>of</strong> length three in the Dewey encoding <strong>of</strong> the<br />

paths. Note that the problem is much more involved in general. Unordered tree inclusion<br />

in general is even NP complete (Kilpeläinen 1992). The reason twigs are not treated<br />

here, is that in most cases, the leaves <strong>of</strong> the twigs will be value predicates(as in the given<br />

71


CHAPTER 4.<br />

INCLUDED PAPERS<br />

query), which will have to be looked up in the value index in any case, given the overall<br />

system design.<br />

Below follows the descriptions <strong>of</strong> various path index data structures <strong>and</strong> matching<br />

approaches.<br />

2.2 Brute force search<br />

The simplest way to implement a path index is to store the paths seen in a list, <strong>and</strong><br />

perform brute force searches for path expressions through this list. Given regular path<br />

expressions, a deterministic finite automata (Aho et al. 1986) (DFA) for the query can be<br />

built. A DFA can be exponential in the size <strong>of</strong> the query, but for most cases queries can<br />

be considered to be <strong>of</strong> constant length. Using a DFA gives a linear time scan through the<br />

data. For document collections with large data, but small schema, a brute force search<br />

may be a sufficient solution, as the scan through memory is relatively cheap compared to<br />

the disk accesses needed for the lookup into the value index.<br />

2.3 Inverted list solutions<br />

When inverting paths, each tag in a path is treated as a symbol. The index will contain<br />

for each symbol a list <strong>of</strong> positions in which it occurs, given as path ID <strong>and</strong> position within<br />

path. An index for the paths in Figure 3 is shown in Figure 4. Whether the index should<br />

store pairs <strong>of</strong> path ID <strong>and</strong> position, or store the path ID <strong>and</strong> a list occurrence positions<br />

within the path, depends on the expected lengths <strong>of</strong> these lists. In an implementation not<br />

using compression, an additional integer would be needed for storing the length <strong>of</strong> each<br />

list. This means the latter approach is more space efficient when the expected list length<br />

is greater than two. This could happen with recursive document schemas. The approach<br />

using simple pairs was chosen for simplicity in the implementations used here.<br />

a: 1,1 2,1 3,1 4,1 5,1 5,4 6,1 6,4 7,1<br />

b: 2,2 3,2 4,2 4,3 5,2 5,3 6,2 6,3 6,5<br />

c: 3,3 7,2<br />

Figure 4: Inverted lists for paths seen in example XML. Storing path ID <strong>and</strong> position<br />

within path.<br />

Given a path query using only the child axis, each element is looked up, <strong>and</strong> the results<br />

are merged where the elements are adjacent in paths. Assuming the query //a/b/c,<br />

merging left to right, first merge hits for a <strong>and</strong> b, <strong>and</strong> keep all hits with adjacent elements.<br />

The reason all hits are needed, even though the final output is only path IDs, is that which<br />

hits for a/b have adjacent hits for c is not known. After merging with c in the example,<br />

a match in path 3 is left, from position 1 to 3.<br />

For the descendant axis, hits need not be adjacent, only in the correct order. Given<br />

that the element hits are merged left to right, only the hit with the leftmost right border<br />

need to be passed on to the next step in the merge. For the query //a//b//c, first merge<br />

72


Paper 1: Faster Path Indexes for Search in XML Data<br />

hits for a <strong>and</strong> b, <strong>and</strong> keep at most a single match in each path, one with b as far to the<br />

left as possible. Then merge this hit set with the hits for c.<br />

For queries using both the child <strong>and</strong> descendant axis, the hits for elements in a parent–<br />

child relationship should be merged first, then elements in an ancestor–descendant relationship.<br />

This is because the former needs all intermediate hits, while the latter does<br />

not.<br />

2.3.1 Indexing tuples<br />

The performance for path queries using the child axis can be greatly improved by indexing<br />

pairs, triples, or even longer substrings in the inverted lists for the paths. What is indexed<br />

can be decided dynamically, as in APEX (Chung et al. 2002) or XiST (Runapongsa et al.<br />

2004), or statically, as is done here for simplicity. If the size <strong>of</strong> the data (in this case the<br />

paths) is large compared to the alphabet, the space overhead associated with starting a<br />

list in the index is small compared to the size <strong>of</strong> the list contents. In this case, an index<br />

for all pairs will not require much more space than an index for all single elements.<br />

It is expected that using pairs or triples will greatly reduce query cost in practise, as<br />

these probably will have much better selectivity than single elements.<br />

2.3.2 Extending possible hits<br />

Given queries using the child axis, a simple trick can be used to improve the performance.<br />

Assume only single elements are indexed, <strong>and</strong> the query is /a/b/c. If a is the root element<br />

<strong>of</strong> every second path, b exists in around half the paths, but c is seldom seen, the size <strong>of</strong><br />

the result will be very small compared to the total cost <strong>of</strong> the merges. The cost <strong>of</strong> merging<br />

large <strong>and</strong> small hit sets can be reduced by performing binary searches in the large set.<br />

The merges can also be done out <strong>of</strong> order to reduce the cost.<br />

A simpler <strong>and</strong> more efficient solution is possible. Since all paths reside in main memory,<br />

checking single elements in the paths is very cheap. Take the hits for the most selective<br />

element, <strong>and</strong> for each one, check whether it is preceded <strong>and</strong> succeeded by the needed<br />

elements. This avoids merging with larger sets due to poorer selectivity. The method can<br />

be combined with indexing pairs <strong>and</strong> triples.<br />

2.3.3 Estimate, choose, brute<br />

Expensive merges on the descendant axis can also be avoided when a part <strong>of</strong> the query<br />

has good selectivity. Assume the XPath query /a//c, where c has good selectivity, but<br />

a does not. As the paths are relatively short strings, a brute force search through the set<br />

<strong>of</strong> paths containing c should be more efficient than a merge. The memory management<br />

overhead <strong>of</strong> h<strong>and</strong>ling intermediate hit sets is also avoided.<br />

2.4 Suffix tree solutions<br />

A generalised suffix tree for a set <strong>of</strong> strings is a compacted trie for all suffixes <strong>of</strong> the strings<br />

(McCreight 1976). An example tree is shown in Figure 5. This structure can be built in<br />

73


CHAPTER 4.<br />

INCLUDED PAPERS<br />

time <strong>and</strong> space linear to the total length <strong>of</strong> the strings for constant <strong>and</strong> integer alphabets<br />

(Farach 1997). For general alphabets the complexity is O(n log |Σ|). The implementation<br />

used in this article combines the child arrays from <strong>Grimsmo</strong> (2005, 2007) <strong>and</strong> hashing.<br />

The index can decide whether a given string exists as a substring in the set in expected<br />

time linear to the length <strong>of</strong> the string, <strong>and</strong> all hits can then be extracted in time linear to<br />

their number. The set <strong>of</strong> paths in a path summary can be seen as a set <strong>of</strong> strings, where<br />

XML tags are string symbols, <strong>and</strong> indexed with the suffix tree. This requires that the<br />

suffix tree implementation can h<strong>and</strong>le the possibly large alphabet <strong>of</strong> XML tags. If this<br />

is not the case, the paths can be spelled out separated by delimiters, <strong>and</strong> these longer<br />

strings can be indexed. In the implementation used here, tags were mapped to an integer<br />

alphabet used as string symbols.<br />

Figure 5: Generalised suffix tree for the strings abba <strong>and</strong> abbb.<br />

If only the child axis is used all matches are contiguous sub-paths, <strong>and</strong> the node whose<br />

subtree represents all hits can be found in optimal time. But not all occurrences <strong>of</strong> the<br />

sub-path are needed, only the set <strong>of</strong> paths containing them. As an example, a/b occurs<br />

twice in the path /a/b/b/a/b, but given the query //a/b//"foo", only the fact that it<br />

occurs is <strong>of</strong> interest. In PIGST (Zuopeng et al. 2007) this is solved by storing in each<br />

internal node <strong>of</strong> the tree the set <strong>of</strong> path IDs seen in the subtree below. For r<strong>and</strong>om paths<br />

from a uniform distribution, the average ratio between the size <strong>of</strong> the subtree <strong>and</strong> the size<br />

<strong>of</strong> the list <strong>of</strong> path IDs will be inversely proportional to the average path length. Another<br />

argument for their solution, is that the nodes in a suffix tree built with any known linear<br />

construction algorithm have bad spatial locality, while the path IDs stored in the lists<br />

in PIGST have perfect locality. This is <strong>of</strong> importance also in main memory, because <strong>of</strong><br />

the cache effects <strong>of</strong> modern computers. No bounds on space usage or construction time<br />

for their extended suffix tree is given (Zuopeng et al. 2007), but both are super-linear,<br />

even in the average case (<strong>Grimsmo</strong> <strong>and</strong> Bjørklund 2007). Finding the set <strong>of</strong> strings which<br />

contain a given substring is known as the document listing problem, <strong>and</strong> can actually be<br />

solved optimally with linear preprocessing (Muthukrishnan 2002).<br />

74


Paper 1: Faster Path Indexes for Search in XML Data<br />

2.4.1 Intersect, brute<br />

The straight forward way to search for a path expression using the descendant axis, is to<br />

first do a search for the node representing the first part <strong>of</strong> the query (using only the child<br />

axis), <strong>and</strong> then do a full recursive search <strong>of</strong> the subtree below. To avoid this, PIGST does<br />

a separate search for each part <strong>of</strong> the query, intersects the resulting sets <strong>of</strong> path IDs, <strong>and</strong><br />

performs a brute force search through the set <strong>of</strong> possible paths. Note that this merging<br />

on the descendant axis is different from what is described earlier in Section 2.3, as sets <strong>of</strong><br />

path IDs are intersected, not sets <strong>of</strong> hits in paths, where in-path order is <strong>of</strong> importance.<br />

A variant <strong>of</strong> this is to take only the smallest set <strong>of</strong> path IDs, <strong>and</strong> perform a brute force<br />

search through the respective paths. This is similar to what was described for inverted<br />

files in Section 2.3.3. One difference is that if a part <strong>of</strong> a query using only the child axis<br />

has been matched in the tree, the size <strong>of</strong> the hit set does not need to be estimated, as it<br />

is known. Another is that the set <strong>of</strong> matches is extracted without overhead, no matter<br />

how long the query part is. For inverted lists, merging or hit expansion was necessary if<br />

the part was longer than the tuples indexed.<br />

Skipping the intersection <strong>and</strong> just doing the brute force search through the smallest<br />

set should pay <strong>of</strong>f when the total length <strong>of</strong> the paths in the smallest set <strong>of</strong> possible paths<br />

is less than the number <strong>of</strong> path IDs in the largest set. This could happen <strong>of</strong>ten if the<br />

paths were generated by a source with skewed distribution.<br />

2.4.2 Selective suffix tree traversal<br />

Below follows the description <strong>of</strong> a novel algorithm using multiple entry points into the<br />

query. Pseudo-code is given in Algorithm 1. The nodes <strong>of</strong> an ordinary generalised suffix<br />

tree are each extended with a number giving the size <strong>of</strong> the subtree below the node. This<br />

allows for more intelligent traversal <strong>of</strong> trees <strong>and</strong> queries. Two variants are used, where<br />

the second uses two suffix trees.<br />

For the first variant the number <strong>of</strong> entry points into a query is equal to the number <strong>of</strong><br />

parts separated by the descendant axis. Given the query /a/b//c//d/e, there are entries<br />

starting at a, c <strong>and</strong> d.<br />

All entry points are kept in a priority queue. They are ordered on the expected cost<br />

<strong>of</strong> evaluating them one step further. As soon as one is completely evaluated, the matches<br />

are extracted. Each entry point may during evaluation be at multiple positions in the<br />

suffix tree, if wild-cards or the descendant axis have been used. The cost <strong>of</strong> moving an<br />

entry point forward in the query is the sum <strong>of</strong> the costs for moving downward at each<br />

position held in the tree, with a reduction for having advanced further into the query.<br />

Evaluating a step <strong>of</strong> the child axis costs 1, except when the next symbol is a wild-card,<br />

where the cost is equal to the number <strong>of</strong> children <strong>of</strong> the current node in the suffix tree.<br />

For the descendant axis, the cost <strong>of</strong> moving one step forward in the query is equal to the<br />

size <strong>of</strong> the subtree below the current node. For an entry point which started inside the<br />

query, <strong>and</strong> is exp<strong>and</strong>ed all the way to the end, the cost <strong>of</strong> evaluating it “one step further”<br />

is an estimate <strong>of</strong> the cost <strong>of</strong> a brute force search through the paths with the partial match,<br />

which must be done to get a full match.<br />

75


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Input: path expression P , suffix tree ST<br />

Output: set <strong>of</strong> matching paths<br />

Q ← PriQueue(getEntryPoints(P ));<br />

while not complete(front(Q)) do<br />

ep ← pop(Q);<br />

next ← {};<br />

foreach p ∈ ep.positions do<br />

next ← next ∪ advance(p);<br />

end<br />

c ← 0;<br />

foreach p ∈ next do<br />

c ← c + nextAdvanceCost(p);<br />

end<br />

ep.positions ← next;<br />

ep.advanceCost ← c;<br />

push(Q, ep)<br />

end<br />

ep ← front (Q);<br />

return ep.matches;<br />

Algorithm 1: Selective suffix tree traversal<br />

The other variant <strong>of</strong> this method also uses a suffix tree for the reverse representation<br />

<strong>of</strong> the paths. It has additional entry points moving backwards from the last element in<br />

each part <strong>of</strong> the query, doing the matching in the second suffix tree. For the example<br />

query, there would also be entry points moving backwards from b, c <strong>and</strong> e. A variant<br />

using only the suffix tree for the reversed paths is also included in the tests.<br />

2.4.3 Skipping leading wild-cards<br />

A simple variant <strong>of</strong> the basic use <strong>of</strong> a suffix tree can be used when the query starts with a<br />

wild-card. The leading wild-cards can be omitted from the query, <strong>and</strong> after the hits have<br />

been retrieved, they can be filtered on their starting positions in the paths.<br />

3 Experimental results<br />

The tests where run on an AMD64 3500+ running Linux 2.6.16 compiled for AMD64. All<br />

tested implementations where written in Java, <strong>and</strong> run with 64-bit Sun Java 1.5.0 06. As<br />

the Java virtual machine <strong>of</strong>ten shows a radical speedup from “warming up” (optimising<br />

byte-code), <strong>and</strong> as many <strong>of</strong> the solutions shared code, some measures had to be taken<br />

to give a fair treatment. Each test was run repeatedly with all implementations, until<br />

neither <strong>of</strong> them showed a deviance <strong>of</strong> more than 2% in total running time for the test from<br />

the last attempt. This ensured that all implementations got a proper <strong>and</strong> fair warmup.<br />

76


Paper 1: Faster Path Indexes for Search in XML Data<br />

Some care also had to be taken when measuring the memory usage, as Java relies on<br />

garbage collection. The garbage collector was called multiple times before the indexing<br />

process started, <strong>and</strong> the base memory usage was measured 4 . It was called again multiple<br />

times after the indexing finished, <strong>and</strong> the difference to the base memory usage was then<br />

recorded. If the garbage collector was run only a single time, the space usage measured<br />

differed greatly between runs.<br />

3.1 Data generation<br />

The paths used in the tests were generated both from statistical models <strong>and</strong> DTDs. A<br />

uniform distribution <strong>and</strong> Zipf distributions were used, in addition to first order Markov<br />

models with underlying Zipf distributions.<br />

The DTDs used for path generation were taken from the benchmarks DBLP (Ley<br />

2007), GedML (Kay 1999), Michigan (Runapongsa et al. 2003), XBench (Yao et al. 2003),<br />

XMach-1 (Böhme <strong>and</strong> Rahm 2001), XMark (Schmidt et al. 2001) <strong>and</strong> XOO7 (Bressan<br />

et al. 2003). Paths were generated by a breadth first search through the space <strong>of</strong> possible<br />

paths defined by the DTDs. Some <strong>of</strong> the DTDs were not used alone, as they were small<br />

<strong>and</strong>/or non-recursive. In tests using multiple DTDs, a breadth first search through their<br />

complete space was done, such that all paths <strong>of</strong> length k were generated before any path<br />

<strong>of</strong> length k + 1. For the data from statistical distributions, path lengths were drawn<br />

r<strong>and</strong>omly from 1 to 10.<br />

Queries for paths from DTDs were generated as follows: The length <strong>of</strong> the query was<br />

drawn from its distribution (Default: uniform from 1 to 5). Then, for each position in<br />

the query, the use <strong>of</strong> the child or descendant axis was chosen. (Default: p(desc) = 0.3). If<br />

the descendant axis was not used at all, a r<strong>and</strong>om path <strong>of</strong> the specified length was drawn,<br />

<strong>and</strong> used as query. If not, a r<strong>and</strong>om path <strong>of</strong> at least the specified length was drawn. Then<br />

the child axis operators were inserted into the query at r<strong>and</strong>om locations, <strong>and</strong> r<strong>and</strong>om<br />

path elements next to them were removed until the query had the specified length. This<br />

procedure ensured that all queries related to the generated data. The probability that a<br />

query required matches to start at a root element was set to 0.5. All queries were required<br />

to match the end <strong>of</strong> a path. Finally, the probability that a path element was substituted<br />

with a wild-card was by default 0.1.<br />

3.2 Methods tested<br />

The following methods were used in the tests :<br />

Br Brute force search through all strings. (See Section 2.2)<br />

MgInv Inverted lists <strong>and</strong> merging. (2.3)<br />

MgIe1 Inverted lists, selective entry point in contiguous part, exp<strong>and</strong>ing on child axis,<br />

merging on descendant axis. (2.3.2)<br />

MgIe2 Indexing singles <strong>and</strong> pairs. (2.3.1+2.3.2)<br />

4 Runtime.totalMemory() - Runtime.freeMemory()<br />

77


CHAPTER 4.<br />

INCLUDED PAPERS<br />

MgIe3 Indexing singles, pairs <strong>and</strong> triples. (2.3.1+2.3.2)<br />

EsIe[1,2,3] Estimating contiguous part with fewest hits, extracting possible paths, filtering<br />

with brute force. (2.3.2+2.3.3 )<br />

St Straight forward use <strong>of</strong> suffix tree. (2.4)<br />

Ss Suffix tree, skipping leading wild-cards, <strong>and</strong> filtering on start position in string. (2.4.3)<br />

InSe Suffix tree with path ID lists in internal nodes. Intersection <strong>of</strong> path ID sets on<br />

descendant axis <strong>and</strong> brute force through result. As described in (Zuopeng et al.<br />

2007). (2.4.1)<br />

EsSe Suffix tree with path ID lists in internal nodes. Finding the contiguous part <strong>of</strong> the<br />

query (no descendant axis) with fewest matches, <strong>and</strong> brute force search through the<br />

corresponding set <strong>of</strong> paths. (2.3.3+2.4.1)<br />

Sm[f,r,2] Suffix tree(s) with multiple entry points. Testing single tree with forward<br />

strings, reversed strings, <strong>and</strong> two trees, one forward <strong>and</strong> one reversed. (2.4.2).<br />

Smfe Suffix tree enhanced with path ID lists, using multiple entry points, using only<br />

forward tree. (2.4.1+2.4.2)<br />

3.3 Tests using various data sources<br />

Table 1 shows query performance for the tested methods. 10000 paths were indexed,<br />

drawn from the different data sources. 5000 queries <strong>of</strong> length 1 to 5 were run, with a 0.3<br />

probability <strong>of</strong> using the child axis <strong>and</strong> 0.1 probability <strong>of</strong> wild-cards. See later tests for<br />

variations over this. Table 2 shows more measures for the test using all the DTDs.<br />

% Br MgInv MgIe1 MgIe2 MgIe3 EsIe1 EsIe2 EsIe3 St Ss InSe EsSe Smf Smr Sm2 Smfe<br />

U 10 1.8 1068 5637 1426 755 715 807 206 164 350 365 241 150 148 224 148 99<br />

U 100 0.2 1012 937 164 82 81 96 25 24 81 51 68 57 22 29 26 19<br />

Z 100,0.7 0.4 1021 1719 306 186 182 117 41 37 126 106 92 55 35 44 41 28<br />

Z 100,1.0 0.9 1038 3658 691 448 437 223 89 77 249 241 156 76 73 90 77 55<br />

Z 100,1.3 2.3 1094 8067 1541 1097 1052 483 219 181 543 549 320 157 165 275 166 117<br />

MZ 100,0.7 0.2 1013 957 177 94 94 96 28 26 92 73 64 48 23 33 26 19<br />

MZ 100,1.0 0.3 1020 1132 230 139 136 108 35 33 144 128 76 46 27 38 30 21<br />

MZ 100,1.3 0.5 1038 1394 307 205 204 137 48 46 215 205 92 50 36 53 40 25<br />

DBLP.dtd 2.4 1241 9455 2283 1724 1673 804 326 271 985 1012 559 262 239 533 286 170<br />

GedML.dtd 1.2 1181 7504 1583 1205 1205 460 119 117 731 737 394 92 75 151 98 56<br />

XMark.dtd 3.2 1347 9754 2382 1731 1689 942 401 359 1097 1137 590 339 336 1074 307 263<br />

*.dtd 0.6 1097 3222 715 600 599 161 78 73 365 369 201 62 61 115 85 50<br />

Table 1: Microseconds per query, average. Testing uniform distribution (parameter |Σ|),<br />

Zipf distribution (|Σ|, s), a first order Markov model with underlying Zipf (|Σ|, s), <strong>and</strong><br />

various DTDs. Second column shows average query selectivity per cent.<br />

The brute force solution (Br) serves as a base case for comparison. It is faster than<br />

some methods on many <strong>of</strong> the tests, as these methods have to merge very large sets. The<br />

performance is also related to the query selectivity. For the test with poorest average<br />

selectivity, the fastest method is only five times faster than the brute force search, while<br />

78


Paper 1: Faster Path Indexes for Search in XML Data<br />

Br MgInv MgIe1 MgIe2 MgIe3 EsIe1 EsIe2 EsIe3 St Ss InSe EsSe Smf Smr Sm2 Smfe<br />

µs/q dev 336 3214 1107 1063 1074 304 220 231 680 691 355 181 170 365 245 132<br />

µs/path 0 2 2 3 5 2 3 6 14 14 21 21 14 17 33 22<br />

b/elem 4 18 18 33 50 18 33 50 48 48 76 76 48 69 114 76<br />

Table 2: Indexing paths from all DTDs. Showing microseconds per query st<strong>and</strong>ard deviation,<br />

microseconds per path when indexing, <strong>and</strong> bytes per path element in the complete<br />

index.<br />

where the selectivity is best, the fastest method is more than 50 times faster. The simplest<br />

merging method is MgInv, which looks up every single (non-wild-card) element <strong>of</strong> the path<br />

query, <strong>and</strong> merges the results, first on the child axis, then on the descendant axis. In the<br />

case <strong>of</strong> uniform data (|Σ| = 100) it is comparable with the brute force solution, but it is<br />

much slower on many other tests.<br />

The methods named MgIe* improve this by finding entry points into the contiguous<br />

parts <strong>of</strong> the queries, keeping the hits that exp<strong>and</strong> with a match, <strong>and</strong> then merging only<br />

on the ascendant axis. These methods are only faster than Br on the artificial tests with<br />

rather uniform data. As the probability <strong>of</strong> using the descendant axis is 0.3, there will frequently<br />

be parts <strong>of</strong> the query still with low selectivity after exp<strong>and</strong>ing over the child axis.<br />

Notice that also indexing pairs (MgIe2 ) gives a significant speedup, while the speedup<br />

from indexing triples (MgIe3 ) is less dramatic. Table 2 shows that MgIe2 uses almost<br />

twice as much memory as MgIe1, <strong>and</strong> MgIe3 almost three times as much, as expected.<br />

The total memory used for indexing 10000 paths with MgIe3 was measured to 2.6 MB.<br />

The methods which combine inversion <strong>and</strong> brute force (EsIe* ) have a greater speedup<br />

from indexing pairs <strong>and</strong> triples. They also have a significant speedup over merging in<br />

general. On the test using multiple DTDs, EsIe3 is more than eight times faster than<br />

MgIe3. Indexing triples does not help the latter much if the shortest contiguous parts <strong>of</strong><br />

the query are single elements with poor selectivity.<br />

The straight forward use <strong>of</strong> a suffix tree (St) has better performance than any <strong>of</strong><br />

the merging variants (Mg* ), but is slower than the combinations <strong>of</strong> indexing, selectivity<br />

estimation <strong>and</strong> brute force (Es* ). When the suffix tree encounters use <strong>of</strong> the descendant<br />

axis, it must traverse an entire subtree, which is a costly operation if the first part <strong>of</strong><br />

the query was not very selective. The space usage for the suffix tree is similar to that <strong>of</strong><br />

indexing triples (*Ie3 ). The improvement <strong>of</strong> Ss over St on some <strong>of</strong> the tests comes from<br />

queries starting with wild-cards. St must branch to every child <strong>of</strong> the root node in the<br />

tree, while Ss skips this part <strong>of</strong> the query, <strong>and</strong> filters hits on their starting positions in<br />

the paths afterwards. The reason Ss is sometimes slower is probably the overhead <strong>of</strong> the<br />

filtering. St can be fast when a query starts with a wild-card if the next elements are<br />

very selective, <strong>and</strong> the branching effectively cut <strong>of</strong>f.<br />

The method InSe, based on PIGST (Zuopeng et al. 2007), is faster than St on all,<br />

<strong>and</strong> Ss on most <strong>of</strong> the tests. It uses a suffix tree enhanced with path ID lists, path<br />

set intersection <strong>and</strong> brute force search, as described in Section 2.4.1. As predicted, the<br />

related method EsSe is considerably faster on the tests with non-uniform data. It skips<br />

the intersection, <strong>and</strong> performs the brute force search through the smallest set, exploiting<br />

79


CHAPTER 4.<br />

INCLUDED PAPERS<br />

the varying selectivity <strong>of</strong> the parts <strong>of</strong> the query. It should be noted that EsIe3 is faster<br />

than InSe on all the tests, while EsSe has a similar performance. EsIe3 also uses less<br />

space <strong>and</strong> has faster index construction (See Table 2).<br />

The suffix trees using multiple entry points into the query (Sm[f,r,2]) have very good<br />

performance. Smf is more than three times faster than InSe on the test using multiple<br />

DTDs, <strong>and</strong> also faster than EsIe3. Using the forward representation <strong>of</strong> the paths (Smf )<br />

is more efficient than using the backward representation (Smr). There are two reasons<br />

for this. The first is the probabilities for requiring match in the beginning <strong>and</strong> end <strong>of</strong> a<br />

path, which are 0.5 <strong>and</strong> 1.0. When these were swapped Smr was equivalently faster on<br />

tests using uniform <strong>and</strong> Zipf data. For paths generated from DTDs <strong>and</strong> Markov chains,<br />

Smf was still faster. Notice that Smr uses more memory. Not all <strong>of</strong> it comes from storing<br />

the reverse strings. The number <strong>of</strong> internal nodes in a suffix tree is upper bounded by the<br />

size <strong>of</strong> the input, but depends on its characteristics. A larger number <strong>of</strong> internal nodes<br />

gives more expensive tree traversals. It seems the reverse paths from DTDs have Markov<br />

properties that give a larger tree.<br />

Building two suffix trees <strong>and</strong> using both forward <strong>and</strong> reverse entry points (Sm2 ) does<br />

not increase performance. More entry points are partially evaluated, seemingly without<br />

reducing the total cost. It is possible that a more intelligent <strong>and</strong> well tuned implementation<br />

would give better results. Sm2 used the most memory <strong>of</strong> all implementations,<br />

totalling to 5.8 MB on the test will all DTDs.<br />

Smfe combines features from Smf <strong>and</strong> InSe, <strong>and</strong> turns out to beat all methods on<br />

query performance. A drawback from Smf is the increased construction time <strong>and</strong> memory<br />

usage, which comes from using the data structures from InSe.<br />

In the subsequent tests, only the fastest representatives from each group <strong>of</strong> implementations<br />

are shown.<br />

3.4 Increasing number <strong>of</strong> paths<br />

Figure 6 shows how the number <strong>of</strong> hits returned per second changes as the number <strong>of</strong><br />

paths indexed increases. The paths were generated from all the DTDs, <strong>and</strong> otherwise<br />

the default parameters were used. Hits are counted instead <strong>of</strong> queries to get a better<br />

perspective <strong>of</strong> what happens when the size <strong>of</strong> the data is large. A “higher is better”<br />

measure is used to improve the visualisation <strong>of</strong> the difference between the best methods.<br />

Smfe, Smf, EsSe <strong>and</strong> EsIe3 out-scale the other methods by a good margin, with the<br />

first showing the best performance. The reason for the various drops <strong>and</strong> rises in the<br />

graph may be that the distribution <strong>of</strong> the paths generated from the DTDs change as the<br />

maximal path length increases. A similar test on paths from a Zipf distribution did not<br />

show the same drops.<br />

3.5 Descendant axis<br />

Figure 7 shows hits returned per second as the probability <strong>of</strong> using the descendant axis<br />

increases. The paths were generated from the DTDs, <strong>and</strong> the default parameters were<br />

used, except that the probability <strong>of</strong> wild-cards was set to zero. This was to isolate the<br />

80


Paper 1: Faster Path Indexes for Search in XML Data<br />

1.6e+06<br />

1.4e+06<br />

1.2e+06<br />

Avg. hits per second<br />

1e+06<br />

800000<br />

600000<br />

400000<br />

200000<br />

0<br />

100 1000 10000<br />

Number <strong>of</strong> paths<br />

MgIe3<br />

EsIe3<br />

Ss<br />

EsSe<br />

Smf<br />

Smfe<br />

Br<br />

Figure 6: Increasing number <strong>of</strong> paths. Hits per second.<br />

effect <strong>of</strong> using the descendant axis. 10000 paths were indexed. Here EsSe is fastest when<br />

the probability <strong>of</strong> using the descendant axis is low, while Smfe is fastest when it is high.<br />

EsSe depends on finding contiguous parts <strong>of</strong> the query with good selectivity, which is<br />

harder when all parts are very short.<br />

800000<br />

700000<br />

600000<br />

Avg. hits per second<br />

500000<br />

400000<br />

300000<br />

200000<br />

100000<br />

0<br />

0 0.2 0.4 0.6 0.8 1<br />

Prob. <strong>of</strong> descendant axis<br />

MgIe3<br />

EsIe3<br />

Ss<br />

EsSe<br />

Smf<br />

Smfe<br />

Br<br />

Figure 7: Increasing probability <strong>of</strong> descendant axis. Hits per second.<br />

Note that the simple use <strong>of</strong> a suffix tree (Ss) has the best performance when there<br />

is no use <strong>of</strong> the descendant axis, but poor performance otherwise. The other methods<br />

using suffix trees could switch to the simple search technique when this is expected to be<br />

pr<strong>of</strong>itable.<br />

81


CHAPTER 4.<br />

INCLUDED PAPERS<br />

3.6 Wild-cards<br />

Increasing the probability <strong>of</strong> wild-cards gives different results than increasing the use <strong>of</strong><br />

the descendant axis, as shown in Figure 8. The descendant axis was not used at all in<br />

this test. Here the suffix trees are much faster than the other methods. For the trees,<br />

branching on a wild-card is much less critical than branching on the descendant axis, as<br />

the former is cut <strong>of</strong>f as soon as a mismatch is seen, while the latter introduces a full search<br />

<strong>of</strong> the subtree.<br />

900000<br />

800000<br />

700000<br />

Avg. hits per second<br />

600000<br />

500000<br />

400000<br />

300000<br />

200000<br />

100000<br />

0<br />

0 0.1 0.2 0.3 0.4 0.5<br />

Prob. <strong>of</strong> wildcard<br />

MgIe3<br />

EsIe3<br />

Ss<br />

EsSe<br />

Smf<br />

Smfe<br />

Figure 8: Increasing probability <strong>of</strong> wild-card. Hits per second.<br />

Br<br />

It is interesting to note that the simple use <strong>of</strong> a suffix tree, only skipping leading<br />

wild-cards (Ss), here is much more efficient than the implementation supporting multiple<br />

entry points into the query (Smf ), even though only a single entry point is created. The<br />

realisation <strong>of</strong> the search automata in Ss is just a recursive program, while Smf maintains<br />

a set <strong>of</strong> state objects, giving a lot <strong>of</strong> overhead.<br />

3.7 Index construction<br />

Figure 9 shows the indexing performance <strong>of</strong> the various indexes as the number <strong>of</strong> paths<br />

increases. Single representatives for methods using the same data structure were tested.<br />

MgIe1 represents MgInv <strong>and</strong> EsIe1, MgIe2 represents EsIe2, MgIe3 represents EsIe3,<br />

St represents Ss <strong>and</strong> Smf, <strong>and</strong> finally InSe represents EsSe <strong>and</strong> Smfe. The methods<br />

tested are all asymptotically linear in the worst case except InSe. The performance drop<br />

observed is probably due to the overhead <strong>of</strong> memory management <strong>and</strong> cache effects. The<br />

construction <strong>of</strong> the suffix trees is slower than construction <strong>of</strong> inverted lists, but as the<br />

time cost <strong>of</strong> adding a path is less than 30 µs, this would constitute a neglible part <strong>of</strong><br />

the cost <strong>of</strong> indexing XML in a complete system, if it is assumed that the path index can<br />

reside in main memory, while the value index must reside on disk.<br />

82


Paper 1: Faster Path Indexes for Search in XML Data<br />

900000<br />

800000<br />

700000<br />

Avg. paths indexed per second<br />

600000<br />

500000<br />

400000<br />

300000<br />

200000<br />

100000<br />

0<br />

100 1000 10000<br />

Number <strong>of</strong> paths<br />

MgIe1 MgIe2 MgIe3 St Sm2 InSe<br />

Figure 9: Paths indexed per second.<br />

4 Conclusion <strong>and</strong> future work<br />

The most advanced method using suffix trees (Smfe) is the winner on query performance<br />

in these tests. Whether this should be chosen as the path index component in a larger<br />

system, depends on the amount <strong>of</strong> main memory available <strong>and</strong> on whether or not the<br />

performance <strong>of</strong> the path index significantly impacts the performance <strong>of</strong> the complete<br />

system. Another issue is the complexity <strong>of</strong> the implementation. The suffix tree itself is a<br />

rather complex structure, <strong>and</strong> its use as described here may be hard to grasp.<br />

The method combining inverted lists, extension over the child axis <strong>and</strong> brute force<br />

(EsIe3 ), would be the authors’ choice when implementing a larger system. It is conceptually<br />

simple, easy to implement, <strong>and</strong> has good performance. If memory usage is an issue,<br />

EsIe2 or EsIe1 could be used. These methods could also be modified to work on disk<br />

when the path index does not fit in main memory. The paths themselves could be stored<br />

in the entries in their inverted lists, allowing the extension technique to work, at the cost<br />

<strong>of</strong> using more disk space <strong>and</strong> IO.<br />

In future work, the authors would like to add the fast path summaries introduced to<br />

an existing system for indexing XML, <strong>and</strong> compare this with various implementations,<br />

both with <strong>and</strong> without path summaries, such as Lore (Goldman <strong>and</strong> Widom 1997), ToXin<br />

(Rizzolo <strong>and</strong> Mendelzon 2001) XISS (Li <strong>and</strong> Moon 2001), XIST (Runapongsa et al. 2004),<br />

Ctree (Zou et al. 2004), SphinX (Poola <strong>and</strong> Haritsa 2007) <strong>and</strong> MXI (Yan <strong>and</strong> Liang 2005).<br />

Acknowledgements<br />

The author would like to thank Øystein Torbjørnsen, Svein-Olaf Hvasshovd <strong>and</strong> Truls<br />

Amundsen Bjørklund for helpful feedback on this article.<br />

83


CHAPTER 4.<br />

INCLUDED PAPERS<br />

References<br />

Aboulnaga, A., Alameldeen, A. R. & Naughton, J. F. (2001), Estimating the selectivity<br />

<strong>of</strong> XML path expressions for internet scale applications, in ‘Proc. VLDB’, pp. 591–600.<br />

Aho, A., Sethi, R. & Ullman, J. (1986), Compilers, Principles, Techniques <strong>and</strong> Tools.,<br />

Addison Wesley, Reading.<br />

Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J. & Srivastava, D. (2002), Structural<br />

joins: A primitive for efficient XML query pattern matching, in ‘Proc. ICDE’, pp. 141–<br />

152.<br />

Barta, A., Consens, M. P. & Mendelzon, A. O. (2005), Benefits <strong>of</strong> path summaries in an<br />

XML query optimizer supporting multiple access methods, in ‘Proc. VLDB’, pp. 133–<br />

144.<br />

Bertino, E. & Kim, W. (1989), ‘Indexing techniques for queries on nested objects’, IEEE<br />

Transactions on Knowledge <strong>and</strong> Data Engineering 1(2), 196–214.<br />

Böhme, T. & Rahm, E. (2001), ‘XMach-1: A benchmark for XML data management’,<br />

Proc. German database conference BTW pp. 264–273.<br />

Bressan, S., Lee, M., Li, Y., Lacroix, Z. & Nambiar, U. (2003), ‘The XOO7 benchmark’,<br />

Proc. EEXTT’02 pp. 146–147.<br />

Bruno, N., Koudas, N. & Srivastava, D. (2002), Holistic twig joins: Optimal XML pattern<br />

matching, in ‘Proc. SIGMOD’, pp. 310–321.<br />

Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R. & Viglas, S. D. (2005), Vectorizing<br />

<strong>and</strong> querying large XML repositories, in ‘Proc. ICDE’, pp. 261–272.<br />

Chen, Z., Jagadish, H., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R. & Srivastava,<br />

D. (2001), ‘Counting twig matches in a tree’, Proc. ICDE .<br />

Chien, S., Vagena, Z., Zhang, D., Tsotras, V. & Zaniolo, C. (2002), ‘Efficient structural<br />

joins on indexed XML documents’, Proc. VLDB 2, 263–274.<br />

Chung, C.-W., Min, J.-K. & Shim, K. (2002), APEX: An adaptive path index for XML<br />

data, in ‘Proc. SIGMOD’, pp. 121–132.<br />

Cooper, B., Sample, N., Franklin, M. J., Hjaltason, G. R. & Shadmon, M. (2001), A fast<br />

index for semistructured data, in ‘Proc. VLDB’, pp. 341–350.<br />

Farach, M. (1997), Optimal suffix tree construction with large alphabets, in ‘Proc. FOCS’,<br />

pp. 137–143.<br />

Goldman, R. & Widom, J. (1997), DataGuides: Enabling query formulation <strong>and</strong> optimization<br />

in semistructured databases, in ‘Proc. VLDB’, pp. 436–445.<br />

84


Paper 1: Faster Path Indexes for Search in XML Data<br />

Gottlob, G., Koch, C. & Pichler, R. (2005), ‘Efficient algorithms for processing xpath<br />

queries’, ACM Trans. Database Syst. 30(2), 444–491.<br />

<strong>Grimsmo</strong>, N. (2005), Dynamic indexes vs. static hierarchies for substring search, Master’s<br />

thesis, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology.<br />

<strong>Grimsmo</strong>, N. (2007), On performance <strong>and</strong> cache effects in substring indexes, Technical<br />

Report IDI-TR-2007-04, NTNU, Trondheim, Norway.<br />

<strong>Grimsmo</strong>, N. & Bjørklund, T. A. (2007), On the size <strong>of</strong> generalised suffix trees extended<br />

with string ID lists, Technical Report IDI-TR-2007-01, NTNU, Trondheim, Norway.<br />

Grust, T. (2002), Accelerating XPath location steps, in ‘Proc. SIGMOD’, pp. 109–120.<br />

Jiang, H., Wang, W., Lu, H. & Yu, J. (2003), Holistic twig joins on indexed XML documents,<br />

in ‘Proc. VLDB’, pp. 273–284.<br />

Kay, M. H. (1999), ‘GedML’. http://users.breathe.com/mhkay/gedml/.<br />

Kemper, A. & Moerkotte, G. (1992), ‘Access support relations: an indexing method for<br />

object bases’, Inf. Syst. 17(2), 117–145.<br />

Kilpeläinen, P. (1992), Tree matching problems with applications to structured text<br />

databases, Technical Report A-1992-6, <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> Science, University<br />

<strong>of</strong> Helsinki.<br />

Ley, M. (2007), ‘DBLP XML Records’. http://www.informatik.uni-trier.de/~ley/<br />

db/.<br />

Li, Q. & Moon, B. (2001), Indexing <strong>and</strong> querying xml Data for regular path expressions,<br />

in ‘Proc VLDB’, pp. 361–370.<br />

Lu, J., Ling, T., Chan, C. & Chen, T. (2005), From region encoding to extended dewey:<br />

On efficient processing <strong>of</strong> XML twig pattern matching, in ‘Proc. VLDB’, pp. 193–204.<br />

McCreight, E. M. (1976), ‘A space-economical suffix tree construction algorithm’, J. ACM<br />

23(2), 262–272.<br />

Michiels, P., Mihaila, G. A. & Simeon, J. (2007), Put a tree pattern in your algebra, in<br />

‘Proc. ICDE’, pp. 246–255.<br />

Muthukrishnan, S. (2002), Efficient algorithms for document retrieval problems, in ‘Proc.<br />

SODA’, pp. 657–666.<br />

Nestorov, S., Ullman, J. D., Wiener, J. L. & Chawathe, S. S. (1997), Representative<br />

objects: Concise representations <strong>of</strong> semistructured, hierarchial data, in ‘Proc. ICDE’,<br />

pp. 79–90.<br />

Poola, L. & Haritsa, J. (2007), ‘Schema-conscious XML indexing’, <strong>Information</strong> Systems<br />

32, 344–364.<br />

85


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Rao, P. & Moon, B. (2004), Prix: Indexing <strong>and</strong> querying xml using prüfer sequences,<br />

in ‘ICDE ’04: Proceedings <strong>of</strong> the 20th International Conference on Data Engineering’,<br />

IEEE <strong>Computer</strong> Society, Washington, DC, USA, p. 288.<br />

Rizzolo, F. & Mendelzon, A. (2001), Indexing XML data with ToXin, in ‘Proc. WebDB’,<br />

Vol. 2001.<br />

Runapongsa, K., Patel, J., Bordawekar, R. & Padmanabhan, S. (2004), XIST: An XML<br />

index selection tool, in ‘Proc. XSym’.<br />

Runapongsa, K., Patel, J., Jagadish, H. & Al-Khalifa, S. (2003), The Michigan Benchmark:<br />

A microbenchmark for XML query processing systems, in ‘Proc. EEXTT’02’,<br />

Springer.<br />

Schmidt, A. R., Waas, F., Kersten, M. L., Florescu, D., Manolescu, I., Carey, M. J. &<br />

Busse, R. (2001), The XML Benchmark Project, Technical Report INS-R0103, CWI,<br />

Amsterdam, The Netherl<strong>and</strong>s.<br />

Tatarinov, I., Viglas, S. D., Beyer, K., Shanmugasundaram, J., Shekita, E. & Zhang,<br />

C. (2002), Storing <strong>and</strong> querying ordered XML using a relational database system, in<br />

‘Proc. SIGMOD’, pp. 204–215.<br />

Wang, H. & Meng, X. (2005), On the sequencing <strong>of</strong> tree structures for XML indexing, in<br />

‘Proc. ICDE’, pp. 372–383.<br />

Wang, H. & Ooi, B. (2003), XR-tree: Indexing XML data for efficient structural joins, in<br />

‘Proc. ICDE’, pp. 253–264.<br />

Wang, H., Park, S., Fan, W. & Yu, P. (2003), ‘ViST: a dynamic index method for querying<br />

XML data by tree structures’, Proc. SIGMOD pp. 110–121.<br />

Yan, L. & Liang, Z. (2005), ‘Multiple schema based XML indexing’, Lecture Notes in<br />

<strong>Computer</strong> Science 3619, 891–900.<br />

Yao, B. B., Özsu, M. T. & Keenleyside, J. (2003), XBench - a family <strong>of</strong> benchmarks for<br />

XML DBMSs, in ‘Proc. EEXTT’02’, pp. 162–164.<br />

Yoshikawa, M., Amagasa, T., Shimura, T. & Uemura, S. (2001), ‘XRel: a path-based<br />

approach to storage <strong>and</strong> retrieval <strong>of</strong> XML documents using relational databases’, ACM<br />

Transactions on Internet Technology 1(1), 110–141.<br />

Zhang, C., Naughton, J., DeWitt, D., Luo, Q. & Lohman, G. (2001), ‘On supporting<br />

containment queries in relational database management systems’, SIGMOD Rec.<br />

30(2), 425–436.<br />

Zou, Q., Liu, S. & Chu, W. W. (2004), Ctree: a compact tree for indexing XML data, in<br />

‘Proc. WIDM’, pp. 39–46.<br />

Zuopeng, L., Kongfa, H., Ning, Y. & Yisheng, D. (2007), ‘An efficient index structure for<br />

XML based on generalized suffix tree’, <strong>Information</strong> Systems 32, 283–294.<br />

86


Paper 2<br />

<strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund<br />

On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists<br />

Technical Report IDI-TR-2007-01, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology,<br />

Trondheim, Norway, 2007.<br />

Abstract The document listing problem can be solved with linear preprocessing <strong>and</strong><br />

optimal search time by using a generalised suffix tree, additional data structures <strong>and</strong><br />

constant time range minimum queries. A simpler solution is to use a generalised suffix<br />

tree in which internal nodes are extended with a list <strong>of</strong> all string IDs seen in the subtree<br />

below the respective node. This report makes some remarks on the size <strong>of</strong> such a structure.<br />

For the case <strong>of</strong> a set <strong>of</strong> equal length strings, a bound <strong>of</strong> Θ(n √ n) for the worst case space<br />

usage <strong>of</strong> such lists is given, where n is the total length <strong>of</strong> the strings.<br />

87


Paper 2: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists<br />

1 Document listing problem<br />

The occurrence listing problem is given a string T ∈ Σ ∗ <strong>of</strong> length n, which can be preprocessed,<br />

<strong>and</strong> a pattern P <strong>of</strong> length m, find all z occurrences <strong>of</strong> P in T . This classical<br />

problem can be solved with Θ(n) preprocessing <strong>and</strong> optimal O(m + z) time queries with<br />

a suffix tree [4] if |Σ| is constant. The document listing problem is, given a set <strong>of</strong> strings<br />

T = {T 1 , . . . , T t } <strong>of</strong> total length n, which can be preprocessed, find all y strings in T which<br />

contain the pattern P . This can also be solved with Θ(n) preprocessing <strong>and</strong> O(m + y)<br />

time queries, by augmenting a generalised suffix tree with additional data, <strong>and</strong> using<br />

constant time minimum range queries [5].<br />

This report considers the complexity <strong>of</strong> a simpler solution, which does not have linear<br />

time preprocessing or linear space usage, but has been used to solve a problem in<br />

information retrieval [8], without considering the complexity.<br />

2 Suffix tree definition<br />

A suffix tree [4] for a string T <strong>of</strong> length n from the alphabet Σ is a compacted trie<br />

containing the first n suffixes <strong>of</strong> T $, where $ /∈ Σ is a unique terminator. Compaction<br />

here means that all internal non-branching nodes in the trie have been removed, joining<br />

adjacent edges. As $ is not seen in T , no suffix is a prefix <strong>of</strong> another, <strong>and</strong> all n suffixes are<br />

represented by a unique leaf node. Since all internal nodes are branching, their number<br />

is strictly upper bounded by n. If the edges in the tree are represented as pointers into<br />

T , the suffix tree can be represented in Θ(n) space. It can be constructed in Θ(n) time<br />

if the alphabet is constant [7, 4, 6] or integer [1].<br />

Given a pattern P <strong>of</strong> length m, it can be decided whether it occurs in T by trying<br />

to follow its path downwards from the root in the suffix tree. The time complexity is<br />

O(m) if child lookup is Θ(1), which is trivially achieved if |Σ| is constant. The parent<br />

child relationship is typically implemented with linked sibling lists. An expected lookup<br />

time <strong>of</strong> O(m) can be achieved with hashing [4]. Perfect hashing [2] can be used to get<br />

O(m) worst case lookup, at the cost <strong>of</strong> a longer construction time. After the position<br />

representing P in the tree has been located, all z hits can be extracted in Θ(z) time, as all<br />

leaf nodes in the subtree correspond to unique hits <strong>and</strong> all internal nodes in the subtree<br />

are branching.<br />

3 Generalised suffix tree<br />

A generalised suffix tree for a set <strong>of</strong> strings T = {T 1 , . . . , T s } <strong>of</strong> total length n = n 1 +· · ·+n s<br />

is a compacted trie containing for each T i , all the first n i suffixes <strong>of</strong> T i $ i , where $ i is a<br />

unique terminator string. The tree will have n leaf nodes, <strong>and</strong> can be constructed in Θ(n)<br />

time <strong>and</strong> space [3].<br />

89


CHAPTER 4.<br />

INCLUDED PAPERS<br />

4 String ID list extension<br />

A generalised suffix tree is an asymptotically optimal substring index for a set <strong>of</strong> strings<br />

from an constant alphabet. The cost <strong>of</strong> a query is linear in the size <strong>of</strong> the input <strong>and</strong><br />

output. But in many cases, you do not want to extract the set <strong>of</strong> hit positions, but the<br />

set <strong>of</strong> strings in which the hits occur. In cases where substrings are repeated in the strings<br />

indexed, a string ID can be seen many times in a subtree, <strong>and</strong> the set <strong>of</strong> unique string<br />

IDs will be smaller than the set <strong>of</strong> hit positions. To avoid this overhead, each node can<br />

store a list <strong>of</strong> the string IDs seen in its subtree. This can speed up the search for strings<br />

with matches. A drawback is a non-linear space usage in the worst case, which is shown<br />

below.<br />

5 A lower bound for worst case space usage<br />

Here follows a pro<strong>of</strong> for a lower bound <strong>of</strong> Ω(n √ n) for the worst case space usage <strong>of</strong> the<br />

suffix trees extended with string ID lists. This is shown through the Θ(n √ n) space usage<br />

for a family <strong>of</strong> sets <strong>of</strong> strings.<br />

Assume we have the set <strong>of</strong> strings T = {T 1 , . . . , T t }, where the strings are the rotations<br />

<strong>of</strong> (1, . . . , t), such that<br />

T i = ((0 + i) mod (t + 1), . . . , (t − 1 + i) mod (t + 1) (1)<br />

The total length <strong>of</strong> the strings is n = t 2 . Figure 1 shows a generalised suffix tree for<br />

this set <strong>of</strong> strings for t = 4.<br />

Figure 1: Generalised suffix tree for the padded strings (1, 2, 3, 4, $ 1 ), (2, 3, 4, 1, $ 2 ),<br />

(3, 4, 1, 2, $ 3 ) <strong>and</strong> (4, 1, 2, 3, $ 4 ). Number <strong>of</strong> unique string IDs in subtree shown in internal<br />

nodes.<br />

90


Paper 2: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists<br />

Let U = (1 mod (t + 1), . . . , (2t − 1) mod (t + 1). All strings in T are substrings <strong>of</strong><br />

U. Figure 2 shows this for t = 4. Consider the substring U[p . . . q], where 1 ≤ p ≤ t <strong>and</strong><br />

p ≤ q < p + t, which is equal to U[p + t . . . q + t] if q < t. It is seen in T i , where i ≤ p or<br />

q < i, which gives a total <strong>of</strong> p + (t − q) locations.<br />

If p + (t − q) > 1, there will be an internal node representing the substring U[p . . . q],<br />

as it is followed by U[q + 1] in some string, <strong>and</strong> by an end <strong>of</strong> string symbol in the padded<br />

string T q−t+1 $ q−t+1 . Hence the total number <strong>of</strong> string IDs stored in the lists will be<br />

|L| =<br />

t∑<br />

p=1<br />

p+t−2<br />

∑<br />

q=p<br />

Inserting t = √ n, we get<br />

p + (t − q) =<br />

t∑ ∑t−2<br />

t − q = t<br />

p=1 q=0<br />

t∑<br />

k=2<br />

k = t3 + t 2 − 2t<br />

2<br />

(2)<br />

|L| = n√ n + n − 2 √ n<br />

2<br />

∈ Θ(n √ n) (3)<br />

1 2 3 4 1 2 3<br />

1 2 3 4 $ 1<br />

2 3 4 1 $ 2<br />

3 4 1 2 $ 3<br />

4 1 2 3 $ 4<br />

Figure 2: Showing how the strings in T are substrings <strong>of</strong> U for t = 4. For each T i ∈ T ,<br />

each substring <strong>of</strong> length less than t is seen two places in U.<br />

As the space usage for a normal generalised suffix tree is Θ(n), we get a total space<br />

usage <strong>of</strong> Θ(n √ n) for a tree extended with string ID lists for this family <strong>of</strong> sets <strong>of</strong> strings,<br />

<strong>and</strong> a bound <strong>of</strong> Ω(n √ n) for the worst case space usage <strong>of</strong> this data structure in general.<br />

6 An upper bound for strings <strong>of</strong> equal length<br />

Assume we have T = {T 1 , . . . , T t }, <strong>and</strong> equal string lengths with |T i | = r. The total<br />

length <strong>of</strong> the strings is n = tr. Since there are less than n internal nodes in the tree, <strong>and</strong><br />

each list contains at most t IDs, the total length <strong>of</strong> the lists is bounded as<br />

|L| < tn = t(tr) = t 2 r ∈ O(t 2 r) (4)<br />

Seen differently, the jth suffix <strong>of</strong> each string has length r − j + 1, so there can be<br />

at most r − j internal nodes above the leaf node representing the suffix, to which it can<br />

contribute to the string ID list. This gives a bound<br />

|L| ≤ t<br />

r∑<br />

j=1<br />

∑r−1<br />

(r − 1)r<br />

r − j = t k = t ∈ O(tr 2 ) (5)<br />

2<br />

k=0<br />

91


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Together, these two bounds give<br />

|L| ∈ O(min(t 2 r, tr 2 )) (6)<br />

which is maximal for r = t. This gives n = t 2 = r 2 , <strong>and</strong> the bound<br />

|L| ∈ O(n √ n) (7)<br />

Combining with Equation 3, we get a bound <strong>of</strong> Θ(n √ n) for the worst case space usage<br />

for equal length strings.<br />

References<br />

[1] Martin Farach. Optimal suffix tree construction with large alphabets. In Proc. FOCS,<br />

1997.<br />

[2] Michael L. Fredman, János Komlós, <strong>and</strong> Endre Szemerédi. Storing a sparse table with<br />

0(1) worst case access time. J. ACM, 31(3), 1984.<br />

[3] Daniel M. Gusfield. Algorithms on strings, trees, <strong>and</strong> sequences: <strong>Computer</strong> science<br />

<strong>and</strong> computational biology. Cambridge University Press, 1997.<br />

[4] Edward M. McCreight. A space-economical suffix tree construction algorithm. J.<br />

ACM, 23(2), 1976.<br />

[5] S. Muthu Muthukrishnan. Efficient algorithms for document retrieval problems. In<br />

Proc. SODA, 2002.<br />

[6] Esko Ukkonen. On-line construction <strong>of</strong> suffix trees. Algorithmica, 14(5), 1995.<br />

[7] Peter Weiner. Linear pattern matching algorithms. In Proc. SWAT, 1973.<br />

[8] Liang Zuopeng, Hu Kongfa, Ye Ning, <strong>and</strong> Dong Yisheng. An efficient index structure<br />

for XML based on generalized suffix tree. <strong>Information</strong> Systems, 32:283–294, 2007.<br />

92


Paper 3<br />

<strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Øystein Torbjørnsen<br />

XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

Proceedings <strong>of</strong> the Second International Conference on Advances in Databases, Knowledge,<br />

<strong>and</strong> Data Applications (DBKDA 2010)<br />

Abstract XML indexing <strong>and</strong> search has become an important topic, <strong>and</strong> twig joins<br />

are key building blocks in XML search systems. This paper describes a novel approach<br />

using a nested loop twig join algorithm, which combines several existing techniques to<br />

speed up evaluation <strong>of</strong> XML queries. We combine structural summaries, path indexing<br />

<strong>and</strong> prefix path partitioning to reduce the amount <strong>of</strong> data read by the join. This effect<br />

is amplified by only reading data for leaf query nodes, <strong>and</strong> inferring data for internal<br />

nodes from the structural summary. Skipping is used to speed up merges where query<br />

leaves have differing selectivity. Multiple access methods are implemented as materialized<br />

views instead <strong>of</strong> succinct secondary indexes for better locality. This redundancy is made<br />

affordable in terms <strong>of</strong> space by using compression in a back-end with columnar storage.<br />

We have implemented an experimental prototype, which shows a speedup <strong>of</strong> two orders<br />

<strong>of</strong> magnitude on XPath queries with value predicates, when compared to existing open<br />

source <strong>and</strong> commercial systems using a subset <strong>of</strong> the techniques. Space usage is also<br />

improved.<br />

93


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

1 Introduction<br />

XML has evolved as the dominating data format for information exchange <strong>of</strong> structured<br />

<strong>and</strong> semi-structured data over the Internet. It may also be used for integration <strong>of</strong> data<br />

from disparate sources. When XML is generated from structured databases it is regular<br />

<strong>and</strong> has a known schema, while in other scenarios data is ad-hoc <strong>and</strong> without any schema.<br />

XPath [31] <strong>and</strong> XQuery [32] are languages used to query XML data. XPath is a<br />

simple declarative language that supports matching <strong>of</strong> structure <strong>and</strong> content. XQuery is<br />

a more expressive iterative language, but uses XPath as a fundamental building block.<br />

This paper focuses on performance on a subset <strong>of</strong> XPath called twig pattern matching,<br />

which covers a majority <strong>of</strong> queries used in practice [11].<br />

An XPath query is a tree, where the relation between the query nodes specify the relation<br />

between the data nodes in a possible match, <strong>and</strong> value predicates specify the contents<br />

<strong>of</strong> attributes <strong>and</strong> text nodes in the XML. The example query /lib/book[author="Kant"<br />

<strong>and</strong> author="Gödel"], asks for all books coauthored by Kant <strong>and</strong> Gödel. A parent-child<br />

relation is denoted by “/”, an ancestor-descendant relation by “//”, <strong>and</strong> “[]” encloses a<br />

predicate. Figure 1 shows a document where this query would have a match.<br />

<br />

<br />

Kritik der Unvollständigkeit<br />

Kant<br />

Gödel<br />

...<br />

<br />

Figure 1: Example XML document.<br />

A typical system supporting XPath parses the XML <strong>and</strong> stores some information<br />

about each data node, usually its name, type <strong>and</strong> value, <strong>and</strong> its relation to other nodes.<br />

This can be stored in a table, <strong>and</strong> is usually sorted on document <strong>and</strong> node order for<br />

faster structural joins on the nodes in the query tree. When evaluating queries, it may be<br />

more advantageous to access nodes either by name, value, natural order, or other fields.<br />

Multiple secondary indexes can be added for direct lookup on all the fields stored for a<br />

node.<br />

To find all matches for the example query, many current systems would read six lists <strong>of</strong><br />

nodes from indexes, looking up the four XML element nodes on name (lib, book, author<br />

<strong>and</strong> author), <strong>and</strong> the two text nodes on value ("Kant" <strong>and</strong> "Gödel"). These would then be<br />

joined to give full matches, using the structural information stored about each node. In<br />

a library with millions <strong>of</strong> books, there would be just as many book nodes, <strong>and</strong> even more<br />

author nodes to join.<br />

The main contributions <strong>of</strong> this paper are: 1) A twig join algorithm combining previous<br />

techniques in a novel way; 2) Implementation <strong>of</strong> an experimental prototype; 3) Experi-<br />

95


CHAPTER 4.<br />

INCLUDED PAPERS<br />

ments showing two orders <strong>of</strong> magnitude speedup over other systems on for queries with<br />

value predicates, <strong>and</strong> reduced space usage.<br />

2 Background<br />

Indexing <strong>and</strong> querying XML has been a major research area both in the database <strong>and</strong><br />

information retrieval community the last ten years, resulting in many different approaches.<br />

2.1 Schema Agnostic XML Indexing<br />

An early approach to XML indexing, which is usable with simple schema, is to translate<br />

the XML schema to a relational schema, <strong>and</strong> put shredded XML data into an RDBMS [28].<br />

However, the XML schema can be complex, subject to updates, unknown, or non-existing.<br />

This flexibility may be considered a strength.<br />

In schema agnostic XML indexing, all element-, attribute- <strong>and</strong> text nodes are extracted,<br />

<strong>and</strong> stored with information about their type, name, value <strong>and</strong> position in the<br />

tree. During query evaluation a list <strong>of</strong> matching data nodes is read for each query node,<br />

<strong>and</strong> these are joined into complete matches for the query tree. To allow joining matches<br />

for individual query nodes into tree pattern matches, the positional information must<br />

allow decision <strong>of</strong> node relationships, such as ancestor-descendant <strong>and</strong> parent-child. The<br />

most well known tree position encoding is the regional BEL encoding [37], which gives a<br />

node’s begin <strong>and</strong> end position in the document, <strong>and</strong> its level (depth) in the tree.<br />

Figure 2 shows an example using BEL encoding on the data extracted for the XML<br />

document in Figure 1. The data is usually indexed on name for XML element nodes, <strong>and</strong><br />

on value for text nodes. When querying, one list <strong>of</strong> nodes is typically read for each node<br />

in the query. Retrieving element nodes on name (tag) is called tag partitioning (or tag<br />

streaming [6]). To evaluate the XPath query //lib/book, the nodes u <strong>and</strong> v (for lib <strong>and</strong><br />

book) satisfying u.begin < v.begin ∧ v.end < u.end ∧ u.level + 1 = v.level, in addition to<br />

the type <strong>and</strong> name requirements, must be found.<br />

2.2 Tree-aware Joins<br />

If the node lists are sorted on doc <strong>and</strong> begin values, a linear merge can be used when the<br />

schema <strong>and</strong> query are simple. But in other cases, matches may be formed from the lists<br />

out <strong>of</strong> order. Consider the document in Figure 3, <strong>and</strong> the query //a[c <strong>and</strong> d]. A linear<br />

merge <strong>of</strong> lists would miss one <strong>of</strong> the two pairs <strong>of</strong> c <strong>and</strong> d nodes usable together.<br />

Using a full cross join on lists <strong>of</strong> data nodes matching each query nodes, <strong>and</strong> checking<br />

the output for legal matches, is not feasible for large data, as it can give intermediate<br />

results exponential in the size <strong>of</strong> the query, even when the final answer is small.<br />

The first specialized tree joins focused on splitting the query tree into binary relationships<br />

<strong>and</strong> stitching the results into complete matches. Specialized loop joins gave optimal<br />

O(I + O) complexity for ancestor-descendant relationships [37], where I is the size <strong>of</strong> the<br />

input lists, <strong>and</strong> O is the size <strong>of</strong> the output. When joining an ancestor <strong>and</strong> a descendant<br />

96


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

doc begin end level type name value<br />

1 1 . . . 1 Elem. lib<br />

1 2 12 2 Elem. book<br />

1 3 5 3 Elem. title<br />

1 4 4 4 Text “Kritik. . . ”<br />

1 6 8 3 Elem. author<br />

1 7 7 4 Text “Kant”<br />

1 9 11 3 Elem. author<br />

1 10 10 4 Text “Gödel”<br />

1 13 . . . 2 Elem. article<br />

. . . . . . . . . . . . . . . . . . . . .<br />

Figure 2: Data extracted for tag partitioning.<br />

<br />

<br />

<br />

<br />

<br />

Figure 3: Breaking linear merge for //a[c <strong>and</strong> d].<br />

list, a “marker” is left in the descendant list on matching positions, <strong>and</strong> the descendant<br />

list is “re-winded” to this position when the ascendant list is forwarded.<br />

Later stack-based joins gave O(I + O) complexity also for parent-child relationships<br />

[2]. These maintained a stack <strong>of</strong> nested ancestor c<strong>and</strong>idates in a single traversal <strong>of</strong><br />

the two lists. Any current descendant c<strong>and</strong>idate would be a descendant <strong>of</strong> all the nodes<br />

on the stack, <strong>and</strong> possibly the child <strong>of</strong> the top node.<br />

As evaluating the query node relationships separately still gave useless intermediate<br />

results <strong>of</strong> exponential size, multi-way path <strong>and</strong> twig join algorithms were introduced [4].<br />

By maintaining multiple stacks, these achieved optimal O(I + O) complexity using only<br />

O(d) memory, for path queries <strong>and</strong> twig queries using only the descendant axis. Later<br />

algorithms, which break the O(d) memory bound, achieve optimal complexity for all<br />

combinations <strong>of</strong> ancestor-descendant <strong>and</strong> parent-child edges [5].<br />

2.3 Skipping Tree Joins<br />

When the lists <strong>of</strong> matches for the different query nodes have similar size, regular tag<br />

partitioning is a fairly efficient solution, but that is <strong>of</strong>ten not the case. Take for example<br />

the query //book[author="Kant"], evaluated on a library with millions <strong>of</strong> books. The leaf<br />

predicate probably has good selectivity, while the other nodes do not.<br />

97


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Skipping can be used to improve efficiency in such cases. Consider the query //a//d,<br />

<strong>and</strong> the problem <strong>of</strong> forwarding in the lists for the two query nodes to the first position<br />

where a match for a is an ancestor <strong>of</strong> a match for d. Skipping forward in the list for d to<br />

find the first d j which can be a descendant <strong>of</strong> the current node a i for a, is trivially done by<br />

finding the first d j with d j .begin > a i .begin. If d k is the last d node with d k .end < a i .end,<br />

then all d descendants <strong>of</strong> a i (if any) are stored contiguously from d j to d k .<br />

However, this technique cannot be used to skip in the ancestor list, as any node with<br />

a lower begin value is a c<strong>and</strong>idate, <strong>and</strong> the actual ancestors can be spread evenly among a<br />

large number <strong>of</strong> such c<strong>and</strong>idates. Specialized data structures can be used to skip efficiently<br />

in both ancestor <strong>and</strong> descendant lists [33].<br />

2.4 Adding Structural Summaries<br />

A different way <strong>of</strong> avoiding many useless nodes in the query node match lists is to do<br />

some partial matching in a preprocessing step. In the approach called path indexing, some<br />

<strong>of</strong> the structure <strong>of</strong> the data is extracted <strong>and</strong> put in a path summary [10], which is a tree<br />

containing all label-unique root-to-leaf paths seen in the data. This is used in a query<br />

preprocessing step for partially identifying structural matches for the query.<br />

Figure 4 shows the structural data extracted from the example in Figure 1. Note that<br />

each path seen in the data (e.g. /lib/book/author) is only added once to the summary.<br />

The data stored for each document node is then linked to nodes in the summary in some<br />

way. One approach is shown in Figure 5. Summary tree nodes are given unique integer<br />

path IDs, which are referenced by the indexed data nodes. Here the Dewey encoding [30]<br />

is used to enumerate data node positions.<br />

root: 0<br />


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

doc Dewey pathID value<br />

1 1 1<br />

1 1.1 3<br />

1 1.1.2 6<br />

1 1.1.2.1 7 “Kritik. . . ”<br />

1 1.1.3 4<br />

1 1.1.3.1 5 “Kant”<br />

1 1.1.4 4<br />

1 1.1.4.1 5 “Gödel”<br />

1 1.2 2<br />

. . . . . . . . . . . .<br />

Figure 5: Extracted for prefix path partitioning.<br />

<strong>and</strong> leaves in the query. Lists <strong>of</strong> data nodes matching the related summary nodes can the<br />

be read. This is called prefix path partitioning (or prefix path streaming [6], as opposed<br />

to tag streaming). It usually results in reading less data, as the set <strong>of</strong> matching paths is<br />

<strong>of</strong>ten much more selective than the node name.<br />

2.5 Virtual Nodes<br />

The Dewey encoding, which is used in Figure 5, has an advantage over the BEL encoding<br />

when combined with structural summaries, because it allows ancestor reconstruction.<br />

Only data node lists for branching <strong>and</strong> leaf query nodes need to be read, because matching<br />

<strong>of</strong> non-branching internal nodes is implied by the structural matching <strong>of</strong> the prefix paths<br />

<strong>of</strong> the checked nodes.<br />

This approach is taken one step further by generating virtual nodes for the internal<br />

query nodes from the lists <strong>of</strong> leaf node matches [36]. Given the Dewey <strong>and</strong> pathID <strong>of</strong> a<br />

data node matching a leaf query node, the Deweys <strong>and</strong> pathIDs <strong>of</strong> all above data nodes<br />

can be inferred. The Dewey <strong>of</strong> an ancestor with depth d is a length d prefix <strong>of</strong> the Dewey<br />

<strong>of</strong> any descendant <strong>of</strong> the node, <strong>and</strong> the pathID can be found by going up to depth d in<br />

the summary tree.<br />

Using Dewey to enumerate nodes requires non-linear space, <strong>and</strong> shows poor performance<br />

for some degenerate cases, but is commonly used in practice, such as in Micros<strong>of</strong>t<br />

SQL Server [22]. Space usage can be improved by using various compression schemes [13],<br />

<strong>and</strong> updatability can be attained [21].<br />

2.6 Column Storage<br />

A trend in database systems research the last decade has been investigating how performance<br />

can be improved for read intensive workloads. An important contribution in this<br />

field is column store databases [29], where each column in a table is stored separately.<br />

MonetDB/XQuery [20] is a well known XPath/XQuery system using columnar storage.<br />

99


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Advantages <strong>of</strong> column stores include reading less data, if not all columns in a table<br />

are needed for a query, better cache behavior, if block processing is used in a column,<br />

better compression, as you easily can use techniques such as run length compression<br />

(RLE) [14, 1]. And as tuples can be materialized late, there can be less computational<br />

work, especially if block processing is used [1].<br />

Using compression can give benefits beyond the obvious reduced space usage. Compression<br />

<strong>of</strong> inverted lists is commonly done in regular search engines to reduce disk<br />

I/O [35], <strong>and</strong> using compression can reduce the memory bus bottle-neck in database<br />

systems [23].<br />

3 The XLeaf System<br />

XLeaf is a novel combination <strong>of</strong> many previous techniques for evaluating twig queries.<br />

It combines structural summaries, prefix path partitioning, virtual node lists for internal<br />

query nodes, skipping joins, multiple access methods, compression <strong>and</strong> column storage.<br />

As most other academic XML search prototypes, our system supports the descendant <strong>and</strong><br />

child axis, <strong>and</strong> simple value predicates.<br />

The main difference between our approach <strong>and</strong> the majority <strong>of</strong> work on twig joins, is<br />

that we use a nested loop join, where the size <strong>of</strong> the state is linear in the size <strong>of</strong> the query.<br />

Most approaches read input lists once (streaming), <strong>and</strong> maintain larger intermediate<br />

results.<br />

3.1 Query Evaluation<br />

Algorithm 1 gives an overview <strong>of</strong> the process <strong>of</strong> evaluating a query in our system. First<br />

the query is matched against the summary, as described in Section 3.2. Then an access<br />

method for c<strong>and</strong>idate matches for each leaf query node is chosen, as described in Section<br />

3.3.<br />

For queries with a single return node, such as in XPath, it is also <strong>of</strong>ten possible to<br />

avoid looping, <strong>and</strong> use a simple linear join. Such a join is correct when the depth <strong>of</strong> a<br />

branching node is fixed, as the query will not have out <strong>of</strong> order matches. The simple<br />

linear join is described in Section 3.4, <strong>and</strong> the general looping join in Section 3.5.<br />

Note that in the following, only output, branching, leaf query nodes need to be considered.<br />

For internal query nodes with one child, matching is implied by the matching <strong>of</strong><br />

the root-to-node paths <strong>of</strong> the nearest branching <strong>and</strong> leaf node descendants in the query<br />

tree. On the other h<strong>and</strong>, for a parent query node with multiple children, it is essential<br />

that all <strong>of</strong> these can use the same node for the parent in the data to construct a legal<br />

matching.<br />

3.2 Summary matching<br />

Our summary is indexed using an inverted index over the root-to-node paths in the<br />

summary tree, similar to what is described in [12]. Paths in the document tree are viewed<br />

as strings <strong>of</strong> node names, where the names are dictionary coded <strong>and</strong> stored as integers.<br />

100


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

Algorithm 1 Overall query evaluation<br />

1: procedure Evaluate(Q)<br />

2: SummaryMatch(Q)<br />

3: for l q ∈ Q.leaves<br />

4: ChooseView(l q )<br />

5: if no matches possible<br />

6: return<br />

7: if ∀b q ∈ Q.branching : b q .minDepth = b q .maxDepth<br />

8: LinearJoin(Q)<br />

9: else<br />

10: LoopingJoin(Q)<br />

The first step <strong>of</strong> the summary matching is finding the matches for the individual rootto-leaf<br />

paths in the query. The most selective element name in each path is looked up in<br />

the inverted index, <strong>and</strong> a list <strong>of</strong> c<strong>and</strong>idate paths is retrieved. This list is then matched<br />

against the pattern.<br />

For each query node, the matching path IDs are saved, along with their respective<br />

possible c<strong>and</strong>idates for the parent branching node. This is more robust than storing<br />

all legal combinations <strong>of</strong> matches for above query nodes, as their total number can be<br />

exponential in the size <strong>of</strong> the query. This typically happens with deeply recursive schemas.<br />

After finding individual root-to-leaf path matches, the query tree is traversed bottom<br />

up <strong>and</strong> then top-down to prune away node matches which can never be part <strong>of</strong> a complete<br />

match.<br />

3.3 Data access methods<br />

It is common in XML indexing systems to index element nodes on name <strong>and</strong> text nodes<br />

on value. For prefix path partitioning, nodes must also be accessible on path. In our<br />

system we use multiple access methods for all node types, <strong>and</strong> attempt to choose the<br />

most efficient for each query leaf node during twig evaluation. Lists for internal nodes<br />

are virtual (see Section 3.4 <strong>and</strong> 3.5).<br />

One approach for implementing multiple access methods is storing a table with all<br />

nodes in the data, <strong>and</strong> having multiple indexes point into this storage. But if reading<br />

matches from such an index means following pointers into a node table, this will give bad<br />

spatial locality. It also makes it inefficient to use compression schemes with expensive<br />

r<strong>and</strong>om lookups. To avoid this, we have used multiple materialized views <strong>of</strong> the data<br />

table, with different sorting orders. The views we create are shown in Figure 6, where<br />

underlined fields give the sort order.<br />

The base table is built while scanning the data, while the other tables are sorted using<br />

a stable ternary quicksort. As the sorting is stable, <strong>and</strong> the source is sorted on the first<br />

two columns, the target table is in order after only sorting on the first target column.<br />

101


CHAPTER 4.<br />

INCLUDED PAPERS<br />

t base (doc, dewey, nameID, pathID, value)<br />

t value (value, doc, dewey, nameID, pathID)<br />

t path (pathID, doc, dewey, value)<br />

t name (nameID, doc, dewey, pathID, value)<br />

Figure 6: Materialized views used<br />

The mappings for nameID <strong>and</strong> pathID are stored in tables, which are read into separate<br />

in-memory data structures for fast lookup. Note that the field nameID also indicates the<br />

type <strong>of</strong> the node (element, attribute, text, etc..), <strong>and</strong> that for text nodes, the name <strong>of</strong> the<br />

parent element node is stored. In cases where there are multiple pathID matches for a leaf<br />

in a query, the hits have to be merged when read from the table t path. In many cases,<br />

it is cheaper to scan the t name matches, <strong>and</strong> filter out non-matching nodes on pathID.<br />

3.4 Simple Linear Join<br />

Assume that as in XPath the query has one output node, <strong>and</strong> that there is one list <strong>of</strong><br />

data node matches for each leaf node q in the query, which can be read with Read(q)<br />

<strong>and</strong> moved one data node forward with a call to Advance(q).<br />

The linear join shown in Algorithm 2 is used when for each branching query node,<br />

there is a fixed tree depth for the matching summary nodes. The correctness <strong>of</strong> using a<br />

linear join in this case is trivial from the pro<strong>of</strong>s for TwigStack [4]. There, matches for<br />

root-to-leaf paths are output from the first step <strong>of</strong> their join algorithm sorted root-to-leaf<br />

on the document order <strong>of</strong> the matched nodes, <strong>and</strong> fed to a linear merge, which is the<br />

second <strong>and</strong> last step. Since the depths <strong>of</strong> all branching nodes are fixed, the lists for the<br />

leaf node matches will already be in the correct order, from the fact that the data is a<br />

tree.<br />

The reason no pathID ever needs to be read in Algorithm 2, is that the incoming leaf<br />

lists only contain structural matches for the root-to-leaf paths, with above branching node<br />

matches fixed in depth. A simple comparison <strong>of</strong> Deweys determine a common branching<br />

node in the data.<br />

3.5 Loop Join<br />

For cases where a simple linear join cannot guarantee a correct result, a nested loop join<br />

is used. There are two main modifications from the simple linear join. Markers are left<br />

in the lists, to which the list cursors are re-winded when necessary. And for a given<br />

c<strong>and</strong>idate alignment <strong>of</strong> the leaf lists, it must be checked whether it is possible to choose<br />

c<strong>and</strong>idates for the branching query nodes which are usable for all the leaf matches.<br />

In previous loop join approaches [4], where there are explicit lists for internal query<br />

nodes, child list markers are updated when the parent’s list cursor is forwarded, but<br />

this method cannot be used directly in our approach. First we advance the list cursors to<br />

alignment based on the Deweys matched down to the minimal possible depth <strong>of</strong> the above<br />

102


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

Algorithm 2 Simple linear join<br />

1: procedure LinearJoin(Q)<br />

2: q s := Q.selectingNode ⊲ Assume a leaf<br />

3: while SimpleNextMatch(Q.root)<br />

4: Output(q s )<br />

5: Advance(q s )<br />

⊲ Note non-branching, non-leaf, non-selecting nodes ignored.<br />

6: procedure SimpleNextMatch(q)<br />

7: if IsLeaf(q)<br />

8: q.d := Read(q).dewey<br />

9: return q.d<br />

10: m := q.minDepth ⊲ Equals q.maxDepth<br />

11: x := max qc∈q.children q c .d<br />

12: while true<br />

13: for q c ∈ q.children<br />

14: Align(q c , x, m)<br />

15: x c := SimpleNextMatch(q c )<br />

16: if x c = ∅ return ∅<br />

17: x := max x c , x<br />

18: if No change since last step<br />

19: q.d := (x 1 , . . . , x m )<br />

20: return q.d<br />

21: procedure Align(q, x, m)<br />

22: while Read(q).dewey < (x 1 , . . . , x m ) ⊲ Lexicographic<br />

23: if not Advance(q) return<br />

branching nodes. This depth is the maximal among the leaf matches minimal depths for<br />

a c<strong>and</strong>idate for the branching node (max-<strong>of</strong>-the-min) from the summary matching phase.<br />

Then list positions are marked, before we iterate through the possible alignments. When<br />

list number i in the order is forwarded, list i + 1 is re-winded to the mark, before it<br />

is aligned based on Dewey, <strong>and</strong> the new mark is saved if differing from the previous.<br />

To speed up the iteration, the lists for all leaf query nodes are ordered on the expected<br />

number <strong>of</strong> hits.<br />

The procedure depicted in Algorithm 3 checks whether a given alignment is a match.<br />

First, the query is traversed bottom-up, <strong>and</strong> common c<strong>and</strong>idates for the branching nodes<br />

are chosen. Then it is traversed top-down, choosing the uppermost common c<strong>and</strong>idate for<br />

each branching node. Finally, it is checked that the chosen branching nodes are the same<br />

physical nodes, by comparing the Dewey’s. Note that the top-down pass <strong>and</strong> the final<br />

bottom-up pass can be completed in one top-down traversal, as the Dewey for an internal<br />

node need not be materialized, but is given from the Dewey <strong>of</strong> any leaf descendant <strong>and</strong><br />

103


CHAPTER 4.<br />

INCLUDED PAPERS<br />

the chosen depth.<br />

Even though bit-vectors are used to implement the sets <strong>of</strong> c<strong>and</strong>idates, using Algorithm<br />

3 is expensive. For cases where many alignments are mismatches, it is cheaper to check<br />

the Deweys to the max-<strong>of</strong>-the-min depth first. This gives false positives, <strong>and</strong> matches<br />

must be checked with Algorithm 3. The max-<strong>of</strong>-the-min used here can be calculated<br />

from the branching nodes usable for the current leaf node matches, instead <strong>of</strong> from all<br />

branching node c<strong>and</strong>idates from the summary matching.<br />

In cases where the output node is not a leaf value node, but matches XML element<br />

nodes, care must be taken iterate through all c<strong>and</strong>idates, <strong>and</strong> not to chose duplicates (line<br />

20 in Algorithm 3). Also, to output entire data subtrees as matches, the table t base (see<br />

Section 3.3) must be scanned or searched during the query process to retrieve all nodes<br />

which have the given Dewey as a prefix.<br />

3.6 Skipping<br />

Skipping is crucial for efficient merges, <strong>and</strong> is included in Algorithm 2 by modifying the<br />

procedure Align to use underlying data structures to skip forward to the next item<br />

matching the parameter x. Skipping is implemented in our system by combining a “one<br />

level skip list” [35] <strong>and</strong> search. Every k-th value is extracted from the columns on which<br />

merges will be performed (doc <strong>and</strong> dewey), <strong>and</strong> put in a separate column. k is chosen to<br />

be a power <strong>of</strong> two (usually 32 in our system) to allow for arithmetics using shifts. Note<br />

that this column can also double as check-points in compression (see Section 3.7). When<br />

forwarding lists to values, the first few values in the smaller column are checked, <strong>and</strong> if<br />

no match is found, a binary search is performed there. Then a segment <strong>of</strong> the full column<br />

is scanned linearly.<br />

3.7 Storage Back-end<br />

The first iteration <strong>of</strong> this prototype used MonetDB/SQL [20] as a back-end for storing<br />

the data. This was changed for our own column store back-end because the overhead <strong>of</strong><br />

using the communication interface to the server back-end was significant, <strong>and</strong> because <strong>of</strong><br />

the lack <strong>of</strong> compression in the open source version <strong>of</strong> MonetDB.<br />

Our minimalistic implementation uses memory mapped files to write to <strong>and</strong> read from<br />

disk, <strong>and</strong> uses compression to reduce the space usage. A column adapter chooses which<br />

compression method to use based on statistics from the data. Different compression<br />

schemes are favorable depending on whether the columns are self-ordered, <strong>and</strong> whether<br />

they have few or many distinct values.<br />

Supported storage methods for integer types are raw, bit packing, run length encoding<br />

(RLE), delta encoding <strong>and</strong> dictionary encoding. The last two are more expensive, <strong>and</strong><br />

are used only in the maximum compression variant in the experiments (see Section 4).<br />

Typically RLE or delta coding is chosen for (partially) sorted columns, <strong>and</strong> bit packing or<br />

dictionary encoding for unsorted columns. The delta encoding is done using VByte [27],<br />

with checkpoints for faster r<strong>and</strong>om access. The dictionary encoded column sorts values<br />

on frequency, <strong>and</strong> uses VByte to store the coded values. Many column types are stored<br />

104


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

Algorithm 3 Checking leaf alignment<br />

⊲ Note non-branching, non-leaf, non-selecting nodes ignored.<br />

1: procedure CheckLeafAlignment(Q)<br />

2: q t := Q.topBranchingNode<br />

3: C<strong>and</strong>idates(q t ) ⊲ Bottom-up<br />

4: if q t .C = ∅<br />

5: return false<br />

6: ChooseUppermost(q) ⊲ Top-down<br />

7: if not CheckMatch(q) ⊲ Bottom-up<br />

8: return false<br />

9: procedure C<strong>and</strong>idates(q)<br />

10: if IsLeaf(q)<br />

11: q.p := Read(q).pathID<br />

12: return q.C := {q.p}<br />

13: else<br />

14: return q.C := ⋂ q i∈Children(q) ParentC<strong>and</strong>(q i)<br />

15: procedure ParentC<strong>and</strong>(q)<br />

16: return ⋂ c i∈C<strong>and</strong>idates(q) q.parMatchC<strong>and</strong>[c i]<br />

17: procedure ChooseUppermost(q)<br />

18: if IsLeaf(q)<br />

19: return<br />

20: q.p := arg min ci∈q.C c i .depth<br />

21: for q i ∈ q.children<br />

22: q i .C = {c i | c i ∈ q i .C ∧ q.p ∈ q.parMatchC<strong>and</strong>[c i ]}<br />

23: ChooseUppermost(q i )<br />

24: procedure CheckMatch(q)<br />

25: if IsLeaf(q)<br />

26: q.d := Read(q).dewey<br />

27: return true<br />

28: for q i ∈ q.children<br />

29: if not CheckMatch(q i )<br />

30: return false<br />

31: x := q.children[1].d<br />

32: m := q.p.depth<br />

33: q.d.: = (x 1 , . . . , x m )<br />

34: for q i ∈ q.children<br />

35: if LCP(q.d, q i .d) < m<br />

36: return false<br />

37: return true<br />

105


CHAPTER 4.<br />

INCLUDED PAPERS<br />

as multiple physical columns. For example is an RLE column stored as separate “values”<br />

<strong>and</strong> “cumulative count” columns.<br />

Strings are stored raw, or using a dictionary. For the raw strings, a column <strong>of</strong> pointers<br />

is used for r<strong>and</strong>om access. Dictionaries are shared between the different materialized<br />

views. For logical columns consisting <strong>of</strong> several physical columns, these are compressed<br />

recursively if this is considered favorable.<br />

In the current prototype, no explicit indexing is done, <strong>and</strong> a simple binary search is<br />

used. In the following experiments, all queries are run warm to simplify the experimental<br />

setup. Then the poor spatial locality <strong>of</strong> the binary search is not so much an issue, but<br />

for later experiments, running large batches <strong>of</strong> queries over more data, indexing should<br />

be considered.<br />

However, some <strong>of</strong> the column types have specialized search implementations. For<br />

RLE, the “values” column is searched, <strong>and</strong> the “cumulative count” column is used to<br />

translate to row numbers. For the delta encoding, a search in the checkpoints is done<br />

first, before decoding the final block <strong>of</strong> coded values. For the dictionary encoding, the<br />

dictionary is searched first, before searching for the coded value in the data column.<br />

A note should be given on the encoding <strong>of</strong> Deweys. A variable number <strong>of</strong> bytes are<br />

used per element in the Dewey, <strong>and</strong> the first bits in an element number give the number <strong>of</strong><br />

bytes in the element. This allows comparing two Deweys with a simple string comparison<br />

to find which is smaller. This is useful when merging lists <strong>of</strong> nodes. To check if two<br />

Deweys match to a given depth in the document tree, the codes must be parsed when<br />

compared.<br />

4 Experimental Results<br />

This section presents results for query performance evaluation <strong>of</strong> various XPath implementations.<br />

Experiments were performed on an AMD Athlon 64 Processor 3500+ at 2.2<br />

GHz, with 4GB main memory, running Linux 2.6.28.<br />

The open source MonetDB/XQuery (MoXQ) [3] <strong>and</strong> an anonymous commercial system<br />

(SysA) were tested. They were included because they feature some, but not all, <strong>of</strong><br />

the techniques from Section 3.1. SysA uses path indexing <strong>and</strong> multiple access methods,<br />

but reads lists for internal query nodes. MoXQ uses tag partitioning with no skipping,<br />

<strong>and</strong> has a column store back-end. The release from August 2008 was used. We have also<br />

tested the systems Berkeley DB XML, X-Hive <strong>and</strong> eXist, but the results are omitted, as<br />

these systems had poorer performance than MoXQ <strong>and</strong> SysA, <strong>and</strong> the results did not<br />

yield further insights.<br />

For our own prototype we have included performance results for the compression<br />

schemes none (XLeaf none ), lightweight (XLeaf light ) <strong>and</strong> maximum (XLeaf max ). The latter<br />

includes delta <strong>and</strong> dictionary encoding for integer types.<br />

It should be noted that a comparison <strong>of</strong> a minimalistic prototype with full fledged systems<br />

with features like transaction management is not fair. Ideally algorithms should be<br />

compared implemented in the same framework, but this was not done in these preliminary<br />

experiments, due to time constraints.<br />

106


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

Some queries have been rewritten to counting queries to avoid measuring the overhead<br />

<strong>of</strong> printing answers, <strong>and</strong> because outputting hundreds <strong>of</strong> thous<strong>and</strong>s <strong>of</strong> results is not a<br />

probable user scenario. All queries were given 2 warm-up runs, <strong>and</strong> were executed 10<br />

times.<br />

4.1 Indexing Performance <strong>and</strong> Space Usage<br />

Figure 7 shows the indexing performance <strong>of</strong> the tested methods on the DBLP corpus [17],<br />

<strong>and</strong> the artificial benchmark XMark [26]. The table shows megabytes indexed per second,<br />

<strong>and</strong> the size <strong>of</strong> the index divided by the size <strong>of</strong> the original unparsed data.<br />

MoXQ SysA XLeaf none XLeaf light XLeaf max<br />

DBLP (440 MB)<br />

MB/s 1.92 0.27 0.43 0.45 0.30<br />

Space 2.6 14.1 5.0 1.84 1.59<br />

XMark (1118 MB)<br />

MB/s 8.8 0.45 1.39 1.41 1.06<br />

Space 1.78 10.8 4.2 1.32 1.20<br />

Figure 7: Indexing performance<br />

MoXQ has the fastest indexing, <strong>and</strong> is fairly space efficient. The increased space<br />

requirements for adding the additional access methods in SysA may make it unusable<br />

in some scenarios. The space usage for our uncompressed variant (XLeaf none ) may also<br />

be too high, but the two compressed variants are very space efficient, <strong>and</strong> use less space<br />

than MoXQ, even though they feature multiple access methods like SysA, <strong>and</strong> uses the<br />

Dewey encoding, which has more redundancy than the BEL encoding used in MoXQ.<br />

Note that building the lightweight compressed index XLeaf light , is faster than building<br />

the uncompressed index, while the heavily compressed variant is much slower, but only<br />

reduces storage requirements by 10 − 15%.<br />

4.2 DBLP Queries<br />

Figure 8 shows some queries on DBLP taken from [25], <strong>and</strong> table 9 gives the running<br />

times.<br />

# Query Hits<br />

P1 //inproceedings[author="Jim Gray"][year="1990"]/@key 6<br />

P2 //www[editor]/url/text() 5<br />

P3 //book/author[text()="C. J. Date"]/text() 13<br />

P4 //inproceedings[title/text()="Semantic Analysis Patterns."]/author/text() 2<br />

Figure 8: DBLP queries.<br />

107


CHAPTER 4.<br />

INCLUDED PAPERS<br />

P1 P2 P3 P4<br />

MoXQ 1815 131 257 1151<br />

SysA 972 2.5 0.53 7.2<br />

XLeaf none 0.126 0.064 0.039 0.058<br />

XLeaf light 0.130 0.064 0.044 0.058<br />

XLeaf max 0.42 0.123 0.044 0.067<br />

Figure 9: Query performance. Run-time in milliseconds.<br />

The results for MoXQ are worse than one might expect, especially on P1 <strong>and</strong> P4. This<br />

is because the data has a very flat <strong>and</strong> repetitive structure, with many node matches for<br />

//inproceedings. SysA has better performance, due to the use <strong>of</strong> path indexing. But it<br />

is still slow on P1, probably because merging the three branches is expensive. In P2, the<br />

total number <strong>of</strong> matches for //www is low.<br />

Our implementations perform very well in this experiment, because only data for leaf<br />

nodes are read, <strong>and</strong> skipping is applied in the join. The differences between XLeaf none<br />

<strong>and</strong> XLeaf light are almost negligible. We expected the lightweight compression to give<br />

better performance due to lower data b<strong>and</strong>width requirements. One reason this is not<br />

the case may be that queries are run hot, which reduces the b<strong>and</strong>width bottleneck due to<br />

caching effects, in combination with slightly more expensive computation for XLeaf light .<br />

XLeaf max has worse performance, due to even more computation, <strong>and</strong> perhaps worse<br />

memory access patterns, because <strong>of</strong> dictionary compression for integer types.<br />

# Query Hits<br />

A1 count(/site/closed auctions/closed auction/annotation/description/text/keyword) 40726<br />

A2 count(//closed auction//keyword) 124843<br />

A3 count(/site/closed auctions/closed auction//keyword) 124843<br />

A4 count(/site/closed auctions/closed auction[annotation/description/text/keyword]/date) 26570<br />

A5 count(/site/closed auctions/closed auction[descendant::keyword]/date) 53342<br />

A6 count(/site/people/person[pr<strong>of</strong>ile/gender <strong>and</strong> pr<strong>of</strong>ile/age]/name) 32242<br />

V1 ... keyword[text()=" preventions "] ... 55<br />

V2 ... keyword[text()=" preventions "] ... 145<br />

V3 ... keyword[text()=" preventions "] ... 145<br />

V4 ... keyword[text()=" preventions "] ... date[text()="06/27/1998"] ... 11<br />

V5 ... keyword[text()=" tempests "] ... date[text()="04/18/1999"] ... 12<br />

V6 ... gender[text()="male"] ... age[text()="18"] ... name[text()="Mehrdad Takano"] ... 19<br />

Figure 10: XMark queries from XPathMark. Queries V1-V6 are equal to queries A1-A6<br />

with added value predicates.<br />

4.3 XPathMark A Queries<br />

Queries A1-A6 on the XMark data are from XPathMark [9, 8], an XPath equivalent <strong>of</strong> the<br />

XQuery benchmark XMark. XMark is artificially generated, <strong>and</strong> has properties rather<br />

different from DBLP, which has a flat <strong>and</strong> repetitive structure. XMark is a deeper tree,<br />

following a recursive schema.<br />

108


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

A1 A2 A3 A4 A5 A6 V1 V2 V3 V4 V5 V6<br />

MoXQ 214 85 110 263 156 348 272 249 262 323 294 456<br />

SysA 3.9 352 348 178 2128 446 11.8 398 371 30 46 97<br />

XLeaf none 9.1 72 73 56 153 181 0.160 0.27 0.29 0.32 0.144 4.3<br />

XLeaf light 8.5 94 94 57 180 196 0.24 0.38 0.28 0.58 0.20 4.6<br />

XLeaf max 9.6 93 94 166 190 708 0.21 0.32 0.34 3.6 0.70 24<br />

Figure 11: Query performance. Run-time in milliseconds.<br />

MoXQ <strong>and</strong> SysA seem to have about equal performance, with the former in the lead.<br />

In these tests, the former performs relatively much better than on DBLP, because value<br />

predicates are not used, <strong>and</strong> the data tree is differently shaped. The number <strong>of</strong> matches<br />

for nodes near the root <strong>of</strong> the query are usually low, <strong>and</strong> the query leaf matches make up<br />

the majority. This gives a much cheaper join relative to the total number <strong>of</strong> matches.<br />

Note the performance <strong>of</strong> MoXQ <strong>and</strong> XLeaf light (or SysA) on queries A1 <strong>and</strong> A2, which<br />

highlights a key difference <strong>of</strong> the two systems. MoXQ is twice as fast on A2 as on A1,<br />

even though it has three times as many hits. On A1 it must merge matches for seven<br />

internal nodes, but on A2 only two <strong>of</strong> these are involved. XLeaf light , on the other h<strong>and</strong>,<br />

looks up the first query on pathID, <strong>and</strong> all nodes are matches. The second query is looked<br />

up on nameID <strong>and</strong> filtered on pathID, as there are multiple matches for the latter. Note<br />

that the performance on A3 is the same as on A2, as the executions are identical on our<br />

system. The fact that MoXQ is almost as fast as XLeaf none on A2, even though it reads<br />

more data <strong>and</strong> does more work, indicates a better implementation.<br />

XLeaf light is faster on A4 than A3, even though the former is branching. The reason<br />

is that the leaves can be looked up directly on pathID. Hits are found efficiently using<br />

skipping. A5 is three times as slow, even though there are only two times as many hits.<br />

This is because the predicate leaf on keyword has more matches. A6 is also much slower<br />

than A4, even though the number <strong>of</strong> hits is similar, again because the individual query<br />

leaves have more matches.<br />

Comparing SysA <strong>and</strong> XLeaf light on queries A2 <strong>and</strong> A3 may show the benefit <strong>of</strong> using<br />

a column store back-end. SysA also looks up nodes on path in these queries, but is more<br />

than three times slower. Note that for the branching queries A4-A6, its disadvantage <strong>of</strong><br />

reading matches for branching nodes does not show as much as on the DBLP queries, as<br />

the data is deeper <strong>and</strong> more “tree shaped”, with more matches for the query leaves than<br />

for the branching nodes.<br />

4.4 XPathMark A Queries, with Value Predicates<br />

Queries V1-V6 are the same as A1-A6, but with added value predicates. These were<br />

chosen to give as many hits as possible.<br />

MoXQ is slightly slower with value predicates, probably because <strong>of</strong> the extra text<br />

nodes to be read <strong>and</strong> merged. SysA is also slower on the first three queries, but faster<br />

on the last three, because looking up the leaves on value is much more selective. There<br />

are however still many matches for the branching node, giving an unnecessarily expensive<br />

109


CHAPTER 4.<br />

INCLUDED PAPERS<br />

join. A comparison <strong>of</strong> SysA <strong>and</strong> XLeaf light , both using path indexing, shows the benefits<br />

<strong>of</strong> our systems features. Our implementation avoids h<strong>and</strong>ling the large number <strong>of</strong> matches<br />

for the internal query nodes, <strong>and</strong> is 20-200 times faster.<br />

Query V6 is more expensive than the others for XLeaf light , because the result for one<br />

<strong>of</strong> the leaves is large, <strong>and</strong> the merge is more expensive, even when skipping is used <strong>and</strong><br />

the simple linear merge is allowed.<br />

# Query Hits<br />

S1 /dblp/inproceedings[@key="conf/3dica/RohalyH00"]/booktitle/text() 1<br />

S2 //inproceedings[@key="conf/3dica/RohalyH00"]/booktitle/text() 1<br />

S3 //*[@key="conf/3dica/RohalyH00"]/booktitle/text() 1<br />

S4 //*[@*="conf/3dica/RohalyH00"]/booktitle/text() 1<br />

B0 /dblp//author[text()="Michael Stonebraker"]/text() 215<br />

B1 /dblp/*/author[text()="Michael Stonebraker"]/text() 215<br />

B2 /dblp/*[author/text()="Michael Stonebraker"]/title/text() 215<br />

B3 /dblp/*[author/text()="Michael Stonebraker"] 4<br />

[author/text()="Hector Garcia-Molina"]/title/text()<br />

B4 /dblp/*[author="Michael Stonebraker"][author="Hector Garcia-Molina"] 1<br />

[@key="journals/corr/cs-DB-0310006"]/title/text()<br />

B5 /dblp/*[author="Michael Stonebraker"][author="Hector Garcia-Molina"] 1<br />

[@key="journals/corr/cs-DB-0310006"][year/ > 1950]/title/text()<br />

Figure 12: Queries with decreasing path specification, <strong>and</strong> queries with increasing branching<br />

complexity, on DBLP.<br />

4.5 Queries with Decreasing Path Specification<br />

Queries S1-S4 in Table 13 gives some queries on DBLP with decreasingly specified paths,<br />

using an increasing number <strong>of</strong> wild-cards.<br />

S1 S2 S3 S4 B0 B1 B2 B3 B4 B5<br />

MoXQ 1549 1387 4553 4694 1464 2609 2704 2940 2948 3029<br />

SysA 160 289 21911 21932 1450 1247 146273 146353 149293 19127<br />

XLeaf none 0.054 0.054 0.59 0.93 0.084 0.085 0.63 0.25 0.159 0.20<br />

XLeaf light 0.055 0.054 0.50 0.96 0.086 0.088 1.00 0.26 0.156 0.192<br />

XLeaf max 0.060 0.062 0.51 0.98 0.114 0.104 2.2 0.78 0.21 0.27<br />

Figure 13: Query performance, with run-time in milliseconds.<br />

.<br />

This test shows that our systems are more robust for partially specified paths. With<br />

MoXQ <strong>and</strong> SysA, the query cost increases greatly from query S2 to S3, because a larger<br />

number <strong>of</strong> nodes become c<strong>and</strong>idates for the branching node. Query S4 is a degenerate<br />

query which probably never would be seen in a real system, but it can be argued that S3<br />

is not unrealistic. Our implementation is very fast on all queries. One leaf is looked up<br />

on value consistently, while the other on pathID for S1 <strong>and</strong> S2, <strong>and</strong> nameID for S3 <strong>and</strong><br />

S4, which is the reason the last two queries are slower.<br />

110


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

4.6 DBLP Queries with Increasing Branching Complexity<br />

Queries B0-B5 in Table 13 show performance results on queries with increasing branching<br />

complexity.<br />

Comparing B0 <strong>and</strong> B1, the main cost <strong>of</strong> MoXQ seems to be merging with the lists<br />

for /dblp/*, which is large. Additional branches give no significant extra cost. Note that<br />

this node is critical for the semantics <strong>of</strong> queries B2-B5. The poor results for SysA on<br />

B2-B5 are due to a “performance bug”. The system sometimes chooses to use nested<br />

loop lookups, even though other approaches would be more efficient.<br />

Our implementation is also affected by the added branches in the queries, with some<br />

interesting effects. B3 is faster than B2, because the join algorithm takes advantage <strong>of</strong><br />

the two author lists being more selective. The common result is smaller, <strong>and</strong> more can<br />

be skipped when joining with the title. A similar argument holds for B4 vs B3, but for<br />

B5, the added predicate on year has low selectivity, <strong>and</strong> only adds to the cost.<br />

5 Related Work<br />

Loop joins have been used previously in MPMGJN [37] for pairs <strong>of</strong> nodes, <strong>and</strong> in<br />

PathMPMJ [4] for path queries, while in our system loop joins are used for full twigs,<br />

<strong>and</strong> with with no physical lists <strong>of</strong> matches for internal nodes.<br />

Most twig join algorithms are non-looping, <strong>and</strong> read the input lists once. The original<br />

TwigStack [4] maintains one stack <strong>of</strong> nested matches for each query node, <strong>and</strong> output<br />

individual root-to-leaf path matches, which are merged in a second phase. More complex<br />

intermediate result management are used by the later single phase join algorithms, such<br />

as Twig 2 Stack [5], HolisticTwigStack [16], TwigList [24] <strong>and</strong> TwigFast [18].<br />

Structural summaries [10] are a prerequisite for prefix path partitioning, which is<br />

used previously in iTwigJoin [6]. A difference between iTwigJoin <strong>and</strong> our approach is<br />

that iTwigJoin uses materialized lists for all query nodes, <strong>and</strong> uses a specialized join to<br />

combine pairs <strong>of</strong> lists for directly related query nodes.<br />

Generating matches for internal query nodes on the fly is done in the Virtual Cursors<br />

[36] algorithm <strong>and</strong> in TJFast [19], which use tag partitioning. A disadvantage <strong>of</strong> TJ-<br />

Fast is that instead <strong>of</strong> using a structural summary, the name path <strong>and</strong> Dewey are stored<br />

compressed in the leaf lists. This makes reads unnecessarily expensive <strong>and</strong> increases space<br />

usage. A disadvantage <strong>of</strong> Virtual Cursors is that non-branching internal nodes are not<br />

ignored during evaluation, because generated internal nodes are not guaranteed to have<br />

matching prefix paths. This also causes useless node c<strong>and</strong>idates for branching internal<br />

nodes. A shortcoming <strong>of</strong> both previous approaches, <strong>and</strong> also ours, is that much work is<br />

repeated during construction <strong>of</strong> internal nodes.<br />

Skipping joins has been used previously with tag partitioning in for example TS-<br />

Generic+ [15] <strong>and</strong> TwigOptimal [7]. When physical lists are used for the internal query<br />

nodes, skipping in a list to catch up with a descendant is not trivial, <strong>and</strong> specialized data<br />

structures like XR-trees [33] must be used. An advantage with virtual node lists is that<br />

skipping for internal nodes can be implicit. This is used previously Virtual Cursors [36],<br />

where B-trees were used to skip through leaf node matches.<br />

111


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Multiple access methods for query node matches, implemented through materialized<br />

views, is used for XML search in Micros<strong>of</strong>t SQL Server [22]. Their system uses prefix<br />

path partitioning similar to ours, but reads data node lists for internal query nodes, <strong>and</strong><br />

uses row oriented storage. MonetDB/XQuery [3] has columnar storage, but it uses tag<br />

partitioning, <strong>and</strong> does not feature compression.<br />

6 Conclusion <strong>and</strong> Future Work<br />

This paper has investigated how to combine various techniques for twig matching. A prototype<br />

was developed to investigate possible benefits. When compared with two existing<br />

systems, which use only some <strong>of</strong> these techniques, a speedup <strong>of</strong> two orders <strong>of</strong> magnitude<br />

was shown for queries with value predicates.<br />

Future work includes modifying the experimental prototype to allow switching features<br />

on <strong>and</strong> <strong>of</strong>f individually, to investigate both their individual <strong>and</strong> combined benefit<br />

thoroughly. This may give more insight than comparing with other full systems. More<br />

features, like optimal twig join algorithms, should also be added to our system.<br />

Acknowledgment<br />

Thank you to Felix Weigel for fruitful discussions. The first author was supported by the<br />

Research Council <strong>of</strong> Norway under grant NFR 162349.<br />

References<br />

[1] Daniel J. Abadi, Samuel R. Madden, <strong>and</strong> Nabil Hachem. Column-stores vs. rowstores:<br />

how different are they really? In Proc. SIGMOD, 2008.<br />

[2] S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, <strong>and</strong> D. Srivastava. Structural<br />

joins: A primitive for efficient XML query pattern matching. In Proc. ICDE, 2002.<br />

[3] Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger,<br />

<strong>and</strong> Jens Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational<br />

engine. In Proc. SIGMOD, 2006.<br />

[4] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />

XML pattern matching. In Proc. SIGMOD, 2002.<br />

[5] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />

Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />

queries over XML documents. In Proc. VLDB, 2006.<br />

[6] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />

pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />

112


Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />

[7] Marcus Fontoura, Vanja Josifovski, Eugene Shekita, <strong>and</strong> Beverly Yang. Optimizing<br />

cursor movement in holistic twig joins. In Proc. CIKM, 2005.<br />

[8] M. Franceschet. XPathMark. http://sole.dimi.uniud.it/~massimo.<br />

franceschet/xpathmark/ (Accessed: 2009-03-06).<br />

[9] M. Franceschet. XPathMark: an XPath benchmark for XMark generated data. In<br />

Proc. XSYM, 2005.<br />

[10] Roy Goldman <strong>and</strong> Jennifer Widom. DataGuides: Enabling query formulation <strong>and</strong><br />

optimization in semistructured databases. In Proc. VLDB, 1997.<br />

[11] Gang Gou <strong>and</strong> R. Chirkova. Efficiently querying large XML data repositories: A<br />

survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />

[12] <strong>Nils</strong> <strong>Grimsmo</strong>. Faster path indexes for search in XML data. In Proc. ADC, 2008.<br />

[13] T. Härder, M. Haustein, C. Mathis, <strong>and</strong> M. Wagner. Node Labeling Schemes for<br />

Dynamic XML Documents Reconsidered. Data & Knowl. Engineering, 2006.<br />

[14] Allison L. Holloway <strong>and</strong> David J. DeWitt. Read-optimized databases, in depth. Proc.<br />

VLDB Endow., 2008.<br />

[15] H. Jiang, W. Wang, H. Lu, <strong>and</strong> J.X. Yu. Holistic twig joins on indexed XML<br />

documents. In Proc. VLDB, 2003.<br />

[16] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />

processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />

2007.<br />

[17] Michael Ley. DBLP XML Records, Jan. 2007. http://www.informatik.uni-trier.<br />

de/~ley/db/ (Accessed: 2008-10-22).<br />

[18] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />

[19] J. Lu, T.W. Ling, C.Y. Chan, <strong>and</strong> T. Chen. From region encoding to extended dewey:<br />

On efficient processing <strong>of</strong> XML twig pattern matching. In Proc. VLDB, 2005.<br />

[20] Monet. MonetDB web page. http://monetdb.cwi.nl/ (Accessed: 2009-02-14).<br />

[21] Patrick O’Neil, Elizabeth O’Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, <strong>and</strong><br />

Nigel Westbury. ORDPATHs: insert-friendly XML node labels. In Proc. SIGMOD,<br />

2004.<br />

[22] Shankar Pal, Istvan Cseri, Oliver Seeliger, Gideon Schaller, Leo Giakoumakis, <strong>and</strong><br />

Vasili Zolotov. Indexing XML data stored in a relational database. In Proc. VLDB,<br />

2004.<br />

[23] Niels Nes Peter Broncz, Marcin Zukowski. MonetDB/X100: Hype-pipelining query<br />

execution. In Proc. CIDR, 2005.<br />

113


CHAPTER 4.<br />

INCLUDED PAPERS<br />

[24] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />

In Proc. DASFAA, 2007.<br />

[25] Praveen Rao <strong>and</strong> Bongki Moon. PRIX: Indexing <strong>and</strong> querying XML using prüfer<br />

sequences. In Proc. ICDE, 2004.<br />

[26] Albrecht Schmidt, Florian Waas, Martin Kersten, Michael J. Carey, Ioana<br />

Manolescu, <strong>and</strong> Ralph Busse. XMark: a benchmark for XML data management.<br />

In Proc. VLDB, 2002.<br />

[27] Falk Scholer, Hugh E. Williams, John Yiannis, <strong>and</strong> Justin Zobel. Compression <strong>of</strong><br />

inverted indexes for fast query evaluation. In Proc. SIGIR, 2002.<br />

[28] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. De-<br />

Witt, <strong>and</strong> Jeffrey F. Naughton. Relational databases for querying xml documents:<br />

Limitations <strong>and</strong> opportunities. In Proc. VLDB, 1999.<br />

[29] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,<br />

Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat<br />

O’Neil, Alex Rasin, Nga Tran, <strong>and</strong> Stan Zdonik. C-store: a column-oriented dbms.<br />

In Proc. VLDB, 2005.<br />

[30] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene<br />

Shekita, <strong>and</strong> Chun Zhang. Storing <strong>and</strong> querying ordered XML using a relational<br />

database system. In Proc. SIGMOD, 2002.<br />

[31] W3C. XPath 1.0, 1999. http://www.w3.org/TR/xpath (Accessed: 2009-04-29).<br />

[32] W3C. XQuery 1.0, 2007. http://www.w3.org/TR/xquery/ (Accessed: 2009-04-29).<br />

[33] H.J.H.L.W. Wang <strong>and</strong> BC Ooi. XR-tree: Indexing XML data for efficient structural<br />

joins. In Proc. ICDE, 2003.<br />

[34] Felix Weigel. Structural summaries as a core technology for efficient XML retrieval.<br />

PhD thesis, Ludwig-Maximilians-Universität München, 2006.<br />

[35] Ian H. Witten, Alistair M<strong>of</strong>fat, <strong>and</strong> Timothy C. Bell. Managing gigabytes (2nd ed.):<br />

compressing <strong>and</strong> indexing documents <strong>and</strong> images. 1999.<br />

[36] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />

Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />

[37] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />

supporting containment queries in relational database management systems. SIG-<br />

MOD Rec., 2001.<br />

114


Paper 4<br />

<strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund<br />

Towards Unifying Advances in Twig Join Algorithms<br />

Proceedings <strong>of</strong> the 21st Australasian Database Conference (ADC 2010)<br />

Abstract Twig joins are key building blocks in current XML indexing systems, <strong>and</strong> numerous<br />

algorithms <strong>and</strong> useful data structures have been introduced. We give a structured,<br />

qualitative analysis <strong>of</strong> recent advances, which leads to the identification <strong>of</strong> a number <strong>of</strong><br />

opportunities for further improvements. Cases where combining competing or orthogonal<br />

techniques would be advantageous are highlighted, such as algorithms avoiding redundant<br />

computations <strong>and</strong> schemes for cheaper intermediate result management. We propose some<br />

direct improvements over existing solutions, such as reduced memory usage <strong>and</strong> stronger<br />

filters for bottom-up algorithms. In addition we identify cases where previous work has<br />

been overlooked or not used to its full potential, such as for virtual streams, or the benefits<br />

<strong>of</strong> previous techniques have been underestimated, such as for skipping joins. Using<br />

the identified opportunities as a guide for future work, we are hopefully one step closer<br />

to unification <strong>of</strong> many advances in twig join algorithms.<br />

115


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

1 Introduction<br />

Twig matching is the most heavily used building block for systems <strong>of</strong>fering search in XML<br />

with languages like XPath <strong>and</strong> XQuery [12]. XML has become the de-facto st<strong>and</strong>ard<br />

for storage <strong>of</strong> semi-structured data, <strong>and</strong> the st<strong>and</strong>ard for data exchange between disjoint<br />

information systems. XPath is a declarative language, <strong>and</strong> XQuery is an iterative language<br />

which uses XPath as a building block. XPath queries can be evaluated in polynomial time<br />

[11].<br />

Most academic work related to indexing <strong>and</strong> querying XML focuses on the twig matching<br />

problem, which is equivalent to a sub-set <strong>of</strong> XPath: Given a labeled data tree <strong>and</strong> a<br />

labeled query tree, find all matchings <strong>of</strong> the query nodes to the data nodes, where the data<br />

nodes satisfy the ancestor-descendant (a-d) <strong>and</strong> parent-child (p-c) relationships specified<br />

by the query tree edges.<br />

The example in Figure 1 shows the relation between twig matching <strong>and</strong> XML search.<br />

The tree in part (a) is an abstraction <strong>of</strong> the XML document in (c). Real XML separates<br />

element (tag), attribute <strong>and</strong> text nodes, but in the abstract model there is only one type<br />

<strong>of</strong> nodes. The XPath <strong>and</strong> XQuery examples in (d) both specify the same structure as the<br />

abstract twig query in (b), where double edges symbolize a-d relationships.<br />

b<br />

a<br />

a<br />

c b<br />

(a)<br />

c<br />

a<br />

b c<br />

(b)<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

(c)<br />

//a[.//b][c]<br />

for $na in //a<br />

let $nb := $na//b<br />

where $na/c<br />

order by $na<br />

return ($na, $nb)<br />

(d)<br />

Figure 1: XML <strong>and</strong> twig matching relation. (a) Abstract data tree. (b) Twig query. (c)<br />

XML Data. (d) XPath (above) <strong>and</strong> XQuery (below).<br />

This work focuses on twig matching in indexed data trees. In a typical setting, all<br />

data nodes with the same label are stored together, using some encoding which specifies<br />

tree positions. To evaluate a query, one stream <strong>of</strong> data nodes with matching label is read<br />

for each query node, <strong>and</strong> are joined to form twig matches.<br />

This paper gives a structured analysis <strong>of</strong> recent advances in twig join algorithms,<br />

which leads to the identification <strong>of</strong> a number <strong>of</strong> opportunities for further improvements.<br />

Some direct improvements are identified, such as reduced memory usage in bottom-up<br />

algorithms <strong>and</strong> stronger top-down filters. We highlight cases where new combinations <strong>of</strong><br />

competing <strong>and</strong> orthogonal techniques would have clear advantages, but also cases where<br />

important previous work has been has been compared to unfairly in our view. We note<br />

1 See errata in Appendix A.<br />

117


CHAPTER 4.<br />

INCLUDED PAPERS<br />

some open challenges, such as updatability in strong structural summaries, <strong>and</strong> more<br />

efficient detection <strong>of</strong> cases where simpler <strong>and</strong> faster algorithms can be used (Section 3.7).<br />

The analysis explores techniques for avoiding redundant computations (Section 3.1),<br />

schemes for intermediate result management (Section 3.2), top-down filters for bottom-up<br />

algorithms (Section 3.3), skipping joins (Section 3.4), refined access methods (Section 3.5)<br />

<strong>and</strong> virtual streams (Section 3.6).<br />

2 Background: Concepts <strong>and</strong> Techniques<br />

This section goes through some fundamental concepts <strong>and</strong> techniques which are useful<br />

for the underst<strong>and</strong>ing <strong>of</strong> later algorithms. First we formally define the problem.<br />

Definition 1 (Twig matching). Given a rooted unordered labeled query tree Q <strong>and</strong> a<br />

rooted ordered labeled data tree D, find all complete matchings <strong>of</strong> the nodes in Q, such<br />

that the matched nodes in D follow the structural requirements given by the a-d <strong>and</strong> p-c<br />

edges in Q.<br />

Note that there is a slight difference between the semantics <strong>of</strong> twig matching <strong>and</strong><br />

XPath. A twig query returns all legal combinations <strong>of</strong> node matches, while in XPath<br />

there is a single query return node. The generality <strong>of</strong> returning all legal combinations <strong>of</strong><br />

matches in twig matching may have been the academic focus point because it is useful<br />

for the flexibility in XQuery. XPath can also specify more than a-d <strong>and</strong> p-c relationships,<br />

but a majority <strong>of</strong> XPath queries in practice use only the a-d <strong>and</strong> p-c axis [12].<br />

Many early approaches to search in semi-structured data used combinations <strong>of</strong> indexing<br />

<strong>and</strong> tree navigation, but the main focus the last decade has been on indexing with<br />

inverted lists <strong>and</strong> structural joins <strong>of</strong> streams <strong>of</strong> query node matches [1]. This paper only<br />

considers twig join algorithms.<br />

Indexing <strong>and</strong> node encoding is critical for the efficiency <strong>of</strong> twig joins. Usually<br />

data nodes are indexed (partitioned) on node labels, using for example inverted lists.<br />

Two aspects <strong>of</strong> how data is stored inside partitions are important: How the position <strong>of</strong><br />

a node is encoded, <strong>and</strong> how the nodes in the partition are ordered. For most algorithms<br />

nodes are stored in depth first traversal pre-order, such that ascendants are seen before<br />

descendants. The positional information which follows nodes must allow decision <strong>of</strong> a-d<br />

<strong>and</strong> p-c relationships. The most common is the regional begin,end,level (BEL) encoding<br />

[29], which is used in the data extraction example in Figure 2. It reflects the positions <strong>of</strong><br />

opening <strong>and</strong> closing tags in XML (see Figure 1c). The begin <strong>and</strong> end numbers are not<br />

the same as pre- <strong>and</strong> post-order traversal numbers, but give the same sorting orders.<br />

In the following, let T q denote the stream <strong>of</strong> matches for query node q, <strong>and</strong> C q denote<br />

the current data node in this stream. For simplicity, polynomial factors in the size <strong>of</strong> the<br />

query are ignored in asymptotic notation.<br />

The early history <strong>of</strong> twig joins is shown in Figure 3. An early approach for schema<br />

agnostic XML indexing was to store nodes with BEL encoding in an RDBMS, <strong>and</strong> specify<br />

query node relations as a number <strong>of</strong> inequalities. But these theta-joins are expensive.<br />

Specialized loop structural joins which leveraged the knowledge that the data encoded is a<br />

118


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

a 1,8<br />

b 2,2<br />

a 3,6<br />

c 4,4 b 5,5<br />

c 7,7<br />

tag B E L<br />

a 1 8 1<br />

a 3 6 2<br />

b 2 2 2<br />

b 5 5 3<br />

c 4 4 3<br />

c 7 7 2<br />

a<br />

b c<br />

(a 1,1 b 2,2 c 7,7 )<br />

(a 1,1 b 5,5 c 7,7 )<br />

(a 3,6 b 5,5 c 4,4 )<br />

(a) Data.<br />

(b) Extracted.<br />

(c) Query.<br />

(d) Matches.<br />

Figure 2: Tree indexing <strong>and</strong> querying example.<br />

tree where introduced [29, 1]. These have O(I + O) cost for evaluating an a-d relationship,<br />

where I <strong>and</strong> O are the sizes <strong>of</strong> the input <strong>and</strong> output streams, but quadratic worst-case for<br />

p-c relationships. Stack joins were introduced to get optimal evaluation for all binary<br />

structural joins.<br />

Path m-way loop.<br />

PathMPMJ [4]<br />

Twig m-way loop.<br />

Not explored.<br />

Leveraging<br />

RDBMS.<br />

Specialized loop joins.<br />

– a-d optimal.<br />

MPMGJN [29]<br />

Tree-Merge-Desc/-Anc [1]<br />

Stack joins.<br />

– a-d & p-c optimal.<br />

Stack-Tree-Desc/-Anc [1]<br />

Path m-way stack.<br />

– a-d & p-c optimal.<br />

PathStack [4]<br />

Twig m-way stack.<br />

– a-d optimal.<br />

TwigStack [4]<br />

Skipping.<br />

Anc des B+ [7]<br />

XR-Stack [14]<br />

Figure 3: The early history <strong>of</strong> twig joins. Continued in Figure 8.<br />

A problem with combining the evaluation <strong>of</strong> a number <strong>of</strong> binary relationships to answer<br />

a query, is that the intermediate results may be <strong>of</strong> size exponential in the query, even if<br />

the output is small. This lead to the introduction <strong>of</strong> multi-way join algorithms.<br />

Stacks are key data structures in most modern twig join algorithms. Their use<br />

here is motivated by their use in depth first tree traversals. To join streams <strong>of</strong> ascendants<br />

<strong>and</strong> descendants, a stack <strong>of</strong> currently nested ancestor nodes is maintained. Nodes are<br />

popped <strong>of</strong>f the ancestor stack when a non-contained (disjoint) node is seen in either<br />

stream. In a path or twig multi-way algorithm, there must be one stack S qi for each<br />

internal query node q i . The matches for different query nodes must be processed in total<br />

pre-order, to ensure that ancestor nodes are added before descendants need them.<br />

In each step in the PathStack algorithm [4], the current data node is used to clean all<br />

stacks by popping non-containing nodes, before it is pushed on stack. Figure 4b shows<br />

the stacks for a query when evaluated on the data in Figure 4a, right after the node c 1<br />

has been pushed. When the current query node is a leaf, all related matches are output.<br />

To enable linear time enumeration <strong>of</strong> the matches encoded in the stacks, each data node<br />

pushed onto a stack has a pointer to the closest containing data node in the parent stack,<br />

119


CHAPTER 4.<br />

INCLUDED PAPERS<br />

which would be the top <strong>of</strong> the parent stack as the data node was pushed. Nodes above<br />

on the parent stack cannot be ascendants, as the data nodes are read in pre-order. In the<br />

example, b 2 <strong>and</strong> a 2 are not usable together. Because a stack only contain nested nodes,<br />

the space needed is O(d), where d is the maximal depth <strong>of</strong> the data tree.<br />

a 1<br />

b 1<br />

b 2<br />

a 2<br />

b 3<br />

c 1<br />

a<br />

b<br />

c<br />

a 2<br />

a 1<br />

b 3<br />

b 2<br />

b 1<br />

c 1<br />

(a) Data tree<br />

(b) Query with stacks after pushing c 1 .<br />

Figure 4: Data structures for PathStack.<br />

A technique for getting path matches sorted on higher query node matches first is<br />

critical for the efficiency <strong>of</strong> TwigStack <strong>and</strong> other twig multi-way algorithms. Delaying<br />

out-<strong>of</strong>-order output is achieved by maintaining so-called self- <strong>and</strong> inherit-lists for each<br />

stacked node [1]. The lists for the data <strong>and</strong> query in Figure 4 is shown in Figure 5.<br />

As a node is popped <strong>of</strong>f stack, the contents <strong>of</strong> its lists are appended to the inherit-lists<br />

<strong>of</strong> the node below on the same stack, if there is one. This is to maintain correct output<br />

order. See for example the lists for b 2 <strong>and</strong> b 1 in the example. But if the popped node<br />

can use some ancestor node in the parent stack, which the node below in its own stack<br />

cannot, the contents <strong>of</strong> the lists must be appended to the self-lists there. This is decided<br />

from the inter-stack pointers. In the example, popping node b 3 results in adding (a 2 b 3 c 1 )<br />

to the self-list <strong>of</strong> a 2 . PathStack has O(I + O) complexity both with <strong>and</strong> without delaying<br />

output, where I is now the total input size.<br />

a (a 2 b 3 c 1 )<br />

2<br />

∅<br />

a<br />

(a 1 b 1 c 1 )(a 1 b 2 c 1 )(a 1 b 3 c 1 )<br />

1<br />

(a 2 b 3 c 1 )<br />

(b<br />

b 3 c 1 )<br />

3<br />

∅<br />

(b<br />

b 2 c 1 )<br />

2<br />

(b 3 c 1 )<br />

(b<br />

b 1 c 1 )<br />

1<br />

(b 2 c 1 )(b 3 c 1 )<br />

Figure 5: Stack nodes with final self- <strong>and</strong> inherit lists for the data <strong>and</strong> query in Figure 4.<br />

Darker nodes popped first.<br />

120


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

TwigStack [4] was the first holistic twig join algorithm. Using PathStack on each<br />

root-to-leaf path in a twig query <strong>and</strong> merging the matches, may lead to many useless intermediate<br />

results, because path matches need not be part <strong>of</strong> complete matches. TwigStack<br />

improved on this, <strong>and</strong> achieved O(I + O) complexity for queries with a-d edges only.<br />

It is a two-phase algorithm, where the first phase outputs matches for each root-toleaf<br />

path, <strong>and</strong> the second phase merge joins the path matches. The first phase does<br />

two things which are critical for the performance <strong>of</strong> the algorithm: It only outputs path<br />

matches which possibly are part <strong>of</strong> some complete query match, <strong>and</strong> outputs paths sorted<br />

on higher query nodes first, using the technique from [1]. This allow a linear merge in<br />

phase two.<br />

TwigStack does additional checking before pushing nodes on stack compared to Path-<br />

Stack. The data node at the head <strong>of</strong> the stream for a query node q is not pushed on stack<br />

before it has a so called “solution extension”, which means that the heads <strong>of</strong> the streams<br />

<strong>of</strong> all child query nodes are contained by C q , <strong>and</strong> that child nodes recursively satisfy this<br />

property. Also, a node is not pushed on stack unless there is a useable ancestor data node<br />

on the stack for the parent query node.<br />

Pseudo-code for TwigStack is shown in Algorithm 1 (adapted from [4]). It is included<br />

here to ease the depiction <strong>of</strong> the improvements discussed in the following sections. Each<br />

query node q has an associated stream T q with current element C q , <strong>and</strong> a stack S q .<br />

The algorithm revolves around a recursive function getNext(q), which returns a (locally)<br />

uppermost query node in the subtree <strong>of</strong> q which has a solution extension. If the parent<br />

<strong>of</strong> the returned q has a usable ancestor data node on stack, this means C q is part <strong>of</strong> a full<br />

solution extension identified earlier, <strong>and</strong> C q is pushed on S q . A path match is found when<br />

a leaf node is pushed on stack, but output is delayed to make sure paths are ordered on<br />

the query nodes top down (called “blocking” in [4]). Note that actually pushing a leaf<br />

node on stack is unnecessary, as it will be popped right <strong>of</strong>f.<br />

Algorithm 1 TwigStack<br />

1: function TwigStack(Q)<br />

2: while not atEnd(Q)<br />

3: q := getNext(Q.root)<br />

4: if not isRoot(q)<br />

5: cleanStack(S parent(q) , C q)<br />

6: if isRoot(q) or not empty(S parent(q) )<br />

7: cleanStack(S q, C q)<br />

8: push(S q, C q, top(S parent (q)))<br />

9: if isLeaf (q)<br />

10: outputPathsDelayed(C q)<br />

11: pop(S q)<br />

12: advance(T q)<br />

13: mergePathSolutions()<br />

14: function getNext(q)<br />

15: if isLeaf (q)<br />

16: return q<br />

17: for q i ∈ children(q)<br />

18: q j := getNext(q i )<br />

19: if q j ≠ q i<br />

20: return q j<br />

21: q min = min arg qi ∈children(q) {Cq .begin}<br />

i<br />

22: q max = max arg qi ∈children(q) {Cq .begin}<br />

i<br />

23: while C q .end < C qmax .begin<br />

24: advance(C q)<br />

25: if C q .begin < C qmin .begin<br />

26: return q<br />

27: else<br />

28: return q min<br />

The getNext() traversal is bottom up, <strong>and</strong> is short cut if some node does not have a<br />

solution extension (see line 20). Leaves trivially have solution extensions. The traversal<br />

has the side effect <strong>of</strong> advancing the treated query node at least until it contains all it’s<br />

121


CHAPTER 4.<br />

INCLUDED PAPERS<br />

children (line 23). If it does not contain all children at this point, the child currently with<br />

the first pre-order data node (lowest begin value) is returned to be forwarded in line 12.<br />

Figure 6 shows the state <strong>of</strong> the algorithm when evaluating the query in Figure 2c,<br />

right after node b 5,5 has been processed. After the first call to getNext(a), when all the<br />

streams where at their start position, a itself was returned as it had a solution extension,<br />

<strong>and</strong> C a = a 1,8 was pushed on stack. For the second call to getNext(a), this was not the<br />

case, <strong>and</strong> b was returned, with head C b = b 2,2 . Since b 2,2 had a usable ancestor a 1,8 on<br />

the parent stack S a , a 1,8 must have had a solution extension, in which the subtree rooted<br />

at b 2,2 was usable. So C b = b 2,2 was pushed on its own stack S b , <strong>and</strong> since it was a leaf,<br />

the path matching (a 1,8 b 2,2 ) was output. After all paths have been found they are merge<br />

joined.<br />

T b<br />

b 2,2<br />

b 5,5<br />

⊥<br />

T a<br />

a 1,8<br />

a 3,6<br />

⊥<br />

T c<br />

c 4,4<br />

c 7,7<br />

⊥<br />

b 5,5<br />

a 3,6<br />

b 2,2<br />

a 1,8<br />

S b<br />

c 4,4<br />

S a<br />

S c<br />

a 1,8 b 2,2<br />

a 1,8 b 5,5<br />

a 1,8 c 4,4<br />

a 3,6 b 5,5<br />

a 3,6 c 7,7<br />

(a) Streams.<br />

(b) Stacks<br />

(c) Path matches.<br />

Figure 6: TwigStack state when evaluating query in Figure 2c, after processing node b 5,5 .<br />

TwigStack suboptimality for mixed a-d <strong>and</strong> p-c queries comes from having to<br />

output path matches without knowing whether the data nodes used can satisfy all their<br />

p-c relationships. The algorithm cannot always decide this from the nodes on the stacks<br />

<strong>and</strong> the heads <strong>of</strong> the streams. For the example in Figure 7 it cannot be decided if the<br />

path matches (a 1 , b 1 ), . . . , (a 1 , b n ) are part <strong>of</strong> a full match before the node c n+1 is seen.<br />

a 1<br />

a<br />

b 1 b n<br />

a 2<br />

c n+1<br />

b<br />

c<br />

(a) Query.<br />

b n+1 c 1 c n<br />

(b) Data.<br />

Figure 7: Bad case for TwigStack.<br />

For a-d only queries, queries where a p-c edge never follows an a-d edge [24], or on<br />

data with non-recursive schema [9], twig joins can be solved with linear cost in the size<br />

<strong>of</strong> the input <strong>and</strong> the output, using O(d) memory. Sadly, recursive schema, where nodes<br />

with a given label may nest nodes with the same label, are common in XML in practice<br />

[8], <strong>and</strong> so are mixed queries <strong>of</strong> the type mentioned above. No algorithm can solve the<br />

general problem given tag streaming, linear index size, <strong>and</strong> a query evaluation memory<br />

122


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

requirement <strong>of</strong> O(d) [24]. One alternative is storing multiple sort orders on disk, instead<br />

<strong>of</strong> only tree pre-order. This would require Ω(m min m,d · D) disk space in the worst case,<br />

where m is the number <strong>of</strong> structurally recursive labels <strong>and</strong> D is the size <strong>of</strong> the document<br />

[9]. Another alternative is to do multiple scans <strong>of</strong> the streams, but this would require<br />

Ω(d t ) passes in the worst case, where t is a linear metric on the complexity <strong>of</strong> the query<br />

[9]. So, the only viable alternatives left seem to be relaxing the O(d) space requirement,<br />

or using something different than tag partitioning. The following section investigates this,<br />

but also many practical speedups to TwigStack.<br />

3 Advances<br />

A multitude <strong>of</strong> different improvements have been presented after the introduction <strong>of</strong><br />

TwigStack. Figure 8 gives an overview <strong>of</strong> these, with a separation between improved<br />

join algorithms <strong>and</strong> changes to how data is indexed <strong>and</strong> accessed. The rest <strong>of</strong> this paper<br />

is devoted to a structured review <strong>of</strong> these advances. Our goal is to identify further<br />

improvements, <strong>and</strong> to shed light on whether it is likely that combining these advances is<br />

possible <strong>and</strong> beneficial.<br />

3.1 Avoiding Redundant Computation<br />

TwigStack may perform many redundant checks in the calls to getNext(). Each time<br />

a node is returned, the full subtree below has been inspected. The TJEssential [19]<br />

algorithm improved three specific deficiencies, exemplified in Figure 9.<br />

The first deficiency is from self-nested matching nodes. For query node a <strong>and</strong> data<br />

nodes a 1 to a p in the example, it is unnecessary to recursively check the full subtrees below<br />

b <strong>and</strong> d in each round while pushing the nodes onto S a . The usefulness <strong>of</strong> a 2 , . . . , a p can<br />

be seen from the fact that a 1 had a new solution extension, <strong>and</strong> that a 2 , . . . , a p contains<br />

b 1 <strong>and</strong> d 1 , the heads <strong>of</strong> the streams <strong>of</strong> a’s children.<br />

The second observation is on the order in which child nodes are inspected. If the child<br />

b is inspected before d in line 18 <strong>of</strong> Algorithm 1, getNext(a) will call getNext(b) before<br />

getNext(d) shortcuts the search. There will be m − 1 redundant calls getNext(b) while<br />

forwarding the leaf node e.<br />

The third observation is that many useless calls could be made after a stream has<br />

reached its end. Assuming that b 2 was the last b-node in Figure 9, no a node later in the<br />

tree order would ever be pushed onto stack, <strong>and</strong> T a could be forwarded to its end. Also,<br />

if S b was empty, any descendant <strong>of</strong> b in the query could have their stream forwarded to<br />

the end, as the remaining nodes could not be part <strong>of</strong> a solution.<br />

TJEssential is a total rewrite <strong>of</strong> TwigStack, <strong>and</strong> is more complex than the original algorithm.<br />

TwigStack + [30] is a less involved modification, which only changes the getNext()<br />

procedure, such that it does not return before a solution is found. TwigStack + does not<br />

catch any <strong>of</strong> the tree above cases, but reduces computation for scattered node matches in<br />

practice.<br />

123


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Nested loop twig m-way.<br />

– Fast on simple q.?<br />

2-phase holistic top-down.<br />

– a-d optimal.<br />

TwigStack [4]<br />

1-phase bottom-up.<br />

– a-d & p-c optimal.<br />

Twig 2 Stack [5]<br />

Refined access methods.<br />

– tag, tag+level, path.<br />

iTwigJoin [6]<br />

– twig.<br />

TwigVersion [27]<br />

Virtual streams.<br />

Virtual Cursors [28]<br />

TJFast [21]<br />

(TwigOptimal [10])<br />

Skipping in leaf streams.<br />

Virtual Cursors [28]<br />

Holistic skipping in matched<br />

leaf streams.<br />

Skipping joins.<br />

TwigStackXB [4]<br />

Efficient anc. skip.<br />

TSGeneric+ [15]<br />

Holistic skipping.<br />

TwigOptimal [10]<br />

Holistic anc. skipping.<br />

Avoiding redundant comp.<br />

TJEssential [19]<br />

TwigStack + [30]<br />

TJEssential* [19]<br />

TwigStack + B [30]<br />

Singe phase essential?<br />

1-phase top-down.<br />

– a-d & p-c optimal.<br />

HolisticTwigStack [16]<br />

Simplified intermed. result<br />

management, top-down.<br />

TwigFast [20].<br />

Unification?<br />

Bottom-up + filtering.<br />

Twig 2 Stack+PStack [5]<br />

Twig 2 Stack+TStack [3]<br />

TwigMix [20]<br />

Simplified intermed. result<br />

management, bottom-up.<br />

TwigList [23]<br />

Figure 8: Advances <strong>and</strong> opportunities in twig joins.<br />

Optimal data access?<br />

Combination possible?<br />

Optimal algorithm?<br />

⎫⎪ ⎬<br />

⎪ ⎭<br />

⎫⎪ ⎬<br />

⎪ ⎭<br />

Changing data <strong>and</strong> data access Changing algorithms<br />

124


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

a 1<br />

a p<br />

b 1<br />

d 1<br />

a q<br />

e 1 e m<br />

b 2<br />

(a) Data.<br />

c 1<br />

c n<br />

a<br />

b d<br />

c e<br />

(b) Query.<br />

Figure 9: Giving redundant checks in TwigStack.<br />

Opportunity 1 (Removing redundant computation in top-down one-phase joins). The<br />

improvement <strong>of</strong> TwigStack + can trivially be ported to recent algorithms such as HolisticTwigStack<br />

<strong>and</strong> TwigFast, which improve other aspects <strong>of</strong> TwigStack (see Section 3.2).<br />

A challenge is to do the same for all the three improvements <strong>of</strong> TJEssential. Also, case<br />

three above could be extended to more efficient aligning for multi-document XML collections.<br />

3.2 Top-down vs. Bottom-up<br />

There are two main lines <strong>of</strong> algorithmic improvements over TwigStack which give optimal<br />

evaluation <strong>of</strong> mixed a-d <strong>and</strong> p-c queries by relaxing the O(d) memory requirement:<br />

Bottom-up algorithms which read nodes in post-order, <strong>and</strong> later algorithms which go back<br />

to top-down <strong>and</strong> pre-order. Differences between these are illustrated in Figure 10.<br />

Twig 2 Stack [5] generates a single combined stream with post-order sorting for all<br />

query node matches with the help <strong>of</strong> a single stack. With post-order processing it can be<br />

decided if an entire subtree has a match at the time the top node is seen.<br />

Figure 10c shows the hierarchies <strong>of</strong> stacks built while processing a query. For each<br />

query node, a list <strong>of</strong> trees <strong>of</strong> stacks is maintained. A data node strictly nests all nodes<br />

below in the stack, <strong>and</strong> all nodes in child stacks in the tree. The lists <strong>of</strong> trees are stored<br />

sorted in post-order, <strong>and</strong> are linked together by a common root if an ancestor node is<br />

processed. From the post-order, the nodes to be linked will always be found at the end <strong>of</strong><br />

the list, <strong>and</strong> the new root will always be put at the end. The order naturally maintains<br />

itself, <strong>and</strong> good locality is achieved.<br />

Instead <strong>of</strong> each node on stack having a pointer to an ancestor node on a parent stack<br />

as in TwigStack, each stacked data node has for each related child query node, a list <strong>of</strong><br />

pointers to top stack nodes matching the query axis relationship. Nodes are only added<br />

if a-d <strong>and</strong> p-c relationships can be satisfied, <strong>and</strong> p-c pointers are only added when levels<br />

are correct, as seen for the a <strong>and</strong> c nodes in the example.<br />

125


CHAPTER 4.<br />

INCLUDED PAPERS<br />

c 1<br />

b 1<br />

c 2 a 1<br />

a 2<br />

a 4<br />

c 5<br />

b 2<br />

a 3 c 4 b 3 b 5<br />

c 6<br />

(a) Data<br />

a<br />

b c<br />

(b) Query<br />

a 1<br />

a 4<br />

b 1<br />

c 2 c 3 c 4 c 5<br />

b 2<br />

b 3<br />

b 4<br />

b 5<br />

(c) Twig 2 Stack<br />

c 1<br />

c 6<br />

a 4 a 1<br />

⎧⎪ ⎨<br />

⎪ ⎩<br />

{<br />

⎧⎪ ⎨<br />

⎪ ⎩<br />

{<br />

b 1 b 2 b 4 b 3 b 5<br />

c 2 c 3 c 4 c 6 c 5 c 1<br />

(d) TwigList<br />

tail<br />

a 1 a 2 a 4<br />

b 4<br />

b a 4<br />

5 b 3<br />

a 2 c 3<br />

a 1<br />

b 2<br />

(e) HolisticTwigStack<br />

c 4<br />

⎧⎪ ⎨<br />

⎪ ⎩{<br />

{<br />

⎧<br />

⎪⎨<br />

⎪ ⎩{<br />

{<br />

b 2 b 3 b 4 b 5<br />

c 3 c 4 c 5<br />

tail tail<br />

(f) TwigFast<br />

Figure 10: (a) Data. (b) Query. (c) Hierarchies <strong>of</strong> stacks for Twig 2 Stack. (d) Intervals<br />

for TwigList. Curved arrows are sibling pointers. (e) Lists <strong>of</strong> stacks for HolisticTwigStack<br />

right before c 5 is processed. Previously popped nodes shown in gray. (f) Intervals for<br />

TwigFast after c 5 has been processed. Curved arrows are ancestor pointers.<br />

126


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

TwigList [23] is a simplification <strong>of</strong> Twig 2 Stack using simple lists <strong>and</strong> intervals given<br />

by pointers, which improves performance in practice. For each query node, there is a<br />

post-order list <strong>of</strong> the data nodes used so far. Each node in a list has, for each child query<br />

node, a single recorded interval <strong>of</strong> contained nodes, as shown in Figure 10d. Interval start<br />

<strong>and</strong> end positions are recorded as nodes are pushed <strong>and</strong> popped on <strong>and</strong> <strong>of</strong>f the global<br />

stack. All descendant data nodes are processed in between. Compared with the list <strong>of</strong><br />

pointers in Twig 2 Stack, enumeration <strong>of</strong> matches is not as efficient for p-c edges, but sibling<br />

pointers can remedy this.<br />

HolisticTwigStack [16] is a modification <strong>of</strong> TwigStack which uses pre-order processing,<br />

but maintains complex stack structures like Twig 2 Stack. The argument against<br />

Twig 2 Stack was a high memory usage, caused by the fact that all query leaf matches are<br />

kept in memory until the tree is completely processed, as they could be part <strong>of</strong> a match.<br />

HolisticTwigStack differentiates between the top-most branching node <strong>and</strong> its ancestors,<br />

for which a regular stack is used, <strong>and</strong> lower query nodes, which have multiple linked lists<br />

<strong>of</strong> stacks, as shown in Figure 10e. Each query node match has one pointer to the first<br />

descendant in pre-order for each child query node. For “lower” query nodes, new data<br />

nodes are pushed onto the current stack if contained, otherwise a new stack is created<br />

<strong>and</strong> appended to the list. As a match for an “upper” query node is popped, the node<br />

below on stack must inherit the pointers. Node a 1 would inherit the pointers from both<br />

a 2 <strong>and</strong> a 4 in the example in Figure 10e, <strong>and</strong> the related lists <strong>of</strong> child matches would be<br />

linked.<br />

TwigFast [20] is a simplification <strong>of</strong> HolisticTwigStack similar to TwigList. There is<br />

one list containing matches for each query node, naturally sorted in pre-order, <strong>and</strong> data<br />

nodes in the lists have pointers giving the interval <strong>of</strong> contained matches for child query<br />

nodes, as shown in Figure 10f. Each data node put into the list has a pointer to its closest<br />

ancestor in the same list, <strong>and</strong> there is a “tail pointer”, which gives the last position where<br />

a node can be the ancestor <strong>of</strong> following nodes in the streams. These pointers are used for<br />

the construction <strong>of</strong> the intervals.<br />

Different advantages <strong>of</strong> top-down <strong>and</strong> bottom-up algorithms can be seen in Figure<br />

10. A top-down algorithm can avoid storing b 1 <strong>and</strong> c 2 , while a bottom-up algorithm<br />

is unable to decide that these nodes cannot be part <strong>of</strong> a solution. On the other h<strong>and</strong>, a<br />

bottom-up algorithm can decide that a 2 is not usable, because it cannot satisfy the p-c<br />

relationship between a <strong>and</strong> c. Both approaches can decide that a 3 is not useful because<br />

it does not have a b descendant.<br />

The worst case space complexity <strong>of</strong> twig pattern matching is an open problem, <strong>and</strong><br />

the known bounds are Ω(max d, u) <strong>and</strong> O(I), where u is the number <strong>of</strong> nodes which are<br />

part <strong>of</strong> a solution [24]. However, practical space savings are possible.<br />

Opportunity 2 (Top-down memory usage). TwigStack treats queries as a-d only in the<br />

stack construction part <strong>of</strong> phase one. A node returned from getNext() is pushed on stack if<br />

it has a usable ancestor on the parent stack, even if the query specifies a p-c relationship.<br />

For example does not c 3 have to be pushed on stack in Figure 10e, because it does not<br />

have a usable parent. Strictly checking p-c relationships before adding intermediate results<br />

would reduce memory usage in practice. This optimization was identified for TJFast [21]<br />

127


CHAPTER 4.<br />

INCLUDED PAPERS<br />

(see Section 3.6), but the later HolisticTwigStack <strong>and</strong> TwigFast do not take advantage <strong>of</strong><br />

this opportunity.<br />

Opportunity 3 (Bottom-up memory usage). Assume a query node q with a p-c relationship<br />

to the parent query node. If a c<strong>and</strong>idate match for q is pushed onto a stack in<br />

Twig 2 Stack, <strong>and</strong> the data node below on the stack does not have an incoming pointer,<br />

this means the node below will never get a matching parent, <strong>and</strong> can be popped <strong>of</strong>f stack.<br />

For example could the node c 6 be dropped in Figure 10c. Also, when stack trees for q are<br />

merged, some ancestor data node a i is seen. Then all the stack trees which do not get<br />

or have an incoming pointer can be dropped, as all later c<strong>and</strong>idates for the parent query<br />

node will be after in the post-order. In the example, the stack trees containing the single<br />

nodes c 2 <strong>and</strong> c 3 could be dropped when a 1 is seen.<br />

Note that improvements on this is hard to transfer directly to TwigList unless the<br />

lists are implemented as linked lists. But this is by far inferior to using arrays <strong>and</strong> array<br />

doubling on modern hardware, as done in TwigList [22]. Another solution is to keep one<br />

list for each level for query nodes which have a p-c relationship to the parent query node.<br />

Siblings would then be stored contiguously, <strong>and</strong> interval pointers would implicitly be to<br />

a list on a given level. When the first ancestor <strong>of</strong> a segment <strong>of</strong> nodes in need <strong>of</strong> a parent<br />

is seen, the useless nodes can be over-written. This modification would also make sibling<br />

pointers unnecessary <strong>and</strong> improve efficiency <strong>of</strong> result enumeration. Figure 11 shows the<br />

proposed approach for the data <strong>and</strong> query in Figure 10, where gray list items can be<br />

overwritten.<br />

1: c 1<br />

2: c 2<br />

3: c 5<br />

4: c 4 c 6<br />

5: c 3<br />

{ {<br />

a 4 a 1<br />

b 1 b 2 b 4 b 3 b 5<br />

⎧⎪ ⎨<br />

⎪ ⎩<br />

{<br />

Figure 11: Proposal for multi-level lists for TwigList.<br />

3.3 Filtering<br />

Low memory top-down approaches have been used as filters to bottom-up algorithms to<br />

reduce space usage by avoiding useless nodes. Note that this does not result in a perfect<br />

solution. Assume that node a 1 in Figure 10a had a different label. A O(d) space top-down<br />

pre-order approach could not decide that b 2 in the example was not part <strong>of</strong> a match, <strong>and</strong><br />

a bottom-up algorithm would have to keep it in memory until the entire tree was read.<br />

Figure 12c-e shows the effects <strong>of</strong> different previously proposed filters.<br />

PathStack Pop Filter. In the original Twig 2 Stack paper [5], PathStack was proposed<br />

as a pre-filter to allow early result enumeration. PathStack is run as usual, but<br />

without its result enumeration. As disjoint nodes are popped <strong>of</strong>f their stacks, they are<br />

128


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

x 1<br />

x 1<br />

x 1<br />

c 1 a 1<br />

c 1 a 1<br />

c 1 a 1<br />

a<br />

b c<br />

x 2<br />

b 1<br />

c 2<br />

b 2<br />

a 2<br />

x 2<br />

b 1<br />

c 2<br />

b 2<br />

a 2<br />

x 2<br />

b 1<br />

c 2<br />

b 2<br />

a 2<br />

(a) Query.<br />

(b) No filter.<br />

(c) PathStack pop filter.<br />

(d) Solution extension filter.<br />

x 1<br />

x 1<br />

x 1<br />

c 1 a 1<br />

c 1 a 1<br />

c 1 a 1<br />

x 2<br />

b 1<br />

c 2<br />

b 2<br />

a 2<br />

(e) TwigStack pop filter.<br />

x 2<br />

b 1<br />

c 2<br />

b 2<br />

a 2<br />

(f) Opportunity: PathStack useful.<br />

x 2<br />

b 1<br />

c 2<br />

b 2<br />

a 2<br />

(g) Opportunity: TwigStack useful.<br />

Figure 12: Filtering approaches for bottom-up up processing. Filtered nodes shown in<br />

gray.<br />

passed to Twig 2 Stack. When the bottom node is popped from the stack <strong>of</strong> the root query<br />

node, all results can be output, <strong>and</strong> the hierarchical stacks destroyed. A side effect <strong>of</strong> this<br />

procedure is that only nodes that are part <strong>of</strong> some prefix path match are used (these are<br />

not necessarily part <strong>of</strong> a full root-to-leaf path match). In Figure 12c, node c 1 is avoided.<br />

Note that one data node may result in the popping <strong>of</strong> multiple nodes on multiple stacks,<br />

<strong>and</strong> that Twig 2 Stack must receive descendants before ascendants.<br />

Solution Extension Filter. TwigMix [20] is an algorithm which combines the simplified<br />

data structures in TwigList with the getNext() function from TwigStack as a filter.<br />

This combination gives efficient evaluation for queries involving p-c edges, <strong>and</strong> reduced<br />

memory usage in practice. An advantage <strong>of</strong> this approach over Twig 2 Stack+PathStack is<br />

that there is no overhead <strong>of</strong> maintaining an extra set <strong>of</strong> stacks, <strong>and</strong> that internal nodes<br />

are filtered holistically. The downside is that nodes are added without even having a<br />

possible parent or ancestor. Figure 12d shows that node a 2 is filtered, because it never<br />

has a solution extension (misses b node below), while nodes c 1 <strong>and</strong> b 1 are not filtered.<br />

TwigStack Pop Filter. TwigStack can also be used as a filter for Twig 2 Stack [3]. A<br />

node is never added to the hierarchical stacks if it is not popped from a top-down stack in<br />

TwigStack. As a node is never pushed on stack if it does not have a usable ancestor, which<br />

again has a solution extension, this gives additional filtering, at the cost <strong>of</strong> maintaining the<br />

top-down stacks. Figure 12e shows the improvements both over PathStack <strong>and</strong> solution<br />

extension as filters. An issue is that Twig 2 Stack expects the stream <strong>of</strong> nodes to be in<br />

post-order, <strong>and</strong> that TwigStack may pop nodes <strong>of</strong>f stacks out <strong>of</strong> this order. When a node<br />

is returned from getNext(), only the related stack <strong>and</strong> the parent stack are inspected.<br />

129


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Also, TwigStack does not keep leaf matches on stack, but nested leaf matches may arrive<br />

later. In [3] this is solved by keeping an extra queue <strong>of</strong> data nodes into which popped<br />

nodes are placed if the algorithm decides later popped nodes may precede them.<br />

A different solution could be to allow nested nodes on query leaf stacks, <strong>and</strong> to inspect<br />

all stacks when popping disjoint nodes to ensure post-order, as with the PathStack filter.<br />

Also, Twig 2 Stack actually does not need to see nodes in strict post-order, but only to<br />

see descendants before ascendants. Hence, not all stacks in the query would have to<br />

inspected, only ascendant <strong>and</strong> descendant stacks <strong>of</strong> the current node.<br />

Opportunity 4 (Stronger filters). There are further possibilities for filters with O(d)<br />

space usage. Instead <strong>of</strong> using all nodes popped <strong>of</strong>f stacks in PathStack, one could use<br />

the nodes which would be used in full path match. As leaf nodes are pushed on stack, a<br />

simplified enumeration algorithm could be run, tagging nodes which take part in solutions.<br />

As can be seen in Figure 12f, this is an improvement over the previous PathStack filter, but<br />

only partially over the solution extension filter, which to a greater extent filters matches<br />

for higher query nodes. Leaves trivially have solution extensions. The “PathStack<br />

useful” filter works well on lower query nodes. Note that as the bottom-up algorithms<br />

to a greater extent h<strong>and</strong>le upper nodes themselves, a filter is <strong>of</strong> most use if it removes<br />

lower query node c<strong>and</strong>idates effectively. An even stronger filter would be to only use<br />

nodes which would have been output as parts <strong>of</strong> path matches in TwigStack, as shown in<br />

Figure 12g. None <strong>of</strong> [5, 20, 3] compare with using any other type <strong>of</strong> filter. A thorough<br />

comparison should compare both the practical space reductions filters give, their absolute<br />

costs, <strong>and</strong> how their use affect the total computational cost.<br />

Opportunity 5 (Unification or assimilation). When comparing absolute performance<br />

gains presented in the respective papers, TwigFast is the winner on performance for pure<br />

tag streaming. As this is a very important result, it should be verified independently.<br />

Before TwigFast is picked as the method <strong>of</strong> choice, at least the following should be answered:<br />

(i) Can the improvements discussed in Section 3.1 be applied? (ii) Is it superior<br />

to improved top-down <strong>and</strong> bottom-up combinations? (iii) Does the picture change when<br />

r<strong>and</strong>om access gets more expensive compared to computation? [20] does not comment on<br />

the spatial locality <strong>of</strong> memory access patterns in the intermediate result data structures<br />

in TwigFast, while they are very good for TwigList [23].<br />

3.4 Skipping Joins<br />

Skipping is a useful technique when the streams to be joined have very different sizes.<br />

Skipping is used to jump forward in a stream to avoid reading <strong>and</strong> processing parts <strong>of</strong><br />

the streams which cannot contain useful nodes. Figure 13 shows cases where different<br />

skipping techniques <strong>and</strong> data structures can be used.<br />

Simple B-tree skipping can be used to skip in descendant streams, <strong>and</strong> to some<br />

extent in ancestor streams. It is trivial to skip in the descendant stream to find the first<br />

possible contained node, which is the first node with a larger begin value. In Figure 13b,<br />

T c is forwarded from c 1 to c q to find the first possible ascendant <strong>of</strong> b 1 .<br />

130


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

a<br />

b<br />

c<br />

x<br />

a 1<br />

c q<br />

c 1 c p<br />

b 1<br />

a 1<br />

b 1<br />

c 1 b 2 b p<br />

b q<br />

c 2<br />

a 1<br />

b 1 b p b q<br />

c 1<br />

(a)<br />

(b)<br />

(c)<br />

(d)<br />

x 1<br />

x 1<br />

a 1<br />

x 2<br />

x p a 2<br />

b 1 b 2<br />

c 2 b p<br />

c p<br />

b q<br />

b 1<br />

c 1<br />

x 2<br />

a 2 b 2<br />

x p<br />

a p b p<br />

a q<br />

b q<br />

(e)<br />

c q<br />

(f)<br />

c q<br />

Figure 13: Benefits <strong>of</strong> skipping techniques. (a) Query. (b) Descendants easily skipped<br />

with B-tree. (c) Skip past discarded ascendant. (d) XR-tree needed to skip ascendants.<br />

(e) Holistic skipping preferred. (f) Holistic skipping with XR-tree needed.<br />

But skipping to find the next ascendant <strong>of</strong> a node using the same approach is not<br />

effective, as any node with a lower begin value may be a match. A trick for ancestor<br />

skipping was introduced in [7]. If a node b i is popped <strong>of</strong>f stack S b due to disjointness with<br />

the current data node in some query node, T b is forwarded to the first node not contained<br />

by the popped node, a b j such that b i .end < b j .begin. An example <strong>of</strong> this can be seen<br />

in Figure 13c. If c 2 pops b 1 <strong>of</strong>f stack, T b can be forwarded beyond b 1 to b q , because no<br />

descendant <strong>of</strong> b 1 could be useful.<br />

XR-trees enable ancestor skipping in the general case [14]. Figure 13d shows an<br />

example where the above trick cannot be used. The XR-tree is B-tree variant which can<br />

retrieve all R ancestors or descendants <strong>of</strong> a node from N c<strong>and</strong>idates in O(log N + R)<br />

time. Typically one tree is built for each tag. To find all a ascendants <strong>of</strong> a node d k ,<br />

find the node a i with the nearest preceding begin value, <strong>and</strong> then all a ascendants <strong>of</strong><br />

a i in the XR-tree for a. Conceptually the XR-tree contains for each node, the list <strong>of</strong><br />

ascendants, which gives quadratic space usage when implemented naively. Linear space<br />

usage is achieved by not storing information redundantly internally in XR-tree nodes, <strong>and</strong><br />

by storing common information in internal XR-tree nodes.<br />

TSGeneric+ (also called XRTwig) [15] extends the use <strong>of</strong> the XR-tree to TwigStack,<br />

<strong>and</strong> does two major modifications to the algorithm. The first is to skip forward to<br />

containment <strong>of</strong> the first child in the getNext() procedure (see line 23 in Algorithm 1).<br />

The second change is more involved. Before calling getNext() on all children in line 18, a<br />

“broken” edge in the query sub-tree is repeatedly picked, <strong>and</strong> the two related nodes are<br />

“zig-zag” forwarded until they match. This is only done if the query node does not have<br />

131


CHAPTER 4.<br />

INCLUDED PAPERS<br />

data nodes on the stack. Choosing which edge to fix is either top-down, bottom-up or by<br />

statistics.<br />

Holistic skipping was introduced in the TwigOptimal [10] algorithm, which uses<br />

B-trees. Figure 13e shows a case where the approach from TSGeneric+ would be very<br />

expensive, reading all nodes b 2 -b p <strong>and</strong> c 2 -c p to fix the edge between b <strong>and</strong> c. TwigOptimal<br />

processes the query bottom-up then top-down. In the bottom-up phase, nodes are<br />

forwarded to contain their descendants, <strong>and</strong> in the top-down phase, nodes are forwarded<br />

until they are contained by their parent. To avoid as many data structure reads as possible,<br />

nodes are forwarded to “virtual positions”, which have only begin values. When a<br />

full traversal did not forward any node, the node with the minimal current begin value is<br />

forwarded to a real data node.<br />

The name <strong>of</strong> the TwigOptimal algorithm may be slightly misleading, as the optimality<br />

is given skip structures on begin values only. Only TSGeneric+ using simple B-trees<br />

is compared with. The effects <strong>of</strong> the two contributions, holistic skipping <strong>and</strong> the virtual<br />

positions, are not separately tested. TwigOptimal would not be efficient neither on the<br />

example in Figure 13c nor 13d. The approach is best when there are more matches for<br />

lower query nodes. An common exception from this is queries with leaf value predicates<br />

in XML. [10] mentions skipping to the closest ancestor <strong>and</strong> then backtracking to the first<br />

ancestor as a possible practical speed-up.<br />

Opportunity 6 (Holistic effective ancestor skipping). Figure 13f shows a case where<br />

both TSGeneric+ with XR-trees <strong>and</strong> TwigOptimal would fail to be efficient. The former<br />

would zig-zag join a 2 -a p <strong>and</strong> b 2 -b p , <strong>and</strong> the latter would be unable to forward T a to a q<br />

without checking at least all <strong>of</strong> a 2 -a p for ancestry <strong>of</strong> c q . Combining holistic skipping <strong>and</strong><br />

data structures for efficient ancestor skipping is required in a robust solution.<br />

Opportunity 7 (Simpler <strong>and</strong> faster skipping data structures). The XR-tree is a dynamic<br />

data structure which supports insertions <strong>and</strong> deletions [14]. In regular keyword search<br />

engines, simpler data structures are usually preferred to the heavier B-trees when the<br />

data is static or semi-static. Similar simpler data structures should also be created for<br />

efficient ancestor skipping. If their use is still expensive, techniques similar to the trick<br />

used to skip past discarded ascendants should be applied when possible.<br />

3.5 Refined Access Methods<br />

There are alternatives to indexing <strong>and</strong> accessing data by node labels, such as using label<br />

<strong>and</strong> level, or the root-to-node path strings <strong>of</strong> labels (called tag+level <strong>and</strong> prefix path<br />

streaming [6]). With refined partitioning some method must be used to identify the useful<br />

partitions for each query node. For prefix path streaming this would be the partitions<br />

with data paths matching the root-to-node downward paths in the query.<br />

Structural summaries are directory structures used to classify nodes based on their<br />

surrounding structure. They where first used in combination with tree traversals, but have<br />

later been integrated with pure partitioning schemes [18]. The most common is a path<br />

summary, which is a minimal tree containing all unique root-to-node label paths seen in<br />

the data. The data nodes associated with a summary node is called the extent <strong>of</strong> the<br />

132


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

node. Figure 14a shows the path summary for the data in Figure 14b, <strong>and</strong> the extents<br />

are shown in Figure 14d.<br />

b (1)<br />

b [1]<br />

b 1<br />

a (3)<br />

a (2)<br />

b (4)<br />

b (5)<br />

a (6)<br />

(a)<br />

a [2]<br />

a [7]<br />

b [3]<br />

a [5] b [8]<br />

a [9]<br />

a [4] b [6]<br />

(b)<br />

b [10]<br />

a 2<br />

b 3 b 6<br />

a 8<br />

a 4 a 5 a 7<br />

a 10<br />

b 11<br />

b 9<br />

a 12<br />

a 13<br />

(c)<br />

a 15<br />

b 16<br />

a 17<br />

b 14<br />

b 18<br />

(1)→[1]<br />

(2)→[2, 7]<br />

(3)→[5, 9]<br />

(4)→[6, 10]<br />

(5)→[3, 8]<br />

(6)→[4]<br />

(d)<br />

[1]→1<br />

[2]→2, 10<br />

[3]→3, 6, 11<br />

[4]→4, 5, 7, 12<br />

[5]→8, 13<br />

(e)<br />

[6]→9, 14<br />

[7]→15<br />

[8]→16<br />

[9]→17<br />

[10]→18<br />

Figure 14: Structural partitioning example. (a) is path summary for (b), with extents<br />

shown in (d). (b) is F&B summary for (c), with extents shown in (e).<br />

Many alternative summary structures have been devised for general graphs. A structure<br />

which is also directly useful for trees is the stronger F&B-index [17], where two nodes<br />

are in the same equivalence class if they have the same prefix path, <strong>and</strong> have children <strong>of</strong><br />

the same equivalence classes. In the example, (b) is the F&B summary <strong>of</strong> the tree in (c).<br />

For graphs, the F&B index can be found in O(m log n), where m <strong>and</strong> n are the number<br />

<strong>of</strong> edges <strong>and</strong> nodes in the graph. It is not known whether the F&B index can be found<br />

more efficiently for trees.<br />

Opportunity 8 (Updates in stronger summaries). Simple path summaries are usually<br />

small, <strong>and</strong> are easily updateable. When traversing a data tree for indexing, the path<br />

summary is used as a deterministic automaton, where new nodes are added on the fly<br />

when needed. Data nodes can be put in the correct extents immediately. If a data tree is<br />

updated, only the data nodes whose extent changes are affected. An interesting question<br />

is the updatability <strong>of</strong> stronger structural summaries. In the worst case for the F&B<br />

index, the structure <strong>of</strong> an entire containing subtree below the global root could change<br />

if a data node is added or removed, by causing or removing equivalence with another<br />

subtree. What are the implications <strong>of</strong> strategies lessening the restrictions on the F&B<br />

index? Would this give critical “fragmentation effects” in practice? And are updates<br />

cheaper in coarser variants, such as the F+B-index [17]?<br />

133


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Opportunity 9 (Hashing F&B summaries). In some search scenarios, there are many<br />

small documents, which are parts <strong>of</strong> a virtual tree. Document updates can be implemented<br />

as document deletes <strong>and</strong> inserts. With simple path summaries, documents can be<br />

added with cost linear in the document size, by traversing the summary deterministically.<br />

However, more refined summaries are not deterministic. Are stronger summaries like the<br />

F&B index suitable in this model? A challenge is that matching a new document in the<br />

F&B index has cost linear in the size <strong>of</strong> the summary in the worst case, not the document.<br />

Assume now that a 2 , a 10 <strong>and</strong> a 15 in Figure 14c are document roots. The structure <strong>of</strong> each<br />

document is classified by a node on depth two in the F&B summary in Figure 14b. If a<br />

new document is added below b 1 , it will either have the structure defined by a [2] or a [7] ,<br />

or a new subtree will be added below b [1] . One possibility is to index the F&B summary<br />

by hashing each level 2 subtree, as these represent full document structures. When a new<br />

document is indexed, a summary <strong>of</strong> the document structure can be built <strong>and</strong> hashed, to<br />

identify a match in the global F&B index.<br />

TwigVersion [27] is a twig matching approach which introduces a novel two-layer<br />

indexing scheme, with an F&B summary <strong>of</strong> the data, <strong>and</strong> a path summary <strong>of</strong> the F&B<br />

summary. This reduces the expense <strong>of</strong> matching in the F&B index. But as they only<br />

compare to twig join algorithms which do not use structural summaries, <strong>and</strong> also introduce<br />

many other ideas, it is hard to assess the usefulness <strong>of</strong> the two-layer approach itself. They<br />

compare their two layer approach with a pure F&B index, but do not state how they<br />

search in it.<br />

A common way to use path summaries is to match each individual root-to-leaf path,<br />

<strong>and</strong> prune away matches which cannot be part <strong>of</strong> a full match [6, 2]. Another solution,<br />

which is more robust for large path summaries, is to label partition the summary <strong>and</strong> run<br />

a full twig join algorithm on it. In [3] a novel combination <strong>of</strong> Twig 2 Stack <strong>and</strong> TwigStack<br />

is used for matching in large path summaries (see Section 3.3).<br />

Opportunity 10 (Exploring summary structures <strong>and</strong> how to search them). Many twig<br />

join algorithms have leveraged the benefits <strong>of</strong> path summaries. Stronger summaries like<br />

the F&B index are not as commonly used, maybe because <strong>of</strong> worst case size <strong>and</strong> implementational<br />

complexity. Using different algorithms to search various types <strong>and</strong> combinations<br />

<strong>of</strong> summaries has not been thoroughly explored. An evaluation should address the total<br />

benefits <strong>of</strong> different single- <strong>and</strong> multi-level strategies, but also detail the local cost <strong>of</strong><br />

specific matching methods in specific summary types <strong>of</strong> different sizes.<br />

Multi-stream access may be required for a single query node when partitioning on<br />

path, as there may be many path matches. One solution is to merge the streams. Another<br />

is to have a tag partitioned store, <strong>and</strong> filter the data nodes on matching path ID [28]. A<br />

speedup to this approach is to chain together nodes with the same path [18]. This is also<br />

useful when indexing text nodes on value <strong>and</strong> integrating structure information.<br />

S 3 [13] is a twig matching system which takes all possible combinations <strong>of</strong> individual<br />

streams matching query nodes based on prefix path, <strong>and</strong> solves each combination separately,<br />

merging the results <strong>of</strong> each evaluation. This approach does not give polynomial<br />

worst-case bounds.<br />

134


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

Blocking is the reason for the sub-optimality <strong>of</strong> TwigStack. When partial matches<br />

must be output to evaluate the data <strong>and</strong> query in Figure 15a, it is because b 1 <strong>and</strong> c 1<br />

blocks each others access to c 2 <strong>and</strong> b 2 respectively.<br />

a 1<br />

a 1<br />

b 1<br />

a 2 c 2 a<br />

b 2<br />

c 1 b c<br />

e 1<br />

a 1<br />

a 2 b 2<br />

d 1 d 2 b 1 d 3<br />

c 1<br />

c 2<br />

a<br />

d b<br />

c<br />

b 1<br />

b 3<br />

d 1<br />

c 2 b 4 b 5 c b<br />

b 2<br />

a 2<br />

c 1<br />

a 3<br />

e 1<br />

a 4 d 2<br />

e 2<br />

a<br />

e d<br />

(a)<br />

(b)<br />

(c)<br />

Figure 15: Cases <strong>of</strong> data <strong>and</strong> query blocking with (a) tag streaming, (b) tag+level streaming,<br />

(c) prefix path streaming. Adapted from [6].<br />

Using more fine grained partitioning <strong>and</strong> streaming solves some blocking issues, because<br />

there are multiple heads <strong>of</strong> streams. A partitioning which refines another always<br />

inherits the reduced blocking [6]. In tag+level streaming there is no blocking when p-c<br />

edges only are used below the query root [6]. But in mixed queries, such as in Figure 15b,<br />

blocking can still occur. There the stream for tag d level 3 is [d 1 , d 2 ] <strong>and</strong> the stream for<br />

c level 4 is [c 1 , c 2 ]. There are only two matches for the query, <strong>and</strong> data nodes c 1 <strong>and</strong> d 1<br />

block each other.<br />

Prefix path streaming results in no blocking when there is a single branching node in<br />

the query. It solves the case in Figure 15b optimally, but not the one in 15c. There the<br />

stream for the path abac is [c 1 , c 2 ], <strong>and</strong> the stream for the path ababe is [e 1 , e 2 ]. Even<br />

though c 2 is also usable in the match with root a 1 , it cannot be known whether or not c 1<br />

is usable, because e 1 blocks for e 2 .<br />

iTwigJoin [6] uses a specialized approach for accessing multiple useful streams, which<br />

supports any partitioning scheme. In its variant <strong>of</strong> getNext(q), it considers for each<br />

matching stream, the streams which are usable together with this stream, for each child<br />

<strong>of</strong> q. This reduces the amount <strong>of</strong> blocking when more fine grained partitioning is used.<br />

The space usage for iTwigJoin is O(d), <strong>and</strong> the running time is O(t(I + O)) when no<br />

blocking occurs, where t is the number <strong>of</strong> useful streams.<br />

Opportunity 11 (Access methods for multiple matching partitions). Strategies for accessing<br />

multiple useful streams include merging, filtering <strong>and</strong> chaining <strong>of</strong> input, informed<br />

merging access like in iTwigJoin, <strong>and</strong> merging the output <strong>of</strong> multiple joins. Many authors<br />

do not argue for the rationale <strong>of</strong> their choice <strong>of</strong> how to access multiple useful partitions<br />

for a node. A new access paradigm is presented in [6], but only tag streaming is compared<br />

with. The benefit <strong>of</strong> the method for accessing multiple matching streams is not<br />

separated from the benefit <strong>of</strong> reduced total input size. The technique reduces the number<br />

135


CHAPTER 4.<br />

INCLUDED PAPERS<br />

<strong>of</strong> intermediate paths output by phase one in TwigStack, <strong>and</strong> would undoubtedly reduce<br />

the amount <strong>of</strong> memory needed for intermediate results in time optimal top-down algorithms<br />

like HolisticTwigStack <strong>and</strong> TwigFast, but it is not certain if it is a win-win both<br />

on memory <strong>and</strong> speed in practice.<br />

3.6 Virtual Streams<br />

Another approach that can lead to reading less input is using “virtual streams” for internal<br />

nodes, by inferring the existence <strong>of</strong> nodes from their descendants. This requires a position<br />

encoding which allows ancestor position reconstruction [26], such as Dewey [25]. Ancestor<br />

label paths must also be infereable, <strong>and</strong> a path summary is an excellent tool for this.<br />

Consider the example in Figure 16, where streams <strong>of</strong> nodes matching leaf label paths are<br />

shown. For the node 1.2.1.2 with path (4) a.a.b.a, it can be inferred that one c<strong>and</strong>idate<br />

for the query root is 1.2 with path (2) a.a.<br />

a<br />

b<br />

a<br />

a<br />

b<br />

a<br />

b<br />

b a b b<br />

(a)<br />

a (2)<br />

a (1)<br />

b (6)<br />

b (3)<br />

a (7)<br />

a (4) b (5)<br />

(b)<br />

(1)→1<br />

(2)→1.2<br />

1.3<br />

(3)→1.2.1<br />

1.3.1<br />

(4)→1.2.1.2<br />

(5)→1.2.1.1<br />

1.2.1.3<br />

(6)→1.1<br />

(7)→1.1.1<br />

(c)<br />

a<br />

b b<br />

a<br />

(d)<br />

(7) 1.1.1<br />

(4) 1.2.1.2<br />

(e)<br />

(6) 1.1<br />

(3) 1.2.1<br />

(3) 1.3.1<br />

Figure 16: Virtual stream example. (a) Data. (b) Path summary. (c) Extents <strong>of</strong> summary<br />

nodes. (d) Query. (e) Leaf streams.<br />

Virtual Cursors is an implementation <strong>of</strong> virtual streams using Deweys <strong>and</strong> path<br />

summaries [28]. Generating a “next” match for an internal query node is done by going<br />

through the prefixes <strong>of</strong> a leaf node’s Dewey <strong>and</strong> path, <strong>and</strong> using those where the ending<br />

tag is correct. The search stops when the new Dewey is lexicographically greater than<br />

the previous, meaning later in the pre-order. [28] does not give details on how ancestor<br />

c<strong>and</strong>idates are generated, but this can be done in time linear in the depth <strong>of</strong> the leaf<br />

match used.<br />

Forwarding the entire query is done by repeatedly picking a leaf with a “broken” path,<br />

forwarding it to containment by the maximal ancestor, <strong>and</strong> then forwarding all ancestors<br />

virtually to contain the leaf. In the system described by [28], tag streaming with path ID<br />

filtering was used, <strong>and</strong> B-trees were used for skipping during leaf forwarding.<br />

Other virtual stream approaches have later been introduced. TJFast [21] is an<br />

independently developed algorithm which does not use a structural summary, but stores<br />

with each data node the root-to-node label path <strong>and</strong> the Dewey encoding together in a<br />

compressed format. Label paths are matched for each node processed. An improvement<br />

over Virtual Cursors as described, is that path matching is also done when generating<br />

136


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

internal nodes, giving fewer useless c<strong>and</strong>idates. Also, non-branching internal nodes can<br />

be ignored during query evaluation, because they are implicit from the path matchings<br />

<strong>of</strong> below <strong>and</strong> above nodes. TJFast does not produce streams for internal nodes, but<br />

maintains sets <strong>of</strong> currently possible c<strong>and</strong>idates.<br />

TwigVersion [27] <strong>and</strong> S 3 [13] (see Section 3.5) are non-holistic approaches which combine<br />

structural summaries <strong>and</strong> inference <strong>of</strong> internal node matches. TwigVersion computes<br />

sets <strong>of</strong> matches bottom-up. Each child node query generates a set <strong>of</strong> c<strong>and</strong>idates for its<br />

parent query node based on its own matches, <strong>and</strong> these sets are then intersected. S 3 uses<br />

the potentially exponential number <strong>of</strong> ways a query matches the summary, <strong>and</strong> evaluates<br />

each such match, merging the results. For one summary matching, it looks at the<br />

query leaf nodes pairwise, <strong>and</strong> merge joins sets based on lowest common ancestor query<br />

nodes. This could give large useless intermediate results. The holistic skipping algorithm<br />

TwigOptimal [10] does partially implement virtual streams through its “virtual positions”<br />

(see Section 3.4).<br />

Opportunity 12 (Improved virtual streams). To reduce the number <strong>of</strong> matches <strong>and</strong><br />

make it possible to ignore non-branching internal nodes, only structurally matching internal<br />

nodes should be generated. A structural summary can be used to avoid repeated<br />

matching <strong>of</strong> paths. However, how to store path matching information is not obvious.<br />

Given a matching path for a leaf query node, there may be an exponential number <strong>of</strong><br />

combinations for matching the above query nodes. Should the matches be calculated on<br />

the fly as in TJFast, kept in independent sets for each node above a leaf match, or encoded<br />

in stacks? Or is it enough to store c<strong>and</strong>idates for the lowest branching node above each<br />

leaf match, if the query nodes on a path are processed bottom-up?<br />

It is common to store Deweys in a succinct format to reduce space usage, but in<br />

addition, some scheme should also be devised to reduce the redundancy <strong>of</strong> using related<br />

Deweys in ascendant <strong>and</strong> descendant nodes. It is also preferable if node encodings do<br />

not have to be fully de-compressed to be compared during query evaluation, but that the<br />

compressed storage format allows for cheap direct comparisons.<br />

Opportunity 13 (Holistic skipping among leaf streams). In some sense, virtual streams<br />

are skipping by not generating unrelated matches for internal query nodes. The Virtual<br />

Cursors algorithm does perform skipping which is “path-holistic”, in the way broken<br />

root-to-leaf paths are fixed. The order in which leaves are picked is not specified [28], but<br />

query node pre-order could have been used. If the lexicographically largest broken leaf<br />

was picked, the skipping would become truly holistic.<br />

The work in [28], in combination with some intermediate result h<strong>and</strong>ling method from<br />

Section 3.2, may be a suitable starting point in the hunt for the “ultimate” twig matching<br />

approach, but the work is not much compared with, or even referenced.<br />

3.7 Query Difficulty Classes<br />

As mentioned in Section 2, the “difficulty” <strong>of</strong> twig joins comes from mixture <strong>of</strong> a-d edges<br />

followed by p-c edges in queries, in combination with structurally recursive labels in the<br />

137


CHAPTER 4.<br />

INCLUDED PAPERS<br />

data. [24] shows that when an a-d edge is never followed by a p-c edge downwards in a<br />

query, it can be evaluated in linear time <strong>and</strong> O(d) space. When there in addition is a<br />

single return node (as in XPath), it can also be evaluated in O(1) space. If after combining<br />

all the advances listed in this paper, faster evaluation methods still exist for some classes<br />

<strong>of</strong> queries, practical implementations should take advantage <strong>of</strong> this.<br />

Opportunity 14 (Identifying <strong>and</strong> using difficulty classes). Can the correctness <strong>of</strong> using<br />

a simpler matching algorithm be decided not only from the query, but also from the query<br />

<strong>and</strong> the data? Structural summaries give possibilities for this. What happens if there are<br />

only single path c<strong>and</strong>idates for some query nodes? What happens when the tree level for<br />

matches <strong>of</strong> a query node is fixed? What happens if data node matches with given paths<br />

for some query nodes fix path matches for other query nodes?<br />

In [2] additional information is collected in a path summary, noting whether a node<br />

always has a given child, <strong>and</strong> whether there is at most one child. This information is used<br />

there to simplify query evaluation when there are non-return nodes in the query, such as<br />

in XPath. Could such statistics also allow detection <strong>of</strong> more cases where query evaluation<br />

can be simplified for general twig matching?<br />

4 Conclusion<br />

We have given a structured analysis <strong>of</strong> recent advances, <strong>and</strong> identified a number <strong>of</strong> opportunities<br />

for further research, focusing both on join algorithms <strong>and</strong> index organization<br />

strategies. Hopefully this has given an overview which has led us one step further towards<br />

unification <strong>of</strong> the numerous advances in this field.<br />

One conclusion is that given its sheer volume, it seems nearly impossible to consider<br />

all related work when presenting new twig join techniques. The field would benefit greatly<br />

from an open source repository <strong>of</strong> algorithms <strong>and</strong> data structures.<br />

References<br />

[1] S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, <strong>and</strong> D. Srivastava. Structural<br />

joins: A primitive for efficient XML query pattern matching. In Proc. ICDE, 2002.<br />

[2] Andrei Arion, Angela Bonifati, Ioana Manolescu, <strong>and</strong> Andrea Pugliese. Path summaries<br />

<strong>and</strong> path partitioning in modern XML databases. In Proc. WWW, 2006.<br />

[3] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />

twig query in large DataGuide trees. In Proc. IDEAS, 2008.<br />

[4] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />

XML pattern matching. In Proc. SIGMOD, 2002.<br />

138


Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />

[5] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />

Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />

queries over XML documents. In Proc. VLDB, 2006.<br />

[6] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />

pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />

[7] S.Y. Chien, Z. Vagena, D. Zhang, V.J. Tsotras, <strong>and</strong> C. Zaniolo. Efficient structural<br />

joins on indexed XML documents. In Proc. VLDB, 2002.<br />

[8] Byron Choi. What are real DTDs like. Technical Report MS-CIS-02-05, University<br />

<strong>of</strong> Pennsylvania, 2002.<br />

[9] Byron Choi, Malika Mahoui, <strong>and</strong> Derick Wood. On the optimality <strong>of</strong> holistic algorithms<br />

for twig queries. In Proc. DEXA, 2003.<br />

[10] Marcus Fontoura, Vanja Josifovski, Eugene Shekita, <strong>and</strong> Beverly Yang. Optimizing<br />

cursor movement in holistic twig joins. In Proc. CIKM, 2005.<br />

[11] Georg Gottlob, Christoph Koch, <strong>and</strong> Reinhard Pichler. Efficient algorithms for processing<br />

XPath queries. In Proc. VLDB, 2002.<br />

[12] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />

survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />

[13] Sayyed Kamyar Izadi, Theo Härder, <strong>and</strong> Mostafa S. Haghjoo. S 3 : Evaluation <strong>of</strong><br />

tree-pattern XML queries supported by structural summaries. Data Knowl. Eng.,<br />

2009.<br />

[14] Haifeng Jiang, Hongjun Lu, Wei Wang, <strong>and</strong> Beng Chin Ooi. XR-tree: Indexing XML<br />

data for efficient structural joins. In Proc. ICDE, 2003.<br />

[15] Haifeng Jiang, Wei Wang, Hongjun Lu, <strong>and</strong> Jeffrey Xu Yu. Holistic twig joins on<br />

indexed XML documents. In Proc. VLDB, 2003.<br />

[16] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />

processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />

2007.<br />

[17] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />

indexes for branching path queries. In Proc. SIGMOD, 2002.<br />

[18] Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, <strong>and</strong> Raghu Ramakrishnan.<br />

On the integration <strong>of</strong> structure indexes <strong>and</strong> inverted lists. In Proc. SIG-<br />

MOD, 2004.<br />

[19] Guoliang Li, Jianhua Feng, Yong Zhang, <strong>and</strong> Lizhu Zhou. Efficient holistic twig<br />

joins in leaf-to-root combining with root-to-leaf way. In Proc. Advances in Databases:<br />

Concepts, Systems <strong>and</strong> Applications, 2007.<br />

139


CHAPTER 4.<br />

INCLUDED PAPERS<br />

[20] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />

[21] Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, <strong>and</strong> Ting Chen. From region encoding<br />

to extended Dewey: On efficient processing <strong>of</strong> XML twig pattern matching. In<br />

Proc. VLDB, 2005.<br />

[22] Lu Qin. Personal correspondence, 2009.<br />

[23] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />

In Proc. DASFAA, 2007.<br />

[24] Mirit Shalem <strong>and</strong> Ziv Bar-Yossef. The space complexity <strong>of</strong> processing XML twig<br />

queries over indexed documents. In Proc. ICDE, 2008.<br />

[25] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene<br />

Shekita, <strong>and</strong> Chun Zhang. Storing <strong>and</strong> querying ordered XML using a relational<br />

database system. In Proc. SIGMOD, 2002.<br />

[26] Felix Weigel. Structural summaries as a core technology for efficient XML retrieval.<br />

PhD thesis, Ludwig-Maximilians-Universität München, 2006.<br />

[27] Xin Wu <strong>and</strong> Guiquan Liu. XML twig pattern matching using version tree. Data &<br />

Knowl. Eng., 2008.<br />

[28] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />

Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />

[29] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />

supporting containment queries in relational database management systems. SIG-<br />

MOD Rec., 2001.<br />

[30] Junfeng Zhou, Min Xie, <strong>and</strong> Xia<strong>of</strong>eng Meng. TwigStack + : Holistic twig join pruning<br />

using extended solution extension. Wuhan University Journal <strong>of</strong> Natural Sciences,<br />

2007.<br />

A<br />

Errata<br />

1. Algorithms TwigList [23], HolisticTwigStack [16] <strong>and</strong> TwigFast [20] are incorrectly<br />

referred to as optimal in Figure 8, in Section 3.2, <strong>and</strong> in Section 3.5. Only algorithm<br />

Twig 2 Stack [5] is optimal as described.<br />

140


Paper 5<br />

<strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong><br />

Fast Optimal Twig Joins<br />

Proceedings <strong>of</strong> the 36th international conference on Very Large Data Bases (VLDB 2010).<br />

Abstract In XML search systems twig queries specify predicates on node values <strong>and</strong><br />

on the structural relationships between nodes, <strong>and</strong> a key operation is to join individual<br />

query node matches into full twig matches. Linear time twig join algorithms exist, but<br />

many non-optimal algorithms with better average-case performance have been introduced<br />

recently. These use somewhat simpler data structures that are faster in practice, but<br />

have exponential worst-case time complexity. In this paper we explore <strong>and</strong> extend the<br />

solution space spanned by previous approaches. We introduce new data structures <strong>and</strong><br />

improved strategies for filtering out useless data nodes, yielding combinations that are<br />

both worst-case optimal <strong>and</strong> faster in practice. An experimental study shows that our<br />

best algorithm outperforms previous approaches by an average factor <strong>of</strong> three on common<br />

benchmarks. On queries with at least one unselective leaf node, our algorithm can be an<br />

order <strong>of</strong> magnitude faster, <strong>and</strong> it is never more than 20% slower on any tested benchmark<br />

query.<br />

141


Paper 5: Fast Optimal Twig Joins<br />

1 Introduction<br />

XML has become the de facto st<strong>and</strong>ard for storing <strong>and</strong> transferring semistructured data<br />

due to its simplicity <strong>and</strong> flexibility [6], with XPath <strong>and</strong> XQuery as the st<strong>and</strong>ard query<br />

languages. XML documents have tree structure, where elements (tags) are internal tree<br />

nodes, <strong>and</strong> attributes <strong>and</strong> text values are leaf nodes. <strong>Information</strong> may be encoded both in<br />

structure <strong>and</strong> content, <strong>and</strong> query languages need the expressional power to specify both.<br />

Twig pattern matching (TPM) is an abstract matching problem on trees, which covers<br />

a subset <strong>of</strong> XPath, which again is a subset <strong>of</strong> XQuery. TPM is important because it<br />

represents the majority <strong>of</strong> the workload in XML search systems [6]. Both data <strong>and</strong><br />

queries (twigs) in TPM are node-labeled trees, with no distinction between node types.<br />

Figure 1 shows a twig query <strong>and</strong> data with a match. A match is a mapping <strong>of</strong> query nodes<br />

to data nodes that respects labels <strong>and</strong> the ancestor-descendant (A–D) <strong>and</strong> parent-child<br />

(P–C) relationships specified by the query edges, respectively represented by double <strong>and</strong><br />

single lines in figures here.<br />

b 1 c 1 y 1 a 4<br />

a 2<br />

c 1<br />

a 1<br />

b 1 c 2 a 1<br />

c 3<br />

b 2 a 3 c 4 b 3<br />

z 1 b 4 z 2<br />

Figure 1: Twig query <strong>and</strong> data with matches.<br />

x 2 b 6<br />

b 5<br />

x 1 c 5<br />

c 6<br />

Twig joins are algorithms for evaluating TPM queries on indexed data, where the<br />

index typically has one list <strong>of</strong> data nodes for each label. A query is evaluated by reading<br />

the label-matching data nodes for each query node, <strong>and</strong> combining these into full<br />

query matches. There exist algorithms that perform twig joins in worst-case optimal<br />

time [3], but current non-optimal algorithms that use simpler data structures are faster<br />

in practice [10, 11].<br />

In this paper we present twig join algorithms that achieve worst-case optimality without<br />

sacrificing practical performance. Our main contributions are (i ) a classification <strong>of</strong><br />

filtering methods as weak or strict, <strong>and</strong> a discussion <strong>of</strong> how filtering influences practical<br />

<strong>and</strong> worst-case performance; (ii ) level split vectors, a data structure yielding lineartime<br />

result enumeration with almost no practical overhead; (iii ) getPart, a method for<br />

merging input streams that gives additional inexpensive filtering <strong>and</strong> practical speedup;<br />

(iv ) TJStrictPost <strong>and</strong> TJStrictPre, worst-case optimal algorithms that unify <strong>and</strong> extend<br />

previous filtering strategies; <strong>and</strong> (v ) a thorough experimental comparison <strong>of</strong> the effects<br />

143


CHAPTER 4.<br />

INCLUDED PAPERS<br />

<strong>of</strong> combining different techniques. Compared to the fastest previous solution, our best<br />

algorithm is on average three times as fast, <strong>and</strong> never more than 20% slower.<br />

The scope <strong>of</strong> this paper is twig joins reading simple streams from label-partitioned<br />

data. See Section 6 for orthogonal related work that introduces other assumptions on<br />

how to partition <strong>and</strong> access the underlying data.<br />

2 Background<br />

A schema-agnostic system for indexing labeled trees usually maintains one list <strong>of</strong> data<br />

nodes per label. Each node is stored with position information that enables checking<br />

A–D <strong>and</strong> P–C relationships in constant time. A common approach is to assign intervals<br />

to nodes, such that containment reflects ancestry. Tree depth can then be used determine<br />

parenthood [15].<br />

An early approach to twig joins was to evaluate query tree edges separately using<br />

binary joins, but when A–D edges are involved, this can give huge intermediate results<br />

even when the final set <strong>of</strong> matches is small [2]. This deficiency led to the introduction<br />

<strong>of</strong> multi-way twig join algorithms. TwigStack [2] can evaluate twig queries without P–C<br />

edges in linear time. It only uses memory linear in the maximum depth <strong>of</strong> the data<br />

tree. However, when P–C <strong>and</strong> A–D edges are mixed, more memory is needed to evaluate<br />

queries in linear time [13]. The example in Figure 2 hints at why.<br />

More recent algorithms, which are used as a starting point for our methods, relax<br />

the memory requirement to be linear in the size <strong>of</strong> the input to the join. They follow<br />

a general scheme illustrated in Figure 3. The scheme has two phases, where the first<br />

phase has two components. The first component merges the stream <strong>of</strong> data node matches<br />

for each query node into a single stream <strong>of</strong> query <strong>and</strong> data node pairs. The second<br />

component organizes these matches into an intermediate data structure where matched<br />

A–D <strong>and</strong> P–C relationships are registered. This structure is used to enumerate results in<br />

the second phase.<br />

The algorithms broadly fall into two categories. So-called top-down <strong>and</strong> bottomup<br />

algorithms process <strong>and</strong> store the data nodes in preorder <strong>and</strong> postorder, respectively,<br />

<strong>and</strong> filter data nodes on matched prefix paths <strong>and</strong> subtrees before they are added to<br />

a<br />

a 1<br />

1<br />

c 1 b 1 b 1 b n a 2<br />

c n+1<br />

b n+1 c 1 c n<br />

Figure 2: Hard case with restricted memory. It cannot be known whether b 1 , . . . , b n are<br />

useful before c n+1 is seen, or whether c 1 , . . . , c n are useful before b n+1 is seen.<br />

144


Paper 5: Fast Optimal Twig Joins<br />

a 1 a 1 a 2<br />

. . .<br />

b 1 b 1 b 2<br />

. . .<br />

c 1 c 1 c 2<br />

. . .<br />

Comp. 1 c 1 ↦→c 1 b 1 ↦→b 1 . . . Comp. 2<br />

Intermed.<br />

results<br />

Phase 2<br />

a 1<br />

b 2 c 5<br />

a 1<br />

b 3 c 5<br />

. . .<br />

Figure 3: Work-flow <strong>of</strong> twig join algorithms, with input stream merge component, intermediate<br />

result construction component, <strong>and</strong> result enumeration phase.<br />

the intermediate results. Many algorithms use both types <strong>of</strong> filtering, which means the<br />

processing is a hybrid <strong>of</strong> top-down <strong>and</strong> bottom-up.<br />

Twig 2 Stack [3] was the first linear twig join algorithm. It reorders the input into a<br />

single postorder stream to build intermediate results bottom-up. The data nodes matching<br />

a query node are stored in a composite data structure (postorder-sorted lists <strong>of</strong> trees<br />

<strong>of</strong> stacks), as shown in Figure 4a. Matches are added to the intermediate results only if<br />

relations to child query nodes are satisfied, <strong>and</strong> each match has a list <strong>of</strong> pointers to usable<br />

child query node matches.<br />

a 1<br />

b 1 b 2 b 4 b 3 b 5 b 6<br />

a 4<br />

c 1<br />

a 2 a 4 a 1<br />

b 1 b 2 b 3 b 5 b 6 c 2 c 3 c 4 c 5<br />

b 4<br />

c 6<br />

c 2 c 3 c 4 c 6 c 5 c 1<br />

(a) Twig 2 Stack.<br />

(b) TwigList.<br />

Figure 4: Intermediate result data structures when evaluating the query in Figure 1.<br />

HolisticTwigStack [8] uses similar data structures, but builds intermediate results topdown<br />

in preorder, <strong>and</strong> filters matches on whether there is a usable match for the parent<br />

query node. It uses the getNext function from the TwigStack algorithm [2] as input<br />

stream merge component, which implements an inexpensive weaker form <strong>of</strong> bottom-up<br />

filtering. The combined filtering does not give worst-case optimality, but results in faster<br />

average-case evaluation <strong>of</strong> queries than for Twig 2 Stack [8].<br />

One approach for improving practical performance is using simpler data structures for<br />

intermediate results. TwigList [11] evaluates data nodes in postorder like Twig 2 Stack,<br />

but stores intermediate nodes in simple vectors, <strong>and</strong> does not differentiate between A–D<br />

<strong>and</strong> P–C relationships in the construction phase. Given a parent <strong>and</strong> child query node,<br />

the descendants <strong>of</strong> a match for the parent are found in an interval in the child’s vector,<br />

as shown in Figure 4b. Interval start indexes are set as nodes are pushed onto a global<br />

stack in preorder, <strong>and</strong> end indexes are set as nodes are popped <strong>of</strong>f the global stack in<br />

postorder. A node is added to the intermediate results if all descendant intervals are nonempty.<br />

Compared to Twig 2 Stack, this gives weaker filtering <strong>and</strong> non-linear worst-case<br />

145


CHAPTER 4.<br />

INCLUDED PAPERS<br />

performance, but is more efficient in practice [11], according to the authors because <strong>of</strong><br />

less computational overhead <strong>and</strong> better spatial locality.<br />

TwigFast [10] is an algorithm that uses data structures similar to those <strong>of</strong> TwigList,<br />

but stores data nodes in preorder. It uses the same preorder filtering as HolisticTwigStack,<br />

<strong>and</strong> inherits postorder filtering from the getNext input stream merge component. There<br />

are several algorithms that utilize both types <strong>of</strong> filtering, but among these TwigFast has<br />

the best practical performance [1,3,10]. Figure 5 gives an overview <strong>of</strong> twig join algorithms,<br />

<strong>and</strong> various properties that are introduced in Section 3.<br />

Filtering Checking <strong>of</strong> Interm.<br />

Algorithm Ref. order path subtree results Optimal<br />

getNext [2] GN none weak N/A N/A<br />

TwigStack [2] GN+pre strict weak complex no<br />

Twig 2 Stack [3] post none strict complex yes<br />

1<br />

2 T2 S [3] pre+post weak strict complex yes<br />

1<br />

2 T2 S [1] GN+pre+post weak strict complex yes<br />

HolisticTwigStack [8] GN+pre weak weak complex no<br />

TwigList [11] post none weak vectors no<br />

TwigMix [10] GN+pre+post weak weak vectors incorrect<br />

TwigFast [10] GN+pre weak weak vectors no<br />

TJStrictPost Sect. 4 pre+post strict strict vectors yes<br />

TJStrictPre Sect. 4 GN+pre(+post) strict strict vectors yes<br />

Figure 5: Previous combinations <strong>of</strong> prefix path <strong>and</strong> subtree filtering. Intermediate result<br />

storage order given by last item in “filtering order”. GN is the node order returned by<br />

the getNext input stream merger.<br />

3 Premises for Performance<br />

To make algorithms that are both fast in practice <strong>and</strong> worst-case optimal, we need an<br />

underst<strong>and</strong>ing <strong>of</strong> how filtering strategies <strong>and</strong> data structures impact performance.<br />

For any graph G, let V (G) be its node set <strong>and</strong> E(G) be its edge set. Let a matching<br />

problem M be a triple 〈Q, D, I〉, where Q is a query tree, D is a data tree, <strong>and</strong> I ⊆<br />

V (Q) × V (D) is a relation such that for q ↦→q ′ ∈ I the node label <strong>of</strong> q equals the node<br />

label <strong>of</strong> q ′ . Each edge 〈p, q〉 ∈ Q has a label L(〈p, q〉) ∈ {“A–D”, “P–C”}, specifying an<br />

ancestor–descendant or parent–child relationship. Let a node map for M be any function<br />

M ⊆ I. Assume a given M = 〈Q, D, I〉 when not otherwise specified.<br />

Definition 1 (Weak/strict edge satisfaction). The node map M weakly satisfies a downward<br />

edge e = 〈p, q〉 ∈ E(Q) iff M(p) is an ancestor <strong>of</strong> M(q), <strong>and</strong> strictly satisfies e iff<br />

M(p) <strong>and</strong> M(q) are related as specified by L(e).<br />

146


Paper 5: Fast Optimal Twig Joins<br />

Definition 2 (Match). Given subgraphs Q ′′ ⊆ Q ′ ⊆ Q, the node map M : V (Q ′ ) → V (D)<br />

is a weak (strict) match for Q ′′ iff all edges in Q ′′ are weakly (strictly) satisfied by M. If<br />

Q ′′ = Q we call M a weak (strict) full match.<br />

Where no confusion arises, the term weak (strict) match may also be used for M(Q).<br />

We denote the set <strong>of</strong> unique strict full matches by O. As is common, we view the<br />

size <strong>of</strong> the query as a constant, <strong>and</strong> call a twig join algorithm linear <strong>and</strong> optimal if the<br />

combined data <strong>and</strong> result complexity is O(I + O) [3]. 1<br />

The results presented in the following all apply to both weak <strong>and</strong> strict matching,<br />

unless otherwise specified. The following lemma implies that we can use filtering strategies<br />

that only consider parts <strong>of</strong> the query.<br />

Lemma 1 (Filtering). If there exists a Q ′ ⊆ Q containing q where no match M ′ for Q ′<br />

contains q ↦→q ′ , then there exists no match M for Q containing q ↦→q ′ .<br />

Pro<strong>of</strong>. By contraposition. Given a match M ∋ q ↦→q ′ for Q, for any Q ′ ⊆ Q containing q,<br />

the match M \ {p↦→p ′ | p ∉ Q ′ } matches Q ′ <strong>and</strong> contains q ↦→q ′ .<br />

3.1 Preorder Filtering on Matched Paths<br />

Many current algorithms use the getNext input stream merge component [2], which returns<br />

data nodes in a relaxed preorder, which only dictates the order <strong>of</strong> matches for query<br />

nodes related by ancestry. This is not detailed in the original description [2] <strong>and</strong> is easy to<br />

miss. 2 The TwigMix algorithm incorrectly assumes strict preorder [10] (See Appendix E).<br />

Definition 3 (Match preorder). The sequence <strong>of</strong> pairs q 1 ↦→q 1, ′ . . . , q k ↦→q<br />

k ′ ∈ V (Q)×V (D)<br />

is in global match preorder iff for any i < j either (1) q i ′ precedes q′ j in tree preorder, or<br />

(2) q i ′ = q′ j <strong>and</strong> q i precedes q j in tree postorder. The sequence is in local match preorder<br />

if (1) <strong>and</strong> (2) hold for any i < j where q i = q j or 〈q i , q j 〉 ∈ E(Q) or 〈q j , q i 〉 ∈ E(Q).<br />

The following definition formalizes a filtering criterion commonly used when processing<br />

data nodes in preorder, local or global.<br />

Definition 4 (Prefix path match). M is a prefix path match for q k ∈ Q iff it is a match<br />

for the (simple) path q 1 . . . q k , where q 1 is the root <strong>of</strong> Q.<br />

To implement prefix path match filtering, preorder algorithms maintain the set <strong>of</strong><br />

open nodes, i.e., the ancestors, at the current position in the tree. Most algorithms have<br />

one stack <strong>of</strong> open data nodes for each query node, <strong>and</strong> given a current pair q ↦→q ′ pop<br />

non-ancestors <strong>of</strong> q ′ from the stacks <strong>of</strong> q <strong>and</strong> its parent [2, 8, 10]. Weak filtering can then<br />

be implemented by checking if (i) q is the root, or (ii) the stack for the parent <strong>of</strong> q is<br />

1 For Twig 2 Stack the combined data, query <strong>and</strong> result complexity is O(I log Q + Ib Q + OQ), where<br />

b Q is the maximum branching factor in the query [3]. The TJStrict algorithms we present in Section 3<br />

have the same complexity when using a heap-based input stream merger, <strong>and</strong> O(IQ + OQ) complexity<br />

when using a getNext-based input stream merger. Note that the total number <strong>of</strong> data nodes references<br />

in the input <strong>and</strong> output are |I| <strong>and</strong> |O| · |Q|, respectively.<br />

2 To be precise, getNext also returns matches for sibling query nodes in order.<br />

147


CHAPTER 4.<br />

INCLUDED PAPERS<br />

non-empty. If q ′ is not filtered out, it is pushed onto the stack for q, <strong>and</strong> added to the<br />

intermediate results. This can be extended to strict checking <strong>of</strong> P–C edges by inspecting<br />

the top node on the parent’s stack. Strict prefix path matching is rarely used in practice,<br />

as can be seen from the fourth column in Figure 5.<br />

The implementation <strong>of</strong> prefix path checks is the reason for the secondary ordering on<br />

query node postorder for match pairs in Definition 3. Without the secondary ordering<br />

problems arise when multiple query nodes have the same label: A data node could be<br />

misinterpreted as a usable ancestor <strong>of</strong> itself when checking for non-empty stacks, or hide<br />

a proper parent <strong>of</strong> itself when checking top stack elements.<br />

Algorithms storing intermediate results in postorder use a global stack for the<br />

query [3, 11], <strong>and</strong> inspection <strong>of</strong> the top stack node cannot be used to implement prefix<br />

path matching when the query contains A–D edges, as ancestors may be hidden deep<br />

in the stack. Extending these algorithms to implement prefix path filtering requires maintaining<br />

additional data structures.<br />

The choice <strong>of</strong> preorder filtering does not influence optimality, as illustrated by the<br />

following example.<br />

Example 1. Assume non-branching data generated by /(α 1 /) n . . . (α m /) n β/γ, <strong>and</strong> the<br />

query ⫽α 1 ⫽ . . . ⫽α m /γ, where α 1 , . . . , α m , β, γ are all distinct labels. Unless it is signaled<br />

bottom-up that the pattern α m /γ is not matched below, the result enumeration phase will<br />

take Ω(n m ) time, because all combinations <strong>of</strong> matches for α 1 , . . . , α m will be tested.<br />

3.2 Postorder Filtering on Matched Subtrees<br />

The ordering <strong>of</strong> match pairs required by most bottom-up algorithms is symmetric with<br />

the global preorder case:<br />

Definition 5 (Match postorder). A sequence <strong>of</strong> pairs q 1 ↦→q 1, ′ . . . , q k ↦→q<br />

k<br />

′ is in match<br />

postorder iff for any i < j either (1) q i ′ precedes q′ j in tree postorder, or (2) q′ i = q′ j <strong>and</strong><br />

q i precedes q j in tree preorder.<br />

The second property is required for the correctness <strong>of</strong> both Twig 2 Stack [3], where a<br />

data node could hide proper children <strong>of</strong> itself, <strong>and</strong> TwigList [11], where a node could be<br />

added as a descendant <strong>of</strong> itself.<br />

Definition 6 (Subtree match).<br />

subtree rooted at q.<br />

M is a subtree match for q ∈ Q iff it is a match for the<br />

Example 1 also illustrates why strict subtree match checking is required for optimality,<br />

because no node labeled γ is a direct child <strong>of</strong> a node labeled α m in the data. As<br />

described in Section 2, Twig 2 Stack <strong>and</strong> TwigList respectively implement strict <strong>and</strong> weak<br />

subtree match filtering, <strong>and</strong> for these algorithms strict filtering is required for optimality.<br />

TwigList could be extended to strict filtering by traversing descendant intervals to<br />

look for direct children, but this would have quadratic cost if implemented naively, as<br />

descendant intervals may overlap.<br />

In algorithms storing data in preorder, nodes are added to the intermediate results<br />

after passing preorder checks. If nodes are stored in arrays, later removing a node failing<br />

148


Paper 5: Fast Optimal Twig Joins<br />

a postorder check from the middle <strong>of</strong> an array would incur significant cost. Note that<br />

many <strong>of</strong> the algorithms listed in Figure 5 inherit weak subtree match filtering from the<br />

input stream merger used, as described later in Section 4.4.<br />

3.3 Result Enumeration<br />

Even if the strict subtree match checking sketched for TwigList above could be implemented<br />

efficiently, results would still not be enumerated in linear time, as usable child<br />

nodes may be scattered throughout descendant intervals, as shown by the following example:<br />

Example 2. Assume a tree constructed from the nodes {a 1 , . . . , a n , b 1 , . . . , b 2n }, labeled<br />

a <strong>and</strong> b. Let each node a i have a left child b i , a right child b n+i , <strong>and</strong> a middle child a i+1<br />

if i < n. Given the query ⫽a/b, each node a i is part <strong>of</strong> two matches, one with b i <strong>and</strong> one<br />

with b n+i , but to find these two matches, 2n − 2i useless b-nodes must be traversed in the<br />

descendant interval. This gives a total enumeration cost <strong>of</strong> Ω(n 2 ).<br />

The following theorem formalizes properties <strong>of</strong> the intermediate result storage in<br />

Twig 2 Stack that are key to its optimality.<br />

Theorem 1 (Linear time result enumeration [3]). The result set O can be enumerated in<br />

Θ(O) time if (i) the data nodes d such that root(Q)↦→d is part <strong>of</strong> a strict full match can<br />

be found in time linear in their number, <strong>and</strong> (ii) given a pair q ↦→q ′ , for each child c <strong>of</strong><br />

q, the pairs c↦→c ′ that are part <strong>of</strong> a strict subtree match for q together with q ↦→q ′ can be<br />

enumerated in time linear in their number.<br />

3.4 Full Matches<br />

When different types <strong>of</strong> filtering strategies are combined, it may be interesting to know<br />

when additional filtering passes will not remove more nodes.<br />

Theorem 2 (Full Match). (i) A pair q ↦→q ′ is part <strong>of</strong> a full match iff (ii) it is part <strong>of</strong> a<br />

prefix path match that only uses pairs that are part <strong>of</strong> subtree matches.<br />

sketch. (i ⇒ ii) Follows from Lemma 1. (i ⇐ ii) Let M = 〈Q, D, I〉 be the initial<br />

matching problem, <strong>and</strong> M ′ = 〈Q, D, I ′ 〉 be the matching problem where I ′ is the set <strong>of</strong><br />

pairs that are part <strong>of</strong> subtree matches in M. The theorem is true for pairs with the query<br />

root, as for the query root a subtree match is a full match. Assume that there is a prefix<br />

path match M ↓q ∋ q ↦→q ′ for q in M ′ , <strong>and</strong> that p is the parent <strong>of</strong> q. By construction,<br />

M ↓q ∋ p↦→p ′ is also a prefix path match for p. We use induction on the query node<br />

depth, <strong>and</strong> prove that if p↦→p ′ is part <strong>of</strong> a full match M p for p, then q ↦→q ′ must be<br />

part <strong>of</strong> a full match for q. Let Q q be the subtree rooted at q, <strong>and</strong> Q p = Q \ Q q . Let<br />

M p ′ = M p \ {r ↦→r ′ | r ∈ Q q }. By the assumption q ↦→q ′ ∈ I ′ , there exists a subtree match<br />

M ↑q ∋ q ↦→q ′ for q. Then the node map M q = M p ′ ∪ M ↑q is a full match for q, because<br />

(1) p↦→p ′ ∈ M p ′ <strong>and</strong> q ↦→q ′ ∈ M ↑q must satisfy the edge 〈p, q〉 as they are used together<br />

in M ↓q , <strong>and</strong> (2) Q p <strong>and</strong> Q q can be matched independently when the mapping <strong>of</strong> p <strong>and</strong> q<br />

is fixed.<br />

149


CHAPTER 4.<br />

INCLUDED PAPERS<br />

In other words, if nodes are filtered first in postorder on strictly matched subtrees, <strong>and</strong><br />

then in preorder on strictly matched prefix paths, the intermediate result set contains only<br />

data nodes that are part <strong>of</strong> the final result. The opposite is not true: In the example in<br />

Figure 1, the pair c↦→c 3 would not be removed if strict prefix path filtering was followed<br />

by strict subtree match filtering.<br />

Note that strict full match filtering is not necessary for optimal enumeration by Theorem<br />

1, <strong>and</strong> that the optimal algorithms we present in the following do not use it. They<br />

use prefix path match filtering followed by subtree match filtering, where the former is<br />

only used to speed up practical performance. On the other h<strong>and</strong>, the input stream merge<br />

component we introduce in Section 4.4 gives inexpensive weak full match filtering.<br />

4 Fast Optimal Twig Joins<br />

In this section we create an algorithmic framework that permits any combination <strong>of</strong><br />

preorder <strong>and</strong> postorder filtering. First we introduce a new data structure that enables<br />

strict subtree match checking <strong>and</strong> linear result enumeration.<br />

4.1 Level Split Vectors<br />

A key to the practical performance <strong>of</strong> TwigList <strong>and</strong> TwigFast is the storage <strong>of</strong> intermediate<br />

nodes in simple vectors [11], but this scheme makes it hard to improve worst-case behavior.<br />

To enable direct access to usable child matches below both A–D <strong>and</strong> P–C edges, we<br />

split the intermediate result vectors for query nodes below P–C edges, such that there is<br />

one vector for each data tree level observed, as shown in Figure 6. Given a node in the<br />

intermediate results, matches for a child query node below an A–D edge can be found in<br />

a regular descendant interval, while the matches for a child query node below a P–C edge<br />

can be found in a child interval in a child vector. This vector can be identified by the<br />

depth <strong>of</strong> the parent data node plus one.<br />

In the following we assume that level split vectors can be accessed by level in amortized<br />

constant time. This is true if level split vectors are stored in dynamic arrays, <strong>and</strong> the<br />

depth <strong>of</strong> the deepest data node is asymptotically bounded by the size <strong>of</strong> the input, that<br />

is, d ∈ O(I). If this bound does not hold, which can be the case when |I| ≪ |D|, expected<br />

linear performance can be achieved by storing level split vectors in hash maps.<br />

When nodes are added to the intermediate results in postorder, checking for nonempty<br />

descendant <strong>and</strong> child intervals inductively implies strict subtree match filtering.<br />

This is illustrated for our example in Figure 6. As each check takes constant time,<br />

the intermediate results can be constructed in Θ(I) time, as for TwigList [11]. Result<br />

enumeration with this data structure is a trivial recursive iteration through nodes <strong>and</strong><br />

their child <strong>and</strong> descendant intervals, which is almost identical to the enumeration in<br />

TwigList. The difference is that no extra check is necessary for P–C relationships, <strong>and</strong><br />

that the result enumeration is Θ(O) by Theorem 1 when strict subtree match filtering is<br />

applied.<br />

150


Paper 5: Fast Optimal Twig Joins<br />

b 1 b 2 b 4 b 3 b 5 b 6<br />

a 4 a 1<br />

1: c 1<br />

2: c 2<br />

3: c 5<br />

4: c 4 c 6<br />

5: c 3<br />

Figure 6: Intermediate data structures using level split vectors <strong>and</strong> strict subtree match<br />

filtering. As opposed to in Figure 4b, a 2 does not satisfy.<br />

4.2 The TJStrictPost Algorithm<br />

Algorithm 1 shows the general framework we use for postorder construction <strong>of</strong> intermediate<br />

results, extending algorithms like TwigList [11]. It allows using any combinations <strong>of</strong><br />

the preorder <strong>and</strong> postorder checks described in Section 3, from none to weak <strong>and</strong> strict,<br />

<strong>and</strong> allows using either simple vectors or level split vectors. A global stack is used to<br />

maintain the set <strong>of</strong> open data nodes, <strong>and</strong> if prefix path matching is implemented, a local<br />

stack for each query node is maintained in parallel. The input stream merge component<br />

used is a priority queue implemented with a binary heap. The postorder storage approach<br />

used here requires global ordering, <strong>and</strong> cannot read local preorder input (see Appendix E).<br />

Algorithm 1 Postorder construction.<br />

While ¬E<strong>of</strong>:<br />

Read next q ↦→d.<br />

While non-ancestors <strong>of</strong> d on global stack:<br />

Pop q ′ ↦→d ′ from global <strong>and</strong> local stack.<br />

If q ′ ↦→d ′ satisfies postorder checks:<br />

Set interval end index for d ′<br />

in the vector <strong>of</strong> each child <strong>of</strong> q ′ .<br />

Add d ′ to intermediate results for q ′ .<br />

If d satisfies preorder checks:<br />

For each child <strong>of</strong> q, set interval start index for d.<br />

Push q ↦→d on global stack.<br />

Push d on local stack for q.<br />

Clean remaining nodes from the stacks.<br />

Enumerate results.<br />

The correctness <strong>of</strong> Algorithm 1 follows from the correctness <strong>of</strong> the filtering strategies<br />

described in Sections 3.1 <strong>and</strong> 3.2, <strong>and</strong> the correctness <strong>of</strong> TwigList [11], with the enu-<br />

151


CHAPTER 4.<br />

INCLUDED PAPERS<br />

meration algorithm trivially extended to use child intervals when level split vectors are<br />

used.<br />

We now define the TJStrictPost algorithm, which builds on this framework. Detailed<br />

pseudocode can be found in Appendix A. The algorithm uses level split vectors, <strong>and</strong>, as<br />

opposed to the previous twig join algorithms listed in Figure 5, it includes strict checking<br />

<strong>of</strong> both matched prefix paths <strong>and</strong> subtrees. The former is implemented by checking the top<br />

data node on the local stack <strong>of</strong> the parent query node, while the latter is implemented by<br />

checking for non-empty child <strong>and</strong> descendant intervals. A Θ(I + O) running time follows<br />

from the discussion in Section 4.1.<br />

4.2.1 A note on TwigList:<br />

As noted in the original description, chaining nodes with the same tree level into linked<br />

lists inside descendant intervals can improve practical performance in TwigList [11]. However,<br />

as the first child match with the correct level must still be searched for, further<br />

changes are needed to achieve linear worst-case evaluation. This can be implemented by<br />

maintaining a vector for each query node with the previous match on each tree level at<br />

any time. A node must then be given pointers to such previous matches as it is pushed<br />

on stack in TwigList. When the node is popped <strong>of</strong>f stack, it can then be checked if any<br />

children have been found, <strong>and</strong> intermediate results can be enumerated in linear time,<br />

assuming that d ∈ O(I), as for our solution.<br />

4.3 The TJStrictPre Algorithm<br />

Algorithm 2 shows the general framework we use to construct intermediate results in preorder,<br />

extending algorithms like TwigFast [10]. It supports any combination <strong>of</strong> preorder<br />

<strong>and</strong> postorder filtering, simple or level split vectors, <strong>and</strong> input in global or local preorder.<br />

As opposed to with postorder storage, nodes are inserted directly into intermediate result<br />

vectors after they have passed a prefix path check. Local stacks store references to open<br />

nodes in the intermediate results. If strict subtree match filtering is required, or weak<br />

subtree match filtering is not implied by the input stream merger, intermediate results<br />

are filtered bottom-up in a post-processing pass.<br />

TwigFast is reported to have faster average case query evaluation than previous twig<br />

joins [10], <strong>and</strong> we hope to match this performance in a worst-case linear algorithm. The<br />

TJStrictPre algorithm is similar to TJStrictPost, <strong>and</strong> uses strict checking <strong>of</strong> prefix paths<br />

<strong>and</strong> subtrees, <strong>and</strong> stores intermediate results in level split vectors. See detailed pseudocode<br />

in Appendix B. TJStrictPre uses the getPart input stream merger, which is an<br />

improvement <strong>of</strong> getNext described in Section 4.4. If the post-processing pass can be performed<br />

in linear time, then the algorithm can evaluate twig queries in Θ(I + O) time, by<br />

the same argument as for TJStrictPost.<br />

The filtering pass is implemented by cleaning intermediate result vectors bottom-up<br />

in the query, in-place overwriting data nodes not satisfying subtree matches, as described<br />

in detail in Appendix D. The indexes <strong>of</strong> kept nodes are stored in a separate vector for<br />

each query node, <strong>and</strong> are used to translate old start <strong>and</strong> end indexes into new positions.<br />

152


Paper 5: Fast Optimal Twig Joins<br />

Algorithm 2 Preorder construction.<br />

While ¬E<strong>of</strong>:<br />

Read next q ↦→d.<br />

For the stack <strong>of</strong> both q’s parent <strong>and</strong> q itself:<br />

Pop non-ancestors <strong>of</strong> d,<br />

<strong>and</strong> set their end indexes.<br />

If d satisfies preorder checks:<br />

For each child <strong>of</strong> q, set interval start index for d.<br />

Add d to intermediate results for q.<br />

Push reference to d on stack for q.<br />

Clean stacks.<br />

Clean intermediate results with postorder checks.<br />

Enumerate results.<br />

To achieve linear traversal, the intermediate result vector <strong>of</strong> a node is traversed in parallel<br />

with the index vectors <strong>of</strong> the children after they have been cleaned. For level split vectors,<br />

there is one separate index vector per used level, <strong>and</strong> a separate vector iterator is used<br />

per level when the parent is cleaned. Also, there is an array giving the level <strong>of</strong> each stored<br />

data node in preorder, such that split <strong>and</strong> non-split child vectors can then be traversed<br />

in parallel. Start values are updated as nodes are pushed onto a stack in preorder, while<br />

end values are updated as nodes are popped <strong>of</strong>f in postorder.<br />

4.4 The getPart Input Stream Merger<br />

The getNext input stream merge component implements weak subtree match filtering<br />

in Θ(I) time, <strong>and</strong> is used to improve practical performance in many current algorithms<br />

using preorder storage [8,10]. Assume in the following discussion that there is one preorder<br />

stream <strong>of</strong> label-matching data nodes associated with each query node. The input stream<br />

merger repeatedly returns pairs containing a query node <strong>and</strong> the data node at the head<br />

<strong>of</strong> its stream, implementing Comp. 1 in Figure 3.<br />

The getNext function processes the query bottom-up to find a query node that satisfies<br />

the following three properties: (1) when its stream is forwarded at least until its head<br />

follows the heads <strong>of</strong> the streams <strong>of</strong> the children in postorder, it still precedes them in<br />

preorder, (2) all children satisfy properties 1 <strong>and</strong> 2, <strong>and</strong> (3) if there is a parent, it does<br />

not satisfy 1 <strong>and</strong> 2. Property 2 implies that weak subtree filtering is achieved, <strong>and</strong><br />

Property 3 implies that local preorder by Definition 3 is achieved.<br />

The procedure is efficient if leaf query nodes have relatively few matches, which can<br />

be the case in practice in XML search when all query leaf nodes are selective text value<br />

predicates. However, if the internal query nodes are more selective than the leaf nodes,<br />

or if not all leaves are selective, the overhead <strong>of</strong> using the getNext function may outweigh<br />

the benefits.<br />

To improve practical performance we introduce the getPart function, which requires<br />

the following property in addition to the above three: (4) if there is a parent, then the<br />

153


CHAPTER 4.<br />

INCLUDED PAPERS<br />

current head <strong>of</strong> stream is a descendant <strong>of</strong> a data node that was the head <strong>of</strong> stream for<br />

the parent in some previous subtree match for the parent. This inductively implies that<br />

nodes returned are also weak prefix path matches, <strong>and</strong> from the ordering <strong>of</strong> the filtering<br />

steps, the result is weak full match filtering by Theorem 2. To allow forwarding streams<br />

to find such nodes, the algorithm can no longer be stateless, as shown by the following<br />

example:<br />

Example 3. Assume that the heads <strong>of</strong> the streams for query nodes a 1 <strong>and</strong> b 1 in Figure 1<br />

are a 3 <strong>and</strong> b 2 , respectively. Then it cannot be known by only inspecting heads <strong>of</strong> streams<br />

whether or not any usable ancestors <strong>of</strong> b 2 were seen before a 3 , <strong>and</strong> b 2 must be returned<br />

regardlessly.<br />

Property 4 is implemented in getPart by maintaining for each query node, the data<br />

node latest in the tree postorder that has been part <strong>of</strong> a weak full match. This value is<br />

updated when a query node is found to satisfy all four properties. To ensure progress,<br />

streams are forwarded top-down in the query to match the stored value or the current<br />

head for the parent node. Note that multiple top-down <strong>and</strong> bottom-up passes may be<br />

needed to find a satisfying node, but each such pass forwards at least one stream past<br />

useless matches. See detailed pseudocode in Appendix C.<br />

5 Experiments<br />

The following experiments explore the effects <strong>of</strong> weak <strong>and</strong> strict matching <strong>of</strong> prefix paths<br />

<strong>and</strong> subtrees, different input stream merge functions, <strong>and</strong> level split vectors.<br />

We have used the DBLP, XMark <strong>and</strong> Treebank benchmark data, <strong>and</strong> run the commonly<br />

used DBLP queries from the PRIX paper [12], the XMark queries from the XPath-<br />

Mark suite part A [5] (except queries 7 <strong>and</strong> 8, which are not twigs), <strong>and</strong> the Treebank<br />

queries from the TwigList paper [11]. In addition, we have created some artificial data <strong>and</strong><br />

queries. Details can be found in Appendix F. The experiments were run on a computer<br />

with an AMD Athlon 64 3500+ processor <strong>and</strong> 4 GB <strong>of</strong> RAM. All queries were warmed<br />

up by 3 runs <strong>and</strong> then run 100 times, or until at least 10 seconds had passed, measuring<br />

average running-time.<br />

All algorithms are implemented in C++, <strong>and</strong> features are turned on or <strong>of</strong>f at compile<br />

time to make sure the overhead <strong>of</strong> complex methods does not affect simpler methods.<br />

Feature combinations are coded with 5 letter tags. We use Heap, getNext <strong>and</strong> getPart<br />

for merging the input streams, <strong>and</strong> store intermediate results in prEorder or pOstorder.<br />

We use no (-), Weak or Strict prefix path match filtering, <strong>and</strong> no (-), Weak or Strict<br />

subtree match filtering. Intermediate results are stored in simple vectors (-) or Level<br />

split vectors. The previous algorithms TwigList <strong>and</strong> TwigFast are denoted by HO-W<strong>and</strong><br />

NEWW-, respectively, while TJStrictPost <strong>and</strong> TJStrictPre are denoted by HOSSL <strong>and</strong><br />

PESSL. Note that filtering checks are not performed in intermediate result construction<br />

if the given filtering level is already achieved by the input stream merger. Strict subtree<br />

match filtering is implemented by descendant interval traversal when not using level split<br />

vectors. With preorder storage an extra filtering pass is used to implement subtree match<br />

filtering. Worst-case optimal algorithms match the pattern ***SL.<br />

154


Paper 5: Fast Optimal Twig Joins<br />

We present no performance comparisons with the Twig 2 Stack algorithm because it<br />

does not fit into our general framework. Getting an accurate <strong>and</strong> fair comparison would<br />

be an exercise in independent program optimization. TwigList is previously reported to<br />

be 2–8 times faster than Twig 2 Stack for queries similar to ours [11].<br />

5.1 Checked Paths <strong>and</strong> Subtrees<br />

Figure 7 shows results for running the XMark query ⫽annotation/keyword, with cost<br />

divided into different components <strong>and</strong> phases. Filtering lowers the cost <strong>of</strong> building intermediate<br />

data structures because their size is reduced, <strong>and</strong> the cost <strong>of</strong> enumerating<br />

results because redundant traversal is avoided. Note that this query was chosen because<br />

it shows the potential effects <strong>of</strong> prefix path <strong>and</strong> subtree match filtering, <strong>and</strong> it may not<br />

be representative.<br />

HO---<br />

HO-W-<br />

HO-S-<br />

HO-SL<br />

HOW--<br />

HOWW-<br />

HOWS-<br />

HOWSL<br />

HOS--<br />

HOSW-<br />

HOSS-<br />

HOSSL<br />

0 0.05 0.1<br />

seconds<br />

merge<br />

construct<br />

enumerate<br />

Figure 7: Query //annotation/keyword on XMark data.<br />

input, building intermediate results, <strong>and</strong> result enumeration.<br />

Cost divided into merging<br />

Figure 8 shows the effects <strong>of</strong> prefix path vs. subtree match filtering averaged over all<br />

queries on DBLP, XMark <strong>and</strong> Treebank. Heap input stream merging was used because it<br />

allows all filtering levels, <strong>and</strong> postorder storage was used to avoid extra filtering passes.<br />

Each timing has been normalized to 1 for the fastest method for each query. Raw timings<br />

are listed in Appendix G. As opposed to in Figure 7, there is on average little difference<br />

between the methods using at least some filtering both in preorder <strong>and</strong> postorder. The<br />

benefits <strong>of</strong> asymptotic constant time checks when using level split vectors seems to be<br />

outweighed by the cost <strong>of</strong> maintaining <strong>and</strong> accessing them, but only by a small margin.<br />

5.2 Reading Input<br />

Figure 9 shows the effect <strong>of</strong> using different input stream mergers. The labels in the<br />

artificial data tree used are Zipf distributed (s = 1), with a, b, y <strong>and</strong> z being the labels on<br />

30%, 13%, 1.0% <strong>and</strong> 1.0% <strong>of</strong> the nodes, respectively. The data <strong>and</strong> queries were chosen<br />

to shed light on both the benefits <strong>and</strong> the possible overhead <strong>of</strong> using the advanced input<br />

methods.<br />

For the first query, ⫽a/b[y][z], the leaves are very selective, <strong>and</strong> both getNext <strong>and</strong><br />

getPart very efficiently filter away most <strong>of</strong> the nodes. The input stream merging is slightly<br />

155


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Prefix path<br />

- W S<br />

avg max avg max avg max avg max<br />

Subtree<br />

-- 1.84 3.5 1.29 1.61 1.23 1.62 1.45 3.5<br />

W- 1.24 1.45 1.03 1.13 1.01 1.06 1.09 1.45<br />

S- 1.24 1.46 1.03 1.10 1.02 1.08 1.10 1.46<br />

SL 1.32 1.55 1.09 1.19 1.06 1.20 1.16 1.55<br />

1.41 3.5 1.11 1.61 1.08 1.62 1.20 3.5<br />

Figure 8: The effect <strong>of</strong> parent match filtering. vs. child match filtering. Running DBLP,<br />

XPathMark <strong>and</strong> Treebank queries. Normalizing query times to 1 for the fastest method<br />

for each query. Showing arithmetic mean <strong>and</strong> maximum for normalized time.<br />

HO---<br />

HE---<br />

NE-W-<br />

PEWW-<br />

HOSSL<br />

HESSL<br />

NESSL<br />

PESSL<br />

0 1 2 3<br />

(a)<br />

0 0.5 1 1.5 2<br />

(b)<br />

0 5 10 15<br />

(c)<br />

Figure 9: Running queries on Zipf data. (a) Selective leaves. (b) Selective internal nodes.<br />

(c) No selective nodes.<br />

more efficient for the simpler getNext. In the second query, ⫽y/z[a][b], the internal nodes<br />

are selective, while the leaves are not. Here getPart efficiently filters away many nodes,<br />

while getNext does not, making it even slower than the simple heap, due to the additional<br />

complexity. The third query shows a case where getPart performs worse than both the<br />

other methods. In this query, ⫽a[a[a][a]][a[a][a]], all query nodes have very low selectivity,<br />

<strong>and</strong> are equally probable. The filtering has almost no effect, <strong>and</strong> only causes overhead.<br />

Note the cost difference between HOSSL <strong>and</strong> HESSL, which is due to the additional filtering<br />

pass over the large intermediate results.<br />

5.3 Combined Benefits<br />

Figure 10 shows the effects <strong>of</strong> combining different input stream mergers <strong>and</strong> additional<br />

filtering strategies. The same queries as in Figure 8 are evaluated, <strong>and</strong> the first column<br />

shows the same tendencies: There is not much difference between the strategies as long<br />

as you do at least weak match filtering on both prefix path <strong>and</strong> subtree.<br />

156


Paper 5: Fast Optimal Twig Joins<br />

Input stream merger<br />

HO HE NE PE<br />

avg max avg max avg max avg max avg max<br />

Match filtering<br />

--- 7.0 33 6.6 30 6.8 33<br />

-W- 4.5 23 7.5 34 3.7 19 5.2 34<br />

-S- 4.5 23 7.5 34 3.7 18 5.2 34<br />

-SL 4.8 25 8.3 38 3.8 19 5.6 38<br />

W-- 4.9 24 4.9 23 4.9 24<br />

WW- 3.7 19 5.4 25 3.2 15 1.02 1.11 3.3 25<br />

WS- 3.7 19 5.4 26 3.2 15 1.04 1.15 3.3 26<br />

WSL 3.9 20 5.7 27 3.2 15 1.08 1.22 3.5 27<br />

S-- 4.8 24 4.8 24 4.8 24<br />

SW- 3.7 19 5.2 26 3.2 15 1.03 1.12 3.3 26<br />

SS- 3.7 19 5.1 26 3.2 15 1.05 1.17 3.3 26<br />

SSL 3.9 20 5.5 27 3.2 15 1.05 1.20 3.4 27<br />

4.4 33 6.0 38 3.4 19 1.04 1.22 4.1 38<br />

Figure 10: Input mergers vs. filtering strategies.<br />

In the second column all methods using any subtree match filtering are more expensive,<br />

because with preorder storage, subtree match filtering is performed in a second pass over<br />

the intermediate results. A second pass is also used for for subtree match filtering in<br />

the third <strong>and</strong> fourth columns, but in practice the getNext <strong>and</strong> getPart components have<br />

already filtered away more nodes, the intermediate results are smaller, <strong>and</strong> the second<br />

pass is less expensive.<br />

Note the difference between using getNext <strong>and</strong> getPart. The new method is more than<br />

three times as fast on average, <strong>and</strong> is more than one order <strong>of</strong> magnitude faster for queries<br />

where only some <strong>of</strong> the leaf nodes are selective. The getPart function also fast forwards<br />

through useless matches for the leaves that are not selective, while getNext passes all<br />

leaf matches on to the intermediate result construction component. Also note that the<br />

maximum overhead <strong>of</strong> using PESSL, the fastest worst-case optimal method, is at most<br />

20% in any benchmark query tested.<br />

6 Related Work<br />

This work is based on the assumption that label-partitioning <strong>and</strong> simple streams is used.<br />

Orthogonal previous work investigates how the underlying data can be accessed. If the<br />

streams support skipping, both unnecessary I/O <strong>and</strong> computation can be avoided [7].<br />

Our getPart algorithm, which is detailed in Appendix C, can be modified to use any<br />

underlying skipping technology by changing the implementation <strong>of</strong> FwdToAncOf() <strong>and</strong><br />

FwdToDescOf(). Refined partitioning schemes with structure indexing can be used to<br />

157


CHAPTER 4.<br />

INCLUDED PAPERS<br />

reduce the number <strong>of</strong> data nodes read for each query node [4, 9]. Our twig join algorithms<br />

are independent <strong>of</strong> the partitioning scheme used, assuming multiple partition<br />

blocks matching a single query node are merged when read. Another technique is to use a<br />

node encoding that allows reconstruction <strong>of</strong> data node ancestors, <strong>and</strong> use virtual streams<br />

for the internal query nodes [14]. Our getPart algorithm could be changed to generate<br />

virtual internal query node matches from leaf query node matches, as complete query<br />

subtrees are always traversed. For a broader view on XML indexing see the survey by<br />

Gou <strong>and</strong> Chirkova [6]. XPath queries can be rewritten to use only the axis self, child, descendant,<br />

<strong>and</strong> following [16]. To add support for support for the following-axis, we would<br />

have to add additional logic for how to forward streams, <strong>and</strong> modify the data structures<br />

to store start indexes for the new relationship.<br />

7 Conclusions <strong>and</strong> Future Work<br />

In this paper we have shown how worst-case optimality <strong>and</strong> fast evaluation in practice<br />

can be combined in twig joins. We have performed experiments that span out <strong>and</strong> extend<br />

the space <strong>of</strong> the fastest previous solutions. For common benchmark queries our new <strong>and</strong><br />

worst-case optimal algorithms are on average three times as fast as earlier approaches.<br />

Sometimes they are more than an order <strong>of</strong> magnitude faster, <strong>and</strong> they are never more<br />

than 20% slower.<br />

In future work we would like to combine the new techniques with previous orthogonal<br />

techniques such as skipping, refined partitioning <strong>and</strong> virtual streams. Also, it would be<br />

interesting to see an elegant worst-case linear algorithm reading local preorder input <strong>and</strong><br />

producing preorder sorted results, that does not perform a post-processing pass over the<br />

data, <strong>and</strong> does not need the assumption d ∈ O(I).<br />

Acknowledgments This material is based upon work supported by the iAd Project<br />

funded by the Research Council <strong>of</strong> Norway, <strong>and</strong> the Norwegian University <strong>of</strong> Science <strong>and</strong><br />

Technology. Any opinions, findings <strong>and</strong> conclusions or recommendations expressed in this<br />

material are those <strong>of</strong> the authors, <strong>and</strong> do not necessarily reflect the views <strong>of</strong> the funding<br />

agencies.<br />

References<br />

[1] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />

twig query in large DataGuide trees. In Proc. IDEAS, 2008.<br />

[2] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />

XML pattern matching. In Proc. SIGMOD, 2002.<br />

[3] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />

Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />

queries over XML documents. In Proc. VLDB, 2006.<br />

158


Paper 5: Fast Optimal Twig Joins<br />

[4] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />

pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />

[5] Massimo Franceschet. XPathMark web page. http://sole.dimi.uniud.it/<br />

~massimo.franceschet/xpathmark/.<br />

[6] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />

survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />

[7] Haifeng Jiang, Wei Wang, Hongjun Lu, <strong>and</strong> Jeffrey Xu Yu. Holistic twig joins on<br />

indexed XML documents. In Proc. VLDB, 2003.<br />

[8] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />

processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />

2007.<br />

[9] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />

indexes for branching path queries. In Proc. SIGMOD, 2002.<br />

[10] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />

[11] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />

In Proc. DASFAA, 2007.<br />

[12] Praveen Rao <strong>and</strong> Bongki Moon. PRIX: Indexing <strong>and</strong> querying XML using Prüfer<br />

sequences. In Proc. ICDE, 2004.<br />

[13] Mirit Shalem <strong>and</strong> Ziv Bar-Yossef. The space complexity <strong>of</strong> processing XML twig<br />

queries over indexed documents. In Proc. ICDE, 2008.<br />

[14] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />

Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />

[15] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />

supporting containment queries in relational database management systems. SIG-<br />

MOD Rec., 2001.<br />

[16] Ning Zhang, Varun Kacholia, <strong>and</strong> M. Tamer Özsu. A succinct physical storage<br />

scheme for efficient evaluation <strong>of</strong> path queries in XML. In Proc. ICDE, 2004.<br />

159


CHAPTER 4.<br />

INCLUDED PAPERS<br />

A<br />

TJStrictPost Pseudocode<br />

Algorithm 3 shows more detailed pseudocode for the TJStrictPost algorithm described in<br />

Section 4.2. Tree positions are assumed to be encoded using BEL (begin, end, level) [15].<br />

The function GetVector(q, level) returns the regular intermediate result vector if q is below<br />

an A–D edge, or the split vector given by level if q is below a P–C edge.<br />

Algorithm 3 TJStrictPost<br />

1: function EvaluateGlobal():<br />

2: (c, q, d) ← MergedStreamHead()<br />

3: while c ≠ E<strong>of</strong>:<br />

4: ProcessGlobalDisjoint(d)<br />

5: if Open(q, d):<br />

6: Push(S global , (d.end, q))<br />

7: (c, q, d) ← MergedStreamHead()<br />

8: ProcessGlobalDisjoint(∞)<br />

9: function ProcessGlobalDisjoint(d):<br />

10: while S global ≠ ∅ ∧ Top(S global ).end < d.end:<br />

11: (·, q) ← Pop(S global )<br />

12: Close(q)<br />

13: function Open(q, d):<br />

14: if CheckParentMatch(q, d):<br />

15: u ← new Intermediate(d)<br />

16: MarkStart(q, u)<br />

17: Push(S local [q], u)<br />

18: return true<br />

19: else:<br />

20: return false<br />

21: function Close(q):<br />

22: u ← Pop(S local [q])<br />

23: MarkEnd(q, u)<br />

24: if CheckChildMatch(q, u):<br />

25: Append(GetVector(q, u.d.level), u)<br />

26: function MarkStart(q, u):<br />

27: for r ∈ q.children:<br />

28: u.start[r] ← GetVector(r, u.d.level + 1).size + 1<br />

29: function MarkEnd(q, u):<br />

30: for r ∈ q.children:<br />

31: u.end[r] ← GetVector(r, u.d.level + 1).size<br />

32: function CheckParentMatch(q, d):<br />

33: if Axis(q) = “⫽”:<br />

34: return IsRoot(q) or S local [Parent(q)] ≠ ∅<br />

35: else:<br />

36: if IsRoot(q): return d.level = 1<br />

37: else: return S local [Parent(q)] ≠ ∅ ∧ d.level = Top(S local [Parent(q)]).level + 1<br />

38: function CheckChildMatch(q, u):<br />

39: for r ∈ q.children:<br />

40: if u.end[r] < u.start[r]: return false<br />

41: return true<br />

160


Paper 5: Fast Optimal Twig Joins<br />

B<br />

TJStrictPre Pseudocode<br />

Algorithm 4 shows more detailed pseudocode for the TJStrictPre algorithm described in<br />

Section 4.3.<br />

Algorithm 4 TJStrictPre<br />

1: function EvaluateLocalTopDown():<br />

2: (c, q, d) ← MergedStreamHead()<br />

3: while c ≠ E<strong>of</strong>:<br />

4: if ¬IsRoot(q):<br />

5: ProcessLocalDisjoint(Parent(q), d)<br />

6: ProcessLocalDisjoint(q, d)<br />

7: Open(q, d)<br />

8: (c, q, d) ← MergedStreamHead()<br />

9: for q ∈ Q:<br />

10: ProcessLocalDisjoint(q, ∞)<br />

11: FilterPass(q.root)<br />

12: function ProcessLocalDisjoint(q, d):<br />

13: while Top(S local [q]).d.end < d.end:<br />

14: Close(q)<br />

15: function Open(q, d):<br />

16: if CheckParentMatch(q, d):<br />

17: u ← new Intermediate(d)<br />

18: V ← GetVector(q, d.level)<br />

19: if ¬IsLeaf(q):<br />

20: MarkStart(q, u)<br />

21: Push(S local [q], (V, V.size))<br />

22: Append(V, u)<br />

23: return ¬IsLeaf(q)<br />

24: else:<br />

25: return false<br />

26: function Close(q):<br />

27: if ¬IsLeaf(q):<br />

28: (V, i) ← Pop(S local [q])<br />

29: MarkEnd(q, V [i])<br />

C<br />

GetPart function<br />

Pseudocode for the getPart function is shown in Algorithm 5, where what is conceptually<br />

different from the previous getNext function is colored dark blue.<br />

GetPart forwards nodes both to catch up with the parent <strong>and</strong> child streams, whereas<br />

getNext only does the latter. The getNext algorithm is completely stateless, <strong>and</strong> only<br />

inspects stream heads. When a match for a query subtree is found, the stream for the<br />

subtree root node is read <strong>and</strong> forwarded. Then it is not possible to know in the next call<br />

on this point in the query, whether the child subtrees were once part <strong>of</strong> a match or not.<br />

In the getPart function we save one extra value per query node, stored in the M array,<br />

namely the latest match in the tree postorder which was part <strong>of</strong> a weak match for the<br />

entire query. When considering a query subtree, the currently interesting data nodes are<br />

those that are either part <strong>of</strong> a match using a previous head in the parent stream, or part<br />

<strong>of</strong> a new match using the current head in the parent stream (see Lines 9-13).<br />

161


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Algorithm 5 GetPart<br />

1: function MergedStreamHead():<br />

2: while true:<br />

3: (c, d, q) ← GetPart(Q.root)<br />

4: if c ≠ MisMatch:<br />

5: if c ≠ E<strong>of</strong>:<br />

6: Fwd(q)<br />

7: return (c, d, q)<br />

8: function GetPart(q):<br />

9: if ¬IsRoot(q):<br />

10: p ← Parent(q)<br />

11: FwdToDescOf(q, M[p])<br />

12: if ¬E<strong>of</strong>(q) ∧ ¬E<strong>of</strong>(p) ∧ M [p].end < H(q).end:<br />

13: FwdToDescOf(q, H(p))<br />

14: if IsLeaf(q):<br />

15: if E<strong>of</strong>(q): return (E<strong>of</strong>, q, ⊥)<br />

16: else: return (Match, q, H(q))<br />

17: (d min , q min ) ← (∞, ⊥) ; (d max , q max ) ← (0, ⊥)<br />

18: for r ∈ q.children:<br />

19: (c r, d r, q r) ← GetPart(r)<br />

20: if c r ≠ E<strong>of</strong>:<br />

21: if c r = MisMatch: flag MisMatch<br />

22: elif q r ≠ r: return (Match, d r, q r)<br />

23: if d r .begin < d min .begin: (d min , q min ) ← (d r, q r)<br />

24: if d r .begin > d max .begin: (d max , q max ) ← (d r, q r)<br />

25: else:<br />

26: FwdToE<strong>of</strong>(q)<br />

27: if q min = ⊥: return (E<strong>of</strong>, ⊥, q)<br />

28: FwdToAncOf(q, d max )<br />

29: if flagged MisMatch:<br />

30: return (MisMatch, ⊥, q r)<br />

31:<br />

32: if ¬E<strong>of</strong>(q) ∧ H(q).begin < d min .begin:<br />

33: if IsRoot(q) ∨ H(q).end < M [p].end:<br />

34: if M [q].end < H(q).end: M[q] ← H(q)<br />

35: return (Match, H(q), q)<br />

36: else:<br />

37: if d min .begin < M [q].end:<br />

38: return (Match, d min , q min )<br />

39: else:<br />

40: if E<strong>of</strong>(q): return (E<strong>of</strong>, ⊥, q)<br />

41: else: return (MisMatch, ⊥, q)<br />

42: function FwdToE<strong>of</strong>(q):<br />

43: while ¬E<strong>of</strong>(q): Fwd(q)<br />

44: function FwdToDescOf(q, d):<br />

45: while ¬E<strong>of</strong>(q) ∧ H(q).begin ≤ d.begin: Fwd(q)<br />

46: function FwdToAncOf(q, d):<br />

47: while ¬E<strong>of</strong>(q) ∧ H(q).end ≤ d.begin: Fwd(q)<br />

The forwarding <strong>of</strong> streams based on child stream heads is very similar to in getNext<br />

(Lines 17-28). Unless the search is short-circuit (Line 22), the stream is forwarded at<br />

least until the head is an ancestor <strong>of</strong> all the child heads (Line 28). The query node itself<br />

is returned if an ancestor <strong>of</strong> the child heads was found, <strong>and</strong> unless the previous M value<br />

is an ancestor <strong>of</strong> the current head, it is updated. When a child query node is returned, it<br />

162


Paper 5: Fast Optimal Twig Joins<br />

is known whether or not it is part <strong>of</strong> a match. 3<br />

D<br />

Extra Filtering Pass<br />

Algorithm 6 gives pseudocode for the extra filtering pass used to obtain strict subtree<br />

match filtering when using preorder storage in TJStrictPre.<br />

Algorithm 6 FilterPass<br />

1: function CleanUp(q):<br />

2: if Axis(q) = “⫽”:<br />

3: CleanUpVector(GetVector(q, ·), C[q])<br />

4: else:<br />

5: for h ∈ used levels:<br />

6: CleanUpVector(GetVector(q, h), C h [q])<br />

7: function CleanUpVector(V, C):<br />

8: i ← j ← 0<br />

9: while i < V.size:<br />

10: if CheckChildMatch(q, V [i]):<br />

11: V [j] ← V [i]<br />

12: Append(C, i)<br />

13: j ← j + 1<br />

14: i ← i + 1<br />

15: Resize(V, j)<br />

16: function FilterPass(q):<br />

17: if IsLeaf(q): return<br />

18: for r ∈ q.children:<br />

19: FilterPass(r)<br />

20: if NonLeafChildren(q) ≠ ∅:<br />

21: for u ∈ AllNodes(q):<br />

22: FilterPassPost(q, u.d)<br />

23: for r ∈ NonLeafChildren(q):<br />

24: u.start[r] ← FwdIter(r, u.start[r], u.d)<br />

25: Push(S local [q], u)<br />

26: FilterPassPost(q, ∞)<br />

27: CleanUp(q)<br />

28: function FilterPassPost(q, d):<br />

29: while S local [q] ≠ ∅ ∧ Top(S local [q]).end < d.end:<br />

30: u ← Pop(S local [q])<br />

31: for r ∈ NonLeafChildren(q):<br />

32: u.end[r] ← FwdIter(r, u.end[r], u.d)<br />

33: function FwdIter(q, pos, d):<br />

34: if Axis(q) = “⫽”:<br />

35: while I[q] < C[q].size ∧ C[q][I[q]] < pos:<br />

36: I[q] ← I[q] + 1<br />

37: return I[q]<br />

38: else:<br />

39: h ← d.level + 1<br />

40: while I h [q] < C h [q].size ∧ C h [q][I h [q]] < pos:<br />

41: I h [q] ← I h [q] + 1<br />

42: return I h [q]<br />

3 Thanks to Radim Bača for pointing out an error in the original published pseudocode, where Lines 30–<br />

31 in Algorithm 5 were different.<br />

163


CHAPTER 4.<br />

INCLUDED PAPERS<br />

During the clean-up (Line 1-15), nodes failing checks are overwritten, <strong>and</strong> it is stored<br />

in the C vectors which values were not dropped. The query nodes are visited bottom-up<br />

by the FilterPass function, updating the vectors <strong>of</strong> one query node at a time, based on<br />

the cleanup in non-leaf child query nodes. The FilterPass <strong>and</strong> FilterPassPost functions<br />

go through all data nodes in preorder <strong>and</strong> postorder respectively, updating interval start<br />

<strong>and</strong> end indexes.<br />

The AllNodes call returns a special iterator to all intermediate data nodes for a query<br />

node, sorted in total preorder. For query nodes with an incoming P–C edge <strong>and</strong> level<br />

split vectors, the order in which nodes were inserted on different levels was recorded<br />

during construction in TJStrictPre. Details are omitted in Algorithm 4, where an extra<br />

statement must be added after line 22, storing a reference to the used vector.<br />

The FwdIter function contains the logic for updating the start <strong>and</strong> end indexes. Each<br />

query node has an iterator for each vector, which is utilized when traversing the matches<br />

for the parent query node. In essence, the segments <strong>of</strong> child <strong>and</strong> descendant intervals which<br />

contain references to nodes which were not saved during a cleanup pass are discarded.<br />

E<br />

GetNext <strong>and</strong> Postorder<br />

Many algorithms use the getNext function [2] for merging the input streams instead<br />

<strong>of</strong> a heap or linear scan [8, 10], because it cheaply filters away many useless nodes by<br />

implementing weak subtree match filtering. In this Appendix we show why using getNext<br />

with postorder intermediate result construction gives problems regardless <strong>of</strong> whether local<br />

or global stacks are used.<br />

The getNext function does not return the data nodes in strict preorder, as assumed<br />

in the correctness pro<strong>of</strong> for the TwigMix algorithm [10], but in local preorder (see Definition<br />

3). As explained in Section 3.1, the top-down algorithms using getNext maintain<br />

one stack or equivalent structure for each internal query node. When a new match for a<br />

given query node is seen, the typical strategy [2] is popping non-ancestor nodes <strong>of</strong>f the<br />

parent query node’s stack, <strong>and</strong> the query node’s own stack.<br />

If a global stack is combined with using getNext input, errors may occur, as for the<br />

example query <strong>and</strong> data in Figure 11. With local preorder the ordering between the nodes<br />

not related through ancestry is not fixed. Assume that the ordering is<br />

〈a 1 ↦→a 1 , a 1 ↦→a 2 , b 1 ↦→b 1 , c 1 ↦→c 1 , d 1 ↦→d 1 , c 1 ↦→c 2 , e 1 ↦→e 1 〉.<br />

Then c 1 ↦→c 2 will pop a 1 ↦→a 2 <strong>of</strong>f stack before e 1 ↦→e 1 is observed, <strong>and</strong> e 1 ↦→e 1 will never<br />

be added as a descendant <strong>of</strong> a 1 ↦→a 2 .<br />

But using local stacks <strong>and</strong> a bottom-up approach also gives errors, because data nodes<br />

are added to the intermediate structures as they are popped <strong>of</strong>f stack when using postorder<br />

storage, which is too late when using getNext <strong>and</strong> local stacks. If the typical approach is<br />

used, c 1 ↦→c 2 will only pop b 1 <strong>of</strong>f the stack <strong>of</strong> b 1 <strong>and</strong> c 1 <strong>of</strong>f the stack <strong>of</strong> c 1 . Then d 1 ↦→d 1<br />

will never be added to the child structures <strong>of</strong> b 1 ↦→b 1 , because it is popped to late.<br />

It may be possible to modify the local stack approach to work with postorder storage<br />

<strong>and</strong> getNext input, but this would require carefully popping nodes on ancestor <strong>and</strong><br />

descendant stacks in the right order.<br />

164


Paper 5: Fast Optimal Twig Joins<br />

a 1<br />

a 1<br />

b 1 e 1<br />

c 1 d 1<br />

(a) Query<br />

a 2<br />

c 2<br />

b 1 e 1<br />

c 1 d 1<br />

(b) Data<br />

Figure 11: Problematic case with local preorder input <strong>and</strong> postorder storage.<br />

F<br />

Benchmark Data <strong>and</strong> Queries<br />

Figure 12 gives some details on the benchmark data used in our experiments, <strong>and</strong> Figure<br />

13 lists the queries we have used.<br />

Name Size Nodes<br />

Source<br />

DBLP 676 MB 35 900 666<br />

http://dblp.uni-trier.de/xml<br />

XMark 1 118 MB 32 298 988<br />

http://www.xml-benchmark.org<br />

Treebank 83 MB 3 829 512<br />

http://www.cs.washington.edu/research/xmldatasets<br />

Figure 12: Benchmark datasets used in experiments.<br />

165


CHAPTER 4.<br />

INCLUDED PAPERS<br />

# data Hits Source<br />

xpath<br />

D1 DBLP 6 [12]<br />

//inproceedings[author/text()="Jim Gray"][year/text()="1990"]/@key<br />

D2 DBLP 21 [12]<br />

//www[editor]/url<br />

D3 DBLP 13 [12]<br />

//book/author[text()="C. J. Date"]<br />

D4 DBLP 2 [12]<br />

//inproceedings[title/text()="Semantic Analysis Patterns."]/author<br />

X1 XMark 40 726 [5]<br />

/site/closed auctions/closed auction/annotation/description/text/keyword<br />

X2 XMark 124 843 [5]<br />

//closed auction//keyword<br />

X3 XMark 124 843 [5]<br />

/site/closed auctions/closed auction//keyword<br />

X4 XMark 40 726 [5]<br />

/site/closed auctions/closed auction[annotation/description/text/keyword]/date<br />

X5 XMark 124 843 [5]<br />

/site/closed auctions/closed auction[.//keyword]/date<br />

X6 XMark 32 242 [5]<br />

/site/people/person[pr<strong>of</strong>ile/gender][pr<strong>of</strong>ile/age]/name<br />

T1 Treebank 1 183 [11]<br />

//S/VP//PP[.//NP/VBN]/IN<br />

T2 Treebank 152 [11]<br />

//S/VP/PP[IN]/NP/VBN<br />

T3 Treebank 381 [11]<br />

//S/VP//PP[.//NN][.//NP[.//CD]/VBN]/IN<br />

T4 Treebank 1 185 [11]<br />

//S[.//VP][.//NP]/VP/PP[IN]/NP/VBN<br />

T5 Treebank 94 535 [11]<br />

//EMPTY[.//VP/PP//NNP][.//S[.//PP//JJ]//VBN]//PP/NP// NONE<br />

Figure 13: Benchmark queries used in experiments.<br />

166


Paper 5: Fast Optimal Twig Joins<br />

G<br />

Extended Results<br />

Figure 14 shows the timings that the aggregates in Section 5 are based on.<br />

D1 D2 D3 D4 X1 X2 X3 X4 X5 X6 T1 T2 T3 T4 T5<br />

HO--- 2.53 0.43 1.07 1.59 0.57 0.11 0.11 0.75 0.25 0.25 0.31 0.30 0.40 0.47 0.50<br />

HO-W- 1.42 0.25 0.44 1.12 0.43 0.10 0.11 0.60 0.24 0.21 0.18 0.17 0.24 0.32 0.32<br />

HO-S- 1.42 0.26 0.44 1.11 0.43 0.10 0.11 0.61 0.24 0.21 0.17 0.18 0.24 0.32 0.33<br />

HO-SL 1.70 0.28 0.48 1.22 0.46 0.10 0.11 0.65 0.26 0.23 0.18 0.19 0.24 0.34 0.33<br />

HOW-- 1.96 0.31 0.32 1.18 0.34 0.08 0.09 0.49 0.19 0.25 0.22 0.22 0.32 0.42 0.42<br />

HOWW- 1.26 0.19 0.33 0.92 0.32 0.08 0.08 0.46 0.19 0.21 0.16 0.17 0.22 0.31 0.30<br />

HOWS- 1.34 0.19 0.32 0.91 0.31 0.08 0.08 0.45 0.19 0.21 0.16 0.16 0.21 0.30 0.32<br />

HOWSL 1.31 0.20 0.31 0.96 0.31 0.09 0.09 0.46 0.20 0.23 0.17 0.17 0.23 0.33 0.32<br />

HOS-- 2.03 0.31 0.32 1.17 0.32 0.08 0.08 0.47 0.19 0.25 0.21 0.19 0.27 0.35 0.40<br />

HOSW- 1.27 0.20 0.31 0.91 0.30 0.08 0.08 0.44 0.19 0.21 0.16 0.15 0.21 0.29 0.29<br />

HOSS- 1.27 0.21 0.32 0.92 0.30 0.08 0.08 0.44 0.19 0.21 0.17 0.15 0.21 0.30 0.30<br />

HOSSL 1.32 0.20 0.31 0.97 0.30 0.08 0.08 0.46 0.19 0.23 0.17 0.18 0.23 0.34 0.31<br />

HE--- 2.46 0.40 1.07 1.48 0.55 0.09 0.09 0.70 0.20 0.23 0.30 0.29 0.36 0.45 0.48<br />

HE-W- 2.73 0.40 1.25 1.67 0.72 0.09 0.10 0.89 0.21 0.27 0.35 0.34 0.43 0.50 0.54<br />

HE-S- 2.74 0.40 1.25 1.66 0.72 0.09 0.10 0.89 0.21 0.27 0.34 0.34 0.43 0.49 0.54<br />

HE-SL 3.14 0.42 1.47 1.83 0.80 0.09 0.10 0.97 0.23 0.32 0.37 0.40 0.45 0.54 0.58<br />

HEW-- 1.97 0.31 0.34 1.13 0.34 0.08 0.08 0.48 0.19 0.24 0.23 0.22 0.28 0.42 0.42<br />

HEWW- 2.13 0.32 0.34 1.24 0.38 0.08 0.09 0.53 0.19 0.27 0.29 0.28 0.35 0.45 0.49<br />

HEWS- 2.13 0.32 0.34 1.24 0.39 0.08 0.09 0.52 0.20 0.28 0.29 0.28 0.35 0.44 0.49<br />

HEWSL 2.33 0.33 0.35 1.31 0.42 0.08 0.09 0.57 0.20 0.32 0.28 0.30 0.37 0.48 0.51<br />

HES-- 1.99 0.31 0.34 1.16 0.33 0.08 0.08 0.47 0.19 0.24 0.22 0.19 0.28 0.33 0.40<br />

HESW- 2.15 0.32 0.34 1.26 0.36 0.08 0.09 0.51 0.20 0.28 0.25 0.21 0.31 0.35 0.45<br />

HESS- 2.14 0.33 0.34 1.24 0.37 0.08 0.09 0.51 0.20 0.29 0.24 0.21 0.31 0.35 0.45<br />

HESSL 2.35 0.33 0.36 1.33 0.40 0.08 0.09 0.55 0.21 0.34 0.26 0.24 0.36 0.38 0.48<br />

NOWW- 0.96 0.22 0.09 0.71 0.49 0.11 0.15 0.85 0.34 0.30 0.08 0.08 0.16 0.34 0.22<br />

NE-W- 1.03 0.25 0.08 0.93 0.59 0.12 0.15 0.99 0.40 0.33 0.08 0.09 0.17 0.34 0.27<br />

NE-S- 1.03 0.25 0.08 0.89 0.68 0.12 0.16 1.10 0.39 0.35 0.08 0.09 0.17 0.34 0.29<br />

NE-SL 1.06 0.26 0.08 0.92 0.77 0.12 0.16 1.20 0.42 0.36 0.09 0.09 0.17 0.34 0.29<br />

NEWW- 0.94 0.23 0.08 0.72 0.51 0.11 0.14 0.87 0.35 0.31 0.07 0.07 0.15 0.31 0.23<br />

NEWS- 0.95 0.23 0.08 0.72 0.54 0.11 0.15 0.90 0.36 0.32 0.08 0.08 0.15 0.32 0.24<br />

NEWSL 0.95 0.23 0.08 0.73 0.55 0.11 0.15 0.93 0.37 0.33 0.08 0.08 0.15 0.32 0.24<br />

NESW- 0.95 0.23 0.08 0.74 0.49 0.11 0.14 0.87 0.36 0.31 0.07 0.07 0.15 0.30 0.23<br />

NESS- 0.93 0.23 0.08 0.72 0.51 0.11 0.15 0.89 0.36 0.32 0.08 0.07 0.15 0.31 0.23<br />

NESSL 0.94 0.23 0.08 0.73 0.54 0.11 0.15 0.92 0.37 0.33 0.08 0.08 0.15 0.31 0.24<br />

PEWW- 0.22 0.04 0.08 0.05 0.33 0.06 0.09 0.43 0.16 0.20 0.06 0.06 0.04 0.13 0.10<br />

PEWS- 0.22 0.04 0.08 0.05 0.34 0.06 0.09 0.44 0.16 0.21 0.06 0.06 0.04 0.13 0.10<br />

PEWSL 0.22 0.04 0.08 0.05 0.36 0.06 0.09 0.49 0.18 0.22 0.06 0.06 0.05 0.13 0.10<br />

PESW- 0.22 0.04 0.08 0.05 0.31 0.06 0.09 0.45 0.18 0.20 0.06 0.06 0.04 0.12 0.10<br />

PESS- 0.22 0.04 0.08 0.05 0.33 0.07 0.09 0.43 0.16 0.21 0.06 0.06 0.04 0.13 0.10<br />

PESSL 0.22 0.04 0.08 0.05 0.35 0.06 0.10 0.44 0.17 0.22 0.06 0.06 0.05 0.13 0.10<br />

Figure 14: Time for all tested methods on all benchmark queries. All queries were warmed<br />

up by 3 runs <strong>and</strong> then run 100 times, or until at least 10 seconds had passed, measuring<br />

average running-time.<br />

167


CHAPTER 4.<br />

INCLUDED PAPERS<br />

H<br />

Bad Behavior<br />

In this appendix we list some experiments showing the super-linear behavior <strong>of</strong> previous<br />

twig join algorithms.<br />

Figure 15a shows the exponential behavior <strong>of</strong> TwigList <strong>and</strong> TwigList (HO-W-)<br />

<strong>and</strong> TwigFast (NEWW-) with the data <strong>and</strong> query from Example 1. The data is<br />

/(α 1 /) n . . . (α m /) n β/γ with m = 10 <strong>and</strong> n = 100, <strong>and</strong> the query is ⫽α 1 ⫽ . . . ⫽α k /γ,<br />

with k varying from 1 to 7.<br />

10 0<br />

HO-W-<br />

NEWW-<br />

PESSL<br />

1.5<br />

1<br />

PESS-<br />

PESSL<br />

10 −2<br />

0.5<br />

10 −4<br />

10 2 (a) Varying query parameter k.<br />

1 2 3 4 5 6 7<br />

0<br />

10 000 20 000 30 000 40 000<br />

(b) Varying total nodes.<br />

Figure 15: (a) Exponential behavior without strict matching. (b) Quadratic behavior<br />

without optimal enumeration. Query time in seconds.<br />

Figure 15b shows the results <strong>of</strong> an experiment based on Example 2 with varying<br />

n = 10 000. The simple query ⫽a/b has quadratic cost even when strict prefix path <strong>and</strong><br />

subtree match filtering is used, if P–C child matches are not directly accessible. Many <strong>of</strong><br />

the a nodes in the data are nested, <strong>and</strong> have a small number <strong>of</strong> b children, but a large<br />

number <strong>of</strong> b descendants. For approaches using simple vectors, overlapping descendant<br />

intervals are scanned for direct children, <strong>and</strong> this results in quadratic running time.<br />

168


Paper 6<br />

<strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong><br />

Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

Proceedings <strong>of</strong> the 7th International XML Database Symposium on Database <strong>and</strong> XML<br />

Technologies (XSym 2010).<br />

Abstract The F&B-index is used to speed up pattern matching in tree <strong>and</strong> graph data,<br />

<strong>and</strong> is based on the maximum F&B-bisimulation, which can be computed in loglinear<br />

time for graphs. It has been shown that the maximum F-bisimulation can be computed<br />

in linear time for DAGs. We build on this result, <strong>and</strong> introduce a linear algorithm for<br />

computing the maximum F&B-bisimulation for tree data. It first computes the maximum<br />

F-bisimulation, <strong>and</strong> then refines this to a maximal B-bisimulation. We prove that the<br />

result equals the maximum F&B-bisimulation.<br />

169


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

1 Introduction<br />

Structure indexes are used to reduce the cost <strong>of</strong> pattern matching in labeled trees <strong>and</strong><br />

graphs [5, 15, 10], by capturing structure properties <strong>of</strong> the data in a structure summary,<br />

where some or all <strong>of</strong> the matching can be performed. Efficient construction <strong>of</strong> such<br />

indexes is important for their practical usefulness [15], <strong>and</strong> in this paper we reduce the<br />

construction cost <strong>of</strong> the F &B-index [10] for tree data.<br />

In a structure index, data nodes are typically partitioned into blocks based on properties<br />

<strong>of</strong> the surrounding nodes. A structure summary typically has one node per block<br />

in the partition, <strong>and</strong> edges between summary nodes where there are edges between data<br />

nodes in the related blocks. Matching in structure summaries is usually more efficient<br />

than partitioning the data nodes on label <strong>and</strong> using structural joins to find full query<br />

matches [15, 10].<br />

In a path index, data nodes are classified by the labels on the paths by which they are<br />

reachable [5, 15]. For tree data this equals partitioning nodes on their label <strong>and</strong> the block<br />

<strong>of</strong> the parent node. Figures 1c <strong>and</strong> 1b show path partitioning <strong>and</strong> the related summary<br />

for the example data in Figure 1a. With path indexes, non-branching queries can be<br />

evaluated without processing joins [15, 19].<br />

b 1<br />

a 1<br />

a 2 a 9 a 15<br />

a 2 b 3<br />

b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />

a 4 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />

(a) Example query <strong>and</strong> data. Single/double query edges specify<br />

parent–child/ancestor–descendant relationships.<br />

b s1<br />

a s2<br />

a s3 b s4<br />

a s5 b s6<br />

(b) Path summary.<br />

b 1<br />

b 1<br />

a 2 a 9 a 15<br />

a 7 a 8 b 3 b 10 b 13 b 16 b 18 b 20<br />

a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />

(c) Path partition.<br />

a 2<br />

a 8<br />

a 4 a 6<br />

a 9 a 15<br />

20 b 13 b 16 b 18<br />

a 21 b 14 b 17 b 19<br />

a 7 b 3 b 10 b<br />

(d) F&B partition.<br />

Figure 1: Partitioning strategies. Superscripts <strong>and</strong> subscripts give node identifiers.<br />

A natural extension <strong>of</strong> a path index is the F&B-index, where nodes are partitioned on<br />

both their label, the partitions <strong>of</strong> the parents, <strong>and</strong> the partitions <strong>of</strong> the children, as shown<br />

171


CHAPTER 4.<br />

INCLUDED PAPERS<br />

in Figure 1d. This gives an index where more <strong>of</strong> the pattern matching can be performed on<br />

the summary, <strong>and</strong> also branching queries can be evaluated without processing joins [10].<br />

The focus <strong>of</strong> this paper is efficient computation <strong>of</strong> the maximum simultaneous forward<br />

<strong>and</strong> backward bisimulation (F&B-bisimulation), which is the underlying concept used to<br />

partition nodes in the F&B-index. It can be computed in time loglinear in the number<br />

<strong>of</strong> edges in the graph [10]. A linear construction algorithm for directed acyclic graphs<br />

(DAGs) has been presented recently [13], but we show that it is incorrect. On the other<br />

h<strong>and</strong>, there exists a correct algorithm which can compute either the maximum forward<br />

bisimulation (F-bisimulation) or the maximum backward bisimulation (B-bisimulation)<br />

in linear time for DAGs [3]. We extend this algorithm to compute the maximum F&Bbisimulation<br />

in linear time for tree data. This has relevance for applications where the<br />

underlying data is known to be tree shaped, such as in many uses <strong>of</strong> XML [6].<br />

2 Background<br />

In this section we present different types <strong>of</strong> bisimulations, <strong>and</strong> show how they can be<br />

computed by first partitioning on label, <strong>and</strong> then stabilizing the graph.<br />

We use the following notation: A directed graph G = 〈V, E〉 has node set V <strong>and</strong> edge<br />

set E ⊆ V × V . Let n = |V | <strong>and</strong> m = |E|. For X ⊆ V , E(X) = {w | ∃v ∈ X : vEw}<br />

<strong>and</strong> E −1 (X) = {u | ∃v ∈ X : uEv}. Each node v ∈ V has label L(v). A partition P <strong>of</strong><br />

V is a set <strong>of</strong> blocks, such that each node v ∈ V is contained in exactly one block. For<br />

an equivalence relation ∼ ⊆ V × V , the equivalence class containing v ∈ V is denoted by<br />

[v] ∼ . The equivalence relation arising from the partition P is denoted = P . A relation R 2<br />

is a refinement <strong>of</strong> R 1 iff R 2 ⊆ R 1 . A partition P 2 is a refinement <strong>of</strong> the coarser P 1 iff<br />

= P2 ⊆ = P1 . Let the contraction graph <strong>of</strong> a partition P be a graph with one node for each<br />

equivalence class <strong>of</strong> = P , <strong>and</strong> an edge 〈[u] =P , [v] =P 〉 whenever 〈u, v〉 ∈ E.<br />

The structure summary built for a structure index is typically isomorphic with the<br />

contraction graph for the data partition. For a partitioning to be useful, it must yield a<br />

summary that somehow simulates the data, such that pattern matching in the summary<br />

gives the same results as pattern matching in the data, or at least no false negatives. If<br />

nodes are partitioned into blocks where nodes in some way simulate each other, then the<br />

contraction graph also simulates the data in the same way.<br />

2.1 Bisimulation <strong>and</strong> Bisimilarity<br />

Broadly speaking, bisimulations relate nodes that have the same label <strong>and</strong> related neighbors.<br />

We use the following properties <strong>of</strong> a binary relation R ⊆ V × V to formally define<br />

the different types <strong>of</strong> bisimulation:<br />

172


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

vRv ′ ⇒ L(v) = L(v ′ ) (1)<br />

vRv ′ ⇒ (uEv ⇒ ∃u ′ : u ′ Ev ′ ∧ uRu ′ ) ∧<br />

(u ′ Ev ′ ⇒ ∃u : uEv ∧ uRu ′ ) (2)<br />

vRv ′ ⇒ (vEw ⇒ ∃w ′ : v ′ Ew ′ ∧ wRw ′ ) ∧<br />

(v ′ Ew ′ ⇒ ∃w : vEw ∧ wRw ′ ) (3)<br />

Definition 1 (Bisimulations [14, 10]). A relation R is a B-bisimulation iff it satisfies (1)<br />

<strong>and</strong> (2) above, an F-bisimulation iff it satisfies (1) <strong>and</strong> (3), <strong>and</strong> an F&B-bisimulation iff<br />

it satisfies (1), (2) <strong>and</strong> (3).<br />

For each type, there exists a unique maximum bisimulation, <strong>of</strong> which all other bisimulations<br />

are refinements [14, 10]. We say that two nodes are bisimilar if there exists<br />

a bisimulation that relates them, i.e., they are related by the maximum bisimulation.<br />

Since bisimilarity is an equivalence relation, it can be used to partition the nodes [14, 10].<br />

When nodes u <strong>and</strong> v are backward, forward, <strong>and</strong> forward <strong>and</strong> backward bisimilar, we<br />

write u ∼ B v, u ∼ F v <strong>and</strong> u ∼ F &B v, respectively. Figure 2 illustrates the different types<br />

<strong>of</strong> bisimilarity.<br />

b 2<br />

a 1<br />

c 3<br />

b 4<br />

c 5 d 6<br />

(a) Data.<br />

a 1<br />

b 2 b 4<br />

b 2<br />

a 1 a 1<br />

b 4<br />

b 4<br />

c 3 c 5 d 6 c 3 c 5 d 6 c 3 c 5 d 6<br />

(b) ∼ B (c) ∼ F<br />

b 2<br />

(d) ∼ F &B<br />

Figure 2: Partitioning on different types <strong>of</strong> bisimilarity.<br />

Two graphs are said to be bisimilar if there exists a total surjective bisimulation from<br />

the nodes in one graph to the nodes in the other. For a given graph, the smallest bisimilar<br />

graph is unique, <strong>and</strong> is exactly the contraction for bisimilarity [3]. The F&B-bisimilarity<br />

contraction is the basis for the F&B-index [10].<br />

2.2 Stability<br />

The different types <strong>of</strong> bisimilarity listed in the previous section can be computed by first<br />

partitioning the data nodes on label, <strong>and</strong> then finding the coarsest stable refinements <strong>of</strong><br />

the initial partition [16, 4, 10]. A partition is successor stable iff all nodes in a block have<br />

incoming edges from nodes in the same set <strong>of</strong> blocks, <strong>and</strong> is predecessor stable iff all nodes<br />

in a block have outgoing edges to the same set <strong>of</strong> blocks [10]. The coarsest successor,<br />

predecessor, <strong>and</strong> successor <strong>and</strong> predecessor stable refinement <strong>of</strong> a label partition, equal a<br />

partition on B-bisimilarity, F-bisimilarity <strong>and</strong> F&B-bisimilarity, respectively [4, 10].<br />

173


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Definition 2 (Partition stability [16, 10]). Given a directed graph G = 〈V, E〉, then<br />

D ⊆ V is successor stable with respect to B ⊆ V if either all or none <strong>of</strong> the nodes in<br />

D are pointed to from nodes in B (meaning D ⊆ E(B) or D ∩ E(B) = ∅), <strong>and</strong> D is<br />

predecessor stable with respect to B if either none or all <strong>of</strong> the nodes in D point to nodes<br />

in B (meaning D ⊆ E −1 (B) or D ∩ E −1 (B) = ∅).<br />

For any combination <strong>of</strong> successor <strong>and</strong> predecessor stability, a partition P <strong>of</strong> V is said<br />

to be stable with respect to a block B if all blocks in P are stable with respect to B. A<br />

partition P is stable with respect to another partition Q if it is stable with respect to all<br />

blocks in Q. P is said to be stable if it is stable with respect to itself.<br />

Figure 3 shows cases where a block can be split to achieve different types <strong>of</strong> stability.<br />

The block D is not stable with respect to B, but we can split it into blocks that are:<br />

Assume that D is stable with respect to a union <strong>of</strong> blocks S such that B ⊂ S. We can<br />

split D into blocks that are stable with respect to both B <strong>and</strong> S \ B, shown as D B , D BS<br />

<strong>and</strong> D S in the figure. Stabilizing also with respect to S \ B is crucial for obtaining a<br />

O(m log n) running time in the partition stabilization algorithm explained in the next<br />

section.<br />

B<br />

D B<br />

D BS<br />

D S<br />

S<br />

D<br />

D<br />

S<br />

D B D BS D S<br />

(a) Succ-stability.<br />

B<br />

(b) Pred-stability.<br />

Figure 3: Refining D into blocks that are stable with respect to B <strong>and</strong> S \ B.<br />

2.3 Stabilizing graph partitions<br />

We now go through Paige <strong>and</strong> Tarjan’s algorithm for refinement to the coarsest predecessor<br />

stable partition [16], extended to simultaneous successor <strong>and</strong> predecessor stability by<br />

Kaushik et al. [10], as shown in Algorithm 1. The input to the algorithm is a partition<br />

P <strong>and</strong> the set <strong>of</strong> flags Directions ⊆ {Succ, Pred}, which specifies whether P is to be<br />

successor <strong>and</strong>/or predecessor stabilized.<br />

Figure 4 illustrates an example run <strong>of</strong> the algorithm with Directions = {Succ, Pred}.<br />

In addition to the current partition P , the algorithm maintains a partition X, where the<br />

blocks are unions <strong>of</strong> blocks in P . Initially X contains a single block that is the union<br />

<strong>of</strong> all blocks in P , <strong>and</strong> the algorithm maintains the loop invariant that P is stable with<br />

respect to X by Definition 2. The algorithm terminates when the partitions P <strong>and</strong> X<br />

174


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

Algorithm 1 Graph partition stabilization.<br />

1: ⊲ P is the initial partition.<br />

2: function StabilizeGraph(P, Directions):<br />

3: for dir ∈ Directions:<br />

4: InitialRefinement(P, dir)<br />

5: X ← {⋃ B∈P B}<br />

6: while P ≠ X:<br />

7: Extract B ∈ P such that B ⊂ S ∈ X <strong>and</strong> |B| ≤ |S|/2.<br />

8: Replace S by B <strong>and</strong> S \ B in X.<br />

9: for dir ∈ Directions:<br />

10: StabilizeWRT(copy <strong>of</strong> B, P, dir)<br />

are equal, which means P must be stable with respect to itself. But the loop invariant<br />

may not be true for the given input partition initially: Blocks containing both roots <strong>and</strong><br />

non-roots are not successor stable with respect to X, because non-roots have incoming<br />

edges from the single block S ∈ X, while roots do not. Similarly, blocks containing both<br />

leaves <strong>and</strong> non-leaves are not predecessor stable with respect to X. In Algorithm 1 initial<br />

stability is achieved by calls to InitialRefinement(), which splits blocks in a simple linear<br />

pass. Initial splitting is illustrated by the step from line (a) to line (b) in Figure 4.<br />

The algorithm repeatedly selects a block S ∈ X that is a compound union <strong>of</strong> blocks<br />

from P , <strong>and</strong> selects a splitter block B ⊂ S with size at most half <strong>of</strong> S. Then S is replaced<br />

by B <strong>and</strong> S \ B in X, as shown when extracting B = {a 2 , a 9 , a 15 } between lines (b) <strong>and</strong><br />

(c) in Figure 4. The call to StabilizeWRT() uses the strategies depicted in Figure 3 to<br />

stabilize P with respect to both B <strong>and</strong> S \ B, to make sure P is stable with respect to<br />

the new X. It is important to use a copy <strong>of</strong> B as splitter, as the stabilization may cause<br />

B itself to be split. The step from line (b) to (c) in the figure shows that a block <strong>of</strong> nodes<br />

labeled a is split when successor stabilizing with respect to B = {a 2 , a 9 , a 15 }.<br />

Efficient implementation <strong>of</strong> the above requires some attention to detail [16]: The<br />

partition X can be realized through a set X containing sets <strong>of</strong> pointers to blocks in P , such<br />

that for each S ∈ X we have S = ⋃ B∈S<br />

B for the related S ∈ X . To extract a B ⊂ S ∈ X<br />

in constant time, we also maintain the set <strong>of</strong> compound unions C = {S ∈ X | 1 < |S|}. A<br />

block B ⊂ S such that |B| ≤ |S|/2 can be found by choosing the smalllest <strong>of</strong> the two first<br />

blocks in any S ∈ C. Note that we can check if P = X by checking whether C is empty,<br />

<strong>and</strong> neither P , X nor X need to be materialized, as they are never used directly. You<br />

only need to maintain C, <strong>and</strong> a P ⊆ P containing the blocks in P not in some S ∈ C. For<br />

inserting <strong>and</strong> removing elements in constant time, the sets are implemented as doubly<br />

linked list. In addition, each v ∈ B has a pointer to B, <strong>and</strong> each B ∈ S has a pointer to<br />

S. This allows keeping the data structures updated throughout the evaluation.<br />

As all operations in the while-loop excluding the calls to StabilizeWRT() are constant<br />

time operations on linked lists, the complexity <strong>of</strong> the loop is bounded by the number <strong>of</strong><br />

splitter blocks selected, which again is bounded by the number <strong>of</strong> times a node may be<br />

part <strong>of</strong> such a splitter. Splitter blocks at most half the size <strong>of</strong> a compound union are<br />

175


CHAPTER 4.<br />

INCLUDED PAPERS<br />

b 1<br />

a 2<br />

a 9 a 15<br />

b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />

a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />

(a)<br />

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 b 13<br />

b 5 b 10 b 13 b 14 b 16 b 17 b 19 b 18 b 20<br />

(b)<br />

a 4 a 6 a 7 a 8 a 11 a 12 a 21 a 2 a 9 a 15 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20<br />

(c)<br />

a 2 a 9 a 15 a 4 a 6 a 11 a 12 a 21 a 7 a 8 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20<br />

(d)<br />

a 7 a 8 a 4 a 6 a 11 a 12 a 21 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20 a 9 a 15 a 2<br />

.<br />

(l)<br />

a 11 a 12 a 21 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 2 a 9 a 15 a 7 a 8 b 1 b 5 b 14 b 17 b 19 b 3<br />

Figure 4: Algorithm 1 doing successor <strong>and</strong> predecessor stabilization on a label partition<br />

<strong>of</strong> the data from Figure 1a. The white boxes are the current blocks in P , while the gray<br />

boxes are the current blocks in X. Line (a) shows initial label partition. Step (a)–(b)<br />

shows refinement separating roots from non-roots <strong>and</strong> leaves from non-leaves, <strong>and</strong> steps<br />

(b)–(l) show simultaneous predecessor <strong>and</strong> successor stabilization.<br />

always selected, <strong>and</strong> no node in this block is part <strong>of</strong> a splitter again before the block itself<br />

has become a compound union. This means that the number <strong>of</strong> times a given node is<br />

part <strong>of</strong> a splitter block is O(log n), <strong>and</strong> that the total number <strong>of</strong> splitter blocks used is<br />

O(n log n) [16].<br />

Algorithm 2 shows how all blocks in a partition P can be successor (or predecessor)<br />

stabilized with respect to a block B ∈ P <strong>and</strong> S \ B, where B ⊂ S ∈ X, in time linear in<br />

the number <strong>of</strong> edges going out from (or into) B [16]. For successor stability, only blocks<br />

D pointed to from B are affected, <strong>and</strong> they are stabilized with respect to both B <strong>and</strong><br />

S \ B without involving S \ B directly. This is done by maintaining for each node v ∈ V ,<br />

the number <strong>of</strong> times it is pointed to from each set S ∈ X, <strong>and</strong> storing references to these<br />

records from the related edges. We can then differentiate between nodes pointed to from<br />

B, pointed to from both B <strong>and</strong> S \ B, <strong>and</strong> pointed to only from S \ B. Nodes from the<br />

first two categories are moved into new blocks, while the rest are untouched.<br />

As the cost <strong>of</strong> a single call to StabilizeWRT() is bounded by the number <strong>of</strong> nodes in the<br />

splitter block <strong>and</strong> the number <strong>of</strong> outgoing (or incoming) edges, the total cost for the calls<br />

is O((m+n) log n), as a given node or edge is used for splitting O(log n) times. Assuming<br />

n ∈ O(m), the cost <strong>of</strong> Algorithm 1 is O(n) for the initial refinement, O(n log n) for the<br />

176


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

Algorithm 2 Stabilizing with respect to a block<br />

1: function StabilizeWRT(B, P, d):<br />

2: Assume dir = Succ. (or Pred)<br />

3: for D ∈ P pointed to from B: (or pointing into B)<br />

4: Initialize D B <strong>and</strong> D BS <strong>and</strong> associate with D.<br />

5: for v ∈ D ∈ P pointed to from B: (or pointing into B)<br />

6: if v is pointed to only from B: (or pointing only into B)<br />

7: D ′ ← D B<br />

8: else:<br />

9: D ′ ← D BS<br />

10: if D ′ /∈ P : Insert D ′ after D in P<br />

11: Move v from D to D ′ .<br />

12: if D = ∅: Remove D from P .<br />

while-loop excluding StabilizeWRT(), <strong>and</strong> O(m log n) for the StabilizeWRT() calls, giving<br />

a total <strong>of</strong> O(m log n) [16, 10].<br />

3 Linear Time Stabilization<br />

Linear time computation <strong>of</strong> F&B-bisimilarity for DAG data has been attempted earlier.<br />

The SAM algorithm [13] partitions the data separately on B-bisimilarity <strong>and</strong><br />

F-bisimilarity, <strong>and</strong> then combines the partitions by putting two nodes in the same final<br />

block iff they are in the same blocks in both partitions. It builds on the following<br />

theorem, which is stated without pro<strong>of</strong> or reference: “Node n 1 <strong>and</strong> node n 2 satisfy F&Bbisimulation<br />

if <strong>and</strong> only if they satisfy F-bisimulation <strong>and</strong> B-bisimulation.” The only if<br />

part is <strong>of</strong> course true, but the if part is not, as can be seen from the partitioning <strong>of</strong> a tree<br />

with six nodes in Figure 2. Here c 3 ∼ B c 5 <strong>and</strong> c 3 ∼ F c 5 , but c 3 ≁ F &B c 5 , because for the<br />

parent nodes b 2 ≁ F &B b 4 . Also note that it is assumed for the running time <strong>of</strong> the SAM<br />

algorithm that the number <strong>of</strong> edges to <strong>and</strong> from each node can be viewed as a constant.<br />

3.1 Stabilizing DAG partitions<br />

We now present an algorithm for refining a partition <strong>of</strong> the nodes in a DAG either to successor<br />

stability or to predecessor stability. It is based on two different previous algorithms:<br />

Paige <strong>and</strong> Tarjan’s loglinear algorithm for stabilizing general graphs [16], <strong>and</strong> Dovier, Piazza<br />

<strong>and</strong> Policriti’s algorithm for computing F-bisimilarity on unlabeled graphs [3], which<br />

has linear complexity for DAG data. A difference between these two algorithms is that<br />

the former is given an initial partition as input, which is then refined, while the latter<br />

starts with the set <strong>of</strong> singleton blocks, from which the final partition is constructed. These<br />

are called negative <strong>and</strong> positive strategies, respectively [3]. Dovier et al. describe how<br />

their algorithm can be extended to compute F-bisimilarity for labeled data, but when<br />

developing an algorithm for refining to simultaneous successor <strong>and</strong> predecessor stability<br />

177


CHAPTER 4.<br />

INCLUDED PAPERS<br />

in the next section, we use the result <strong>of</strong> a predecessor stabilization as input to a successor<br />

stabilization, <strong>and</strong> hence cannot use a positive strategy.<br />

Dovier et al.’s algorithm initially computes the rank <strong>of</strong> each node in the DAG, which<br />

is the length <strong>of</strong> the longest path from the node to a leaf. We extend the notion <strong>of</strong> rank<br />

to both directions in the DAG:<br />

Definition 3 (Rank). In a DAG G, the successor <strong>and</strong> predecessor rank <strong>of</strong> v ∈ V (G) is<br />

defined as:<br />

{<br />

0<br />

if v is a root in G<br />

rank Succ (v) =<br />

1 + max 〈u,v〉∈E(G) rank Succ (u)<br />

otherwise<br />

{<br />

0<br />

if v is a leaf in G<br />

rank Pred (v) =<br />

1 + max 〈v,w〉∈E(G) rank Pred (w) otherwise<br />

Algorithm 3 shows our modification <strong>of</strong> Paige <strong>and</strong> Tarjan’s algorithm [16] based on<br />

Dovier et al.’s principles [3]. It refines a partition <strong>of</strong> a DAG either to predecessor or to<br />

successor stability, <strong>and</strong> runs in linear time, due to the order in which splitter blocks are<br />

chosen.<br />

Algorithm 3 DAG partition stabilization.<br />

1: ⊲ Assume sets are ordered.<br />

2: function StabilizeDAG(P, dir):<br />

3: RefineAndSortOnRank(P, dir)<br />

4: X ← {⋃ B∈P B}<br />

5: while P ≠ X:<br />

6: Extract first B ∈ P such that B ⊂ S ∈ X.<br />

7: Replace S by B <strong>and</strong> S \ B in X.<br />

8: StabilizeWRT(B copy, P, dir)<br />

A run <strong>of</strong> the algorithm with dir = Pred is illustrated in Figure 5. Instead <strong>of</strong> only<br />

separating between leaves <strong>and</strong> non-leaves (or roots <strong>and</strong> non-roots) as in Algorithm 1,<br />

blocks are initially split such that predecessor (or successor) rank is uniform within each<br />

block, <strong>and</strong> sorted, such that the rank is monotonically increasing in the partition. This<br />

is done in the function RefineAndSortOnRank(), which is described later in this section.<br />

An initial refinement <strong>and</strong> sorting on predecessor rank is shown when going from line (a)<br />

to line (b) in Figure 5.<br />

In a partition that respects a given type <strong>of</strong> rank, let the rank <strong>of</strong> a block be equal to<br />

the rank <strong>of</strong> the contained nodes. The following lemma implies that the initial refinement<br />

on rank does not split blocks unnecessarily:<br />

Lemma 1. Given nodes u, v ∈ V (G) <strong>and</strong> a partition P <strong>of</strong> G, if P is successor stable<br />

then [u] =P = [v] =P ⇒ rank Succ (u) = rank Succ (v), <strong>and</strong> if P is predecessor stable then<br />

[u] =P = [v] =P ⇒ rank Pred (u) = rank Pred (v).<br />

178


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

(a)<br />

a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 b 13<br />

b 5 b 10 b 13 b 14 b 16 b 17 b 19 b 18 b 20<br />

(b)<br />

a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 10 b 20 b 3 b 13 b 16 b 18 a 2 a 9 a 15 b 1<br />

(c)<br />

a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />

(c)<br />

a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />

.<br />

(i)<br />

a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />

Figure 5: Using Algorithm 3 to predecessor stabilize a label partition <strong>of</strong> the data from<br />

Figure 1a. The first step shows refinement on predecessor rank.<br />

Pro<strong>of</strong>. For F-bisimilarity [u] ∼F = [v] ∼F ⇒ rank Pred (u) = rank Pred (v) [3]. If P is predecessor<br />

stable, then = P is an F-bisimulation [4], <strong>and</strong> therefore a refinement <strong>of</strong> partitioning<br />

on ∼ F , such that [u] =P = [v] =P ⇒ [u] ∼F = [v] ∼F . The case for successor stability is<br />

symmetric.<br />

The next lemma implies that if blocks are chosen as splitters in order <strong>of</strong> their rank, a<br />

node will be part <strong>of</strong> a block that is used as a splitter at most once. This property is used<br />

to achieve a linear complexity in our algorithm.<br />

Lemma 2 ([3]). Given a DAG G, a partition P <strong>of</strong> G that respects predecessor rank <strong>and</strong> a<br />

block B ∈ P , predecessor stabilization <strong>of</strong> P with respect to B only splits blocks D where<br />

rank Pred (D) > rank Pred (B).<br />

This is symmetric for successor stabilization <strong>and</strong> successor rank, as a reversed DAG is<br />

also a DAG.<br />

In Algorithm 3 the blocks in the current partition P are maintained ordered on rank.<br />

This is implemented through a detail in our method for stabilization with respect to a<br />

block in Algorithm 2, which is different from the original description [16]: The new blocks<br />

D B <strong>and</strong> D BS are inserted into P on the position after the old block D, <strong>and</strong> not at the<br />

end <strong>of</strong> P . The sets <strong>of</strong> blocks which make up the unions in X are also ordered such that<br />

their concatenation yields an ordered list <strong>of</strong> blocks. This is maintained by inserting B<br />

followed by S \ B on the original position <strong>of</strong> S in X. Notice how blocks never change<br />

positions during the stabilization shown in lines (b)–(i) in Figure 5.<br />

In Dovier et al.’s algorithm, rank is computed by performing a topological traversal<br />

<strong>of</strong> the DAG depth first [3]. Because we need to refine a given partition on rank, as<br />

opposed to constructing a partition on rank from scratch, the problem is slightly more<br />

involved. Algorithm 4 shows how a partitioning can be refined <strong>and</strong> sorted on successor<br />

or predecessor rank in a single pass. The algorithm traverses the DAG with a hybrid<br />

between a topological sort <strong>and</strong> a breadth first search, implemented using edge counters<br />

<strong>and</strong> a queue. Blocks are refined <strong>and</strong> sorted on the fly.<br />

179


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Algorithm 4 Refining <strong>and</strong> sorting on rank.<br />

1: function RefineAndSortOnRank(P, dir):<br />

2: Assume dir = Succ (or dir = Pred)<br />

3: Q is a queue.<br />

4: for v ∈ V :<br />

5: v.count ← |{x | 〈x, v〉 ∈ E}| (or |{x | 〈v, x〉 ∈ E}|)<br />

6: if v.count = 0:<br />

7: v.rank dir ← 0<br />

8: PushBack(Q, v)<br />

9: for B ∈ P :<br />

10: B.currRank ← −1<br />

11: while Q:<br />

12: v ← PopFront(Q)<br />

13: Let B ∋ v.<br />

14: if v.rank dir ≠ B.currRank:<br />

15: B.rankedB ← {}<br />

16: Append B.rankedB at the end <strong>of</strong> P .<br />

17: B.currRank ← v.rank dir<br />

18: Move v from B to B.rankedB<br />

19: Remove B from P if empty.<br />

20: for x where 〈v, x〉 ∈ E: (or 〈x, v〉 ∈ E)<br />

21: x.count ← x.count − 1<br />

22: if x.count = 0:<br />

23: x.rank dir ← v.rank dir + 1<br />

24: PushBack(Q, x)<br />

Lemma 3. Algorithm 4 refines <strong>and</strong> orders P on successor (or predecessor) rank in O(m+<br />

n) time.<br />

Pro<strong>of</strong> (sketch). Because the queue is initialized with the roots, <strong>and</strong> a node is added to the<br />

queue when the last parent is popped from the queue, the nodes are queued <strong>and</strong> popped<br />

in order <strong>of</strong> successor rank, <strong>and</strong> this distance is calculated from the parent node with the<br />

greatest successor rank. As the successor rank <strong>of</strong> the nodes that are moved to a new block<br />

grows monotonically, only one associated block B.rankedB is created per successor rank<br />

found in block B, <strong>and</strong> all such blocks are appended to P in sorted order. As a node is only<br />

queued once, <strong>and</strong> the cost <strong>of</strong> processing a node is proportional to the number <strong>of</strong> outgoing<br />

edges, the total running time <strong>of</strong> the algorithm is O(m + n). The case for predecessor rank<br />

is symmetric.<br />

Theorem 1. Algorithm 3 yields the coarsest refinement <strong>of</strong> a partition <strong>of</strong> a DAG that is<br />

successor (or predecessor) stable.<br />

Pro<strong>of</strong> (sketch). For predecessor stability, the only differences from Paige <strong>and</strong> Tarjan’s<br />

algorithm are the initial refinement <strong>and</strong> the order in which splitter blocks are chosen.<br />

180


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

By Lemma 1 blocks are not refined unnecessarily when refininig on rank, <strong>and</strong> the order<br />

<strong>of</strong> split operations is not used in the correctness pro<strong>of</strong> for the original algorithm [16].<br />

Successor stability is symmetric.<br />

To implement Algorithm 3 we use the same underlying data structures as for Algorithm<br />

1: doubly linked lists C <strong>and</strong> P realizing X <strong>and</strong> P . In Algorithm 3, the extract<br />

operation is implemented by removing the first B ∈ S from the first S ∈ C. The replace<br />

operation is implemented by moving B from S to the end <strong>of</strong> P, <strong>and</strong> if only one block B ′<br />

is left in S, this B ′ is also moved from S to P, <strong>and</strong> S is removed from C.<br />

Theorem 2. The running time <strong>of</strong> Algorithm 3 is O(m + n).<br />

Pro<strong>of</strong> (sketch). We analyze the cost <strong>of</strong> StabilizeWRT() separately. Outside the while loop,<br />

the call to RefineAndSortOnRank() has cost O(m+n) by Lemma 3, <strong>and</strong> the construction<br />

<strong>of</strong> C <strong>and</strong> P has cost O(|P |) ⊆ O(n). As splitters are chosen in order <strong>of</strong> their rank, by<br />

Lemma 2 a splitter block is not later split itself. This means that each node is only<br />

part <strong>of</strong> a splitter once, <strong>and</strong> that the while-loop is run O(n) times. The loop condition<br />

is implemented by checking if C ≠ ∅. As all the operations on linked lists inside the<br />

loop have complexity O(1), the total cost <strong>of</strong> the while-loop is O(n). The StabilizeWRT()<br />

function is called O(n) times, <strong>and</strong> the cost <strong>of</strong> one call is linear in the number <strong>of</strong> nodes<br />

<strong>and</strong> edges used [16]. As nodes are only part <strong>of</strong> a splitter block once, edges are also only<br />

used for splitting once, <strong>and</strong> the total cost is O(m + n).<br />

3.2 Stabilizing Trees<br />

We now present an algorithm for finding the coarsest successor <strong>and</strong> predecessor stable<br />

refinement <strong>of</strong> the nodes in a tree. It uses the solution for DAGs from the previous section<br />

to refine a partition first to the coarsest predecessor stable refinement, <strong>and</strong> then to the<br />

coarsest successor stable refinement, as shown in Algorithm 5. For trees this yields a<br />

partition that is still predecessor stable, as we prove in the following.<br />

Algorithm 5 Stabilization for trees<br />

1: function StabilizeTree(P, Directions):<br />

2: if Pred ∈ Directions:<br />

3: StabilizeDAG(P, Pred)<br />

4: if Succ ∈ Directions:<br />

5: StabilizeDAG(P, Succ)<br />

Figure 6 shows with a continuation <strong>of</strong> Figure 5 how Algorithm 5 is used to find a<br />

successor <strong>and</strong> predecessor stable refinement. The starting point in the figure is a predecessor<br />

stable refinement found after calling StabilizeDAG(P, Pred). This partition is<br />

then successor stabilized by calling StabilizeDAG(P, Succ), which first refines <strong>and</strong> sorts<br />

on successor rank, shown between lines (i) <strong>and</strong> (j) in the figure, <strong>and</strong> then uses the blocks<br />

in the current P as splitters in order, shown in lines (j)–(t). Compare this partition with<br />

the F&B-bisimilarity partition in Figure 1d.<br />

181


CHAPTER 4.<br />

INCLUDED PAPERS<br />

(i)<br />

a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />

(j)<br />

b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />

.<br />

(m)<br />

b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />

(n)<br />

b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 11 a 12 a 21 a 4 a 6 b 14 b 17 b 19 b 5<br />

.<br />

(t)<br />

b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 11 a 12 a 21 a 4 a 6 b 14 b 17 b 19 b 5<br />

Figure 6: Continuing Fig. 5: Line (i) shows predecessor stable partition, step (i)–(j) shows<br />

successor rank refinement, <strong>and</strong> steps (j)–(t) show successor stabilization.<br />

Theorem 3. If a predecessor stable partition <strong>of</strong> the nodes in a tree is refined to successor<br />

stability, the resulting partition is still predecessor stable.<br />

Pro<strong>of</strong> (sketch). Blocks are split in three ways: To refine on rank, with respect to a block<br />

B ∈ P , or as a side effect with respect to S \ B in Algorithm 2. By Lemma 1, the first<br />

type <strong>of</strong> split does not cause any split that would not eventually be caused by an algorithm<br />

that iteratively refines P with respect to a r<strong>and</strong>om block B ∈ P , <strong>and</strong> from the correctness<br />

pro<strong>of</strong> <strong>of</strong> Paige <strong>and</strong> Tarjan’s algorithm [16], neither does splitting with respect to S \ B.<br />

We now use induction on the refinement steps, <strong>and</strong> show that the partition P remains<br />

predecessor stable. It is true initially by assumption. The induction step is to split a<br />

block D ∈ P on successor stability with respect to a block B ∈ P .<br />

B will split D into two parts D B <strong>and</strong> D S , containing the nodes pointed to <strong>and</strong> not<br />

pointed to from B, respectively. The splitting <strong>of</strong> D may only affect the predecessor<br />

stability <strong>of</strong> the new blocks D B <strong>and</strong> D S with respect to their descendants, <strong>and</strong> <strong>of</strong> the set<br />

<strong>of</strong> blocks B pointing into D with respect to D B <strong>and</strong> D S . After the split, B ⊆ E −1 (D B )<br />

<strong>and</strong> B ∩ E −1 (D S ) = ∅, <strong>and</strong> for all other B ′ ∈ B we have that B ′ ∩ E −1 (D B ) = ∅ <strong>and</strong><br />

B ′ ⊆ E −1 (D S ), because these B ′ by assumption have pointers into D, but by the fact<br />

that the data is a tree, do not have pointers into D B .<br />

For any block G pointed to from some node in D, D ⊆ E −1 (G) by the initial assumption<br />

<strong>of</strong> predecessor stability, meaning all nodes in D point into G. This means that all<br />

nodes in D B <strong>and</strong> D S also point into G, <strong>and</strong> thus D B <strong>and</strong> D S are predecessor stable with<br />

respect to G.<br />

Figure 7a illustrates this theorem. By contrast, assuming the data was not a tree, the<br />

blocks B ′ ∈ B would not necessarily be predecessor stable after splitting D, as shown in<br />

Figure 7b.<br />

Theorem 4. Algorithm 5 finds the coarsest refinement <strong>of</strong> a partition <strong>of</strong> the nodes in tree<br />

that is successor <strong>and</strong> predecessor stable in O(n) time.<br />

182


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

B B ′<br />

B B ′<br />

D<br />

D<br />

G<br />

(a) For a tree.<br />

G<br />

(b) For a DAG.<br />

Figure 7: Splitting D for successor stability w.r.t. B in a partition P . This does not<br />

impact predecessor stability for any block given tree data, but may break predecessor<br />

stability in a DAG.<br />

Pro<strong>of</strong>. From Theorems 1, 2 <strong>and</strong> 3, <strong>and</strong> the fact that m ∈ O(n) for tree data.<br />

Corollary 1. The F&B-index can be built in O(n) time for tree data using Algorithm 5.<br />

Pro<strong>of</strong> (sketch). The coarsest successor <strong>and</strong> predecessor stable refinement <strong>of</strong> a partitioning<br />

on label gives the maximum F&B-bisimulation [10], <strong>and</strong> the summary structure can be<br />

constructed from the contraction graph [10].<br />

4 Related Work<br />

There are many variations <strong>of</strong> structure summaries for graph data, such as partitionings on<br />

similarity [8, 15], the F+B-index [10], the A(k) index [12], <strong>and</strong> the D(k)-index [2]. Note<br />

that an implication <strong>of</strong> Theorem 3 is that the F+B-index <strong>and</strong> the F&B-index are identical<br />

for tree data. The cost <strong>of</strong> matching in the summary can be reduced by label-partitioning<br />

it <strong>and</strong> using specialized joins [1], or using multi-level structure indexing [18]. For queries<br />

using ancestor–descendant edges on graph data, different graph encodings <strong>of</strong>fer trade-<strong>of</strong>fs<br />

between space usage <strong>and</strong> query time [21]. For a general overview <strong>of</strong> indexing <strong>and</strong> search<br />

in XML see the survey by Gou <strong>and</strong> Chirkova [6]. There is previous research on updates<br />

<strong>of</strong> bisimilarity partitions [11, 20, 17]. Some <strong>of</strong> these methods trade update time for<br />

coarseness, as refinements <strong>of</strong> bisimilarity may be cheaper to compute after a data update.<br />

Single directional bisimulations are used in many fields, such as modal logic, concurrency<br />

theory, set theory, formal verification, etc. [3], but to our knowledge, F&B-bisimulation<br />

is not frequently used outside XML search.<br />

5 Conclusions <strong>and</strong> Future Work<br />

In this paper we have improved the running time for refining a partition to the coarsest<br />

simultaneous successor <strong>and</strong> predecessor stability for tree data from O(n log n) to O(n),<br />

183


CHAPTER 4.<br />

INCLUDED PAPERS<br />

<strong>and</strong> with that the computation <strong>of</strong> F&B-bisimilarity, <strong>and</strong> construction <strong>of</strong> the F&B-index. 1<br />

An incorrect linear algorithm for DAGs has been presented recently [13], <strong>and</strong> it would be<br />

interesting to know whether the problem is actually solvable in linear time for DAGs.<br />

A natural extension <strong>of</strong> our work would be to reduce the cost <strong>of</strong> updates in F&B-bisimilarity<br />

partitions for trees. A particularly interesting direction would be to improve indexing performance<br />

for typical XML document collections, where there is a large number <strong>of</strong> small<br />

independent documents. It may be possible to iteratively add documents to the index<br />

with (expected) cost dependent only on the size <strong>of</strong> the documents.<br />

Acknowledgments<br />

This material is based upon work supported by the iAd Project funded by the Research<br />

Council <strong>of</strong> Norway <strong>and</strong> the Norwegian University <strong>of</strong> Science <strong>and</strong> Technology. Any opinions,<br />

findings <strong>and</strong> conclusions or recommendations expressed in this material are those <strong>of</strong><br />

the authors, <strong>and</strong> do not necessarily reflect the views <strong>of</strong> the funding agencies.<br />

References<br />

[1] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />

twig query in large DataGuide trees. In Proc. IDEAS, 2008.<br />

[2] Qun Chen, Andrew Lim, <strong>and</strong> Kian Win Ong. D(k)-index: an adaptive structural<br />

summary for graph-structured data. In Proc. SIGMOD, 2003.<br />

[3] Agostino Dovier, Carla Piazza, <strong>and</strong> Alberto Policriti. A fast bisimulation algorithm.<br />

In Proc. CAV, 2001.<br />

[4] Jean-Claude Fern<strong>and</strong>ez. An implementation <strong>of</strong> an efficient algorithm for bisimulation<br />

equivalence. Sci. Comput. Program., 13(2-3), 1990.<br />

[5] Roy Goldman <strong>and</strong> Jennifer Widom. DataGuides: Enabling query formulation <strong>and</strong><br />

optimization in semistructured databases. In Proc. VLDB, 1997.<br />

[6] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />

survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />

[7] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Linear computation<br />

<strong>of</strong> the maximum simultaneous forward <strong>and</strong> backward bisimulation for nodelabeled<br />

trees (extended version). Technical Report IDI-TR-2010-10, NTNU, Trondheim,<br />

Norway, 2010.<br />

[8] Monika R. Henzinger, Thomas A. Henzinger, <strong>and</strong> Peter W. Kopke. Computing<br />

simulations on finite <strong>and</strong> infinite graphs. In Proc. FOCS, 1995.<br />

[9] Paris C. Kanellakis <strong>and</strong> Scott A. Smolka. Ccs expressions, finite state processes, <strong>and</strong><br />

three problems <strong>of</strong> equivalence. In Proc. PODC, 1983.<br />

1 See the extended version <strong>of</strong> this paper for some performance experiments [7].<br />

184


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

[10] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />

indexes for branching path queries. In Proc. SIGMOD, 2002.<br />

[11] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Pradeep Shenoy. Updates<br />

for structure indexes. In Proc. VLDB, 2002.<br />

[12] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, <strong>and</strong> Ehud Gudes. Exploiting<br />

local similarity for indexing paths in graph-structured data. In Proc. ICDE, 2002.<br />

[13] Xianmin Liu, Jianzhong Li, <strong>and</strong> Hongzhi Wang. SAM: An efficient algorithm for<br />

F&B-index construction. In Proc. APWeb/WAIM, 2007.<br />

[14] Robin Milner. Communication <strong>and</strong> concurrency. Prentice-Hall, Inc., 1989.<br />

[15] Tova Milo <strong>and</strong> Dan Suciu. Index structures for path expressions. Proc. ICDT, 1999.<br />

[16] Robert Paige <strong>and</strong> Robert E. Tarjan. Three partition refinement algorithms. SIAM<br />

J. Comput., 1987.<br />

[17] Diptikalyan Saha. An incremental bisimulation algorithm. In Proc. FSTTCS, 2007.<br />

[18] Xin Wu <strong>and</strong> Guiquan Liu. XML twig pattern matching using version tree. Data &<br />

Knowl. Eng., 2008.<br />

[19] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />

Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />

[20] Ke Yi, Hao He, Ioana Stanoi, <strong>and</strong> Jun Yang. Incremental maintenance <strong>of</strong> XML<br />

structural indexes. In Proc. SIGMOD, 2004.<br />

[21] Jeffrey Xu Yu <strong>and</strong> Jiefeng Cheng. Graph reachability queries: A survey. In Managing<br />

<strong>and</strong> Mining Graph Data, Advances in Database Systems. Springer US, 2010.<br />

A<br />

Experiments<br />

To investigate the practical impact <strong>of</strong> the algorithms presented we have run a small<br />

experiment on the common XML tree benchmarks DBLP 2 , XMark 3 <strong>and</strong> Treebank 4 . The<br />

general graph algorithm <strong>and</strong> the specialized tree algorithm were implemented in C++<br />

<strong>and</strong> run on an AMD Athlon 3500+. We generated XMark data <strong>of</strong> roughly the same size<br />

as the Treebank benchmark, <strong>and</strong> used a prefix <strong>of</strong> the DBLP data <strong>of</strong> similar size. Only<br />

XML elements <strong>and</strong> attributes without their values were indexed. Properties <strong>of</strong> the data<br />

sets <strong>and</strong> the constructed indexes are listed in Figure 8. DBLP is a shallow tree, while<br />

XMark <strong>and</strong> Treebank are much deeper. Note that Treebank has a larger number <strong>of</strong> labels,<br />

<strong>and</strong> a structure that gives a large number <strong>of</strong> bisimulation partitions.<br />

2 http://dblp.uni-trier.de/xml/dblp.xml<br />

3 http://www.xml-benchmark.org/downloads.html<br />

4 http://www.cs.washington.edu/research/xmldatasets/data/treebank/<br />

185


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Nodes Labels B-partitions F-partitions F&B-partitions<br />

DBLP 2 830 635 33 88 292 2479<br />

XMark 2 048 193 77 547 31 230 484 615<br />

Treebank 2 437 667 251 338 749 443 639 2 277 127<br />

Figure 8: Data set properties.<br />

Figure 9a shows the number <strong>of</strong> times each edge was used in a stabilization operation<br />

on average throughout the refinement, while Figure 9b shows the time in milliseconds<br />

divided by the number <strong>of</strong> edges in the graph. For the DBLP data, the original algorithm<br />

is actually slightly faster for all partition types. The explanation can be found in the<br />

number <strong>of</strong> times each edge is used, which is close to the optimal. The specialized tree<br />

algorithm has some linear time overhead because the refinement on maximum distance<br />

is slightly more involved than the separation <strong>of</strong> roots <strong>and</strong> leaves in the graph algorithm.<br />

Partitioning on B-bisimilarity <strong>and</strong> F-bisimilarity has roughly the same cost with both<br />

algorithms on all the data sets, again because the graph algorithm is very far from its<br />

worst-case behavior.<br />

StabilizeGraph<br />

StabilizeTree<br />

6<br />

4<br />

2<br />

DBLP<br />

XMark<br />

Treebank<br />

15<br />

10<br />

5<br />

DBLP<br />

XMark<br />

Treebank<br />

0<br />

0<br />

B<br />

F<br />

F&B<br />

B<br />

F<br />

F&B<br />

B<br />

F<br />

F&B<br />

B<br />

F<br />

F&B<br />

B<br />

F<br />

F&B<br />

B<br />

F<br />

F&B<br />

(a) Avg. uses per edge.<br />

(b) Avg. time per edge (in ms).<br />

Figure 9: Computing different bisimulations using the original algorithm for graphs or<br />

the specialized algorithm for trees.<br />

On F&B-index construction for XMark data, the tree algorithm uses one third the<br />

edge operations, <strong>and</strong> is almost twice as fast as the general graph algorithm. Note that<br />

the 5.9 uses <strong>of</strong> each edge in the graph algorithm is far from the worst-case number <strong>of</strong> uses<br />

per edge. For the Treebank benchmark, the number times an edge is used is smaller, <strong>and</strong><br />

so is the difference between the algorithms. A greater part <strong>of</strong> the cost is the overhead<br />

<strong>of</strong> maintaining the blocks in the partition, because the number <strong>of</strong> blocks is close to the<br />

number <strong>of</strong> nodes in the tree, as can be seen in Figure 8.<br />

186


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

B<br />

Bi-directional Bisimulation <strong>and</strong> Stability<br />

Most work on bisimulation considers only forward bisimulation on edge-labeled graph<br />

data, <strong>and</strong> most work on stabilizing partitions considers only predecessor stabilization<br />

on node-labeled graph data. When bi-directional bisimulation <strong>and</strong> the connection to<br />

predecessor <strong>and</strong> successor stabilization was introduced [10], extension <strong>of</strong> these previous<br />

results were omitted probably due to space restrictions. We therefore go through <strong>and</strong><br />

extend the results for bisimulation [14], partition stabilization [16] <strong>and</strong> their connection [9]<br />

to the bi-directional case for node- <strong>and</strong> edge-labeled data. The extensions may be trivial,<br />

but this appendix also serves as an easily accessible reference.<br />

Assume a node- <strong>and</strong> edge-labeled graph G = 〈V, E〉 with node set V <strong>and</strong> edge set<br />

E ⊆ V × V . The label <strong>of</strong> a node v ∈ V is L(v) ∈ A V , <strong>and</strong> the label <strong>of</strong> an edge e ∈ E<br />

is L(e) ∈ A E . Let A = A V ∪ A E . For a given α ∈ A V , let V α be the set <strong>of</strong> nodes<br />

with this label, V α = {v | v ∈ V, L(v) = α}. And similarly, for a given α ∈ A E , let<br />

E α = {e | e ∈ E, L(e) = α}.<br />

is a F&B-<br />

B.1 Bisimulation<br />

Definition 4 (Node- <strong>and</strong> edge-labeled F&B-bisimulation). R ⊆ V × V<br />

bisimulation if v R v ′ implies that the following Properties are satisfied:<br />

1. L(v) = L(v ′ )<br />

2. u E α v ⇒ ∃u ′ : u ′ E α v ′ ∧ u R u ′<br />

3. u ′ E α v ′ ⇒ ∃u : u E α v ∧ u R u ′<br />

4. v E α w ⇒ ∃w ′ : v ′ E α w ′ ∧ w R w ′<br />

5. v ′ E α w ′ ⇒ ∃w : v E α w ∧ w R w ′<br />

This is illustrated in Figure 10. The definition becomes equal to the definition <strong>of</strong><br />

F&B-bisimulation for node-labeled data [10] when there is only a single edge label α, <strong>and</strong><br />

equal to the original (forward) bisimulation definition [14] when there is a single node<br />

label <strong>and</strong> we ignore Properties 2 <strong>and</strong> 3. Various types <strong>of</strong> simulation [8] are defined by<br />

ignoring at least Properties 3 <strong>and</strong> 5.<br />

The following lemma trivially extends the original for forward bisimulation in edge<br />

labeled graphs [14, Proposition 4.1].<br />

Lemma 4 (Constructing F&B-bisimulations). Assuming that R i <strong>and</strong> R j are F&Bbisimulations,<br />

then the following are F&B-bisimulations:<br />

(1) ∅<br />

(2) Id V = {〈v, v〉 | v ∈ V }<br />

(3) R −1<br />

i = {〈v ′ , v〉 | 〈v, v ′ 〉 ∈ R i }<br />

(4) R i ∪ R j = {〈v, v ′ 〉 | 〈v, v ′ 〉 ∈ R i ∨ 〈v, v ′ 〉 ∈ R j }.<br />

187


CHAPTER 4.<br />

INCLUDED PAPERS<br />

E α<br />

u<br />

v<br />

R<br />

R<br />

u ′<br />

v ′<br />

E α<br />

E α<br />

w<br />

R<br />

w ′<br />

E α<br />

Figure 10: Illustration <strong>of</strong> F&B-bisimulation<br />

(5) R i R j = {〈v, v ′′ 〉 | 〈v, v ′ 〉 ∈ R i , 〈v ′ , v ′′ 〉 ∈ R j }<br />

Pro<strong>of</strong>. Property 1 is trivially satisfied for (1)–(5), as it is an equality. (1) is true as there<br />

are no v R v ′ . For (2), substitute u for u ′ <strong>and</strong> w for w ′ in Properties 2–5. (3) is true by<br />

the symmetry <strong>of</strong> Properties 2 <strong>and</strong> 3, <strong>and</strong> <strong>of</strong> Properties 4 <strong>and</strong> 5. (4) is true by substituting<br />

R i ∪ R j for R in Properties 2–5. For (5), assume that v R i R j v ′′ . Then there is some v ′<br />

such that v R i v ′ <strong>and</strong> v ′ R j v ′′ . For Property 2, let there be some u such that uE α v. Then,<br />

as v R i v ′ <strong>and</strong> R i is a F&B-bisimulation, there exists a u ′ such that u ′ E α v ′ <strong>and</strong> u R i u ′ . As<br />

v ′ R j v ′′ <strong>and</strong> R j is a F&B-bisimulation, there exists a u ′′ such that u ′′ E α v ′′ <strong>and</strong> u ′ R j u ′′ ,<br />

<strong>and</strong> the relation u R i R j u ′′ follows. Pro<strong>of</strong>s <strong>of</strong> Properties 3–5 are similar.<br />

Corollary 2 (Union <strong>of</strong> bisimulation refinements). It follows from Lemma 4(4) that given<br />

an initial relation S, the union <strong>of</strong> all F&B-bisimulations that are refinements <strong>of</strong> S is also<br />

a F&B-bisimulation. Trivially this union is also a refinement <strong>of</strong> S.<br />

B.2 Bisimilarity<br />

We know from Corollary 2 that there exist a unique maximum F&B-bisimulation, <strong>of</strong><br />

which all other F&B-bisimulations are refinements:<br />

Definition 5 (F&B-bisimilarity). Nodes v <strong>and</strong> v ′ are F&B-bisimilar, written v ∼ F &B v ′ ,<br />

iff v R v ′ for some F&B-bisimulation R. Equivalently,<br />

∼ F &B = ⋃ {R | R is a F&B-bisimulation}<br />

If the type <strong>of</strong> bisimulation is clear from the context, such as in the following, we write<br />

just ∼ for ∼ F &B .<br />

The next theorem is a generalization <strong>of</strong> the following Corollary 3, which is taken<br />

from Milner’s original results [14]. This more general result will be useful when linking<br />

F&B-bisimulations with partition P&S-stability in Section B.3.<br />

Theorem 5 (Largest bisimulation partition refinement is equivalence). If Q is a partition<br />

<strong>of</strong> the nodes in a graph, then the largest F&B-bisimulation R that is a refinement <strong>of</strong> = Q<br />

is an equivalence relation.<br />

188


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

Pro<strong>of</strong>. Reflexivity: We have that Id V ⊆ = Q as = Q is an equivalence relation, <strong>and</strong> that<br />

Id V is a F&B-bisimulation by Lemma 4(2). Hence Id V ⊆ R, as R contains all such<br />

F&B-bisimulations by Corollary 2. Symmetry: If v R v ′ , then v R i v ′ for some F&Bbisimulation<br />

R i ⊆ R. We have that v ′ R −1 i v by definition <strong>of</strong> the inverse, <strong>and</strong> R −1 i is<br />

a F&B-bisimulation by Lemma 4(3). For any v ′ R −1 i v, we have than v ′ = Q v, because<br />

v R i v ′ implies v = Q v ′ by assumption, <strong>and</strong> = Q is an equivalence relation <strong>and</strong> hence<br />

symmetric. This means that R −1 i ⊆ = Q , <strong>and</strong> that R −1 i ⊆ R by the maximality <strong>of</strong> R,<br />

<strong>and</strong> v ′ R v follows. Transitivity: If v R v ′ <strong>and</strong> v ′ R v ′′ , then for some F&B-bisimulation<br />

R i ⊆ R <strong>and</strong> R j ⊆ R we have that v R i v ′ <strong>and</strong> v ′ R j v ′′ . By Lemma 4(5) R i R j is a F&Bbisimulation.<br />

For any x <strong>and</strong> x ′ , if x R i x ′ <strong>and</strong> x ′ R j x ′′ , then x = Q x ′ <strong>and</strong> x ′ = Q x ′′ follows<br />

from R i , R j ⊆ R ⊆ = Q . Because = Q is an equivalence relation <strong>and</strong> hence transitive,<br />

x = Q x ′′ follows, <strong>and</strong> hence R i R j ⊆ = Q . By the maximality <strong>of</strong> R we have that R i R j ⊆ R,<br />

<strong>and</strong> therefore v R v ′′ .<br />

The next corollary is a simple extension <strong>of</strong> the original result for single-directional<br />

bisimulations in edge-labeled data [14, Proposition 2].<br />

Corollary 3 (Bisimilarity is equivalence). It follows from Theorem 5 that F&B-bisimilarity<br />

is an equivalence relation.<br />

The following lemma states that for F&B-bisimilarity, the word implies in Definition 5<br />

can be exchanged with iff. The pro<strong>of</strong> is almost identical to the pro<strong>of</strong> for the singledirectional<br />

bisimulation for edge-labeled data [14, Lemma 3+Proposition 4].<br />

Lemma 5. Properties 1–5 in Definition 4 with ∼ substituted for R imply v ∼ v ′ .<br />

Pro<strong>of</strong>. Define ∼ ′ to be a relation such that v ∼ ′ v ′ iff Properties 1–5 are satisfied with ∼<br />

replaced for R. We have that v ∼ v ′ implies v ∼ ′ v ′ , because v ∼ v ′ implies the properties,<br />

<strong>and</strong> the properties imply v ∼ ′ v ′ by the definition <strong>of</strong> ∼ ′ . To prove that the properties<br />

imply v ∼ v ′ , we prove that v ∼ ′ v ′ implies v ∼ v ′ , i.e., ∼ ′ ⊆ ∼. Assume that v ∼ ′ v ′ .<br />

Property 1 is trivially satisfied. By the definition <strong>of</strong> ∼ ′ , for all u such that u E α v, there<br />

exist a u ′ such that u ′ E α v ′ <strong>and</strong> u ∼ u ′ , which, as shown above implies u ∼ ′ u ′ , satisfying<br />

Property 2. Properties 3–5 are similarly shown.<br />

B.3 Stability<br />

The following is a reformulation <strong>of</strong> Definition 2 <strong>of</strong> successor <strong>and</strong> predecessor stability,<br />

adding edge labels:<br />

Definition 6 (Stability). A set D ⊆ V is S α -stable with respect to a set B ⊆ V iff<br />

D ⊆ E α (B) or D ∩ E α (B) = ∅, <strong>and</strong> P α -stable with respect to B iff D ⊆ E α −1 (B) or<br />

D ∩ E α −1 (B) = ∅. Furthermore, D is S-stable or P-stable with respect to B iff D is<br />

S α -stable or P α -stable with respect to B for all α ∈ A E , respectively.<br />

If D is both successor stable <strong>and</strong> predecessor stable with respect to B, we shall say<br />

that D is P&S-stable with respect to B, or simply stable if there is no ambiguity.<br />

For any combination <strong>of</strong> successor <strong>and</strong> predecessor stability, a partition Q <strong>of</strong> V is said<br />

to be stable with respect to a block B if all blocks in Q are stable with respect to B. A<br />

189


CHAPTER 4.<br />

INCLUDED PAPERS<br />

partition Q is stable with respect to another partition Q ′ if it is stable with respect to all<br />

blocks in Q ′ . Q is said to be stable if it is stable with respect to itself.<br />

B.4 Stability <strong>and</strong> bisimulation<br />

The next lemma states that a F&B-bisimulation can be found by P&S-stabilizing a partition,<br />

extending the link between single-directional (forward) bisimulation <strong>and</strong> the relational<br />

coarsest (predecessor) stable partition problem [9, Lemma 3].<br />

Lemma 6 (Stability implies bisimulation). If a partition Q respects node labels <strong>and</strong> is<br />

P&S-stable, then = Q is a F&B-bisimulation.<br />

Pro<strong>of</strong>. We must prove that v = Q v ′ implies Properties 1–5 in Definition 4. Property 1<br />

is satisfied as node labeling is assumed to be respected by the partition. For Property 2,<br />

assume that D ∈ Q is the block containing v <strong>and</strong> v ′ , <strong>and</strong> that there is an edge u E α v<br />

for some α. Let B ∈ P be the block containing u. We have that D ∩ E(B) ≠ ∅ from<br />

u E α v, <strong>and</strong> hence D ⊆ E α (B) from the successor stability. This means that for any<br />

v ′ ∈ D, there is some u ′ ∈ B, i.e., u = Q u ′ , such that u ′ E α v ′ . Properties 3–5 are proved<br />

similarly.<br />

Corollary 4. Given a partition Q, by Definition 5 <strong>and</strong> Lemma 6 there is a unique<br />

coarsest P&S-stable refinement <strong>of</strong> Q, <strong>of</strong> which all other P&S-stable refinements <strong>of</strong> Q are<br />

again refinements.<br />

Lemma 7 (Bisimulation <strong>and</strong> equivalence implies stability). If a relation R is a F&Bbisimulation<br />

<strong>and</strong> an equivalence relation, then the equivalence classes <strong>of</strong> R give a P&Sstable<br />

partition respecting node labels.<br />

Pro<strong>of</strong>. Let Q be the partition arising from R. Q respects labels by Property 1 in Definition<br />

4. Successor stability: We prove that for blocks D, B ∈ Q, we have that D∩E(B) ≠ ∅<br />

implies D ⊆ E(B). Assume D ∩ E(B) ≠ ∅, i.e., there is some node v ∈ D such that<br />

v ∈ E α (B), <strong>and</strong> hence some u ∈ B such that u E α v. For each v ′ ∈ D, we have that that<br />

v R v ′ , <strong>and</strong> by Property 2 in Definition 4 there is some u ′ such that u ′ E α v ′ <strong>and</strong> u R u ′ ,<br />

i.e., u ′ ∈ B <strong>and</strong> v ′ ∈ E(B). Hence, D ⊆ E(B). Predecessor stability: Identical pro<strong>of</strong><br />

substituting E α with E α −1 .<br />

Theorem 6 (Maximum refinements). Given an initial partition Q, if R is the maximum<br />

F&B-bisimulation that is a refinement <strong>of</strong> = Q , <strong>and</strong> P is the coarsest stable partition<br />

refining Q respecting node labels, then = P is equal to R.<br />

Pro<strong>of</strong>. Follows from Lemmas 6 <strong>and</strong> 7.<br />

B.5 Implementing Stability<br />

The following two lemmas are trivial extensions <strong>of</strong> Paige <strong>and</strong> Tarjan’s original description<br />

[16].<br />

190


Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />

Bisimulation for Node-Labeled Trees<br />

Lemma 8 (Stability inherited under refinement). For any type <strong>of</strong> stability, if a partition<br />

Q 2 is a refinement <strong>of</strong> Q 1 <strong>and</strong> Q 1 is stable with respect to a partition Q, then so is Q 2 .<br />

Pro<strong>of</strong>. For any block D 2 ∈ Q 2 there is some block D 1 ∈ Q 1 such that D 2 ⊆ D 1 . We<br />

show successor stability. Assume a given α ∈ A E for S α -stability, or any α for general<br />

S-stability. For any block B ∈ Q, if D 1 ⊆ E α (B) then D 2 ⊆ E α (B), or if D 1 ∩E α (B) = ∅<br />

then D 2 ∩ E α (B) = ∅. Predecessor stability is symmetric.<br />

Lemma 9 (Stability inherited under union). For any type <strong>of</strong> stability, if a partition Q is<br />

stable with respect to both B 1 ⊆ V <strong>and</strong> B 2 ⊆ V , then Q is stable with respect to B 1 ∪B 2 .<br />

Pro<strong>of</strong>. For successor stability, for any block D ∈ Q <strong>and</strong> a given or any α ∈ A E , if at least<br />

one <strong>of</strong> D ⊆ E α (B 1 ) <strong>and</strong> D ⊆ E α (B 2 ) is true, then D ⊆ E α (B 1 ∪ B 2 ) is true. If both<br />

<strong>of</strong> D ∩ E α (B 1 ) = ∅ <strong>and</strong> D ∩ E α (B 2 ) = ∅ are true, then D ∩ E α (B 1 ∪ B 2 ) = ∅ is true.<br />

Predecessor stability is symmetric.<br />

When using the procedure StabilizeWRT() in our stabilization algorithms, an inplace<br />

stabilization is performed, while in the theoretical description [16] <strong>of</strong> the original<br />

algorithm, there is a function split(B, Q), which returns the coarsest refinement <strong>of</strong> Q<br />

that is predecessor stable with respect to B. This is naturally generalized to stabilizing<br />

on P-stability, S-stability <strong>and</strong> S&P-stability. The following functions are h<strong>and</strong>y for the<br />

definitions:<br />

clean(Q) = {B | B ∈ Q ∧ B ≠ ∅}<br />

sep(C, Q) = clean({D ∩ C | D ∈ Q} ∪ {D \ C | D ∈ Q})<br />

Note that sep is commutative, i.e., sep(C 1 , sep(C 2 , Q)) = sep(C 2 , sep(C 1 , Q)).<br />

Definition 7 (α-Split).<br />

split Sα<br />

(B, Q) = sep(E α (B), Q)<br />

split Pα<br />

(B, Q) = sep(E α −1 (B), Q)<br />

split P &Sα (B, Q) = split Pα<br />

(B, split Sα<br />

(B, Q))<br />

Lemma 10 (Splitting <strong>and</strong> stability). A partition Q is:<br />

S-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split Sα<br />

(B, Q)<br />

P-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split Pα<br />

(B, Q)<br />

P&S-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split P &Sα (B, Q)<br />

Pro<strong>of</strong>. Follows directly from the Definitions 6 <strong>and</strong> 7.<br />

Lemma 11 (α-splitting is monotone). Given a set B ⊆ V , if a partition Q 2 is a refinement<br />

<strong>of</strong> a partition Q 1 , then split Sα<br />

(B, Q 2 ) is a refinement <strong>of</strong> split Sα<br />

(B, Q 1 ), <strong>and</strong> split Pα<br />

(B, Q 2 )<br />

is a refinement <strong>of</strong> split Pα<br />

(B, Q 1 ),<br />

Pro<strong>of</strong>. For any block D 2 ∈ Q 2 , there is a block D 1 ∈ Q 1 such that D 2 ⊆ D 1 , <strong>and</strong> for<br />

split Sα<br />

we have D 2 ∩ E(B) ⊆ D 1 ∩ E(B) <strong>and</strong> D 2 \ E(B) ⊆ D 1 \ E(B). Similarly, for<br />

split Pα<br />

we have D 2 ∩ E −1 (B) ⊆ D 1 ∩ E −1 (B) <strong>and</strong> D 2 \ E −1 (B) ⊆ D 1 \ E −1 (B).<br />

191


CHAPTER 4.<br />

INCLUDED PAPERS<br />

Lemma 12 (Correctness <strong>of</strong> splitting). For an initial partition Q 0 , let R be the coarsest<br />

refinement <strong>of</strong> Q 0 that is P&S-stable, <strong>and</strong> let Q i <strong>and</strong> Q ′ be refinements <strong>of</strong> Q 0 such that R<br />

is a refinement <strong>of</strong> both Q i <strong>and</strong> Q ′ . If U is a union <strong>of</strong> blocks in Q ′ , then for any α we have<br />

that R is a refinement <strong>of</strong> split Sα<br />

(U, Q i ), <strong>of</strong> split Pα<br />

(U, Q i ), <strong>and</strong> hence <strong>of</strong> split P &Sα (U, Q i ).<br />

Pro<strong>of</strong>. Since U is a union <strong>of</strong> blocks from Q ′ it is also a union <strong>of</strong> blocks from R, which<br />

refines Q ′ . As R is S-stable we have R = split Sα<br />

(U, R) for any α. From Lemma 11 we<br />

have that split Sα<br />

(U, R) is a refinement <strong>of</strong> split Sα<br />

(U, Q i ) when R is a refinement <strong>of</strong> Q i .<br />

Hence, R is a refinement <strong>of</strong> split Sα<br />

(U, Q i ). P-stability is symmetric for split Pα<br />

.<br />

Algorithm 6 shows the generic framework for stabilization used in Paige <strong>and</strong> Tarjan’s<br />

algorithm [16], extended to our more general case.<br />

Algorithm 6 Generic stabilization<br />

Given a partition Q 0 .<br />

i ← 0<br />

Until no change is possible,<br />

find some set U that is a union <strong>of</strong> blocks in Q i<br />

such that Q i ≠ split P &Sα (U, Q i ) for some α ∈ A E .<br />

Q i+1 ← split P &Sα (U, Q i ).<br />

i ← i + 1<br />

Corollary 5 (Stabilization correct). It follows from Lemma 12 that algorithm 6 maintains<br />

the invariant that the coarsest P&S-stable refinement <strong>of</strong> the initial partition Q 0 is also a<br />

refinement <strong>of</strong> the current partition Q i .<br />

Theorem 7 (Stabilization terminates). Algorithm 6 terminates, <strong>and</strong> then Q i<br />

coarsest P&S-stable refinement <strong>of</strong> Q 0 .<br />

is the<br />

Pro<strong>of</strong>. As long as Q i is not P&S-stable, by Lemma 10 there exist a union U <strong>of</strong> blocks in<br />

Q i <strong>and</strong> an α ∈ A E such that split Xα<br />

(U, Q i ) ≠ Q i . When split Xα<br />

(U, Q i ) = Q i for all U<br />

<strong>and</strong> α, Q i is P&S-stable by Lemma 10, <strong>and</strong> by Corollary 5, Q i is the coarsest P&S-stable<br />

refinement <strong>of</strong> Q 0 .<br />

Corollary 6. It follows from Theorem 6 <strong>and</strong> Theorem 7 that F&B-bisimilarity can be<br />

computed by using Algorithm 6 to compute the coarsest P&S-stable refinement <strong>of</strong> the<br />

label partition.<br />

192


Appendix A<br />

Other Papers<br />

“No. Better research needed. Fire your research person.<br />

No fishnet stockings. Never. Not in this b<strong>and</strong>.”<br />

– Gene Simmons<br />

193


Paper 7<br />

<strong>Nils</strong> <strong>Grimsmo</strong><br />

On performance <strong>and</strong> cache effects in substring indexes<br />

Technical Report IDI-TR-2007-04, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology,<br />

Trondheim, Norway, 2007<br />

Abstract This report evaluates the performance <strong>of</strong> uncompressed <strong>and</strong> compressed substring<br />

indexes on build time, space usage <strong>and</strong> search performance. It is shown how the<br />

structures react to increasing data size, alphabet size <strong>and</strong> repetitiveness in the data.<br />

Conclusions are drawn as to when different data structures should be used. The main<br />

contribution is the strong relationship identified between time performance <strong>and</strong> locality<br />

in the data structures. As an example, it is found that for byte sized alphabets, suffix<br />

tree construction can be speeded up by a factor 16, <strong>and</strong> query lookup by a factor 8, if<br />

dynamic arrays are used to store the lists <strong>of</strong> children for each node instead <strong>of</strong> linked lists,<br />

at the cost <strong>of</strong> using about 20% more space. And for enhanced suffix arrays, query lookup<br />

is up to twice as fast if the data structure is stored as an array <strong>of</strong> structs instead <strong>of</strong> a set<br />

<strong>of</strong> arrays, at no extra space cost.<br />

Research process<br />

I started on this PhD right after I had finished my masters degree [16], where the topic<br />

was substring indexes, such as suffix trees <strong>and</strong> suffix arrays. XML search was the topic<br />

my PhD supervisors had planned for me to work on, but I felt that I had some results on<br />

substring indexing I should try to publish first.<br />

Retrospective view<br />

In retrospect, pursuing this research direction was a bad idea, as I was fresh out <strong>of</strong> school<br />

<strong>and</strong> tried to publish in a field that was outside the expertise <strong>of</strong> my supervisors. Also,<br />

in the last years before I had started working on this topic, some research had been<br />

published that drastically changed the field, <strong>and</strong> my research turned out to be outdated<br />

<strong>and</strong> uninteresting for the research community. The results <strong>of</strong> this venture were published<br />

in a technical report [17].<br />

195


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

1 Introduction<br />

The suffix tree is a versatile substring index, first introduced by Weiner [44] (see more<br />

accessible descriptions by McCreight <strong>and</strong> Ukkonen [29, 42]). It can be used to search<br />

for occurrences <strong>of</strong> patterns in a string, <strong>and</strong> solve many other problems in combinatorial<br />

pattern matching [16]. The suffix tree <strong>and</strong> similar structures are mostly used in computational<br />

biology, where the sequences considered do not have word boundaries, <strong>and</strong> methods<br />

such as inverted lists are not suitable.<br />

1.1 Suffix Tree Definition<br />

A suffix tree for a string T <strong>of</strong> length n from the alphabet Σ <strong>of</strong> size σ, is a trie <strong>of</strong> all n + 1<br />

suffixes <strong>of</strong> the string padded with a unique terminal, edge compacted such that every<br />

internal node is branching. Since the padded string has n + 1 suffixes, <strong>and</strong> the unique<br />

terminal ensures that no padded suffix is a prefix <strong>of</strong> another, the tree has n+1 leaf nodes.<br />

There are at most n internal nodes, since they all have at least two children. This means<br />

that if edges are represented as pointers into the string, the suffix tree needs Θ(n) space<br />

measured in computer words, or Θ(n log n) bits, which is slightly more than the optimal<br />

O(n log σ) bits, the space needed to store the string indexed.<br />

In addition to the parent–child edges, each internal (non-root) node has a suffix link<br />

to another internal node [29]. If the edges from the root to an internal node spells χα,<br />

where χ is a string <strong>of</strong> length 1, this node has a suffix link to the internal node which<br />

represents the string α. These links are used in construction algorithms, <strong>and</strong> to solve<br />

some problems, such as longest common substring [16].<br />

A suffix tree can be built in Θ(n) time, when the alphabet size is constant [44] or<br />

integer [5]. As with any trie, it can be checked if a pattern P <strong>of</strong> length m is contained<br />

in the tree in O(m) time, if node child lookup is Θ(1). Since all leaf nodes below the<br />

tree position found in such a lookup represent unique matches, <strong>and</strong> all internal nodes are<br />

branching, at most 2z − 1 nodes must then be visited to find all z matches, giving a total<br />

time <strong>of</strong> O(m + z). The problem <strong>of</strong> finding all matches <strong>of</strong> a pattern in a string or a set<br />

<strong>of</strong> strings in known as the occurrence listing problem. A suffix tree has asymptotically<br />

optimal construction time for integer alphabets, <strong>and</strong> optimal search time for constant<br />

alphabets. Optimal search time for larger alphabets can be achieved with perfect hashing,<br />

at the cost <strong>of</strong> a longer construction time [10].<br />

A generalised suffix tree [16] for a set <strong>of</strong> strings S = {T 1 , . . . , T d }, is an edge compacted<br />

trie <strong>of</strong> all suffixes <strong>of</strong> each string T i padded with a unique terminator string $ i . A string<br />

T i <strong>of</strong> length n i can be added to the tree in Θ(n i ) time by slightly modifying a suffix tree<br />

construction algorithm, <strong>and</strong> can be removed in Θ(n i ) time.<br />

1.2 Alternatives to Suffix Trees<br />

A high constant factor in the space needed for suffix trees make them unsuited for solving<br />

problems with large strings on commodity computers, as they do not work well on disk<br />

197


APPENDIX A. OTHER PAPERS<br />

[2]. The space efficient representation by Kurtz [22] needs 13 bytes per string symbol on<br />

average 1 . This has lead to the development <strong>of</strong> many alternative index structures.<br />

The suffix array [27] is a simplification <strong>of</strong> suffix trees, consisting only <strong>of</strong> an array A<br />

with the sorted order <strong>of</strong> the suffixes <strong>of</strong> the padded string, such that T [A[i] . . . n + 1] <<br />

T [A[i + 1] . . . n + 1] for all 1 ≤ i ≤ n. The order in A is equal to the order <strong>of</strong> the leaf<br />

nodes seen in an ordered depth first traversal <strong>of</strong> the suffix tree for the string. The space<br />

usage is n⌈log n⌉ bits, plus n⌈log σ⌉ bits for storing the text. Lookup by binary search<br />

is O(m log n), which can be improved to O(m + log n) by using additional arrays storing<br />

longest common prefixes (LCP) between the suffixes indexed [27]. After finding the left<br />

<strong>and</strong> right border <strong>of</strong> suffixes matching the pattern, the hits are read by a sequential scan<br />

in the array. Suffix arrays can be constructed in Θ(n) time [18, 20, 21], but algorithms<br />

with higher worst-case bounds are usually faster in practice [28, 40].<br />

A “hybrid” between suffix arrays <strong>and</strong> trees is the enhanced suffix array, which has the<br />

same functionality as suffix trees, but uses less space in the worst case. It is a suffix array<br />

with additional fields, <strong>and</strong> can be built in linear time. Abouelhoda et al. [1] show how<br />

to use the enhanced suffix array to replace any algorithm doing top–down, bottom–up or<br />

suffix link traversal in a suffix tree. The steps <strong>of</strong> looking up a pattern is similar to what<br />

is done in a suffix tree implemented with sibling lists. Listing hits is done as in suffix<br />

arrays, by sequential scan between a left <strong>and</strong> right border. Enhanced suffix arrays are<br />

expected to list large numbers <strong>of</strong> hits much faster than suffix trees in practise, due to the<br />

improved locality.<br />

An alternative non-linear suffix tree construction is the write-only top-down construction<br />

(wotd) [11], in which the suffix tree is constructed in O(n 2 ) time (expected<br />

Θ(n log n)). Sibling nodes are stored adjacently in memory. This spatial locality makes<br />

it faster than linear time suffix trees for some cases.<br />

Many compressed substring indexes have been introduced the last ten years. See [34]<br />

for an introduction, time <strong>and</strong> space bounds, <strong>and</strong> an extensive list <strong>of</strong> references. The family<br />

<strong>of</strong> LZ-indexes [19, 8, 32] use structures based on the Ziv-Lempel decomposition <strong>of</strong> the<br />

string [45]. Other indexes are based on suffix arrays, such as the family <strong>of</strong> Compressed<br />

suffix arrays [23, 15, 38, 14, 24]. The FM-index family [7, 12, 9, 26] also build on suffix<br />

arrays, but uses a different search method. These structures <strong>of</strong>fer various trade<strong>of</strong>fs between<br />

index size, construction time, <strong>and</strong> query performance. Many <strong>of</strong> these structures do<br />

not depend upon keeping a copy <strong>of</strong> the text, <strong>and</strong> are hence called self indexes.<br />

Sadakane [39] has shown that it is also possible to combine a compressed suffix array<br />

with a balanced parenthesis representation <strong>of</strong> a suffix tree <strong>and</strong> various other structures to<br />

get full functionality, with bottom–up, top–down <strong>and</strong> suffix link traversal <strong>of</strong> the tree.<br />

1.3 Dynamic Problems<br />

Suffix trees have been equalled by other structures in terms <strong>of</strong> construction time <strong>and</strong> search<br />

speed, <strong>and</strong> surpassed on space usage <strong>and</strong> practical performance when listing many hits.<br />

The only problems in which suffix trees have not been matched (at least in asymptotic<br />

terms, to the authors knowledge), are problems concerning dynamic sets <strong>of</strong> strings, in<br />

1 When supporting strings up to 500MB<br />

198


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

which strings are added to <strong>and</strong> removed from the set. Generalised suffix trees can be used<br />

to solve many <strong>of</strong> these problems optimally.<br />

Problems with static sets <strong>of</strong> strings can be solved by using suffix arrays or similar<br />

structures, by indexing the concatenation <strong>of</strong> the strings, separated by a unique symbol /∈<br />

Σ. Also, any dynamic decomposable indexing problem can be solved by using a hierarchy<br />

<strong>of</strong> static indexes <strong>of</strong> varying sizes, <strong>and</strong> the cost <strong>of</strong> a O(log |S|) overhead on build time<br />

<strong>and</strong> search performance [35]. <strong>Grimsmo</strong> [13] shows the performance <strong>of</strong> hierarchies <strong>of</strong> suffix<br />

arrays compared to a generalised suffix tree.<br />

A problem related to the occurrence listing problem is the document listing problem for<br />

sets <strong>of</strong> strings. Given a pattern, find all documents in which it occurs. Muthukrishnan<br />

[31] shows how to solve this problem optimally by using a suffix tree with additional<br />

information. The technique used can be adapted to suffix arrays <strong>and</strong> similar structures.<br />

An open problem (to the authors knowledge) is solving the document listing problem<br />

optimally for dynamic sets <strong>of</strong> documents.<br />

1.4 Report Overview<br />

Section 2 details how to efficiently implement a suffix tree using dynamic arrays for the<br />

parent–child relationship, with a low space overhead compared to sibling lists. Section<br />

3 describes some <strong>of</strong> the choices made in the tested implementation <strong>of</strong> enhanced suffix<br />

arrays. Section 4 features extensive tests <strong>of</strong> many substring indexes, on build time <strong>and</strong><br />

search performance, with many types <strong>of</strong> test data. It is shown how the indexes react to<br />

increasing data size, varying alphabet size, <strong>and</strong> varying text r<strong>and</strong>omness. Section 5 draws<br />

conclusions from the experiments, <strong>and</strong> gives some guidelines for when the various types<br />

<strong>of</strong> index structures are suitable.<br />

2 Implementation <strong>of</strong> Suffix Trees<br />

The suffix tree data structure consists <strong>of</strong> a set <strong>of</strong> nodes. These nodes are linked together<br />

with parent–child edges <strong>and</strong> suffix links. One <strong>of</strong> the major choices when implementing<br />

suffix trees is how to represent the parent–child relationship. McCreight [29] describes<br />

the use <strong>of</strong> linked lists (called sibling lists) <strong>and</strong> hash tables. His article states that using<br />

an array <strong>of</strong> pointers “would be fast to search, but slow to initialise <strong>and</strong> prohibitive in size<br />

for a large alphabet.” This was true in 1976.<br />

2.1 Erroneous Assumptions on Using Arrays for Child Lists<br />

Various authors have made assumptions about the efficiency <strong>of</strong> different ways to implement<br />

suffix trees. Bedathur <strong>and</strong> Haritsa [2] claim that using arrays would result in wasted<br />

space, as there would be a lot <strong>of</strong> null pointers. They say this would be especially severe in<br />

the lower parts <strong>of</strong> the tree. This implies that the authors do not consider the possibility <strong>of</strong><br />

storing the pointers in an unsorted array <strong>and</strong> using dynamic arrays, or that they consider<br />

such a solution to be inefficient. Tian et al. [41, page 288] make similar claims, referring<br />

to [29] <strong>and</strong> [2].<br />

199


APPENDIX A. OTHER PAPERS<br />

This report presents results showing that using dynamic arrays for storing the parentchild<br />

relationship clearly outperforms using sibling lists in terms <strong>of</strong> speed, at the cost <strong>of</strong><br />

using slightly more space.<br />

2.2 Parent-Child Relationship<br />

Suffix tree construction is usually described as being Θ(n), with the assumption that the<br />

size <strong>of</strong> the alphabet can be viewed as a constant. The most common way <strong>of</strong> implementing<br />

suffix trees is using sibling lists. This is a simple solution, which gives a construction<br />

time <strong>of</strong> O(nσ), which is effective for small alphabets, such as DNA. This alphabet factor<br />

is most visible in trees for highly r<strong>and</strong>om strings from large alphabets, where there are<br />

fewer nodes, but a higher average branching factor.<br />

When considering general alphabets (sorting only possible by comparison), the running<br />

time <strong>of</strong> any suffix tree construction algorithm has a worst case Ω(n log n), as it has sorting<br />

complexity [6]. Farach [5] shows how to construct suffix trees for integer alphabets (σ ≤ n)<br />

in Θ(n) time by recursively constructing a suffix tree for odd numbered suffixes, building<br />

a tree for the even numbered from the information found, <strong>and</strong> then merging these trees.<br />

McCreight [29] proposed to use hashing as an alternative to sibling lists. Kurtz [22]<br />

shows how to do this effectively. Since there is an initial space overhead for each hash<br />

table, a single table is used for all nodes in the tree. The keys in this table are pairs<br />

<strong>of</strong> parent node numbers <strong>and</strong> edge first symbols, <strong>and</strong> the values are child node numbers.<br />

With this scheme, the expected construction time is Θ(n), but finding all occurrences<br />

when searching can be very slow, as a lookup for children must be done on all possible<br />

symbols in Σ. With sibling lists, all z occurrences are found in O(mσ + z), while with a<br />

hash map the expected time is O(m + zσ). It is possible to implement a combination <strong>of</strong><br />

a hash map <strong>and</strong> a linked list, where the children <strong>of</strong> a node are linked together, as shown<br />

by <strong>Grimsmo</strong> [13]. The proposed structure uses less space than what is needed for the<br />

combination <strong>of</strong> the sibling lists <strong>and</strong> the hash table, but it is complex <strong>and</strong> slow in practice.<br />

Simply using both sibling lists <strong>and</strong> hashing would probably be faster, at the cost <strong>of</strong> using<br />

more space.<br />

Bedathur <strong>and</strong> Haritsa [2] propose storing all child pointers inside the nodes in their<br />

disk-based suffix tree construction for very small alphabets (σ = 4). However, when using<br />

this approach, the array storing the pointers must be <strong>of</strong> a constant size.<br />

2.3 Sibling List Implementation<br />

The implementation by Kurtz [22] is the fastest space efficient implementation <strong>of</strong> linear<br />

time suffix trees known to the author. Nodes are stored as integer fields in an array,<br />

<strong>and</strong> node references are indexes into this array. The following layout is used for internal<br />

nodes:<br />

• first-child – Pointer to first node in list <strong>of</strong> children<br />

• branch-brother – Pointer to the next sibling<br />

200


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

• suffix-link – Pointer to the node α, if this is node χα.<br />

• head-position – The starting position <strong>of</strong> the second suffix in the string represented<br />

in this node.<br />

• depth – The depth <strong>of</strong> the node.<br />

The fields head-position <strong>and</strong> depth are used to find the string the incoming edge<br />

represents (see [22]). These are used instead <strong>of</strong> start <strong>and</strong> end positions because they do<br />

not change for a given node during the construction. The number <strong>of</strong> internal nodes is not<br />

known in advance, <strong>and</strong> are therefore kept in a dynamic array. There are always exactly<br />

n + 1 leaf nodes, which can be stored in a static array. The only field needed in leaf nodes<br />

is the branch-brother pointer.<br />

As the space needed is at most 5n words for the internal nodes, <strong>and</strong> n words for the<br />

leaf nodes, an upper limit for the total space needed for the suffix tree is 6n words. A<br />

word must be ⌈log n⌉ + 1 bits to index all nodes <strong>and</strong> string positions. A total <strong>of</strong> 24n bytes<br />

is needed if 32-bit words are assumed, plus n bytes for storing the string. Kurtz shows<br />

how to reduce this to 20n bytes, by storing suffix-link in the branch-brother field <strong>of</strong> the<br />

last child. Another trick shown is using small nodes, which are internal nodes with fewer<br />

fields, exploiting redundant information in the tree. On average, the space usage is about<br />

13n bytes, when supporting strings up to 500 MB.<br />

When the number <strong>of</strong> children per node is high, child lookup is a costly part <strong>of</strong> tree<br />

traversal, <strong>and</strong> even more so on a computer architecture where cache misses are expensive.<br />

When traversing a sibling list, two cache misses are expected per child visited: one for<br />

looking up fields in the node, <strong>and</strong> one for extracting the first symbol on the incoming<br />

edge to the child.<br />

2.4 Child Array Implementation<br />

On modern computers, a miss in level 1 cache costs around 20 cycles, while a miss in level 2<br />

cache costs around 200 cycles, more than four times the cost <strong>of</strong> an integer division [17] (<strong>and</strong><br />

more if pipelined instructions are considered). This implies that using dynamic arrays<br />

<strong>and</strong> copying could be preferable in many applications where linked lists were previously<br />

considered the best option. Below follows a description <strong>of</strong> how to implement suffix trees<br />

with child arrays efficiently.<br />

Each internal node refers to a child array. There exist child arrays <strong>of</strong> a set <strong>of</strong> predefined<br />

sizes. Arrays <strong>of</strong> the same size are stored together in a container, giving allocation<br />

<strong>and</strong> de-allocation in Θ(1) time by storing free locations in a linked list, with the pointers<br />

saved inside the free slots. When a node needs room for more children, a new child array<br />

<strong>of</strong> larger size is allocated, the child pointers copied here, <strong>and</strong> the old array released.<br />

Table 1 shows the layout <strong>of</strong> nodes in the used implementation. Two new fields replace<br />

first-child <strong>and</strong> branch-bro in internal nodes. Which child array size is used is given in<br />

carr-size, while the position <strong>of</strong> the child array in the container is given in carr-pos. Since<br />

there is no branch-brother field, leaf nodes do not need to be explicitly represented at all.<br />

The reasoning is the same as for a hash table implementation [22]. The fields in-symb<br />

201


APPENDIX A. OTHER PAPERS<br />

<strong>and</strong> in-ref listed in the table is the space needed for each node in its parents child array.<br />

In a traditional sibling list implementation, a lookup in the string is required for each<br />

child considered in lookup on a specific symbol. This is avoided here by storing the first<br />

symbol interleaved with the node references. This reduces the number <strong>of</strong> cache misses.<br />

While an internal node is identified by its index in memory, a leaf node can be identified<br />

by its head-position. The depth can be found by subtracting this from the length <strong>of</strong> the<br />

string. A bit in each node reference is needed to distinguish between internal <strong>and</strong> leaf<br />

nodes. Lookup in the string while traversing the children can also be avoided, by storing<br />

the first symbol on incoming edges interleaved with the node references. As relatively few<br />

cache misses are expected compared to the number <strong>of</strong> nodes considered in child traversal,<br />

this should prove more efficient than sibling lists.<br />

Table 1: Node layouts with child arrays<br />

Large B Small B Leaf B<br />

carr-size 1 carr-size 1<br />

carr-pos 4 carr-pos 4<br />

suffix-link 4 dist 1<br />

hpos 4<br />

depth 4<br />

Child array space usage<br />

in-symb 1 in-symb 1 in-symb 1<br />

in-ref 4 in-ref 4 in-ref 4<br />

Sum 22 Sum 11 Sum 5<br />

The term array doubling is <strong>of</strong>ten used when describing dynamic arrays, but is a bit<br />

misleading, as the growth factor can be different from 2. The amortised asymptotic cost<br />

<strong>of</strong> insertion into the array is Θ(n) for any constant growth factor. If a growth factor <strong>of</strong><br />

2 is used, <strong>and</strong> n values are inserted into a dynamic array, the worst case total number<br />

<strong>of</strong> insertions <strong>and</strong> re-insertions is about 3n. If the growth factor is 1.1, the worst case is<br />

about 12n. Because <strong>of</strong> the cache effects on modern computers, copying is very cheap, <strong>and</strong><br />

using a low growth factor is affordable.<br />

The total space needed for this implementation, disregarding overhead from dynamic<br />

arrays, is 27n bytes in the worst. This is 27/20 = 1.35 times more than for sibling<br />

lists. In practice, the space usage is lower, as the number <strong>of</strong> internal nodes is usually<br />

around 0.5n to 0.7n [22]. The space wasted internally in each child array should also<br />

be considered, which comes in addition to the overhead due to the growth factor in the<br />

storage for the groups <strong>of</strong> child arrays. With a growth factor <strong>of</strong> 1.1, the cost for child<br />

arrays is 17 · 1.1 + (5 + 5) · 1.1 2 ≈ 30.8, while for sibling lists it is 16 · 1.1 + 4 ≈ 21.6.<br />

This gives a ratio <strong>of</strong> 30.8/21.6 ≈ 1.43. This is also lower in practice, as the out-degree<br />

<strong>of</strong> internal nodes <strong>of</strong>ten has a very skewed distribution, which can be taken into account<br />

when configuring the pre-defined child array sizes.<br />

202


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

3 Implementation <strong>of</strong> Enhanced Suffix Arrays<br />

An enhanced suffix array is also tested in the following experiments. The implementation<br />

is the authors translation <strong>of</strong> the pseudo-code from Abouelhoda et al. [1] into C++. Some<br />

choices for the data structures affect the performance, especially query lookup, <strong>and</strong> are<br />

therefore discussed here.<br />

The data structures needed for substring search with an enhanced suffix array is the<br />

suffix array itself, the LCP table <strong>and</strong> the child table (CLD), all <strong>of</strong> which have the same<br />

length. The first choice to be made in an implementation is whether to store the three<br />

as separate arrays, or as an array <strong>of</strong> structs with three fields. As the fields for a given<br />

“position” in the enhanced suffix array are <strong>of</strong>ten read together during search, the latter<br />

choice should give better locality <strong>and</strong> performance.<br />

Abouelhoda et al. describe how store the LCP table in a byte array, overflowing into<br />

another data structure, where lookup is done by binary search. This was not done here,<br />

because it gave a considerable slowdown on some <strong>of</strong> the benchmarks, as they had a high<br />

average LCP. On the file w3c2 (see Table 2), there was a 36% overflow with 1 byte LCP<br />

fields, <strong>and</strong> 11% overflow with 2 bytes.<br />

The CLD table can also be stored in an overflowing byte array [1]. This gave a very<br />

low overflow on the tests run here, but still a considerable slowdown during search. This<br />

is because the “nodes” close to the root <strong>of</strong> the virtual tree are the most likely to overflow,<br />

as they “span” larger portions <strong>of</strong> the enhanced suffix array. Even though a small portion<br />

<strong>of</strong> all nodes overflow, a large portion <strong>of</strong> the nodes traversed during a query lookup are<br />

expected to have overflowing CLD fields.<br />

In the implementations tested, all three mentioned fields were stored as four byte<br />

integers, resulting in a space usage <strong>of</strong> 3 · 4n + n = 13n bytes including storage for the<br />

text itself. If the LCP <strong>and</strong> CLD fields are stored as overflowing bytes, the expected space<br />

usage is 4n + 3 · n = 7n bytes. Performance for an implementation where the CLD values<br />

were stored in bytes overflowing into a hash table is also given in some <strong>of</strong> the tests on<br />

query lookup in Section 4.7.<br />

For the tests on tree traversal operations in Section 4.9, two additional fields were<br />

used. The “left” <strong>and</strong> “right” suffix link pointers where stored in 2 · 4n bytes, <strong>and</strong> the<br />

inverse suffix array used to create them was stored in 4n bytes. [1] describes how to store<br />

the suffix link pointers in expected 2 · n bytes.<br />

4 Experiments<br />

Below follow experiments testing many substring index implementations on various types<br />

<strong>of</strong> text data <strong>and</strong> queries. The experiments were designed to emphasis the strengths <strong>and</strong><br />

weaknesses <strong>of</strong> the various methods. Construction time, space usage <strong>and</strong> query performance<br />

was tested.<br />

203


APPENDIX A. OTHER PAPERS<br />

4.1 Test Data<br />

Both “real world” <strong>and</strong> synthetic data was used in the tests. The largest tests from the<br />

Canterbury Corpus [3] <strong>and</strong> the tests from Manzini <strong>and</strong> Ferragina’s test collection [30]<br />

were used, in addition to some artificial data generated by the author. The results for<br />

these tests are given in section 4.3.<br />

To underline the properties <strong>of</strong> the methods, other synthetic data was also used: Space<br />

separated words from a Zipf distribution (Section 4.4), uniform r<strong>and</strong>om data with varying<br />

alphabet size (Section 4.5), <strong>and</strong> first order Markov data with a Zipf distribution on the<br />

transitions (Section 4.6).<br />

4.2 Tested Implementations<br />

The following implementations were tested:<br />

• STa - The author’s implementation <strong>of</strong> suffix trees using child arrays.<br />

2.4.<br />

See Section<br />

• STs - Using sibling lists. See Section 2.3.<br />

• Kur - The suffix tree implementation by Kurtz [22], taken from MUMmer 3.15 2 . It<br />

was compiled using the internal flag STREE HUGE, which gives a maximum data size<br />

<strong>of</strong> 500 MB.<br />

• wotde - The Write Only Top Town Eager 3 suffix tree construction algorithm [11].<br />

• DS - Deep-shallow suffix array construction algorithm 4 [28]. Default construction<br />

parameters used.<br />

• esa1 [b] - Enhanced suffix array. The implementation is a translation <strong>of</strong> the pseudocode<br />

from [1] into C++. Node fields stored in separate arrays. Initial suffix array<br />

built with deep-shallow. Suffix b denotes that CLD values were stored in bytes<br />

overflowing into a hash map.<br />

• esa2 [b] - Enhanced suffix array stored as an array <strong>of</strong> structs.<br />

• BPR - The Bucket Pointer Refinement suffix array construction algorithm 5 [40].<br />

Default construction parameters used.<br />

• LZ - LZ-index (LZ ) [33], implementation by Navarro <strong>and</strong> Arroyuelo 6 .<br />

• CCSA - Compressed compact suffix array [24], implementation by Mäkinen <strong>and</strong><br />

González 7 .<br />

2 http://mummer.sourceforge.net/<br />

3 http://bibiserv.techfak.uni-bielefeld.de/wotd/<br />

4 http://www.mfn.unipmn.it/~manzini/lightweight/<br />

5 http://bibiserv.techfak.uni-bielefeld.de/bpr/<br />

6 http://pizzachili.dcc.uchile.cl/indexes/LZ-index/<br />

7 http://pizzachili.dcc.uchile.cl/indexes/Compressed_Compact_Suffix_Array/<br />

204


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

• FM - FM-index (FM ) [8], implementation (version 2) by Ferragina <strong>and</strong> Venturini 8 .<br />

• AFFM - Alphabet-Friendly FM-index [9], implementation (version 2) by González 9 .<br />

• RLFM - Run-Length FM-Index [25], implementation by Mäkinen <strong>and</strong> González 10 .<br />

• SSA - Succinct suffix array [26] (FM family), implementation (version 2) by Mäkinen<br />

<strong>and</strong> González 11 .<br />

All programs were compiled with gcc 4.1.2 with -O3 optimisation, <strong>and</strong> used glibc<br />

2.3.6. They were run on an AMD Athlon 64 3500+ with 512 KB L2 cache, running<br />

Debian Sid with Linux 2.6.16. The kernel had the perfctr [37] patch, which makes<br />

hardware performance counters readable. The PAPI [36] interface was used to read<br />

the variable PAPI L2 TCM, giving level 2 cache misses. The times given are all wall-clock<br />

timings, excluding the time for reading files into memory. The memory usage given is<br />

VmRSS (resident set size) <strong>and</strong> VmHWM (resident set size peak) from /proc/$pid/status.<br />

(What is used by the tools top <strong>and</strong> ps). When running the tests, all initialisation <strong>of</strong><br />

measurement variables <strong>and</strong> reading <strong>of</strong> data was done before each timing was started, <strong>and</strong><br />

all related calculations were done after it was stopped.<br />

4.3 General Tests<br />

The following tests include data from the Canterbury Corpus [3] <strong>and</strong> Manzini <strong>and</strong> Ferragina’s<br />

test collection [30]. In addition, the 35th Fibonacci string is used. Note that the odd<br />

numbered 12 Fibonacci strings (counting from 1) give the maximum number <strong>of</strong> nodes in<br />

suffix trees (2n + 1), <strong>and</strong> give the worst case in many non-linear suffix array construction<br />

algorithms [40]. The test a025 is 2 25 subsequent a’s. Table 2 gives some statistics for the<br />

tests: Length, alphabet size, LCPs, <strong>and</strong> first order empirical entropy [34].<br />

Table 3 shows the construction speed for the tested methods, given as symbols indexed<br />

per second. The rationale for giving this instead <strong>of</strong> absolute time, is to give easier comparison<br />

between the tests shown in the tables, <strong>and</strong> clearer differentiation between the best<br />

methods in the plots that follow. An “x” in the tables denotes that that the test crashed,<br />

while “-” denotes that it took too long to complete. “N/A” denotes that the implementation<br />

was not designed to h<strong>and</strong>le this alphabet size. Table 4 shows symbols indexed per<br />

L2 cache miss. It is expected that this is related to the construction performance.<br />

The suffix trees with child arrays clearly perform better than the sibling lists variants,<br />

except on the small DNA tests, where they are slightly slower, <strong>and</strong> on the Fibonacci string,<br />

which has an alphabet <strong>of</strong> size 2. The advantage comes from the improved locality, as can<br />

be read from Table 4. Kur <strong>and</strong> STs have similar behaviour, with the former being slightly<br />

faster. This is probably because the latter was originally developed to index dynamic sets<br />

<strong>of</strong> documents. Wotde has a varying performance, which seems dependent <strong>of</strong> the average<br />

8 http://pizzachili.dcc.uchile.cl/indexes/FM-indexV2/<br />

9 http://pizzachili.dcc.uchile.cl/indexes/Alphabet-Friendly_FM-Index/<br />

10 http://pizzachili.dcc.uchile.cl/indexes/Run_Length_FM_index/<br />

11 http://pizzachili.dcc.uchile.cl/indexes/Succinct_Suffix_Array/<br />

12 In the even numbered Fibonacci strings, a is always followed by b.<br />

205


APPENDIX A. OTHER PAPERS<br />

Table 2: Test statistics. The three first come from the Canterbury Corpus [3], while the next 10 come from Manzini <strong>and</strong><br />

Ferragina’s test collection [30]<br />

Length σ Med. LCP Avg. LCP Max. LCP Entr. (1)<br />

1 bible.txt 4047393 63 11 14 551 3.27 King James bible<br />

2 E.coli 4638690 4 11 17 2815 1.98 Escherichia coli<br />

3 world192.txt 2473400 94 13 23 559 3.66 CIA world fact book<br />

4 chr22.dna 34553758 5 13 1979 199999 1.88 Human chromo. 22<br />

5 etext99 105277340 146 12 1109 286352 3.57 Project Gutenberg<br />

6 gcc-3.0.tar 86630400 150 21 8603 856970 3.80 gcc 3.0 source files<br />

7 howto 39422105 197 12 268 70720 3.92 Linux HOWTO text<br />

8 jdk13c 69728899 113 113 679 37334 3.54 HTML <strong>and</strong> Java<br />

9 linux-2.4.5.tar 116254720 256 17 479 136035 3.90 Linux Kernel 2.4.5<br />

10 rctail96 114711151 93 61 282 26597 3.47 Reuters news in XML<br />

11 rfc 116421901 120 19 93 3445 3.40 RFC text files<br />

12 sprot34.dat 109617186 66 32 89 7373 3.93 Swiss Prot database<br />

13 w3c2 104201579 256 114 42300 990053 4.06 HTML from w3c.org<br />

14 fib035 9227465 2 2306866 2435423 5702885 0.59 35th Fibonacci String<br />

15 r<strong>and</strong>254 104857600 254 3 2.86 6 7.98 Uniform, |σ| = 254<br />

16 a025 33554432 1 16777217 16777216 33554431 0.00 a repeated 2 25 times<br />

206


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

Table 3: Symbols indexed per second in thous<strong>and</strong>s.<br />

STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />

1 1948 1126 1475 1556 531 3935 3430 1897 1816 3201 1011 2525 630 2286 2127<br />

2 1370 1046 1377 1256 518 3770 3374 1728 1658 4504 869 2365 783 2108 2251<br />

3 2335 1263 1745 1857 570 4833 3241 1934 2056 3211 1129 2884 522 2622 2310<br />

4 1212 932 1165 22 426 2900 2250 1236 1325 3753 730 1712 669 1825 1888<br />

5 1332 632 700 127 309 1736 1292 938 996 1873 606 1430 341 1240 1200<br />

6 2227 1044 1273 12 398 1300 1813 913 920 2457 630 1804 350 1122 1036<br />

7 1677 707 791 468 391 2763 2128 1372 1401 2110 796 2006 332 1781 1659<br />

8 3321 2282 3598 178 429 1251 1341 880 892 3553 579 1360 387 965 879<br />

9 2049 883 1032 258 366 2520 2086 1348 1365 N/A N/A N/A N/A N/A N/A<br />

10 2302 1198 1447 287 369 1091 1039 741 767 2708 519 1216 371 855 801<br />

11 1842 864 1005 533 356 2224 1463 1171 1216 2212 712 1769 410 1550 1433<br />

12 2029 778 925 499 351 1904 1261 1029 1087 2445 658 1594 473 1356 1267<br />

13 3052 1509 2138 - 403 1178 1484 825 853 N/A N/A N/A N/A N/A N/A<br />

14 4586 5346 12375 - 889 91 267 89 89 21400 73 502 73 76 76<br />

15 606 30 32 1957 473 2470 1218 1020 1094 647 592 x 191 1158 1252<br />

16 4547 5931 14767 - 4338 41708 22805 12904 9677 55966 3290 - 1012 9558 10168<br />

207


APPENDIX A. OTHER PAPERS<br />

Table 4: Symbols indexed per L2 cache miss.<br />

STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />

1 0.30 0.15 0.13 0.18 0.039 0.48 0.34 0.20 0.22 0.44 0.15 0.34 0.13 0.34 0.34<br />

2 0.16 0.13 0.13 0.14 0.038 0.43 0.37 0.17 0.19 0.61 0.15 0.30 0.11 0.31 0.31<br />

3 0.40 0.18 0.16 0.27 0.042 0.67 0.40 0.23 0.26 0.47 0.18 0.44 0.14 0.44 0.44<br />

4 0.18 0.14 0.15 0.012 0.036 0.29 0.32 0.14 0.16 0.47 0.11 0.22 0.096 0.23 0.23<br />

5 0.22 0.092 0.088 0.057 0.032 0.18 0.16 0.11 0.12 0.24 0.088 0.16 0.082 0.15 0.15<br />

6 0.49 0.16 0.15 0.00087 0.036 0.19 0.15 0.13 0.14 0.34 0.10 0.23 0.11 0.19 0.19<br />

7 0.28 0.091 0.082 0.11 0.033 0.29 0.24 0.15 0.17 0.29 0.11 0.23 0.11 0.22 0.22<br />

8 1.2 0.56 0.54 0.031 0.037 0.17 0.093 0.12 0.12 0.43 0.085 0.17 0.090 0.14 0.14<br />

9 0.44 0.14 0.13 0.068 0.035 0.28 0.19 0.16 0.17 N/A N/A N/A N/A N/A N/A<br />

10 0.50 0.20 0.19 0.023 0.036 0.11 0.069 0.083 0.088 0.33 0.067 0.14 0.067 0.10 0.10<br />

11 0.36 0.13 0.13 0.068 0.036 0.24 0.13 0.14 0.15 0.30 0.10 0.21 0.10 0.19 0.19<br />

12 0.39 0.11 0.11 0.052 0.035 0.20 0.11 0.13 0.13 0.31 0.093 0.19 0.093 0.17 0.17<br />

13 0.97 0.26 0.26 - 0.036 0.16 0.12 0.11 0.12 N/A N/A N/A N/A N/A N/A<br />

14 11 10 18 - 0.063 0.015 0.0095 0.014 0.014 4.1 0.013 0.11 0.013 0.014 0.014<br />

15 0.099 0.0031 0.0033 0.18 0.056 0.34 0.16 0.14 0.15 0.12 0.12 x 0.089 0.25 0.25<br />

16 23 28 49 - 1.9 130 55 26 14 419 20 - 4.1 32 32<br />

208


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

LCP (See Table 2). For tests larger than 5 MB, it is faster than the regular suffix trees<br />

only on the r<strong>and</strong>om data. The trees using sibling lists seem unsuited for r<strong>and</strong>om data<br />

with large alphabets. One might have expected that the time <strong>and</strong> cache performance for<br />

the suffix trees was directly proportional to the alphabet size, but it also depends on other<br />

properties <strong>of</strong> the data. A test where only the alphabet size is varied is given in Section<br />

4.5.<br />

The cost <strong>of</strong> building the enhanced suffix array is rather high. A reason for this is<br />

the way data is laid out in memory. The improvement <strong>of</strong> esa2 over esa1 does not show<br />

very well in this test on construction performance, as the data fields are built by separate<br />

algorithms. The tests in Section 4.7 show that esa2 is <strong>of</strong>ten significantly faster on query<br />

lookup. In general, the regular suffix tree with child arrays has faster construction than<br />

both the enhanced suffix array <strong>and</strong> wotde.<br />

DS <strong>and</strong> BPR are faster than the linear time suffix trees on most <strong>of</strong> these tests. The<br />

exception is those with very high LCP.<br />

All the compressed structures except LZ use a suffix array built with deep-shallow as a<br />

starting point for their construction. The build times for the compressed representations<br />

seem very affordable. The LZ index has fast construction, <strong>and</strong> is even faster than all noncompressed<br />

methods on many <strong>of</strong> the tests. For all the compressed indexesthere is a strong<br />

relationship between time <strong>and</strong> cache performance. This can be seen when comparing for<br />

each method the results on the different tests in Tables 3 <strong>and</strong> 4.<br />

Query performance is not shown in this test because it depends on too many parameters,<br />

such as the length <strong>of</strong> the query, the number <strong>of</strong> hits, <strong>and</strong> the data distribution. Query<br />

performance is evaluated in Section 4.7.<br />

Table 5 shows the memory usage for the various implementations, in bytes per symbol<br />

after the construction was finished. The peak space usage is shown in table 6. The<br />

memory usage <strong>of</strong> the application seen externally is measured, which could give slightly<br />

inconsistent results because <strong>of</strong> space re-allocation. One method might use all its allocated<br />

memory, while one may have allocated more just before it finished.<br />

STa uses 15-25% more memory than STs. Both methods use more memory on the<br />

DNA tests, probably because <strong>of</strong> a smaller average branching factor <strong>and</strong> many internal<br />

nodes. Apart from that all non-compressed methods have a rather constant space usage.<br />

Both wotde <strong>and</strong> esa1/2 use slightly less space than the regular suffix trees. Among<br />

the construction algorithms for non-compressed structures, BPR is the only one using<br />

considerable extra space during construction. DS is much more space efficient, <strong>and</strong> has<br />

virtually no space overhead.<br />

The FM index is a clear winner on space usage on most <strong>of</strong> the tests. The exception<br />

seems to be those with a large alphabet. The other indexes from the FM family, AFFM,<br />

RLFM <strong>and</strong> SSA, fare much better there. FIXME: [Fix bug in FM for large σ.] The LZ<br />

index uses more space than a suffix array on some <strong>of</strong> the smaller tests, but around half<br />

the space on the larger tests.<br />

The working space used during construction varies for the compressed indexes, as seen<br />

in table 6. CCSA <strong>and</strong> AFFM need around 8-9 bytes per symbol, while the rest <strong>of</strong> the FM<br />

family needs a little more than 6. The working space for the LZ index strongly depends<br />

on the properties <strong>of</strong> the text data, as the tree data structures built utilise its repetitions.<br />

209


APPENDIX A. OTHER PAPERS<br />

Table 5: Space usage in bytes per symbol indexed after construction.<br />

STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />

1 16 13 13 11 7.3 5.5 5.4 13 13 6.2 1.8 1.1 2.4 1.4 1.4<br />

2 20 17 16 11 6.5 5.4 5.3 13 13 4.4 3.0 1.00 1.0 1.3 1.0<br />

3 16 13 13 11 7.4 5.8 5.6 14 14 7.4 1.9 1.4 1.8 1.7 1.9<br />

4 19 16 16 11 5.8 5.1 5.0 13 13 2.4 2.5 0.62 0.66 0.92 0.69<br />

5 16 13 13 10 5.2 5.0 5.0 13 13 2.8 1.9 0.68 0.83 0.98 1.0<br />

6 16 13 13 11 5.5 5.0 5.0 13 13 2.5 1.1 0.59 0.96 0.85 1.1<br />

7 16 13 13 11 5.8 5.0 5.0 13 13 2.9 1.6 0.75 2.0 0.98 1.1<br />

8 16 13 13 11 5.3 5.0 5.0 13 13 1.9 0.70 0.43 0.76 0.73 1.2<br />

9 16 13 13 11 5.3 5.0 5.0 13 13 N/A N/A N/A N/A N/A N/A<br />

10 15 12 12 10 5.2 5.0 5.0 13 13 2.0 0.98 0.46 0.73 0.79 1.1<br />

11 16 13 13 11 5.3 5.0 5.0 13 13 2.5 1.2 0.60 0.84 0.85 1.0<br />

12 15 13 13 10 5.2 5.0 5.0 13 13 2.5 1.4 0.57 0.90 0.88 1.1<br />

13 16 13 14 - 5.2 5.0 5.0 13 13 N/A N/A N/A N/A N/A N/A<br />

14 19 16 17 - 5.2 5.2 5.2 13 13 1.3 0.64 0.39 0.49 0.85 0.69<br />

15 13 9.5 8.4 6.4 5.0 5.0 5.0 13 13 5.8 4.0 x 2.5 2.0 1.5<br />

16 19 16 17 - 5.1 5.1 5.0 13 13 1.1 0.50 - 3.5 0.69 0.49<br />

210


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

Table 6: Peak space usage during construction.<br />

STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />

1 16 13 13 11 18 5.5 11 13 13 6.7 8.3 6.5 10 6.7 6.7<br />

2 20 17 16 11 18 5.4 10 13 13 4.9 8.4 6.5 11 6.7 6.7<br />

3 16 13 13 11 18 5.8 12 14 14 7.9 8.6 6.8 9.1 7.0 7.0<br />

4 19 16 16 12 18 5.1 10 13 13 4.1 8.4 6.2 9.8 6.3 6.3<br />

5 16 13 13 10 17 5.0 10 13 13 5.6 8.5 6.1 9.0 6.3 6.3<br />

6 16 13 13 11 18 5.1 10 13 13 5.8 8.5 6.1 8.2 6.3 6.3<br />

7 16 13 13 11 18 5.1 11 13 13 6.6 8.4 6.1 9.7 6.3 6.3<br />

8 16 13 13 11 18 5.0 10 13 13 4.2 8.5 6.1 7.7 6.3 6.3<br />

9 16 13 13 11 17 5.0 11 13 13 N/A N/A N/A N/A N/A N/A<br />

10 15 12 12 10 17 5.0 10 13 13 4.7 8.5 6.1 8.0 6.3 6.3<br />

11 16 13 13 11 17 5.0 10 13 13 5.4 8.5 6.1 8.3 6.3 6.3<br />

12 15 13 13 10 17 5.0 10 13 13 5.7 8.5 6.1 8.4 6.3 6.3<br />

13 16 13 14 - 17 5.0 11 13 13 N/A N/A N/A N/A N/A N/A<br />

14 19 16 17 - 17 5.2 10 13 13 1.4 8.3 6.3 7.6 6.5 6.5<br />

15 13 9.5 8.4 7.7 13 5.0 11 13 13 20 9.3 x 12 6.3 6.3<br />

16 19 16 17 - 17 5.1 10 13 13 1.1 8.4 - 59 6.3 6.3<br />

211


APPENDIX A. OTHER PAPERS<br />

4.4 Test on Increasing Data Size<br />

Figure 1 gives the results <strong>of</strong> indexing increasing amounts <strong>of</strong> space separated word data<br />

from a Zipf distribution (parameter s = 1.0). The figure shows symbols indexed per<br />

second <strong>and</strong> per L2 cache miss. Notice that these two quantities are closely related, both<br />

here <strong>and</strong> in the following tests.<br />

The suffix tree STa is almost three times faster than STs, even though they logically<br />

perform the same steps. This is because <strong>of</strong> a better locality, which can be read from the<br />

plot for cache performance. STa traverses a contiguous array to lookup the child <strong>of</strong> a<br />

node, while STs traverses sibling nodes with “r<strong>and</strong>om” memory locations. DS <strong>and</strong> BPR<br />

here exhibit the most non-linear performance, suggesting than for larger data, linear time<br />

suffix array constructions would be faster. Wotde, which is also a non-linear method, is<br />

faster than STs, but slower than STa.<br />

One would expect that the reason STa <strong>and</strong> STs show a slightly non-linear behaviour,<br />

is the fact that a smaller relative proportion <strong>of</strong> the tree fits in cache. In the construction<br />

algorithms the tree is traversed in a seemingly r<strong>and</strong>om pattern along a certain depth.<br />

The average depth is the expected longest common prefix <strong>of</strong> closest pairs between the<br />

suffixes added so far. For r<strong>and</strong>om data, this is log σ n [4]. But the cache performance is<br />

more linear than the time performance both for STa <strong>and</strong> STs. The additional slowdown<br />

might be due to memory management overhead.<br />

The LZ index shows great indexing performance. It has a more linear behaviour<br />

than DS, <strong>and</strong> is faster when the data size reaches 100 MB. For all the other compressed<br />

implementations, the time <strong>and</strong> cache misses for the initial construction <strong>of</strong> the suffix array<br />

by deep-shallow has been subtracted. This is done because this construction in many<br />

cases dominates the running time, <strong>and</strong> made it very hard to interpret the cost <strong>of</strong> building<br />

the compressed representations themselves in Sections 4.5 <strong>and</strong> 4.6. For FM, RLFM <strong>and</strong><br />

SSA, the cost <strong>of</strong> the compression is less than the cost <strong>of</strong> the initial build. An interesting<br />

feature in the plot for cache performance is that FM, RLFM, <strong>and</strong> SSA are nearly identical.<br />

This must be because they have similar access patterns to their data structures. Figure<br />

3 shows the space usage for the compressed methods on this test. FM is twice as space<br />

efficient as the next method for 100 MB.<br />

Figure 4 shows a similar test, but with uniform r<strong>and</strong>om data (σ = 20). DS <strong>and</strong> esa1/2<br />

show approximately the same performance as in the last test, but the other methods do<br />

not. Note that BPR here seems has less decrease than DS, <strong>and</strong> may have been the fastest<br />

method with even more data. Wotde is now much faster, while the other suffix trees are<br />

slower. R<strong>and</strong>om data gives lower average LCP, which benefits wotde, but also a higher<br />

average out degree in the nodes, which is bad for the regular suffix trees. The worst effect<br />

is seen for the sibling lists, where the performance is halved from what was seen for Zipf<br />

data in Figure 1. All compressed structures perform slightly worse on the r<strong>and</strong>om data.<br />

This is probably because there are fewer regularities in the text, <strong>and</strong> the index structures<br />

grow larger.<br />

212


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

3.5e+06<br />

3e+06<br />

2.5e+06<br />

symb/s<br />

2e+06<br />

1.5e+06<br />

1e+06<br />

500000<br />

0<br />

0 20 40 60 80 100<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Data size in MiB<br />

DS<br />

esa2<br />

(a) Symbols indexed per second.<br />

BPR<br />

LZ<br />

8e+06<br />

7e+06<br />

6e+06<br />

5e+06<br />

symb/s<br />

4e+06<br />

3e+06<br />

2e+06<br />

1e+06<br />

0<br />

0 20 40 60 80 100<br />

Data size in MiB<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per second.<br />

Figure 1: Indexing performance on increasing data size, Zipf word data. Construction <strong>of</strong><br />

initial suffix array not included for compressed indexes.<br />

213


APPENDIX A. OTHER PAPERS<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

symb/L2m<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0 20 40 60 80 100<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Data size in MiB<br />

DS<br />

esa2<br />

(a) Symbols indexed per L2 miss.<br />

BPR<br />

LZ<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

symb/L2m<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 20 40 60 80 100<br />

Data size in MiB<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per L2 miss.<br />

Figure 2: Continuing Figure 1. Indexing performance on increasing data size, Zipf word<br />

data. Construction <strong>of</strong> initial suffix array not included for compressed indexes.<br />

214


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

3.5<br />

3<br />

2.5<br />

B/symb<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 20 40 60 80 100<br />

Data size in MiB<br />

CCSA FM AFFM RLFM SSA LZ<br />

Figure 3: Space usage for compressed structures after construction, Zipf word data, increasing<br />

size.<br />

4.5 Increasing Alphabet Size<br />

Figure 6 shows performance when indexing 10 MB <strong>of</strong> uniformly r<strong>and</strong>om data with an<br />

increasing alphabet size.<br />

DS here clearly out-performs the other methods for large alphabets, but has a rather<br />

peculiar behaviour. This suffix array construction algorithm combines two different sorting<br />

techniques, where one utilises the results from the other. The behaviour seen may be<br />

due to the individual performance <strong>of</strong> these, <strong>and</strong> the way they are combined. Wotde shows<br />

great performance as the alphabet increases, because <strong>of</strong> the decreasing average LCP. The<br />

opposite behaviour is seen for the other suffix trees, where an increasing number <strong>of</strong> children<br />

for internal nodes slows down the construction. For STs the performance decreases<br />

with a factor 30 as the alphabet size goes from 2 to 254. The same effect is seen to a<br />

lesser degree for STa, with a drop factor <strong>of</strong> 2.<br />

The construction times <strong>of</strong> all compressed indexes except CCSA seem strongly dependant<br />

on the alphabet size, but the amount <strong>of</strong> cache misses is not. Seemingly, a larger<br />

alphabet results in more computation, but as the data accesses are already rather r<strong>and</strong>om,<br />

there are not more cache misses, even though the data structures are larger. Figure 8<br />

shows the space usage in bytes per symbol. The FM index is extremely efficient for small<br />

alphabets, <strong>and</strong> is best <strong>of</strong> all methods up to an alphabet size <strong>of</strong> around 128. Remember<br />

that this very artificial test gives the worst case space usage for the compressed indexes, as<br />

the entropy <strong>of</strong> the data is maximal for the given alphabet size. The space usage <strong>of</strong> around<br />

215


APPENDIX A. OTHER PAPERS<br />

3.5e+06<br />

3e+06<br />

2.5e+06<br />

symb/s<br />

2e+06<br />

1.5e+06<br />

1e+06<br />

500000<br />

0<br />

0 20 40 60 80 100<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Data size in MiB<br />

DS<br />

esa2<br />

(a) Symbols indexed per second.<br />

BPR<br />

LZ<br />

8e+06<br />

7e+06<br />

6e+06<br />

5e+06<br />

symb/s<br />

4e+06<br />

3e+06<br />

2e+06<br />

1e+06<br />

0<br />

0 20 40 60 80 100<br />

Data size in MiB<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per second.<br />

Figure 4: Indexing performance on increasing data size, r<strong>and</strong>om data (σ = 20). Construction<br />

<strong>of</strong> initial suffix array not included for compressed indexes.<br />

216


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

0.4<br />

0.35<br />

0.3<br />

0.25<br />

symb/L2m<br />

0.2<br />

0.15<br />

0.1<br />

0.05<br />

0<br />

0 20 40 60 80 100<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Data size in MiB<br />

DS<br />

esa2<br />

(a) Symbols indexed per L2 miss.<br />

BPR<br />

LZ<br />

1<br />

0.9<br />

0.8<br />

0.7<br />

0.6<br />

symb/L2m<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

0 20 40 60 80 100<br />

Data size in MiB<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per L2 miss.<br />

Figure 5: Continuing Figure 4. Indexing performance on increasing data size, r<strong>and</strong>om<br />

data (σ = 20). Construction <strong>of</strong> initial suffix array not included for compressed indexes.<br />

217


APPENDIX A. OTHER PAPERS<br />

5e+06<br />

4e+06<br />

3e+06<br />

symb/s<br />

2e+06<br />

1e+06<br />

0<br />

2 4 8 16 32 64 128 256<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Alphabet size<br />

DS<br />

esa2<br />

BPR<br />

LZ<br />

(a) Symbols indexed per second.<br />

8e+06<br />

7e+06<br />

6e+06<br />

5e+06<br />

symb/s<br />

4e+06<br />

3e+06<br />

2e+06<br />

1e+06<br />

0<br />

2 4 8 16 32 64 128 256<br />

Alphabet size<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per second.<br />

Figure 6: Indexing performance on increasing alphabet size, uniform data. Construction<br />

<strong>of</strong> initial suffix array not included for compressed indexes.<br />

218


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

0.8<br />

0.7<br />

0.6<br />

symb/L2m<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0<br />

2 4 8 16 32 64 128 256<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Alphabet size<br />

DS<br />

esa2<br />

BPR<br />

LZ<br />

(a) Symbols indexed per L2 miss.<br />

1<br />

0.8<br />

symb/L2m<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

2 4 8 16 32 64 128 256<br />

Alphabet size<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per L2 miss.<br />

Figure 7: Continuing Figure 6. Indexing performance on increasing alphabet size, uniform<br />

data. Construction <strong>of</strong> initial suffix array not included for compressed indexes.<br />

219


APPENDIX A. OTHER PAPERS<br />

25<br />

20<br />

15<br />

B/symb<br />

10<br />

5<br />

0<br />

2 4 8 16 32 64 128 256<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Alphabet size<br />

DS<br />

esa2<br />

BPR<br />

LZ<br />

Figure 8: Space usage for compressed structures after construction, on increasing alphabet<br />

size.<br />

2 bytes per symbol with an alphabet size <strong>of</strong> 254 for the most space efficient structures is<br />

impressive.<br />

4.6 Markov Data<br />

Many <strong>of</strong> the methods tested have time <strong>and</strong> space efficiencies which are dependant on<br />

the r<strong>and</strong>omness <strong>of</strong> the data. Less r<strong>and</strong>om data gives higher average LCP, degrading the<br />

performance <strong>of</strong> some <strong>of</strong> the non-linear construction methods. Figure 9 shows results for<br />

indexing first order Markov data with a Zipfian transition distribution with varying parameter<br />

s. An alphabet size <strong>of</strong> 20 was used. In a general Zipf distribution, the probability<br />

<strong>of</strong> selecting element k is given as<br />

p(k; s, n) =<br />

k −s<br />

∑ n<br />

i=1 i−s<br />

Setting s = 0 gives a uniform distribution. For the data used, the average LCP<br />

approximately doubled each time the parameter s increased by 1, from 4.7 for s = 0, to<br />

2160 for s = 10.<br />

Wotde has the greatest dependence on the LCP, <strong>and</strong> is the only method showing<br />

a contiguous drop in performance here. DS has a strong peak at s = 4, <strong>and</strong> a total<br />

breakdown at s = 7, where the average LCP is 324. The peak probably comes from the<br />

220


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

5e+06<br />

4e+06<br />

3e+06<br />

symb/s<br />

2e+06<br />

1e+06<br />

0<br />

0 2 4 6 8 10<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Increasing regularity<br />

DS<br />

esa2<br />

(a) Symbols indexed per second.<br />

BPR<br />

LZ<br />

1.4e+07<br />

1.2e+07<br />

1e+07<br />

symb/s<br />

8e+06<br />

6e+06<br />

4e+06<br />

2e+06<br />

0<br />

0 2 4 6 8 10<br />

Increasing regularity<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per second.<br />

Figure 9: Indexing performance on first order Markov data with Zipfian transition distribution<br />

<strong>and</strong> increasing parameter s. (σ = 20). Construction <strong>of</strong> initial suffix array not<br />

included for compressed indexes.<br />

221


APPENDIX A. OTHER PAPERS<br />

1.4<br />

1.2<br />

1<br />

symb/L2m<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 2 4 6 8 10<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Increasing regularity<br />

DS<br />

esa2<br />

(a) Symbols indexed per L2 miss.<br />

BPR<br />

LZ<br />

2<br />

1.5<br />

symb/L2m<br />

1<br />

0.5<br />

0<br />

0 2 4 6 8 10<br />

Increasing regularity<br />

CCSA FM AFFM RLFM SSA<br />

(b) Symbols indexed per L2 miss.<br />

Figure 10: Continuing Figure 9. Indexing performance on first order Markov data with<br />

Zipfian transition distribution <strong>and</strong> increasing parameter s. (σ = 20). Construction <strong>of</strong><br />

initial suffix array not included for compressed indexes.<br />

222


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

distribution <strong>of</strong> work between the two methods deep-shallow combines, as was discussed<br />

in section 4.5. The performances <strong>of</strong> the regular suffix trees increase as the data grows less<br />

r<strong>and</strong>om, due to lower average branching factors.<br />

4<br />

3.5<br />

3<br />

2.5<br />

B/symb<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 2 4 6 8 10<br />

Increasing regularity<br />

CCSA FM AFFM RLFM SSA LZ<br />

Figure 11: Space usage for compressed structures after construction on Markov data.<br />

(σ = 20)<br />

LZ shows an extreme performance on this data, which is very repetitive for large s.<br />

The curve for LZ continues rising to 35 million symbols per second at s = 10, with 70<br />

symbols indexed per L2 miss. FM, RLFM <strong>and</strong> SSA also show a significant increase in<br />

performance with the regularity. Note that for these methods the bulk <strong>of</strong> the time used<br />

is the initial deep-shallow construction, which is subtracted here. To get more robust<br />

performance, a linear time suffix array construction algorithm could be used.<br />

Figure 11 shows the space efficiency for the compressed indexes. As in the previous<br />

tests, the FM index is the most space efficient. CCSA <strong>and</strong> LZ also show very good<br />

trends. Although the space usage for LZ is low, it does not match the extreme time<br />

performance that was shown in Figure 9a. Various trade<strong>of</strong>fs between space usage <strong>and</strong><br />

query performance could be made in the implementation. All the compressed methods<br />

have a dependency on higher order entropy in their asymptotic space usage [34], but the<br />

implementations show this to a varying extent here.<br />

For large parameters s, the data has LCPs higher than what you would see in any<br />

real world data <strong>of</strong> reasonable size. In these pure Markov chain strings, the LCP between<br />

almost all neighbouring suffixes is very high. The average LCP here should probably be<br />

223


APPENDIX A. OTHER PAPERS<br />

compared with the median LCP seen in table 2.<br />

the median LCP 159.<br />

At s = 6, the average LCP is 172, <strong>and</strong><br />

4.7 Query Lookup<br />

As previously mentioned, many parameters influence query performance. One such parameter<br />

is the total data size. Figure 12 shows the number <strong>of</strong> queries per second <strong>and</strong> per<br />

L2 cache miss, on an increasing amount <strong>of</strong> Zipf word data (s = 1, σ = 20). Queries <strong>of</strong><br />

length 30 were run, <strong>and</strong> only one hit was reported (to isolate lookup time). The performance<br />

drop for the suffix trees is due to the increasing number <strong>of</strong> nodes seen in downward<br />

traversal in the tree, <strong>and</strong> the number <strong>of</strong> children considered in child lookup. The latter<br />

hits STs worse than STa. The decrease for the suffix array is due to the increasing number<br />

<strong>of</strong> jumps in the binary search. Wotde has the best performance in this test, followed by<br />

STa.<br />

Esa1 performs rather poorly, <strong>and</strong> even worse than STs, which logically performs the<br />

same steps. This is related to the number <strong>of</strong> cache misses, which is many times as<br />

high. A sibling traversal is also performed in esa1, but this has bad locality for two<br />

reasons: The first is an implementational detail. The information for one “position” in<br />

the enhanced suffix array is spread over different arrays, giving unnecessary cache misses.<br />

This is implemented differently in esa2, which has performance similar to STs. The second<br />

reason is an inherently bad locality in the enhanced suffix array. For an internal “node”<br />

close to the root, the values read to traverse the child nodes is spread over a large area.<br />

This is the reason esa2 has much slower lookup than STa, <strong>and</strong> also a more degrading<br />

performance as the amount <strong>of</strong> data increases. . The variants esa1b <strong>and</strong> esa2b have the<br />

CLD field stored in bytes overflowing into a hash table (see Section 3). Even though<br />

the total overflow is neglible in terms <strong>of</strong> space, the query lookup performance is rougly<br />

halved, because many nodes in the upper part <strong>of</strong> the tree overflow.<br />

The query lookup performance on the compressed indexes differs greatly. SSA is<br />

almost 20 times faster than FM on 100 MB <strong>of</strong> data, but 20 times slower than the fastest<br />

suffix tree. FM has poor time performance, but better cache performance than the other<br />

methods. The author does not know the implementation well enough to comment this<br />

properly. Note the different scales on the y-axis for compressed indexes. Searching in the<br />

compressed structures involves many recursive lookups for each logical step in the search.<br />

The values to be read are spread throughout the data structures, giving bad locality.<br />

Figure 14 shows query lookup on data with varying alphabet size. Queries <strong>of</strong> length<br />

30 were issued on 10 MB <strong>of</strong> uniform r<strong>and</strong>om data. Lookup performance degrades in the<br />

suffix tree variants <strong>and</strong> the enhanced suffix array when the alphabet size grows larger.<br />

This is because these methods traverse lists <strong>of</strong> children in trees, <strong>and</strong> the average lengths<br />

<strong>of</strong> these lists increase with the alphabet. The effect hits STs, esa1 <strong>and</strong> esa2 worst, as<br />

they have poor locality in the child traversal.<br />

The performances <strong>of</strong> many <strong>of</strong> the compressed indexes are very dependant <strong>of</strong> the alphabet<br />

size. Only CCSA <strong>and</strong> FM have alphabet independent asymptotic lookup times [34].<br />

CCSA shows an increase in performance as the alphabet size increases. This is because<br />

the average length <strong>of</strong> the match between the search pattern <strong>and</strong> the suffixes considered in<br />

224


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

300000<br />

250000<br />

200000<br />

query/s<br />

150000<br />

100000<br />

50000<br />

0<br />

0 20 40 60 80 100<br />

STa<br />

STs<br />

wotde<br />

DS<br />

Data size in MiB<br />

esa1<br />

esa1b<br />

(a) Queries per second.<br />

esa2<br />

esa2b<br />

18000<br />

16000<br />

14000<br />

12000<br />

query/s<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

0 20 40 60 80 100<br />

Data size in MiB<br />

CCSA FM AFFM RLFM SSA LZ<br />

(b) Queries per second.<br />

Figure 12: Query lookup on increasing data size, Zipf word data.<br />

225


APPENDIX A. OTHER PAPERS<br />

0.06<br />

0.05<br />

0.04<br />

query/L2m<br />

0.03<br />

0.02<br />

0.01<br />

0<br />

0 20 40 60 80 100<br />

STa<br />

STs<br />

wotde<br />

DS<br />

Data size in MiB<br />

esa1<br />

esa1b<br />

(a) Queries per L2 miss.<br />

esa2<br />

esa2b<br />

0.0045<br />

0.004<br />

0.0035<br />

0.003<br />

query/L2m<br />

0.0025<br />

0.002<br />

0.0015<br />

0.001<br />

0.0005<br />

0<br />

0 20 40 60 80 100<br />

Data size in MiB<br />

CCSA FM AFFM RLFM SSA LZ<br />

(b) Queries per L2 miss.<br />

Figure 13: Continuing Figure 12. Query lookup on increasing data size, Zipf word data.<br />

226


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

400000<br />

350000<br />

300000<br />

250000<br />

query/s<br />

200000<br />

150000<br />

100000<br />

50000<br />

0<br />

2 4 8 16 32 64 128 256<br />

STa<br />

STs<br />

wotde<br />

DS<br />

Alphabet size<br />

esa1<br />

esa1b<br />

esa2<br />

esa2b<br />

(a) Queries per second.<br />

400000<br />

350000<br />

300000<br />

250000<br />

query/s<br />

200000<br />

150000<br />

100000<br />

50000<br />

0<br />

2 4 8 16 32 64 128 256<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Alphabet size<br />

DS<br />

esa2<br />

BPR<br />

LZ<br />

(b) Queries per second.<br />

Figure 14: Query lookup on increasing alphabet size, uniform data.<br />

227


APPENDIX A. OTHER PAPERS<br />

0.12<br />

0.1<br />

0.08<br />

query/L2m<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

2 4 8 16 32 64 128 256<br />

STa<br />

STs<br />

wotde<br />

DS<br />

Alphabet size<br />

esa1<br />

esa1b<br />

(a) Queries per L2 miss.<br />

esa2<br />

esa2b<br />

0.12<br />

0.1<br />

0.08<br />

query/L2m<br />

0.06<br />

0.04<br />

0.02<br />

0<br />

2 4 8 16 32 64 128 256<br />

STa<br />

STs<br />

wotde<br />

skew<br />

Alphabet size<br />

DS<br />

esa2<br />

(b) Queries per L2 miss.<br />

BPR<br />

LZ<br />

Figure 15: Continuing Figure 14.<br />

data.<br />

Query lookup on increasing alphabet size, uniform<br />

228


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

the binary search decreases, resulting in fewer character comparisons. The implementation<br />

tested does not use the trick <strong>of</strong> keeping track <strong>of</strong> how many symbols match on the left<br />

<strong>and</strong> right border <strong>of</strong> the binary search, to reduce the expected number <strong>of</strong> character comparisons.<br />

The methods AFFM, RLFM <strong>and</strong> SSA all have the same asymptotic lookup cost<br />

[34], but SSA is faster in practise, <strong>and</strong> has the best query performance <strong>of</strong> all compressed<br />

indexes for small alphabets.<br />

4.8 Reporting Hits<br />

In the previous tests on query performance a single hit was reported for each query, but<br />

for some applications the efficiency when listing large numbers <strong>of</strong> occurrences is more<br />

relevant. Figure 16 shows the performance when reporting an increasing number <strong>of</strong> hits<br />

from 100 MB <strong>of</strong> uniform data with an alphabet size <strong>of</strong> 20. All search functions were<br />

modified to cap the number <strong>of</strong> hits, <strong>and</strong> a varying number was requested. The lengths <strong>of</strong><br />

the queries were set such that the number <strong>of</strong> hits would be at least what was requested.<br />

Although suffix trees are asymptotically optimal here, both for sibling lists <strong>and</strong> child<br />

arrays, the suffix arrays have an advantage in practise. After finding the left <strong>and</strong> right<br />

border for a match, values are read between these subsequently, giving great spatial<br />

locality <strong>and</strong> performance. DS is more than 15 times faster than STa at the most. The<br />

enhanced suffix array is not included, as it has a performance almost identical to the<br />

regular suffix array. Wotde is significantly faster than the regular suffix trees in this test,<br />

due to a more compact representation <strong>and</strong> better locality. The slight drop in performance<br />

for more than 3000 hits seen for the fastest methods is due to the overhead for the dynamic<br />

container used to hold the hits. This could be easily avoided for the suffix arrays, as the<br />

number <strong>of</strong> hits is known after finding the left <strong>and</strong> right border.<br />

The LZ index shows the best performance among the compressed indexes, <strong>and</strong> delivers<br />

hits nearly as fast as a suffix tree with sibling lists. The other compressed structures are<br />

3-5 orders <strong>of</strong> magnitude slower than the suffix array, but in many applications, thous<strong>and</strong>s<br />

<strong>of</strong> hits per second is sufficient.<br />

4.9 Tree Operations<br />

Suffix trees can be used for many purposes other than substring search. They are used in<br />

bio-informatics to find many properties <strong>of</strong> strings. Gusfield [16] lists numerous applications.<br />

Because suffix trees are costly in space, Abouelhoda et al. [1] show how to replace<br />

suffix trees with enhanced suffix arrays in all algorithms doing bottom-up, top-down or<br />

suffix link traversal in the tree. Sadakane [39] has taken this one step further <strong>and</strong> shown<br />

how to use a succinct representation <strong>of</strong> a suffix tree on top <strong>of</strong> any compressed suffix array,<br />

supporting the same operations.<br />

Table 7 shows the performance <strong>of</strong> suffix trees, enhanced suffix arrays <strong>and</strong> a compressed<br />

suffix tree (CST ) on various tree operations. The compressed suffix tree implementation<br />

by Mäkinen [43] was used. The longest common sub-string (LCSS) between two strings<br />

is found by building a tree for the first string, <strong>and</strong> then traversing it matching suffixes <strong>of</strong><br />

the other string, following parent-to-child <strong>and</strong> suffix links in the tree. (The construction<br />

229


APPENDIX A. OTHER PAPERS<br />

1e+08<br />

1e+07<br />

1e+06<br />

hit/s<br />

100000<br />

10000<br />

1000<br />

10 100 1000 10000<br />

STa<br />

STs<br />

wotde<br />

DS<br />

Requested number <strong>of</strong> hits<br />

LZ<br />

CCSA<br />

FM<br />

AFFM<br />

(a) Hits reported per second.<br />

RLFM<br />

SSA<br />

100<br />

10<br />

1<br />

hit/L2m<br />

0.1<br />

0.01<br />

0.001<br />

10 100 1000 10000<br />

STa<br />

STs<br />

wotde<br />

DS<br />

Requested number <strong>of</strong> hits<br />

LZ<br />

CCSA<br />

FM<br />

AFFM<br />

(b) Hits reported per L2 miss.<br />

RLFM<br />

SSA<br />

Figure 16: Requesting an increasing number <strong>of</strong> hits. log 10 scale on both axis.<br />

230


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

Table 7: Tree operations. Showing time in seconds for construction, bottom–up <strong>and</strong><br />

top–down traverse, <strong>and</strong> longest common substring search.<br />

(a) 50 MB DNA data.<br />

STa STs esa1 esa2 CST<br />

Build 50 59 76 77 2604<br />

Top–down 14 16 1.9 2.5 16<br />

Bottom–up 9.7 9.3 9.8 9.9 23<br />

LCSS 24 33 4.0 3.7 313<br />

Memory 1586 1461 1313 1313 226<br />

Memory (peak) 1961 1837 1621 1621 364<br />

(b) 50 MB protein data.<br />

STa STs esa1 esa2 CST<br />

Build 42 137 75 76 4788<br />

Top–down 11 14 1.9 2.5 17<br />

Bottom–up 6.4 6.3 7.0 7.3 24<br />

LCSS 22 76 10 6.1 721<br />

Memory 1362 1236 1331 1364 251<br />

Memory (peak) 1737 1611 1514 1513 391<br />

time for the index is excluded.) On LCSS, esa2 is much faster than esa1, because many<br />

fields are read in each “node”, <strong>and</strong> these are stored close to each other in esa2. In general,<br />

esa2 is as fast or faster than the regular suffix trees on tree traversal.<br />

The compressed suffix tree is very competitive on top–down <strong>and</strong> bottom–up traversal,<br />

where the operations on nodes performed in constant time, but around 100 times slower<br />

than esa2 on LCSS, where they are not [39].<br />

5 Conclusion<br />

It has been shown that performance is strongly dependant <strong>of</strong> locality also in data structures<br />

resident in primary memory. This should be considered when implementing indexes,<br />

as can be seen when comparing the performance <strong>of</strong> STa with STs, <strong>and</strong> esa2 with esa1.<br />

Which index structures should be chosen for which tasks depends on what is more important<br />

<strong>of</strong> space usage, construction time, query lookup time <strong>and</strong> reporting hits.<br />

• Suffix trees with dynamic arrays can be as much as 20 times faster on construction<br />

than sibling lists for byte sized alphabets, as seen in figure 6a, <strong>and</strong> 10 times faster<br />

on query lookup, as seen in figure 14. The array representation requires about 20%<br />

more space.<br />

231


APPENDIX A. OTHER PAPERS<br />

• In applications where large numbers <strong>of</strong> hits must be reported, suffix arrays are<br />

strongly preferable over suffix trees, as they list hits 1-2 orders <strong>of</strong> magnitude faster.<br />

See Figure 16a.<br />

• For fast lookup for small numbers <strong>of</strong> hits, a suffix tree variant is the most effective<br />

index, if you have enough memory. See Figures 12a <strong>and</strong> 14a.<br />

• The lookup performance <strong>of</strong> the enhanced suffix array is dependant <strong>of</strong> the alphabet<br />

size, <strong>and</strong> a regular suffix array may be faster if the alphabet is sufficiently large.<br />

See figure 14a.<br />

• The deep-shallow suffix array construction should in general be chosen over BPR,<br />

as it has a much lower space overhead. See Table 6.<br />

• Among the implementations tested here, the FM index is the most space efficient,<br />

as long as the alphabet size is not too large. See Table 5.<br />

• The LZ-index is fast on construction <strong>and</strong> listing hits. As it is more space efficient<br />

than a suffix array, it would be the structure <strong>of</strong> choice in many situations.<br />

• The wotdeager suffix tree is fast on lookup <strong>and</strong> reporting hits, but the construction<br />

easily breaks down for non-r<strong>and</strong>om data, as seen for some <strong>of</strong> the benchmarks in<br />

table 3.<br />

• Tree traversal algorithms are faster with enhanced suffix arrays than suffix trees.<br />

See Table 7.<br />

In general, when designing the layout <strong>of</strong> data structures, it is important to consider the<br />

access patterns in construction <strong>and</strong> search to maximise locality. A few more computational<br />

steps during lookup will <strong>of</strong>ten be cheaper than a cache miss.<br />

Acknowledgements.<br />

The author would like to thank all the authors who have made their code publically available,<br />

<strong>and</strong> Øystein Torbjørnsen, Magnus Lie Hetl<strong>and</strong> <strong>and</strong> Tor Egge for helpful feedback<br />

on this article,<br />

References<br />

[1] M. I. Abouelhoda, S. Kurtz, <strong>and</strong> E. Ohlebusch. Replacing suffix trees with enhanced<br />

suffix arrays. J. <strong>of</strong> Discrete Algorithms, 2(1):53–86, 2004.<br />

[2] S. J. Bedathur <strong>and</strong> J. R. Haritsa. Engineering a fast online persistent suffix tree<br />

construction. In Proc. ICDE, page 720, 2004.<br />

[3] The canterbury corpus. http://corpus.canterbury.ac.nz/.<br />

[4] L. Devroye. Note on the average depth <strong>of</strong> tries. Computing, 28:367–371, 1982.<br />

232


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

[5] M. Farach. Optimal suffix tree construction with large alphabets. In Proc. FOCS,<br />

pages 137–143, 1997.<br />

[6] M. Farach-Colton, P. Ferragina, <strong>and</strong> S. M. Muthukrishnan. On the sortingcomplexity<br />

<strong>of</strong> suffix tree construction. J. ACM, 47(6):987–1011, 2000.<br />

[7] P. Ferragina <strong>and</strong> G. Manzini. Opportunistic data structures with applications. In<br />

Proc. FOCS, page 390, 2000.<br />

[8] P. Ferragina <strong>and</strong> G. Manzini. Indexing compressed text. J. ACM, 52(4):552–581,<br />

2005.<br />

[9] P. Ferragina, G. Manzini, V. Makinen, <strong>and</strong> G. Navarro. An alphabet-friendly FMindex.<br />

Proc. SPIRE, pages 150–160, 2004.<br />

[10] M.L. Fredman, J. Komlós, <strong>and</strong> E. Szemerédi. Storing a Sparse Table with 0 (1)<br />

Worst Case Access Time. J. ACM, 31(3):538–544, 1984.<br />

[11] R. Giegerich, S. Kurtz, <strong>and</strong> J. Stoye. Efficient implementation <strong>of</strong> lazy suffix trees.<br />

S<strong>of</strong>tware: Practice <strong>and</strong> Experience, 33:1035–1049, 2003.<br />

[12] S. Grabowski, V. Makinen, <strong>and</strong> G. Navarro. First Huffman, then Burrows-Wheeler:<br />

An alphabet-independent FM-index. Proc SPIRE, 2004.<br />

[13] N. <strong>Grimsmo</strong>. Dynamic indexes vs. static hierarchies for substring search. Master’s<br />

thesis, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology, 2005.<br />

[14] R. Grossi, A. Gupta, <strong>and</strong> J. S. Vitter. High-order entropy-compressed text indexes.<br />

In Proc. SODA, pages 841–850, 2003.<br />

[15] R. Grossi <strong>and</strong> J. S. Vitter. Compressed suffix arrays <strong>and</strong> suffix trees with applications<br />

to text indexing <strong>and</strong> string matching (extended abstract). In Proc. STOC, pages<br />

397–406, 2000.<br />

[16] D. Gusfield. Algorithms on strings, trees, <strong>and</strong> sequences: <strong>Computer</strong> science <strong>and</strong><br />

computational biology. Cambridge University Press, 1997.<br />

[17] Intel Corporation. IA-32 Intel R○ Architecture Optimization Reference Manual, 2006.<br />

[18] J. Kärkkäinen <strong>and</strong> P. S<strong>and</strong>ers. Simple linear work suffix array construction. In Proc.<br />

ICALP, 2003.<br />

[19] J. Kärkkäinen <strong>and</strong> E. Ukkonen. Lempel-Ziv parsing <strong>and</strong> sublinear-size index structures<br />

for string matching. In Proc. WSP, pages 141–155, 1996.<br />

[20] D. K. Kim, J. S. Sim, H. Park, <strong>and</strong> K. Park. Linear-time construction <strong>of</strong> suffix<br />

arrays. In Proc. CPM, pages 186–199, 2003.<br />

[21] P. Ko <strong>and</strong> S. Aluru. Space efficient linear time construction <strong>of</strong> suffix arrays. In Proc.<br />

CPM, 2003.<br />

233


APPENDIX A. OTHER PAPERS<br />

[22] S. Kurtz. Reducing the space requirement <strong>of</strong> suffix trees. S<strong>of</strong>tware: Practice <strong>and</strong><br />

Experience, 29(13):1149–1171, 1999.<br />

[23] V. Mäkinen. Compact suffix array. In CPM, pages 305–319, 2000.<br />

[24] V. Mäkinen <strong>and</strong> G. Navarro. Compressed compact suffix arrays. In CPM, pages<br />

420–433, 2004.<br />

[25] V. Mäkinen <strong>and</strong> G. Navarro. Run-length FM-index. In Proc. DIMACS, pages 17–19,<br />

2004.<br />

[26] V. Mäkinen <strong>and</strong> G. Navarro. Succinct suffix arrays based on run-length encoding.<br />

In Proc. CPM, 2005.<br />

[27] U. Manber <strong>and</strong> G. Myers. Suffix arrays: A new method for on-line string searches.<br />

SIAM J. on Computing, 22(5):935–948, 1993.<br />

[28] G. Manzini <strong>and</strong> P. Ferragina. Engineering a lightweight suffix array construction<br />

algorithm. Algorithmica, 40(1):33–50, 2004.<br />

[29] E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM,<br />

23(2):262–272, 1976.<br />

[30] Manzini <strong>and</strong> Ferragina’s test collection.<br />

http://www.mfn.unipmn.it/~manzini/lightweight/.<br />

[31] S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc.<br />

SODA, pages 657–666, 2002.<br />

[32] G. Navarro. The LZ-index: A text index based on the Ziv-Lempel trie. Technical<br />

Report TR/DCC-2003-1, Dept. <strong>of</strong> <strong>Computer</strong> Science, Univ. <strong>of</strong> Chile, 2003.<br />

[33] G. Navarro. Indexing text using the ziv-lempel trie. J. <strong>of</strong> Discrete Algorithms,<br />

2(1):87–114, 2004.<br />

[34] G. Navarro <strong>and</strong> V. Mäkinen. Compressed full-text indexes. Technical Report<br />

TR/DCC-2006-6, Dept. <strong>of</strong> Comp. Sci., U. <strong>of</strong> Chile, 2006.<br />

[35] M. H. Overmars <strong>and</strong> J. van Leeuwen. Some principles for dynamizing decomposable<br />

searching problems. Technical Report RUU-CS-80-1, Rijksuniversitet Utrecht, 1980.<br />

[36] Performance Application Programming Interface.<br />

http://icl.cs.utk.edu/papi/.<br />

[37] Perfctr Hardware performance counters.<br />

http://user.it.uu.se/~mikpe/linux/perfctr/.<br />

[38] K. Sadakane. New text indexing functionalities <strong>of</strong> the compressed suffix arrays. J.<br />

<strong>of</strong> Algorithms, 48(2):294–313, 2003.<br />

234


Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />

[39] K. Sadakane. Compressed suffix trees with full functionality. Theory <strong>of</strong> Computing<br />

Systems (Online), 2007.<br />

[40] K. Schürmann <strong>and</strong> J. Stoye. An incomplex algorithm for fast suffix array construction.<br />

In Proc. ALENEX, 2005.<br />

[41] Y. Tian, S. Tata, A. Hankins, <strong>and</strong> M. Patel. Practical methods for constructing<br />

suffix trees. VLDB J., 14(3):281–299, 2005.<br />

[42] E. Ukkonen. On-line construction <strong>of</strong> suffix trees. Algorithmica, 14(5):249–260, 1995.<br />

[43] Niko Välimäki, Wolfgang Gerlach, Kashyap Dixit, <strong>and</strong> Veli Mäkinen. Engineering a<br />

compressed suffix tree implementation. In Proc. WEA, pages 217–228, 2007.<br />

[44] P. Weiner. Linear pattern matching algorithms. In Proc. SWAT, pages 1–11, 1973.<br />

[45] J. Ziv <strong>and</strong> A. Lempel. A universal algorithm for sequential data compression. IEEE<br />

Transactions on <strong>Information</strong> Theory, 23(3):337–343, 1977.<br />

235


Paper 8<br />

Truls A. Bjørklund, <strong>Nils</strong> <strong>Grimsmo</strong>, Johannes Gehrke <strong>and</strong> Øystein Torbjørnsen<br />

Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />

Proceeding <strong>of</strong> the 18th ACM conference on <strong>Information</strong> <strong>and</strong> knowledge management<br />

(CIKM 2009)<br />

Abstract Bitmap indexes are widely used in Decision Support Systems (DSSs) to improve<br />

query performance. In this paper, we evaluate the use <strong>of</strong> compressed inverted<br />

indexes with adapted query processing strategies from <strong>Information</strong> Retrieval as an alternative.<br />

In a thorough experimental evaluation on both synthetic data <strong>and</strong> data from the<br />

Star Schema Benchmark, we show that inverted indexes are more compact than bitmap<br />

indexes in almost all cases. This compactness combined with efficient query processing<br />

strategies results in inverted indexes outperforming bitmap indexes for most queries, <strong>of</strong>ten<br />

significantly.<br />

My role as an author<br />

This paper <strong>and</strong> the contained ideas was mostly the work <strong>of</strong> Bjørklund. I took part in<br />

some idea brainstormings, discussions about the implementation, <strong>and</strong> the writing process.<br />

A technical contribution <strong>of</strong> mine was to configure the FastBit system compared with to<br />

make it not crash for large queries.<br />

Retrospective view<br />

Bjørklund <strong>and</strong> I share supervisors, started on our PhDs roughly at the same time, <strong>and</strong> we<br />

were both given tasks related to indexing <strong>and</strong> search. Common denominators have been<br />

columnar storage <strong>and</strong> multi-way operators. In retrospect it could have been advantageous<br />

for us if we were given even closer topics, <strong>and</strong> shared more implementation in our research<br />

projects.<br />

237


Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />

1 Introduction<br />

Decision Support Systems (DSSs) support queries over large amounts <strong>of</strong> structured data,<br />

<strong>and</strong> bitmap indexes are <strong>of</strong>ten used to improve the efficiency <strong>of</strong> important query classes<br />

involving selection predicates <strong>and</strong> joins [16, 17].<br />

Bitmap indexes were formerly also used in <strong>Information</strong> Retrieval (IR), but are today<br />

mainly replaced by inverted indexes. Part <strong>of</strong> the reason why inverted indexes gained<br />

popularity in IR was that they easily support integrating new fields required to support<br />

ranked queries. The switch from bitmap indexes to inverted indexes lead to a flood <strong>of</strong><br />

research on efficient inverted indexes [25, 30, 6, 13, 24, 3, 31, 29], <strong>and</strong> inverted indexes<br />

are now the preferred indexing method in search engines [30].<br />

In this paper, we are asking (<strong>and</strong> answering) the question: What are the trade-<strong>of</strong>fs<br />

<strong>of</strong> using inverted indexes in DSSs, <strong>and</strong> should they be considered a serious alternative<br />

to bitmap indexes? The main contributions <strong>of</strong> this paper are (1) the study <strong>of</strong> how to<br />

use <strong>and</strong> implement inverted indexes in DSSs, <strong>and</strong> (2) a thorough performance evaluation<br />

that compares inverted indexes <strong>and</strong> bitmap indexes in DSSs. In particular, we compare<br />

inverted indexes with FastBit, 1 a state-<strong>of</strong>-the-art bitmap query processing <strong>and</strong> indexing<br />

system based on WAH-compressed bitmap indexes [27].<br />

2 Background<br />

A st<strong>and</strong>ard bitmap index has one bitmap per distinct value for the indexed attribute, with<br />

1’s at positions for tuples with the represented value, <strong>and</strong> 0’s elsewhere. Bitmaps can be<br />

combined using bitwise operators to answer complex boolean queries. For attributes with<br />

few distinct values, bitmap indexes are relatively compact, but their space usage increases<br />

linearly with the cardinality. One approach to limit the space usage <strong>of</strong> bitmap indexes<br />

for high cardinality attributes is compression. WAH [27] is one <strong>of</strong> several introduced<br />

compression schemes. Although there are schemes with more compact indexes, WAH<br />

supports efficient query processing. This combined with the fact that FastBit is openly<br />

available motivates the use <strong>of</strong> WAH-compressed bitmap indexes in the experiments in this<br />

paper.<br />

WAH-compression is a form <strong>of</strong> word-aligned run-length encoding for bitmap indexes,<br />

where consecutive words containing only 0’s or 1’s are stored as fill words, <strong>and</strong> other words<br />

are stored literally [26, 27]. WAH-compressed bitmaps for high cardinality attributes are<br />

relatively compact because most words contain only 0’s.<br />

In IR, inverted indexes consist <strong>of</strong> a search structure for all searchable words called a<br />

dictionary, <strong>and</strong> lists <strong>of</strong> references to documents containing each searchable word, called<br />

inverted lists. An inverted index for an attribute in a DSS consists <strong>of</strong> a dictionary <strong>of</strong> the<br />

distinct values in the attribute, with pointers to inverted lists that reference tuples with<br />

the given value through tuple identifiers (TIDs). To reduce both space usage <strong>and</strong> the<br />

I/O requirements in query processing, the inverted lists are <strong>of</strong>ten compressed by storing<br />

the deltas between the sorted references [30]. This approach makes small values more<br />

1 http://sdm.lbl.gov/fastbit/<br />

239


APPENDIX A. OTHER PAPERS<br />

likely, <strong>and</strong> several compression schemes that represent small values compactly have been<br />

suggested. According to a recent study, PForDelta [31] is currently the most efficient<br />

method [29], <strong>and</strong> is therefore used in this paper. PForDelta stores deltas in a wordaligned<br />

version <strong>of</strong> bit packing, which also includes exceptions to enable storing larger<br />

values than the chosen number <strong>of</strong> bits allows [31].<br />

Two overall query processing approaches exist in search engines. Document-at-atime<br />

strategies avoid materializing intermediate results by processing all inverted lists<br />

in a query in parallel [6, 24], <strong>and</strong> are well suited for boolean query processing. They<br />

can be combined with skipping, which is used in search engines to avoid reading <strong>and</strong><br />

decompressing parts <strong>of</strong> inverted lists that are not required to process a query [13]. We<br />

give a brief description <strong>of</strong> how we use these ideas in the query processing in this paper in<br />

the next section.<br />

3 Query Processing<br />

Recall that we use document-at-a-time strategies that avoid materializing intermediate<br />

results to process inverted index queries in this paper. We support three operators which<br />

can be combined to answer complex queries. They all support skipping to the next result<br />

with a given minimum TID value, in addition to st<strong>and</strong>ard Volcano-style iteration [9].<br />

The SCAN operator can iterate through an inverted list. To support skipping, each kth<br />

TID in each inverted list is stored in an external list. The external list is kept in memory<br />

during scans, <strong>and</strong> supports binary searches to find the correct part <strong>of</strong> the inverted list to<br />

process when skipping.<br />

The OR operator provides an iterator interface over the sorted merge <strong>of</strong> its multiple<br />

input iterators. The iterators are organized in a priority-queue based on a heap, which is<br />

maintained to make sure that the input with the smallest next TID is at the top. Skipping<br />

in the OR operator is based on a breadth-first search in the heap. A skip may not result<br />

in an actual skip for a given input iterator. If so, we know that neither <strong>of</strong> its children<br />

in the heap can do any skipping either, <strong>and</strong> we therefore avoid testing. After the search,<br />

we make sure that only the part <strong>of</strong> the heap involving iterators that actually skipped is<br />

maintained. This approach is reasonably efficient when actually performing skips in both<br />

large <strong>and</strong> small fractions <strong>of</strong> the set <strong>of</strong> input iterators.<br />

The AND operator expects that the input iterators are sorted in ascending order according<br />

to the expected number <strong>of</strong> returned results. To find the next result, we start with<br />

a c<strong>and</strong>idate from the iterator with the fewest number <strong>of</strong> expected results. We then try to<br />

skip to the c<strong>and</strong>idate in the other input iterators, re-starting with a new c<strong>and</strong>idate if the<br />

current c<strong>and</strong>idate is absent in one iterator. A c<strong>and</strong>idate found in all inputs is returned<br />

as a result. To support skipping, we start with the value to skip to as the c<strong>and</strong>idate <strong>and</strong><br />

proceed as in normal iteration.<br />

240


Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />

MB<br />

200<br />

150<br />

100<br />

50<br />

0<br />

MB<br />

200<br />

150<br />

100<br />

50<br />

0<br />

Uniform data<br />

2 1 10 1 10 2 10 3 10 4 10 5 10 6 10 7<br />

Zipf data, k = 1.5<br />

2 1 10 1 10 2 10 3 10 4 10 5 10 6 10 7<br />

Uncompr. attrib.<br />

FastBit<br />

InvInd<br />

InvInd w/skip<br />

Figure 1: Index sizes. X-axis gives attribute cardinality.<br />

4 Experiments<br />

To investigate the trade-<strong>of</strong>fs between inverted indexes <strong>and</strong> bitmaps, we experiment with<br />

FastBit <strong>and</strong> our inverted index solutions with <strong>and</strong> without support for skipping. We<br />

present results from experiments with synthetic data <strong>and</strong> data from the Star Schema<br />

Benchmark (SSB) [14].<br />

All experiments are run on a quad-core Intel Xeon CPU at 3.2 GHz with 16 GB main<br />

memory. All indexes are stored on disk, but queries are run warm by performing 10<br />

runs during one invocation <strong>of</strong> the system, <strong>and</strong> reporting the average from the last 8, thus<br />

measuring the in-memory query processing performance. We run FastBit version 1.0.5<br />

(implemented in C++), with extra stack space to enable processing queries with many<br />

oper<strong>and</strong>s. Our approaches are implemented in Java (version 1.6). We use additional<br />

warm-up for our system to enable run-time optimizations in the Java virtual machine<br />

that reduce variance between runs. Additional warm-up did not change the performance<br />

for FastBit.<br />

4.1 Synthetic Data<br />

To experiment with synthetic data we generate two tables. Both tables have 10 million<br />

tuples <strong>and</strong> 8 indexed attributes with maximum cardinalities ranging from 2 via all powers<br />

<strong>of</strong> 10 up to 10 million. The attributes in the first table follow a uniform distribution, while<br />

a Zipf distribution (with k = 1.5) is used in the other.<br />

241


APPENDIX A. OTHER PAPERS<br />

4.1.1 Index Size<br />

The sizes <strong>of</strong> the uncompressed attributes in the synthetic tables <strong>and</strong> their indexes are<br />

shown in Figure 1. When using st<strong>and</strong>ard PForDelta compression on the attribute with<br />

cardinality 2 in the table with uniform data, each value is represented with 4 bits in<br />

the most compact index. The reason why a lower number <strong>of</strong> bits results in a larger<br />

index is that the implementation <strong>of</strong> PForDelta might result in artificial extra exceptions<br />

when using a small number <strong>of</strong> bits per value [31]. Bitmap indexes are known to be<br />

compact when the cardinality is 2, <strong>and</strong> FastBit outperforms our approaches in this case.<br />

PForDelta results in compact indexes for higher cardinality attributes, <strong>and</strong> most <strong>of</strong> the<br />

space usage for the highest cardinality attribute comes from the dictionary (62 out <strong>of</strong><br />

91MB). The WAH-compressed bitmaps for the same attribute can in worst case contain<br />

nearly 3 computer words per tuple, resulting in a space usage <strong>of</strong> over 228MB on a 64-bit<br />

architecture. The actual results are significantly better, but compressed inverted indexes<br />

are clearly more compact.<br />

Indexes for Zipf distributed attributes are more compact than for uniformly distributed<br />

attributes with the same maximum cardinality, because skewed distributions make it less<br />

likely for the actual cardinality to be equal to the maximum.<br />

4.1.2 Query processing<br />

To experiment with query processing performance, we test four different query types<br />

which all vary the attribute on which there is a single value predicate:<br />

1. Query type SCAN: Finds all tuples with value 0 for a varied attribute.<br />

2. Query type skewed AND: Finds all tuples having value 0 for the attribute with<br />

cardinality 10, in addition to 0 in one other varied attribute.<br />

3. Query type OR: Finds all tuples having values in the lower half range for a varied<br />

attribute.<br />

4. Query type AND-OR: Finds all tuples with value in the lower half range for the<br />

attribute with cardinality 100000, <strong>and</strong> value 0 for another varied attribute.<br />

All queries compute the sum <strong>of</strong> the primary keys <strong>of</strong> the matching tuples, to ensure that<br />

the output from the index is used to perform table look-ups. In the table with uniform<br />

distributions, there were no tuples with value 0 for the highest cardinality attribute, so<br />

all single valued predicates on this attribute was changed to require the value to be 2.<br />

The results are shown in Figure 2.<br />

Compared to bitmaps, decompressed inverted lists are well suited for looking up other<br />

attributes for the qualifying tuples, a factor contributing to faster scans for uniform data.<br />

The difference in index size also seems to have an impact. All scans are relatively slow for<br />

Zipfian data because we always search for the most common attribute value in a skewed<br />

distribution, except for in the highest cardinality attribute as noted above.<br />

Skewed AND favors methods capable <strong>of</strong> taking advantage <strong>of</strong> the different density in<br />

the oper<strong>and</strong>s. Inverted indexes with skipping are therefore efficient for uniform data,<br />

but introduce overhead for Zipfian data because both oper<strong>and</strong>s are dense when the most<br />

common values in skewed distributions are accessed. FastBit performs well on dense<br />

242


Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />

FastBit<br />

InvInd<br />

InvInd w/skip<br />

SCAN query<br />

(logscale)<br />

Skewed AND<br />

OR query<br />

AND-OR query<br />

(logscale)<br />

(a1)<br />

(a2)<br />

(a3)<br />

(a4)<br />

10 −1 150<br />

0.01<br />

10 2<br />

10 −4<br />

100<br />

10 0<br />

0.005<br />

50<br />

10 −2<br />

Uniform data<br />

10 −7<br />

0<br />

0<br />

2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 10 1 10 2 10 3 10 4 10 5 10 6 10 7 2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7<br />

(b1)<br />

(b2)<br />

(b3)<br />

(b3)<br />

0.15<br />

40<br />

10 2<br />

10 −1<br />

10 −2<br />

0.10<br />

0.05<br />

30<br />

20<br />

10<br />

10 0<br />

Zipf data<br />

10 −2<br />

10 −3<br />

0<br />

0<br />

2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 10 1 10 2 10 3 10 4 10 5 10 6 10 7 2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7<br />

Figure 2: Results from running queries on generated tables, showing query time in seconds for varying cardinalities.<br />

243


APPENDIX A. OTHER PAPERS<br />

MB<br />

40<br />

20<br />

0<br />

custkey partkey suppkey<br />

Uncompr. attr.<br />

FastBit<br />

InvInd<br />

InvInd w/skip<br />

Figure 3: Size <strong>of</strong> indexes for foreign key columns in SSB<br />

oper<strong>and</strong>s, both because it can combine multiple logical TIDs using one CPU instruction,<br />

<strong>and</strong> because it performs the operator before extracting the tuple references. Because neither<br />

input is smaller than the output for AND operators, FastBit decodes fewer references<br />

compared to the inverted indexes.<br />

The multi-way OR operators in our solution demonstrate better scalability than FastBit<br />

with respect to the number <strong>of</strong> inputs for both tables.<br />

The idea <strong>of</strong> skipping in OR operators is ideally suited for query type AND-OR, but it is<br />

only useful when the other oper<strong>and</strong>s to the AND return data that enables reasonable skip<br />

lengths, which occurs for high cardinality attributes with uniform distributions.<br />

4.2 Star Schema Benchmark<br />

Star schemas represent a best practice for how to organize data in decision support systems,<br />

<strong>and</strong> are characterized by a central fact table that references several smaller dimension<br />

tables. Typical queries on such schemas involve joins <strong>of</strong> the fact table with relevant<br />

dimension tables called star joins. Bitmap indexes can be constructed over the foreign<br />

keys in the fact table to speed up such joins, <strong>and</strong> are then called join indexes [16, 17].<br />

We experiment with using inverted indexes as an alternative to bitmap indexes for this<br />

purpose in the Star Schema Benchmark (SSB) [14]. We use Scale Factor 1, <strong>and</strong> precalculate<br />

the foreign keys that match the queries in SSB. They are submitted as part <strong>of</strong><br />

the query to the tested systems. We also avoid calculating the exact answer, but rather<br />

let all queries return the sum <strong>of</strong> an attribute <strong>of</strong> the fact table. This isolates the effects<br />

<strong>of</strong> the indexes while making sure the returned results are suitable for further look-ups in<br />

the fact table. There are four dimension tables in SSB, but we avoid constructing join<br />

indexes for the Date table because FastBit is unable to process the queries involving all<br />

tables without a very significant stack size.<br />

4.2.1 Index Size<br />

Figure 3 shows the join index sizes in both systems. FastBit has significantly larger indexes<br />

because the foreign keys have relatively high cardinalities. The attribute custkey is partly<br />

sorted, resulting in longer runs in WAH-compressed indexes, <strong>and</strong> the relative difference<br />

between FastBit <strong>and</strong> inverted indexes is therefore smaller in that case.<br />

244


Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />

20<br />

2.0<br />

1.5<br />

0.20<br />

0.15<br />

20<br />

10<br />

1.0<br />

0.5<br />

0.10<br />

0.05<br />

10<br />

0<br />

1.0<br />

0.5<br />

0<br />

Q2.1<br />

0.03<br />

0.02<br />

0.01<br />

Q2.2<br />

0<br />

200<br />

150<br />

100<br />

50<br />

Q2.3<br />

0<br />

10<br />

5<br />

Q3.1<br />

FastBit<br />

InvInd<br />

InvInd w/skip<br />

0<br />

Q3.2<br />

0<br />

Q3.3/3.4<br />

0<br />

Q4.1/4.2<br />

0<br />

Q4.3<br />

Figure 4: Query processing time in seconds for SSB queries.<br />

4.2.2 Join Processing<br />

The query processing results for SSB are given in Figure 4. Within each set <strong>of</strong> queries, the<br />

predicates on the dimension tables become increasingly selective, making FastBit perform<br />

better relative to inverted indexes because the OR operators that combine the tuples<br />

representing each qualifying foreign key have fewer inputs. Query 2.3 has skewed input to<br />

an AND operator making skipping important for performance. The OR operator providing<br />

dense input to the AND is over the attribute with the lowest cardinality, contributing to<br />

the smaller performance difference between FastBit <strong>and</strong> inverted indexes for this query.<br />

5 Related Work<br />

Several alternatives to the compression schemes discussed in this paper have been suggested<br />

both for bitmaps [4, 10, 5, 15, 21] <strong>and</strong> inverted indexes [25, 20, 1]. Experiments<br />

have shown that the query processing efficiency <strong>of</strong> WAH remains attractive, even though<br />

there are approaches resulting in smaller indexes. WAH is known to result in smaller indexes<br />

when the table is sorted on the indexed attribute [19, 12]. Due to space restrictions,<br />

we do not experiment with sorted tables in this paper. Experiments with compression in<br />

inverted indexes in IR have shown that PForDelta currently is the most efficient technique<br />

[29], <strong>and</strong> further improvements to the technique have also been suggested recently [28].<br />

As an alternative to compression, there are several approaches that reduce the number<br />

<strong>of</strong> bitmaps in the index [17, 7, 8, 11, 22]. Strategies for operating on many bitmaps by<br />

processing two at a time have been explored for WAH-compressed bitmap indexes [26],<br />

245


APPENDIX A. OTHER PAPERS<br />

<strong>and</strong> a recent paper suggests using multi-way operators for bitmaps, but the idea is not<br />

tested [12]. Query processing approaches in inverted indexes in IR have focused on termat-a-time<br />

strategies in addition to the document-at-a-time approach used in this paper<br />

[6, 24, 18, 2, 3, 23].<br />

6 Conclusions<br />

In this paper, we have evaluated the applicability <strong>of</strong> compressed inverted indexes as an<br />

alternative to bitmap indexes in DSSs. Inverted indexes are generally significantly more<br />

space efficient. The only case where WAH-compressed bitmaps are clearly more compact<br />

is when the cardinality <strong>of</strong> the indexed attribute is very low. FastBit performs well on<br />

simple queries with dense oper<strong>and</strong>s, but inverted indexes are better in other cases, <strong>of</strong>ten<br />

significantly.<br />

Acknowledgments: This material is based upon work supported by New York State<br />

Science Technology <strong>and</strong> Academic Research under agreement number C050061, by Grant<br />

NFR 162349, by the National Science Foundation under Grants 0534404 <strong>and</strong> 0627680,<br />

<strong>and</strong> by the iAd Project funded by the Research Council <strong>of</strong> Norway. Any opinions, findings<br />

<strong>and</strong> conclusions or recommendations expressed in this material are those <strong>of</strong> the authors<br />

<strong>and</strong> do not necessarily reflect the views <strong>of</strong> NYSTAR, the National Science Foundation or<br />

the Research Council <strong>of</strong> Norway.<br />

References<br />

[1] V. N. Anh <strong>and</strong> A. M<strong>of</strong>fat. Index compression using fixed binary codewords. In<br />

Proc. ADC, 2004.<br />

[2] V. N. Anh <strong>and</strong> A. M<strong>of</strong>fat. Simplified similarity scoring using term ranks. In Proc. SI-<br />

GIR, 2005.<br />

[3] V. N. Anh <strong>and</strong> A. M<strong>of</strong>fat. Pruned query evaluation using pre-computed impacts. In<br />

Proc. SIGIR, 2006.<br />

[4] G. Antoshenkov. Byte-aligned bitmap compression. In Proc. DCC, 1995.<br />

[5] T. Apaydin, G. Canahuate, H. Ferhatosmanoglu, <strong>and</strong> A. S. Tosun. Approximate<br />

encoding for direct access <strong>and</strong> query processing over compressed bitmaps. In<br />

Proc. VLDB, 2006.<br />

[6] E. W. Brown. Fast evaluation <strong>of</strong> structured queries for information retrieval. In<br />

Proc. SIGIR, 1995.<br />

[7] C.-Y. Chan <strong>and</strong> Y. E. Ioannidis. Bitmap index design <strong>and</strong> evaluation. SIGMOD<br />

Rec., 27(2), 1998.<br />

[8] C.-Y. Chan <strong>and</strong> Y. E. Ioannidis. An efficient bitmap encoding scheme for selection<br />

queries. In Proc. SIGMOD, 1999.<br />

246


Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />

[9] G. Graefe. Volcano - an extensible <strong>and</strong> parallel query evaluation system. IEEE<br />

Trans. on Knowl. <strong>and</strong> Data Eng., 6(1), 1994.<br />

[10] T. Johnson. Performance measurements <strong>of</strong> compressed bitmap indices. In<br />

Proc. VLDB, 1999.<br />

[11] N. Koudas. Space efficient bitmap indexing. In Proc. CIKM, 2000.<br />

[12] D. Lemire, O. Kaser, <strong>and</strong> K. Aouiche. Sorting improves word-aligned bitmap indexes.<br />

CoRR, abs/0901.3751, 2009.<br />

[13] A. M<strong>of</strong>fat <strong>and</strong> J. Zobel. Self-indexing inverted files for fast text retrieval. ACM<br />

Trans. Inf. Syst., 14(4), 1996.<br />

[14] E. O’Neil, P. O’Neil, <strong>and</strong> X. Chen. http://www.cs.umb.edu/ poneil/StarSchemaB.PDF.<br />

[15] E. O’Neil, P. O’Neil, <strong>and</strong> K. Wu. Bitmap index design choices <strong>and</strong> their performance<br />

implications. In Proc. IDEAS, 2007.<br />

[16] P. O’Neil <strong>and</strong> G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD<br />

Rec., 24(3), 1995.<br />

[17] P. O’Neil <strong>and</strong> D. Quass. Improved query performance with variant indexes. In<br />

Proc. SIGMOD, 1997.<br />

[18] M. Persin, J. Zobel, <strong>and</strong> R. Sacks-Davis. Filtered document retrieval with frequencysorted<br />

indexes. J. Am. Soc. Inf. Sci., 47(10), 1996.<br />

[19] A. Pinar, T. Tao, <strong>and</strong> H. Ferhatosmanoglu. Compressing bitmap indices by data<br />

reorganization. In Proc. ICDE, 2005.<br />

[20] F. Scholer, H. E. Williams, J. Yiannis, <strong>and</strong> J. Zobel. Compression <strong>of</strong> inverted indexes<br />

for fast query evaluation. In Proc. SIGIR, 2002.<br />

[21] M. Stabno <strong>and</strong> R. Wrembel. Rlh: Bitmap compression technique based on run-length<br />

<strong>and</strong> huffman encoding. <strong>Information</strong> Systems, 2008.<br />

[22] K. Stockinger, K. Wu, <strong>and</strong> A. Shoshani. Evaluation strategies for bitmap indices<br />

with binning. In Database <strong>and</strong> Expert Systems Applications, 2004.<br />

[23] T. Strohman <strong>and</strong> W. B. Cr<strong>of</strong>t. Efficient document retrieval in main memory. In<br />

Proc. SIGIR, 2007.<br />

[24] T. Strohman, H. Turtle, <strong>and</strong> W. B. Cr<strong>of</strong>t. Optimization strategies for complex<br />

queries. In Proc. SIGIR, 2005.<br />

[25] I. Witten, A. M<strong>of</strong>fat, <strong>and</strong> T. C. Bell. Managing Gigabytes. Academic Press, 1999.<br />

[26] K. Wu, E. Otoo, <strong>and</strong> A. Shoshani. On the performance <strong>of</strong> bitmap indices for high<br />

cardinality attributes. In Proc. VLDB, 2004.<br />

247


APPENDIX A. OTHER PAPERS<br />

[27] K. Wu, E. J. Otoo, <strong>and</strong> A. Shoshani. Optimizing bitmap indices with efficient<br />

compression. ACM Trans. Database Syst., 31(1), 2006.<br />

[28] H. Yan, S. Ding, <strong>and</strong> T. Suel. Inverted index compression <strong>and</strong> query processing with<br />

optimized document ordering. In Proc. WWW, 2009.<br />

[29] J. Zhang, X. Long, <strong>and</strong> T. Suel. Performance <strong>of</strong> compressed inverted list caching in<br />

search engines. In Proc. WWW, 2008.<br />

[30] J. Zobel <strong>and</strong> A. M<strong>of</strong>fat. Inverted files for text search engines. ACM Comput. Surv.,<br />

38(2), 2006.<br />

[31] M. Zukowski, S. Heman, N. Nes, <strong>and</strong> P. Boncz. Super-scalar ram-cpu cache compression.<br />

In Proc. ICDE, 2006.<br />

248


Paper 9<br />

Truls A. Bjørklund, Michaela Götz, Johannes Gehrke <strong>and</strong> <strong>Nils</strong> <strong>Grimsmo</strong><br />

Search Your Friends And Not Your Enemies<br />

Submitted to Proceedings <strong>of</strong> the VLDB Endowment (VLDB 2011)<br />

Abstract More <strong>and</strong> more data is accumulated inside social networks. Keyword search<br />

provides a simple interface for exploring this content. However, a lot <strong>of</strong> the content is<br />

private, <strong>and</strong> a search system must enforce the privacy settings <strong>of</strong> the social network.<br />

In this paper, we present a workload aware keyword search system with access control.<br />

We develop a range <strong>of</strong> cost models that vary in sophistication <strong>and</strong> accuracy. These<br />

cost models provide input to an optimization algorithm that selects the ideal solution<br />

for a given workload. With our cost models we find designs that outperform previous<br />

approaches by up to a factor <strong>of</strong> 3. We also address the query processing strategy in<br />

the system, <strong>and</strong> develop a novel union operator called HeapUnion that speeds up query<br />

processing with a factor between 1.1 <strong>and</strong> 2.3 compared to the best previous solution. We<br />

believe that both our cost models <strong>and</strong> our novel union operator will be <strong>of</strong> independent<br />

interest for future work.<br />

My role as an author<br />

Bjørklund was the main contributor for this paper. I supported him during debugging<br />

sessions for the cost models, <strong>of</strong>fered a shoulder to cry on when they had to be modified,<br />

<strong>and</strong> helped during the writing process.<br />

249


Paper 9: Search Your Friends And Not Your Enemies<br />

1 Introduction<br />

More <strong>and</strong> more data is accumulated inside social networks where users tweet, update their<br />

status, chat, post photos, <strong>and</strong> comment on each other’s lives. From a user’s perspective, a<br />

lot <strong>of</strong> her content is private <strong>and</strong> should not be accessible to everybody. To limit arbitrary<br />

information flow, social networks enable users to adjust their privacy settings, e.g., to<br />

ensure that only friends can see the content they have posted. Keyword search provides<br />

a simple interface for the exploration <strong>of</strong> content in social networks. It is the challenge<br />

<strong>of</strong> enabling search over the content while enforcing privacy settings that motivates the<br />

research in this paper.<br />

Search over collections <strong>of</strong> documents — without taking social networks into account<br />

— is a well-studied problem [29], <strong>and</strong> a search query consisting <strong>of</strong> keywords is usually<br />

answered using an inverted index [31]. An inverted index consists <strong>of</strong> a look-up structure<br />

over all unique terms in the indexed document collection, <strong>and</strong> a posting list for each term.<br />

The posting list contains the document identifiers (IDs) <strong>of</strong> all documents that contain the<br />

term.<br />

Search in a social network is more challenging because each user sees a unique subset<br />

<strong>of</strong> the document collection. We view a social network as a directed graph where each node<br />

represents a user <strong>and</strong> a directed edge represents a (one-way) friendship. We assume in this<br />

paper that the social network implements the following privacy setting: A user has access<br />

to the documents authored by herself <strong>and</strong> to the documents authored by her friends. We<br />

rank the results <strong>of</strong> a query according to recency with the most recent documents ranked<br />

highest; then we retrieve the top-k results.<br />

Figure 1(a) shows an example <strong>of</strong> a social network with four users, where User 4 is<br />

friends with Users 1 <strong>and</strong> 3, <strong>and</strong> User 2 is friends with Users 1 <strong>and</strong> 4. All the users have<br />

posted documents, <strong>and</strong> each document has an ID which is shown in its top right corner.<br />

User 3 has posted Documents 2 <strong>and</strong> 5, <strong>and</strong> in our model she can search across Documents<br />

2, 5, <strong>and</strong> 7.<br />

In previous work we developed a conceptual framework for solutions to this problem<br />

[8]. Since we need to both search <strong>and</strong> enforce access control, we have characterized<br />

solutions along two axes; the index axis <strong>and</strong> the access axis. The index axis captures the<br />

idea that instead <strong>of</strong> creating one single inverted index over all the content in the social<br />

network, we may create several inverted indexes, each containing a subset <strong>of</strong> the content.<br />

A set <strong>of</strong> inverted indexes <strong>and</strong> their content is called an index design. The access axis<br />

mirrors the index axis <strong>and</strong> describes the meta-data used to filter out inaccessible results;<br />

the meta-data is organized into author-lists. For the purpose <strong>of</strong> this paper, an author-list<br />

contains the IDs <strong>of</strong> all documents authored by a set <strong>of</strong> users. An access design describes<br />

a set <strong>of</strong> author-lists.<br />

Previous work experimented with a few extreme points in this solution space; it showed<br />

that two <strong>of</strong> the most promising solutions both use an index design with a single index<br />

containing all users’ documents, while the access design in the two approaches differ.<br />

The first approach is called user design <strong>and</strong> has one author-list per user that contains<br />

the document IDs posted by that particular user. The second approach is called friends<br />

design; it also has one author-list per user, but this author-list contains the documents<br />

251


APPENDIX A. OTHER PAPERS<br />

1<br />

social<br />

network<br />

4<br />

access<br />

control<br />

6<br />

keyword<br />

search<br />

3<br />

keyword<br />

access<br />

User 1: 6 4 1<br />

User 2: 3<br />

User 3: 5 2<br />

User 4: 7<br />

User 1 User 2<br />

(b) User Design<br />

User 3 User 4<br />

2 5<br />

7<br />

network social<br />

social<br />

access search<br />

control<br />

User 1: 7 6 4 3 1<br />

User 2: 7 6 4 3 1<br />

User 3: 7 5 2<br />

User 4: 7 6 5 4 2 1<br />

(a) Example social<br />

network with posted<br />

documents<br />

Figure 1: Social Network <strong>and</strong> Basic Designs<br />

(c) Friends Design<br />

posted by the user <strong>and</strong> all <strong>of</strong> her friends. The author-lists for the user <strong>and</strong> friends designs<br />

for our example from Figure 1(a) are shown in Figures 1(b) <strong>and</strong> 1(c), respectively. In<br />

both <strong>of</strong> these designs, a keyword query from a user is processed in the single inverted<br />

index. To enforce access control, the results from the index are intersected with a set<br />

<strong>of</strong> author-lists containing all friends <strong>of</strong> the user. In the friends design, all friends <strong>of</strong> the<br />

user are represented in the author-list for the user, whereas in the user design, we need<br />

to calculate the union <strong>of</strong> the author-lists for all friends.<br />

Because <strong>of</strong> the promising performance <strong>of</strong> user <strong>and</strong> friends designs in previous work [8],<br />

we have chosen them as the basis for our solutions. Note that the two designs have very<br />

different trade<strong>of</strong>fs. When a user posts a new document, only a single author-list must<br />

be updated in the user design. In the friends design, however, the author-lists for all<br />

users that are friends with the posting user must be updated. During queries, only one<br />

author-list is accessed with the friends design, whereas one author-list for each friend <strong>of</strong><br />

the user must be accessed with the user design.<br />

In this paper we bridge the gap between the two extrema to enable efficient search<br />

in a social networks with access control. We propose an intermediate strategy that combines<br />

the best <strong>of</strong> both the friends design <strong>and</strong> the user design: low search costs <strong>and</strong> low<br />

update costs. Our solution starts with the user design <strong>and</strong> then judiciously adds selected<br />

252


Paper 9: Search Your Friends And Not Your Enemies<br />

additional author-lists to improve query performance.<br />

The remainder <strong>of</strong> the paper explains our solution in detail. In Section 2, we introduce<br />

notation <strong>and</strong> define the problem we address. Then, we develop efficient query processing<br />

strategies based on a novel union operator (Section 3). Furthermore, we develop a set <strong>of</strong><br />

cost models with various degrees <strong>of</strong> sophistication <strong>and</strong> accuracy, <strong>and</strong> use them as bases for<br />

workload optimization to find the ideal access design for a particular workload (Section 4).<br />

In a thorough experimental evaluation, we demonstrate the efficiency <strong>of</strong> our new union<br />

operator, validate the accuracy <strong>of</strong> our cost models, <strong>and</strong> compare access designs resulting<br />

from workload optimization based on different cost models to previous work (Section 5).<br />

Related work is discussed in Section 6 <strong>and</strong> we conclude in Section 7. We believe that both<br />

our novel union operator <strong>and</strong> our cost models will be <strong>of</strong> independent interest for future<br />

work.<br />

2 Problem Definition<br />

In this section, we will introduce some notation which is used to define the problem<br />

addressed in this paper.<br />

Search Data <strong>and</strong> Query Model. We view a social network as a directed graph<br />

〈V, E〉, where each node u ∈ V represents a user. There is an edge 〈u, v〉 ∈ E if user v is<br />

a friend <strong>of</strong> user u, denoted v ∈ F u or alternatively that u is one <strong>of</strong> v’s followers, denoted<br />

u ∈ O v . We always have u ∈ F u <strong>and</strong> u ∈ O u .<br />

We consider workloads that consist <strong>of</strong> two different operations: posting new documents<br />

<strong>and</strong> issuing queries. A new document, which we also refer to as an update, consists <strong>of</strong> a set<br />

<strong>of</strong> terms. We will call the user who posted document d the author <strong>of</strong> d, <strong>and</strong> we will also<br />

say that the user authored d. The new document gets assigned a unique document ID;<br />

more recently posted documents have higher IDs. Let n u denote the number <strong>of</strong> documents<br />

authored by user u, <strong>and</strong> let N = ∑ u n u denote the total number <strong>of</strong> documents in the<br />

system.<br />

A query submitted by a user u consists <strong>of</strong> a set <strong>of</strong> keywords. As mentioned in the<br />

previous section, we assume that only documents authored by users in F u are accessible<br />

to u, <strong>and</strong> that the ranking is based on recency. Thus the results <strong>of</strong> a keyword query are<br />

the k documents that (1) contain the query keywords, (2) are authored by a user in F u ,<br />

<strong>and</strong> (3) have the k highest document IDs among the documents that satisfy (1) <strong>and</strong> (2).<br />

Beyond User <strong>and</strong> Friends Designs. The relative merits <strong>of</strong> the user <strong>and</strong> friends<br />

designs motivate the solution in this paper. Recall from the previous section that in the<br />

user design, updates are efficient because only u’s author-list is updated when u posts<br />

a document; queries, however, need to access the author-lists <strong>of</strong> all users in F u . In the<br />

friends design, queries are more efficient because only the author-list <strong>of</strong> u is accessed.<br />

Updates, however, need to change the author-lists <strong>of</strong> all users in O u .<br />

In our approach, we start out with the user design. In addition, we add one author-list<br />

l u for each user u; l u contains the IDs <strong>of</strong> all documents authored by a selected set <strong>of</strong> users<br />

L u ⊆ F u . When user u submits a query, there is no need to access specific author-lists for<br />

users in L u , <strong>and</strong> queries therefore become more efficient as more users are represented in<br />

253


APPENDIX A. OTHER PAPERS<br />

L u . On the other h<strong>and</strong>, representing more users in L u leads to higher update costs. We<br />

therefore determine the contents <strong>of</strong> L u (<strong>and</strong> thus l u ) based on the workload characteristics.<br />

System Model. Keyword search over collections where updates are interleaved with<br />

queries is typically supported with a hierarchy <strong>of</strong> indexes [19, 10, 17, 21]. New data is<br />

accumulated in a small updatable structure that also supports concurrent queries, while<br />

the rest <strong>of</strong> the hierarchy consists <strong>of</strong> a set <strong>of</strong> read-only indexes. The read-only indexes<br />

change through merges that form larger read-only indexes, resulting in a hierarchy where<br />

the largest indexes contain the least recent documents.<br />

An index hierarchy is well suited for search in social networks, especially when used<br />

in combination with an access design that adapts to the workload. The time at which<br />

indexes are merged represents an opportunity to modify the access design <strong>and</strong> adapt it to<br />

the current workload, so that different indexes in the hierarchy potentially have different<br />

access designs. In our system, we process top-k queries with recency used as ranking, <strong>and</strong><br />

the largest indexes would therefore rarely be accessed. Our workload-aware algorithms<br />

can be used to find suitable access designs in these cases.<br />

In this paper, we focus on selecting an access design for an index with a stratified workload,<br />

where all updates are processed before all queries. A stratified workload closely approximates<br />

the workloads for the indexes at the various levels in the hierarchy, except that<br />

the update costs can vary slightly based on the structure <strong>of</strong> the hierarchy [19, 10, 17, 21].<br />

Recall that a system based on a hierarchy with indexes, where each index supports stratified<br />

workloads, will still support interleaved workloads overall, because <strong>of</strong> the small updatable<br />

part <strong>of</strong> the index. The updatable part supports interleaved workloads, but we<br />

have verified experimentally that because it is small, its processing times for stratified<br />

workloads very well approximate (within single percentage points) the processing times<br />

for interleaved workloads because the constant costs in query processing dominate. The<br />

problem we focus on in this paper is thus important for a system that supports an interleaved<br />

workload <strong>of</strong> updates <strong>and</strong> queries, <strong>and</strong> we will address both query processing <strong>and</strong><br />

adaptation to workloads based on cost models.<br />

3 Query processing<br />

Our main memory-based search system constructs indexes by accumulating batches <strong>of</strong><br />

documents in an updatable structure where the lists are compressed using VByte [24].<br />

The batches are combined in the end to form the complete index, where the lists are<br />

compressed using PForDelta [32, 30].<br />

The system answers queries by computing the intersection <strong>of</strong> a posting list with a<br />

union <strong>of</strong> author lists. 1 Note that similar types <strong>of</strong> queries occur in several scenarios, e.g.,<br />

in star joins in data warehouses [23]. We process the queries with the template shown in<br />

Figure 2, which is a combination <strong>of</strong> three operators (intersection, union <strong>and</strong> list iterator)<br />

that all support the following interface:<br />

• Init(), initialize the operator.<br />

1 Notice that our presentation focuses on single-term queries. This is done to simplify the presentation<br />

<strong>and</strong> does not reflect a limitation <strong>of</strong> our system.<br />

254


Paper 9: Search Your Friends And Not Your Enemies<br />

⋂<br />

p t<br />

⋃<br />

a 1 a 2 . . . a n<br />

Figure 2: Query Template<br />

• Current(), retrieve the current result.<br />

• Next(), forward to <strong>and</strong> retrieve the next result.<br />

• SkipTo(val), forward to next result with value ≤ val.<br />

All results are returned in sorted order based on descending document IDs to facilitate<br />

efficient ranking on recency. The top-k ranked results are therefore found by calling Next()<br />

on the intersection operator k times. A st<strong>and</strong>ard intersection operator alternates between<br />

the inputs <strong>and</strong> skips forward to find a value that occurs in both. With such a solution,<br />

the implementation <strong>of</strong> the SkipTo(val) operation in the union operator is essential to the<br />

overall processing strategy. The union operator can for example merge all the inputs first<br />

<strong>and</strong> perform all operations on the merged list [23]. Another strategy is to perform all skips<br />

on all inputs <strong>and</strong> return the maximum result. This last strategy essentially calculates the<br />

intersection <strong>of</strong> the posting list <strong>and</strong> each single author-list <strong>and</strong> then returns the union <strong>of</strong><br />

the results [23].<br />

Raman et al. have introduced a union operator called Lazy Merge, which is based on<br />

the idea that if the number <strong>of</strong> skip operations in the union is large compared to the length<br />

<strong>of</strong> an input, it would have been ideal to pre-merge this input into a list <strong>of</strong> intermediate<br />

results. Lazy Merge adaptively merges an input into the intermediate results when the<br />

number <strong>of</strong> skip operations exceeds the length <strong>of</strong> the input times a constant α. The<br />

strategy never uses more than twice the processing time <strong>of</strong> a solution that pre-merges the<br />

optimal set <strong>of</strong> inputs [23]. However, the approach is not well suited for our workloads for<br />

two reasons. First, we usually process top-k queries <strong>and</strong> therefore only need the first k<br />

results from the intersection. It is inefficient to merge a set <strong>of</strong> complete inputs when only<br />

a small fraction <strong>of</strong> the results are used during query processing. Second, we <strong>of</strong>ten have a<br />

large number <strong>of</strong> inputs to the union, which can lead to many merges in Lazy Merge. The<br />

analysis in Lazy Merge does not take the actual cost <strong>of</strong> merges into consideration, <strong>and</strong><br />

this can result in far-from-optimal processing strategies with many inputs.<br />

To address these issues, we develop a union operator called HeapUnion which is described<br />

next. The implementation <strong>of</strong> the other operators is described in Appendix A.<br />

255


APPENDIX A. OTHER PAPERS<br />

3.1 HeapUnion<br />

HeapUnion, our novel union operator, is designed to be efficient regardless <strong>of</strong> whether all<br />

or only a fraction <strong>of</strong> the results are actually needed, <strong>and</strong> to scale gracefully to a very large<br />

number <strong>of</strong> inputs. To achieve these goals, HeapUnion is based on a binary heap. The heap<br />

contains one entry for each input operator, <strong>and</strong> it is ordered based on the value obtained<br />

from calling Current() on each input operator (referred to as the current value for the<br />

input operator). The same overall strategy is commonly used in information retrieval<br />

during inverted index construction [29].<br />

We support the st<strong>and</strong>ard operator interface by always having the input with the<br />

highest current value at the top <strong>of</strong> the heap, so that this value is also the current value for<br />

HeapUnion. The heap is initialized the first time Next() or SkipTo(val) is called. When<br />

the first call is a Next() (SkipTo(val) resp.), HeapUnion calls Next() (SkipTo(val)) on all<br />

inputs, <strong>and</strong> the heap is constructed using Floyd’s Algorithm [13]. Floyd’s algorithm calls<br />

a sub-procedure called heapify() which can be used to construct a legal heap from an entry<br />

with two legal sub-heaps as children. The heapify() operation has logarithmic worst-case<br />

complexity in the size <strong>of</strong> the heap, <strong>and</strong> Floyd’s Algorithm runs in linear time [13]. We<br />

will also use these operations during heap maintenance.<br />

After initialization, HeapUnion works as shown in the pseudo code in Algorithm 1.<br />

The Current() operation either returns the current value from the input operator at the<br />

top <strong>of</strong> the heap, or it indicates that there are no more results if the heap is empty. The<br />

Next() operation forwards the input with the current value, <strong>and</strong> calls heapify() to ensure<br />

that the input with the new highest current value is at the top <strong>of</strong> the heap. The worst-case<br />

complexity <strong>of</strong> this operation is thus logarithmic in the number <strong>of</strong> input operators.<br />

The SkipTo(val) operation is based on a breadth-first search (BFS) in the heap. When<br />

forwarding to a value val, only the inputs with a current value > val actually need to be<br />

forwarded. Because the heap is organized according to the current value for all inputs,<br />

we know that if a given input has a current value ≤ val, the same is true for all <strong>of</strong> its<br />

descendants. If we determine that no skip is necessary in a given input, we thus also<br />

know that no skip is required in any <strong>of</strong> its children in the heap, <strong>and</strong> there is no need to<br />

process the children in the BFS. Furthermore, if an input is not forwarded, we know that<br />

its position in the heap relative to its children will not change. We also take advantage<br />

<strong>of</strong> this observation by calling heapify() only for the inputs where an actual skip occurred,<br />

<strong>and</strong> use a complete run <strong>of</strong> Floyd’s algorithm only in the worst case.<br />

Because HeapUnion never pre-merges any inputs, its efficiency does not depend on<br />

the total number <strong>of</strong> results, only on the number <strong>of</strong> returned results. It is therefore<br />

well suited for top-k queries. Furthermore, the BFS in the heap during skip operations<br />

ensures scalability with the number <strong>of</strong> inputs. If only one input is forwarded during a<br />

skip operation, the processing cost is logarithmic in the number <strong>of</strong> inputs; if all inputs<br />

are forwarded, the total maintenance costs becomes linear in the number <strong>of</strong> inputs.<br />

256


Paper 9: Search Your Friends And Not Your Enemies<br />

Algorithm 1 HeapUnion Operator<br />

1: function Init():<br />

2: Allocate heap<br />

3: function Next():<br />

4: heap[0].Next()<br />

5: heapify(0)<br />

6: return Current()<br />

7: function SkipTo(int Val):<br />

8: Perform a breadth-first search in the heap from the root<br />

9: while BFS queue is not empty:<br />

10: if Current iterator is forwarded in SkipTo(val):<br />

11: Add to LIFO-list <strong>of</strong> entries for heap reorganization<br />

12: Add children to BFS queue<br />

13: Call heapify() for forwarded inputs in LIFO-list<br />

14: return Current()<br />

15: function Current():<br />

16: if size(heap) == 0:<br />

17: return E<strong>of</strong><br />

18: else:<br />

19: return heap[0].Current()<br />

4 Cost Models<br />

The efficiency <strong>of</strong> our system for a particular workload depends on the contents <strong>of</strong> the<br />

additional author-lists, <strong>and</strong> selecting a good set <strong>of</strong> lists is therefore essential. For each<br />

user u, any subset <strong>of</strong> F u can be included in L u , which leads to 2 ∑ u∈V |Fu| = 2 |E| possible<br />

designs. We use cost models to investigate this large space; the models are used as bases<br />

for optimization algorithms to find the best access design for a stratified workload <strong>of</strong><br />

updates <strong>and</strong> queries.<br />

In related work, Silberstein et al. have developed a simple cost model to address the<br />

problem <strong>of</strong> finding the most recent events in users’ event feeds [25]. It is straight-forward<br />

to adapt their model to our problem, <strong>and</strong> we will refer to this model as Simple. Simple<br />

assumes that the cost <strong>of</strong> a search is linear in the number <strong>of</strong> accessed author-lists, <strong>and</strong><br />

that the cost <strong>of</strong> constructing a list is linear in the number <strong>of</strong> document IDs in the list. It<br />

thus captures the intuition that including more users in the additional author-lists leads<br />

to lower search costs <strong>and</strong> higher update costs. A formal description <strong>of</strong> Simple is found in<br />

Figure 3. Silberstein et al. have shown that when using Simple as a basis for optimization,<br />

a globally optimal solution can be found by making local decisions for each friend pair.<br />

This implies that only ∑ u∈V |F u| = |E| different designs have to be explored in order to<br />

257


APPENDIX A. OTHER PAPERS<br />

find the optimal one [25].<br />

Unfortunately, Simple is not accurate enough for our application scenario. Figure 4<br />

compares actual running times <strong>of</strong> our system to predictions from Simple. The tested<br />

workloads are described in Appendix B, but for now it suffices to know that a low limit<br />

indicates that the additional author-lists are almost empty, while a high limit indicates<br />

that the additional author-lists include most or all friends. The search estimates for<br />

Simple are clearly far from accurate, <strong>and</strong> as we will see in Section 5, this inaccuracy<br />

can lead to access designs that are up to 30% slower than designs from more accurate<br />

models. In this section, we introduce two more accurate models called Monotonic <strong>and</strong><br />

Non-monotonic.<br />

4.1 Monotonic<br />

Monotonic is designed to be an accurate yet tractable cost model, where only a small<br />

number <strong>of</strong> access designs must be checked to find a globally optimal solution. The model<br />

for updates in Monotonic is the same as in Simple, but we use a different approach to<br />

model query costs. Monotonic has one cost model for each operator <strong>and</strong> query costs are<br />

estimated by combining the models for all operators in the query. The cost model for an<br />

operator describes the cost <strong>of</strong> each method supported by the operator. For operators that<br />

have other operators as inputs, like HeapUnion <strong>and</strong> Intersection, the cost is calculated<br />

by combining the cost <strong>of</strong> operations within the operator with the cost <strong>of</strong> method calls on<br />

the inputs. To find the cost <strong>of</strong> the queries we use in this paper, we combine the operators<br />

according to the template in Figure 2, <strong>and</strong> calculate the cost <strong>of</strong> k Next() calls for the<br />

Intersection to retrieve the top-k results (assuming there are at least k).<br />

Monotonic is described in Figure 3. Skip(s) is a model for SkipTo(val). If SkipTo(val)<br />

forwards the current value <strong>of</strong> an operator with ∆v document IDs, we model the cost <strong>of</strong> the<br />

operation with Skip( r∆v<br />

N<br />

), where r is the number <strong>of</strong> results <strong>of</strong> the operator. The number <strong>of</strong><br />

results for an operator is the number <strong>of</strong> times we can call Next() <strong>and</strong> retrieve a new result.<br />

Monotonic <strong>and</strong> Non-monotonic use the same models for List Iterator <strong>and</strong> Intersection,<br />

but they have different models for HeapUnion. We will now explain Monotonic’s model<br />

for HeapUnion; models for the other operators are explained in Appendix C. 2<br />

HeapUnion. Let us assume that HeapUnion has m inputs k 1 , . . . , k m . We use r i to<br />

denote the number <strong>of</strong> results from input i, <strong>and</strong> define R = ∑ m<br />

i=1 r i. We assume that<br />

the cost <strong>of</strong> initialization within the HeapUnion operator itself is negligible, <strong>and</strong> therefore<br />

model the cost <strong>of</strong> initialization as the sum <strong>of</strong> the initialization costs for all inputs.<br />

Recall from Section 3.1 that the first call to either Next() or SkipTo(val) will involve<br />

construction <strong>of</strong> the heap; we therefore have two different cases in the model for Next() <strong>and</strong><br />

SkipTo(val), depending on whether it is the first or subsequent calls. The cost <strong>of</strong> the first<br />

call to Next() includes the cost <strong>of</strong> calling Next() on all m inputs, <strong>and</strong> the cost <strong>of</strong> the heap<br />

construction using Floyd’s Algorithm. For heap construction, we model the cost <strong>of</strong> each<br />

call to heapify() as being constant, c h . With Floyd’s algorithm, heapify() is called for half<br />

the heap entries, <strong>and</strong> then recursively every time there is a reorganization. The average<br />

case complexity <strong>of</strong> Floyd’s algorithm is well known, <strong>and</strong> the number <strong>of</strong> relocations in the<br />

2 We determined values for the constants with microbenchmarks as described in Appendix E.<br />

258


Paper 9: Search Your Friends And Not Your Enemies<br />

Update Cost (|lu| = n) Query Cost (user u)<br />

Simple cupdaten clist (|Fu| − |Lu| + 1)<br />

Monotonic ⎪⎨ c1b ∗ n if N n < b 1<br />

cupdaten ⎧<br />

see below<br />

Non-monotonic c2b ∗ n if b1 ≤ N<br />

⎪ n < b 2 see below<br />

⎩<br />

c3b ∗ n otherwise<br />

List Iterator<br />

Init() cg Next() Skip(s)<br />

{<br />

{<br />

ceinit empty list<br />

cnext<br />

cc + s b c d + scan(s)csc Skip(b) + log ( )<br />

s<br />

if s ≤ b<br />

cinit otherwise<br />

otherwise<br />

Intersection k1.Init() + k2.Init()<br />

Monotonic<br />

HeapUnion<br />

Nonmonotonic<br />

HeapUnion<br />

∑ m<br />

i=1 k i.Init()<br />

∑ m<br />

i=1 k i.Init()<br />

k1.Next() + (t − 1)k1.Skip(1 + r1<br />

) r2<br />

+t · k2.Skip( t−1<br />

t<br />

(1 + r2<br />

) + r2<br />

r1 r1t ) not used<br />

{ ∑m<br />

i=1 k i.Next() + (γ + 1 2 ) · m · c h first call<br />

∑ m<br />

i=1<br />

ri<br />

R k i.Next() + (γ + 1) · ch otherwise<br />

{ ∑m<br />

i=1 k i.Next() + (γ + 1 2 ) · m · c h first call<br />

∑ m<br />

i=1<br />

ri<br />

R (k i.Next() + log(1 + pi) · ch) otherwise<br />

b<br />

⎧ ∑ ⎪⎨ m<br />

i=1 k i.Skip( ris<br />

R ) + (γ + 1 2 ) · m · c h first call<br />

ris<br />

min(1,<br />

⎪ ⎩<br />

∑ m<br />

i=1<br />

R )k i.Skip(max(1, ris<br />

R ))<br />

+(γ + 1) · ms · ch otherwise<br />

∑ m<br />

i=1<br />

⎧⎪ k i.Skip( ris<br />

R ) + (γ + 1 2 ) · m · c h first call<br />

⎨ ris<br />

min(1,<br />

⎪ ⎩<br />

∑ m<br />

i=1<br />

R )k i.Skip(max(1, ris<br />

R ))<br />

+ max(msγ + min(ms, m 2 ),<br />

ms( ∑ m<br />

j=1<br />

rj<br />

R log(1 + p j)) − h(⌊ms⌋)) · ch otherwise<br />

Figure 3: Overview <strong>of</strong> Cost Estimates<br />

259


APPENDIX A. OTHER PAPERS<br />

heap is approximately γm = 0.744m [27]. Thus we model the cost <strong>of</strong> heap construction<br />

as (γ + 1 2 ) · m · c h.<br />

A first call to SkipTo(val) involves skipping in all inputs, in addition to heap construction.<br />

Given that s results are skipped in this operation, we simply assume that the<br />

number <strong>of</strong> entries skipped on input i is ris<br />

R , resulting in a cost <strong>of</strong> ∑ m<br />

i=1 k i.Skip( ris<br />

R<br />

). The<br />

cost <strong>of</strong> heap construction is modeled as explained for Next() above.<br />

Subsequent calls to Next() involve a call to Next() for the input at the top <strong>of</strong> the heap<br />

<strong>and</strong> a heap reorganization. We estimate the cost <strong>of</strong> the call to Next() for the input at the<br />

top as a weighted average over the inputs. The model for the cost <strong>of</strong> heap maintenance is<br />

simple. We assume that there will be γ relocations when a single operator is forwarded,<br />

leading to γ + 1 calls to heapify().<br />

Subsequent calls to SkipTo(val) will potentially forward all inputs, <strong>and</strong> then reorganize<br />

the heap according to the new current values <strong>of</strong> the inputs. On average, each input will<br />

be forwarded past ris<br />

R<br />

entries. However, HeapUnion will ensure not to call SkipTo(val)<br />

for inputs that will not be forwarded. Therefore, when the average skip length is less than<br />

1, we model the cost as ris<br />

R<br />

calls that skip 1 entry. To find the cost <strong>of</strong> heap maintenance<br />

we estimate the number <strong>of</strong> forwarded inputs as: m s = ∑ m sri<br />

i=1<br />

min(<br />

R<br />

, 1). Assuming that<br />

there will be as many relocations in the heap as when constructing a heap with m s entries,<br />

the cost <strong>of</strong> heap maintenance is (γ + 1) · m s · c h .<br />

Optimization Algorithm. Although Monotonic is more complex than Simple, it still<br />

has the property that testing only |E| access designs is sufficient to find the optimal<br />

approach. We will describe how this is done in two steps. The following lemma shows<br />

that we can find a globally optimal solution by choosing the contents <strong>of</strong> the additional<br />

author-list for each user individually; it follows directly from the definitions in Figure 3.<br />

Lemma 1. In Monotonic, the costs <strong>and</strong> savings from including the users L u in the<br />

additional author-list for u are not affected by the contents <strong>of</strong> l v when v ≠ u.<br />

Lemma 1 reduces the number <strong>of</strong> access designs to test in the optimization algorithm<br />

from 2 ∑ u∈V |Fu| to ∑ u∈V 2|Fu| .<br />

Theorem 2. If Monotonic estimates that for a user u <strong>and</strong> a given workload, the performance<br />

is improved if v ∈ F u is included in L u , then Monotonic will predict that it leads<br />

to a performance improvement to also include user w in L u if w ∈ F u , 0 < n w < n v , <strong>and</strong><br />

c d − csc<br />

2b<br />

≥ cg<br />

ln 2 .<br />

The pro<strong>of</strong> <strong>of</strong> Theorem 2 is found in Appendix D. We notice that in our case, c d − csc<br />

2b<br />

≥<br />

c g<br />

ln 2<br />

translates to 330.6578 ≥ 114.0751, which holds with a significant margin. Theorem 2<br />

implies that if we sort all friends <strong>of</strong> u based on the number <strong>of</strong> documents they post, the<br />

optimal contents <strong>of</strong> L u is a prefix <strong>of</strong> this sorted list. We thus need to check |E| designs<br />

in total in order to find he optimal solution. Furthermore, notice that there is no cost<br />

associated with including users who do not post documents in the additional author-lists,<br />

<strong>and</strong> it is therefore always beneficial to do so.<br />

Validation. Monotonic’s predictions for our test workloads are presented in Figure 4.<br />

The search estimates are much more accurate than Simple’s, but there is still room for<br />

improvement when limit is close to 0 for Workload 2. Inaccurate modeling <strong>of</strong> the cost<br />

260


Paper 9: Search Your Friends And Not Your Enemies<br />

Search time<br />

Update time<br />

2.5 ·1011 limit<br />

·10 12 limit<br />

Processing time (s)<br />

2<br />

1.5<br />

1<br />

0.5<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 0.5 1 1.5<br />

·10 4<br />

Search time<br />

0<br />

0 0.5 1 1.5<br />

·10 4<br />

Update time<br />

Actual time<br />

Simple<br />

Monotonic<br />

Non-monotonic<br />

Processing time (s)<br />

·10 11<br />

1.4<br />

1.2<br />

1<br />

0.8<br />

0.6<br />

0.4<br />

0.2<br />

0<br />

0 1,000 2,000 3,000 4,000 5,000<br />

limit<br />

·10 12<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 1,000 2,000 3,000 4,000 5,000<br />

limit<br />

Figure 4: Accuracy <strong>of</strong> cost models for queries <strong>and</strong> updates in Workload 1 <strong>and</strong> Workload 2<br />

<strong>of</strong> heap maintenance is one factor that contributes to this error, <strong>and</strong> we will therefore<br />

improve this aspect in Non-monotonic.<br />

4.2 Non-monotonic<br />

Non-monotonic is designed to be more accurate than Monotonic, <strong>and</strong> rather sacrifice the<br />

efficiency <strong>of</strong> the optimization algorithm that finds the globally optimal access design.<br />

To achieve better accuracy, two aspects <strong>of</strong> Monotonic are changed: The model for heap<br />

maintenance is extended, <strong>and</strong> a slightly more advanced update model is used. The formal<br />

description <strong>of</strong> Non-monotonic is found in Figure 3.<br />

The update model from Monotonic is extended by taking list compression into account.<br />

During accumulation, the lists are compressed using VByte, which implies that lists with<br />

few entries result in long deltas that use more storage space. The model assumes that<br />

the cost <strong>of</strong> updating a list depends on the number <strong>of</strong> bytes used by VByte to represent<br />

the average delta length.<br />

The models for heap maintenance in subsequent calls to Next() <strong>and</strong> SkipTo(val) reflect<br />

261


APPENDIX A. OTHER PAPERS<br />

that the cost <strong>of</strong> heap maintenance <strong>of</strong>ten depends on the total number <strong>of</strong> inputs as well<br />

as on the number <strong>of</strong> forwarded inputs. Given that input i is at the top <strong>of</strong> the heap when<br />

Next() is called, let p i denote the number <strong>of</strong> inputs that will have Current() values larger<br />

than input i after the call to Next(). We estimate p i as ∑ m<br />

k=1 min( r k<br />

ri<br />

, 1), <strong>and</strong> assume<br />

that the heap maintenance cost when input i is at the top <strong>of</strong> the heap is log(1 + p i )c h .<br />

By calculating a weighted average over all inputs, we end up with an average cost <strong>of</strong> heap<br />

maintenance for a Next() as shown in Figure 3.<br />

The model for heap maintenance in SkipTo(val) is slightly more complex, <strong>and</strong> the<br />

maximum <strong>of</strong> two different estimates is used: (1) The first alternative is similar to the<br />

estimate in Monotonic, but incorporates that Floyd’s algorithm will never call heapify()<br />

for more than half the inputs, which yields the following estimate: (γm s +min(m s , m 2 ))c h.<br />

(2) The other alternative reflects that the cost can be logarithmic in the number <strong>of</strong> inputs<br />

when the number <strong>of</strong> forwarded inputs is low. We have already estimated the average<br />

number <strong>of</strong> calls to heapify() when only one input is forwarded in the model for Next(),<br />

denoted h next in the following. We now assume that all forwarded inputs will lead to<br />

h next heapify() operations, but compensate for the fact that many <strong>of</strong> the inputs are not<br />

at the top <strong>of</strong> the heap when heapify() is called. The compensation is achieved with the<br />

function h(m s ) in Figure 3, which returns the minimum possible total distance from the<br />

root to m s entries in the heap.<br />

Optimization Algorithm. Lemma 1 also holds for Non-monotonic. However, the<br />

pro<strong>of</strong> <strong>of</strong> Theorem 2 does not hold due to the extensions in Non-monotonic. As a result,<br />

an optimization algorithm based on Non-monotonic must test ∑ u∈V 2|Fu| access designs<br />

to find the optimal approach. In social networks, users have hundreds <strong>of</strong> friends, <strong>and</strong><br />

it is therefore not feasible to test all possibilities. We thus choose to limit the solution<br />

space to the same space as the optimization algorithm based on Monotonic explores. If<br />

Non-monotonic is actually more accurate than Monotonic, the resulting design should be<br />

at least as efficient.<br />

Validation. Non-monotonic’s predictions on our two test workloads are shown in Figure<br />

4. The extended update model represents a slight improvement. For search costs,<br />

Non-monotonic is clearly more accurate than Monotonic for low limits in Workload 2.<br />

This reflects that the model for heap maintenance actually plays a significant role.<br />

5 Experiments<br />

To validate the efficiency <strong>of</strong> our system, we conduct a set <strong>of</strong> experiments. We compare our<br />

query processing strategy based on HeapUnion to Lazy Merge, <strong>and</strong> test how the solutions<br />

based on optimizations with different cost models perform compared to the user design<br />

<strong>and</strong> the friends design. Our experiments are based on the system described in Section 3<br />

which is implemented in Java. The experiments are run on a computer with an Intel<br />

Xeon 3.2 GHz CPU with 6 MB cache <strong>and</strong> 16 GB main memory. The computer is running<br />

RedHat Enterprise 5.3 <strong>and</strong> Java version 1.6.<br />

262


Paper 9: Search Your Friends And Not Your Enemies<br />

5.1 Efficiency <strong>of</strong> HeapUnion<br />

To test the efficiency <strong>of</strong> HeapUnion, we compare it to the alternative suggested by Raman<br />

et al. by exchanging the HeapUnion operator in the queries with Lazy Merge [23]. To do<br />

so, we have implemented a version <strong>of</strong> Lazy Merge that stores the intermediate results in<br />

an uncompressed list where skips are implemented with galloping search [6]. As explained<br />

in Section 3, the parameter α describes how eager Lazy Merge is at pre-merging inputs<br />

into the intermediate results. When setting α to 0, Lazy Merge pre-merges all inputs,<br />

<strong>and</strong> when setting it to ∞, no inputs are pre-merged.<br />

We use the workloads from Appendix B in the experiments, but limit the design to<br />

the user design because this provides a real test <strong>of</strong> the solutions’ ability to process queries<br />

with many author-lists efficiently. We report the time spent processing the 100,000 queries<br />

for each <strong>of</strong> the workloads, <strong>and</strong> vary α between 0 <strong>and</strong> ∞. We have argued that one <strong>of</strong><br />

the reasons why Lazy Merge is not ideal for our workloads is that we typically process<br />

top-k queries. To isolate this effect, we process the workload both when only returning<br />

the top-100 results <strong>and</strong> when returning all results.<br />

The results from the experiments are shown in Figure 5. Notice that HeapUnion<br />

does not depend on α, <strong>and</strong> its cost is therefore constant. When using Lazy Merge, the<br />

difference between the cost <strong>of</strong> top-100 queries <strong>and</strong> retrieving all results increases with the<br />

size <strong>of</strong> α. This reflects the inadequacy <strong>of</strong> approaches that pre-merge when processing<br />

top-k queries. The processing time with HeapUnion is clearly dependent on the number<br />

<strong>of</strong> retrieved results, <strong>and</strong> HeapUnion is therefore an attractive solution for top-k queries<br />

as expected.<br />

Lazy Merge consistently performs best with α set to one <strong>of</strong> the extreme values. Poor<br />

performance in the other cases is mainly caused by the large number <strong>of</strong> merges resulting<br />

from many inputs with different lengths. What extreme value <strong>of</strong> α that performs best<br />

for a particular workload depends on the average length <strong>of</strong> the author-lists compared<br />

to the posting list. HeapUnion outperforms all configurations <strong>of</strong> Lazy Merge for both<br />

workloads with a speed-up between 1.13 <strong>and</strong> 2.36, reflecting that the high number <strong>of</strong><br />

inputs is processed efficiently regardless <strong>of</strong> workload characteristics.<br />

5.2 Workload Optimization<br />

We have tested the accuracy <strong>of</strong> the different cost models in Section 4, but the key success<br />

factor for a cost model is whether using it in an optimization algorithm leads to efficient<br />

designs. We therefore conduct a set <strong>of</strong> experiments to compare the access designs suggested<br />

by the different cost models <strong>and</strong> associated optimization algorithms. To do so, we<br />

use the networks <strong>and</strong> documents from the workloads in Appendix B, <strong>and</strong> combine them<br />

with several different sets <strong>of</strong> queries. We first test workloads with different numbers <strong>of</strong><br />

queries generated by letting a r<strong>and</strong>om user search for a r<strong>and</strong>om term. We also experiment<br />

with workloads where the users who search follow a Zipfian distribution with exponent<br />

1.5, to see how skew affects the efficiency <strong>of</strong> the different approaches. We compare the designs<br />

based on Simple, Monotonic <strong>and</strong> Non-Monotonic to the user design <strong>and</strong> the friends<br />

design.<br />

263


APPENDIX A. OTHER PAPERS<br />

Lazy Merge (top-100)<br />

Lazy Merge (top-∞)<br />

HeapUnion (top-100)<br />

HeapUnion (top-∞)<br />

Workload 1<br />

Workload 2<br />

2<br />

processing time (ns)<br />

1<br />

·10 13 alpha<br />

4<br />

2<br />

·10 11 alpha<br />

0<br />

0 10 −4 10 −3 10 −2 10 −1 10 0 10 1 inf<br />

0<br />

0 10 −4 10 −3 10 −2 10 −1 10 0 10 1 inf<br />

Figure 5: HeapUnion vs. Lazy Merge varying α.<br />

The results <strong>of</strong> our experiments are shown in Figure 6, where the first line shows the<br />

results for workloads with uniformly distributed queries. To get a better view <strong>of</strong> the<br />

relative differences between the methods, the second line shows the performance <strong>of</strong> all<br />

approaches relative to the best <strong>of</strong> user design <strong>and</strong> friends design.<br />

For the workloads with uniformly distributed queries, Simple <strong>of</strong>ten leads to designs<br />

that are slower than choosing the best <strong>of</strong> user design <strong>and</strong> friends design due to the<br />

inaccuracies in prediction. Compared to Monotonic <strong>and</strong> Non-monotonic, the designs<br />

from Simple are up to 30% slower. Monotonic <strong>and</strong> Non-monotonic generally lead to<br />

reasonable designs that are comparable to or faster than the basic approaches. However,<br />

for Workload 2, both approaches lead to sub-optimal designs when queries are frequent<br />

relative to updates. This reflects the inaccuracies in the estimates for low limits for<br />

Workload 2 in Figure 4. However, Non-monotonic clearly results in better designs than<br />

Monotonic in this case, so the additional complexity pays <strong>of</strong>f.<br />

The results from Workload 2 with Zipfian queries are shown in the third line in Figure<br />

6, denoted Workload 2Z. We have omitted the results for Workload 1Z due to space<br />

constraints, but the results are similar as for Workload 2Z. The results show that our<br />

overall solution performs much better compared to the basic designs when there is skew,<br />

with a speed-up <strong>of</strong> up to 3.4. With skew, the optimization problem is simple because<br />

a few users submit almost all queries, <strong>and</strong> all cost models are able to reflect that these<br />

users should have additional author-lists with nearly all their friends represented. Nonmonotonic<br />

is still slightly better than the others, but the difference is not significant.<br />

6 Related Work<br />

Due to its clear commercial value there has recently been some interest in search in social<br />

networks. Search engines like Google <strong>and</strong> Bing already support search over public Twitter<br />

264


Paper 9: Search Your Friends And Not Your Enemies<br />

Workload 1<br />

Workload 2<br />

Processing time (ns)<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

0 1 2 3 4<br />

6 ·1012 #queries ·10 6<br />

Processing time (ns)<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 1 2 3 4<br />

3.5 ·1012 #queries ·10 6<br />

Workload 1<br />

Workload 2<br />

Relative performance<br />

1.2<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

Relative performance<br />

1.2<br />

1.1<br />

1<br />

0.9<br />

0.8<br />

0 1 2 3 4<br />

#queries ·10 6<br />

0 1 2 3 4<br />

#queries ·10 6<br />

Workload 2Z<br />

User design<br />

Friends design<br />

Simple<br />

Monotonic<br />

Non-monotonic<br />

Processing time (ns)<br />

3<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 1 2 3 4<br />

3.5 ·1012 #queries ·10 6<br />

Figure 6: Performance for workloads with different fractions <strong>of</strong> documents vs. queries<br />

posts, <strong>and</strong> they plan to include private posts as well. 3 Facebook allows users to search<br />

their friends’ posts. However, the details <strong>of</strong> these commercial solutions have not been<br />

published.<br />

The problem <strong>of</strong> supporting keyword search over content in a social network was described<br />

in [8]. The problem is related to access control in both information retrieval [9, 26]<br />

<strong>and</strong> for structured data [7]. The solutions we explore in this paper have relations to the<br />

3 http://www.networkworldme.com/v1/news.aspx?v=1&nid=3236&sec=netmanagement<br />

265


APPENDIX A. OTHER PAPERS<br />

solution explored by Silberstein et al. on how to support retrieving the most recent events<br />

in users’ news feeds in a social network [25]. While our problem is more complex because<br />

we support keyword search, the procedure <strong>of</strong> solving it is along the same lines.<br />

Cost models have been used to estimate the efficiency <strong>of</strong> processing strategies in both<br />

information retrieval <strong>and</strong> databases [29, 10, 11, 20]. There exists advanced cost models to<br />

evaluate different index construction <strong>and</strong> update strategies [10, 29]. For search queries,<br />

however, simple models are most commonly used, sometimes without verification on an<br />

actual system [11]. We use simple cost models for updates in this paper, but relatively<br />

advanced models are required to predict search performance accurately. Manegold et<br />

al. outline a generic strategy to estimate the memory access costs in databases [20]. The<br />

memory access costs in our advanced models can be considered part <strong>of</strong> the model for the<br />

list iterator. However, our focus is on estimating processing time, whereas Manegold et<br />

al. focus on memory access costs.<br />

The problems <strong>of</strong> calculating unions <strong>and</strong> particularly intersections <strong>of</strong> lists have attracted<br />

a lot <strong>of</strong> attention, both through the introduction <strong>of</strong> new algorithms with theoretical<br />

analyses [18, 16, 1, 4], <strong>and</strong> experimental investigations [5, 2]. The algorithms for single<br />

operations are also combined into full query plans [23, 12]. Most algorithms assume that<br />

the input sets are uncompressed. Motivated by the ability to store more data <strong>and</strong> in<br />

some cases improve computational efficiency, work from the IR community relaxes this<br />

assumption by introducing synchronization points in the compressed lists to speed up<br />

r<strong>and</strong>om look-ups [14, 22]. In addition, there exists work on algorithms adapted to new<br />

hardware trends [28]. Our intersection operator is based on the ideas from Demaine et<br />

al. [16], <strong>and</strong> we use a strategy similar to the one introduced by Culpepper <strong>and</strong> M<strong>of</strong>fat<br />

for synchronization points in the lists, but we use a novel union operator. Another<br />

discriminating factor is that we attempt to estimate the exact running time <strong>of</strong> the queries,<br />

while most previous work focus on either asymptotic complexity or experimental running<br />

times.<br />

7 Conclusions<br />

In this paper we have presented a system for keyword search in social networks with<br />

access control, which is designed to perform well for a wide range <strong>of</strong> workloads. We have<br />

addressed the query processing strategy <strong>and</strong> developed an efficient HeapUnion operator<br />

that results in a speed-up between 1.13 <strong>and</strong> 2.36 compared to previous solutions. We<br />

have also developed cost models for the system <strong>and</strong> used them as bases for workload<br />

optimization. With an accurate cost model, workload optimization leads to comparable<br />

or better performance than the best basic designs; the speed-up is up to 3.4 for workloads<br />

with skew.<br />

We believe that the ideas discussed here have even wider applicability than search<br />

in social networks. First, queries similar to the ones we have discussed in this paper<br />

occur in several settings, like in star joins in data warehouses [23], <strong>and</strong> HeapUnion might<br />

be a practical <strong>and</strong> efficient solution in such application scenarios as well. Furthermore,<br />

workload optimization might be applicable for other problems. In particular, faceted<br />

266


Paper 9: Search Your Friends And Not Your Enemies<br />

search is supported by storing meta-data about the documents, <strong>and</strong> several different<br />

storage schemes are possible [15]. Workload optimization could be used to find an optimal<br />

storage scheme for the meta-data, an interesting research area for future work.<br />

References<br />

[1] R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Proc. CPM,<br />

2004.<br />

[2] R. Baeza-Yates <strong>and</strong> A. Salinger. Fast intersection algorithms for sorted sequences.<br />

Algorithms <strong>and</strong> Applications, 2010.<br />

[3] A. L. Barabasi <strong>and</strong> R. Albert. Emergence <strong>of</strong> scaling in r<strong>and</strong>om networks. Science,<br />

286(5439), 1999.<br />

[4] J. Barbay <strong>and</strong> C. Kenyon. Alternation <strong>and</strong> redundancy analysis <strong>of</strong> the intersection<br />

problem. ACM Trans. Algorithms, 4(1), 2008.<br />

[5] J. Barbay, A. López-Ortiz, T. Lu, <strong>and</strong> A. Salinger. An experimental investigation <strong>of</strong><br />

set intersection algorithms for text searching. J. Exp. Algorithmics, 14, 2009.<br />

[6] J. L. Bentley <strong>and</strong> A. C.-C. Yao. An almost optimal algorithm for unbounded searching.<br />

<strong>Information</strong> Processing Letters, 5(3), 1976.<br />

[7] E. Bertino, S. Jajodia, <strong>and</strong> P. Samarati. Database security: research <strong>and</strong> practice.<br />

Inf. Syst., 20(7), 1995.<br />

[8] T. A. Bjørklund, M. Götz, <strong>and</strong> J. Gehrke. Search in social networks with access<br />

control. In Proc. KEYS, 2010.<br />

[9] S. Büttcher <strong>and</strong> C. L. A. Clarke. A security model for full-text file system search in<br />

multi-user environments. In FAST, 2005.<br />

[10] S. Büttcher, C. L. A. Clarke, <strong>and</strong> B. Lushman. Hybrid index maintenance for growing<br />

text collections. In Proc. SIGIR, 2006.<br />

[11] B B. Cambazoglu, V. Plachouras, <strong>and</strong> R. Baeza-Yates. Quantifying performance <strong>and</strong><br />

quality gains in distributed web search engines. In Proc. SIGIR, 2009.<br />

[12] E. Chiniforooshan, A. Farzan, <strong>and</strong> M. Mirzazadeh. Worst case optimal unionintersection<br />

expression evaluation. In ICALP, 2005.<br />

[13] T. H. Cormen, C. E. Leiserson, R. L. Rivest, <strong>and</strong> C. Stein. Introduction to Algorithms,<br />

Second Edition. The MIT Press, 2001.<br />

[14] J. S. Culpepper <strong>and</strong> A. M<strong>of</strong>fat. Compact set representation for information retrieval.<br />

In Proc. SPIRE, 2007.<br />

267


APPENDIX A. OTHER PAPERS<br />

[15] D. Dash, J. Rao, N. Megiddo, A. Ailamaki, <strong>and</strong> G. Lohman. Dynamic faceted search<br />

for discovery-driven analysis. In Proc. CIKM, 2008.<br />

[16] E. D. Demaine, A. López-Ortiz, <strong>and</strong> J. I. Munro. Adaptive set intersections, unions,<br />

<strong>and</strong> differences. In Proc. SODA, 2000.<br />

[17] S. Gurajada <strong>and</strong> S. Kumar P. On-line index maintenance using horizontal partitioning.<br />

In Proc. CIKM, 2009.<br />

[18] F. K. Hwang <strong>and</strong> S. Lin. Optimal merging <strong>of</strong> 2 elements with n elements. Acta<br />

Informatica, 1(2), 1971.<br />

[19] N. Lester, A. M<strong>of</strong>fat, <strong>and</strong> J. Zobel. Fast on-line index construction by geometric<br />

partitioning. In Proc. CIKM, 2005.<br />

[20] S. Manegold, P. Boncz, <strong>and</strong> M. L. Kersten. Generic database cost models for hierarchical<br />

memory systems. In Proc. VLDB, 2002.<br />

[21] G. Margaritis <strong>and</strong> S. V. Anastasiadis. Low-cost management <strong>of</strong> inverted files for<br />

online full-text search. In Proc. CIKM, 2009.<br />

[22] A. M<strong>of</strong>fat <strong>and</strong> J. Zobel. Self-indexing inverted files for fast text retrieval. ACM<br />

Trans. Inf. Syst., 14(4), 1996.<br />

[23] V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H. Yang, <strong>and</strong> F.-L. Ling.<br />

Lazy, adaptive rid-list intersection, <strong>and</strong> its application to index <strong>and</strong>ing. In Proc. SIG-<br />

MOD, 2007.<br />

[24] F. Scholer, H. E. Williams, J. Yiannis, <strong>and</strong> J. Zobel. Compression <strong>of</strong> inverted indexes<br />

for fast query evaluation. In Proc. SIGIR, 2002.<br />

[25] A. Silberstein, J. Terrace, B. F. Cooper, <strong>and</strong> R. Ramakrishnan. Feeding frenzy:<br />

Selectively materializing users. event feeds. In Proc. SIGMOD, 2010.<br />

[26] A. Singh, M. Srivatsa, <strong>and</strong> L. Liu. Efficient <strong>and</strong> secure search <strong>of</strong> enterprise file<br />

systems. In Proc. ICWS, 2007.<br />

[27] R. Sprugnoli. Recurrence relations on heaps. Algorithmica, 15(5), 1996.<br />

[28] D. Tsirogiannis, S. Guha, <strong>and</strong> N. Koudas. Improving the performance <strong>of</strong> list intersection.<br />

Proc. VLDB Endow., 2(1), 2009.<br />

[29] I. Witten, A. M<strong>of</strong>fat, <strong>and</strong> T. C. Bell. Managing Gigabytes. Academic Press, 1999.<br />

[30] J. Zhang, X. Long, <strong>and</strong> T. Suel. Performance <strong>of</strong> compressed inverted list caching in<br />

search engines. In Proc. WWW, 2008.<br />

[31] J. Zobel <strong>and</strong> A. M<strong>of</strong>fat. Inverted files for text search engines. ACM Comput. Surv.,<br />

38(2), 2006.<br />

[32] M. Zukowski, S. Heman, N. Nes, <strong>and</strong> P. Boncz. Super-scalar ram-cpu cache compression.<br />

In Proc. ICDE, 2006.<br />

268


Paper 9: Search Your Friends And Not Your Enemies<br />

A<br />

Query Processing Operators<br />

A.1 List Iterator<br />

The List Iterator is used to iterate through posting lists <strong>and</strong> author lists in our system.<br />

Recall from Section 3 that the lists are compressed using PForDelta. PForDelta is a<br />

form <strong>of</strong> delta compression which is commonly used to compress inverted indexes in search<br />

engines [32, 30]. With delta compression, an entry in a sorted list is stored as the delta<br />

from the previous entry. Small values are therefore likely, <strong>and</strong> compression is achieved by<br />

using a representation where small values require little space.<br />

To achieve efficient decompression rates, we decompress batches <strong>of</strong> b = 128 values<br />

at a time [32]. With this in mind, the implementation <strong>of</strong> Init(), Current() <strong>and</strong> Next()<br />

is straight-forward. During initialization, we allocate an array to store the batch being<br />

processed in an uncompressed format. A st<strong>and</strong>ard call to Next() will forward the current<br />

result to the next in the batch, <strong>and</strong> every bth call will result in decompression <strong>of</strong> a new<br />

batch.<br />

The use <strong>of</strong> delta compression complicates the implementation <strong>of</strong> SkipTo(val), because<br />

previous entries in the list must be decompressed to find the value <strong>of</strong> an entry; straightforward<br />

r<strong>and</strong>om look-ups are therefore impossible. We can implement SkipTo(val) by<br />

calling Next() until the returned value is ≤ val, but that is potentially inefficient when<br />

many entries are skipped. We enable more efficient r<strong>and</strong>om look-ups by storing the first<br />

value <strong>of</strong> each batch uncompressed in an auxiliary array [14]. A SkipTo(val) is processed<br />

by first checking whether the entry to skip to is located in the current batch. If not, we do<br />

a galloping search in the auxiliary array to find the correct batch <strong>and</strong> decompress it [6].<br />

When the correct batch is stored in the intermediate array, we forward within that batch<br />

until the correct answer is found.<br />

A.2 Intersection<br />

The intersection operator sorts its inputs according to their expected number <strong>of</strong> results.<br />

The number <strong>of</strong> results for an operator is the number <strong>of</strong> times we can call Next() <strong>and</strong><br />

retrieve a new result. Both Next() <strong>and</strong> SkipTo(val), will begin with the input with fewest<br />

results, <strong>and</strong> call Next() or SkipTo(val), respectively. From then on, these methods are<br />

similar. Both <strong>of</strong> them alternate between the inputs <strong>and</strong> tries to skip forward to the<br />

last value returned from the other input. When the same value is found in both inputs,<br />

this value is part <strong>of</strong> the intersection <strong>and</strong> is therefore returned. The outlined processing<br />

strategy is similar to the one introduced by Demaine et al. [16].<br />

B<br />

Test Workloads<br />

Two workloads with different characteristics, denoted Workload 1 <strong>and</strong> Workload 2, are<br />

used to test the accuracy <strong>of</strong> our cost models <strong>and</strong> efficiency <strong>of</strong> our system. Workload 1<br />

has 10,000 users <strong>and</strong> Workload 2 has 100,000. In both networks, users have 100 friends<br />

each <strong>and</strong> the friendships are generated with Barabasi’s preferential attachment model [3].<br />

269


APPENDIX A. OTHER PAPERS<br />

Documents in both workloads were obtained from a crawl <strong>of</strong> Twitter in February 2010. In<br />

Workload 1, we assign 1,500,000 documents to the users in a r<strong>and</strong>om process where there is<br />

a strong correlation between number <strong>of</strong> documents posted <strong>and</strong> number <strong>of</strong> followers, while<br />

the 2,500,000 documents in Workload 2 are assigned to users without such a correlation.<br />

Each workload also has 100,000 top-100 queries generated by choosing a r<strong>and</strong>om user who<br />

searches for a r<strong>and</strong>om term in the documents (except stop words).<br />

We use a set <strong>of</strong> pre-generated designs to test the accuracy <strong>of</strong> the cost models. Motivated<br />

by the fact that search performance is improved if more users are represented in the<br />

additional author-lists, <strong>and</strong> that it is less costly to include a user who post infrequently<br />

compared to one who posts frequently, we generate a set <strong>of</strong> designs with the following<br />

simple strategy: We represent a user u in the additional author-lists <strong>of</strong> all her followers,<br />

O u , if n u < limit. By choosing a set <strong>of</strong> different values for limit, we obtain a range <strong>of</strong><br />

designs in between empty additional author-lists <strong>and</strong> additional author-lists that include<br />

all friends.<br />

C<br />

Cost Models for Operators<br />

Monotonic <strong>and</strong> Non-monotonic use the same models for the List Iterator <strong>and</strong> Intersection<br />

operators, <strong>and</strong> the underlying intuition for these models are presented here. A more<br />

formal description is given in Figure 3.<br />

C.1 List Iterator<br />

The list iterator operator is initialized by using a look-up structure to find the list to<br />

process. Recall that we decompress lists in batches <strong>of</strong> b values at a time. An array for<br />

the b intermediate results is therefore allocated during initialization, <strong>and</strong> the first batch<br />

<strong>of</strong> values is decompressed. However, if the look-up indicates that the list is empty, no<br />

intermediate array allocation nor decompression is required. In conclusion, we model the<br />

cost <strong>of</strong> Init() for list iterators with two constants, c init <strong>and</strong> c einit , depending on whether<br />

the list is empty or not.<br />

Every bth call to Next() will trigger decompression <strong>of</strong> a new batch. However, we<br />

estimate the average cost <strong>of</strong> a call to Next() to be constant, c next .<br />

The model for the SkipTo(val)-operation is slightly more complex <strong>and</strong> depends on the<br />

number <strong>of</strong> entries skipped, denoted s. As described in Appendix A.1, a galloping search<br />

is used to find the correct batch to decompress if many entries are skipped, <strong>and</strong> we will<br />

therefore discriminate between cases for when s > b or not. If s ≤ b, we assume that<br />

there is some constant cost associated with a skip operation, c c . It might be necessary<br />

to decompress a new batch <strong>of</strong> values to find the correct return value. We model the<br />

cost <strong>of</strong> decompressing a segment as a constant, c d , <strong>and</strong> the probability that it happens<br />

as s b<br />

. In addition, we have to forward within the correct batch <strong>of</strong> values until we find<br />

the value we seek. The number <strong>of</strong> scanned values is lower than s when we need to<br />

decompress a new batch, <strong>and</strong> the average number <strong>of</strong> entries scanned in our implementation<br />

is scan(s) = 1 + (2b−1)s<br />

2b<br />

− s2<br />

2b . We assume that scanning an entry also has a constant<br />

270


Paper 9: Search Your Friends And Not Your Enemies<br />

cost, c sc . When s > b, the logarithmic cost <strong>of</strong> the galloping search is incorporated into<br />

the model as c g log ( s<br />

b<br />

)<br />

, <strong>and</strong> the rest <strong>of</strong> the cost is equal to Skip(b).<br />

C.2 Intersection<br />

For the intersection operator, we will restrict our analysis to two inputs because that is<br />

what we use in this paper. The initialization within the Intersection operator is assumed<br />

to have negligible cost overall, <strong>and</strong> the cost <strong>of</strong> Init() is the sum <strong>of</strong> the initialization costs in<br />

the two inputs, k 1 <strong>and</strong> k 2 . The inputs are assumed to have r 1 <strong>and</strong> r 2 results, respectively,<br />

with r 1 < r 2 .<br />

As explained in Appendix A.2, the Intersection operator works similarly for Next()<br />

<strong>and</strong> SkipTo(val)-operations, but we focus on Next() in this paper because that is the<br />

method we will use. The processing <strong>of</strong> a Next() call will begin with a call to Next() on<br />

input k 1 . Then, there will be a set <strong>of</strong> skips in each input until we find a value that occurs<br />

in both. We assume that deltas between entries in the inputs are N r 1<br />

<strong>and</strong> N r 2<br />

document<br />

IDs, respectively. Furthermore, we assume that the inputs are independent, so that the<br />

deltas between results <strong>of</strong> the Intersection can be estimated as N 2<br />

r 1r 2<br />

. The amount skipped<br />

in one round <strong>of</strong> skips (one skip in each input) is estimated to N r 1<br />

+ N r 2<br />

. We now calculate<br />

t, the expected number <strong>of</strong> rounds with skips, as:<br />

t = 1 +<br />

N 2<br />

r 1r 2<br />

− N r 1<br />

N<br />

r 1<br />

+ N = 1 + N − r 2<br />

r<br />

r 1 + r 2<br />

2<br />

Because input 1 will be forwarded with a Next() call in the first round, there will be t − 1<br />

skips in it with average length N r + N 1 r 2<br />

N = 1 + r1<br />

r<br />

r 2<br />

. Similarly, there will be t skips in input 2,<br />

1<br />

but one <strong>of</strong> the skips will, according to our assumptions, arrive at the last value from input<br />

1 which results in a shorter skip in the last round. The average skip length is therefore:<br />

t−1<br />

t<br />

(1 + r2<br />

r 1<br />

) + r2<br />

r . 1t<br />

D Pro<strong>of</strong> <strong>of</strong> Theorem 2<br />

Pro<strong>of</strong>. To see this, we will show that the additional costs from also including user w are<br />

never larger per document posted by w than the costs <strong>of</strong> including v per document posted<br />

by v, <strong>and</strong> that the savings during query processing from also including w are at least as<br />

large per posted document as for v.<br />

The cost <strong>of</strong> updates to l u is linear in the number <strong>of</strong> document IDs. The costs <strong>of</strong> including<br />

the documents from v <strong>and</strong> w in l u are therefore c update n v <strong>and</strong> c update n w , respectively.<br />

Per document for each user, the update cost is thus the same, namely c update .<br />

We will now focus on queries. Recall that the queries are processed with the query<br />

template in Figure 2. The only things that will change in the query plans for search<br />

queries from user u when L u changes are the inputs to HeapUnion. Notice that what<br />

HeapUnion returns for a particular call to SkipTo(val) or Next() will not change, <strong>and</strong><br />

the processing cost for the intersection operator <strong>and</strong> the iterator for the posting list is<br />

271


APPENDIX A. OTHER PAPERS<br />

thus unaffected. We therefore only need to show that for the methods for Monotonic<br />

HeapUnion in Figure 3, the savings from including w in addition to v in L u are at least<br />

as large per posted document as when only including v, assuming that all inputs to the<br />

union are List Iterators. We will go through each <strong>of</strong> the methods in order.<br />

Init(): When v is included in L u , the cost will decrease with c init if |L u | > 0 before v<br />

was included, because the included list does not require initialization. Otherwise, the cost<br />

will remain constant because we have to initialize l u instead. When w is also included in<br />

L u , the cost will definitely decrease with c init because we know that |L u | > 0 before w<br />

was included. Because cinit<br />

n v<br />

< cinit<br />

n w<br />

when n v > n w , the savings during Init() are at least<br />

as large per entry for w as for v.<br />

Next(), first call: From the formula in Figure 3, it is clear that when assuming that<br />

|l u | > 0 before v is included, the savings from including v are c next + (γ + 1 2 )c h (it is less if<br />

|l u | = 0). The savings from including w in L u are at least the same (or c next + 2(γ + 1 2 )c h<br />

if adding w to L u makes L u = F u , in which case no heap is required). By a similar<br />

argument as for Init(), the savings per n w for w are at least as large as per n v for v.<br />

Next(), subsequent calls: Because all inputs to the union are List Iterators, the cost<br />

<strong>of</strong> the call to Next() for the input at the top <strong>of</strong> the heap remains the same when inputs<br />

are removed. There will be no savings from the heap maintenance costs either unless<br />

including w makes L u = F u (when no heap is required). It is therefore straight-forward<br />

to conclude that the savings from including w are at least as large per n w as the savings<br />

from including v per n v .<br />

Skip(s), first call: We will first consider the cost <strong>of</strong> the calls to skip in the inputs<br />

that are processed in this operation. Notice that when including a new user in L u , the<br />

resulting changes in skip costs are: 1) Each skip in l u will be longer, <strong>and</strong> the added length<br />

depends on the number <strong>of</strong> documents posted by the new user. And, 2) the skips in the<br />

list that is included will no longer be necessary. In the following, we will refer to the<br />

cost <strong>of</strong> performing a skip with length s in a List Iterator as LI.Skip(s). Given that there<br />

are |l u | entries in the additional author-list before v or w is included, the savings from<br />

including v are:<br />

S v = LI .Skip<br />

( snv<br />

)<br />

R<br />

( ) s|lu |<br />

+ LI .Skip<br />

R<br />

( )<br />

s(|lu | + n v )<br />

− LI .Skip<br />

R<br />

The savings from including w as well are:<br />

( snw<br />

) ( )<br />

s(|lu | + n v )<br />

S w = LI .Skip + LI .Skip<br />

R<br />

R<br />

( )<br />

s(|lu | + n v + n w )<br />

− LI .Skip<br />

R<br />

272


Paper 9: Search Your Friends And Not Your Enemies<br />

We need to show that Sv<br />

n v<br />

≤ Sw<br />

n w<br />

. Or, in particular that:<br />

LI .Skip ( sn w<br />

R<br />

)<br />

n w<br />

(LI .Skip<br />

−<br />

− LI .Skip ( sn v<br />

R<br />

n v<br />

LI .Skip<br />

+<br />

(<br />

s(|lu|+n v+n w)<br />

)<br />

R<br />

)<br />

− LI .Skip<br />

n w<br />

( )<br />

s(|lu|+n v)<br />

R<br />

− LI .Skip<br />

( )<br />

s(|lu|+n v)<br />

R<br />

( )<br />

s|lu|<br />

n v<br />

R<br />

≥ 0<br />

We will now show that LI .Skip(s) is concave, a fact we will use to show the above.<br />

First, the second derivative <strong>of</strong> LI .Skip(s) when s < b is − csc<br />

b<br />

, which is negative because<br />

c sc <strong>and</strong> b are both positive constants. When s > b, the second derivative is −<br />

cg<br />

s 2 ln 2 , which<br />

is also negative. Furthermore, LI .Skip(s) is continuous at s = b. To show that it is<br />

concave, it thus remains to show that the derivative as s approaches b from below is not<br />

smaller than when s approaches b from above. By calculating the derivatives <strong>and</strong> finding<br />

the limits, we see that the following must hold:<br />

c d − c sc<br />

2b ≥ c g<br />

ln 2<br />

And this holds by assumption in the theorem. We thus know that LI .Skip(s) is concave.<br />

For a concave function f(x) <strong>and</strong> two points x 1 <strong>and</strong> x 2 where x 1 < x 2 , it follows that<br />

the slope between the two points ( f(x2)−f(x1)<br />

x 2−x 1<br />

) will never increase when either x 1 , x 2 or<br />

both increase. We can therefore conclude that<br />

LI .Skip( s(|lu|+nv)<br />

R<br />

) − LI .Skip( s|lu|<br />

R )<br />

n v<br />

s(|lu|+nv+nw)<br />

(LI .Skip(<br />

R<br />

) − LI .Skip( s(|lu|+nv)<br />

R<br />

)<br />

≥<br />

n w<br />

snw<br />

LI .Skip( R )<br />

n w<br />

snv<br />

LI .Skip( R<br />

It remains to show that ≥ )<br />

n v<br />

. To do so, we return to f(x) as<br />

discussed above, <strong>and</strong> substitute x 1 = 0. As long as f(0) ≥ 0, the average slope between<br />

(0, 0) <strong>and</strong> (x 2 , f(x 2 ) is smaller than between (0, 0) <strong>and</strong> (x 3 , f(x 3 )) when x 3 < x 2 because<br />

f(x) is concave. Because LI.Skip(0) = c c + c sc is positive as no costs are negative, we<br />

snw<br />

LI .Skip( R )<br />

n w<br />

LI .Skip(<br />

snv<br />

R )<br />

n v<br />

.<br />

can conclude that ≥<br />

We know that the theorem holds for the heap construction from the description <strong>of</strong><br />

Next(), first call, above.<br />

Skip(s), subsequent calls: The pro<strong>of</strong> for the savings in calls to SkipTo(val) for the<br />

inputs is mostly as for the first call to Skip(s) above. The only difference occurs when the<br />

skip length for an input is less than 1. We thus need to prove that this resulting function<br />

273


APPENDIX A. OTHER PAPERS<br />

is still concave, <strong>and</strong> that it is not negative for s = 0. Because it is linear when s < 1,<br />

the second derivative is 0. Furthermore, for the derivative not to increase at s = 1, we<br />

require that:<br />

This holds when:<br />

c c + c d<br />

b<br />

+ (1 +<br />

2b − 1<br />

2b<br />

− 1 2b )c sc<br />

≥ c d<br />

b + (2b − 1 − 2 2b 2b )c sc<br />

c c + (1 + 1 2b )c sc ≥ 0<br />

This holds because all costs are positive. Furthermore, when the skip length is 0, the cost<br />

is 0, so we reach the same conclusion as for Skip(s), first call, above.<br />

The model for the cost <strong>of</strong> heap maintenance in subsequent calls to SkipTo(val) is<br />

linear in the number <strong>of</strong> forwarded inputs. Hence, we proceed to show that the number <strong>of</strong><br />

forwarded inputs decrease at least as much per n w when w is included in the additional<br />

author-list as per n v when only v is included. The reduction when including v in L u is:<br />

R v = min<br />

( s|lu |<br />

R , 1 )<br />

( snv<br />

)<br />

+ min<br />

R , 1 ( s(|lu | + n v )<br />

− min<br />

R<br />

)<br />

, 1<br />

And, the reduction when also including w is:<br />

( )<br />

s(|lu | + n v )<br />

( snw<br />

)<br />

R w = min<br />

, 1 + min<br />

R<br />

R , 1 ( )<br />

s(|lu | + n v + n w )<br />

− min<br />

, 1<br />

R<br />

We need to show that Rw<br />

n w<br />

≥ Rv<br />

n v<br />

.<br />

n w < n v , <strong>and</strong> it remains to show that:<br />

( )<br />

min s(|lu|+n v)<br />

R<br />

, 1<br />

It is clear that<br />

min(<br />

snv<br />

R ,1)<br />

n v<br />

( )<br />

− min s(|lu|+n v+n w)<br />

R<br />

, 1<br />

≤<br />

min(<br />

snw<br />

R ,1)<br />

n w<br />

when<br />

n w<br />

( ) ( )<br />

min s|lu|<br />

R , 1 − min s(|lu|+n v)<br />

R<br />

, 1<br />

−<br />

≥ 0<br />

n v<br />

Assume first that s(|lu|+nv+nw)<br />

R<br />

≤ 1. In that case, we see that the statement sums to 0<br />

because |l u |, n v , n w ≥ 0. There are three more possible cases:<br />

a) s |lu|+nv+nw<br />

R<br />

b) s |lu|+nv<br />

R<br />

> 1 <strong>and</strong> s |lu|+nv<br />

R<br />

< 1,<br />

< 1, <strong>and</strong><br />

≥ 1 <strong>and</strong> s |lu|<br />

R<br />

274


Paper 9: Search Your Friends And Not Your Enemies<br />

Processing time (ns)<br />

1.5<br />

0.5<br />

0.1<br />

·10 12 Actual time<br />

Estimate<br />

0.5 1 1.5<br />

#authorings ·10 9<br />

10 11<br />

10 10<br />

10 5 10 6<br />

#lists<br />

Figure 7: (Left) update workload with x list entries (c update = 1001.7939), (right) search<br />

workload with x accessed author-lists (c list = 24111.2354).<br />

c) s |lu|<br />

R ≥ 1.<br />

( )<br />

In case a), − min s(|lu|+n v+n w)<br />

R<br />

, 1 , which is negative, is limited so we subtract less<br />

relative ( to when ) the statement ( summed to 0, ) so the statement is positive. ( In case)<br />

b),<br />

min s(|lu|+n v)<br />

R<br />

, 1 <strong>and</strong> − min s(|lu|+n v+n w)<br />

R<br />

, 1 cancel. Furthermore, min s|lu|<br />

R , 1 <<br />

( )<br />

min s(|lu|+n v)<br />

R<br />

, 1 , making the statement positive. In case c), the result is 0.<br />

We have now showed that the costs <strong>of</strong> including w in addition to v in the additional<br />

author-list for u are exactly the same per document posted by w, n w , as the cost <strong>of</strong> including<br />

only v per n v . Furthermore, we have showed that the savings per posted document<br />

are at least as large for all relevant functions in HeapUnion when also including w as<br />

when including v in L u . Because it by assumption leads to a performance improvement<br />

to include v, it will thus lead to a further performance improvement to also include w.<br />

E<br />

Microbenchmarks<br />

This section contains an overview <strong>of</strong> the microbenchmarks used to determine the constants<br />

in the cost models in Section 4. All constants are summarized in Table 1.<br />

The constants used in Simple are estimated from a complete workload. We process<br />

Workload 2 from Appendix B <strong>and</strong> measure update cost as a function <strong>of</strong> author-list length,<br />

<strong>and</strong> search cost as a function <strong>of</strong> the number <strong>of</strong> merged lists in the queries. Results <strong>and</strong><br />

estimates are shown in Figure 7.<br />

The costs <strong>of</strong> initialization <strong>and</strong> Next() operations in list iterators are estimated in an<br />

experiment using author-lists with deltas between entries ranging from 1 to 100,000. The<br />

results are shown in Figure 8. In a similar experiment using posting lists instead <strong>of</strong><br />

author-lists, the estimate for c init was higher. This is probably due to that we use a<br />

different look-up structure in the inverted index, <strong>and</strong> we therefore have two values for<br />

c init .<br />

275


APPENDIX A. OTHER PAPERS<br />

10 5 delta<br />

Processing time (ns)<br />

Processing time (ns)<br />

10 4<br />

10 3<br />

10 4<br />

10 0 10 1 10 2 10 3 10 4 10 5<br />

1<br />

2<br />

5<br />

10<br />

20<br />

50<br />

100<br />

200<br />

500<br />

1000<br />

2000<br />

5000<br />

10000<br />

20000<br />

50000<br />

100000<br />

resulting estimate<br />

10 5 delta<br />

10 3<br />

10 0 10 1 10 2 10 3 10 4 10 5<br />

Figure 8: Time spent on x next operations with different deltas. For author-lists (above)<br />

estimate with c init = 1261.6450 <strong>and</strong> c next = 15.0138. For posting-lists (below) estimate<br />

with c initposting = 1675.6671.<br />

The cost <strong>of</strong> accessing an empty list is estimated with an experiment where we have a<br />

user who is friends with x users that do not post documents. We perform 100,000 queries<br />

on behalf <strong>of</strong> this user <strong>and</strong> record the time per search. The results are shown in Figure 9<br />

along with the estimates.<br />

The cost <strong>of</strong> skipping in the list iterator involves several constants. We first estimate<br />

the scan cost with an experiment that tests all scan lengths up to b = 128. The results<br />

<strong>and</strong> the resulting estimates are shown in Figure 9. The rest <strong>of</strong> the constants for skipping<br />

in list iterators are estimated from a set <strong>of</strong> experiments where skips with various lengths<br />

(1 to 10,000) are performed in 1000 lists. All results from these experiments are shown in<br />

Figure 10. We use these results to first estimate c c <strong>and</strong> c d by considering all skip lengths<br />

below b <strong>and</strong> subtracting the average scan cost (from the above experiment). The constant<br />

for long skips is estimated by subtracting the estimated cost <strong>of</strong> skipping b values from<br />

the longer skips. The results <strong>and</strong> resulting estimates from these experiments are shown<br />

in Figure 11.<br />

The cost <strong>of</strong> a heapify() operation is estimated by running union queries over a set <strong>of</strong><br />

iterators that simply return numbers with given intervals. We are thus able to predetermine<br />

the number <strong>of</strong> heapify() operations. To estimate the constants in construction, we<br />

test a workload with 100,000 lists <strong>and</strong> accumulate 1,000 entries in each list. The deltas<br />

276


Paper 9: Search Your Friends And Not Your Enemies<br />

Processing time (ns)<br />

3 ·104 Results<br />

2.5 Estimates<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

0 5 10 15 20 25 30 35 40 45 50<br />

#empty inits<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

0 20 40 60 80 100 120<br />

scan length<br />

Figure 9: Left: Time spent on x initializations <strong>of</strong> empty lists c einit = 543.7537. Right:<br />

Cost <strong>of</strong> scanning x entries in array (c sc = 1.1889).<br />

between the entries vary from 1 to 100,000, which results in different numbers <strong>of</strong> bytes<br />

used during accumulation. Results <strong>and</strong> estimates are shown in Figure 12.<br />

Processing time (ns)<br />

10 5<br />

10 4<br />

10 3<br />

10 0 10 1 10 2 10 3 10 4<br />

num skip operations<br />

1 skipped entries<br />

2 skipped entries<br />

5 skipped entries<br />

10 skipped entries<br />

20 skipped entries<br />

50 skipped entries<br />

100 skipped entries<br />

200 skipped entries<br />

500 skipped entries<br />

1000 skipped entries<br />

2000 skipped entries<br />

5000 skipped entries<br />

10000 skipped entries<br />

Figure 10: Cost <strong>of</strong> SkipTo(val) in list iterator<br />

277


APPENDIX A. OTHER PAPERS<br />

Processing time (ns)<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

Results<br />

Estimates<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

0 20 40 60 80 100<br />

skip length<br />

0<br />

200 400 600 800 1,000<br />

skip length<br />

Figure 11: Determination <strong>of</strong> skip constants for short skips (left: c c = 24.8305 <strong>and</strong> c d =<br />

330.6624) <strong>and</strong> long skips (right: c g = 79.0708).<br />

Processing time (ns)<br />

4 ·108 Results<br />

Estimates<br />

3<br />

2<br />

1<br />

0<br />

0 1 2 3 4<br />

Number <strong>of</strong> heapify ops ·10 7<br />

Processing per entry (ns)<br />

1,000<br />

500<br />

0<br />

10 0 10 1 10 2 10 3 10 4 10 5<br />

delta<br />

Figure 12: Left: Time per heapify operations (c h = 9.9073). Right: Microbenchmark for<br />

construction (c 1b = 990.3406, c 2b = 1082.8243 <strong>and</strong> c 3b = 1135.9749).<br />

Constant<br />

Estimate<br />

c update 1001.7939<br />

c list 24111.2354<br />

c init 1261.6450<br />

c init−posting 1675.6671<br />

c einit 543.7537<br />

c next 15.0138<br />

c sc 1.1889<br />

c c 24.8305<br />

c d 330.6624<br />

c g 79.0708<br />

c h 9.9073<br />

c 1b 990.3406<br />

c 2b 1082.8243<br />

c 3b 1135.9749<br />

Table 1: Estimates from microbenchmarks<br />

278

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!