Disputation Nils Grimsmo - Department of Computer and Information ...
Disputation Nils Grimsmo - Department of Computer and Information ...
Disputation Nils Grimsmo - Department of Computer and Information ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Nils</strong> <strong>Grimsmo</strong><br />
Bottom Up <strong>and</strong> Top Down —<br />
Twig Pattern Matching on Indexed Trees<br />
Thesis for the degree <strong>of</strong> philosophiae doctor<br />
Trondheim, 2010-09-02<br />
Norwegian University <strong>of</strong> Science <strong>and</strong> Technology.<br />
Faculty <strong>of</strong> <strong>Information</strong> Technology, Mathematics <strong>and</strong> Electrical Engineering.<br />
<strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science.
NTNU<br />
Norwegian University <strong>of</strong> Science <strong>and</strong> Technology<br />
Thesis for the degree <strong>of</strong> philosophiae doctor<br />
Faculty <strong>of</strong> <strong>Information</strong> Technology, Mathematics <strong>and</strong> Electrical Engineering<br />
<strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science<br />
c○<strong>Nils</strong> <strong>Grimsmo</strong><br />
ISBN 978-82-471-2723-0 (printed version)<br />
ISBN 978-82-471-2724-7 (electronic version)<br />
ISSN 1503-8181<br />
Doctoral theses at NTNU, 2011:96<br />
Printed by NTNU-trykk
Preface<br />
This thesis is submitted to the Norwegian University <strong>of</strong> Science <strong>and</strong> Technology (NTNU)<br />
for partial fulfillment <strong>of</strong> the requirements for the degree <strong>of</strong> philosophiae doctor. The doctoral<br />
work has been performed at the <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science,<br />
NTNU, Trondheim, with Bjørn Olstad as main supervisor, <strong>and</strong> Øystein Torbjørnsen <strong>and</strong><br />
Magnus Lie Hetl<strong>and</strong> as co-supervisors.<br />
The c<strong>and</strong>idate was supported by the Research Council <strong>of</strong> Norway under the grant<br />
NFR 162349, <strong>and</strong> by the iAD project, also funded by the Research Council <strong>of</strong> Norway.<br />
5
Summary<br />
This PhD thesis is a collection <strong>of</strong> papers presented with a general introduction to the<br />
topic, which is twig pattern matching (TPM) on indexed tree data. TPM is a pattern<br />
matching problem where occurrences <strong>of</strong> a query tree are found in a usually much larger<br />
data tree. This has applications in XML search, where the data is tree shaped <strong>and</strong><br />
the queries specify tree patterns. The papers included present contributions on how to<br />
construct <strong>and</strong> use structure indexes, which can speed up pattern matching, <strong>and</strong> on how<br />
to efficiently join together results for the different parts <strong>of</strong> the query with so-called twig<br />
joins.<br />
• Paper 1 [18] shows how to perform more efficient matching <strong>of</strong> root-to-leaf query<br />
paths in so-called path indexes, by using new opportunistic algorithms on existing<br />
data structures.<br />
• Paper 2 [19] proves a tight bound on the worst-case space usage for a data structure<br />
used to implement path indexes.<br />
• Paper 3 [24] presents an XML indexing system which combines existing techniques<br />
in a novel way, <strong>and</strong> has orders <strong>of</strong> magnitude improved performance over existing<br />
commercial <strong>and</strong> open-source systems.<br />
• Paper 4 [20] reviews <strong>and</strong> creates a taxonomy for the many advances in the field <strong>of</strong><br />
TPM on indexed data, <strong>and</strong> proposes opportunities for further research.<br />
• Paper 5 [21] bridges the gap between worst-case optimality <strong>and</strong> practical performance<br />
in current twig join algorithms.<br />
• Paper 6 [22] improves the construction cost <strong>of</strong> so-called forward <strong>and</strong> backward path<br />
indexes for tree data from loglinear to linear.<br />
7
Acknowledgments<br />
The day-to-day supervision <strong>of</strong> the PhD work during the first years was mostly done by<br />
the external supervisor Dr. Øystein Torbjørnsen from Fast Search <strong>and</strong> Transfer, who has<br />
been a good source <strong>of</strong> ideas <strong>and</strong> clever technical solutions. Dr. Magnus Lie Hetl<strong>and</strong> from<br />
my department has been supervising the last year, <strong>and</strong> has given substantial help both<br />
scientifically <strong>and</strong> in the writing process <strong>of</strong> some papers. The visits <strong>of</strong> my formal supervisor<br />
Dr. Bjørn Olstad have been inspirational. The discussions with Dr. Felix Weigel during<br />
his internship at FAST resulted in many ideas. I would like to thank fellow PhD student<br />
Truls Amundsen Bjørklund for good times, fruitful discussions <strong>and</strong> honest feedback during<br />
our work together.<br />
Thank you Nina, for your patience, beauty <strong>and</strong> delicious cooking.<br />
9
Contents<br />
Preface 5<br />
Summary 7<br />
Acknowledgments 9<br />
Contents 12<br />
1 Introduction 13<br />
1.1 Indexing/search in semi-structured data . . . . . . . . . . . . . . . . . . . 13<br />
1.2 Use-case: XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
1.2.1 XPath <strong>and</strong> XQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />
1.3 Abstract problem: Twig Pattern Matching . . . . . . . . . . . . . . . . . . 15<br />
1.3.1 Research scope: TPM on indexed data . . . . . . . . . . . . . . . . 16<br />
1.4 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17<br />
2 Background 19<br />
2.1 Twig joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
2.1.1 Twig join work-flow . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />
2.1.2 Result enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . 22<br />
2.1.2.1 Single output query node . . . . . . . . . . . . . . . . . . 23<br />
2.1.3 Simple intermediate result architecture . . . . . . . . . . . . . . . . 23<br />
2.1.4 Tree position encoding . . . . . . . . . . . . . . . . . . . . . . . . . 24<br />
2.1.5 Partial match filtering . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />
2.1.6 Intermediate result construction . . . . . . . . . . . . . . . . . . . 26<br />
2.1.7 Merging input streams . . . . . . . . . . . . . . . . . . . . . . . . . 29<br />
2.1.8 Data locality <strong>and</strong> updatability . . . . . . . . . . . . . . . . . . . . 30<br />
2.1.9 Twig join conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
2.2 Partitioning data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31<br />
2.2.1 Motivation for fragmentation . . . . . . . . . . . . . . . . . . . . . 31<br />
2.2.2 Path partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32<br />
2.2.3 Backward <strong>and</strong> forward path partitioning . . . . . . . . . . . . . . . 33<br />
2.2.4 Balancing fragmentation . . . . . . . . . . . . . . . . . . . . . . . . 34<br />
2.3 Reading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
11
2.3.1 Skipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />
2.3.1.1 Skipping child matches . . . . . . . . . . . . . . . . . . . 36<br />
2.3.1.2 Skipping parent matches . . . . . . . . . . . . . . . . . . 37<br />
2.3.1.3 Holistic skipping . . . . . . . . . . . . . . . . . . . . . . . 37<br />
2.3.2 Virtual streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38<br />
2.3.2.1 Virtual matches for non-branching internal query nodes . 38<br />
2.3.2.2 Tree position encoding allowing ancestor reconstruction . 38<br />
2.3.2.3 Virtual matches for branching query nodes . . . . . . . . 39<br />
2.4 Related problems <strong>and</strong> solutions . . . . . . . . . . . . . . . . . . . . . . . . 40<br />
3 Research Summary 43<br />
3.1 Formalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />
3.2 Publications <strong>and</strong> research process . . . . . . . . . . . . . . . . . . . . . . . 44<br />
3.2.1 Paper 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />
3.2.2 Paper 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />
3.2.3 Paper 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />
3.2.4 Paper 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47<br />
3.2.5 Paper 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />
3.2.6 Paper 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />
3.3 Research methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50<br />
3.4 Evaluation <strong>of</strong> contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 51<br />
3.4.1 Research questions revisited . . . . . . . . . . . . . . . . . . . . . . 51<br />
3.4.2 Opportunities revisited . . . . . . . . . . . . . . . . . . . . . . . . 52<br />
3.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54<br />
3.5.1 Strong structure summaries for independent documents . . . . . . 54<br />
3.5.2 A simpler fast optimal twig join . . . . . . . . . . . . . . . . . . . 55<br />
3.5.3 Simpler <strong>and</strong> faster evaluation with non-output nodes . . . . . . . . 55<br />
3.5.4 Ultimate data access shoot-out . . . . . . . . . . . . . . . . . . . . 56<br />
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />
Bibliography 61<br />
4 Included papers 63<br />
Paper 1: Faster Path Indexes for Search in XML Data . . . . . . . . . . . . . . 65<br />
Paper 2: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists 87<br />
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes 93<br />
Paper 4: Towards Unifying Advances in Twig Join Algorithms . . . . . . . . . 115<br />
Paper 5: Fast Optimal Twig Joins . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees . . . . . . . . . . . . . . . . . . 169<br />
A Other Papers 193<br />
Paper 7: On performance <strong>and</strong> cache effects in substring indexes . . . . . . . . . 195<br />
Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems . . 237<br />
Paper 9: Search Your Friends And Not Your Enemies . . . . . . . . . . . . . . 249<br />
12
Chapter 1<br />
Introduction<br />
“Research is formalized curiosity.<br />
It is poking <strong>and</strong> prying with a purpose.”<br />
– Zora Neale Hurston<br />
The thesis is submitted as a paper collection bound together by a general introduction.<br />
This chapter presents the context <strong>of</strong> the research, which is indexing <strong>and</strong> querying<br />
semi-structured data, <strong>and</strong> the abstract problem investigated, which is twig pattern<br />
matching (TPM). Chapter 2 gives a high-level introduction to techniques used in state<br />
<strong>of</strong> the art TPM on indexed data. Chapter 3 lists the included published papers with<br />
short qualitative assessments, evaluates the total contribution <strong>of</strong> this thesis, <strong>and</strong> proposes<br />
future work.<br />
1.1 Indexing/search in semi-structured data<br />
So-called semi-structured data gives both flexibility <strong>and</strong> expressional power, <strong>and</strong> is commonly<br />
used for storing <strong>and</strong> exchanging data in heterogeneous information systems. In the<br />
semi-structured data model, documents have a structure that specifies how the different<br />
parts <strong>of</strong> the content relate to each other. This means information is contained both in<br />
the structure <strong>and</strong> the content. Documents are usually structurally self-contained, meaning<br />
that the structure can be understood from the document alone, without additional<br />
meta-data.<br />
The focus <strong>of</strong> this thesis is algorithms <strong>and</strong> data structures for indexing <strong>and</strong> querying<br />
semi-structured data, where queries specify both structure <strong>and</strong> content. The use <strong>of</strong> semistructured<br />
data can functionally cover both traditional structure-oriented <strong>and</strong> contentoriented<br />
data management, <strong>and</strong> the thesis therefore touches the fields <strong>of</strong> both databases<br />
<strong>and</strong> information retrieval.<br />
13
CHAPTER 1.<br />
INTRODUCTION<br />
1.2 Use-case: XML<br />
XML is a simple yet flexible markup language [46], <strong>and</strong> has become the de facto st<strong>and</strong>ard<br />
for storing semi-structured data. An example XML document is shown in Figure 1.1.<br />
St<strong>and</strong>ard XML has a tree model, where there are mainly three types <strong>of</strong> nodes in a document<br />
tree: element, attribute <strong>and</strong> text. All internal nodes in the document tree are <strong>of</strong><br />
type element, <strong>and</strong> are given by start <strong>and</strong> end tags, such as the node with name book in<br />
the example. Text <strong>and</strong> attribute nodes are always leaf nodes. Text nodes have simple<br />
string values, while attributes have both a name <strong>and</strong> a value, such as the ISBN node in<br />
the example.<br />
<br />
<br />
Kritik der Unvollständigkeit<br />
Kant<br />
Gödel<br />
<br />
...<br />
<br />
Figure 1.1: Example XML document<br />
1.2.1 XPath <strong>and</strong> XQuery<br />
XPath [45] <strong>and</strong> XQuery [47] have become st<strong>and</strong>ard languages for querying XML. Comparing<br />
the two, XPath is a simpler declarative language, while XQuery is a more complex<br />
language that uses XPath expressions as building blocks. The XPath expression in Figure<br />
1.2a asks for the title <strong>of</strong> all books coauthored by Kant <strong>and</strong> Gödel.<br />
In XPath single <strong>and</strong> double forward slashes specify child <strong>and</strong> descendant relationships<br />
between nodes, respectively. Square brackets contain predicates, <strong>and</strong> the rightmost node<br />
not part <strong>of</strong> a predicate is the output node, also called the return node. XPath queries are<br />
trees, <strong>and</strong> the tree representation <strong>of</strong> the example is shown in Figure 1.2b. In XPath there<br />
are 11 so-called axes in addition to descendant <strong>and</strong> child: parent, ancestor, followingsibling,<br />
preceding-sibling, following, preceding, attribute, namespace, self, descendant-orself<br />
<strong>and</strong> ancestor-or-self [45]. There can also be more complex value predicates than<br />
simple tests on string equality, using function such as count(), contains(), sum(), etc..<br />
XQuery is a powerful language where small programs are built with path expressions<br />
as building blocks, in so-called FLWOR expressions (for, let, where, order, return). Figure<br />
1.3 shows an XQuery program similar to the XPath expression in Figure 1.2a, which in<br />
addition orders books by title <strong>and</strong> retrieves both title <strong>and</strong> ISBN.<br />
14
1.3. ABSTRACT PROBLEM: TWIG PATTERN MATCHING<br />
//book[author/text()="Kant"][author/text()="Gödel"]/title<br />
(a) Expression.<br />
<br />
<br />
<br />
<br />
"Kant"<br />
"Gödel"<br />
(b) Tree representation.<br />
Figure 1.2: XPath example finding books coauthored by Kant <strong>and</strong> Gödel.<br />
for $b in doc("lib.xml")/library/book<br />
let $t := $b/title<br />
where $b/author = "Kant" <strong>and</strong> $b/author = "Gödel"<br />
order by $t<br />
return ($t, $b/@ISBN)<br />
Figure 1.3: Example XQuery.<br />
1.3 Abstract problem: Twig Pattern Matching<br />
In XPath a large number <strong>of</strong> functions can be used in value predicates, <strong>and</strong> thirteen different<br />
axes dictate the relationships between nodes. The many details in the language<br />
makes it hard to reason about the complexity <strong>of</strong> evaluation algorithms <strong>and</strong> hard to implement<br />
prototypes. TPM is a more abstract tree matching problem that covers a subset<br />
<strong>of</strong> XPath. It is <strong>of</strong> academic interest because a TPM solution covers the majority <strong>of</strong> the<br />
workload in most XML search systems [15].<br />
In TPM both query <strong>and</strong> data are node-labeled trees, as shown in the example in<br />
Figure 1.4. Node predicates are on label equality, <strong>and</strong> all nodes have the same type.<br />
There are two types <strong>of</strong> query edges that dictate the relationship between data nodes in a<br />
match, ancestor–descendant (A–D) <strong>and</strong> parent–child (P–C), denoted in figures by double<br />
<strong>and</strong> single edges, respectively. The result <strong>of</strong> a TPM query is the set <strong>of</strong> mappings <strong>of</strong> query<br />
nodes to data nodes that both respect the node labels <strong>and</strong> satisfy the A–D <strong>and</strong> P–C<br />
relationships specified by the query edges.<br />
In settings with XML document collections, the data is a forest <strong>of</strong> trees, but this can<br />
easily be transformed into a single tree by adding a virtual super-root node.<br />
15
CHAPTER 1.<br />
INTRODUCTION<br />
c 1<br />
a 1<br />
b 1 c 2 a 1 d 2 b 6<br />
b 1 c 1 a 2 e 1 a 4 d 1 c 5<br />
b 2 a 3 c 4 b 3 b 5 c 6<br />
c 3 f 1 b 4<br />
Figure 1.4: TPM example with a query tree on the left <strong>and</strong> a data tree on the right.<br />
One <strong>of</strong> the matches for the query in the data is shown with arrows from query nodes to<br />
data nodes. In the following, query nodes are drawn with circles <strong>and</strong> data nodes with<br />
rounded rectangles. Node labels are written with typewriter font, <strong>and</strong> the superscripts in<br />
query nodes <strong>and</strong> subscripts in data nodes are used to identify the nodes (together with<br />
the labels).<br />
1.3.1 Research scope: TPM on indexed data<br />
The scope <strong>of</strong> this thesis is twig pattern matching on indexed data, <strong>and</strong> we assume that the<br />
processes <strong>of</strong> preparing the index <strong>and</strong> evaluating queries are separate. For this strategy<br />
to be viable, the cost <strong>of</strong> index construction must be justified by the performance gain<br />
for query evaluation compared to evaluation without an index, seen in light <strong>of</strong> the index<br />
construction cost.<br />
We use the following abstract view <strong>of</strong> an index: It is a mechanism which provides a<br />
function from some feature <strong>of</strong> a node, to nodes in the data tree that have this feature. The<br />
simplest non-trivial such feature is node label, as used in the index shown in Figure 1.5a.<br />
In a typical implementation, entries in a so-called dictionary on label point to so-called<br />
occurrence lists containing nodes with matching label.<br />
When indexing on label, a query can be evaluated by reading the label-matching data<br />
nodes for each query node, <strong>and</strong> joining these into full query matches. The number <strong>of</strong> full<br />
query matches may be small compared to the total number <strong>of</strong> query node matches read,<br />
but if the labels on the query nodes are selective, much fewer data nodes will be processed<br />
than when evaluating the query on the data tree without an index.<br />
Indexing on node label can be extended to indexing on path labels, the string <strong>of</strong><br />
labels from the root to a node, as illustrated in Figure 1.5b. This can again be extended<br />
to classify nodes not only on labels <strong>of</strong> the ancestor nodes on the path above, but also<br />
on the labels <strong>of</strong> the children in the subtree below. These indexing strategies, called<br />
structure indexing, will be discussed in the next chapter, together with so-called twig join<br />
algorithms.<br />
16
1.4. RESEARCH QUESTIONS<br />
c<br />
c 1<br />
a<br />
a 1 a 2 a 3 a 4<br />
c a<br />
a 1<br />
b<br />
b 1 b 2 b 3 b 4 b 5 b 6<br />
c a a<br />
a 2 a 4<br />
c<br />
c 1 c 2 c 3 c 4 c 5 c 6<br />
c a a a<br />
a 3<br />
d<br />
d 1 d 2<br />
c a a b<br />
b 2 b 3 b 5<br />
e<br />
e 1<br />
f<br />
f 1<br />
c d<br />
d 2<br />
(a) Indexing on label.<br />
(b) Indexing on path.<br />
Figure 1.5: Indexing the data tree from Figure 1.4.<br />
1.4 Research questions<br />
The following are the main research questions I have investigated during the work with<br />
this thesis:<br />
• RQ1: How can matches for tree queries be joined more efficiently?<br />
• RQ2: How can pattern matching in the dictionary be done more efficiently?<br />
• RQ3: How can structure indexes be constructed faster <strong>and</strong> using less space?<br />
These questions will be revisited in Section 3.4.1, where I will evaluate to what extent<br />
they have been answered by my research. Note that more efficient query evaluation can<br />
mean either that all or most queries are evaluated using less time, or that queries from<br />
some important group are evaluated using less time. Preferably, faster evaluation for one<br />
group <strong>of</strong> queries should not cause slower evaluation for other groups.<br />
17
Chapter 2<br />
Background<br />
“Research is what I’m doing<br />
when I don’t know what I’m doing.”<br />
– Werner von Braun<br />
This chapter presents some underlying concepts for state-<strong>of</strong>-the-art approaches for<br />
TPM on indexed data, which will hopefully ease the underst<strong>and</strong>ing <strong>of</strong> the contributions<br />
in the research papers included in this thesis. A high-level conceptual overview is given<br />
instead <strong>of</strong> an in-depth description <strong>of</strong> details in state-<strong>of</strong>-the-art solutions, because this is<br />
better covered by the included papers where the specific techniques are discussed.<br />
The following discussion divides the problem <strong>of</strong> TPM on indexed data into three<br />
somewhat orthogonal issues: How to construct full query matches from individual query<br />
node matches in so-called twig joins, how to partition the underlying data nodes such<br />
that as few as possible are read to evaluate a query, <strong>and</strong> how to efficiently read streams<br />
<strong>of</strong> data nodes during a join.<br />
Notation. The following notation is used in the discussion: A graph G has node set V G<br />
<strong>and</strong> edge set E G ⊆ V G ×V G . All graphs are directed. A graph is a tree if all nodes have one<br />
incoming edge except the root, which has zero incoming edges. Nodes with zero outgoing<br />
edges are called leaves. A graph is called a forest if it consists <strong>of</strong> many unconnected trees,<br />
i.e., if all nodes have zero to one incoming edges. If a relation R relates x to y, this may<br />
be denoted both xRy <strong>and</strong> 〈x, y〉 ∈ R <strong>and</strong> x↦→y ∈ R. We primarily use angle brackets for<br />
graph edges, as in 〈u, v〉 ∈ E G , <strong>and</strong> the “maps to” arrow for mappings <strong>of</strong> query nodes<br />
to data nodes, as in q ↦→d ∈ M. The transitive closure <strong>of</strong> a relation R is denoted by<br />
R ∗ . In the problems discussed there will mostly be a query tree Q <strong>and</strong> a data tree D,<br />
where each node v ∈ V Q ∪ V D has a Label(v) ∈ A. Assume |A| ∈ O(|D|) for simplicity.<br />
Each query edge 〈u, v〉 ∈ E Q has an EdgeType(u, v) ∈ {“A–D”, “P–C”}, specifying an<br />
ancestor–descendant or a parent–child relationship.<br />
Remember from Section 1.3 that in TPM we have a single node type, <strong>and</strong> only differentiate<br />
nodes by label, while in XML there are different node types. We can generalize<br />
19
CHAPTER 2.<br />
BACKGROUND<br />
TPM to cover this by using different label codings for different node type, such as for<br />
example starting element node labels with “
2.1. TWIG JOINS<br />
to enumerate <strong>and</strong> output the set <strong>of</strong> full query matches O. The first phase has two<br />
components, where the first merges the streams S v , materializing I v for each v ∈ V Q , into<br />
a single stream S, materializing the total input set I.<br />
Phase 1,<br />
Component 1:<br />
Input stream<br />
merger<br />
Phase 1,<br />
Component 2:<br />
Intermediate<br />
result constr.<br />
Phase 2:<br />
Result<br />
enumeration<br />
a 1 a 1 a 2 . . .<br />
b 1 b 1 b 2<br />
. . .<br />
c 1 c 1 c 2 . . .<br />
c 1 ↦→c 1<br />
b 1 ↦→b 1<br />
. . .<br />
Intermed.<br />
results<br />
a 1<br />
b 2 c 5<br />
. . .<br />
Figure 2.1: Work-flow <strong>of</strong> twig join algorithms.<br />
Figure 2.2 illustrates why the two phases are temporally separate, as in the worst<br />
case, all the data must be read before it is known whether or not the nodes in the input<br />
are useful. On the other h<strong>and</strong>, use <strong>of</strong> the two components in Phase 1 can be temporally<br />
overlapping, because Component 2 reads data <strong>and</strong> query node pairs from Component 1 in<br />
some order that can be implemented without lookahead in the individual streams. Note<br />
that for some combinations <strong>of</strong> query <strong>and</strong> data, the construction <strong>of</strong> intermediate results is<br />
not necessary for linear evaluation (as we exploit in Paper 3 included in Chapter 4).<br />
a<br />
a 1<br />
1<br />
c 1 b 1 b 1 b n a 2<br />
c n+1<br />
b n+1 c 1 c n<br />
Figure 2.2: Example showing why Phase 1 <strong>and</strong> Phase 2 are temporally separate. When<br />
the input streams are sorted in tree preorder, it cannot be known whether b 1 , . . . , b n are<br />
part <strong>of</strong> a query match before c n+1 is seen, or whether c 1 , . . . , c n are part <strong>of</strong> a query match<br />
before b n+1 is seen. Note that there is no stream ordering such that all twig queries can<br />
be evaluated without storing intermediate results [10].<br />
To underst<strong>and</strong> the design choices in the approach depicted in Figure 2.1, it is easiest to<br />
start with the last step, result enumeration, <strong>and</strong> work backwards. Section 2.1.2 sketches<br />
a generic algorithm for enumerating results, <strong>and</strong> Section 2.1.3 sketches the layout <strong>of</strong><br />
a generic data structure that enables evaluating that algorithm in linear time. With<br />
this as a starting point, I go through various techniques <strong>and</strong> strategies for implementing<br />
the generic approach. Section 2.1.4 briefly presents a common tree position encoding<br />
that makes it possible to decide A–D <strong>and</strong> P–C relationships between data nodes in the<br />
21
CHAPTER 2.<br />
BACKGROUND<br />
various streams in constant time. Section 2.1.5 describes two common data node filtering<br />
strategies, <strong>and</strong> Section 2.1.6 shows how one <strong>of</strong> these can be used to realize the conceptual<br />
data structure from Section 2.1.3 in linear time. Section 2.1.7 describes the input stream<br />
merge component, where filtering strategies can be used for practical speedups.<br />
2.1.2 Result enumeration<br />
Algorithm 1 gives a high-level description <strong>of</strong> how to output all unique query matches<br />
that can be constructed from the input. The approach is a generalization <strong>of</strong> what is used<br />
in state <strong>of</strong> the art twig joins [7, 27, 39, 33]. The algorithm recursively constructs full<br />
query matches from partial matches that are known to be part <strong>of</strong> full query matches,<br />
denoted here as partial full matches. Formally, a partial full match is an M ′ such that<br />
M ′ ⊆ M for some full query match M ∈ M. The set <strong>of</strong> all partial full matches is<br />
M ′ = {M ′ | ∃M ∈ M : M ′ ⊆ M}.<br />
Algorithm 1 Result enumeration<br />
Denote the set <strong>of</strong> partial full matches by M ′ .<br />
Start with M ′ = {}, an empty partial full twig match.<br />
Assume any fixed ordering <strong>of</strong> the nodes in Q,<br />
<strong>and</strong> let v ∈ Q be the first node in this ordering.<br />
For all v ′ such that {v ↦→v ′ } ∈ M ′ :<br />
Call Recurse(v ↦→v ′ ).<br />
The function Recurse(u↦→u ′ ):<br />
Insert u↦→u ′ into M ′ .<br />
If |M ′ | = |Q|:<br />
Output M ′ .<br />
Otherwise:<br />
Let v be the node following u in Q.<br />
For all v ↦→v ′ such that M ′ ∪ {v ↦→v ′ } ∈ M ′ :<br />
Recurse(v ↦→v ′ )<br />
Remove u↦→u ′ from M ′ .<br />
Example 1. We evaluate the query <strong>and</strong> data in Figure 1.4 using Algorithm 1, <strong>and</strong> order<br />
query nodes in tree preorder. A c<strong>and</strong>idate match for query node a 1 that is part <strong>of</strong> a<br />
full match is data node a 1 , <strong>and</strong> hence one <strong>of</strong> the top-level calls to Recurse will be<br />
with the parameter u↦→u ′ set to a 1 ↦→a 1 . After this pair has been inserted into M ′ , we<br />
consider the query node b 1 , which follows a 1 in tree preorder. Since M ′ = {a 1 ↦→a 1 }, <strong>and</strong><br />
M ′ ∪ {b 1 ↦→b 1 } is a partial full match, b 1 ↦→b 1 is one <strong>of</strong> the pairs we recurse with. In that<br />
recursive call we have M ′ = {a 1 ↦→a 1 , b 1 ↦→b 1 }, <strong>and</strong> consider matches for the final query<br />
node c 1 . As {a 1 ↦→a 1 , b 1 ↦→b 1 } ∪ {c 1 ↦→c 1 } is a partial full match, we again recurse with<br />
c 1 ↦→c 1 , <strong>and</strong> output the new M ′ , since it is a complete full match.<br />
Assume that the set <strong>of</strong> partial full matches M ′ does not have to be materialized, <strong>and</strong><br />
that given a partial full match M ′ ∈ M ′ , where all nodes u preceding v have a mapping<br />
22
2.1. TWIG JOINS<br />
in u↦→u ′ ∈ M ′ , all v ↦→v ′ such that M ′ ∪ {v ↦→v ′ } ∈ M ′ can be traversed in time linear<br />
in their number. Under these assumptions the algorithm can be evaluated in O(|O| · |Q|)<br />
time, linear in the total number <strong>of</strong> data nodes in the output. The intuition is that each<br />
recursive call constructs in constant time a partial full match not seen before, <strong>and</strong> that<br />
each unique partial full match yields at least one unique full query match.<br />
2.1.2.1 Single output query node<br />
In TPM the answers in the result set are all legal ways <strong>of</strong> matching the query nodes<br />
to the data nodes, but in many information retrieval settings other semantics may be<br />
more useful. In the XPath language [45] queries have a single output node, <strong>and</strong> the<br />
result set contains all matches for this query node that are part <strong>of</strong> some full query match.<br />
In the XQuery language [47], which is used for more complex information retrieval <strong>and</strong><br />
processing, there can be any number <strong>of</strong> output <strong>and</strong> non-output nodes in the query.<br />
Only minor changes are needed in Algorithm 1 for this generalized case with both<br />
output <strong>and</strong> non-output query nodes. A simple solution is to put the output query nodes<br />
first in the fixed ordering, <strong>and</strong> stop the recursion before non-output nodes are considered.<br />
Note that practical data structures that enable linear enumeration for any combination<br />
<strong>of</strong> output <strong>and</strong> non-output nodes [7] are not as simple as the data structures described in<br />
the following sections.<br />
2.1.3 Simple intermediate result architecture<br />
Figure 2.2 illustrated why it is not possible to output query matches directly by just<br />
inspecting the heads <strong>of</strong> the streams for each query node. In the example all the nodes<br />
labeled c must be read before it can be known whether or not any <strong>of</strong> the nodes labeled<br />
b are useful, <strong>and</strong> vice versa.<br />
The purpose <strong>of</strong> storing intermediate results is to organize the data nodes in such a way<br />
that an implementation <strong>of</strong> the approach in Algorithm 1 can be evaluated efficiently. If<br />
the query nodes are ordered in tree preorder, it is natural to maintain for each u↦→u ′ that<br />
is part <strong>of</strong> a full query match, for each child v <strong>of</strong> u, the list <strong>of</strong> pairs v ↦→v ′ used together<br />
with u↦→u ′ in some full query match. Figure 2.3 illustrates this strategy. In addition to<br />
the lists <strong>of</strong> pointers to useful child query node matches for each pair, there must be a list<br />
<strong>of</strong> pointers to the data nodes that match the query root in full query matches.<br />
a 1<br />
b 1 c 1<br />
b 1 b 2 b 3 b 4 b 5 b 6<br />
Full<br />
c 1 c 2 c 3 c 4 c 5 c 6<br />
match<br />
roots<br />
a 1 a 2 a 3 a 4<br />
Figure 2.3: Generic intermediate results for the data tree in Figure 1.4.<br />
23
CHAPTER 2.<br />
BACKGROUND<br />
This data structure takes O(|I| + |O| · |Q|) space, linear in the size <strong>of</strong> the input <strong>and</strong><br />
output, because the lists <strong>of</strong> data nodes take O(|I|) space, <strong>and</strong> each root pointer or child<br />
match pointer is used at least once in Algorithm 1, which has time complexity O(|O| · |Q|).<br />
The following intuition shows how this data structure can be used to efficiently implement<br />
Algorithm 1 when query nodes are ordered in tree preorder: (i) The pairs v ↦→v ′ in<br />
the initial calls in the outer for-loop are trivially found by traversing the list <strong>of</strong> pointers<br />
to full match roots. (ii) In a recursive call, after u↦→u ′ has been added to M ′ , the current<br />
M ′ is a partial full match by assumption. Let v be the node following u in preorder, <strong>and</strong><br />
let p be the parent <strong>of</strong> v (possibly p = u). All query nodes preceding v have a mapping in<br />
M ′ , <strong>and</strong> assume M ′ (p) = p ′ . Let Q p <strong>and</strong> Q v be the subgraphs resulting from removing the<br />
edge 〈p, v〉 from Q. These subqueries can be matched independently when the mapping<br />
<strong>of</strong> both p <strong>and</strong> v is fixed in a way such that EdgeType(p, v) is satisfied. If v ↦→v ′ is used in<br />
some full query match together with p↦→p ′ , we know that 〈p ′ , v ′ 〉 satisfies EdgeType(p, v).<br />
Then, if M ′ is a partial full match, M ′ ∪ {v ↦→v ′ } must also be a partial full match.<br />
Example 2. This example illustrates how to implement the data access for Example 1<br />
using the data structure in Figure 2.3. The first match for the query root a 1 that is part<br />
<strong>of</strong> a full match is the data node a 1 , <strong>and</strong> hence the first non-empty partial full match<br />
in Algorithm 1 is M ′ = {a 1 ↦→a 1 }. When considering the next query node in preorder,<br />
b 1 , we see from the pointers in the data structure that b 2 is the first data node usable<br />
together with a 1 . Hence the next partial full match is M ′ = {a 1 ↦→a 1 , b 1 ↦→b 2 }. Then,<br />
when considering the next query node c 1 , we see that the data node c 5 is the only data<br />
node usable with a 1 , the current match for a 1 , the parent <strong>of</strong> query node c 1 . We insert<br />
c 1 ↦→c 5 to get the full match M ′ = {a 1 ↦→a 1 , b 1 ↦→b 2 , c 1 ↦→c 5 }.<br />
2.1.4 Tree position encoding<br />
To construct the intermediate results efficiently it must be decidable from position information<br />
following the data nodes whether or not they satisfy A–D <strong>and</strong> P–C relationships.<br />
A common solution is the interval-based BEL encoding [56], where each node is given<br />
integer numbers begin, end <strong>and</strong> level, as shown in Figure 2.4.<br />
1,10,1<br />
2,5,2<br />
3,3,3 4,4,3<br />
6,9,2<br />
7,7,3 8,8,3<br />
Figure 2.4: The BEL encoding for a tree, with begin, end <strong>and</strong> level numbers.<br />
This encoding is similar to preorder <strong>and</strong> postorder traversal numbers, <strong>and</strong> can be<br />
computed in a depth-first traversal <strong>of</strong> the tree. The reason the encoding is <strong>of</strong>ten preferred<br />
is probably that the begin <strong>and</strong> end numbers correspond to the document position <strong>of</strong><br />
opening <strong>and</strong> closing tags in XML.<br />
24
2.1. TWIG JOINS<br />
With the BEL encoding, a node a is an ancestor <strong>of</strong> a node b iff a.begin < b.begin <strong>and</strong><br />
b.begin < a.end, <strong>and</strong> it is a parent if also a.level + 1 = b.level. Sorting on begin or end<br />
numbers respectively gives the same sorting orders as preorder <strong>and</strong> postorder traversal<br />
numbers.<br />
There exists a large number <strong>of</strong> tree position encodings with different properties [50].<br />
Some allow decision <strong>of</strong> more types <strong>of</strong> node relationships, <strong>and</strong> some allow reconstruction<br />
<strong>of</strong> related nodes. They differ in the computational cost <strong>of</strong> evaluating relationships, space<br />
usage, <strong>and</strong> how well they h<strong>and</strong>le updates in the data tree.<br />
2.1.5 Partial match filtering<br />
When constructing intermediate results it is <strong>of</strong>ten possible to filter out some query <strong>and</strong><br />
data node pairs that will never be part <strong>of</strong> a full query match. In current twig join<br />
algorithms filtering is used both for practical speedup [5, 27, 33], <strong>and</strong>/or as a necessity<br />
for worst-case efficient result enumeration [7].<br />
A filtering strategy does not have to be perfect, but it must certainly not remove pairs<br />
that are part <strong>of</strong> full query matches. In other words, it can have false positives, but not<br />
false negatives. Most filtering strategies are based on the observation that if there is some<br />
subquery (a subgraph <strong>of</strong> the query), such that the pair v ↦→v ′ is not part <strong>of</strong> any match<br />
for the subquery, then v ↦→v ′ is not part <strong>of</strong> any match for the entire query, <strong>and</strong> can safely<br />
be thrown away [21].<br />
The two most common filtering strategies are illustrated in Figure 2.5. The first is<br />
based on checking if query prefix paths are matched [5, 27, 33], <strong>and</strong> the second on checking<br />
if query subtrees are matched [7, 39, 33]. The prefix path <strong>of</strong> a query node is the subquery<br />
containing the nodes on the path from the root down to the node.<br />
c 1<br />
c 1<br />
b 1 c 2 a 1<br />
d 2 b 6<br />
b 1 c 2 a 1<br />
d 2 b 6<br />
a 1<br />
b 1 c 1<br />
f 1 b 2<br />
(a) Query.<br />
a 2<br />
e 1 a 4 d 1 c 5<br />
b 2 a 3 c 4 b 3 b 5 c 6<br />
c 3 f 1 b 4<br />
(b) Matching prefix paths.<br />
a 2<br />
e 1 a 4 d 1 c 5<br />
b 2 a 3 c 4 b 3 b 5 c 6<br />
c 3 f 1 b 4<br />
(c) Matching subtrees.<br />
Figure 2.5: Matching query parts.<br />
We call a pair v ↦→v ′ that is part <strong>of</strong> a prefix path match for v a prefix path matcher.<br />
Filtering query <strong>and</strong> data node pairs on whether or not they are prefix path matchers is<br />
easy to implement with an inductive strategy: Assuming that v ∈ Q has parent u, the<br />
25
CHAPTER 2.<br />
BACKGROUND<br />
pair v ↦→v ′ is a prefix path matcher for v if <strong>and</strong> only if there exists a pair u↦→u ′ that is a<br />
prefix path matcher for u such that 〈u ′ , v ′ 〉 satisfies the A–D or P–C relationship specified<br />
by EdgeType(u, v) [5]. Prefix path filtering is easiest to implement when data nodes are<br />
seen in tree preorder, where ancestors are seen before descendants.<br />
Example 3. Figure 2.5b illustrates prefix path match checking. The pair a 1 ↦→a 1 is<br />
trivially a prefix path matcher, <strong>and</strong> b 1 ↦→b 3 must then be a prefix path matcher because<br />
EdgeType(a 1 , b 1 ) = “A–D” <strong>and</strong> 〈a 1 , b 3 〉 ∈ E ∗ D . This again implies that f 1 ↦→f 1 must be<br />
a prefix path matcher because EdgeType(b 1 , f 1 ) = “P–C” <strong>and</strong> 〈b 3 , f 1 〉 ∈ E D .<br />
Filtering pairs on whether or not they are subtree matchers can be implemented with<br />
a similar strategy: The pair v ↦→v ′ is a subtree matcher if <strong>and</strong> only if for each child<br />
w <strong>of</strong> v, there exists a subtree matcher w ↦→w ′ such that 〈v ′ , w ′ 〉 satisfies the A–D or<br />
P–C relationship specified by EdgeType(v, w) [7]. Subtree match filtering is easiest to<br />
implement when data nodes are seen in tree postorder.<br />
Example 4. Figure 2.5c illustrates subtree match checking. The pairs f 1 ↦→f 1 , b 2 ↦→b 4<br />
<strong>and</strong> c 1 ↦→c 5 are trivially subtree matchers because the query nodes are leaves. The pair<br />
b 1 ↦→b 3 is a subtree matcher because f 1 ↦→f 1 is a subtree matcher <strong>and</strong> 〈b 3 , f 1 〉 ∈ E D satisfies<br />
EdgeType(b 1 , f 1 ) = “P–C”, <strong>and</strong> because b 2 ↦→b 4 is a subtree matcher <strong>and</strong> 〈b 3 , b 4 〉 ∈<br />
E D ∗ satisfies EdgeType(b 1 , b 2 ) = “A–D”. The pair a 1 ↦→a 1 is a subtree matcher, because<br />
b 1 ↦→b 3 is a subtree matcher <strong>and</strong> 〈a 1 , b 3 〉 ∈ E D ∗ satisfies EdgeType(a 1 , b 1 ) = “A–D”,<br />
<strong>and</strong> because c 1 ↦→c 5 is a subtree matcher <strong>and</strong> 〈a 1 , c 5 〉 ∈ E D satisfies EdgeType(a 1 , b 1 ) =<br />
“P–C”.<br />
2.1.6 Intermediate result construction<br />
The filtering on matched subtrees described in the previous section is strongly related<br />
to a strategy that can be used to efficiently build a data structure that realizes the<br />
conceptual structure depicted in Figure 2.3. What is described in the following is a slight<br />
simplification <strong>of</strong> what is used in the Twig 2 Stack [7] algorithm, which was the first twig<br />
join algorithm with cost linear in the size <strong>of</strong> the input data <strong>and</strong> the output result set.<br />
The reason preorder processing <strong>of</strong> data nodes <strong>and</strong> filtering on matched prefix paths is<br />
not a suitable starting point for a worst-case efficient algorithm, is that even though paths<br />
in the data do match paths in the query, it is hard to figure out on the fly during preorder<br />
processing whether or not other paths in the query can use the same branching nodes.<br />
On the other h<strong>and</strong>, with postorder processing matches for the query can be constructed<br />
bottom up by combining subtree matches into bigger subtree matches. The storage order<br />
<strong>of</strong> data nodes in the index does not have to be changed for postorder processing, as a<br />
preorder stream <strong>of</strong> match pairs can be translated to a postorder stream with a stack:<br />
When a pair v ↦→v ′ is read in preorder, all pairs u↦→u ′ on the stack such that u ′ is not an<br />
ancestor <strong>of</strong> v ′ are popped <strong>of</strong>f <strong>and</strong> processed one by one, before v ↦→v ′ is pushed on stack.<br />
When following the strategy from Sections 2.1.2 <strong>and</strong> 2.1.3, the key to efficient enumeration<br />
<strong>of</strong> results is the ability to efficiently find usable subtree matches. Given a c<strong>and</strong>idate<br />
v ↦→v ′ , we need to find for all children w <strong>of</strong> v, the list <strong>of</strong> matchers w ↦→w ′ such that<br />
〈v ′ , w ′ 〉 satisfies EdgeType(v, w). Subtree matches for the query root are trivially full<br />
query matches.<br />
26
2.1. TWIG JOINS<br />
The overall strategy for the proposed data structure is to maintain for each query<br />
node v a list <strong>of</strong> disjoint trees T v consisting <strong>of</strong> node matches from the stream S v , as shown<br />
in Figure 2.6. Some additional dummy nodes are used to bind the trees together. For<br />
each data node in the trees for a query node, there is a list <strong>of</strong> pointers to usable child<br />
query node matches. P–C matches are pointed to directly, while A–D matches are found<br />
in the entire subtrees pointed to.<br />
a 1<br />
a 4<br />
c 1<br />
b 1<br />
c 2<br />
b 2<br />
c 3 c 4 c 5<br />
b 3 b 5<br />
c 6<br />
b 4<br />
Figure 2.6:<br />
Figure 1.4.<br />
Postorder construction <strong>of</strong> intermediate results for the data <strong>and</strong> query in<br />
Algorithm 2 shows how this data structure can be constructed, specifying the processing<br />
<strong>of</strong> a single pair v ↦→v ′ in postorder. For each query node v, there is a list T v <strong>of</strong><br />
disjoint trees consisting <strong>of</strong> subtree matchers v ↦→v ′ where v ′ ∈ S v . When processing a pair<br />
v ↦→v ′ , the trees where the root data nodes are descendants <strong>of</strong> v ′ are joined into single<br />
trees, both in the lists T w for the children w <strong>of</strong> v, <strong>and</strong> in the list T v for v itself. For P–C<br />
edges, pointers from v ↦→v ′ to w ↦→w ′ denote single direct child matches, while for A–D<br />
edges, pointers denote that entire subtrees contain matches. A pair v ↦→v ′ is only added if<br />
there is at least one pointer for each child w <strong>of</strong> v, <strong>and</strong> this effectively implements subtree<br />
match filtering as described in Section 2.1.5.<br />
Example 5. Figure 2.7 shows the step processing a 1 ↦→a 1 when constructing intermediate<br />
results for the data <strong>and</strong> query from Figure 1.4 with Algorithm 2. The trees at the end<br />
<strong>of</strong> T b 1, where the roots are b 2 <strong>and</strong> a dummy node, are joined into a single tree. So are<br />
the trees at the end <strong>of</strong> T c 1, where the roots are c 3 , c 4 <strong>and</strong> c 5 . Pointers are added from<br />
a 1 ↦→a 1 to the tree <strong>of</strong> descendants in T b 1, <strong>and</strong> to the child match c 1 ↦→c 5 in T c 1. Since<br />
a 1 ↦→a 1 has pointers both to matches for b 1 <strong>and</strong> c 1 , it is a subtree match, <strong>and</strong> is added<br />
to T a 1.<br />
When evaluating the input I with Algorithm 2, the total number <strong>of</strong> calls to the<br />
procedure Process() would be ∑ v∈V Q<br />
|S v | = |I|, <strong>and</strong> the total number <strong>of</strong> rounds in the<br />
for-loop would be ∑ v∈V Q<br />
|I v | · b v ∈ O(|I| · b Q ), where b v is number <strong>of</strong> children <strong>of</strong> v <strong>and</strong><br />
b Q is the maximal number <strong>of</strong> children for any node in Q. Apart from constant time<br />
27
CHAPTER 2.<br />
BACKGROUND<br />
Algorithm 2 Postorder intermediate result construction<br />
Function Process(v ↦→v ′ ):<br />
For each child w <strong>of</strong> v:<br />
Let T ′ w be the trees at the end <strong>of</strong> T w where root nodes are descendants <strong>of</strong> v ′ .<br />
If EdgeType(v, v) = “P–C”:<br />
Add pointers from v ↦→v ′ to all w ↦→w ′ in T ′ w where depth(w ′ ) = depth(v ′ )+1.<br />
If |T ′ w| > 1<br />
Replace T ′ w by a dummy node with the trees from T ′ w as children.<br />
If EdgeType(v, v) = “A–D” <strong>and</strong> |T ′ w| > 0:<br />
Add a descendant pointer from v ↦→v ′ to the single node in T ′ w.<br />
If v ↦→v ′ does not have at least one pointer per child w <strong>of</strong> v:<br />
Discard v ↦→v ′ <strong>and</strong> return failure.<br />
Remove from the end <strong>of</strong> T v all roots where data nodes are descendants <strong>of</strong> v ′ ,<br />
add them as children <strong>of</strong> v ↦→v ′ , <strong>and</strong> append v ↦→v ′ to T v .<br />
a 1<br />
a 1<br />
a 4<br />
c 1<br />
b 1<br />
c 2<br />
b 2<br />
c 3 c 4 c 5<br />
b 3 b 5<br />
c 6<br />
b 4<br />
(a) Before adding a 1 ↦→a 1 .<br />
a 4<br />
c 1<br />
b 1<br />
c 2<br />
b 2<br />
c 3 c 4 c 5<br />
b 3 b 5<br />
c 6<br />
b 4<br />
(b) After adding a 1 ↦→a 1 .<br />
Figure 2.7: A step in postorder construction <strong>of</strong> intermediate results for the data <strong>and</strong> query<br />
in Figure 1.4. Dotted boxes give the current list <strong>of</strong> trees T v for each v ∈ V Q .<br />
28
2.1. TWIG JOINS<br />
operations for each input v ↦→v ′ <strong>and</strong> each child w <strong>of</strong> v, there is some non-trivial cost<br />
associated with merging trees <strong>and</strong> adding pointers to P–C <strong>and</strong> A–D child matches. A<br />
merge attempt either inspects only one tree root <strong>and</strong> does not change T v , or inspects<br />
k > 1 roots, removes k − 1 roots from T v <strong>and</strong> adds a new one. This means that the cost<br />
<strong>of</strong> merge operations is bounded by the number <strong>of</strong> attempts <strong>and</strong> the sizes <strong>of</strong> the trees, i.e.,<br />
∑<br />
v∈V Q<br />
O(|I v | + |I v | · b v ). Now consider the cost <strong>of</strong> adding pointers from matches for a<br />
query node u to matches for a child query node w. If EdgeType(v, w) = “A–D”, then<br />
only a single edge is added from each v ↦→v ′ . If EdgeType(v, w) = “P–C”, then only a<br />
single edge is added to each w ↦→w ′ , as a node can have only one parent. In conclusion,<br />
the total cost <strong>of</strong> using Algorithm 2 is ∑ v∈V Q<br />
O(|I v | + |I v | · b v ) ⊆ O(|I| + |I| · b Q ).<br />
What is presented here is a slight simplification <strong>of</strong> the Twig 2 Stack algorithm [7]. The<br />
main difference between the above depiction <strong>and</strong> Twig 2 Stack is that in the latter, the data<br />
structure for each query node is a list <strong>of</strong> trees <strong>of</strong> stacks <strong>of</strong> nodes, instead <strong>of</strong> simply lists<br />
<strong>of</strong> trees <strong>of</strong> nodes. Many alternative twig join algorithms have been presented [27, 39, 33]<br />
in the years following the publication <strong>of</strong> the Twig 2 Stack algorithm. What is common to<br />
these algorithms is that they have improved practical performance, but higher worst-case<br />
complexity in the result enumeration phase. An example is the TwigList algorithm, which<br />
stores intermediate nodes in simple vectors instead <strong>of</strong> trees, <strong>and</strong> implements a weaker form<br />
<strong>of</strong> subtree filtering, where all query edges are considered to have type A–D.<br />
2.1.7 Merging input streams<br />
The final component missing to implement the strategy in Figure 2.1 is the input stream<br />
merger. The input to the merge is one preorder sorted stream representing I v for each<br />
v ∈ Q, <strong>and</strong> the desired output is a sorted stream representing I. The sort order required<br />
for using the approach from Section 2.1.6 is that the pairs v ↦→v ′ ∈ I are sorted primarily<br />
on the preorder <strong>of</strong> the data nodes, <strong>and</strong> secondarily on the postorder <strong>of</strong> the query nodes.<br />
This means that after translating the stream into data node postorder with a stack, the<br />
new stream is sorted secondarily on query node preorder. This is required by Algorithm 2<br />
for cases where a single data node matches multiple query nodes, as a data node could<br />
hide useful children <strong>of</strong> itself if the sorting was not secondarily on query node preorder.<br />
The simplest merge approach is to traverse the query in postorder, <strong>and</strong> find some<br />
minimum v ↦→v ′ by taking a preorder minimum v ′ that is head <strong>of</strong> a stream I v for a<br />
postorder minimal v. This takes Θ(|Q|) time per extraction, <strong>and</strong> gives a total cost <strong>of</strong><br />
Θ(|I| · |Q|) for the merge. An asymptotically better approach is to organize the individual<br />
streams in a priority queue implemented with a binary heap, sorted primarily on the heads<br />
<strong>of</strong> the streams <strong>and</strong> secondarily on the query nodes. Extractions then take O(log |Q|) time,<br />
<strong>and</strong> the total cost is O(|I| log |Q|) [11].<br />
Since the preorder <strong>and</strong> postorder tree traversal numbers we are sorting on are bounded<br />
by the size <strong>of</strong> the input, the sorting complexity is not loglinear, but linear under the unit<br />
cost assumption. The entire set I can be put in a single array, <strong>and</strong> sorted using radix sort<br />
in Θ(|I|) time [11]. As the intermediate result construction is already O(|I| · b Q ), the radix<br />
sort approach gives no advantage over the heap based approach when log |Q| ∼ b Q . Since<br />
the latter uses much less memory in practice, Θ(|Q|) instead <strong>of</strong> Θ(|I|), it is preferable in<br />
most real-world scenarios.<br />
29
CHAPTER 2.<br />
BACKGROUND<br />
Some <strong>of</strong> the newer twig join algorithms storing intermediate results in preorder [27, 33]<br />
use a O(|I| · |Q|) input stream merge component that implements a weak form <strong>of</strong> subtree<br />
match filtering, where all query edges are considered to have type A–D [5]. The merger<br />
uses only O(|Q|) memory <strong>and</strong> is very fast in practice because queries are typically small.<br />
It returns data nodes in a relaxed preorder, where the ordering is only guaranteed between<br />
matches for query nodes related by ancestry. This stream is not easily translated into<br />
postorder, <strong>and</strong> hence the merger is not used for postorder processing algorithms [21].<br />
2.1.8 Data locality <strong>and</strong> updatability<br />
This chapter does in general not make a distinction between data stored in main memory<br />
<strong>and</strong> on disk, but in practical implementations it is important to consider the costs <strong>of</strong><br />
different access patterns in different media. While main memory on modern computers<br />
does not really have a uniform memory access cost, due to the use <strong>of</strong> caches, we can<br />
design usable systems that use r<strong>and</strong>om memory reads <strong>and</strong> writes. On the other h<strong>and</strong>, if<br />
the data is so large it must reside on disk, a system that uses a lot <strong>of</strong> r<strong>and</strong>om access will<br />
not be efficient in practice.<br />
Consider now the different phases <strong>and</strong> components in our twig join strategy. The input<br />
stream merger is assumed to only inspect stream heads <strong>and</strong> store a minimal amount <strong>of</strong><br />
state. Hence it should work well on an architecture where the c<strong>and</strong>idate matches for each<br />
query node are streamed from disk. The intermediate result construction, as shown in<br />
Algorithm 2, inspects in each call a number <strong>of</strong> tree roots stored contiguously at the end<br />
<strong>of</strong> the current list <strong>of</strong> trees for each query node. This in itself is simple to implement with<br />
good spatial locality, but it should also be considered how the layout <strong>of</strong> data affects the<br />
result enumeration phase. Luckily, if intermediate nodes are stream onto disk <strong>and</strong> inserted<br />
into blocks in postorder, most nodes that are close in the data tree will be stored closely<br />
on disk. This strategy will give fairly good spatial locality during result enumeration [7].<br />
The problem <strong>of</strong> intermediate results exceeding the size <strong>of</strong> main memory can be avoided<br />
in many practical cases, by observing that when the uppermost c<strong>and</strong>idate match for the<br />
root query node is closed, none <strong>of</strong> the data nodes seen so far in the tree preorder will be<br />
used in any match involving data nodes later in the tree preorder [7]. This means that<br />
when the uppermost query root match c<strong>and</strong>idate is closed, the current intermediate data<br />
can be used to enumerate the current set <strong>of</strong> query matches, before this data is discarded.<br />
Example 6. Consider the data in Figure 1.4, <strong>and</strong> an algorithm that pushes nodes onto a<br />
stack in preorder <strong>and</strong> pops them <strong>of</strong>f in postorder. When the data node b 6 is processed, it<br />
causes the popping <strong>of</strong> a 1 , <strong>and</strong> there are no more a-nodes on the stack. As a match for the<br />
query node a 1 must be above the match for any other query node in a full query match, no<br />
nodes preceding b 6 in the data will be involved in a match together with nodes following<br />
<strong>and</strong> including b 6 . Hence we can enumerate results, <strong>and</strong> delete the current intermediate<br />
data structures.<br />
In many practical cases with large amounts <strong>of</strong> data, the underlying information is<br />
stored in a large number <strong>of</strong> independent documents <strong>of</strong> moderate size, <strong>and</strong> in these cases the<br />
above trick is always applicable. Data updates are also easy to h<strong>and</strong>le in such a setting. A<br />
way <strong>of</strong> encoding global data node positions is by combining document identifiers <strong>and</strong> local<br />
30
2.2. PARTITIONING DATA<br />
node position encodings, such as BEL, <strong>and</strong> this simplifies updates: Updating a document<br />
can be viewed as deleting it <strong>and</strong> then re-adding it with a new document identifier, as is<br />
common in search systems for unstructured data [51]. Note that when the data is a single<br />
large tree that cannot easily be partitioned into independent documents, we need a node<br />
position encoding that has affordable cost for tree updates. There exist a number <strong>of</strong> such<br />
encodings with different properties [50].<br />
2.1.9 Twig join conclusion<br />
We have now discussed all the components in a state-<strong>of</strong>-the-art twig join algorithm, <strong>and</strong><br />
the costs <strong>of</strong> the different components are:<br />
• input stream merge: O(|I| log |Q|) for the heap-based approach,<br />
• intermediate results construction: O(|I| · b Q ),<br />
• <strong>and</strong> result enumeration: O(|O| · |Q|).<br />
This gives a total combined data, query <strong>and</strong> result complexity <strong>of</strong> O(|I| log |Q| + |I| · b Q +<br />
|O|·|Q|). Commonly the size <strong>of</strong> the query is viewed as a constant, <strong>and</strong> twig join algorithms<br />
are called linear <strong>and</strong> optimal if the combined data <strong>and</strong> result complexity is O(|I| + |O|).<br />
2.2 Partitioning data<br />
In the previous discussion it was assumed that the data nodes where partitioned on label<br />
in the index. This section considers the advantages <strong>and</strong> challenges that arise from more<br />
advanced indexing strategies.<br />
2.2.1 Motivation for fragmentation<br />
Let us first recap the introduction to the general strategy for TPM on indexed data from<br />
Section 1.3.1: The index is a mechanism which provides a function from some feature <strong>of</strong><br />
a node to the set <strong>of</strong> nodes in the data that have this feature.<br />
The main motivation for using an index is <strong>of</strong> course reading <strong>and</strong> processing less data<br />
during query processing. If node labels are selective then simple label partitioning is an<br />
efficient approach, but this is not always the case. Figure 2.8 shows a case with many<br />
label-matches for the individual query nodes in the data, but only a few full matches for<br />
the query.<br />
The above example may be unrealistic, but reconsider the data in Figure 1.1 <strong>and</strong> the<br />
query in Figure 1.2 on page 14. If the given library has billions <strong>of</strong> books, then the cost <strong>of</strong><br />
reading the data nodes labeled book will be huge compared to the size <strong>of</strong> the output result<br />
set. This motivates the use <strong>of</strong> a more fragmented partitioning <strong>of</strong> the data to improve the<br />
selectivity <strong>of</strong> query nodes. Note that another way <strong>of</strong> improving performance in these cases<br />
is to use skipping, discussed later in Section 2.3.1.<br />
31
CHAPTER 2.<br />
BACKGROUND<br />
b 1<br />
a 1<br />
a 2 a 9 a 15<br />
a 2 b 3<br />
b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />
a 4 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />
(a) Example query <strong>and</strong> data, showing first <strong>of</strong> four matches.<br />
a 1<br />
a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21<br />
a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21<br />
a 2 b 3<br />
b 1 b 3 b 5 b 10 b 13 b 14 b 16 b 17 b 18 b 19 b 20<br />
a 4 a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21<br />
(b) Example query <strong>and</strong> streams read. Marked stream nodes are useful.<br />
Figure 2.8: Partitioning on label.<br />
2.2.2 Path partitioning<br />
A natural extension <strong>of</strong> label partitioning is to partition data nodes on the paths by which<br />
they are reachable [37, 13, 36, 8]. Section 2.1.5 described how useless data nodes could<br />
be filtered out during intermediate result construction if they did not match prefix paths<br />
in the query. When indexing data nodes on prefix path, the same filtering is performed<br />
in advance, <strong>and</strong> we only process data nodes from classes where the prefix paths match<br />
the prefix paths in the query.<br />
To identify useful partitions when evaluating a query, we need some form <strong>of</strong> dictionary.<br />
In Figure 1.5b on page 17 a simple dictionary <strong>of</strong> path strings was used in the index, but<br />
this approach does not have attractive worst-case properties. There may be many unique<br />
paths in the data, <strong>and</strong> the size <strong>of</strong> this naive dictionary can be O(|D| 2 ) if the tree is deep.<br />
A more robust approach is to use a dictionary tree called a path summary, where shared<br />
prefixes <strong>of</strong> paths are only encoded once.<br />
Figure 2.9a shows the path partitioning for the data tree in Figure 2.8a. A path<br />
summary can be constructed from this partitioning by creating one node for each block<br />
in the partition, <strong>and</strong> creating edges between summary nodes whenever there are edges<br />
between data nodes in the related blocks, as shown on the left in Figure 2.9b.<br />
Prefix path matches for each query node can be found individually by using a matching<br />
algorithm on the summary tree, but this may give many individual matches that never<br />
take part in full query matches. A robust <strong>and</strong> efficient way to find useful prefix path<br />
matches is to index the summary itself on label, <strong>and</strong> use a twig join algorithms to evaluate<br />
queries directly on the summary to find relevant nodes [2].<br />
32
2.2. PARTITIONING DATA<br />
b 1<br />
a 2 a 9 a 15<br />
a 7 a 8 b 3 b 10 b 13 b 16 b 18 b 20<br />
a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />
(a) Path partition.<br />
a p3<br />
b p1<br />
a 1<br />
a p2<br />
a 2 a 9 a 15<br />
a 7 a 8<br />
a 2 b 3<br />
b p4<br />
b 3 b 10 b 13 b 16 b 18 b 20<br />
a 4 a 4 a 6 a 11 a 12 a 21<br />
a p5 b p6<br />
(b) Summary, query <strong>and</strong> streams read.<br />
Figure 2.9: Partitioning on prefix path.<br />
Figure 2.9b shows how a query is evaluated on the data from Figure 2.8a. The legal<br />
mappings <strong>of</strong> query nodes to summary nodes are found, <strong>and</strong> streams <strong>of</strong> related data nodes<br />
are read. Note that in this particular example there is only one match in the summary<br />
for each query node. If one query node matches multiple summary nodes, the streams<br />
<strong>of</strong> data nodes related to each <strong>of</strong> these summary nodes must somehow be merged into a<br />
single stream [8, 3].<br />
An extra bonus with path indexing applies for non-branching queries where the leaf<br />
node is the only output node. For such queries, the data nodes related to the summary<br />
nodes matched by the output query node can be read directly from the index, without<br />
using any join. It is known that these data nodes have the necessary ancestor nodes<br />
for matching the path query, <strong>and</strong> hence the ancestors do not have to be read. As an<br />
example, if the query node a 2 in Figure 2.9b is removed, <strong>and</strong> a 4 is the only output node,<br />
all matching data nodes can be read from the extents <strong>of</strong> the matching summary nodes,<br />
which in this case is only a p5 .<br />
2.2.3 Backward <strong>and</strong> forward path partitioning<br />
With path indexing, data nodes are placed in the same block in the partition iff they<br />
have the same incoming paths. An alternative recursive formalization is that two nodes<br />
are in the same block iff they have the same label <strong>and</strong> both have parents from the same<br />
block [36]. This can be extended to requiring also that the nodes also have children from<br />
the same blocks, yielding a partition on the sets <strong>of</strong> incoming <strong>and</strong> outgoing paths [28].<br />
With this backward <strong>and</strong> forward path partitioning, branching queries can be evaluated<br />
more efficiently than with simple backward path partitioning [28].<br />
Figure 2.10 shows backward <strong>and</strong> forward path partitioning, the resulting structure<br />
summary, <strong>and</strong> the evaluation <strong>of</strong> a query. Note that in this example, each query node<br />
matches only one summary node, but this is not always the case.<br />
Recall that for simple path indexing, non-branching queries with a single leaf output<br />
node could be evaluated without joins, by just reading the matches for the output leaf.<br />
The reason was that the existence <strong>of</strong> useful ancestor nodes was implied from the summary<br />
matching. A bonus with backward <strong>and</strong> forward path indexing is that the existence <strong>of</strong><br />
33
CHAPTER 2.<br />
BACKGROUND<br />
b 1<br />
a 2<br />
a 9 a 15<br />
a 7 a 8 b 3 b 10 b 20 b 13 b 16 b 18<br />
b 5<br />
(a) F&B partition.<br />
b 14 b 17 b 19<br />
a s3<br />
b s1<br />
a<br />
a<br />
a s2<br />
a 1<br />
2<br />
s7<br />
a 7 a 8<br />
a 2 b 3<br />
b s4 b s8 b s10<br />
b 3<br />
a<br />
a s5 b s6 a 4 a 4 a 6<br />
s9 b s11<br />
(b) Summary, query <strong>and</strong> streams read.<br />
Figure 2.10: Forward <strong>and</strong> backward path partitioning.<br />
useful matches both for ancestor <strong>and</strong> descendant query nodes is implied from the summary<br />
matching, even for branching queries. Assuming there is a match for the query in the<br />
backward <strong>and</strong> forward path summary, as shown in Figure 2.10b, it is known that all data<br />
nodes related to matched summary nodes are guaranteed to be part <strong>of</strong> at least one full<br />
query match [28]. In the example, it is certain that all data nodes classified by a p2 have<br />
children classified by a p3 <strong>and</strong> b p4 , <strong>and</strong> that data nodes classified by b p4 have children<br />
classified by a p5 <strong>and</strong> b p6 .<br />
A second bonus is that no joins are necessary for any query with a single output<br />
node. All matching can be performed in the summary, <strong>and</strong> relevant data nodes matching<br />
the output query node can be read directly from the blocks related to the matched<br />
summary nodes.<br />
The formal definitions <strong>of</strong> the recursive partitioning strategies sketched above are based<br />
on binary relations called bisimulations [36, 28], which can be computed in O(|E| log |V |)<br />
time for any graph [38].<br />
2.2.4 Balancing fragmentation<br />
The main advantage <strong>of</strong> the finer partitioning schemes is that the increased fragmentation<br />
usually leads to less data read from the index, as smaller blocks are read for each query<br />
node. This does not come without a cost. A larger number <strong>of</strong> blocks in the partition<br />
gives a larger summary structure, which leads to more expensive initial matching in<br />
the summary. Over-refined partitioning can even give increased query evaluation cost<br />
without considering the expenses <strong>of</strong> summary matching. If the properties <strong>of</strong> nodes used<br />
to classify data nodes in the index are more detailed than what a query describes, then<br />
many partition blocks would have to be read <strong>and</strong> merged for each query node. To sum<br />
up, we need to balance on one h<strong>and</strong>, the number <strong>of</strong> false positives read for each query<br />
node, <strong>and</strong> on the other, the cost <strong>of</strong> summary matching <strong>and</strong> merging blocks.<br />
A way <strong>of</strong> reducing the cost <strong>of</strong> summary matching for strong structural indexing with<br />
high fragmentation is to use a multi-level strategy. For example can the data be indexed<br />
with backward <strong>and</strong> forward path partitioning, the resulting backward <strong>and</strong> forward<br />
path summary be indexed with simple backward path partitioning [52], <strong>and</strong> the result-<br />
34
2.3. READING DATA<br />
ing backward path summary be indexed with label partitioning [2]. Figure 2.11 shows<br />
conceptually how a query can be evaluated with such an index.<br />
a 1<br />
a 2 b 3<br />
(a)<br />
a 4<br />
b p1<br />
b s1<br />
a p2<br />
a s2<br />
a p3 b p4 a s3<br />
a p5 b p6<br />
a s7<br />
b s4 b s8<br />
a s5 b s6<br />
a s9<br />
b 1<br />
a 2 a 9 a 15<br />
b s10 b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />
b s11 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />
(b) (c) (d)<br />
Figure 2.11: Multi-level partitioning. Showing (a) the query (b) backward path summary<br />
(c) backward <strong>and</strong> forward path summary, <strong>and</strong> (d) the underlying data.<br />
The path-based partitioning strategies presented here give over-refinement in some<br />
practical cases, <strong>and</strong> various weaker indexing schemes have been developed. There exist<br />
indexing schemes with fragmentation between simple path indexing <strong>and</strong> backward <strong>and</strong><br />
forward path indexing [28], schemes considering only shorter local paths [30], <strong>and</strong> schemes<br />
that adapt to query workload [6].<br />
A way <strong>of</strong> reducing the impact <strong>of</strong> the fragmentation problem is to use an adaptive<br />
strategy with different partitioning schemes for different classed <strong>of</strong> data nodes. For example<br />
could label partitioning be used for nodes with infrequent labels, <strong>and</strong> path indexing<br />
be used for nodes with frequent labels. A static strategy that is used for XML data, is to<br />
use path indexing for XML element nodes (tags), <strong>and</strong> value indexing for XML text values<br />
<strong>and</strong> attributes [29]. This works well in practice, because the alphabet <strong>of</strong> element node<br />
names is typically small, while the alphabet <strong>of</strong> text values is typically very large. Also, it<br />
lets the same index be used for both semi-structured <strong>and</strong> unstructured query processing,<br />
a simple text values can be looked up directly in the index.<br />
2.3 Reading data<br />
Section 2.1 was concerned with how to join together full query matches from individual<br />
node matches, while Section 2.2 was concerned with how to partition the data in such a<br />
way that as few matches as possible had to be read for each individual query node. In<br />
Section 2.1.7 it was mentioned that some input stream mergers implement some inexpensive<br />
form <strong>of</strong> weak filtering to relieve the more expensive filtering in the intermediate result<br />
construction component. This section considers how to store the data in a way such that<br />
this filtering can be performed more efficiently, even without reading all individual query<br />
node matches.<br />
2.3.1 Skipping<br />
Consider evaluating the query “Britney <strong>Grimsmo</strong>” on a regular web search engine. The<br />
simplest way to find the co-occurrences <strong>of</strong> the two terms is to go through the two lists<br />
35
CHAPTER 2.<br />
BACKGROUND<br />
<strong>of</strong> occurrences in parallel. But if there are many more occurrences <strong>of</strong> “Britney” than<br />
“<strong>Grimsmo</strong>”, it is more efficient to read through the hits for the second term <strong>and</strong> somehow<br />
search for related matches in the list <strong>of</strong> hits for the first term, skipping those that are<br />
irrelevant. This can be performed either with some sort <strong>of</strong> binary search if all the data is<br />
stored uncompressed in main memory, or with some specialized data structure, such as<br />
for example skip lists or B-trees.<br />
Skipping in joins <strong>of</strong> streams <strong>of</strong> different length is also relevant for queries on semistructured<br />
data, as shown by the example library query in Figure 1.2a, where there<br />
probably are much fewer matches for “Gödel” than for . Unfortunately, skipping<br />
is not always trivial to implement for tree queries. In the following it will be discussed<br />
how to efficiently skip through matches when aligning streams for parent <strong>and</strong> child query<br />
nodes, <strong>and</strong> how to make sure the skipping is efficient for the query as a whole. Figure 2.12<br />
illustrates different issues in connection with skipping twig joins.<br />
a 1<br />
a 1<br />
a 1<br />
b 1<br />
b p<br />
b q<br />
b 1<br />
c 1<br />
x 1<br />
c q<br />
c 1 c p<br />
b 1<br />
b· b·<br />
b· b·<br />
b· b r<br />
c 1<br />
(a)<br />
(b)<br />
(c)<br />
x 1<br />
x 1<br />
a 1<br />
b 1<br />
x 2<br />
b 2 c 2<br />
x p<br />
b p c p<br />
a 2<br />
b q<br />
x 1<br />
a 1 b 1 c 1<br />
x p<br />
a p b p c p<br />
a q<br />
b q<br />
(d)<br />
c q<br />
(e)<br />
c q<br />
Figure 2.12: Cases where different skipping techniques are needed for efficient processing.<br />
(a) Query. (b) Descendants easily skipped with B-tree or similar. (c) XR-tree needed<br />
to skip ascendants. (d) Holistic skipping preferred. (e) Holistic skipping with XR-tree<br />
needed.<br />
2.3.1.1 Skipping child matches<br />
For a given parent <strong>and</strong> child query node pair, we want to efficiently forward the related<br />
streams <strong>of</strong> data nodes to the first pair <strong>of</strong> stream positions where a match for the parent<br />
36
2.3. READING DATA<br />
query node is an ancestor (or parent) <strong>of</strong> the match for the child query node. We will first<br />
consider the simpler problem <strong>of</strong> forwarding a stream <strong>of</strong> matches for the child query node<br />
to catch up with the parent’s stream.<br />
Figure 2.12b shows a case where there is a current match b 1 for the query node b 1 , <strong>and</strong><br />
we want to find the first match for the child query node c 1 which is a descendant <strong>of</strong> b 1 . In<br />
other words, we desire the first match for c 1 that that follows b 1 in preorder. If this node<br />
also precedes b 1 in postorder, it must be a descendant. For example, let b 1 .begin = k,<br />
using the BEL encoding described in Section 2.1.4. We can binary search for the first<br />
match for c 1 with begin value greater than k, which is c q . As also c q .end < b 1 .end, we<br />
have that c q must be a descendant <strong>of</strong> b 1 .<br />
2.3.1.2 Skipping parent matches<br />
Skipping through matches for a parent query node to find the first data node that is an<br />
ancestor <strong>of</strong> the current match for a child query node is not as simple, as illustrated in<br />
Figure 2.12c. Given a current match c 1 ↦→c 1 , the first ancestor <strong>of</strong> c 1 in the stream for<br />
b 1 could be any node before c 1 in the preorder. Preceding ancestors are mixed with<br />
preceding non-ancestors in the preorder sorting. They are spread throughout the stream,<br />
<strong>and</strong> hence it is hard to predict where they are. Note that sorting in postorder does not<br />
solve this problem, as an ancestor <strong>of</strong> a node is any node later in the postorder.<br />
Implementing parent match skipping requires specialized data structures. The XR-tree<br />
is a B-tree variant that can be used to retrieve all r ancestors <strong>of</strong> a node from n c<strong>and</strong>idates<br />
in O(log n + r) time [25, 26]. The data structure maintains one tree for each node label<br />
α, <strong>and</strong> conceptually stores for each node u with Label(u) = α, the list <strong>of</strong> ancestors <strong>of</strong> u<br />
that have label α. Linear space usage is achieved by removing the redundancy <strong>of</strong> storing<br />
common ancestors multiple times.<br />
2.3.1.3 Holistic skipping<br />
A common method for forwarding the streams associated with the nodes in a query is to<br />
pick an edge, <strong>and</strong> repeatedly forward the streams for the parent <strong>and</strong> child node until they<br />
are aligned, then pick another edge, <strong>and</strong> so on, until the entire query is aligned. Unfortunately,<br />
this procedure can lead to suboptimal behavior, as illustrated by Figure 2.12d. As<br />
the query edge 〈a 1 , b 1 〉 is satisfied initially, the streams for b 1 <strong>and</strong> c 1 will be repeatedly<br />
forwarded until the edge 〈b 1 , c 1 〉 is satisfied. This means all data nodes b 2 . . . b p <strong>and</strong><br />
c 2 . . . c p will be inspected.<br />
To avoid this pitfall, the query should be considered holistically when forwarding<br />
streams. A robust approach is to repeatedly forward streams top-down <strong>and</strong> bottom-up in<br />
the query [12]. In the top-down pass, a stream is forwarded until the current head follows<br />
the head for the stream <strong>of</strong> the parent query node in preorder. In the bottom-up pass, a<br />
stream is forwarded until the current head follows the heads <strong>of</strong> the streams <strong>of</strong> all child<br />
query nodes in postorder. If this approach was followed in Figure 2.12d, nearly all the<br />
useless nodes labeled b <strong>and</strong> c would be skipped past.<br />
Figure 2.12e shows a case where both holistic skipping <strong>and</strong> specialized data structures<br />
for ancestor skipping are needed for an optimal skipping strategy.<br />
37
CHAPTER 2.<br />
BACKGROUND<br />
2.3.2 Virtual streams<br />
In the previous section we considered how data nodes could be efficiently skipped past<br />
if they were not part <strong>of</strong> query matches. This section describes a different approach:<br />
Reconstruction <strong>of</strong> data nodes needed for a complete query match.<br />
2.3.2.1 Virtual matches for non-branching internal query nodes<br />
Reconstructing a data node based on the existence <strong>of</strong> other data nodes requires some<br />
information about the structure the known nodes are part <strong>of</strong>. A good starting point is<br />
using structural summaries, such as those described in Section 2.2, <strong>and</strong> store with each<br />
data node the classification <strong>of</strong> the structure it is part <strong>of</strong>.<br />
Example 7. For the query <strong>and</strong> data in Figure 2.13 the data nodes are classified on<br />
path, with the path summary shown on the right in the figure. Given a current matching<br />
a 1 ↦→a 2 <strong>and</strong> a 4 ↦→a 4 , we know that the current data nodes have paths specified by a p2 <strong>and</strong><br />
a p5 , <strong>and</strong> we can deduce by using the path summary that they must have a node on the<br />
path between them specified by b p4 .<br />
b 1 b p1<br />
a 1<br />
a 2<br />
a p2<br />
a 2 b 3<br />
b 3 a 7 a 8 a p3<br />
a 4 a 4 b 5 a 6<br />
a p5<br />
b p4<br />
b p6<br />
Figure 2.13: Virtual match for non-branching internal query node.<br />
As illustrated above, the existence <strong>of</strong> matches for non-branching internal query nodes<br />
can be implied by the matches for above <strong>and</strong> below query nodes <strong>and</strong> information from<br />
pattern matching in the path summary. This means no data nodes have to be read for<br />
non-branching internal query nodes, as long as they are not output nodes.<br />
2.3.2.2 Tree position encoding allowing ancestor reconstruction<br />
When virtualizing output nodes, matches cannot just be implied, but must be reconstructed,<br />
at least to an extent such that they can be uniquely identified. The most<br />
common approach to implementing this is to use a node encoding which allows ancestor<br />
reconstruction, such as Dewey [44], which encodes positions with strings <strong>of</strong> integers. With<br />
the Dewey encoding the tree position <strong>of</strong> a node is encoded as the position <strong>of</strong> the parent<br />
concatenated with the child number <strong>of</strong> the node, as shown in Figure 2.14. This means<br />
that the string length <strong>of</strong> a position encoding is equal to the depth <strong>of</strong> the corresponding<br />
data node.<br />
38
2.3. READING DATA<br />
1<br />
1.1 1.2<br />
1.3<br />
1.2.1 1.2.2 1.2.3<br />
Figure 2.14: Dewey encoding <strong>of</strong> node positions.<br />
2.3.2.3 Virtual matches for branching query nodes<br />
Using virtual matches for branching query nodes is considerably harder than for nonbranching<br />
query nodes, <strong>and</strong> requires an encoding allowing ancestor reconstruction even<br />
when the query nodes to be virtualized are not output nodes. An example is shown in<br />
Figure 2.15. Even though it is known that there is an above data node classified by a p2<br />
in the path summary, both for a 2 ↦→a 7 <strong>and</strong> a 4 ↦→a 4 , it cannot be determined from the<br />
summary matching alone that it is the same above data node.<br />
b 1 b p1<br />
a 1<br />
a 2<br />
a p2<br />
a 2 b 3<br />
b 3 a 7 a 8 a p3<br />
a 4 a 4 b 5 a 6<br />
a p5<br />
b p4<br />
b p6<br />
Figure 2.15: Virtual match for branching internal query node.<br />
Luckily, with the Dewey encoding, the lowest common ancestor <strong>of</strong> two data nodes can<br />
be determined by finding the longest common prefix <strong>of</strong> the Dewey strings. This means<br />
that virtual matches for branching query nodes can be generated by combining node<br />
position encoding with information from the path summary matching [53, 34].<br />
Example 8. In the example in Figure 2.15, data nodes a 7 <strong>and</strong> a 4 have Dewey position<br />
encodings 1.1.2 <strong>and</strong> 1.1.1.1. The longest common prefix <strong>of</strong> these strings is 1.1, which<br />
has length 2. Since both a p3 <strong>and</strong> a p5 have the ancestor a p2 at depth 2, it is determined<br />
that a 7 <strong>and</strong> a 4 have a common ancestor with structure belonging to the group specified<br />
by a p2 , <strong>and</strong> the query is matched.<br />
An advantage <strong>of</strong> using virtual matches for internal query nodes is that you avoid the<br />
issues with skipping parent matches discussed in Section 2.3.1.2. A downside is that the<br />
39
CHAPTER 2.<br />
BACKGROUND<br />
Dewey encoding requires O(d) space per data node, linear in the maximal depth <strong>of</strong> the<br />
data.<br />
2.4 Related problems <strong>and</strong> solutions<br />
This background chapter featured a high-level description <strong>of</strong> some <strong>of</strong> the concepts that<br />
are used in the research papers included in this thesis. A number <strong>of</strong> related issues that I<br />
have not covered are briefly listed below.<br />
The strategy <strong>of</strong> using indexing <strong>and</strong> joins is considered to be the most efficient out<br />
<strong>of</strong> three common strategies [15], where the other two are navigation <strong>and</strong> subsequence<br />
matching. In navigation-based approaches, the data tree is indexed similarly to in joinbased<br />
approaches, but instead <strong>of</strong> joining together matches for different parts <strong>of</strong> the query,<br />
only the matches for one part are read from the index, <strong>and</strong> for each such partial match,<br />
the rest <strong>of</strong> the query is matched by navigating in the data tree. Subsequence indexing is<br />
a radically different indexing approach, where both data <strong>and</strong> query trees are converted to<br />
sequences with special properties, such that a subsequence match for the query sequence<br />
in the data sequence indicates a tree match [48, 40, 49].<br />
When the underlying data is not a tree, but a general graph, many aspects <strong>of</strong> pattern<br />
matching become more complex. In Section 2.1.4 it was described how a simple tree<br />
position encoding using constant space per data node could be used to decide P–C <strong>and</strong><br />
A–D relationships in constant time. Unfortunately, this is not as simple in general graphs,<br />
where encodings giving constant time decision are expensive to compute <strong>and</strong> store, while<br />
encodings with small space requirements that can be computed efficiently give expensive<br />
decision <strong>of</strong> node relationships [55].<br />
Note that for general graph data, partitioning nodes on their sets <strong>of</strong> incoming <strong>and</strong>/or<br />
outgoing paths is actually PSPACE complete [43]. Luckily there exist refinements <strong>of</strong> this<br />
ideal partition that are tractable. Recursively partitioning nodes on their label <strong>and</strong> the<br />
blocks <strong>of</strong> their parent <strong>and</strong>/or child nodes gives a refinement that is usually very close to<br />
the ideal in practice [36, 29].<br />
As discussed in Section 2.1.2.1, a difference between XPath evaluation <strong>and</strong> TPM is<br />
that in the former, the result is all matches for the output node in the query, while in<br />
the latter, the result is all legal combinations <strong>of</strong> matches for the query nodes. There is<br />
a third output type called filtering [42] where the output is the set <strong>of</strong> documents where<br />
there exists at least one match for the query. This is useful in many information retrieval<br />
settings.<br />
The full XPath language is complex, <strong>and</strong> the first algorithms for evaluating XPath<br />
queries in polynomial time were presented first years after the language st<strong>and</strong>ard was<br />
formalized [14]. A method for extending a TPM solution to cover the different axes in<br />
XPath, is to use an algorithm to rewrite queries to use a smaller set <strong>of</strong> axes. What is<br />
missing from TPM is any notion <strong>of</strong> order between matches for query nodes not related by<br />
ancestry, <strong>and</strong> it has been shown that all XPath queries can be rewritten to queries using<br />
only the self, child, descendant <strong>and</strong> following axes [57]. Adding checks <strong>of</strong> the relative<br />
ordering <strong>of</strong> data nodes in a match can be done with post-processing, checks during result<br />
40
2.4. RELATED PROBLEMS AND SOLUTIONS<br />
enumeration, or by modifying the intermediate result construction. The latter is required<br />
for optimal evaluation.<br />
There are many pattern matching problems on trees that are related to TPM, such as<br />
ordered <strong>and</strong> unordered tree inclusion, path inclusion, region inclusion, child inclusion <strong>and</strong><br />
the subtree problem [31]. Of these, the most similar to twig pattern matching is unordered<br />
tree inclusion (UTI). The differences between UTI <strong>and</strong> TPM are that in the former all<br />
query edges are <strong>of</strong> type A–D, <strong>and</strong> that for any match function M, M(u) = M(v) if <strong>and</strong><br />
only if u = v. The latter requirement makes UTI NP-complete [31].<br />
41
Chapter 3<br />
Research Summary<br />
“There is nothing like looking, if you want to find something.<br />
You certainly usually find something, if you look,<br />
but it is not always quite the something you were after.”<br />
– J.R.R. Tolkien<br />
This chapter gives a brief overview <strong>of</strong> the research contained in this thesis. Section 3.2<br />
lists the papers included, describes the research process <strong>and</strong> the roles <strong>of</strong> my co-authors,<br />
<strong>and</strong> gives a short evaluation <strong>of</strong> each paper in retrospect. Section 3.3 discusses the methodology<br />
used in my research, <strong>and</strong> Section 3.4 tries to evaluate the contributions in the papers<br />
in the light <strong>of</strong> the research questions from Section 1.4. Section 3.5 gives a short list <strong>of</strong><br />
future work I find interesting, <strong>and</strong> Section 3.6 concludes the evaluation <strong>of</strong> the research,<br />
3.1 Formalities<br />
The thesis is a paper collection submitted for partial fulfillment <strong>of</strong> the requirements for<br />
the degree <strong>of</strong> philosophiae doctor. I have been enrolled in a four year PhD program<br />
at the <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> <strong>and</strong> <strong>Information</strong> Science at the Norwegian University<br />
<strong>of</strong> Science <strong>and</strong> Technology. Three years worth <strong>of</strong> financing was given by the Research<br />
Council <strong>of</strong> Norway under the grant NFR 162349, <strong>and</strong> one year <strong>of</strong> financing was given by<br />
the department in exchange for 25% duty work during my stay.<br />
I started in the PhD program summer 2005, <strong>and</strong> it has taken an additional year to<br />
complete it. From August 2008 I was on a full <strong>and</strong> later partial sick leave due to tendinitis<br />
in both h<strong>and</strong>s, which I had acquired from rock climbing. During the partial sick leave<br />
I started setting up voice recognition s<strong>of</strong>tware for programming, <strong>and</strong> most <strong>of</strong> the C++<br />
implementation used in Paper 3 was actually written using this setup. Six extra months<br />
<strong>of</strong> financing was given by the iAD project sponsored by the Research Council <strong>of</strong> Norway,<br />
because I had been without a supervisor for a longer period.<br />
In the PhD program it is m<strong>and</strong>atory to take five courses, <strong>and</strong> I have taken the following:<br />
43
CHAPTER 3.<br />
RESEARCH SUMMARY<br />
• DT8101 Highly Concurrent Algorithms<br />
• DT8102 Database Management Systems<br />
• DT8108 Topics in <strong>Information</strong> Technology<br />
• TDT4215 Knowledge in Document Collections<br />
• TT8001 Pattern Recognition<br />
In my duty work for the department I have worked with the following:<br />
• The Nordic Collegiate Programming Contest<br />
• TDT4120 Algorithms <strong>and</strong> Data Structures<br />
• TDT4215 Algorithm Construction, Advanced Course<br />
• TDT4287 Algorithms for Bioinformatics<br />
3.2 Publications <strong>and</strong> research process<br />
After finishing my Masters Thesis on substring indexing [16] I wanted to continue this<br />
research, but this venture was ab<strong>and</strong>oned for various reasons described in Appendix A<br />
(Paper 7), <strong>and</strong> resulted in a technical report [17]. After that I started the research on<br />
XML indexing, which resulted in the papers listed below. In addition I have co-authored<br />
some other papers on indexing <strong>and</strong> search in other types <strong>of</strong> data together with fellow PhD<br />
student Truls Amundsen Bjørklund <strong>and</strong> others. These are listed in Appendix A (Papers 8<br />
<strong>and</strong> 9).<br />
3.2.1 Paper 1<br />
Authors: <strong>Nils</strong> <strong>Grimsmo</strong>.<br />
Title: Faster Path Indexes for Search in XML Data [18].<br />
Publication: Proceeding <strong>of</strong> the Nineteenth Australasian Database Conference (ADC 2008).<br />
Abstract: This article describes how to implement efficient memory resident path indexes<br />
for semi-structured data. Two techniques are introduced, <strong>and</strong> they are shown to<br />
be significantly faster than previous methods when facing path queries using the descendant<br />
axis <strong>and</strong> wild-cards. 1 The first is conceptually simple <strong>and</strong> combines inverted lists,<br />
selectivity estimation, hit expansion <strong>and</strong> brute force search. The second uses suffix trees<br />
with additional statistics <strong>and</strong> multiple entry points into the query. The entry points are<br />
partially evaluated in an order based on estimated cost until one <strong>of</strong> them is complete.<br />
Many path index implementations are tested, using paths generated both from statistical<br />
models <strong>and</strong> DTDs.<br />
Research process<br />
When I started working with XML search my supervisor Dr. Torbjørnsen had a plan<br />
for the architecture <strong>of</strong> a commercial XML indexing system. This system was to use<br />
1 Note that wild-cards are not mentioned in Chapter 2, but can be supported by fixing the level<br />
difference between matches for above <strong>and</strong> below query nodes when necessary, or simply by reading all<br />
data nodes for the wild-card query node.<br />
44
3.2. PUBLICATIONS AND RESEARCH PROCESS<br />
path indexing for XML structure elements <strong>and</strong> inverted files for textual content. My<br />
first assignment to design an efficient path index, <strong>and</strong> as a first step, only look at single<br />
paths. I used my experience from string indexing, <strong>and</strong> came up with one solution based<br />
on joins, <strong>and</strong> one based on suffix trees. Both used selectivity estimates <strong>and</strong> opportunistic<br />
optimizations.<br />
Retrospective view<br />
The solutions I came up with for solving the problem at h<strong>and</strong> were rather advanced,<br />
especially the opportunistic algorithm using suffix trees. The focus <strong>of</strong> the work was on<br />
practical performance, <strong>and</strong> given the interplay between the path index <strong>and</strong> the value<br />
index that was planned in the underlying design, I feel that the solutions I found were<br />
good. On the other h<strong>and</strong>, from a more theoretical viewpoint the worst-case behavior <strong>of</strong><br />
the solutions is not attractive, as individually matching paths can give large intermediate<br />
results. An asymptotically better approach, is to use a twig join algorithm on the path<br />
summary, as described in Section 2.2.2.<br />
As the size <strong>of</strong> the path index is small compared to the size <strong>of</strong> the data for many<br />
document collections, it is a justified question whether or not path index lookups are<br />
important for total query evaluation time in a complete search system.<br />
3.2.2 Paper 2<br />
Authors: <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund.<br />
Title: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists [19].<br />
Publication: Technical Report IDI-TR-2007-01, Norwegian University <strong>of</strong> Science <strong>and</strong><br />
Technology, Trondheim, Norway, 2007.<br />
Abstract: The document listing problem can be solved with linear preprocessing <strong>and</strong><br />
optimal search time by using a generalised suffix tree, additional data structures <strong>and</strong><br />
constant time range minimum queries. A simpler solution is to use a generalised suffix<br />
tree in which internal nodes are extended with a list <strong>of</strong> all string IDs seen in the subtree<br />
below the respective node. This report makes some remarks on the size <strong>of</strong> such a structure.<br />
For the case <strong>of</strong> a set <strong>of</strong> equal length strings, a bound <strong>of</strong> Θ(n √ n) for the worst case space<br />
usage <strong>of</strong> such lists is given, where n is the total length <strong>of</strong> the strings.<br />
Research process<br />
During the implementation <strong>of</strong> the system for Paper 1, I found some recent work which<br />
used an extension <strong>of</strong> generalized suffix trees [58]. Suffix trees have attractive asymptotic<br />
complexity, but as the authors did not comment on the complexity <strong>of</strong> their extension, I<br />
felt that I should investigate it before I based my work on their solution. As there was<br />
not space for these results in Paper 1, they were published as a technical report.<br />
Roles <strong>of</strong> the authors<br />
Bjørklund helped write <strong>and</strong> simplify the pro<strong>of</strong>s, <strong>and</strong> helped write the paper itself.<br />
45
CHAPTER 3.<br />
RESEARCH SUMMARY<br />
Retrospective view<br />
This report only investigated worst-case space usage as a function <strong>of</strong> the total length <strong>of</strong><br />
the strings indexed. It could also have been <strong>of</strong> interest to find the best case <strong>and</strong> average<br />
case space usage, <strong>and</strong> to express the complexity as a function <strong>of</strong> the number <strong>of</strong> strings<br />
<strong>and</strong> their average length.<br />
3.2.3 Paper 3<br />
Authors: <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Øystein Torbjørnsen.<br />
Title: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes [24].<br />
Publication: Proceedings <strong>of</strong> the Second International Conference on Advances in<br />
Databases, Knowledge, <strong>and</strong> Data Applications (DBKDA 2010).<br />
Abstract: XML indexing <strong>and</strong> search has become an important topic, <strong>and</strong> twig joins<br />
are key building blocks in XML search systems. This paper describes a novel approach<br />
using a nested loop twig join algorithm, which combines several existing techniques to<br />
speed up evaluation <strong>of</strong> XML queries. We combine structural summaries, path indexing<br />
<strong>and</strong> prefix path partitioning to reduce the amount <strong>of</strong> data read by the join. This effect<br />
is amplified by only reading data for leaf query nodes, <strong>and</strong> inferring data for internal<br />
nodes from the structural summary. Skipping is used to speed up merges where query<br />
leaves have differing selectivity. Multiple access methods are implemented as materialized<br />
views instead <strong>of</strong> succinct secondary indexes for better locality. This redundancy is made<br />
affordable in terms <strong>of</strong> space by using compression in a back-end with columnar storage.<br />
We have implemented an experimental prototype, which shows a speedup <strong>of</strong> two orders<br />
<strong>of</strong> magnitude on XPath queries with value predicates, when compared to existing open<br />
source <strong>and</strong> commercial systems using a subset <strong>of</strong> the techniques. Space usage is also<br />
improved.<br />
Research process<br />
This was an attempt to build an academic prototype for a high performance XML search<br />
system, based on design ideas by my supervisor Dr. Torbjørnsen <strong>and</strong> myself. The system<br />
turned out rather complex <strong>and</strong> had many features, <strong>and</strong> in my view at the time, substantial<br />
academic contributions. The main contribution was the virtualization <strong>of</strong> internal query<br />
nodes, which I thought was a novelty. A paper was submitted to XSym 2009, but was<br />
rejected, mainly because <strong>of</strong> missing references to previous work, but also because <strong>of</strong> a<br />
weak description <strong>of</strong> the join algorithms used. Virtual streams for branching internal<br />
query nodes had been presented in independent papers from 2004 [53] <strong>and</strong> 2005 [34]. We<br />
then rewrote the presentation <strong>of</strong> the paper, <strong>and</strong> the final form was more <strong>of</strong> a experimental<br />
systems-paper that featured some minor novelties.<br />
Roles <strong>of</strong> the authors<br />
Both Bjørklund <strong>and</strong> Torbjørnsen were part <strong>of</strong> all phases <strong>of</strong> this work, except the implementation<br />
<strong>of</strong> the system. Bjørklund contributed mainly with knowledge about columnar<br />
storage <strong>and</strong> compression schemes. Torbjørnsen contributed with general knowledge <strong>of</strong><br />
46
3.2. PUBLICATIONS AND RESEARCH PROCESS<br />
database systems <strong>and</strong> search engines, <strong>and</strong> also had many <strong>of</strong> the initial design ideas for<br />
the system.<br />
Retrospective view<br />
In this research I mostly compared my full system with other full systems, <strong>and</strong> tried to<br />
explain performance differences in the experiments based on which features each <strong>of</strong> the<br />
systems used. The experiments would probably have been more academically interesting<br />
if I in addition had extended my system such that all the different features could be<br />
turned on <strong>and</strong> <strong>of</strong>f to isolate their effects.<br />
I learned two important lessons from this project. The first is that I should have<br />
spent considerably more time reading previous research, <strong>and</strong> less time focusing on my<br />
own ideas in the beginning. The second lesson is that it is not very tactical for a PhD<br />
student working mostly on his own to spend a year implementing a large system, unless<br />
it is certain that it will bear fruits in terms <strong>of</strong> publishing. Large systems should be left<br />
to large research groups with long term projects.<br />
3.2.4 Paper 4<br />
Authors: <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund.<br />
Title: Towards Unifying Advances in Twig Join Algorithms [20].<br />
Publication: Proceedings <strong>of</strong> the 21st Australasian Database Conference (ADC 2010).<br />
Abstract: Twig joins are key building blocks in current XML indexing systems, <strong>and</strong> numerous<br />
algorithms <strong>and</strong> useful data structures have been introduced. We give a structured,<br />
qualitative analysis <strong>of</strong> recent advances, which leads to the identification <strong>of</strong> a number <strong>of</strong><br />
opportunities for further improvements. Cases where combining competing or orthogonal<br />
techniques would be advantageous are highlighted, such as algorithms avoiding redundant<br />
computations <strong>and</strong> schemes for cheaper intermediate result management. We propose some<br />
direct improvements over existing solutions, such as reduced memory usage <strong>and</strong> stronger<br />
filters for bottom-up algorithms. In addition we identify cases where previous work has<br />
been overlooked or not used to its full potential, such as for virtual streams, or the benefits<br />
<strong>of</strong> previous techniques have been underestimated, such as for skipping joins. Using<br />
the identified opportunities as a guide for future work, we are hopefully one step closer<br />
to unification <strong>of</strong> many advances in twig join algorithms.<br />
Research process<br />
After the disappointment <strong>of</strong> having reinvented the wheel when working with Paper 3,<br />
I started doing a more thorough literature review to get a deeper underst<strong>and</strong>ing <strong>of</strong> the<br />
different aspects <strong>of</strong> XML indexing <strong>and</strong> twig joins in particular. My notes from this study<br />
gradually grew into a mixture <strong>of</strong> a survey paper <strong>and</strong> a long list <strong>of</strong> research ideas. As most<br />
conferences do not accept survey papers, <strong>and</strong> because I did not feel I had time for the<br />
process <strong>of</strong> publishing a journal publication, the focus <strong>of</strong> the paper was turned to the list<br />
<strong>of</strong> research opportunities.<br />
47
CHAPTER 3.<br />
RESEARCH SUMMARY<br />
Roles <strong>of</strong> the authors<br />
Bjørklund helped structure the paper, was part <strong>of</strong> discussions about the different research<br />
opportunities listed, <strong>and</strong> helped writing the final version.<br />
Retrospective view<br />
This paper features a long list <strong>of</strong> ideas. Some <strong>of</strong> these may be <strong>of</strong> interest to other researchers,<br />
while others probably are not. It would <strong>of</strong> course had been a much greater<br />
contribution to the research community if this work had been presented at a more visible<br />
venue, such as a journal, but this would require much more time, <strong>and</strong> probably the help<br />
<strong>of</strong> co-authors experienced in the field <strong>of</strong> twig joins.<br />
In retrospect, it may be backwards to write such a literature review at the end <strong>of</strong> a<br />
PhD, <strong>and</strong> I think my academic development would have benefited from doing it at an<br />
earlier stage. On the other h<strong>and</strong>, with no experience I probably would not have been able<br />
to analyze previous work with the same underst<strong>and</strong>ing.<br />
3.2.5 Paper 5<br />
Authors: <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong>.<br />
Title: Fast Optimal Twig Joins [21].<br />
Publication: Proceedings <strong>of</strong> the 36th international conference on Very Large Data Bases<br />
(VLDB 2010).<br />
Abstract: In XML search systems twig queries specify predicates on node values <strong>and</strong><br />
on the structural relationships between nodes, <strong>and</strong> a key operation is to join individual<br />
query node matches into full twig matches. Linear time twig join algorithms exist, but<br />
many non-optimal algorithms with better average-case performance have been introduced<br />
recently. These use somewhat simpler data structures that are faster in practice, but have<br />
exponential worst-case time complexity. In this paper we explore <strong>and</strong> extend the solution<br />
space spanned by previous approaches. We introduce new data structures <strong>and</strong> improved<br />
strategies for filtering out useless data nodes, yielding combinations that are both worstcase<br />
optimal <strong>and</strong> faster in practice. An experimental study shows that our best algorithm<br />
outperforms previous approaches by an average factor <strong>of</strong> three on common benchmarks.<br />
On queries with at least one unselective leaf node, our algorithm can be an order <strong>of</strong><br />
magnitude faster, <strong>and</strong> it is never more than 20% slower on any tested benchmark query.<br />
Research process<br />
This work started <strong>of</strong>f as an idea for space saving in twig joins, listed as Opportunity 3 in<br />
Paper 4, but then I noticed that the same idea could be used to close the gap between<br />
worst-case behavior <strong>and</strong> practical performance in current twig join algorithms. Also, the<br />
creation <strong>of</strong> the framework for classifying strategies we presented in this paper was inspired<br />
by Opportunity 4 in Paper 4, which suggested extending the range <strong>of</strong> different filtering<br />
strategies used in current twig joins.<br />
As many <strong>of</strong> the recently presented algorithms used very similar underlying data structures,<br />
I implemented a minimalistic system where different filtering strategies <strong>and</strong> storage<br />
48
3.2. PUBLICATIONS AND RESEARCH PROCESS<br />
techniques could be switched on <strong>and</strong> <strong>of</strong>f. The result was a system which spanned out a<br />
large space <strong>of</strong> solutions, where interplay between the effects <strong>of</strong> the different features could<br />
be analyzed.<br />
Roles <strong>of</strong> the authors<br />
Bjørklund was part <strong>of</strong> the idea phase that lead up to this paper, discussions about the<br />
development <strong>of</strong> the system, <strong>and</strong> gave much constructive feedback during the writing phase.<br />
Hetl<strong>and</strong> helped simplify our formalizations, was <strong>of</strong> great help writing the theoretical part<br />
<strong>of</strong> the paper, <strong>and</strong> had the idea for how to implement the post-processing for our preorder<br />
storage algorithm TJStrictPre.<br />
Retrospective view<br />
I feel that the most important contribution in this paper is the framework we developed<br />
for classifying the different filtering strategies used in twig joins, <strong>and</strong> the analysis <strong>of</strong> how<br />
the strategies affect practical <strong>and</strong> worst-case performance.<br />
On the day before the submission <strong>of</strong> the camera-ready copy <strong>of</strong> the paper, I started<br />
thinking about how our novel input stream merger called getPart could be modified to<br />
return data nodes strictly in order. It would indeed be interesting to know whether such a<br />
modification would increase the cost <strong>of</strong> the merge, because with strict ordering, there are<br />
actually more possibilities during intermediate result construction, as you can use a global<br />
stack [21, Appendix H]. My current impression is that with a modified getPart merger,<br />
it should be possible to create an algorithm that is much simpler <strong>and</strong> more elegant than<br />
the TJStrictPre algorithm presented in this paper, but still as fast.<br />
3.2.6 Paper 6<br />
Authors: <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong>.<br />
Title: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward Bisimulation<br />
for Node-Labeled Trees [22].<br />
Publication: Proceedings <strong>of</strong> the 7th International XML Database Symposium on<br />
Database <strong>and</strong> XML Technologies (XSym 2010).<br />
Abstract: The F&B-index is used to speed up pattern matching in tree <strong>and</strong> graph data,<br />
<strong>and</strong> is based on the maximum F&B-bisimulation, which can be computed in loglinear<br />
time for graphs. It has been shown that the maximum F-bisimulation can be computed<br />
in linear time for DAGs. We build on this result, <strong>and</strong> introduce a linear algorithm for<br />
computing the maximum F&B-bisimulation for tree data. It first computes the maximum<br />
F-bisimulation, <strong>and</strong> then refines this to a maximal B-bisimulation. We prove that<br />
the result equals the maximum F&B-bisimulation.<br />
Research process<br />
When I started working on this project, the idea was to write a paper presenting three<br />
contributions: Linear F&B-index construction for trees, subtree addition with cost dependent<br />
only <strong>of</strong> the size <strong>of</strong> the addition, <strong>and</strong> a publish/subscribe solution that indexed<br />
49
CHAPTER 3.<br />
RESEARCH SUMMARY<br />
queries on what document structures they matched. Note that the second part is Opportunity<br />
9 in Paper 4. After I had found a solution to the first part <strong>and</strong> implemented it, I<br />
found a published incorrect algorithm for solving this problem, <strong>and</strong> felt that perhaps this<br />
deserved a publication <strong>of</strong> it’s own.<br />
Our initial contribution was two-fold: Linear construction <strong>of</strong> single-directional bisimilarity<br />
for DAGs, <strong>and</strong> linear construction <strong>of</strong> two-directional bisimilarity for trees. One<br />
week before the submission deadline for XSym I found a paper which described a solution<br />
to the first problem, <strong>and</strong> we had to rewrite the paper slightly. We had to describe how<br />
our solution was different from theirs, <strong>and</strong> justify why these differences were necessary<br />
for solving the second problem, two-directional bisimilarity for trees.<br />
Roles <strong>of</strong> the authors<br />
Bjørklund <strong>and</strong> Hetl<strong>and</strong> both participated in discussions about how to solve the problems,<br />
<strong>and</strong> in the writing <strong>of</strong> the paper.<br />
Retrospective view<br />
Some <strong>of</strong> the reviewers at XSym commented that the paper would benefit from more<br />
examples illustrating the algorithms, <strong>and</strong> also that the experiments were superfluous<br />
because <strong>of</strong> the asymptotic improvement. The experiments were therefore removed from<br />
the published version to give room for more examples [22], but can be found in the<br />
extended technical report version [23]. The experiments are on st<strong>and</strong>ard XML benchmark<br />
data, where it turns out the old loglinear algorithm performs rather well. It would perhaps<br />
be more interesting to see experiments on r<strong>and</strong>om data showing the loglinear worst-case,<br />
but this could turn out to be a difficult exercise in constructing data.<br />
3.3 Research methodology<br />
Most <strong>of</strong> the work presented here can be considered constructive research: Given a computational<br />
problem <strong>and</strong> the set <strong>of</strong> existing solutions, find a computationally cheaper solution.<br />
Cheaper can mean that the asymptotic running time is less, such as in Paper 6 listed<br />
above, or it can mean that an implementation runs faster for a given set <strong>of</strong> instances <strong>of</strong><br />
the problem on a given computer, such as in Papers 1 <strong>and</strong> 3, or both, such as in Paper 5.<br />
Empirical methods are needed to draw conclusions about the practical performance<br />
<strong>of</strong> solutions in general. Unfortunately, this is a weak part <strong>of</strong> many subfields <strong>of</strong> computer<br />
science, such as XML indexing, <strong>and</strong> it is also a weak part in most <strong>of</strong> my research. Most<br />
papers on twig joins <strong>and</strong> XPath evaluation use around three st<strong>and</strong>ard data sources, <strong>and</strong><br />
run maybe five to ten st<strong>and</strong>ard queries on each data set, <strong>and</strong> then draw conclusions about<br />
which solution is faster. These queries were typically written by some paper author to<br />
highlight a specific difference between two solutions. This can almost be called qualitative<br />
research. Using a few st<strong>and</strong>ard data sets may be the only feasible solution, but it should<br />
be possible to create a large population <strong>of</strong> st<strong>and</strong>ard queries from which samples could be<br />
drawn. Another solution is to create statistical model from which the data <strong>and</strong> queries<br />
are generated [17]. This allows drawing conclusion about which solution is better given<br />
50
3.4. EVALUATION OF CONTRIBUTIONS<br />
properties <strong>of</strong> the data <strong>and</strong> queries, such as label alphabet distribution, tree size <strong>and</strong> shape,<br />
etc..<br />
Some <strong>of</strong> the work presented here can also be considered descriptive research, such as<br />
the taxonomy <strong>of</strong> techniques in Paper 4, <strong>and</strong> the classification <strong>of</strong> filtering strategies in<br />
previous approaches in Paper 5.<br />
3.4 Evaluation <strong>of</strong> contributions<br />
This section revisits the research questions from Section 1.4, <strong>and</strong> tries to evaluate to what<br />
extent they have been answered by the research papers that constitute this thesis.<br />
3.4.1 Research questions revisited<br />
Below follows a brief list <strong>of</strong> my contributions, grouped by research question.<br />
1. RQ1: How can matches for tree queries be joined more efficiently?<br />
• Paper 3: The XLeaf system uses multiple materialized views <strong>and</strong> selectivity<br />
estimates to reduce the amount <strong>of</strong> data read during XPath evaluation. Query<br />
leaves are looked up on either value, path or tag, <strong>and</strong> then possibly filtered on<br />
path <strong>and</strong>/or value.<br />
• Paper 3: <strong>Information</strong> from the summary matching is used to determine whether<br />
or not a linear join can be employed. This is stronger than previous approaches,<br />
which determined this only from properties <strong>of</strong> the query [42].<br />
• Paper 3: Both the linear <strong>and</strong> the nested loop join in the XLeaf system store<br />
intermediate results <strong>of</strong> negligible size, <strong>and</strong> achieve this by never materializing<br />
virtual matches for internal query node, using only implicit prefixes <strong>of</strong> descendant<br />
leaf Deweys. Previous approaches for virtual matches explicitly generate<br />
node identifiers [53, 34].<br />
• Paper 3: The compressed Dewey encoding used in the XLeaf system allows<br />
faster joins, because the Deweys can be compared without decompression.<br />
• Paper 5: The TJStrictPost <strong>and</strong> TJStrictPre algorithms combine worst-case<br />
optimality [7] with good practical performance [33], bridging a gap between<br />
current twig join algorithms.<br />
• Paper 5: The getPart input stream merger gives inexpensive weak full match<br />
filtering, <strong>and</strong> a considerable speedup during intermediate result construction<br />
compared to the previous getNext input stream merger [5], which gave weak<br />
subtree match filtering.<br />
2. RQ2: How can pattern matching in the dictionary be done more efficiently?<br />
• Paper 1: EsIe3 was a novel strategy for indexing paths combines tuple indexing,<br />
selectivity estimates, <strong>and</strong> nested-loop lookups. This was shown experimentally<br />
to be much faster than previous approaches based on merge joins [32].<br />
51
CHAPTER 3.<br />
RESEARCH SUMMARY<br />
• Paper 1: Smfe was a new opportunistic approach to using suffix trees for<br />
matching path patterns, which was considerably faster than previous similar<br />
methods [58].<br />
• Paper 3: To join branch matches when using path indexing, it must be<br />
known how different data paths match the query paths, to make sure different<br />
branches in the query use the same data nodes for branching points. To<br />
avoid storing the potentially exponential number <strong>of</strong> ways the paths can match<br />
the data, the XLeaf systems stores for each path matched, the set <strong>of</strong> usable<br />
matches for the nearest ancestor branching nodes.<br />
• Paper 3: Using the meta information from the path summary matching, for<br />
each leaf stream alignment, the joins in the XLeaf system use a simple bottomup<br />
then top-down traversal <strong>of</strong> the query tree, which determines which branching<br />
nodes can be used, <strong>and</strong> whether or not the leaf Deweys match down to<br />
these branching points.<br />
3. RQ3: How can structure indexes be constructed faster <strong>and</strong> using less space?<br />
• Paper 2: We determined the worst-case space complexity <strong>of</strong> a previous path<br />
indexing strategy building on an extension <strong>of</strong> suffix trees [58]. This gives a<br />
more balanced view <strong>of</strong> the usefulness <strong>of</strong> this data structure.<br />
• Paper 3: We showed in the XLeaf system how multiple materialized views<br />
<strong>of</strong> a structure index could be made affordable with compression <strong>and</strong> shared<br />
dictionaries in a column store back-end. The system used less space for three<br />
materialized views than a state <strong>of</strong> the art system without such compression [4]<br />
uses for one view.<br />
• Paper 6: The cost <strong>of</strong> construction <strong>of</strong> the forward <strong>and</strong> backward path index for<br />
tree data was reduced from loglinear to linear, improving the usability <strong>of</strong> these<br />
strong structure indexes [28].<br />
3.4.2 Opportunities revisited<br />
In Paper 4 we listed a number <strong>of</strong> different research opportunities, <strong>and</strong> below I list the<br />
developments I have found for each <strong>of</strong> them, in my own research <strong>and</strong> others’. It may be<br />
advantageous to skip this section <strong>and</strong> revisit it after reading the paper. As Paper 4 was<br />
written after the XLeaf system in Paper 3 was developed, I already had partial answers<br />
to some <strong>of</strong> the questions I posed, but at the time <strong>of</strong> writing Paper 4, I did not know<br />
whether or not Paper 3 would be published.<br />
1. Removing redundant computation in top-down one-phase joins - This is a point we<br />
did not pursue in the twig joins in Paper 5, something which could have resulted in<br />
even faster input stream merging.<br />
2. Top-down memory usage - The TJStrict algorithms in Paper 5 uses strict matching<br />
equivalent to what was proposed here to reduce the number <strong>of</strong> nodes added to<br />
intermediate results.<br />
52
3.4. EVALUATION OF CONTRIBUTIONS<br />
3. Bottom-up memory usage - It is possible to add some more inexpensive space savings<br />
to the TJStrict algorithms as proposed in Paper 4, <strong>and</strong> using early enumeration<br />
when the topmost query root match is closed [7]. More aggressive space savings,<br />
requiring more dynamic storage <strong>of</strong> intermediate results, have been presented recently<br />
[35].<br />
4. Stronger filters - This topic was treated extensively in Paper 5.<br />
5. Unification or assimilation - Paper 5 may have brought us a small step closer to<br />
unifying top-down <strong>and</strong> bottom-up approaches, but as stated in the conclusion, it<br />
would be nice to see a solution that is simpler <strong>and</strong> more elegant than TJStrictPre,<br />
but still as efficient.<br />
6. Holistic effective ancestor skipping - As this is just the combination <strong>of</strong> two previous<br />
techniques (see Section 2.3.1), it probably holds little academic interest.<br />
7. Simpler <strong>and</strong> faster skipping data structures - I have seen no advances on this point,<br />
<strong>and</strong> I am not sure whether or not it is <strong>of</strong> academic interest.<br />
8. Updates in stronger summaries - During the background research for Paper 6 we<br />
found papers covering updates in indexes based on single-directional bisimulations<br />
for graph data [28, 54, 41], but none for the bidirectional case. This is probably <strong>of</strong><br />
less practical interest than the following opportunity.<br />
9. Hashing F&B summaries - My original plan for Paper 6 featured this as the main<br />
contribution, but I have not had time for exploring it. I still believe investigating<br />
this could yield interesting results.<br />
10. Exploring summary structures <strong>and</strong> how to search them - As noted in Paper 4, the<br />
recently proposed ideas <strong>of</strong> multi-level structure indexes [52] <strong>and</strong> using twig join<br />
algorithms to do path summary matching [2] may be part <strong>of</strong> the ultimate structure<br />
indexing package, but a thorough experimental investigation <strong>of</strong> what benefits the<br />
different techniques give in different cases is still missing.<br />
11. Access methods for multiple matching partitions - A recent paper [3] compared the<br />
advanced access methods in iTwigJoin [8] with simply merge joining the different<br />
streams <strong>of</strong> matching nodes for a query node, <strong>and</strong> found that the latter scaled better.<br />
The method we used in our XLeaf system in the case <strong>of</strong> multiple matching paths<br />
was to read a stream <strong>of</strong> label-matching nodes, <strong>and</strong> filter on path using a bit vector.<br />
This is probably more efficient than merging path matches when accessing on path<br />
is not much more selective than accessing on tag. An optimizer could chose which<br />
access method to use based on cost estimates.<br />
12. Improved virtual streams - Paper 3 made some progress on how to store summary<br />
matching meta-information <strong>and</strong> how to avoid materializing virtual node positions,<br />
but I believe that a better solution is waiting to be designed, <strong>and</strong> that it would use<br />
features both from Virtual Cursors [53], TJFast [34] <strong>and</strong> XLeaf [24].<br />
53
CHAPTER 3.<br />
RESEARCH SUMMARY<br />
13. Holistic skipping among leaf streams - As noted in Paper 4, Virtual Cursors [53]<br />
is very closed to implementing holistic skipping <strong>and</strong> so-called optimal data-access.<br />
The only piece missing is to always generate virtual internal query node matches<br />
from the below leaf with the greatest Dewey, as done in the XLeaf system when the<br />
simple linear join is used.<br />
14. Identifying <strong>and</strong> using difficulty classes - The XLeaf system decides whether or not<br />
a linear join algorithm can be used based on whether or not the tree depth <strong>of</strong> a<br />
node is fixed after the path summary matching. This is stronger than previous<br />
approaches [42].<br />
3.5 Future work<br />
Below I list a few directions I would pursue as a continuation <strong>of</strong> my research.<br />
3.5.1 Strong structure summaries for independent documents<br />
This was proposed as Opportunity 9 in Paper 4. The conjecture is that if the underlying<br />
data is a set <strong>of</strong> unconnected graphs, then the structure summary for the forward <strong>and</strong><br />
backward path path partition will also be a set <strong>of</strong> unconnected graphs. This is shown for<br />
a collection <strong>of</strong> independent tree-shaped documents in Figure 3.1. Note that for trees, the<br />
document roots are partitioned into blocks represented by roots in the summary. Also,<br />
the only change caused by adding a virtual root for the documents is the addition <strong>of</strong> a<br />
virtual root in the summary.<br />
D 1 D 2 D 3 D 4 D 5 D n<br />
S 1 S 2 S 3 S m<br />
Figure 3.1: Shape <strong>of</strong> forward <strong>and</strong> backward path summary (below) for set <strong>of</strong> independent<br />
documents with virtual root (above).<br />
The idea is that when adding a new document, its nodes will either be mapped to<br />
summary nodes in a single tree in the summary, or a new tree will be created in the<br />
54
3.5. FUTURE WORK<br />
summary. When a new document is seen, the structure summary tree for this document<br />
can be constructed independently <strong>of</strong> the summaries for the previous documents. If this<br />
summary tree is label-isomorph to an existing top-level subtree in the global summary,<br />
we translate the classification <strong>of</strong> the document nodes to the equivalent existing classes,<br />
otherwise, the document summary is added as a new top-level subtree in the summary.<br />
This means we need a lookup structure on the shapes <strong>of</strong> the summary trees, for example<br />
with hashing <strong>of</strong> sorted trees, or some form <strong>of</strong> tree automaton.<br />
3.5.2 A simpler fast optimal twig join<br />
The getPart input stream merger presented in Paper 5, has one major flaw, namely that<br />
similarly to the original getNext input stream merger [5], it outputs data nodes in socalled<br />
local preorder [21]. As discussed in our paper, a global strict preorder is needed<br />
when using a global stack to open nodes in preorder <strong>and</strong> closing them in postorder, which<br />
is needed for on the fly postorder subtree match filtering. In TJStrictPre, a postorder<br />
processing pass over the data is used to perform the subtree match filtering, which is<br />
required for optimality.<br />
As full weak match filtering removes many nodes, it may be possible that a combination<br />
<strong>of</strong> a getPart variant with global preorder output <strong>and</strong> a simple tree-based intermediate<br />
result construction as described in Section 2.1.6 would be competitive with TJStrictPre.<br />
Such an algorithm would be significantly more elegant <strong>and</strong> simpler to implement.<br />
3.5.3 Simpler <strong>and</strong> faster evaluation with non-output nodes<br />
As described in Section 2.1.2.1, XPath queries have a single output node. Using a regular<br />
twig join algorithm <strong>and</strong> removing duplicates during post-processing can be very inefficient,<br />
but the Full Match Theorem from Paper 5 makes it simple to modify a twig join algorithm<br />
to get complexity linear in the input <strong>and</strong> output also with a single output node. As long<br />
as nodes are filtered first on being subtree matchers in postorder, <strong>and</strong> then on being prefix<br />
path matchers in preorder, all nodes are part <strong>of</strong> a full match. We can then simply output<br />
the remaining matches for the output query node. Note that the second filtering pass is<br />
easier to implement when using tree-based intermediate data structures as suggested in<br />
Section 3.5.2.<br />
The two-pass full match filtering approach may also be useful when dealing with<br />
multiple-output nodes in so-called generalized tree pattern matching [9, 7]. In cases<br />
where output nodes form a connected tree, linear enumeration is easily obtained: Assume<br />
we use intermediate data structures similar to those constructed by Algorithm 2 (page<br />
28), <strong>and</strong> that the enumeration in Algorithm 1 (page 22) traverses the query in preorder.<br />
We can then start the recursive enumeration at the root <strong>of</strong> the output subtree, recurse<br />
only on output query nodes, <strong>and</strong> output the current match M ′ when its size is equal to<br />
the number <strong>of</strong> output query nodes.<br />
Based on this approach, it may also be possible to formulate a simpler <strong>and</strong> more<br />
elegant enumeration algorithm for the general case with possibly unconnected output<br />
nodes [7].<br />
55
CHAPTER 3.<br />
RESEARCH SUMMARY<br />
3.5.4 Ultimate data access shoot-out<br />
It would be interesting to know which would perform better in practice <strong>of</strong> a system<br />
using virtual streams <strong>and</strong> skipping, improved as suggested in Opportunities 12 <strong>and</strong> 13<br />
in Paper 4, <strong>and</strong> a system with explicit streams <strong>and</strong> skipping, improved as suggested in<br />
Opportunity 6. With virtual streams the total amount <strong>of</strong> data read is probably lower (also<br />
when skipping), <strong>and</strong> the advanced data structures for ancestor skipping is avoided. On<br />
the other h<strong>and</strong>, node encodings enabling the ancestor reconstruction needed for virtual<br />
streams have O(d) space usage per encoded position, where d is the maximal depth <strong>of</strong><br />
the data.<br />
Deciding which is the better approach in practice is no simple task. It would require<br />
experienced programmers with good knowledge <strong>of</strong> modern hardware, man-hours to<br />
optimize both solutions sufficiently, <strong>and</strong> a thorough empirical evaluation.<br />
3.6 Conclusions<br />
This thesis presents a number <strong>of</strong> contributions on the construction <strong>and</strong> use <strong>of</strong> structure<br />
indexes, <strong>and</strong> on twig join algorithms. A strength <strong>of</strong> the thesis is that it combines both work<br />
on practical efficiency <strong>of</strong> implementations <strong>and</strong> on theoretical aspects, read asymptotic<br />
complexity. A weakness is that it does not consider the broader picture: What is search in<br />
XML data used for in practice? And, are the advances in twig joins presented here useful<br />
for real world systems? I believe that the answer to the latter question is yes. In Paper 3<br />
we built a system that combined new <strong>and</strong> previous techniques in a new way, such that<br />
a large group <strong>of</strong> queries were evaluated orders <strong>of</strong> magnitude faster than in current state<strong>of</strong>-the-art<br />
open source <strong>and</strong> commercial systems. This should definitively be <strong>of</strong> interest<br />
for system implementors. Also, I believe that the use <strong>of</strong> improved filtering strategies <strong>and</strong><br />
intermediate data structures introduced in Paper 5 are applicable independently <strong>of</strong> the<br />
data partitioning <strong>and</strong> data access methods used. Whether or not the faster construction<br />
<strong>of</strong> F&B-indexes for trees presented in Paper 6 is useful, depends on whether or not F&Bindexing<br />
is considered favorable over for example simple path indexing in the future.<br />
The use <strong>of</strong> the recently introduced multi-level indexing [52] may make F&B-indexes more<br />
attractive in use.<br />
A lesson learned during my research is a general strategy for how to process trees,<br />
which inspired the name <strong>of</strong> the thesis. When implementing virtual matches in Paper 3,<br />
c<strong>and</strong>idate sets for branching nodes are calculated bottom-up in the query tree, before<br />
they are chosen top-down. The Full Match Theorem from Paper 5 states that only nodes<br />
part <strong>of</strong> a full match remain after filtering nodes bottom-up on matched subtrees, <strong>and</strong> then<br />
top-down on matched prefix paths. In Paper 6 it is shown how the maximum forward <strong>and</strong><br />
backward bisimulation can be computed for tree data by first computing the maximum<br />
forward bisimulation bottom-up in the tree, <strong>and</strong> then refining to a maximal backward<br />
bisimulation top-down.<br />
The work presented in this thesis builds heavily on the research that has been presented<br />
by the community during the last ten years. Hopefully some <strong>of</strong> my contributions will also<br />
be built on, <strong>and</strong> result in further advances in the field.<br />
56
Bibliography<br />
[1] S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, <strong>and</strong> D. Srivastava. Structural<br />
joins: A primitive for efficient XML query pattern matching. In Proc. ICDE, 2002.<br />
2.1.1<br />
[2] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />
twig query in large DataGuide trees. In Proc. IDEAS, 2008. 2.2.2, 2.2.4, 10<br />
[3] Radim Bača <strong>and</strong> Michal Krátký. On the efficiency <strong>of</strong> a prefix path holistic algorithm.<br />
In Proc. XSym, 2009. 2.2.2, 11<br />
[4] Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger,<br />
<strong>and</strong> Jens Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational<br />
engine. In Proc. SIGMOD, 2006. 3<br />
[5] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />
XML pattern matching. In Proc. SIGMOD, 2002. 2.1.1, 2.1.5, 2.1.5, 2.1.7, 1, 3.5.2<br />
[6] Qun Chen, Andrew Lim, <strong>and</strong> Kian Win Ong. D(k)-index: an adaptive structural<br />
summary for graph-structured data. In Proc. SIGMOD, 2003. 2.2.4<br />
[7] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />
Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />
queries over XML documents. In Proc. VLDB, 2006. 2.1.1, 2.1.2, 2.1.2.1,<br />
2.1.5, 2.1.5, 2.1.6, 2.1.6, 2.1.8, 1, 3, 3.5.3<br />
[8] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />
pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />
2.2.2, 2.2.2, 11<br />
[9] Zhimin Chen, H. V. Jagadish, Laks V. S. Lakshmanan, <strong>and</strong> Stelios Paparizos. From<br />
tree patterns to generalized tree patterns: on efficient evaluation <strong>of</strong> xquery. In<br />
Proc. VLDB, 2003. 3.5.3<br />
[10] Byron Choi, Malika Mahoui, <strong>and</strong> Derick Wood. On the optimality <strong>of</strong> holistic algorithms<br />
for twig queries. In Proc. DEXA, 2003. 2.2<br />
[11] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, <strong>and</strong> Clifford Stein.<br />
Introduction to Algorithms. McGraw-Hill Higher Education, 2001. 2.1.7<br />
57
BIBLIOGRAPHY<br />
[12] Marcus Fontoura, Vanja Josifovski, Eugene Shekita, <strong>and</strong> Beverly Yang. Optimizing<br />
cursor movement in holistic twig joins. In Proc. CIKM, 2005. 2.3.1.3<br />
[13] Roy Goldman <strong>and</strong> Jennifer Widom. DataGuides: Enabling query formulation <strong>and</strong><br />
optimization in semistructured databases. In Proc. VLDB, 1997. 2.2.2<br />
[14] Georg Gottlob, Christoph Koch, <strong>and</strong> Reinhard Pichler. Efficient algorithms for processing<br />
XPath queries. In Proc. VLDB, 2002. 2.4<br />
[15] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />
survey. Knowl. <strong>and</strong> Data Eng., 2007. 1.3, 2.4<br />
[16] <strong>Nils</strong> <strong>Grimsmo</strong>. Dynamic indexes vs. static hierarchies for substring search. Master’s<br />
thesis, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology, 2005. 3.2, 7<br />
[17] <strong>Nils</strong> <strong>Grimsmo</strong>. On performance <strong>and</strong> cache effects in substring indexes. Technical<br />
Report IDI-TR-2007-04, NTNU, Trondheim, Norway, 2007. 3.2, 3.3, 7<br />
[18] <strong>Nils</strong> <strong>Grimsmo</strong>. Faster path indexes for search in XML data. In Proc. ADC, 2008.<br />
(document), 3.2.1<br />
[19] <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund. On the size <strong>of</strong> generalised suffix trees<br />
extended with string ID lists. Technical Report IDI-TR-2007-01, NTNU, Trondheim,<br />
Norway, 2007. (document), 3.2.2<br />
[20] <strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund. Towards unifying advances in twig<br />
join algorithms. In Proc. ADC, 2010. (document), 3.2.4<br />
[21] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Fast optimal<br />
twig joins. In Proc. VLDB, 2010. (document), 2.1.5, 2.1.7, 3.2.5, 3.2.5, 3.5.2<br />
[22] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Linear computation<br />
<strong>of</strong> the maximum simultaneous forward <strong>and</strong> backward bisimulation for nodelabeled<br />
trees. In Proc. XSym, 2010. (document), 3.2.6, 3.2.6<br />
[23] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Linear computation<br />
<strong>of</strong> the maximum simultaneous forward <strong>and</strong> backward bisimulation for nodelabeled<br />
trees (extended version). Technical Report IDI-TR-2010-10, NTNU, Trondheim,<br />
Norway, 2010. 3.2.6<br />
[24] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Øystein Torbjørnsen. Xleaf: Twig<br />
evaluation with skipping loop joins <strong>and</strong> virtual nodes. In Proc. DBKDA, 2010.<br />
(document), 3.2.3, 12<br />
[25] Haifeng Jiang, Hongjun Lu, Wei Wang, <strong>and</strong> Beng Chin Ooi. XR-tree: Indexing XML<br />
data for efficient structural joins. In Proc. ICDE, 2003. 2.3.1.2<br />
[26] Haifeng Jiang, Wei Wang, Hongjun Lu, <strong>and</strong> Jeffrey Xu Yu. Holistic twig joins on<br />
indexed XML documents. In Proc. VLDB, 2003. 2.3.1.2<br />
58
BIBLIOGRAPHY<br />
[27] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />
processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />
2007. 2.1.1, 2.1.2, 2.1.5, 2.1.6, 2.1.7<br />
[28] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />
indexes for branching path queries. In Proc. SIGMOD, 2002. 2.2.3, 2.2.3, 2.2.4,<br />
3, 8<br />
[29] Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, <strong>and</strong> Raghu Ramakrishnan.<br />
On the integration <strong>of</strong> structure indexes <strong>and</strong> inverted lists. In Proc. SIG-<br />
MOD, 2004. 2.2.4, 2.4<br />
[30] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, <strong>and</strong> Ehud Gudes. Exploiting<br />
local similarity for indexing paths in graph-structured data. In Proc. ICDE, 2002.<br />
2.2.4<br />
[31] Pekka Kilpeläinen. Tree matching problems with applications to structured text<br />
databases. Technical Report A-1992-6, <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> Science, University<br />
<strong>of</strong> Helsinki, 1992. 2.4<br />
[32] Krishna P. Leela <strong>and</strong> Jayant R. Haritsa. Schema-conscious XML indexing. <strong>Information</strong><br />
Systems, 32:344–364, 2007. 2<br />
[33] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />
2.1.1, 2.1.2, 2.1.5, 2.1.6, 2.1.7, 1<br />
[34] Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, <strong>and</strong> Ting Chen. From region encoding<br />
to extended Dewey: On efficient processing <strong>of</strong> XML twig pattern matching. In<br />
Proc. VLDB, 2005. 2.3.2.3, 3.2.3, 1, 12<br />
[35] Federica M<strong>and</strong>reoli, Riccardo Martoglia, <strong>and</strong> Pavel Zezula. Principles <strong>of</strong> holism for<br />
sequential twig pattern matching. The VLDB Journal, 18(6), 2009. 3<br />
[36] Tova Milo <strong>and</strong> Dan Suciu. Index structures for path expressions. Proc. ICDT, 1999.<br />
2.2.2, 2.2.3, 2.2.3, 2.4<br />
[37] Svetlozar Nestorov, Jeffrey D. Ullman, Janet L. Wiener, <strong>and</strong> Sudarshan S. Chawathe.<br />
Representative objects: Concise representations <strong>of</strong> semistructured, hierarchial data.<br />
In Proc. ICDE, 1997. 2.2.2<br />
[38] Robert Paige <strong>and</strong> Robert E. Tarjan. Three partition refinement algorithms. SIAM<br />
J. Comput., 1987. 2.2.3<br />
[39] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />
In Proc. DASFAA, 2007. 2.1.1, 2.1.2, 2.1.5, 2.1.6<br />
[40] Praveen Rao <strong>and</strong> Bongki Moon. PRIX: Indexing <strong>and</strong> querying XML using Prüfer<br />
sequences. In Proc. ICDE, 2004. 2.4<br />
59
BIBLIOGRAPHY<br />
[41] Diptikalyan Saha. An incremental bisimulation algorithm. In Proc. FSTTCS, 2007.<br />
8<br />
[42] Mirit Shalem <strong>and</strong> Ziv Bar-Yossef. The space complexity <strong>of</strong> processing XML twig<br />
queries over indexed documents. In Proc. ICDE, 2008. 2.4, 1, 14<br />
[43] L. J. Stockmeyer <strong>and</strong> A. R. Meyer. Word problems requiring exponential time (preliminary<br />
report). In Proc. STOC, 1973. 2.4<br />
[44] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene<br />
Shekita, <strong>and</strong> Chun Zhang. Storing <strong>and</strong> querying ordered XML using a relational<br />
database system. In Proc. SIGMOD, 2002. 2.3.2.2<br />
[45] W3C. XPath 1.0, 1999. http://w3.org/TR/xpath. 1.2.1, 1.2.1, 2.1.2.1<br />
[46] W3C. Extensible markup language (XML) 1.0 (fourth edition), 2006. http://www.<br />
w3.org/TR/2006/REC-xml-20060816/. 1.2<br />
[47] W3C. XQuery 1.0, 2007. http://w3.org/TR/xquery. 1.2.1, 2.1.2.1<br />
[48] H. Wang, S. Park, W. Fan, <strong>and</strong> P.S. Yu. ViST: a dynamic index method for querying<br />
XML data by tree structures. Proc. SIGMOD, pages 110–121, 2003. 2.4<br />
[49] Haixun Wang <strong>and</strong> Xia<strong>of</strong>eng Meng. On the sequencing <strong>of</strong> tree structures for XML<br />
indexing. In Proc. ICDE, pages 372–383, 2005. 2.4<br />
[50] Felix Weigel. Structural summaries as a core technology for efficient XML retrieval.<br />
PhD thesis, Ludwig-Maximilians-Universität München, 2006. 2.1.4, 2.1.8<br />
[51] Ian H. Witten, Alistair M<strong>of</strong>fat, <strong>and</strong> Timothy C. Bell. Managing gigabytes (2nd ed.):<br />
compressing <strong>and</strong> indexing documents <strong>and</strong> images. 1999. 2.1.8<br />
[52] Xin Wu <strong>and</strong> Guiquan Liu. XML twig pattern matching using version tree. Data &<br />
Knowl. Eng., 2008. 2.2.4, 10, 3.6<br />
[53] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />
Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004. 2.3.2.3, 3.2.3, 1, 12, 13<br />
[54] Ke Yi, Hao He, Ioana Stanoi, <strong>and</strong> Jun Yang. Incremental maintenance <strong>of</strong> XML<br />
structural indexes. In Proc. SIGMOD, 2004. 8<br />
[55] Jeffrey Xu Yu <strong>and</strong> Jiefeng Cheng. Graph reachability queries: A survey. In Managing<br />
<strong>and</strong> Mining Graph Data, Advances in Database Systems. Springer US, 2010. 2.4<br />
[56] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />
supporting containment queries in relational database management systems. SIG-<br />
MOD Rec., 2001. 2.1.1, 2.1.4<br />
[57] Ning Zhang, Varun Kacholia, <strong>and</strong> M. Tamer Özsu. A succinct physical storage<br />
scheme for efficient evaluation <strong>of</strong> path queries in XML. In Proc. ICDE, 2004. 2.4<br />
60
BIBLIOGRAPHY<br />
[58] Liang Zuopeng, Hu Kongfa, Ye Ning, <strong>and</strong> Dong Yisheng. An efficient index structure<br />
for XML based on generalized suffix tree. <strong>Information</strong> Systems, 32:283–294, 2007.<br />
3.2.2, 2, 3<br />
61
Chapter 4<br />
Included papers<br />
“As for me, all I know is that I know nothing.”<br />
– Socrates<br />
This chapter contains the research papers that constitute the main body <strong>of</strong> this thesis.<br />
They have been reformatted to fit the format used here, <strong>and</strong> figures <strong>and</strong> tables have been<br />
moved, resized <strong>and</strong> rotated to become readable. Some other papers <strong>and</strong> reports I have<br />
authored <strong>and</strong> co-authored can be found in Appendix A.<br />
63
Paper 1<br />
<strong>Nils</strong> <strong>Grimsmo</strong><br />
Faster Path Indexes for Search in XML Data<br />
Proceeding <strong>of</strong> the Nineteenth Australasian Database Conference (ADC 2008)<br />
Abstract This article describes how to implement efficient memory resident path indexes<br />
for semi-structured data. Two techniques are introduced, <strong>and</strong> they are shown to<br />
be significantly faster than previous methods when facing path queries using the descendant<br />
axis <strong>and</strong> wild-cards. The first is conceptually simple <strong>and</strong> combines inverted lists,<br />
selectivity estimation, hit expansion <strong>and</strong> brute force search. The second uses suffix trees<br />
with additional statistics <strong>and</strong> multiple entry points into the query. The entry points are<br />
partially evaluated in an order based on estimated cost until one <strong>of</strong> them is complete.<br />
Many path index implementations are tested, using paths generated both from statistical<br />
models <strong>and</strong> DTDs.<br />
65
Paper 1: Faster Path Indexes for Search in XML Data<br />
1 Introduction<br />
With the advent <strong>of</strong> XML <strong>and</strong> query languages such as XPath <strong>and</strong> XQuery came the<br />
need for efficient ways to query the structure <strong>of</strong> XML documents. This article focuses on<br />
settings where a document collection can be indexed in advance, as opposed to querying on<br />
the fly. For efficient solutions to the latter problem, see for example Gottlob et al. (2005).<br />
An important component in many systems indexing XML is a path index (Bertino <strong>and</strong><br />
Kim 1989) summarising <strong>and</strong> indexing all unique paths seen in the document collection.<br />
(Other names for similar structures are representative objects (Nestorov et al. 1997),<br />
DataGuides (Goldman <strong>and</strong> Widom 1997), <strong>and</strong> access support relations (Kemper <strong>and</strong><br />
Moerkotte 1992).) For many document collections following schemas, this set <strong>of</strong> paths<br />
will be small compared to the total size <strong>of</strong> the data. The path index is in some way<br />
connected to a value index (or content index), which allows search for words or values.<br />
XPath is a query language allowing search for regular path expressions in XML documents.<br />
It is a simple declarative language, but techniques used for XPath queries can<br />
also be components in more advanced procedural query languages such as XQuery. Many<br />
FLOWR expressions can also be rewritten to simpler XPath expressions (Michiels et al.<br />
2007).<br />
Assume the XML document shown in Figure 1, <strong>and</strong> the XPath query /a//c/"foo" 1 .<br />
There are two matches for the path part <strong>of</strong> the query, <strong>and</strong> four matches for the value<br />
predicate "foo", but only one match for the entire query, which is the third occurrence<br />
<strong>of</strong> "foo". Note that each unique path is only indexed once in a path index. There are<br />
two occurrences <strong>of</strong> the path /a/b in the example document, but there will only be one in<br />
a path index.<br />
An efficient path index is important when indexing large heterogeneous document<br />
collections for structural querying. If the data has a very homogeneous structure, the<br />
number <strong>of</strong> unique paths seen is small, <strong>and</strong> the implementation <strong>of</strong> the path index itself is<br />
not significant. An example where document structure could be very heterogeneous, is an<br />
enterprise search engine, indexing all information generated by a business. This could be<br />
composed <strong>of</strong> the content from multiple databases, repositories <strong>of</strong> reports, etc.. Another<br />
case is search engines for the semantic web, where structural search is a key feature. It is<br />
hard to imagine a future for the semantic web without search engines that scale as well<br />
as current web search engines.<br />
1.1 Previous approaches<br />
Many approaches for indexing XML <strong>and</strong> supporting path queries have been proposed in<br />
the last ten years. They can mainly be divided into node indexing, path indexing <strong>and</strong><br />
sequence indexing. In node indexing, one makes an inverted index for all XML tags, <strong>and</strong><br />
one for all data values <strong>and</strong> terms. Given a simple path query, lookup all the tags <strong>and</strong><br />
values, <strong>and</strong> merge the results. To be able to merge hits for XML elements, an encoding<br />
which tells whether a node is a child or descendant <strong>of</strong> another node is needed. Common<br />
solutions are the range based (docid, start:end, level) encoding (Zhang et al. 2001), the<br />
1 Let path/"term" be short for path[normalize-space(text())="term"].<br />
67
CHAPTER 4.<br />
INCLUDED PAPERS<br />
1<br />
foo 1.1<br />
foo 1.2(.1)<br />
1.3<br />
foo 1.3.1(.1)<br />
1.3.2<br />
bar 1.3.2.1<br />
1.3.2.2<br />
bar 1.3.2.2.1<br />
1.3.2.2.2<br />
<br />
<br />
<br />
foo 1.4(.1)<br />
bar 1.5(.1)<br />
<br />
Figure 1: Example XML document. Dewey order encoding <strong>of</strong> elements shown on the<br />
right.<br />
(post, pre) encoding (Grust 2002) <strong>and</strong> the prefix based Dewey order encoding (Tatarinov<br />
et al. 2002).<br />
A “brute force” merge between element hits may be extremely inefficient. A lot <strong>of</strong><br />
research has gone into finding better merge algorithms in terms <strong>of</strong> time <strong>and</strong> IO-complexity:<br />
MPMGJN (Zhang et al. 2001), EE/EA-join (Li <strong>and</strong> Moon 2001), tree-merge <strong>and</strong> stack-tree<br />
(Al-Khalifa et al. 2002), Anc Des (Chien et al. 2002), PathStack <strong>and</strong> TreeStack (Bruno<br />
et al. 2002), XR-tree <strong>and</strong> XR-stack (Wang <strong>and</strong> Ooi 2003), TSGeneric (Jiang et al. 2003),<br />
<strong>and</strong> TJFast (Lu et al. 2005). Even though these algorithms are very efficient in terms<br />
<strong>of</strong> their input <strong>and</strong> output, many systems which use them perform a lot <strong>of</strong> unnecessary<br />
work. Assume for example the query /a/b/c/"foo", <strong>and</strong> that a is the first element <strong>of</strong> the<br />
path for half <strong>of</strong> the data values stored in the database, while c is seen only a few times.<br />
To do the merge, either the entire occurrence list for a must be read from disk, or every<br />
occurrence <strong>of</strong> c must be looked up in some data structure for a on disk. It is obvious that<br />
methods which utilise the varying selectivity <strong>of</strong> the query elements or the full paths will<br />
be faster.<br />
Various methods which use index structures other than inverted lists have been proposed.<br />
The index fabric (Cooper et al. 2001) maintains a layered Patricia trie <strong>of</strong> all paths<br />
seen in the data. It is organised in multiple levels, so that a path query will result in a<br />
number <strong>of</strong> lookups (block) logarithmic in the size <strong>of</strong> the total data. A problem with this<br />
index structure is that only queries starting at the document root element are efficiently<br />
supported.<br />
When the structure <strong>of</strong> XML documents is highly regular, the number <strong>of</strong> unique paths<br />
seen in a collection will be very small compared to the total size <strong>of</strong> the data. This set <strong>of</strong><br />
unique paths can <strong>of</strong>ten fit in main memory, <strong>and</strong> be searched very efficiently. The first use<br />
68
Paper 1: Faster Path Indexes for Search in XML Data<br />
<strong>of</strong> path indexing known to the author is by Bertino <strong>and</strong> Kim (1989), while perhaps the<br />
best known is the DataGuide (Goldman <strong>and</strong> Widom 1997) used in the Lore DBMS for<br />
semistructured data.<br />
Let a path summary denote a summary <strong>of</strong> all paths which is not indexed for fast<br />
matching. Perhaps the simplest use <strong>of</strong> path summaries is by Buneman et al. (2005),<br />
where all paths are extracted from the data, <strong>and</strong> maintained as an in-memory “skeleton”.<br />
For each path seen, there is an on-disk vector containing all terms <strong>and</strong> values seen below<br />
instances <strong>of</strong> this path. Weaknesses <strong>of</strong> this approach is that a full search for matching<br />
paths in the skeleton is required when paths do not start with the root element, <strong>and</strong><br />
that a brute force (or binary) search through the vector is necessary for queries with<br />
value predicates. The ToXin system (Rizzolo <strong>and</strong> Mendelzon 2001) improves the latter<br />
by maintaining for each path an index over the values seen below it. The strength <strong>of</strong><br />
ToXin is an efficient matching <strong>of</strong> twig queries, by storing navigational information for<br />
the data in the nodes in the index. A further improvement is ToXop (Barta et al. 2005),<br />
in which query plans are made based on the selectivity on the path query elements, <strong>and</strong><br />
clever combinations <strong>of</strong> merges <strong>and</strong> searches are used. A potential weakness is that if a<br />
query does not start with a root element, a brute force search through the path summary<br />
is required to match the path expression.<br />
An enhancement over a brute force search through the summary is to make an inverted<br />
index over the paths on their tags. Given a path query, look up the individual tags in<br />
a path index <strong>and</strong> merge the results. This is used in SphinX (Poola <strong>and</strong> Haritsa 2007),<br />
where there is a value index for each path (as in ToXin). In the case where the path<br />
index is <strong>of</strong> considerable size, the merging can be costly. The systems APEX (Chung et al.<br />
2002) <strong>and</strong> XIST (Runapongsa et al. 2004) address this by maintaining index entries for<br />
sub-paths <strong>of</strong> lengths greater than one on dem<strong>and</strong>.<br />
A simple <strong>and</strong> elegant system for XML indexing using path indexing in a RDBMS is<br />
XRel (Yoshikawa et al. 2001). One <strong>of</strong> the four tables used is a mapping from paths to<br />
integer identifiers. All text <strong>and</strong> values indexed have a reference to the path under which<br />
they reside, <strong>and</strong> path matching is done using simple LIKE queries with wild-cards in the<br />
path table. Similar solutions is used in many systems based on RDBMS.<br />
A problem with keeping a separate value index for each path is cases where many<br />
paths match the query. The worst case scenario is when the query consists <strong>of</strong> only a<br />
value predicate. This results in many disk accesses, unless the indexes are stored in some<br />
interleaved fashion, grouped on the value key. An alternative is to have a single value<br />
index, where the occurrences <strong>of</strong> a value are stored with their parent path ID. After the<br />
entry for a value has been found, the occurrences are filtered on matching path IDs. If<br />
the occurrence list is large, it can be stored sorted on path ID, <strong>and</strong> pointers into the list<br />
can be used to avoid having to read all <strong>of</strong> it (known as skip lists).<br />
When the path summary fits in main memory, the choice <strong>of</strong> index structures which<br />
are suitable for implementing it is greater than if it would have to reside on disk. One<br />
structure which is only efficient in main memory is the suffix tree (McCreight 1976).<br />
PIGST (Zuopeng et al. 2007) is a system maintaining a generalised suffix tree as the<br />
69
CHAPTER 4.<br />
INCLUDED PAPERS<br />
path index 2 . See Section 2.4 for a description <strong>of</strong> this solution. A more common use <strong>of</strong><br />
(<strong>of</strong>ten pruned) suffix trees is selectivity estimation for optimising query plans (Aboulnaga<br />
et al. 2001, Chen et al. 2001).<br />
A method very different from node <strong>and</strong> path indexing is sequence indexing, where all<br />
documents are converted to a sequence representation, <strong>and</strong> searching is done by subsequence<br />
matching. ViST 3 (Wang et al. 2003) is a system using this approach. An<br />
advantage is that searching for twig queries can be done without merging partial results.<br />
A problem with ViST is that the index has quadratic size in the worst case, if the trees<br />
indexed are very deep. PRIX (Rao <strong>and</strong> Moon 2004) solves this by taking a different<br />
approach to the sequencing, using Prüfer sequences. Wang <strong>and</strong> Meng (2005) use a representation<br />
similar to in ViST, but using a more clever sequencing they optimise for smaller<br />
indexes <strong>and</strong> faster queries. The querying process also becomes much simpler than in ViST<br />
<strong>and</strong> PRIX.<br />
1.2 Contributions<br />
This article describes how to do efficient path matching. It is assumed that there is an<br />
overlying system similar to what is common when using path indexing (see Section 2.1).<br />
• It is shown that to combine an inverted index for the path summary with brute<br />
force search is in practise much faster than merging path element hits. The methods<br />
introduced exploit the varying selectivity <strong>of</strong> the query path elements.<br />
• It is shown how the use <strong>of</strong> a generalised suffix tree can be enhanced by adding<br />
statistics to the tree nodes, <strong>and</strong> changing the way searches are performed. Multiple<br />
entry points into the query are partially evaluated in parallel, depending on the<br />
evaluation cost.<br />
• Many path index implementations are compared, using paths generated from statistical<br />
models <strong>and</strong> from DTDs.<br />
2 Path index implementation<br />
Below follows descriptions <strong>of</strong> various solutions for implementing path summaries.<br />
2.1 Assumptions<br />
This article only addresses the implementation <strong>of</strong> the path index, <strong>and</strong> assumes that an<br />
overlying system with the following design is using it: All values <strong>and</strong> terms seen in<br />
the document collection are indexed by ordinary inverted lists. Stored in each entry is<br />
information encoding document ID, position in the document, the values parent path<br />
identifier, <strong>and</strong> a local specifier for the path instance (range based or prefix based). Figure<br />
2 The authors make some extensions which makes the suffix trees super-linear in size, seemingly without<br />
considering this.<br />
3 St<strong>and</strong>s for Virtual Suffix Tree, but only due to a misconception from the author.<br />
70
Paper 1: Faster Path Indexes for Search in XML Data<br />
2 shows the value index for the example XML document in Figure 1, with path ID <strong>and</strong><br />
Dewey order encoding <strong>of</strong> path instance shown. Document ID <strong>and</strong> position is omitted for<br />
brevity. The enumeration <strong>of</strong> the paths is shown in Figure 3.<br />
"foo": 1 1.1<br />
2 1.2.1<br />
3 1.3.1.1<br />
2 1.4.1<br />
"bar": 4 1.3.2.1<br />
5 1.3.2.2.1<br />
7 1.5.1<br />
Figure 2: Value index for example XML from Figure 1. Storing path ID <strong>and</strong> path instance<br />
Dewey order. Document ID <strong>and</strong> position omitted.<br />
1. /a<br />
2. /a/b<br />
3. /a/b/c<br />
4. /a/b/b<br />
5. /a/b/b/a<br />
6. /a/b/b/a/b<br />
7. /a/c<br />
Figure 3: Enumeration <strong>of</strong> unique paths seen in XML in Figure 1. This is the set <strong>of</strong> paths<br />
which would be indexed by a path index.<br />
Given a non-branching XPath query with a value predicate, all paths matching are<br />
found with the path index. The value is then looked up in the value index, <strong>and</strong> the<br />
hits are filtered with the set <strong>of</strong> matching paths. For the query /a//c/"foo", the paths<br />
matching /a//c in the example XML are number 3 <strong>and</strong> 7. The only occurrence in the<br />
lists for "foo" with a path which matched is (3 1.3.1.1). In the case <strong>of</strong> XPath queries<br />
without value predicates, an index for occurrences <strong>of</strong> XML tags should be maintained<br />
in addition. The value index may also have the occurrence lists split/sorted on path ID<br />
for faster filtering, as in ToXin (Rizzolo <strong>and</strong> Mendelzon 2001) <strong>and</strong> SphinX (Poola <strong>and</strong><br />
Haritsa 2007).<br />
It is assumed that representatives for the unique paths seen in a document collection<br />
fit in main memory, <strong>and</strong> further any index structure which is linear in their size. Only<br />
“simple” path expressions are considered, not twigs. An example <strong>of</strong> a twig XPath query<br />
is /a/b[c/"foo" <strong>and</strong> b/"bar"]. It is assumed that a system using the path index here<br />
would perform two queries (one for each branch in the twig), <strong>and</strong> merge the results. Here<br />
the merge would check for a common prefix <strong>of</strong> length three in the Dewey encoding <strong>of</strong> the<br />
paths. Note that the problem is much more involved in general. Unordered tree inclusion<br />
in general is even NP complete (Kilpeläinen 1992). The reason twigs are not treated<br />
here, is that in most cases, the leaves <strong>of</strong> the twigs will be value predicates(as in the given<br />
71
CHAPTER 4.<br />
INCLUDED PAPERS<br />
query), which will have to be looked up in the value index in any case, given the overall<br />
system design.<br />
Below follows the descriptions <strong>of</strong> various path index data structures <strong>and</strong> matching<br />
approaches.<br />
2.2 Brute force search<br />
The simplest way to implement a path index is to store the paths seen in a list, <strong>and</strong><br />
perform brute force searches for path expressions through this list. Given regular path<br />
expressions, a deterministic finite automata (Aho et al. 1986) (DFA) for the query can be<br />
built. A DFA can be exponential in the size <strong>of</strong> the query, but for most cases queries can<br />
be considered to be <strong>of</strong> constant length. Using a DFA gives a linear time scan through the<br />
data. For document collections with large data, but small schema, a brute force search<br />
may be a sufficient solution, as the scan through memory is relatively cheap compared to<br />
the disk accesses needed for the lookup into the value index.<br />
2.3 Inverted list solutions<br />
When inverting paths, each tag in a path is treated as a symbol. The index will contain<br />
for each symbol a list <strong>of</strong> positions in which it occurs, given as path ID <strong>and</strong> position within<br />
path. An index for the paths in Figure 3 is shown in Figure 4. Whether the index should<br />
store pairs <strong>of</strong> path ID <strong>and</strong> position, or store the path ID <strong>and</strong> a list occurrence positions<br />
within the path, depends on the expected lengths <strong>of</strong> these lists. In an implementation not<br />
using compression, an additional integer would be needed for storing the length <strong>of</strong> each<br />
list. This means the latter approach is more space efficient when the expected list length<br />
is greater than two. This could happen with recursive document schemas. The approach<br />
using simple pairs was chosen for simplicity in the implementations used here.<br />
a: 1,1 2,1 3,1 4,1 5,1 5,4 6,1 6,4 7,1<br />
b: 2,2 3,2 4,2 4,3 5,2 5,3 6,2 6,3 6,5<br />
c: 3,3 7,2<br />
Figure 4: Inverted lists for paths seen in example XML. Storing path ID <strong>and</strong> position<br />
within path.<br />
Given a path query using only the child axis, each element is looked up, <strong>and</strong> the results<br />
are merged where the elements are adjacent in paths. Assuming the query //a/b/c,<br />
merging left to right, first merge hits for a <strong>and</strong> b, <strong>and</strong> keep all hits with adjacent elements.<br />
The reason all hits are needed, even though the final output is only path IDs, is that which<br />
hits for a/b have adjacent hits for c is not known. After merging with c in the example,<br />
a match in path 3 is left, from position 1 to 3.<br />
For the descendant axis, hits need not be adjacent, only in the correct order. Given<br />
that the element hits are merged left to right, only the hit with the leftmost right border<br />
need to be passed on to the next step in the merge. For the query //a//b//c, first merge<br />
72
Paper 1: Faster Path Indexes for Search in XML Data<br />
hits for a <strong>and</strong> b, <strong>and</strong> keep at most a single match in each path, one with b as far to the<br />
left as possible. Then merge this hit set with the hits for c.<br />
For queries using both the child <strong>and</strong> descendant axis, the hits for elements in a parent–<br />
child relationship should be merged first, then elements in an ancestor–descendant relationship.<br />
This is because the former needs all intermediate hits, while the latter does<br />
not.<br />
2.3.1 Indexing tuples<br />
The performance for path queries using the child axis can be greatly improved by indexing<br />
pairs, triples, or even longer substrings in the inverted lists for the paths. What is indexed<br />
can be decided dynamically, as in APEX (Chung et al. 2002) or XiST (Runapongsa et al.<br />
2004), or statically, as is done here for simplicity. If the size <strong>of</strong> the data (in this case the<br />
paths) is large compared to the alphabet, the space overhead associated with starting a<br />
list in the index is small compared to the size <strong>of</strong> the list contents. In this case, an index<br />
for all pairs will not require much more space than an index for all single elements.<br />
It is expected that using pairs or triples will greatly reduce query cost in practise, as<br />
these probably will have much better selectivity than single elements.<br />
2.3.2 Extending possible hits<br />
Given queries using the child axis, a simple trick can be used to improve the performance.<br />
Assume only single elements are indexed, <strong>and</strong> the query is /a/b/c. If a is the root element<br />
<strong>of</strong> every second path, b exists in around half the paths, but c is seldom seen, the size <strong>of</strong><br />
the result will be very small compared to the total cost <strong>of</strong> the merges. The cost <strong>of</strong> merging<br />
large <strong>and</strong> small hit sets can be reduced by performing binary searches in the large set.<br />
The merges can also be done out <strong>of</strong> order to reduce the cost.<br />
A simpler <strong>and</strong> more efficient solution is possible. Since all paths reside in main memory,<br />
checking single elements in the paths is very cheap. Take the hits for the most selective<br />
element, <strong>and</strong> for each one, check whether it is preceded <strong>and</strong> succeeded by the needed<br />
elements. This avoids merging with larger sets due to poorer selectivity. The method can<br />
be combined with indexing pairs <strong>and</strong> triples.<br />
2.3.3 Estimate, choose, brute<br />
Expensive merges on the descendant axis can also be avoided when a part <strong>of</strong> the query<br />
has good selectivity. Assume the XPath query /a//c, where c has good selectivity, but<br />
a does not. As the paths are relatively short strings, a brute force search through the set<br />
<strong>of</strong> paths containing c should be more efficient than a merge. The memory management<br />
overhead <strong>of</strong> h<strong>and</strong>ling intermediate hit sets is also avoided.<br />
2.4 Suffix tree solutions<br />
A generalised suffix tree for a set <strong>of</strong> strings is a compacted trie for all suffixes <strong>of</strong> the strings<br />
(McCreight 1976). An example tree is shown in Figure 5. This structure can be built in<br />
73
CHAPTER 4.<br />
INCLUDED PAPERS<br />
time <strong>and</strong> space linear to the total length <strong>of</strong> the strings for constant <strong>and</strong> integer alphabets<br />
(Farach 1997). For general alphabets the complexity is O(n log |Σ|). The implementation<br />
used in this article combines the child arrays from <strong>Grimsmo</strong> (2005, 2007) <strong>and</strong> hashing.<br />
The index can decide whether a given string exists as a substring in the set in expected<br />
time linear to the length <strong>of</strong> the string, <strong>and</strong> all hits can then be extracted in time linear to<br />
their number. The set <strong>of</strong> paths in a path summary can be seen as a set <strong>of</strong> strings, where<br />
XML tags are string symbols, <strong>and</strong> indexed with the suffix tree. This requires that the<br />
suffix tree implementation can h<strong>and</strong>le the possibly large alphabet <strong>of</strong> XML tags. If this<br />
is not the case, the paths can be spelled out separated by delimiters, <strong>and</strong> these longer<br />
strings can be indexed. In the implementation used here, tags were mapped to an integer<br />
alphabet used as string symbols.<br />
Figure 5: Generalised suffix tree for the strings abba <strong>and</strong> abbb.<br />
If only the child axis is used all matches are contiguous sub-paths, <strong>and</strong> the node whose<br />
subtree represents all hits can be found in optimal time. But not all occurrences <strong>of</strong> the<br />
sub-path are needed, only the set <strong>of</strong> paths containing them. As an example, a/b occurs<br />
twice in the path /a/b/b/a/b, but given the query //a/b//"foo", only the fact that it<br />
occurs is <strong>of</strong> interest. In PIGST (Zuopeng et al. 2007) this is solved by storing in each<br />
internal node <strong>of</strong> the tree the set <strong>of</strong> path IDs seen in the subtree below. For r<strong>and</strong>om paths<br />
from a uniform distribution, the average ratio between the size <strong>of</strong> the subtree <strong>and</strong> the size<br />
<strong>of</strong> the list <strong>of</strong> path IDs will be inversely proportional to the average path length. Another<br />
argument for their solution, is that the nodes in a suffix tree built with any known linear<br />
construction algorithm have bad spatial locality, while the path IDs stored in the lists<br />
in PIGST have perfect locality. This is <strong>of</strong> importance also in main memory, because <strong>of</strong><br />
the cache effects <strong>of</strong> modern computers. No bounds on space usage or construction time<br />
for their extended suffix tree is given (Zuopeng et al. 2007), but both are super-linear,<br />
even in the average case (<strong>Grimsmo</strong> <strong>and</strong> Bjørklund 2007). Finding the set <strong>of</strong> strings which<br />
contain a given substring is known as the document listing problem, <strong>and</strong> can actually be<br />
solved optimally with linear preprocessing (Muthukrishnan 2002).<br />
74
Paper 1: Faster Path Indexes for Search in XML Data<br />
2.4.1 Intersect, brute<br />
The straight forward way to search for a path expression using the descendant axis, is to<br />
first do a search for the node representing the first part <strong>of</strong> the query (using only the child<br />
axis), <strong>and</strong> then do a full recursive search <strong>of</strong> the subtree below. To avoid this, PIGST does<br />
a separate search for each part <strong>of</strong> the query, intersects the resulting sets <strong>of</strong> path IDs, <strong>and</strong><br />
performs a brute force search through the set <strong>of</strong> possible paths. Note that this merging<br />
on the descendant axis is different from what is described earlier in Section 2.3, as sets <strong>of</strong><br />
path IDs are intersected, not sets <strong>of</strong> hits in paths, where in-path order is <strong>of</strong> importance.<br />
A variant <strong>of</strong> this is to take only the smallest set <strong>of</strong> path IDs, <strong>and</strong> perform a brute force<br />
search through the respective paths. This is similar to what was described for inverted<br />
files in Section 2.3.3. One difference is that if a part <strong>of</strong> a query using only the child axis<br />
has been matched in the tree, the size <strong>of</strong> the hit set does not need to be estimated, as it<br />
is known. Another is that the set <strong>of</strong> matches is extracted without overhead, no matter<br />
how long the query part is. For inverted lists, merging or hit expansion was necessary if<br />
the part was longer than the tuples indexed.<br />
Skipping the intersection <strong>and</strong> just doing the brute force search through the smallest<br />
set should pay <strong>of</strong>f when the total length <strong>of</strong> the paths in the smallest set <strong>of</strong> possible paths<br />
is less than the number <strong>of</strong> path IDs in the largest set. This could happen <strong>of</strong>ten if the<br />
paths were generated by a source with skewed distribution.<br />
2.4.2 Selective suffix tree traversal<br />
Below follows the description <strong>of</strong> a novel algorithm using multiple entry points into the<br />
query. Pseudo-code is given in Algorithm 1. The nodes <strong>of</strong> an ordinary generalised suffix<br />
tree are each extended with a number giving the size <strong>of</strong> the subtree below the node. This<br />
allows for more intelligent traversal <strong>of</strong> trees <strong>and</strong> queries. Two variants are used, where<br />
the second uses two suffix trees.<br />
For the first variant the number <strong>of</strong> entry points into a query is equal to the number <strong>of</strong><br />
parts separated by the descendant axis. Given the query /a/b//c//d/e, there are entries<br />
starting at a, c <strong>and</strong> d.<br />
All entry points are kept in a priority queue. They are ordered on the expected cost<br />
<strong>of</strong> evaluating them one step further. As soon as one is completely evaluated, the matches<br />
are extracted. Each entry point may during evaluation be at multiple positions in the<br />
suffix tree, if wild-cards or the descendant axis have been used. The cost <strong>of</strong> moving an<br />
entry point forward in the query is the sum <strong>of</strong> the costs for moving downward at each<br />
position held in the tree, with a reduction for having advanced further into the query.<br />
Evaluating a step <strong>of</strong> the child axis costs 1, except when the next symbol is a wild-card,<br />
where the cost is equal to the number <strong>of</strong> children <strong>of</strong> the current node in the suffix tree.<br />
For the descendant axis, the cost <strong>of</strong> moving one step forward in the query is equal to the<br />
size <strong>of</strong> the subtree below the current node. For an entry point which started inside the<br />
query, <strong>and</strong> is exp<strong>and</strong>ed all the way to the end, the cost <strong>of</strong> evaluating it “one step further”<br />
is an estimate <strong>of</strong> the cost <strong>of</strong> a brute force search through the paths with the partial match,<br />
which must be done to get a full match.<br />
75
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Input: path expression P , suffix tree ST<br />
Output: set <strong>of</strong> matching paths<br />
Q ← PriQueue(getEntryPoints(P ));<br />
while not complete(front(Q)) do<br />
ep ← pop(Q);<br />
next ← {};<br />
foreach p ∈ ep.positions do<br />
next ← next ∪ advance(p);<br />
end<br />
c ← 0;<br />
foreach p ∈ next do<br />
c ← c + nextAdvanceCost(p);<br />
end<br />
ep.positions ← next;<br />
ep.advanceCost ← c;<br />
push(Q, ep)<br />
end<br />
ep ← front (Q);<br />
return ep.matches;<br />
Algorithm 1: Selective suffix tree traversal<br />
The other variant <strong>of</strong> this method also uses a suffix tree for the reverse representation<br />
<strong>of</strong> the paths. It has additional entry points moving backwards from the last element in<br />
each part <strong>of</strong> the query, doing the matching in the second suffix tree. For the example<br />
query, there would also be entry points moving backwards from b, c <strong>and</strong> e. A variant<br />
using only the suffix tree for the reversed paths is also included in the tests.<br />
2.4.3 Skipping leading wild-cards<br />
A simple variant <strong>of</strong> the basic use <strong>of</strong> a suffix tree can be used when the query starts with a<br />
wild-card. The leading wild-cards can be omitted from the query, <strong>and</strong> after the hits have<br />
been retrieved, they can be filtered on their starting positions in the paths.<br />
3 Experimental results<br />
The tests where run on an AMD64 3500+ running Linux 2.6.16 compiled for AMD64. All<br />
tested implementations where written in Java, <strong>and</strong> run with 64-bit Sun Java 1.5.0 06. As<br />
the Java virtual machine <strong>of</strong>ten shows a radical speedup from “warming up” (optimising<br />
byte-code), <strong>and</strong> as many <strong>of</strong> the solutions shared code, some measures had to be taken<br />
to give a fair treatment. Each test was run repeatedly with all implementations, until<br />
neither <strong>of</strong> them showed a deviance <strong>of</strong> more than 2% in total running time for the test from<br />
the last attempt. This ensured that all implementations got a proper <strong>and</strong> fair warmup.<br />
76
Paper 1: Faster Path Indexes for Search in XML Data<br />
Some care also had to be taken when measuring the memory usage, as Java relies on<br />
garbage collection. The garbage collector was called multiple times before the indexing<br />
process started, <strong>and</strong> the base memory usage was measured 4 . It was called again multiple<br />
times after the indexing finished, <strong>and</strong> the difference to the base memory usage was then<br />
recorded. If the garbage collector was run only a single time, the space usage measured<br />
differed greatly between runs.<br />
3.1 Data generation<br />
The paths used in the tests were generated both from statistical models <strong>and</strong> DTDs. A<br />
uniform distribution <strong>and</strong> Zipf distributions were used, in addition to first order Markov<br />
models with underlying Zipf distributions.<br />
The DTDs used for path generation were taken from the benchmarks DBLP (Ley<br />
2007), GedML (Kay 1999), Michigan (Runapongsa et al. 2003), XBench (Yao et al. 2003),<br />
XMach-1 (Böhme <strong>and</strong> Rahm 2001), XMark (Schmidt et al. 2001) <strong>and</strong> XOO7 (Bressan<br />
et al. 2003). Paths were generated by a breadth first search through the space <strong>of</strong> possible<br />
paths defined by the DTDs. Some <strong>of</strong> the DTDs were not used alone, as they were small<br />
<strong>and</strong>/or non-recursive. In tests using multiple DTDs, a breadth first search through their<br />
complete space was done, such that all paths <strong>of</strong> length k were generated before any path<br />
<strong>of</strong> length k + 1. For the data from statistical distributions, path lengths were drawn<br />
r<strong>and</strong>omly from 1 to 10.<br />
Queries for paths from DTDs were generated as follows: The length <strong>of</strong> the query was<br />
drawn from its distribution (Default: uniform from 1 to 5). Then, for each position in<br />
the query, the use <strong>of</strong> the child or descendant axis was chosen. (Default: p(desc) = 0.3). If<br />
the descendant axis was not used at all, a r<strong>and</strong>om path <strong>of</strong> the specified length was drawn,<br />
<strong>and</strong> used as query. If not, a r<strong>and</strong>om path <strong>of</strong> at least the specified length was drawn. Then<br />
the child axis operators were inserted into the query at r<strong>and</strong>om locations, <strong>and</strong> r<strong>and</strong>om<br />
path elements next to them were removed until the query had the specified length. This<br />
procedure ensured that all queries related to the generated data. The probability that a<br />
query required matches to start at a root element was set to 0.5. All queries were required<br />
to match the end <strong>of</strong> a path. Finally, the probability that a path element was substituted<br />
with a wild-card was by default 0.1.<br />
3.2 Methods tested<br />
The following methods were used in the tests :<br />
Br Brute force search through all strings. (See Section 2.2)<br />
MgInv Inverted lists <strong>and</strong> merging. (2.3)<br />
MgIe1 Inverted lists, selective entry point in contiguous part, exp<strong>and</strong>ing on child axis,<br />
merging on descendant axis. (2.3.2)<br />
MgIe2 Indexing singles <strong>and</strong> pairs. (2.3.1+2.3.2)<br />
4 Runtime.totalMemory() - Runtime.freeMemory()<br />
77
CHAPTER 4.<br />
INCLUDED PAPERS<br />
MgIe3 Indexing singles, pairs <strong>and</strong> triples. (2.3.1+2.3.2)<br />
EsIe[1,2,3] Estimating contiguous part with fewest hits, extracting possible paths, filtering<br />
with brute force. (2.3.2+2.3.3 )<br />
St Straight forward use <strong>of</strong> suffix tree. (2.4)<br />
Ss Suffix tree, skipping leading wild-cards, <strong>and</strong> filtering on start position in string. (2.4.3)<br />
InSe Suffix tree with path ID lists in internal nodes. Intersection <strong>of</strong> path ID sets on<br />
descendant axis <strong>and</strong> brute force through result. As described in (Zuopeng et al.<br />
2007). (2.4.1)<br />
EsSe Suffix tree with path ID lists in internal nodes. Finding the contiguous part <strong>of</strong> the<br />
query (no descendant axis) with fewest matches, <strong>and</strong> brute force search through the<br />
corresponding set <strong>of</strong> paths. (2.3.3+2.4.1)<br />
Sm[f,r,2] Suffix tree(s) with multiple entry points. Testing single tree with forward<br />
strings, reversed strings, <strong>and</strong> two trees, one forward <strong>and</strong> one reversed. (2.4.2).<br />
Smfe Suffix tree enhanced with path ID lists, using multiple entry points, using only<br />
forward tree. (2.4.1+2.4.2)<br />
3.3 Tests using various data sources<br />
Table 1 shows query performance for the tested methods. 10000 paths were indexed,<br />
drawn from the different data sources. 5000 queries <strong>of</strong> length 1 to 5 were run, with a 0.3<br />
probability <strong>of</strong> using the child axis <strong>and</strong> 0.1 probability <strong>of</strong> wild-cards. See later tests for<br />
variations over this. Table 2 shows more measures for the test using all the DTDs.<br />
% Br MgInv MgIe1 MgIe2 MgIe3 EsIe1 EsIe2 EsIe3 St Ss InSe EsSe Smf Smr Sm2 Smfe<br />
U 10 1.8 1068 5637 1426 755 715 807 206 164 350 365 241 150 148 224 148 99<br />
U 100 0.2 1012 937 164 82 81 96 25 24 81 51 68 57 22 29 26 19<br />
Z 100,0.7 0.4 1021 1719 306 186 182 117 41 37 126 106 92 55 35 44 41 28<br />
Z 100,1.0 0.9 1038 3658 691 448 437 223 89 77 249 241 156 76 73 90 77 55<br />
Z 100,1.3 2.3 1094 8067 1541 1097 1052 483 219 181 543 549 320 157 165 275 166 117<br />
MZ 100,0.7 0.2 1013 957 177 94 94 96 28 26 92 73 64 48 23 33 26 19<br />
MZ 100,1.0 0.3 1020 1132 230 139 136 108 35 33 144 128 76 46 27 38 30 21<br />
MZ 100,1.3 0.5 1038 1394 307 205 204 137 48 46 215 205 92 50 36 53 40 25<br />
DBLP.dtd 2.4 1241 9455 2283 1724 1673 804 326 271 985 1012 559 262 239 533 286 170<br />
GedML.dtd 1.2 1181 7504 1583 1205 1205 460 119 117 731 737 394 92 75 151 98 56<br />
XMark.dtd 3.2 1347 9754 2382 1731 1689 942 401 359 1097 1137 590 339 336 1074 307 263<br />
*.dtd 0.6 1097 3222 715 600 599 161 78 73 365 369 201 62 61 115 85 50<br />
Table 1: Microseconds per query, average. Testing uniform distribution (parameter |Σ|),<br />
Zipf distribution (|Σ|, s), a first order Markov model with underlying Zipf (|Σ|, s), <strong>and</strong><br />
various DTDs. Second column shows average query selectivity per cent.<br />
The brute force solution (Br) serves as a base case for comparison. It is faster than<br />
some methods on many <strong>of</strong> the tests, as these methods have to merge very large sets. The<br />
performance is also related to the query selectivity. For the test with poorest average<br />
selectivity, the fastest method is only five times faster than the brute force search, while<br />
78
Paper 1: Faster Path Indexes for Search in XML Data<br />
Br MgInv MgIe1 MgIe2 MgIe3 EsIe1 EsIe2 EsIe3 St Ss InSe EsSe Smf Smr Sm2 Smfe<br />
µs/q dev 336 3214 1107 1063 1074 304 220 231 680 691 355 181 170 365 245 132<br />
µs/path 0 2 2 3 5 2 3 6 14 14 21 21 14 17 33 22<br />
b/elem 4 18 18 33 50 18 33 50 48 48 76 76 48 69 114 76<br />
Table 2: Indexing paths from all DTDs. Showing microseconds per query st<strong>and</strong>ard deviation,<br />
microseconds per path when indexing, <strong>and</strong> bytes per path element in the complete<br />
index.<br />
where the selectivity is best, the fastest method is more than 50 times faster. The simplest<br />
merging method is MgInv, which looks up every single (non-wild-card) element <strong>of</strong> the path<br />
query, <strong>and</strong> merges the results, first on the child axis, then on the descendant axis. In the<br />
case <strong>of</strong> uniform data (|Σ| = 100) it is comparable with the brute force solution, but it is<br />
much slower on many other tests.<br />
The methods named MgIe* improve this by finding entry points into the contiguous<br />
parts <strong>of</strong> the queries, keeping the hits that exp<strong>and</strong> with a match, <strong>and</strong> then merging only<br />
on the ascendant axis. These methods are only faster than Br on the artificial tests with<br />
rather uniform data. As the probability <strong>of</strong> using the descendant axis is 0.3, there will frequently<br />
be parts <strong>of</strong> the query still with low selectivity after exp<strong>and</strong>ing over the child axis.<br />
Notice that also indexing pairs (MgIe2 ) gives a significant speedup, while the speedup<br />
from indexing triples (MgIe3 ) is less dramatic. Table 2 shows that MgIe2 uses almost<br />
twice as much memory as MgIe1, <strong>and</strong> MgIe3 almost three times as much, as expected.<br />
The total memory used for indexing 10000 paths with MgIe3 was measured to 2.6 MB.<br />
The methods which combine inversion <strong>and</strong> brute force (EsIe* ) have a greater speedup<br />
from indexing pairs <strong>and</strong> triples. They also have a significant speedup over merging in<br />
general. On the test using multiple DTDs, EsIe3 is more than eight times faster than<br />
MgIe3. Indexing triples does not help the latter much if the shortest contiguous parts <strong>of</strong><br />
the query are single elements with poor selectivity.<br />
The straight forward use <strong>of</strong> a suffix tree (St) has better performance than any <strong>of</strong><br />
the merging variants (Mg* ), but is slower than the combinations <strong>of</strong> indexing, selectivity<br />
estimation <strong>and</strong> brute force (Es* ). When the suffix tree encounters use <strong>of</strong> the descendant<br />
axis, it must traverse an entire subtree, which is a costly operation if the first part <strong>of</strong><br />
the query was not very selective. The space usage for the suffix tree is similar to that <strong>of</strong><br />
indexing triples (*Ie3 ). The improvement <strong>of</strong> Ss over St on some <strong>of</strong> the tests comes from<br />
queries starting with wild-cards. St must branch to every child <strong>of</strong> the root node in the<br />
tree, while Ss skips this part <strong>of</strong> the query, <strong>and</strong> filters hits on their starting positions in<br />
the paths afterwards. The reason Ss is sometimes slower is probably the overhead <strong>of</strong> the<br />
filtering. St can be fast when a query starts with a wild-card if the next elements are<br />
very selective, <strong>and</strong> the branching effectively cut <strong>of</strong>f.<br />
The method InSe, based on PIGST (Zuopeng et al. 2007), is faster than St on all,<br />
<strong>and</strong> Ss on most <strong>of</strong> the tests. It uses a suffix tree enhanced with path ID lists, path<br />
set intersection <strong>and</strong> brute force search, as described in Section 2.4.1. As predicted, the<br />
related method EsSe is considerably faster on the tests with non-uniform data. It skips<br />
the intersection, <strong>and</strong> performs the brute force search through the smallest set, exploiting<br />
79
CHAPTER 4.<br />
INCLUDED PAPERS<br />
the varying selectivity <strong>of</strong> the parts <strong>of</strong> the query. It should be noted that EsIe3 is faster<br />
than InSe on all the tests, while EsSe has a similar performance. EsIe3 also uses less<br />
space <strong>and</strong> has faster index construction (See Table 2).<br />
The suffix trees using multiple entry points into the query (Sm[f,r,2]) have very good<br />
performance. Smf is more than three times faster than InSe on the test using multiple<br />
DTDs, <strong>and</strong> also faster than EsIe3. Using the forward representation <strong>of</strong> the paths (Smf )<br />
is more efficient than using the backward representation (Smr). There are two reasons<br />
for this. The first is the probabilities for requiring match in the beginning <strong>and</strong> end <strong>of</strong> a<br />
path, which are 0.5 <strong>and</strong> 1.0. When these were swapped Smr was equivalently faster on<br />
tests using uniform <strong>and</strong> Zipf data. For paths generated from DTDs <strong>and</strong> Markov chains,<br />
Smf was still faster. Notice that Smr uses more memory. Not all <strong>of</strong> it comes from storing<br />
the reverse strings. The number <strong>of</strong> internal nodes in a suffix tree is upper bounded by the<br />
size <strong>of</strong> the input, but depends on its characteristics. A larger number <strong>of</strong> internal nodes<br />
gives more expensive tree traversals. It seems the reverse paths from DTDs have Markov<br />
properties that give a larger tree.<br />
Building two suffix trees <strong>and</strong> using both forward <strong>and</strong> reverse entry points (Sm2 ) does<br />
not increase performance. More entry points are partially evaluated, seemingly without<br />
reducing the total cost. It is possible that a more intelligent <strong>and</strong> well tuned implementation<br />
would give better results. Sm2 used the most memory <strong>of</strong> all implementations,<br />
totalling to 5.8 MB on the test will all DTDs.<br />
Smfe combines features from Smf <strong>and</strong> InSe, <strong>and</strong> turns out to beat all methods on<br />
query performance. A drawback from Smf is the increased construction time <strong>and</strong> memory<br />
usage, which comes from using the data structures from InSe.<br />
In the subsequent tests, only the fastest representatives from each group <strong>of</strong> implementations<br />
are shown.<br />
3.4 Increasing number <strong>of</strong> paths<br />
Figure 6 shows how the number <strong>of</strong> hits returned per second changes as the number <strong>of</strong><br />
paths indexed increases. The paths were generated from all the DTDs, <strong>and</strong> otherwise<br />
the default parameters were used. Hits are counted instead <strong>of</strong> queries to get a better<br />
perspective <strong>of</strong> what happens when the size <strong>of</strong> the data is large. A “higher is better”<br />
measure is used to improve the visualisation <strong>of</strong> the difference between the best methods.<br />
Smfe, Smf, EsSe <strong>and</strong> EsIe3 out-scale the other methods by a good margin, with the<br />
first showing the best performance. The reason for the various drops <strong>and</strong> rises in the<br />
graph may be that the distribution <strong>of</strong> the paths generated from the DTDs change as the<br />
maximal path length increases. A similar test on paths from a Zipf distribution did not<br />
show the same drops.<br />
3.5 Descendant axis<br />
Figure 7 shows hits returned per second as the probability <strong>of</strong> using the descendant axis<br />
increases. The paths were generated from the DTDs, <strong>and</strong> the default parameters were<br />
used, except that the probability <strong>of</strong> wild-cards was set to zero. This was to isolate the<br />
80
Paper 1: Faster Path Indexes for Search in XML Data<br />
1.6e+06<br />
1.4e+06<br />
1.2e+06<br />
Avg. hits per second<br />
1e+06<br />
800000<br />
600000<br />
400000<br />
200000<br />
0<br />
100 1000 10000<br />
Number <strong>of</strong> paths<br />
MgIe3<br />
EsIe3<br />
Ss<br />
EsSe<br />
Smf<br />
Smfe<br />
Br<br />
Figure 6: Increasing number <strong>of</strong> paths. Hits per second.<br />
effect <strong>of</strong> using the descendant axis. 10000 paths were indexed. Here EsSe is fastest when<br />
the probability <strong>of</strong> using the descendant axis is low, while Smfe is fastest when it is high.<br />
EsSe depends on finding contiguous parts <strong>of</strong> the query with good selectivity, which is<br />
harder when all parts are very short.<br />
800000<br />
700000<br />
600000<br />
Avg. hits per second<br />
500000<br />
400000<br />
300000<br />
200000<br />
100000<br />
0<br />
0 0.2 0.4 0.6 0.8 1<br />
Prob. <strong>of</strong> descendant axis<br />
MgIe3<br />
EsIe3<br />
Ss<br />
EsSe<br />
Smf<br />
Smfe<br />
Br<br />
Figure 7: Increasing probability <strong>of</strong> descendant axis. Hits per second.<br />
Note that the simple use <strong>of</strong> a suffix tree (Ss) has the best performance when there<br />
is no use <strong>of</strong> the descendant axis, but poor performance otherwise. The other methods<br />
using suffix trees could switch to the simple search technique when this is expected to be<br />
pr<strong>of</strong>itable.<br />
81
CHAPTER 4.<br />
INCLUDED PAPERS<br />
3.6 Wild-cards<br />
Increasing the probability <strong>of</strong> wild-cards gives different results than increasing the use <strong>of</strong><br />
the descendant axis, as shown in Figure 8. The descendant axis was not used at all in<br />
this test. Here the suffix trees are much faster than the other methods. For the trees,<br />
branching on a wild-card is much less critical than branching on the descendant axis, as<br />
the former is cut <strong>of</strong>f as soon as a mismatch is seen, while the latter introduces a full search<br />
<strong>of</strong> the subtree.<br />
900000<br />
800000<br />
700000<br />
Avg. hits per second<br />
600000<br />
500000<br />
400000<br />
300000<br />
200000<br />
100000<br />
0<br />
0 0.1 0.2 0.3 0.4 0.5<br />
Prob. <strong>of</strong> wildcard<br />
MgIe3<br />
EsIe3<br />
Ss<br />
EsSe<br />
Smf<br />
Smfe<br />
Figure 8: Increasing probability <strong>of</strong> wild-card. Hits per second.<br />
Br<br />
It is interesting to note that the simple use <strong>of</strong> a suffix tree, only skipping leading<br />
wild-cards (Ss), here is much more efficient than the implementation supporting multiple<br />
entry points into the query (Smf ), even though only a single entry point is created. The<br />
realisation <strong>of</strong> the search automata in Ss is just a recursive program, while Smf maintains<br />
a set <strong>of</strong> state objects, giving a lot <strong>of</strong> overhead.<br />
3.7 Index construction<br />
Figure 9 shows the indexing performance <strong>of</strong> the various indexes as the number <strong>of</strong> paths<br />
increases. Single representatives for methods using the same data structure were tested.<br />
MgIe1 represents MgInv <strong>and</strong> EsIe1, MgIe2 represents EsIe2, MgIe3 represents EsIe3,<br />
St represents Ss <strong>and</strong> Smf, <strong>and</strong> finally InSe represents EsSe <strong>and</strong> Smfe. The methods<br />
tested are all asymptotically linear in the worst case except InSe. The performance drop<br />
observed is probably due to the overhead <strong>of</strong> memory management <strong>and</strong> cache effects. The<br />
construction <strong>of</strong> the suffix trees is slower than construction <strong>of</strong> inverted lists, but as the<br />
time cost <strong>of</strong> adding a path is less than 30 µs, this would constitute a neglible part <strong>of</strong><br />
the cost <strong>of</strong> indexing XML in a complete system, if it is assumed that the path index can<br />
reside in main memory, while the value index must reside on disk.<br />
82
Paper 1: Faster Path Indexes for Search in XML Data<br />
900000<br />
800000<br />
700000<br />
Avg. paths indexed per second<br />
600000<br />
500000<br />
400000<br />
300000<br />
200000<br />
100000<br />
0<br />
100 1000 10000<br />
Number <strong>of</strong> paths<br />
MgIe1 MgIe2 MgIe3 St Sm2 InSe<br />
Figure 9: Paths indexed per second.<br />
4 Conclusion <strong>and</strong> future work<br />
The most advanced method using suffix trees (Smfe) is the winner on query performance<br />
in these tests. Whether this should be chosen as the path index component in a larger<br />
system, depends on the amount <strong>of</strong> main memory available <strong>and</strong> on whether or not the<br />
performance <strong>of</strong> the path index significantly impacts the performance <strong>of</strong> the complete<br />
system. Another issue is the complexity <strong>of</strong> the implementation. The suffix tree itself is a<br />
rather complex structure, <strong>and</strong> its use as described here may be hard to grasp.<br />
The method combining inverted lists, extension over the child axis <strong>and</strong> brute force<br />
(EsIe3 ), would be the authors’ choice when implementing a larger system. It is conceptually<br />
simple, easy to implement, <strong>and</strong> has good performance. If memory usage is an issue,<br />
EsIe2 or EsIe1 could be used. These methods could also be modified to work on disk<br />
when the path index does not fit in main memory. The paths themselves could be stored<br />
in the entries in their inverted lists, allowing the extension technique to work, at the cost<br />
<strong>of</strong> using more disk space <strong>and</strong> IO.<br />
In future work, the authors would like to add the fast path summaries introduced to<br />
an existing system for indexing XML, <strong>and</strong> compare this with various implementations,<br />
both with <strong>and</strong> without path summaries, such as Lore (Goldman <strong>and</strong> Widom 1997), ToXin<br />
(Rizzolo <strong>and</strong> Mendelzon 2001) XISS (Li <strong>and</strong> Moon 2001), XIST (Runapongsa et al. 2004),<br />
Ctree (Zou et al. 2004), SphinX (Poola <strong>and</strong> Haritsa 2007) <strong>and</strong> MXI (Yan <strong>and</strong> Liang 2005).<br />
Acknowledgements<br />
The author would like to thank Øystein Torbjørnsen, Svein-Olaf Hvasshovd <strong>and</strong> Truls<br />
Amundsen Bjørklund for helpful feedback on this article.<br />
83
CHAPTER 4.<br />
INCLUDED PAPERS<br />
References<br />
Aboulnaga, A., Alameldeen, A. R. & Naughton, J. F. (2001), Estimating the selectivity<br />
<strong>of</strong> XML path expressions for internet scale applications, in ‘Proc. VLDB’, pp. 591–600.<br />
Aho, A., Sethi, R. & Ullman, J. (1986), Compilers, Principles, Techniques <strong>and</strong> Tools.,<br />
Addison Wesley, Reading.<br />
Al-Khalifa, S., Jagadish, H., Koudas, N., Patel, J. & Srivastava, D. (2002), Structural<br />
joins: A primitive for efficient XML query pattern matching, in ‘Proc. ICDE’, pp. 141–<br />
152.<br />
Barta, A., Consens, M. P. & Mendelzon, A. O. (2005), Benefits <strong>of</strong> path summaries in an<br />
XML query optimizer supporting multiple access methods, in ‘Proc. VLDB’, pp. 133–<br />
144.<br />
Bertino, E. & Kim, W. (1989), ‘Indexing techniques for queries on nested objects’, IEEE<br />
Transactions on Knowledge <strong>and</strong> Data Engineering 1(2), 196–214.<br />
Böhme, T. & Rahm, E. (2001), ‘XMach-1: A benchmark for XML data management’,<br />
Proc. German database conference BTW pp. 264–273.<br />
Bressan, S., Lee, M., Li, Y., Lacroix, Z. & Nambiar, U. (2003), ‘The XOO7 benchmark’,<br />
Proc. EEXTT’02 pp. 146–147.<br />
Bruno, N., Koudas, N. & Srivastava, D. (2002), Holistic twig joins: Optimal XML pattern<br />
matching, in ‘Proc. SIGMOD’, pp. 310–321.<br />
Buneman, P., Choi, B., Fan, W., Hutchison, R., Mann, R. & Viglas, S. D. (2005), Vectorizing<br />
<strong>and</strong> querying large XML repositories, in ‘Proc. ICDE’, pp. 261–272.<br />
Chen, Z., Jagadish, H., Korn, F., Koudas, N., Muthukrishnan, S., Ng, R. & Srivastava,<br />
D. (2001), ‘Counting twig matches in a tree’, Proc. ICDE .<br />
Chien, S., Vagena, Z., Zhang, D., Tsotras, V. & Zaniolo, C. (2002), ‘Efficient structural<br />
joins on indexed XML documents’, Proc. VLDB 2, 263–274.<br />
Chung, C.-W., Min, J.-K. & Shim, K. (2002), APEX: An adaptive path index for XML<br />
data, in ‘Proc. SIGMOD’, pp. 121–132.<br />
Cooper, B., Sample, N., Franklin, M. J., Hjaltason, G. R. & Shadmon, M. (2001), A fast<br />
index for semistructured data, in ‘Proc. VLDB’, pp. 341–350.<br />
Farach, M. (1997), Optimal suffix tree construction with large alphabets, in ‘Proc. FOCS’,<br />
pp. 137–143.<br />
Goldman, R. & Widom, J. (1997), DataGuides: Enabling query formulation <strong>and</strong> optimization<br />
in semistructured databases, in ‘Proc. VLDB’, pp. 436–445.<br />
84
Paper 1: Faster Path Indexes for Search in XML Data<br />
Gottlob, G., Koch, C. & Pichler, R. (2005), ‘Efficient algorithms for processing xpath<br />
queries’, ACM Trans. Database Syst. 30(2), 444–491.<br />
<strong>Grimsmo</strong>, N. (2005), Dynamic indexes vs. static hierarchies for substring search, Master’s<br />
thesis, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology.<br />
<strong>Grimsmo</strong>, N. (2007), On performance <strong>and</strong> cache effects in substring indexes, Technical<br />
Report IDI-TR-2007-04, NTNU, Trondheim, Norway.<br />
<strong>Grimsmo</strong>, N. & Bjørklund, T. A. (2007), On the size <strong>of</strong> generalised suffix trees extended<br />
with string ID lists, Technical Report IDI-TR-2007-01, NTNU, Trondheim, Norway.<br />
Grust, T. (2002), Accelerating XPath location steps, in ‘Proc. SIGMOD’, pp. 109–120.<br />
Jiang, H., Wang, W., Lu, H. & Yu, J. (2003), Holistic twig joins on indexed XML documents,<br />
in ‘Proc. VLDB’, pp. 273–284.<br />
Kay, M. H. (1999), ‘GedML’. http://users.breathe.com/mhkay/gedml/.<br />
Kemper, A. & Moerkotte, G. (1992), ‘Access support relations: an indexing method for<br />
object bases’, Inf. Syst. 17(2), 117–145.<br />
Kilpeläinen, P. (1992), Tree matching problems with applications to structured text<br />
databases, Technical Report A-1992-6, <strong>Department</strong> <strong>of</strong> <strong>Computer</strong> Science, University<br />
<strong>of</strong> Helsinki.<br />
Ley, M. (2007), ‘DBLP XML Records’. http://www.informatik.uni-trier.de/~ley/<br />
db/.<br />
Li, Q. & Moon, B. (2001), Indexing <strong>and</strong> querying xml Data for regular path expressions,<br />
in ‘Proc VLDB’, pp. 361–370.<br />
Lu, J., Ling, T., Chan, C. & Chen, T. (2005), From region encoding to extended dewey:<br />
On efficient processing <strong>of</strong> XML twig pattern matching, in ‘Proc. VLDB’, pp. 193–204.<br />
McCreight, E. M. (1976), ‘A space-economical suffix tree construction algorithm’, J. ACM<br />
23(2), 262–272.<br />
Michiels, P., Mihaila, G. A. & Simeon, J. (2007), Put a tree pattern in your algebra, in<br />
‘Proc. ICDE’, pp. 246–255.<br />
Muthukrishnan, S. (2002), Efficient algorithms for document retrieval problems, in ‘Proc.<br />
SODA’, pp. 657–666.<br />
Nestorov, S., Ullman, J. D., Wiener, J. L. & Chawathe, S. S. (1997), Representative<br />
objects: Concise representations <strong>of</strong> semistructured, hierarchial data, in ‘Proc. ICDE’,<br />
pp. 79–90.<br />
Poola, L. & Haritsa, J. (2007), ‘Schema-conscious XML indexing’, <strong>Information</strong> Systems<br />
32, 344–364.<br />
85
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Rao, P. & Moon, B. (2004), Prix: Indexing <strong>and</strong> querying xml using prüfer sequences,<br />
in ‘ICDE ’04: Proceedings <strong>of</strong> the 20th International Conference on Data Engineering’,<br />
IEEE <strong>Computer</strong> Society, Washington, DC, USA, p. 288.<br />
Rizzolo, F. & Mendelzon, A. (2001), Indexing XML data with ToXin, in ‘Proc. WebDB’,<br />
Vol. 2001.<br />
Runapongsa, K., Patel, J., Bordawekar, R. & Padmanabhan, S. (2004), XIST: An XML<br />
index selection tool, in ‘Proc. XSym’.<br />
Runapongsa, K., Patel, J., Jagadish, H. & Al-Khalifa, S. (2003), The Michigan Benchmark:<br />
A microbenchmark for XML query processing systems, in ‘Proc. EEXTT’02’,<br />
Springer.<br />
Schmidt, A. R., Waas, F., Kersten, M. L., Florescu, D., Manolescu, I., Carey, M. J. &<br />
Busse, R. (2001), The XML Benchmark Project, Technical Report INS-R0103, CWI,<br />
Amsterdam, The Netherl<strong>and</strong>s.<br />
Tatarinov, I., Viglas, S. D., Beyer, K., Shanmugasundaram, J., Shekita, E. & Zhang,<br />
C. (2002), Storing <strong>and</strong> querying ordered XML using a relational database system, in<br />
‘Proc. SIGMOD’, pp. 204–215.<br />
Wang, H. & Meng, X. (2005), On the sequencing <strong>of</strong> tree structures for XML indexing, in<br />
‘Proc. ICDE’, pp. 372–383.<br />
Wang, H. & Ooi, B. (2003), XR-tree: Indexing XML data for efficient structural joins, in<br />
‘Proc. ICDE’, pp. 253–264.<br />
Wang, H., Park, S., Fan, W. & Yu, P. (2003), ‘ViST: a dynamic index method for querying<br />
XML data by tree structures’, Proc. SIGMOD pp. 110–121.<br />
Yan, L. & Liang, Z. (2005), ‘Multiple schema based XML indexing’, Lecture Notes in<br />
<strong>Computer</strong> Science 3619, 891–900.<br />
Yao, B. B., Özsu, M. T. & Keenleyside, J. (2003), XBench - a family <strong>of</strong> benchmarks for<br />
XML DBMSs, in ‘Proc. EEXTT’02’, pp. 162–164.<br />
Yoshikawa, M., Amagasa, T., Shimura, T. & Uemura, S. (2001), ‘XRel: a path-based<br />
approach to storage <strong>and</strong> retrieval <strong>of</strong> XML documents using relational databases’, ACM<br />
Transactions on Internet Technology 1(1), 110–141.<br />
Zhang, C., Naughton, J., DeWitt, D., Luo, Q. & Lohman, G. (2001), ‘On supporting<br />
containment queries in relational database management systems’, SIGMOD Rec.<br />
30(2), 425–436.<br />
Zou, Q., Liu, S. & Chu, W. W. (2004), Ctree: a compact tree for indexing XML data, in<br />
‘Proc. WIDM’, pp. 39–46.<br />
Zuopeng, L., Kongfa, H., Ning, Y. & Yisheng, D. (2007), ‘An efficient index structure for<br />
XML based on generalized suffix tree’, <strong>Information</strong> Systems 32, 283–294.<br />
86
Paper 2<br />
<strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund<br />
On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists<br />
Technical Report IDI-TR-2007-01, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology,<br />
Trondheim, Norway, 2007.<br />
Abstract The document listing problem can be solved with linear preprocessing <strong>and</strong><br />
optimal search time by using a generalised suffix tree, additional data structures <strong>and</strong><br />
constant time range minimum queries. A simpler solution is to use a generalised suffix<br />
tree in which internal nodes are extended with a list <strong>of</strong> all string IDs seen in the subtree<br />
below the respective node. This report makes some remarks on the size <strong>of</strong> such a structure.<br />
For the case <strong>of</strong> a set <strong>of</strong> equal length strings, a bound <strong>of</strong> Θ(n √ n) for the worst case space<br />
usage <strong>of</strong> such lists is given, where n is the total length <strong>of</strong> the strings.<br />
87
Paper 2: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists<br />
1 Document listing problem<br />
The occurrence listing problem is given a string T ∈ Σ ∗ <strong>of</strong> length n, which can be preprocessed,<br />
<strong>and</strong> a pattern P <strong>of</strong> length m, find all z occurrences <strong>of</strong> P in T . This classical<br />
problem can be solved with Θ(n) preprocessing <strong>and</strong> optimal O(m + z) time queries with<br />
a suffix tree [4] if |Σ| is constant. The document listing problem is, given a set <strong>of</strong> strings<br />
T = {T 1 , . . . , T t } <strong>of</strong> total length n, which can be preprocessed, find all y strings in T which<br />
contain the pattern P . This can also be solved with Θ(n) preprocessing <strong>and</strong> O(m + y)<br />
time queries, by augmenting a generalised suffix tree with additional data, <strong>and</strong> using<br />
constant time minimum range queries [5].<br />
This report considers the complexity <strong>of</strong> a simpler solution, which does not have linear<br />
time preprocessing or linear space usage, but has been used to solve a problem in<br />
information retrieval [8], without considering the complexity.<br />
2 Suffix tree definition<br />
A suffix tree [4] for a string T <strong>of</strong> length n from the alphabet Σ is a compacted trie<br />
containing the first n suffixes <strong>of</strong> T $, where $ /∈ Σ is a unique terminator. Compaction<br />
here means that all internal non-branching nodes in the trie have been removed, joining<br />
adjacent edges. As $ is not seen in T , no suffix is a prefix <strong>of</strong> another, <strong>and</strong> all n suffixes are<br />
represented by a unique leaf node. Since all internal nodes are branching, their number<br />
is strictly upper bounded by n. If the edges in the tree are represented as pointers into<br />
T , the suffix tree can be represented in Θ(n) space. It can be constructed in Θ(n) time<br />
if the alphabet is constant [7, 4, 6] or integer [1].<br />
Given a pattern P <strong>of</strong> length m, it can be decided whether it occurs in T by trying<br />
to follow its path downwards from the root in the suffix tree. The time complexity is<br />
O(m) if child lookup is Θ(1), which is trivially achieved if |Σ| is constant. The parent<br />
child relationship is typically implemented with linked sibling lists. An expected lookup<br />
time <strong>of</strong> O(m) can be achieved with hashing [4]. Perfect hashing [2] can be used to get<br />
O(m) worst case lookup, at the cost <strong>of</strong> a longer construction time. After the position<br />
representing P in the tree has been located, all z hits can be extracted in Θ(z) time, as all<br />
leaf nodes in the subtree correspond to unique hits <strong>and</strong> all internal nodes in the subtree<br />
are branching.<br />
3 Generalised suffix tree<br />
A generalised suffix tree for a set <strong>of</strong> strings T = {T 1 , . . . , T s } <strong>of</strong> total length n = n 1 +· · ·+n s<br />
is a compacted trie containing for each T i , all the first n i suffixes <strong>of</strong> T i $ i , where $ i is a<br />
unique terminator string. The tree will have n leaf nodes, <strong>and</strong> can be constructed in Θ(n)<br />
time <strong>and</strong> space [3].<br />
89
CHAPTER 4.<br />
INCLUDED PAPERS<br />
4 String ID list extension<br />
A generalised suffix tree is an asymptotically optimal substring index for a set <strong>of</strong> strings<br />
from an constant alphabet. The cost <strong>of</strong> a query is linear in the size <strong>of</strong> the input <strong>and</strong><br />
output. But in many cases, you do not want to extract the set <strong>of</strong> hit positions, but the<br />
set <strong>of</strong> strings in which the hits occur. In cases where substrings are repeated in the strings<br />
indexed, a string ID can be seen many times in a subtree, <strong>and</strong> the set <strong>of</strong> unique string<br />
IDs will be smaller than the set <strong>of</strong> hit positions. To avoid this overhead, each node can<br />
store a list <strong>of</strong> the string IDs seen in its subtree. This can speed up the search for strings<br />
with matches. A drawback is a non-linear space usage in the worst case, which is shown<br />
below.<br />
5 A lower bound for worst case space usage<br />
Here follows a pro<strong>of</strong> for a lower bound <strong>of</strong> Ω(n √ n) for the worst case space usage <strong>of</strong> the<br />
suffix trees extended with string ID lists. This is shown through the Θ(n √ n) space usage<br />
for a family <strong>of</strong> sets <strong>of</strong> strings.<br />
Assume we have the set <strong>of</strong> strings T = {T 1 , . . . , T t }, where the strings are the rotations<br />
<strong>of</strong> (1, . . . , t), such that<br />
T i = ((0 + i) mod (t + 1), . . . , (t − 1 + i) mod (t + 1) (1)<br />
The total length <strong>of</strong> the strings is n = t 2 . Figure 1 shows a generalised suffix tree for<br />
this set <strong>of</strong> strings for t = 4.<br />
Figure 1: Generalised suffix tree for the padded strings (1, 2, 3, 4, $ 1 ), (2, 3, 4, 1, $ 2 ),<br />
(3, 4, 1, 2, $ 3 ) <strong>and</strong> (4, 1, 2, 3, $ 4 ). Number <strong>of</strong> unique string IDs in subtree shown in internal<br />
nodes.<br />
90
Paper 2: On the Size <strong>of</strong> Generalised Suffix Trees Extended with String ID Lists<br />
Let U = (1 mod (t + 1), . . . , (2t − 1) mod (t + 1). All strings in T are substrings <strong>of</strong><br />
U. Figure 2 shows this for t = 4. Consider the substring U[p . . . q], where 1 ≤ p ≤ t <strong>and</strong><br />
p ≤ q < p + t, which is equal to U[p + t . . . q + t] if q < t. It is seen in T i , where i ≤ p or<br />
q < i, which gives a total <strong>of</strong> p + (t − q) locations.<br />
If p + (t − q) > 1, there will be an internal node representing the substring U[p . . . q],<br />
as it is followed by U[q + 1] in some string, <strong>and</strong> by an end <strong>of</strong> string symbol in the padded<br />
string T q−t+1 $ q−t+1 . Hence the total number <strong>of</strong> string IDs stored in the lists will be<br />
|L| =<br />
t∑<br />
p=1<br />
p+t−2<br />
∑<br />
q=p<br />
Inserting t = √ n, we get<br />
p + (t − q) =<br />
t∑ ∑t−2<br />
t − q = t<br />
p=1 q=0<br />
t∑<br />
k=2<br />
k = t3 + t 2 − 2t<br />
2<br />
(2)<br />
|L| = n√ n + n − 2 √ n<br />
2<br />
∈ Θ(n √ n) (3)<br />
1 2 3 4 1 2 3<br />
1 2 3 4 $ 1<br />
2 3 4 1 $ 2<br />
3 4 1 2 $ 3<br />
4 1 2 3 $ 4<br />
Figure 2: Showing how the strings in T are substrings <strong>of</strong> U for t = 4. For each T i ∈ T ,<br />
each substring <strong>of</strong> length less than t is seen two places in U.<br />
As the space usage for a normal generalised suffix tree is Θ(n), we get a total space<br />
usage <strong>of</strong> Θ(n √ n) for a tree extended with string ID lists for this family <strong>of</strong> sets <strong>of</strong> strings,<br />
<strong>and</strong> a bound <strong>of</strong> Ω(n √ n) for the worst case space usage <strong>of</strong> this data structure in general.<br />
6 An upper bound for strings <strong>of</strong> equal length<br />
Assume we have T = {T 1 , . . . , T t }, <strong>and</strong> equal string lengths with |T i | = r. The total<br />
length <strong>of</strong> the strings is n = tr. Since there are less than n internal nodes in the tree, <strong>and</strong><br />
each list contains at most t IDs, the total length <strong>of</strong> the lists is bounded as<br />
|L| < tn = t(tr) = t 2 r ∈ O(t 2 r) (4)<br />
Seen differently, the jth suffix <strong>of</strong> each string has length r − j + 1, so there can be<br />
at most r − j internal nodes above the leaf node representing the suffix, to which it can<br />
contribute to the string ID list. This gives a bound<br />
|L| ≤ t<br />
r∑<br />
j=1<br />
∑r−1<br />
(r − 1)r<br />
r − j = t k = t ∈ O(tr 2 ) (5)<br />
2<br />
k=0<br />
91
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Together, these two bounds give<br />
|L| ∈ O(min(t 2 r, tr 2 )) (6)<br />
which is maximal for r = t. This gives n = t 2 = r 2 , <strong>and</strong> the bound<br />
|L| ∈ O(n √ n) (7)<br />
Combining with Equation 3, we get a bound <strong>of</strong> Θ(n √ n) for the worst case space usage<br />
for equal length strings.<br />
References<br />
[1] Martin Farach. Optimal suffix tree construction with large alphabets. In Proc. FOCS,<br />
1997.<br />
[2] Michael L. Fredman, János Komlós, <strong>and</strong> Endre Szemerédi. Storing a sparse table with<br />
0(1) worst case access time. J. ACM, 31(3), 1984.<br />
[3] Daniel M. Gusfield. Algorithms on strings, trees, <strong>and</strong> sequences: <strong>Computer</strong> science<br />
<strong>and</strong> computational biology. Cambridge University Press, 1997.<br />
[4] Edward M. McCreight. A space-economical suffix tree construction algorithm. J.<br />
ACM, 23(2), 1976.<br />
[5] S. Muthu Muthukrishnan. Efficient algorithms for document retrieval problems. In<br />
Proc. SODA, 2002.<br />
[6] Esko Ukkonen. On-line construction <strong>of</strong> suffix trees. Algorithmica, 14(5), 1995.<br />
[7] Peter Weiner. Linear pattern matching algorithms. In Proc. SWAT, 1973.<br />
[8] Liang Zuopeng, Hu Kongfa, Ye Ning, <strong>and</strong> Dong Yisheng. An efficient index structure<br />
for XML based on generalized suffix tree. <strong>Information</strong> Systems, 32:283–294, 2007.<br />
92
Paper 3<br />
<strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Øystein Torbjørnsen<br />
XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
Proceedings <strong>of</strong> the Second International Conference on Advances in Databases, Knowledge,<br />
<strong>and</strong> Data Applications (DBKDA 2010)<br />
Abstract XML indexing <strong>and</strong> search has become an important topic, <strong>and</strong> twig joins<br />
are key building blocks in XML search systems. This paper describes a novel approach<br />
using a nested loop twig join algorithm, which combines several existing techniques to<br />
speed up evaluation <strong>of</strong> XML queries. We combine structural summaries, path indexing<br />
<strong>and</strong> prefix path partitioning to reduce the amount <strong>of</strong> data read by the join. This effect<br />
is amplified by only reading data for leaf query nodes, <strong>and</strong> inferring data for internal<br />
nodes from the structural summary. Skipping is used to speed up merges where query<br />
leaves have differing selectivity. Multiple access methods are implemented as materialized<br />
views instead <strong>of</strong> succinct secondary indexes for better locality. This redundancy is made<br />
affordable in terms <strong>of</strong> space by using compression in a back-end with columnar storage.<br />
We have implemented an experimental prototype, which shows a speedup <strong>of</strong> two orders<br />
<strong>of</strong> magnitude on XPath queries with value predicates, when compared to existing open<br />
source <strong>and</strong> commercial systems using a subset <strong>of</strong> the techniques. Space usage is also<br />
improved.<br />
93
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
1 Introduction<br />
XML has evolved as the dominating data format for information exchange <strong>of</strong> structured<br />
<strong>and</strong> semi-structured data over the Internet. It may also be used for integration <strong>of</strong> data<br />
from disparate sources. When XML is generated from structured databases it is regular<br />
<strong>and</strong> has a known schema, while in other scenarios data is ad-hoc <strong>and</strong> without any schema.<br />
XPath [31] <strong>and</strong> XQuery [32] are languages used to query XML data. XPath is a<br />
simple declarative language that supports matching <strong>of</strong> structure <strong>and</strong> content. XQuery is<br />
a more expressive iterative language, but uses XPath as a fundamental building block.<br />
This paper focuses on performance on a subset <strong>of</strong> XPath called twig pattern matching,<br />
which covers a majority <strong>of</strong> queries used in practice [11].<br />
An XPath query is a tree, where the relation between the query nodes specify the relation<br />
between the data nodes in a possible match, <strong>and</strong> value predicates specify the contents<br />
<strong>of</strong> attributes <strong>and</strong> text nodes in the XML. The example query /lib/book[author="Kant"<br />
<strong>and</strong> author="Gödel"], asks for all books coauthored by Kant <strong>and</strong> Gödel. A parent-child<br />
relation is denoted by “/”, an ancestor-descendant relation by “//”, <strong>and</strong> “[]” encloses a<br />
predicate. Figure 1 shows a document where this query would have a match.<br />
<br />
<br />
Kritik der Unvollständigkeit<br />
Kant<br />
Gödel<br />
...<br />
<br />
Figure 1: Example XML document.<br />
A typical system supporting XPath parses the XML <strong>and</strong> stores some information<br />
about each data node, usually its name, type <strong>and</strong> value, <strong>and</strong> its relation to other nodes.<br />
This can be stored in a table, <strong>and</strong> is usually sorted on document <strong>and</strong> node order for<br />
faster structural joins on the nodes in the query tree. When evaluating queries, it may be<br />
more advantageous to access nodes either by name, value, natural order, or other fields.<br />
Multiple secondary indexes can be added for direct lookup on all the fields stored for a<br />
node.<br />
To find all matches for the example query, many current systems would read six lists <strong>of</strong><br />
nodes from indexes, looking up the four XML element nodes on name (lib, book, author<br />
<strong>and</strong> author), <strong>and</strong> the two text nodes on value ("Kant" <strong>and</strong> "Gödel"). These would then be<br />
joined to give full matches, using the structural information stored about each node. In<br />
a library with millions <strong>of</strong> books, there would be just as many book nodes, <strong>and</strong> even more<br />
author nodes to join.<br />
The main contributions <strong>of</strong> this paper are: 1) A twig join algorithm combining previous<br />
techniques in a novel way; 2) Implementation <strong>of</strong> an experimental prototype; 3) Experi-<br />
95
CHAPTER 4.<br />
INCLUDED PAPERS<br />
ments showing two orders <strong>of</strong> magnitude speedup over other systems on for queries with<br />
value predicates, <strong>and</strong> reduced space usage.<br />
2 Background<br />
Indexing <strong>and</strong> querying XML has been a major research area both in the database <strong>and</strong><br />
information retrieval community the last ten years, resulting in many different approaches.<br />
2.1 Schema Agnostic XML Indexing<br />
An early approach to XML indexing, which is usable with simple schema, is to translate<br />
the XML schema to a relational schema, <strong>and</strong> put shredded XML data into an RDBMS [28].<br />
However, the XML schema can be complex, subject to updates, unknown, or non-existing.<br />
This flexibility may be considered a strength.<br />
In schema agnostic XML indexing, all element-, attribute- <strong>and</strong> text nodes are extracted,<br />
<strong>and</strong> stored with information about their type, name, value <strong>and</strong> position in the<br />
tree. During query evaluation a list <strong>of</strong> matching data nodes is read for each query node,<br />
<strong>and</strong> these are joined into complete matches for the query tree. To allow joining matches<br />
for individual query nodes into tree pattern matches, the positional information must<br />
allow decision <strong>of</strong> node relationships, such as ancestor-descendant <strong>and</strong> parent-child. The<br />
most well known tree position encoding is the regional BEL encoding [37], which gives a<br />
node’s begin <strong>and</strong> end position in the document, <strong>and</strong> its level (depth) in the tree.<br />
Figure 2 shows an example using BEL encoding on the data extracted for the XML<br />
document in Figure 1. The data is usually indexed on name for XML element nodes, <strong>and</strong><br />
on value for text nodes. When querying, one list <strong>of</strong> nodes is typically read for each node<br />
in the query. Retrieving element nodes on name (tag) is called tag partitioning (or tag<br />
streaming [6]). To evaluate the XPath query //lib/book, the nodes u <strong>and</strong> v (for lib <strong>and</strong><br />
book) satisfying u.begin < v.begin ∧ v.end < u.end ∧ u.level + 1 = v.level, in addition to<br />
the type <strong>and</strong> name requirements, must be found.<br />
2.2 Tree-aware Joins<br />
If the node lists are sorted on doc <strong>and</strong> begin values, a linear merge can be used when the<br />
schema <strong>and</strong> query are simple. But in other cases, matches may be formed from the lists<br />
out <strong>of</strong> order. Consider the document in Figure 3, <strong>and</strong> the query //a[c <strong>and</strong> d]. A linear<br />
merge <strong>of</strong> lists would miss one <strong>of</strong> the two pairs <strong>of</strong> c <strong>and</strong> d nodes usable together.<br />
Using a full cross join on lists <strong>of</strong> data nodes matching each query nodes, <strong>and</strong> checking<br />
the output for legal matches, is not feasible for large data, as it can give intermediate<br />
results exponential in the size <strong>of</strong> the query, even when the final answer is small.<br />
The first specialized tree joins focused on splitting the query tree into binary relationships<br />
<strong>and</strong> stitching the results into complete matches. Specialized loop joins gave optimal<br />
O(I + O) complexity for ancestor-descendant relationships [37], where I is the size <strong>of</strong> the<br />
input lists, <strong>and</strong> O is the size <strong>of</strong> the output. When joining an ancestor <strong>and</strong> a descendant<br />
96
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
doc begin end level type name value<br />
1 1 . . . 1 Elem. lib<br />
1 2 12 2 Elem. book<br />
1 3 5 3 Elem. title<br />
1 4 4 4 Text “Kritik. . . ”<br />
1 6 8 3 Elem. author<br />
1 7 7 4 Text “Kant”<br />
1 9 11 3 Elem. author<br />
1 10 10 4 Text “Gödel”<br />
1 13 . . . 2 Elem. article<br />
. . . . . . . . . . . . . . . . . . . . .<br />
Figure 2: Data extracted for tag partitioning.<br />
<br />
<br />
<br />
<br />
<br />
Figure 3: Breaking linear merge for //a[c <strong>and</strong> d].<br />
list, a “marker” is left in the descendant list on matching positions, <strong>and</strong> the descendant<br />
list is “re-winded” to this position when the ascendant list is forwarded.<br />
Later stack-based joins gave O(I + O) complexity also for parent-child relationships<br />
[2]. These maintained a stack <strong>of</strong> nested ancestor c<strong>and</strong>idates in a single traversal <strong>of</strong><br />
the two lists. Any current descendant c<strong>and</strong>idate would be a descendant <strong>of</strong> all the nodes<br />
on the stack, <strong>and</strong> possibly the child <strong>of</strong> the top node.<br />
As evaluating the query node relationships separately still gave useless intermediate<br />
results <strong>of</strong> exponential size, multi-way path <strong>and</strong> twig join algorithms were introduced [4].<br />
By maintaining multiple stacks, these achieved optimal O(I + O) complexity using only<br />
O(d) memory, for path queries <strong>and</strong> twig queries using only the descendant axis. Later<br />
algorithms, which break the O(d) memory bound, achieve optimal complexity for all<br />
combinations <strong>of</strong> ancestor-descendant <strong>and</strong> parent-child edges [5].<br />
2.3 Skipping Tree Joins<br />
When the lists <strong>of</strong> matches for the different query nodes have similar size, regular tag<br />
partitioning is a fairly efficient solution, but that is <strong>of</strong>ten not the case. Take for example<br />
the query //book[author="Kant"], evaluated on a library with millions <strong>of</strong> books. The leaf<br />
predicate probably has good selectivity, while the other nodes do not.<br />
97
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Skipping can be used to improve efficiency in such cases. Consider the query //a//d,<br />
<strong>and</strong> the problem <strong>of</strong> forwarding in the lists for the two query nodes to the first position<br />
where a match for a is an ancestor <strong>of</strong> a match for d. Skipping forward in the list for d to<br />
find the first d j which can be a descendant <strong>of</strong> the current node a i for a, is trivially done by<br />
finding the first d j with d j .begin > a i .begin. If d k is the last d node with d k .end < a i .end,<br />
then all d descendants <strong>of</strong> a i (if any) are stored contiguously from d j to d k .<br />
However, this technique cannot be used to skip in the ancestor list, as any node with<br />
a lower begin value is a c<strong>and</strong>idate, <strong>and</strong> the actual ancestors can be spread evenly among a<br />
large number <strong>of</strong> such c<strong>and</strong>idates. Specialized data structures can be used to skip efficiently<br />
in both ancestor <strong>and</strong> descendant lists [33].<br />
2.4 Adding Structural Summaries<br />
A different way <strong>of</strong> avoiding many useless nodes in the query node match lists is to do<br />
some partial matching in a preprocessing step. In the approach called path indexing, some<br />
<strong>of</strong> the structure <strong>of</strong> the data is extracted <strong>and</strong> put in a path summary [10], which is a tree<br />
containing all label-unique root-to-leaf paths seen in the data. This is used in a query<br />
preprocessing step for partially identifying structural matches for the query.<br />
Figure 4 shows the structural data extracted from the example in Figure 1. Note that<br />
each path seen in the data (e.g. /lib/book/author) is only added once to the summary.<br />
The data stored for each document node is then linked to nodes in the summary in some<br />
way. One approach is shown in Figure 5. Summary tree nodes are given unique integer<br />
path IDs, which are referenced by the indexed data nodes. Here the Dewey encoding [30]<br />
is used to enumerate data node positions.<br />
root: 0<br />
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
doc Dewey pathID value<br />
1 1 1<br />
1 1.1 3<br />
1 1.1.2 6<br />
1 1.1.2.1 7 “Kritik. . . ”<br />
1 1.1.3 4<br />
1 1.1.3.1 5 “Kant”<br />
1 1.1.4 4<br />
1 1.1.4.1 5 “Gödel”<br />
1 1.2 2<br />
. . . . . . . . . . . .<br />
Figure 5: Extracted for prefix path partitioning.<br />
<strong>and</strong> leaves in the query. Lists <strong>of</strong> data nodes matching the related summary nodes can the<br />
be read. This is called prefix path partitioning (or prefix path streaming [6], as opposed<br />
to tag streaming). It usually results in reading less data, as the set <strong>of</strong> matching paths is<br />
<strong>of</strong>ten much more selective than the node name.<br />
2.5 Virtual Nodes<br />
The Dewey encoding, which is used in Figure 5, has an advantage over the BEL encoding<br />
when combined with structural summaries, because it allows ancestor reconstruction.<br />
Only data node lists for branching <strong>and</strong> leaf query nodes need to be read, because matching<br />
<strong>of</strong> non-branching internal nodes is implied by the structural matching <strong>of</strong> the prefix paths<br />
<strong>of</strong> the checked nodes.<br />
This approach is taken one step further by generating virtual nodes for the internal<br />
query nodes from the lists <strong>of</strong> leaf node matches [36]. Given the Dewey <strong>and</strong> pathID <strong>of</strong> a<br />
data node matching a leaf query node, the Deweys <strong>and</strong> pathIDs <strong>of</strong> all above data nodes<br />
can be inferred. The Dewey <strong>of</strong> an ancestor with depth d is a length d prefix <strong>of</strong> the Dewey<br />
<strong>of</strong> any descendant <strong>of</strong> the node, <strong>and</strong> the pathID can be found by going up to depth d in<br />
the summary tree.<br />
Using Dewey to enumerate nodes requires non-linear space, <strong>and</strong> shows poor performance<br />
for some degenerate cases, but is commonly used in practice, such as in Micros<strong>of</strong>t<br />
SQL Server [22]. Space usage can be improved by using various compression schemes [13],<br />
<strong>and</strong> updatability can be attained [21].<br />
2.6 Column Storage<br />
A trend in database systems research the last decade has been investigating how performance<br />
can be improved for read intensive workloads. An important contribution in this<br />
field is column store databases [29], where each column in a table is stored separately.<br />
MonetDB/XQuery [20] is a well known XPath/XQuery system using columnar storage.<br />
99
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Advantages <strong>of</strong> column stores include reading less data, if not all columns in a table<br />
are needed for a query, better cache behavior, if block processing is used in a column,<br />
better compression, as you easily can use techniques such as run length compression<br />
(RLE) [14, 1]. And as tuples can be materialized late, there can be less computational<br />
work, especially if block processing is used [1].<br />
Using compression can give benefits beyond the obvious reduced space usage. Compression<br />
<strong>of</strong> inverted lists is commonly done in regular search engines to reduce disk<br />
I/O [35], <strong>and</strong> using compression can reduce the memory bus bottle-neck in database<br />
systems [23].<br />
3 The XLeaf System<br />
XLeaf is a novel combination <strong>of</strong> many previous techniques for evaluating twig queries.<br />
It combines structural summaries, prefix path partitioning, virtual node lists for internal<br />
query nodes, skipping joins, multiple access methods, compression <strong>and</strong> column storage.<br />
As most other academic XML search prototypes, our system supports the descendant <strong>and</strong><br />
child axis, <strong>and</strong> simple value predicates.<br />
The main difference between our approach <strong>and</strong> the majority <strong>of</strong> work on twig joins, is<br />
that we use a nested loop join, where the size <strong>of</strong> the state is linear in the size <strong>of</strong> the query.<br />
Most approaches read input lists once (streaming), <strong>and</strong> maintain larger intermediate<br />
results.<br />
3.1 Query Evaluation<br />
Algorithm 1 gives an overview <strong>of</strong> the process <strong>of</strong> evaluating a query in our system. First<br />
the query is matched against the summary, as described in Section 3.2. Then an access<br />
method for c<strong>and</strong>idate matches for each leaf query node is chosen, as described in Section<br />
3.3.<br />
For queries with a single return node, such as in XPath, it is also <strong>of</strong>ten possible to<br />
avoid looping, <strong>and</strong> use a simple linear join. Such a join is correct when the depth <strong>of</strong> a<br />
branching node is fixed, as the query will not have out <strong>of</strong> order matches. The simple<br />
linear join is described in Section 3.4, <strong>and</strong> the general looping join in Section 3.5.<br />
Note that in the following, only output, branching, leaf query nodes need to be considered.<br />
For internal query nodes with one child, matching is implied by the matching <strong>of</strong><br />
the root-to-node paths <strong>of</strong> the nearest branching <strong>and</strong> leaf node descendants in the query<br />
tree. On the other h<strong>and</strong>, for a parent query node with multiple children, it is essential<br />
that all <strong>of</strong> these can use the same node for the parent in the data to construct a legal<br />
matching.<br />
3.2 Summary matching<br />
Our summary is indexed using an inverted index over the root-to-node paths in the<br />
summary tree, similar to what is described in [12]. Paths in the document tree are viewed<br />
as strings <strong>of</strong> node names, where the names are dictionary coded <strong>and</strong> stored as integers.<br />
100
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
Algorithm 1 Overall query evaluation<br />
1: procedure Evaluate(Q)<br />
2: SummaryMatch(Q)<br />
3: for l q ∈ Q.leaves<br />
4: ChooseView(l q )<br />
5: if no matches possible<br />
6: return<br />
7: if ∀b q ∈ Q.branching : b q .minDepth = b q .maxDepth<br />
8: LinearJoin(Q)<br />
9: else<br />
10: LoopingJoin(Q)<br />
The first step <strong>of</strong> the summary matching is finding the matches for the individual rootto-leaf<br />
paths in the query. The most selective element name in each path is looked up in<br />
the inverted index, <strong>and</strong> a list <strong>of</strong> c<strong>and</strong>idate paths is retrieved. This list is then matched<br />
against the pattern.<br />
For each query node, the matching path IDs are saved, along with their respective<br />
possible c<strong>and</strong>idates for the parent branching node. This is more robust than storing<br />
all legal combinations <strong>of</strong> matches for above query nodes, as their total number can be<br />
exponential in the size <strong>of</strong> the query. This typically happens with deeply recursive schemas.<br />
After finding individual root-to-leaf path matches, the query tree is traversed bottom<br />
up <strong>and</strong> then top-down to prune away node matches which can never be part <strong>of</strong> a complete<br />
match.<br />
3.3 Data access methods<br />
It is common in XML indexing systems to index element nodes on name <strong>and</strong> text nodes<br />
on value. For prefix path partitioning, nodes must also be accessible on path. In our<br />
system we use multiple access methods for all node types, <strong>and</strong> attempt to choose the<br />
most efficient for each query leaf node during twig evaluation. Lists for internal nodes<br />
are virtual (see Section 3.4 <strong>and</strong> 3.5).<br />
One approach for implementing multiple access methods is storing a table with all<br />
nodes in the data, <strong>and</strong> having multiple indexes point into this storage. But if reading<br />
matches from such an index means following pointers into a node table, this will give bad<br />
spatial locality. It also makes it inefficient to use compression schemes with expensive<br />
r<strong>and</strong>om lookups. To avoid this, we have used multiple materialized views <strong>of</strong> the data<br />
table, with different sorting orders. The views we create are shown in Figure 6, where<br />
underlined fields give the sort order.<br />
The base table is built while scanning the data, while the other tables are sorted using<br />
a stable ternary quicksort. As the sorting is stable, <strong>and</strong> the source is sorted on the first<br />
two columns, the target table is in order after only sorting on the first target column.<br />
101
CHAPTER 4.<br />
INCLUDED PAPERS<br />
t base (doc, dewey, nameID, pathID, value)<br />
t value (value, doc, dewey, nameID, pathID)<br />
t path (pathID, doc, dewey, value)<br />
t name (nameID, doc, dewey, pathID, value)<br />
Figure 6: Materialized views used<br />
The mappings for nameID <strong>and</strong> pathID are stored in tables, which are read into separate<br />
in-memory data structures for fast lookup. Note that the field nameID also indicates the<br />
type <strong>of</strong> the node (element, attribute, text, etc..), <strong>and</strong> that for text nodes, the name <strong>of</strong> the<br />
parent element node is stored. In cases where there are multiple pathID matches for a leaf<br />
in a query, the hits have to be merged when read from the table t path. In many cases,<br />
it is cheaper to scan the t name matches, <strong>and</strong> filter out non-matching nodes on pathID.<br />
3.4 Simple Linear Join<br />
Assume that as in XPath the query has one output node, <strong>and</strong> that there is one list <strong>of</strong><br />
data node matches for each leaf node q in the query, which can be read with Read(q)<br />
<strong>and</strong> moved one data node forward with a call to Advance(q).<br />
The linear join shown in Algorithm 2 is used when for each branching query node,<br />
there is a fixed tree depth for the matching summary nodes. The correctness <strong>of</strong> using a<br />
linear join in this case is trivial from the pro<strong>of</strong>s for TwigStack [4]. There, matches for<br />
root-to-leaf paths are output from the first step <strong>of</strong> their join algorithm sorted root-to-leaf<br />
on the document order <strong>of</strong> the matched nodes, <strong>and</strong> fed to a linear merge, which is the<br />
second <strong>and</strong> last step. Since the depths <strong>of</strong> all branching nodes are fixed, the lists for the<br />
leaf node matches will already be in the correct order, from the fact that the data is a<br />
tree.<br />
The reason no pathID ever needs to be read in Algorithm 2, is that the incoming leaf<br />
lists only contain structural matches for the root-to-leaf paths, with above branching node<br />
matches fixed in depth. A simple comparison <strong>of</strong> Deweys determine a common branching<br />
node in the data.<br />
3.5 Loop Join<br />
For cases where a simple linear join cannot guarantee a correct result, a nested loop join<br />
is used. There are two main modifications from the simple linear join. Markers are left<br />
in the lists, to which the list cursors are re-winded when necessary. And for a given<br />
c<strong>and</strong>idate alignment <strong>of</strong> the leaf lists, it must be checked whether it is possible to choose<br />
c<strong>and</strong>idates for the branching query nodes which are usable for all the leaf matches.<br />
In previous loop join approaches [4], where there are explicit lists for internal query<br />
nodes, child list markers are updated when the parent’s list cursor is forwarded, but<br />
this method cannot be used directly in our approach. First we advance the list cursors to<br />
alignment based on the Deweys matched down to the minimal possible depth <strong>of</strong> the above<br />
102
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
Algorithm 2 Simple linear join<br />
1: procedure LinearJoin(Q)<br />
2: q s := Q.selectingNode ⊲ Assume a leaf<br />
3: while SimpleNextMatch(Q.root)<br />
4: Output(q s )<br />
5: Advance(q s )<br />
⊲ Note non-branching, non-leaf, non-selecting nodes ignored.<br />
6: procedure SimpleNextMatch(q)<br />
7: if IsLeaf(q)<br />
8: q.d := Read(q).dewey<br />
9: return q.d<br />
10: m := q.minDepth ⊲ Equals q.maxDepth<br />
11: x := max qc∈q.children q c .d<br />
12: while true<br />
13: for q c ∈ q.children<br />
14: Align(q c , x, m)<br />
15: x c := SimpleNextMatch(q c )<br />
16: if x c = ∅ return ∅<br />
17: x := max x c , x<br />
18: if No change since last step<br />
19: q.d := (x 1 , . . . , x m )<br />
20: return q.d<br />
21: procedure Align(q, x, m)<br />
22: while Read(q).dewey < (x 1 , . . . , x m ) ⊲ Lexicographic<br />
23: if not Advance(q) return<br />
branching nodes. This depth is the maximal among the leaf matches minimal depths for<br />
a c<strong>and</strong>idate for the branching node (max-<strong>of</strong>-the-min) from the summary matching phase.<br />
Then list positions are marked, before we iterate through the possible alignments. When<br />
list number i in the order is forwarded, list i + 1 is re-winded to the mark, before it<br />
is aligned based on Dewey, <strong>and</strong> the new mark is saved if differing from the previous.<br />
To speed up the iteration, the lists for all leaf query nodes are ordered on the expected<br />
number <strong>of</strong> hits.<br />
The procedure depicted in Algorithm 3 checks whether a given alignment is a match.<br />
First, the query is traversed bottom-up, <strong>and</strong> common c<strong>and</strong>idates for the branching nodes<br />
are chosen. Then it is traversed top-down, choosing the uppermost common c<strong>and</strong>idate for<br />
each branching node. Finally, it is checked that the chosen branching nodes are the same<br />
physical nodes, by comparing the Dewey’s. Note that the top-down pass <strong>and</strong> the final<br />
bottom-up pass can be completed in one top-down traversal, as the Dewey for an internal<br />
node need not be materialized, but is given from the Dewey <strong>of</strong> any leaf descendant <strong>and</strong><br />
103
CHAPTER 4.<br />
INCLUDED PAPERS<br />
the chosen depth.<br />
Even though bit-vectors are used to implement the sets <strong>of</strong> c<strong>and</strong>idates, using Algorithm<br />
3 is expensive. For cases where many alignments are mismatches, it is cheaper to check<br />
the Deweys to the max-<strong>of</strong>-the-min depth first. This gives false positives, <strong>and</strong> matches<br />
must be checked with Algorithm 3. The max-<strong>of</strong>-the-min used here can be calculated<br />
from the branching nodes usable for the current leaf node matches, instead <strong>of</strong> from all<br />
branching node c<strong>and</strong>idates from the summary matching.<br />
In cases where the output node is not a leaf value node, but matches XML element<br />
nodes, care must be taken iterate through all c<strong>and</strong>idates, <strong>and</strong> not to chose duplicates (line<br />
20 in Algorithm 3). Also, to output entire data subtrees as matches, the table t base (see<br />
Section 3.3) must be scanned or searched during the query process to retrieve all nodes<br />
which have the given Dewey as a prefix.<br />
3.6 Skipping<br />
Skipping is crucial for efficient merges, <strong>and</strong> is included in Algorithm 2 by modifying the<br />
procedure Align to use underlying data structures to skip forward to the next item<br />
matching the parameter x. Skipping is implemented in our system by combining a “one<br />
level skip list” [35] <strong>and</strong> search. Every k-th value is extracted from the columns on which<br />
merges will be performed (doc <strong>and</strong> dewey), <strong>and</strong> put in a separate column. k is chosen to<br />
be a power <strong>of</strong> two (usually 32 in our system) to allow for arithmetics using shifts. Note<br />
that this column can also double as check-points in compression (see Section 3.7). When<br />
forwarding lists to values, the first few values in the smaller column are checked, <strong>and</strong> if<br />
no match is found, a binary search is performed there. Then a segment <strong>of</strong> the full column<br />
is scanned linearly.<br />
3.7 Storage Back-end<br />
The first iteration <strong>of</strong> this prototype used MonetDB/SQL [20] as a back-end for storing<br />
the data. This was changed for our own column store back-end because the overhead <strong>of</strong><br />
using the communication interface to the server back-end was significant, <strong>and</strong> because <strong>of</strong><br />
the lack <strong>of</strong> compression in the open source version <strong>of</strong> MonetDB.<br />
Our minimalistic implementation uses memory mapped files to write to <strong>and</strong> read from<br />
disk, <strong>and</strong> uses compression to reduce the space usage. A column adapter chooses which<br />
compression method to use based on statistics from the data. Different compression<br />
schemes are favorable depending on whether the columns are self-ordered, <strong>and</strong> whether<br />
they have few or many distinct values.<br />
Supported storage methods for integer types are raw, bit packing, run length encoding<br />
(RLE), delta encoding <strong>and</strong> dictionary encoding. The last two are more expensive, <strong>and</strong><br />
are used only in the maximum compression variant in the experiments (see Section 4).<br />
Typically RLE or delta coding is chosen for (partially) sorted columns, <strong>and</strong> bit packing or<br />
dictionary encoding for unsorted columns. The delta encoding is done using VByte [27],<br />
with checkpoints for faster r<strong>and</strong>om access. The dictionary encoded column sorts values<br />
on frequency, <strong>and</strong> uses VByte to store the coded values. Many column types are stored<br />
104
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
Algorithm 3 Checking leaf alignment<br />
⊲ Note non-branching, non-leaf, non-selecting nodes ignored.<br />
1: procedure CheckLeafAlignment(Q)<br />
2: q t := Q.topBranchingNode<br />
3: C<strong>and</strong>idates(q t ) ⊲ Bottom-up<br />
4: if q t .C = ∅<br />
5: return false<br />
6: ChooseUppermost(q) ⊲ Top-down<br />
7: if not CheckMatch(q) ⊲ Bottom-up<br />
8: return false<br />
9: procedure C<strong>and</strong>idates(q)<br />
10: if IsLeaf(q)<br />
11: q.p := Read(q).pathID<br />
12: return q.C := {q.p}<br />
13: else<br />
14: return q.C := ⋂ q i∈Children(q) ParentC<strong>and</strong>(q i)<br />
15: procedure ParentC<strong>and</strong>(q)<br />
16: return ⋂ c i∈C<strong>and</strong>idates(q) q.parMatchC<strong>and</strong>[c i]<br />
17: procedure ChooseUppermost(q)<br />
18: if IsLeaf(q)<br />
19: return<br />
20: q.p := arg min ci∈q.C c i .depth<br />
21: for q i ∈ q.children<br />
22: q i .C = {c i | c i ∈ q i .C ∧ q.p ∈ q.parMatchC<strong>and</strong>[c i ]}<br />
23: ChooseUppermost(q i )<br />
24: procedure CheckMatch(q)<br />
25: if IsLeaf(q)<br />
26: q.d := Read(q).dewey<br />
27: return true<br />
28: for q i ∈ q.children<br />
29: if not CheckMatch(q i )<br />
30: return false<br />
31: x := q.children[1].d<br />
32: m := q.p.depth<br />
33: q.d.: = (x 1 , . . . , x m )<br />
34: for q i ∈ q.children<br />
35: if LCP(q.d, q i .d) < m<br />
36: return false<br />
37: return true<br />
105
CHAPTER 4.<br />
INCLUDED PAPERS<br />
as multiple physical columns. For example is an RLE column stored as separate “values”<br />
<strong>and</strong> “cumulative count” columns.<br />
Strings are stored raw, or using a dictionary. For the raw strings, a column <strong>of</strong> pointers<br />
is used for r<strong>and</strong>om access. Dictionaries are shared between the different materialized<br />
views. For logical columns consisting <strong>of</strong> several physical columns, these are compressed<br />
recursively if this is considered favorable.<br />
In the current prototype, no explicit indexing is done, <strong>and</strong> a simple binary search is<br />
used. In the following experiments, all queries are run warm to simplify the experimental<br />
setup. Then the poor spatial locality <strong>of</strong> the binary search is not so much an issue, but<br />
for later experiments, running large batches <strong>of</strong> queries over more data, indexing should<br />
be considered.<br />
However, some <strong>of</strong> the column types have specialized search implementations. For<br />
RLE, the “values” column is searched, <strong>and</strong> the “cumulative count” column is used to<br />
translate to row numbers. For the delta encoding, a search in the checkpoints is done<br />
first, before decoding the final block <strong>of</strong> coded values. For the dictionary encoding, the<br />
dictionary is searched first, before searching for the coded value in the data column.<br />
A note should be given on the encoding <strong>of</strong> Deweys. A variable number <strong>of</strong> bytes are<br />
used per element in the Dewey, <strong>and</strong> the first bits in an element number give the number <strong>of</strong><br />
bytes in the element. This allows comparing two Deweys with a simple string comparison<br />
to find which is smaller. This is useful when merging lists <strong>of</strong> nodes. To check if two<br />
Deweys match to a given depth in the document tree, the codes must be parsed when<br />
compared.<br />
4 Experimental Results<br />
This section presents results for query performance evaluation <strong>of</strong> various XPath implementations.<br />
Experiments were performed on an AMD Athlon 64 Processor 3500+ at 2.2<br />
GHz, with 4GB main memory, running Linux 2.6.28.<br />
The open source MonetDB/XQuery (MoXQ) [3] <strong>and</strong> an anonymous commercial system<br />
(SysA) were tested. They were included because they feature some, but not all, <strong>of</strong><br />
the techniques from Section 3.1. SysA uses path indexing <strong>and</strong> multiple access methods,<br />
but reads lists for internal query nodes. MoXQ uses tag partitioning with no skipping,<br />
<strong>and</strong> has a column store back-end. The release from August 2008 was used. We have also<br />
tested the systems Berkeley DB XML, X-Hive <strong>and</strong> eXist, but the results are omitted, as<br />
these systems had poorer performance than MoXQ <strong>and</strong> SysA, <strong>and</strong> the results did not<br />
yield further insights.<br />
For our own prototype we have included performance results for the compression<br />
schemes none (XLeaf none ), lightweight (XLeaf light ) <strong>and</strong> maximum (XLeaf max ). The latter<br />
includes delta <strong>and</strong> dictionary encoding for integer types.<br />
It should be noted that a comparison <strong>of</strong> a minimalistic prototype with full fledged systems<br />
with features like transaction management is not fair. Ideally algorithms should be<br />
compared implemented in the same framework, but this was not done in these preliminary<br />
experiments, due to time constraints.<br />
106
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
Some queries have been rewritten to counting queries to avoid measuring the overhead<br />
<strong>of</strong> printing answers, <strong>and</strong> because outputting hundreds <strong>of</strong> thous<strong>and</strong>s <strong>of</strong> results is not a<br />
probable user scenario. All queries were given 2 warm-up runs, <strong>and</strong> were executed 10<br />
times.<br />
4.1 Indexing Performance <strong>and</strong> Space Usage<br />
Figure 7 shows the indexing performance <strong>of</strong> the tested methods on the DBLP corpus [17],<br />
<strong>and</strong> the artificial benchmark XMark [26]. The table shows megabytes indexed per second,<br />
<strong>and</strong> the size <strong>of</strong> the index divided by the size <strong>of</strong> the original unparsed data.<br />
MoXQ SysA XLeaf none XLeaf light XLeaf max<br />
DBLP (440 MB)<br />
MB/s 1.92 0.27 0.43 0.45 0.30<br />
Space 2.6 14.1 5.0 1.84 1.59<br />
XMark (1118 MB)<br />
MB/s 8.8 0.45 1.39 1.41 1.06<br />
Space 1.78 10.8 4.2 1.32 1.20<br />
Figure 7: Indexing performance<br />
MoXQ has the fastest indexing, <strong>and</strong> is fairly space efficient. The increased space<br />
requirements for adding the additional access methods in SysA may make it unusable<br />
in some scenarios. The space usage for our uncompressed variant (XLeaf none ) may also<br />
be too high, but the two compressed variants are very space efficient, <strong>and</strong> use less space<br />
than MoXQ, even though they feature multiple access methods like SysA, <strong>and</strong> uses the<br />
Dewey encoding, which has more redundancy than the BEL encoding used in MoXQ.<br />
Note that building the lightweight compressed index XLeaf light , is faster than building<br />
the uncompressed index, while the heavily compressed variant is much slower, but only<br />
reduces storage requirements by 10 − 15%.<br />
4.2 DBLP Queries<br />
Figure 8 shows some queries on DBLP taken from [25], <strong>and</strong> table 9 gives the running<br />
times.<br />
# Query Hits<br />
P1 //inproceedings[author="Jim Gray"][year="1990"]/@key 6<br />
P2 //www[editor]/url/text() 5<br />
P3 //book/author[text()="C. J. Date"]/text() 13<br />
P4 //inproceedings[title/text()="Semantic Analysis Patterns."]/author/text() 2<br />
Figure 8: DBLP queries.<br />
107
CHAPTER 4.<br />
INCLUDED PAPERS<br />
P1 P2 P3 P4<br />
MoXQ 1815 131 257 1151<br />
SysA 972 2.5 0.53 7.2<br />
XLeaf none 0.126 0.064 0.039 0.058<br />
XLeaf light 0.130 0.064 0.044 0.058<br />
XLeaf max 0.42 0.123 0.044 0.067<br />
Figure 9: Query performance. Run-time in milliseconds.<br />
The results for MoXQ are worse than one might expect, especially on P1 <strong>and</strong> P4. This<br />
is because the data has a very flat <strong>and</strong> repetitive structure, with many node matches for<br />
//inproceedings. SysA has better performance, due to the use <strong>of</strong> path indexing. But it<br />
is still slow on P1, probably because merging the three branches is expensive. In P2, the<br />
total number <strong>of</strong> matches for //www is low.<br />
Our implementations perform very well in this experiment, because only data for leaf<br />
nodes are read, <strong>and</strong> skipping is applied in the join. The differences between XLeaf none<br />
<strong>and</strong> XLeaf light are almost negligible. We expected the lightweight compression to give<br />
better performance due to lower data b<strong>and</strong>width requirements. One reason this is not<br />
the case may be that queries are run hot, which reduces the b<strong>and</strong>width bottleneck due to<br />
caching effects, in combination with slightly more expensive computation for XLeaf light .<br />
XLeaf max has worse performance, due to even more computation, <strong>and</strong> perhaps worse<br />
memory access patterns, because <strong>of</strong> dictionary compression for integer types.<br />
# Query Hits<br />
A1 count(/site/closed auctions/closed auction/annotation/description/text/keyword) 40726<br />
A2 count(//closed auction//keyword) 124843<br />
A3 count(/site/closed auctions/closed auction//keyword) 124843<br />
A4 count(/site/closed auctions/closed auction[annotation/description/text/keyword]/date) 26570<br />
A5 count(/site/closed auctions/closed auction[descendant::keyword]/date) 53342<br />
A6 count(/site/people/person[pr<strong>of</strong>ile/gender <strong>and</strong> pr<strong>of</strong>ile/age]/name) 32242<br />
V1 ... keyword[text()=" preventions "] ... 55<br />
V2 ... keyword[text()=" preventions "] ... 145<br />
V3 ... keyword[text()=" preventions "] ... 145<br />
V4 ... keyword[text()=" preventions "] ... date[text()="06/27/1998"] ... 11<br />
V5 ... keyword[text()=" tempests "] ... date[text()="04/18/1999"] ... 12<br />
V6 ... gender[text()="male"] ... age[text()="18"] ... name[text()="Mehrdad Takano"] ... 19<br />
Figure 10: XMark queries from XPathMark. Queries V1-V6 are equal to queries A1-A6<br />
with added value predicates.<br />
4.3 XPathMark A Queries<br />
Queries A1-A6 on the XMark data are from XPathMark [9, 8], an XPath equivalent <strong>of</strong> the<br />
XQuery benchmark XMark. XMark is artificially generated, <strong>and</strong> has properties rather<br />
different from DBLP, which has a flat <strong>and</strong> repetitive structure. XMark is a deeper tree,<br />
following a recursive schema.<br />
108
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
A1 A2 A3 A4 A5 A6 V1 V2 V3 V4 V5 V6<br />
MoXQ 214 85 110 263 156 348 272 249 262 323 294 456<br />
SysA 3.9 352 348 178 2128 446 11.8 398 371 30 46 97<br />
XLeaf none 9.1 72 73 56 153 181 0.160 0.27 0.29 0.32 0.144 4.3<br />
XLeaf light 8.5 94 94 57 180 196 0.24 0.38 0.28 0.58 0.20 4.6<br />
XLeaf max 9.6 93 94 166 190 708 0.21 0.32 0.34 3.6 0.70 24<br />
Figure 11: Query performance. Run-time in milliseconds.<br />
MoXQ <strong>and</strong> SysA seem to have about equal performance, with the former in the lead.<br />
In these tests, the former performs relatively much better than on DBLP, because value<br />
predicates are not used, <strong>and</strong> the data tree is differently shaped. The number <strong>of</strong> matches<br />
for nodes near the root <strong>of</strong> the query are usually low, <strong>and</strong> the query leaf matches make up<br />
the majority. This gives a much cheaper join relative to the total number <strong>of</strong> matches.<br />
Note the performance <strong>of</strong> MoXQ <strong>and</strong> XLeaf light (or SysA) on queries A1 <strong>and</strong> A2, which<br />
highlights a key difference <strong>of</strong> the two systems. MoXQ is twice as fast on A2 as on A1,<br />
even though it has three times as many hits. On A1 it must merge matches for seven<br />
internal nodes, but on A2 only two <strong>of</strong> these are involved. XLeaf light , on the other h<strong>and</strong>,<br />
looks up the first query on pathID, <strong>and</strong> all nodes are matches. The second query is looked<br />
up on nameID <strong>and</strong> filtered on pathID, as there are multiple matches for the latter. Note<br />
that the performance on A3 is the same as on A2, as the executions are identical on our<br />
system. The fact that MoXQ is almost as fast as XLeaf none on A2, even though it reads<br />
more data <strong>and</strong> does more work, indicates a better implementation.<br />
XLeaf light is faster on A4 than A3, even though the former is branching. The reason<br />
is that the leaves can be looked up directly on pathID. Hits are found efficiently using<br />
skipping. A5 is three times as slow, even though there are only two times as many hits.<br />
This is because the predicate leaf on keyword has more matches. A6 is also much slower<br />
than A4, even though the number <strong>of</strong> hits is similar, again because the individual query<br />
leaves have more matches.<br />
Comparing SysA <strong>and</strong> XLeaf light on queries A2 <strong>and</strong> A3 may show the benefit <strong>of</strong> using<br />
a column store back-end. SysA also looks up nodes on path in these queries, but is more<br />
than three times slower. Note that for the branching queries A4-A6, its disadvantage <strong>of</strong><br />
reading matches for branching nodes does not show as much as on the DBLP queries, as<br />
the data is deeper <strong>and</strong> more “tree shaped”, with more matches for the query leaves than<br />
for the branching nodes.<br />
4.4 XPathMark A Queries, with Value Predicates<br />
Queries V1-V6 are the same as A1-A6, but with added value predicates. These were<br />
chosen to give as many hits as possible.<br />
MoXQ is slightly slower with value predicates, probably because <strong>of</strong> the extra text<br />
nodes to be read <strong>and</strong> merged. SysA is also slower on the first three queries, but faster<br />
on the last three, because looking up the leaves on value is much more selective. There<br />
are however still many matches for the branching node, giving an unnecessarily expensive<br />
109
CHAPTER 4.<br />
INCLUDED PAPERS<br />
join. A comparison <strong>of</strong> SysA <strong>and</strong> XLeaf light , both using path indexing, shows the benefits<br />
<strong>of</strong> our systems features. Our implementation avoids h<strong>and</strong>ling the large number <strong>of</strong> matches<br />
for the internal query nodes, <strong>and</strong> is 20-200 times faster.<br />
Query V6 is more expensive than the others for XLeaf light , because the result for one<br />
<strong>of</strong> the leaves is large, <strong>and</strong> the merge is more expensive, even when skipping is used <strong>and</strong><br />
the simple linear merge is allowed.<br />
# Query Hits<br />
S1 /dblp/inproceedings[@key="conf/3dica/RohalyH00"]/booktitle/text() 1<br />
S2 //inproceedings[@key="conf/3dica/RohalyH00"]/booktitle/text() 1<br />
S3 //*[@key="conf/3dica/RohalyH00"]/booktitle/text() 1<br />
S4 //*[@*="conf/3dica/RohalyH00"]/booktitle/text() 1<br />
B0 /dblp//author[text()="Michael Stonebraker"]/text() 215<br />
B1 /dblp/*/author[text()="Michael Stonebraker"]/text() 215<br />
B2 /dblp/*[author/text()="Michael Stonebraker"]/title/text() 215<br />
B3 /dblp/*[author/text()="Michael Stonebraker"] 4<br />
[author/text()="Hector Garcia-Molina"]/title/text()<br />
B4 /dblp/*[author="Michael Stonebraker"][author="Hector Garcia-Molina"] 1<br />
[@key="journals/corr/cs-DB-0310006"]/title/text()<br />
B5 /dblp/*[author="Michael Stonebraker"][author="Hector Garcia-Molina"] 1<br />
[@key="journals/corr/cs-DB-0310006"][year/ > 1950]/title/text()<br />
Figure 12: Queries with decreasing path specification, <strong>and</strong> queries with increasing branching<br />
complexity, on DBLP.<br />
4.5 Queries with Decreasing Path Specification<br />
Queries S1-S4 in Table 13 gives some queries on DBLP with decreasingly specified paths,<br />
using an increasing number <strong>of</strong> wild-cards.<br />
S1 S2 S3 S4 B0 B1 B2 B3 B4 B5<br />
MoXQ 1549 1387 4553 4694 1464 2609 2704 2940 2948 3029<br />
SysA 160 289 21911 21932 1450 1247 146273 146353 149293 19127<br />
XLeaf none 0.054 0.054 0.59 0.93 0.084 0.085 0.63 0.25 0.159 0.20<br />
XLeaf light 0.055 0.054 0.50 0.96 0.086 0.088 1.00 0.26 0.156 0.192<br />
XLeaf max 0.060 0.062 0.51 0.98 0.114 0.104 2.2 0.78 0.21 0.27<br />
Figure 13: Query performance, with run-time in milliseconds.<br />
.<br />
This test shows that our systems are more robust for partially specified paths. With<br />
MoXQ <strong>and</strong> SysA, the query cost increases greatly from query S2 to S3, because a larger<br />
number <strong>of</strong> nodes become c<strong>and</strong>idates for the branching node. Query S4 is a degenerate<br />
query which probably never would be seen in a real system, but it can be argued that S3<br />
is not unrealistic. Our implementation is very fast on all queries. One leaf is looked up<br />
on value consistently, while the other on pathID for S1 <strong>and</strong> S2, <strong>and</strong> nameID for S3 <strong>and</strong><br />
S4, which is the reason the last two queries are slower.<br />
110
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
4.6 DBLP Queries with Increasing Branching Complexity<br />
Queries B0-B5 in Table 13 show performance results on queries with increasing branching<br />
complexity.<br />
Comparing B0 <strong>and</strong> B1, the main cost <strong>of</strong> MoXQ seems to be merging with the lists<br />
for /dblp/*, which is large. Additional branches give no significant extra cost. Note that<br />
this node is critical for the semantics <strong>of</strong> queries B2-B5. The poor results for SysA on<br />
B2-B5 are due to a “performance bug”. The system sometimes chooses to use nested<br />
loop lookups, even though other approaches would be more efficient.<br />
Our implementation is also affected by the added branches in the queries, with some<br />
interesting effects. B3 is faster than B2, because the join algorithm takes advantage <strong>of</strong><br />
the two author lists being more selective. The common result is smaller, <strong>and</strong> more can<br />
be skipped when joining with the title. A similar argument holds for B4 vs B3, but for<br />
B5, the added predicate on year has low selectivity, <strong>and</strong> only adds to the cost.<br />
5 Related Work<br />
Loop joins have been used previously in MPMGJN [37] for pairs <strong>of</strong> nodes, <strong>and</strong> in<br />
PathMPMJ [4] for path queries, while in our system loop joins are used for full twigs,<br />
<strong>and</strong> with with no physical lists <strong>of</strong> matches for internal nodes.<br />
Most twig join algorithms are non-looping, <strong>and</strong> read the input lists once. The original<br />
TwigStack [4] maintains one stack <strong>of</strong> nested matches for each query node, <strong>and</strong> output<br />
individual root-to-leaf path matches, which are merged in a second phase. More complex<br />
intermediate result management are used by the later single phase join algorithms, such<br />
as Twig 2 Stack [5], HolisticTwigStack [16], TwigList [24] <strong>and</strong> TwigFast [18].<br />
Structural summaries [10] are a prerequisite for prefix path partitioning, which is<br />
used previously in iTwigJoin [6]. A difference between iTwigJoin <strong>and</strong> our approach is<br />
that iTwigJoin uses materialized lists for all query nodes, <strong>and</strong> uses a specialized join to<br />
combine pairs <strong>of</strong> lists for directly related query nodes.<br />
Generating matches for internal query nodes on the fly is done in the Virtual Cursors<br />
[36] algorithm <strong>and</strong> in TJFast [19], which use tag partitioning. A disadvantage <strong>of</strong> TJ-<br />
Fast is that instead <strong>of</strong> using a structural summary, the name path <strong>and</strong> Dewey are stored<br />
compressed in the leaf lists. This makes reads unnecessarily expensive <strong>and</strong> increases space<br />
usage. A disadvantage <strong>of</strong> Virtual Cursors is that non-branching internal nodes are not<br />
ignored during evaluation, because generated internal nodes are not guaranteed to have<br />
matching prefix paths. This also causes useless node c<strong>and</strong>idates for branching internal<br />
nodes. A shortcoming <strong>of</strong> both previous approaches, <strong>and</strong> also ours, is that much work is<br />
repeated during construction <strong>of</strong> internal nodes.<br />
Skipping joins has been used previously with tag partitioning in for example TS-<br />
Generic+ [15] <strong>and</strong> TwigOptimal [7]. When physical lists are used for the internal query<br />
nodes, skipping in a list to catch up with a descendant is not trivial, <strong>and</strong> specialized data<br />
structures like XR-trees [33] must be used. An advantage with virtual node lists is that<br />
skipping for internal nodes can be implicit. This is used previously Virtual Cursors [36],<br />
where B-trees were used to skip through leaf node matches.<br />
111
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Multiple access methods for query node matches, implemented through materialized<br />
views, is used for XML search in Micros<strong>of</strong>t SQL Server [22]. Their system uses prefix<br />
path partitioning similar to ours, but reads data node lists for internal query nodes, <strong>and</strong><br />
uses row oriented storage. MonetDB/XQuery [3] has columnar storage, but it uses tag<br />
partitioning, <strong>and</strong> does not feature compression.<br />
6 Conclusion <strong>and</strong> Future Work<br />
This paper has investigated how to combine various techniques for twig matching. A prototype<br />
was developed to investigate possible benefits. When compared with two existing<br />
systems, which use only some <strong>of</strong> these techniques, a speedup <strong>of</strong> two orders <strong>of</strong> magnitude<br />
was shown for queries with value predicates.<br />
Future work includes modifying the experimental prototype to allow switching features<br />
on <strong>and</strong> <strong>of</strong>f individually, to investigate both their individual <strong>and</strong> combined benefit<br />
thoroughly. This may give more insight than comparing with other full systems. More<br />
features, like optimal twig join algorithms, should also be added to our system.<br />
Acknowledgment<br />
Thank you to Felix Weigel for fruitful discussions. The first author was supported by the<br />
Research Council <strong>of</strong> Norway under grant NFR 162349.<br />
References<br />
[1] Daniel J. Abadi, Samuel R. Madden, <strong>and</strong> Nabil Hachem. Column-stores vs. rowstores:<br />
how different are they really? In Proc. SIGMOD, 2008.<br />
[2] S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, <strong>and</strong> D. Srivastava. Structural<br />
joins: A primitive for efficient XML query pattern matching. In Proc. ICDE, 2002.<br />
[3] Peter Boncz, Torsten Grust, Maurice van Keulen, Stefan Manegold, Jan Rittinger,<br />
<strong>and</strong> Jens Teubner. MonetDB/XQuery: a fast XQuery processor powered by a relational<br />
engine. In Proc. SIGMOD, 2006.<br />
[4] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />
XML pattern matching. In Proc. SIGMOD, 2002.<br />
[5] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />
Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />
queries over XML documents. In Proc. VLDB, 2006.<br />
[6] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />
pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />
112
Paper 3: XLeaf: Twig Evaluation with Skipping Loop Joins <strong>and</strong> Virtual Nodes<br />
[7] Marcus Fontoura, Vanja Josifovski, Eugene Shekita, <strong>and</strong> Beverly Yang. Optimizing<br />
cursor movement in holistic twig joins. In Proc. CIKM, 2005.<br />
[8] M. Franceschet. XPathMark. http://sole.dimi.uniud.it/~massimo.<br />
franceschet/xpathmark/ (Accessed: 2009-03-06).<br />
[9] M. Franceschet. XPathMark: an XPath benchmark for XMark generated data. In<br />
Proc. XSYM, 2005.<br />
[10] Roy Goldman <strong>and</strong> Jennifer Widom. DataGuides: Enabling query formulation <strong>and</strong><br />
optimization in semistructured databases. In Proc. VLDB, 1997.<br />
[11] Gang Gou <strong>and</strong> R. Chirkova. Efficiently querying large XML data repositories: A<br />
survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />
[12] <strong>Nils</strong> <strong>Grimsmo</strong>. Faster path indexes for search in XML data. In Proc. ADC, 2008.<br />
[13] T. Härder, M. Haustein, C. Mathis, <strong>and</strong> M. Wagner. Node Labeling Schemes for<br />
Dynamic XML Documents Reconsidered. Data & Knowl. Engineering, 2006.<br />
[14] Allison L. Holloway <strong>and</strong> David J. DeWitt. Read-optimized databases, in depth. Proc.<br />
VLDB Endow., 2008.<br />
[15] H. Jiang, W. Wang, H. Lu, <strong>and</strong> J.X. Yu. Holistic twig joins on indexed XML<br />
documents. In Proc. VLDB, 2003.<br />
[16] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />
processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />
2007.<br />
[17] Michael Ley. DBLP XML Records, Jan. 2007. http://www.informatik.uni-trier.<br />
de/~ley/db/ (Accessed: 2008-10-22).<br />
[18] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />
[19] J. Lu, T.W. Ling, C.Y. Chan, <strong>and</strong> T. Chen. From region encoding to extended dewey:<br />
On efficient processing <strong>of</strong> XML twig pattern matching. In Proc. VLDB, 2005.<br />
[20] Monet. MonetDB web page. http://monetdb.cwi.nl/ (Accessed: 2009-02-14).<br />
[21] Patrick O’Neil, Elizabeth O’Neil, Shankar Pal, Istvan Cseri, Gideon Schaller, <strong>and</strong><br />
Nigel Westbury. ORDPATHs: insert-friendly XML node labels. In Proc. SIGMOD,<br />
2004.<br />
[22] Shankar Pal, Istvan Cseri, Oliver Seeliger, Gideon Schaller, Leo Giakoumakis, <strong>and</strong><br />
Vasili Zolotov. Indexing XML data stored in a relational database. In Proc. VLDB,<br />
2004.<br />
[23] Niels Nes Peter Broncz, Marcin Zukowski. MonetDB/X100: Hype-pipelining query<br />
execution. In Proc. CIDR, 2005.<br />
113
CHAPTER 4.<br />
INCLUDED PAPERS<br />
[24] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />
In Proc. DASFAA, 2007.<br />
[25] Praveen Rao <strong>and</strong> Bongki Moon. PRIX: Indexing <strong>and</strong> querying XML using prüfer<br />
sequences. In Proc. ICDE, 2004.<br />
[26] Albrecht Schmidt, Florian Waas, Martin Kersten, Michael J. Carey, Ioana<br />
Manolescu, <strong>and</strong> Ralph Busse. XMark: a benchmark for XML data management.<br />
In Proc. VLDB, 2002.<br />
[27] Falk Scholer, Hugh E. Williams, John Yiannis, <strong>and</strong> Justin Zobel. Compression <strong>of</strong><br />
inverted indexes for fast query evaluation. In Proc. SIGIR, 2002.<br />
[28] Jayavel Shanmugasundaram, Kristin Tufte, Chun Zhang, Gang He, David J. De-<br />
Witt, <strong>and</strong> Jeffrey F. Naughton. Relational databases for querying xml documents:<br />
Limitations <strong>and</strong> opportunities. In Proc. VLDB, 1999.<br />
[29] Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack,<br />
Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat<br />
O’Neil, Alex Rasin, Nga Tran, <strong>and</strong> Stan Zdonik. C-store: a column-oriented dbms.<br />
In Proc. VLDB, 2005.<br />
[30] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene<br />
Shekita, <strong>and</strong> Chun Zhang. Storing <strong>and</strong> querying ordered XML using a relational<br />
database system. In Proc. SIGMOD, 2002.<br />
[31] W3C. XPath 1.0, 1999. http://www.w3.org/TR/xpath (Accessed: 2009-04-29).<br />
[32] W3C. XQuery 1.0, 2007. http://www.w3.org/TR/xquery/ (Accessed: 2009-04-29).<br />
[33] H.J.H.L.W. Wang <strong>and</strong> BC Ooi. XR-tree: Indexing XML data for efficient structural<br />
joins. In Proc. ICDE, 2003.<br />
[34] Felix Weigel. Structural summaries as a core technology for efficient XML retrieval.<br />
PhD thesis, Ludwig-Maximilians-Universität München, 2006.<br />
[35] Ian H. Witten, Alistair M<strong>of</strong>fat, <strong>and</strong> Timothy C. Bell. Managing gigabytes (2nd ed.):<br />
compressing <strong>and</strong> indexing documents <strong>and</strong> images. 1999.<br />
[36] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />
Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />
[37] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />
supporting containment queries in relational database management systems. SIG-<br />
MOD Rec., 2001.<br />
114
Paper 4<br />
<strong>Nils</strong> <strong>Grimsmo</strong> <strong>and</strong> Truls Amundsen Bjørklund<br />
Towards Unifying Advances in Twig Join Algorithms<br />
Proceedings <strong>of</strong> the 21st Australasian Database Conference (ADC 2010)<br />
Abstract Twig joins are key building blocks in current XML indexing systems, <strong>and</strong> numerous<br />
algorithms <strong>and</strong> useful data structures have been introduced. We give a structured,<br />
qualitative analysis <strong>of</strong> recent advances, which leads to the identification <strong>of</strong> a number <strong>of</strong><br />
opportunities for further improvements. Cases where combining competing or orthogonal<br />
techniques would be advantageous are highlighted, such as algorithms avoiding redundant<br />
computations <strong>and</strong> schemes for cheaper intermediate result management. We propose some<br />
direct improvements over existing solutions, such as reduced memory usage <strong>and</strong> stronger<br />
filters for bottom-up algorithms. In addition we identify cases where previous work has<br />
been overlooked or not used to its full potential, such as for virtual streams, or the benefits<br />
<strong>of</strong> previous techniques have been underestimated, such as for skipping joins. Using<br />
the identified opportunities as a guide for future work, we are hopefully one step closer<br />
to unification <strong>of</strong> many advances in twig join algorithms.<br />
115
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
1 Introduction<br />
Twig matching is the most heavily used building block for systems <strong>of</strong>fering search in XML<br />
with languages like XPath <strong>and</strong> XQuery [12]. XML has become the de-facto st<strong>and</strong>ard<br />
for storage <strong>of</strong> semi-structured data, <strong>and</strong> the st<strong>and</strong>ard for data exchange between disjoint<br />
information systems. XPath is a declarative language, <strong>and</strong> XQuery is an iterative language<br />
which uses XPath as a building block. XPath queries can be evaluated in polynomial time<br />
[11].<br />
Most academic work related to indexing <strong>and</strong> querying XML focuses on the twig matching<br />
problem, which is equivalent to a sub-set <strong>of</strong> XPath: Given a labeled data tree <strong>and</strong> a<br />
labeled query tree, find all matchings <strong>of</strong> the query nodes to the data nodes, where the data<br />
nodes satisfy the ancestor-descendant (a-d) <strong>and</strong> parent-child (p-c) relationships specified<br />
by the query tree edges.<br />
The example in Figure 1 shows the relation between twig matching <strong>and</strong> XML search.<br />
The tree in part (a) is an abstraction <strong>of</strong> the XML document in (c). Real XML separates<br />
element (tag), attribute <strong>and</strong> text nodes, but in the abstract model there is only one type<br />
<strong>of</strong> nodes. The XPath <strong>and</strong> XQuery examples in (d) both specify the same structure as the<br />
abstract twig query in (b), where double edges symbolize a-d relationships.<br />
b<br />
a<br />
a<br />
c b<br />
(a)<br />
c<br />
a<br />
b c<br />
(b)<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
(c)<br />
//a[.//b][c]<br />
for $na in //a<br />
let $nb := $na//b<br />
where $na/c<br />
order by $na<br />
return ($na, $nb)<br />
(d)<br />
Figure 1: XML <strong>and</strong> twig matching relation. (a) Abstract data tree. (b) Twig query. (c)<br />
XML Data. (d) XPath (above) <strong>and</strong> XQuery (below).<br />
This work focuses on twig matching in indexed data trees. In a typical setting, all<br />
data nodes with the same label are stored together, using some encoding which specifies<br />
tree positions. To evaluate a query, one stream <strong>of</strong> data nodes with matching label is read<br />
for each query node, <strong>and</strong> are joined to form twig matches.<br />
This paper gives a structured analysis <strong>of</strong> recent advances in twig join algorithms,<br />
which leads to the identification <strong>of</strong> a number <strong>of</strong> opportunities for further improvements.<br />
Some direct improvements are identified, such as reduced memory usage in bottom-up<br />
algorithms <strong>and</strong> stronger top-down filters. We highlight cases where new combinations <strong>of</strong><br />
competing <strong>and</strong> orthogonal techniques would have clear advantages, but also cases where<br />
important previous work has been has been compared to unfairly in our view. We note<br />
1 See errata in Appendix A.<br />
117
CHAPTER 4.<br />
INCLUDED PAPERS<br />
some open challenges, such as updatability in strong structural summaries, <strong>and</strong> more<br />
efficient detection <strong>of</strong> cases where simpler <strong>and</strong> faster algorithms can be used (Section 3.7).<br />
The analysis explores techniques for avoiding redundant computations (Section 3.1),<br />
schemes for intermediate result management (Section 3.2), top-down filters for bottom-up<br />
algorithms (Section 3.3), skipping joins (Section 3.4), refined access methods (Section 3.5)<br />
<strong>and</strong> virtual streams (Section 3.6).<br />
2 Background: Concepts <strong>and</strong> Techniques<br />
This section goes through some fundamental concepts <strong>and</strong> techniques which are useful<br />
for the underst<strong>and</strong>ing <strong>of</strong> later algorithms. First we formally define the problem.<br />
Definition 1 (Twig matching). Given a rooted unordered labeled query tree Q <strong>and</strong> a<br />
rooted ordered labeled data tree D, find all complete matchings <strong>of</strong> the nodes in Q, such<br />
that the matched nodes in D follow the structural requirements given by the a-d <strong>and</strong> p-c<br />
edges in Q.<br />
Note that there is a slight difference between the semantics <strong>of</strong> twig matching <strong>and</strong><br />
XPath. A twig query returns all legal combinations <strong>of</strong> node matches, while in XPath<br />
there is a single query return node. The generality <strong>of</strong> returning all legal combinations <strong>of</strong><br />
matches in twig matching may have been the academic focus point because it is useful<br />
for the flexibility in XQuery. XPath can also specify more than a-d <strong>and</strong> p-c relationships,<br />
but a majority <strong>of</strong> XPath queries in practice use only the a-d <strong>and</strong> p-c axis [12].<br />
Many early approaches to search in semi-structured data used combinations <strong>of</strong> indexing<br />
<strong>and</strong> tree navigation, but the main focus the last decade has been on indexing with<br />
inverted lists <strong>and</strong> structural joins <strong>of</strong> streams <strong>of</strong> query node matches [1]. This paper only<br />
considers twig join algorithms.<br />
Indexing <strong>and</strong> node encoding is critical for the efficiency <strong>of</strong> twig joins. Usually<br />
data nodes are indexed (partitioned) on node labels, using for example inverted lists.<br />
Two aspects <strong>of</strong> how data is stored inside partitions are important: How the position <strong>of</strong><br />
a node is encoded, <strong>and</strong> how the nodes in the partition are ordered. For most algorithms<br />
nodes are stored in depth first traversal pre-order, such that ascendants are seen before<br />
descendants. The positional information which follows nodes must allow decision <strong>of</strong> a-d<br />
<strong>and</strong> p-c relationships. The most common is the regional begin,end,level (BEL) encoding<br />
[29], which is used in the data extraction example in Figure 2. It reflects the positions <strong>of</strong><br />
opening <strong>and</strong> closing tags in XML (see Figure 1c). The begin <strong>and</strong> end numbers are not<br />
the same as pre- <strong>and</strong> post-order traversal numbers, but give the same sorting orders.<br />
In the following, let T q denote the stream <strong>of</strong> matches for query node q, <strong>and</strong> C q denote<br />
the current data node in this stream. For simplicity, polynomial factors in the size <strong>of</strong> the<br />
query are ignored in asymptotic notation.<br />
The early history <strong>of</strong> twig joins is shown in Figure 3. An early approach for schema<br />
agnostic XML indexing was to store nodes with BEL encoding in an RDBMS, <strong>and</strong> specify<br />
query node relations as a number <strong>of</strong> inequalities. But these theta-joins are expensive.<br />
Specialized loop structural joins which leveraged the knowledge that the data encoded is a<br />
118
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
a 1,8<br />
b 2,2<br />
a 3,6<br />
c 4,4 b 5,5<br />
c 7,7<br />
tag B E L<br />
a 1 8 1<br />
a 3 6 2<br />
b 2 2 2<br />
b 5 5 3<br />
c 4 4 3<br />
c 7 7 2<br />
a<br />
b c<br />
(a 1,1 b 2,2 c 7,7 )<br />
(a 1,1 b 5,5 c 7,7 )<br />
(a 3,6 b 5,5 c 4,4 )<br />
(a) Data.<br />
(b) Extracted.<br />
(c) Query.<br />
(d) Matches.<br />
Figure 2: Tree indexing <strong>and</strong> querying example.<br />
tree where introduced [29, 1]. These have O(I + O) cost for evaluating an a-d relationship,<br />
where I <strong>and</strong> O are the sizes <strong>of</strong> the input <strong>and</strong> output streams, but quadratic worst-case for<br />
p-c relationships. Stack joins were introduced to get optimal evaluation for all binary<br />
structural joins.<br />
Path m-way loop.<br />
PathMPMJ [4]<br />
Twig m-way loop.<br />
Not explored.<br />
Leveraging<br />
RDBMS.<br />
Specialized loop joins.<br />
– a-d optimal.<br />
MPMGJN [29]<br />
Tree-Merge-Desc/-Anc [1]<br />
Stack joins.<br />
– a-d & p-c optimal.<br />
Stack-Tree-Desc/-Anc [1]<br />
Path m-way stack.<br />
– a-d & p-c optimal.<br />
PathStack [4]<br />
Twig m-way stack.<br />
– a-d optimal.<br />
TwigStack [4]<br />
Skipping.<br />
Anc des B+ [7]<br />
XR-Stack [14]<br />
Figure 3: The early history <strong>of</strong> twig joins. Continued in Figure 8.<br />
A problem with combining the evaluation <strong>of</strong> a number <strong>of</strong> binary relationships to answer<br />
a query, is that the intermediate results may be <strong>of</strong> size exponential in the query, even if<br />
the output is small. This lead to the introduction <strong>of</strong> multi-way join algorithms.<br />
Stacks are key data structures in most modern twig join algorithms. Their use<br />
here is motivated by their use in depth first tree traversals. To join streams <strong>of</strong> ascendants<br />
<strong>and</strong> descendants, a stack <strong>of</strong> currently nested ancestor nodes is maintained. Nodes are<br />
popped <strong>of</strong>f the ancestor stack when a non-contained (disjoint) node is seen in either<br />
stream. In a path or twig multi-way algorithm, there must be one stack S qi for each<br />
internal query node q i . The matches for different query nodes must be processed in total<br />
pre-order, to ensure that ancestor nodes are added before descendants need them.<br />
In each step in the PathStack algorithm [4], the current data node is used to clean all<br />
stacks by popping non-containing nodes, before it is pushed on stack. Figure 4b shows<br />
the stacks for a query when evaluated on the data in Figure 4a, right after the node c 1<br />
has been pushed. When the current query node is a leaf, all related matches are output.<br />
To enable linear time enumeration <strong>of</strong> the matches encoded in the stacks, each data node<br />
pushed onto a stack has a pointer to the closest containing data node in the parent stack,<br />
119
CHAPTER 4.<br />
INCLUDED PAPERS<br />
which would be the top <strong>of</strong> the parent stack as the data node was pushed. Nodes above<br />
on the parent stack cannot be ascendants, as the data nodes are read in pre-order. In the<br />
example, b 2 <strong>and</strong> a 2 are not usable together. Because a stack only contain nested nodes,<br />
the space needed is O(d), where d is the maximal depth <strong>of</strong> the data tree.<br />
a 1<br />
b 1<br />
b 2<br />
a 2<br />
b 3<br />
c 1<br />
a<br />
b<br />
c<br />
a 2<br />
a 1<br />
b 3<br />
b 2<br />
b 1<br />
c 1<br />
(a) Data tree<br />
(b) Query with stacks after pushing c 1 .<br />
Figure 4: Data structures for PathStack.<br />
A technique for getting path matches sorted on higher query node matches first is<br />
critical for the efficiency <strong>of</strong> TwigStack <strong>and</strong> other twig multi-way algorithms. Delaying<br />
out-<strong>of</strong>-order output is achieved by maintaining so-called self- <strong>and</strong> inherit-lists for each<br />
stacked node [1]. The lists for the data <strong>and</strong> query in Figure 4 is shown in Figure 5.<br />
As a node is popped <strong>of</strong>f stack, the contents <strong>of</strong> its lists are appended to the inherit-lists<br />
<strong>of</strong> the node below on the same stack, if there is one. This is to maintain correct output<br />
order. See for example the lists for b 2 <strong>and</strong> b 1 in the example. But if the popped node<br />
can use some ancestor node in the parent stack, which the node below in its own stack<br />
cannot, the contents <strong>of</strong> the lists must be appended to the self-lists there. This is decided<br />
from the inter-stack pointers. In the example, popping node b 3 results in adding (a 2 b 3 c 1 )<br />
to the self-list <strong>of</strong> a 2 . PathStack has O(I + O) complexity both with <strong>and</strong> without delaying<br />
output, where I is now the total input size.<br />
a (a 2 b 3 c 1 )<br />
2<br />
∅<br />
a<br />
(a 1 b 1 c 1 )(a 1 b 2 c 1 )(a 1 b 3 c 1 )<br />
1<br />
(a 2 b 3 c 1 )<br />
(b<br />
b 3 c 1 )<br />
3<br />
∅<br />
(b<br />
b 2 c 1 )<br />
2<br />
(b 3 c 1 )<br />
(b<br />
b 1 c 1 )<br />
1<br />
(b 2 c 1 )(b 3 c 1 )<br />
Figure 5: Stack nodes with final self- <strong>and</strong> inherit lists for the data <strong>and</strong> query in Figure 4.<br />
Darker nodes popped first.<br />
120
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
TwigStack [4] was the first holistic twig join algorithm. Using PathStack on each<br />
root-to-leaf path in a twig query <strong>and</strong> merging the matches, may lead to many useless intermediate<br />
results, because path matches need not be part <strong>of</strong> complete matches. TwigStack<br />
improved on this, <strong>and</strong> achieved O(I + O) complexity for queries with a-d edges only.<br />
It is a two-phase algorithm, where the first phase outputs matches for each root-toleaf<br />
path, <strong>and</strong> the second phase merge joins the path matches. The first phase does<br />
two things which are critical for the performance <strong>of</strong> the algorithm: It only outputs path<br />
matches which possibly are part <strong>of</strong> some complete query match, <strong>and</strong> outputs paths sorted<br />
on higher query nodes first, using the technique from [1]. This allow a linear merge in<br />
phase two.<br />
TwigStack does additional checking before pushing nodes on stack compared to Path-<br />
Stack. The data node at the head <strong>of</strong> the stream for a query node q is not pushed on stack<br />
before it has a so called “solution extension”, which means that the heads <strong>of</strong> the streams<br />
<strong>of</strong> all child query nodes are contained by C q , <strong>and</strong> that child nodes recursively satisfy this<br />
property. Also, a node is not pushed on stack unless there is a useable ancestor data node<br />
on the stack for the parent query node.<br />
Pseudo-code for TwigStack is shown in Algorithm 1 (adapted from [4]). It is included<br />
here to ease the depiction <strong>of</strong> the improvements discussed in the following sections. Each<br />
query node q has an associated stream T q with current element C q , <strong>and</strong> a stack S q .<br />
The algorithm revolves around a recursive function getNext(q), which returns a (locally)<br />
uppermost query node in the subtree <strong>of</strong> q which has a solution extension. If the parent<br />
<strong>of</strong> the returned q has a usable ancestor data node on stack, this means C q is part <strong>of</strong> a full<br />
solution extension identified earlier, <strong>and</strong> C q is pushed on S q . A path match is found when<br />
a leaf node is pushed on stack, but output is delayed to make sure paths are ordered on<br />
the query nodes top down (called “blocking” in [4]). Note that actually pushing a leaf<br />
node on stack is unnecessary, as it will be popped right <strong>of</strong>f.<br />
Algorithm 1 TwigStack<br />
1: function TwigStack(Q)<br />
2: while not atEnd(Q)<br />
3: q := getNext(Q.root)<br />
4: if not isRoot(q)<br />
5: cleanStack(S parent(q) , C q)<br />
6: if isRoot(q) or not empty(S parent(q) )<br />
7: cleanStack(S q, C q)<br />
8: push(S q, C q, top(S parent (q)))<br />
9: if isLeaf (q)<br />
10: outputPathsDelayed(C q)<br />
11: pop(S q)<br />
12: advance(T q)<br />
13: mergePathSolutions()<br />
14: function getNext(q)<br />
15: if isLeaf (q)<br />
16: return q<br />
17: for q i ∈ children(q)<br />
18: q j := getNext(q i )<br />
19: if q j ≠ q i<br />
20: return q j<br />
21: q min = min arg qi ∈children(q) {Cq .begin}<br />
i<br />
22: q max = max arg qi ∈children(q) {Cq .begin}<br />
i<br />
23: while C q .end < C qmax .begin<br />
24: advance(C q)<br />
25: if C q .begin < C qmin .begin<br />
26: return q<br />
27: else<br />
28: return q min<br />
The getNext() traversal is bottom up, <strong>and</strong> is short cut if some node does not have a<br />
solution extension (see line 20). Leaves trivially have solution extensions. The traversal<br />
has the side effect <strong>of</strong> advancing the treated query node at least until it contains all it’s<br />
121
CHAPTER 4.<br />
INCLUDED PAPERS<br />
children (line 23). If it does not contain all children at this point, the child currently with<br />
the first pre-order data node (lowest begin value) is returned to be forwarded in line 12.<br />
Figure 6 shows the state <strong>of</strong> the algorithm when evaluating the query in Figure 2c,<br />
right after node b 5,5 has been processed. After the first call to getNext(a), when all the<br />
streams where at their start position, a itself was returned as it had a solution extension,<br />
<strong>and</strong> C a = a 1,8 was pushed on stack. For the second call to getNext(a), this was not the<br />
case, <strong>and</strong> b was returned, with head C b = b 2,2 . Since b 2,2 had a usable ancestor a 1,8 on<br />
the parent stack S a , a 1,8 must have had a solution extension, in which the subtree rooted<br />
at b 2,2 was usable. So C b = b 2,2 was pushed on its own stack S b , <strong>and</strong> since it was a leaf,<br />
the path matching (a 1,8 b 2,2 ) was output. After all paths have been found they are merge<br />
joined.<br />
T b<br />
b 2,2<br />
b 5,5<br />
⊥<br />
T a<br />
a 1,8<br />
a 3,6<br />
⊥<br />
T c<br />
c 4,4<br />
c 7,7<br />
⊥<br />
b 5,5<br />
a 3,6<br />
b 2,2<br />
a 1,8<br />
S b<br />
c 4,4<br />
S a<br />
S c<br />
a 1,8 b 2,2<br />
a 1,8 b 5,5<br />
a 1,8 c 4,4<br />
a 3,6 b 5,5<br />
a 3,6 c 7,7<br />
(a) Streams.<br />
(b) Stacks<br />
(c) Path matches.<br />
Figure 6: TwigStack state when evaluating query in Figure 2c, after processing node b 5,5 .<br />
TwigStack suboptimality for mixed a-d <strong>and</strong> p-c queries comes from having to<br />
output path matches without knowing whether the data nodes used can satisfy all their<br />
p-c relationships. The algorithm cannot always decide this from the nodes on the stacks<br />
<strong>and</strong> the heads <strong>of</strong> the streams. For the example in Figure 7 it cannot be decided if the<br />
path matches (a 1 , b 1 ), . . . , (a 1 , b n ) are part <strong>of</strong> a full match before the node c n+1 is seen.<br />
a 1<br />
a<br />
b 1 b n<br />
a 2<br />
c n+1<br />
b<br />
c<br />
(a) Query.<br />
b n+1 c 1 c n<br />
(b) Data.<br />
Figure 7: Bad case for TwigStack.<br />
For a-d only queries, queries where a p-c edge never follows an a-d edge [24], or on<br />
data with non-recursive schema [9], twig joins can be solved with linear cost in the size<br />
<strong>of</strong> the input <strong>and</strong> the output, using O(d) memory. Sadly, recursive schema, where nodes<br />
with a given label may nest nodes with the same label, are common in XML in practice<br />
[8], <strong>and</strong> so are mixed queries <strong>of</strong> the type mentioned above. No algorithm can solve the<br />
general problem given tag streaming, linear index size, <strong>and</strong> a query evaluation memory<br />
122
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
requirement <strong>of</strong> O(d) [24]. One alternative is storing multiple sort orders on disk, instead<br />
<strong>of</strong> only tree pre-order. This would require Ω(m min m,d · D) disk space in the worst case,<br />
where m is the number <strong>of</strong> structurally recursive labels <strong>and</strong> D is the size <strong>of</strong> the document<br />
[9]. Another alternative is to do multiple scans <strong>of</strong> the streams, but this would require<br />
Ω(d t ) passes in the worst case, where t is a linear metric on the complexity <strong>of</strong> the query<br />
[9]. So, the only viable alternatives left seem to be relaxing the O(d) space requirement,<br />
or using something different than tag partitioning. The following section investigates this,<br />
but also many practical speedups to TwigStack.<br />
3 Advances<br />
A multitude <strong>of</strong> different improvements have been presented after the introduction <strong>of</strong><br />
TwigStack. Figure 8 gives an overview <strong>of</strong> these, with a separation between improved<br />
join algorithms <strong>and</strong> changes to how data is indexed <strong>and</strong> accessed. The rest <strong>of</strong> this paper<br />
is devoted to a structured review <strong>of</strong> these advances. Our goal is to identify further<br />
improvements, <strong>and</strong> to shed light on whether it is likely that combining these advances is<br />
possible <strong>and</strong> beneficial.<br />
3.1 Avoiding Redundant Computation<br />
TwigStack may perform many redundant checks in the calls to getNext(). Each time<br />
a node is returned, the full subtree below has been inspected. The TJEssential [19]<br />
algorithm improved three specific deficiencies, exemplified in Figure 9.<br />
The first deficiency is from self-nested matching nodes. For query node a <strong>and</strong> data<br />
nodes a 1 to a p in the example, it is unnecessary to recursively check the full subtrees below<br />
b <strong>and</strong> d in each round while pushing the nodes onto S a . The usefulness <strong>of</strong> a 2 , . . . , a p can<br />
be seen from the fact that a 1 had a new solution extension, <strong>and</strong> that a 2 , . . . , a p contains<br />
b 1 <strong>and</strong> d 1 , the heads <strong>of</strong> the streams <strong>of</strong> a’s children.<br />
The second observation is on the order in which child nodes are inspected. If the child<br />
b is inspected before d in line 18 <strong>of</strong> Algorithm 1, getNext(a) will call getNext(b) before<br />
getNext(d) shortcuts the search. There will be m − 1 redundant calls getNext(b) while<br />
forwarding the leaf node e.<br />
The third observation is that many useless calls could be made after a stream has<br />
reached its end. Assuming that b 2 was the last b-node in Figure 9, no a node later in the<br />
tree order would ever be pushed onto stack, <strong>and</strong> T a could be forwarded to its end. Also,<br />
if S b was empty, any descendant <strong>of</strong> b in the query could have their stream forwarded to<br />
the end, as the remaining nodes could not be part <strong>of</strong> a solution.<br />
TJEssential is a total rewrite <strong>of</strong> TwigStack, <strong>and</strong> is more complex than the original algorithm.<br />
TwigStack + [30] is a less involved modification, which only changes the getNext()<br />
procedure, such that it does not return before a solution is found. TwigStack + does not<br />
catch any <strong>of</strong> the tree above cases, but reduces computation for scattered node matches in<br />
practice.<br />
123
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Nested loop twig m-way.<br />
– Fast on simple q.?<br />
2-phase holistic top-down.<br />
– a-d optimal.<br />
TwigStack [4]<br />
1-phase bottom-up.<br />
– a-d & p-c optimal.<br />
Twig 2 Stack [5]<br />
Refined access methods.<br />
– tag, tag+level, path.<br />
iTwigJoin [6]<br />
– twig.<br />
TwigVersion [27]<br />
Virtual streams.<br />
Virtual Cursors [28]<br />
TJFast [21]<br />
(TwigOptimal [10])<br />
Skipping in leaf streams.<br />
Virtual Cursors [28]<br />
Holistic skipping in matched<br />
leaf streams.<br />
Skipping joins.<br />
TwigStackXB [4]<br />
Efficient anc. skip.<br />
TSGeneric+ [15]<br />
Holistic skipping.<br />
TwigOptimal [10]<br />
Holistic anc. skipping.<br />
Avoiding redundant comp.<br />
TJEssential [19]<br />
TwigStack + [30]<br />
TJEssential* [19]<br />
TwigStack + B [30]<br />
Singe phase essential?<br />
1-phase top-down.<br />
– a-d & p-c optimal.<br />
HolisticTwigStack [16]<br />
Simplified intermed. result<br />
management, top-down.<br />
TwigFast [20].<br />
Unification?<br />
Bottom-up + filtering.<br />
Twig 2 Stack+PStack [5]<br />
Twig 2 Stack+TStack [3]<br />
TwigMix [20]<br />
Simplified intermed. result<br />
management, bottom-up.<br />
TwigList [23]<br />
Figure 8: Advances <strong>and</strong> opportunities in twig joins.<br />
Optimal data access?<br />
Combination possible?<br />
Optimal algorithm?<br />
⎫⎪ ⎬<br />
⎪ ⎭<br />
⎫⎪ ⎬<br />
⎪ ⎭<br />
Changing data <strong>and</strong> data access Changing algorithms<br />
124
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
a 1<br />
a p<br />
b 1<br />
d 1<br />
a q<br />
e 1 e m<br />
b 2<br />
(a) Data.<br />
c 1<br />
c n<br />
a<br />
b d<br />
c e<br />
(b) Query.<br />
Figure 9: Giving redundant checks in TwigStack.<br />
Opportunity 1 (Removing redundant computation in top-down one-phase joins). The<br />
improvement <strong>of</strong> TwigStack + can trivially be ported to recent algorithms such as HolisticTwigStack<br />
<strong>and</strong> TwigFast, which improve other aspects <strong>of</strong> TwigStack (see Section 3.2).<br />
A challenge is to do the same for all the three improvements <strong>of</strong> TJEssential. Also, case<br />
three above could be extended to more efficient aligning for multi-document XML collections.<br />
3.2 Top-down vs. Bottom-up<br />
There are two main lines <strong>of</strong> algorithmic improvements over TwigStack which give optimal<br />
evaluation <strong>of</strong> mixed a-d <strong>and</strong> p-c queries by relaxing the O(d) memory requirement:<br />
Bottom-up algorithms which read nodes in post-order, <strong>and</strong> later algorithms which go back<br />
to top-down <strong>and</strong> pre-order. Differences between these are illustrated in Figure 10.<br />
Twig 2 Stack [5] generates a single combined stream with post-order sorting for all<br />
query node matches with the help <strong>of</strong> a single stack. With post-order processing it can be<br />
decided if an entire subtree has a match at the time the top node is seen.<br />
Figure 10c shows the hierarchies <strong>of</strong> stacks built while processing a query. For each<br />
query node, a list <strong>of</strong> trees <strong>of</strong> stacks is maintained. A data node strictly nests all nodes<br />
below in the stack, <strong>and</strong> all nodes in child stacks in the tree. The lists <strong>of</strong> trees are stored<br />
sorted in post-order, <strong>and</strong> are linked together by a common root if an ancestor node is<br />
processed. From the post-order, the nodes to be linked will always be found at the end <strong>of</strong><br />
the list, <strong>and</strong> the new root will always be put at the end. The order naturally maintains<br />
itself, <strong>and</strong> good locality is achieved.<br />
Instead <strong>of</strong> each node on stack having a pointer to an ancestor node on a parent stack<br />
as in TwigStack, each stacked data node has for each related child query node, a list <strong>of</strong><br />
pointers to top stack nodes matching the query axis relationship. Nodes are only added<br />
if a-d <strong>and</strong> p-c relationships can be satisfied, <strong>and</strong> p-c pointers are only added when levels<br />
are correct, as seen for the a <strong>and</strong> c nodes in the example.<br />
125
CHAPTER 4.<br />
INCLUDED PAPERS<br />
c 1<br />
b 1<br />
c 2 a 1<br />
a 2<br />
a 4<br />
c 5<br />
b 2<br />
a 3 c 4 b 3 b 5<br />
c 6<br />
(a) Data<br />
a<br />
b c<br />
(b) Query<br />
a 1<br />
a 4<br />
b 1<br />
c 2 c 3 c 4 c 5<br />
b 2<br />
b 3<br />
b 4<br />
b 5<br />
(c) Twig 2 Stack<br />
c 1<br />
c 6<br />
a 4 a 1<br />
⎧⎪ ⎨<br />
⎪ ⎩<br />
{<br />
⎧⎪ ⎨<br />
⎪ ⎩<br />
{<br />
b 1 b 2 b 4 b 3 b 5<br />
c 2 c 3 c 4 c 6 c 5 c 1<br />
(d) TwigList<br />
tail<br />
a 1 a 2 a 4<br />
b 4<br />
b a 4<br />
5 b 3<br />
a 2 c 3<br />
a 1<br />
b 2<br />
(e) HolisticTwigStack<br />
c 4<br />
⎧⎪ ⎨<br />
⎪ ⎩{<br />
{<br />
⎧<br />
⎪⎨<br />
⎪ ⎩{<br />
{<br />
b 2 b 3 b 4 b 5<br />
c 3 c 4 c 5<br />
tail tail<br />
(f) TwigFast<br />
Figure 10: (a) Data. (b) Query. (c) Hierarchies <strong>of</strong> stacks for Twig 2 Stack. (d) Intervals<br />
for TwigList. Curved arrows are sibling pointers. (e) Lists <strong>of</strong> stacks for HolisticTwigStack<br />
right before c 5 is processed. Previously popped nodes shown in gray. (f) Intervals for<br />
TwigFast after c 5 has been processed. Curved arrows are ancestor pointers.<br />
126
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
TwigList [23] is a simplification <strong>of</strong> Twig 2 Stack using simple lists <strong>and</strong> intervals given<br />
by pointers, which improves performance in practice. For each query node, there is a<br />
post-order list <strong>of</strong> the data nodes used so far. Each node in a list has, for each child query<br />
node, a single recorded interval <strong>of</strong> contained nodes, as shown in Figure 10d. Interval start<br />
<strong>and</strong> end positions are recorded as nodes are pushed <strong>and</strong> popped on <strong>and</strong> <strong>of</strong>f the global<br />
stack. All descendant data nodes are processed in between. Compared with the list <strong>of</strong><br />
pointers in Twig 2 Stack, enumeration <strong>of</strong> matches is not as efficient for p-c edges, but sibling<br />
pointers can remedy this.<br />
HolisticTwigStack [16] is a modification <strong>of</strong> TwigStack which uses pre-order processing,<br />
but maintains complex stack structures like Twig 2 Stack. The argument against<br />
Twig 2 Stack was a high memory usage, caused by the fact that all query leaf matches are<br />
kept in memory until the tree is completely processed, as they could be part <strong>of</strong> a match.<br />
HolisticTwigStack differentiates between the top-most branching node <strong>and</strong> its ancestors,<br />
for which a regular stack is used, <strong>and</strong> lower query nodes, which have multiple linked lists<br />
<strong>of</strong> stacks, as shown in Figure 10e. Each query node match has one pointer to the first<br />
descendant in pre-order for each child query node. For “lower” query nodes, new data<br />
nodes are pushed onto the current stack if contained, otherwise a new stack is created<br />
<strong>and</strong> appended to the list. As a match for an “upper” query node is popped, the node<br />
below on stack must inherit the pointers. Node a 1 would inherit the pointers from both<br />
a 2 <strong>and</strong> a 4 in the example in Figure 10e, <strong>and</strong> the related lists <strong>of</strong> child matches would be<br />
linked.<br />
TwigFast [20] is a simplification <strong>of</strong> HolisticTwigStack similar to TwigList. There is<br />
one list containing matches for each query node, naturally sorted in pre-order, <strong>and</strong> data<br />
nodes in the lists have pointers giving the interval <strong>of</strong> contained matches for child query<br />
nodes, as shown in Figure 10f. Each data node put into the list has a pointer to its closest<br />
ancestor in the same list, <strong>and</strong> there is a “tail pointer”, which gives the last position where<br />
a node can be the ancestor <strong>of</strong> following nodes in the streams. These pointers are used for<br />
the construction <strong>of</strong> the intervals.<br />
Different advantages <strong>of</strong> top-down <strong>and</strong> bottom-up algorithms can be seen in Figure<br />
10. A top-down algorithm can avoid storing b 1 <strong>and</strong> c 2 , while a bottom-up algorithm<br />
is unable to decide that these nodes cannot be part <strong>of</strong> a solution. On the other h<strong>and</strong>, a<br />
bottom-up algorithm can decide that a 2 is not usable, because it cannot satisfy the p-c<br />
relationship between a <strong>and</strong> c. Both approaches can decide that a 3 is not useful because<br />
it does not have a b descendant.<br />
The worst case space complexity <strong>of</strong> twig pattern matching is an open problem, <strong>and</strong><br />
the known bounds are Ω(max d, u) <strong>and</strong> O(I), where u is the number <strong>of</strong> nodes which are<br />
part <strong>of</strong> a solution [24]. However, practical space savings are possible.<br />
Opportunity 2 (Top-down memory usage). TwigStack treats queries as a-d only in the<br />
stack construction part <strong>of</strong> phase one. A node returned from getNext() is pushed on stack if<br />
it has a usable ancestor on the parent stack, even if the query specifies a p-c relationship.<br />
For example does not c 3 have to be pushed on stack in Figure 10e, because it does not<br />
have a usable parent. Strictly checking p-c relationships before adding intermediate results<br />
would reduce memory usage in practice. This optimization was identified for TJFast [21]<br />
127
CHAPTER 4.<br />
INCLUDED PAPERS<br />
(see Section 3.6), but the later HolisticTwigStack <strong>and</strong> TwigFast do not take advantage <strong>of</strong><br />
this opportunity.<br />
Opportunity 3 (Bottom-up memory usage). Assume a query node q with a p-c relationship<br />
to the parent query node. If a c<strong>and</strong>idate match for q is pushed onto a stack in<br />
Twig 2 Stack, <strong>and</strong> the data node below on the stack does not have an incoming pointer,<br />
this means the node below will never get a matching parent, <strong>and</strong> can be popped <strong>of</strong>f stack.<br />
For example could the node c 6 be dropped in Figure 10c. Also, when stack trees for q are<br />
merged, some ancestor data node a i is seen. Then all the stack trees which do not get<br />
or have an incoming pointer can be dropped, as all later c<strong>and</strong>idates for the parent query<br />
node will be after in the post-order. In the example, the stack trees containing the single<br />
nodes c 2 <strong>and</strong> c 3 could be dropped when a 1 is seen.<br />
Note that improvements on this is hard to transfer directly to TwigList unless the<br />
lists are implemented as linked lists. But this is by far inferior to using arrays <strong>and</strong> array<br />
doubling on modern hardware, as done in TwigList [22]. Another solution is to keep one<br />
list for each level for query nodes which have a p-c relationship to the parent query node.<br />
Siblings would then be stored contiguously, <strong>and</strong> interval pointers would implicitly be to<br />
a list on a given level. When the first ancestor <strong>of</strong> a segment <strong>of</strong> nodes in need <strong>of</strong> a parent<br />
is seen, the useless nodes can be over-written. This modification would also make sibling<br />
pointers unnecessary <strong>and</strong> improve efficiency <strong>of</strong> result enumeration. Figure 11 shows the<br />
proposed approach for the data <strong>and</strong> query in Figure 10, where gray list items can be<br />
overwritten.<br />
1: c 1<br />
2: c 2<br />
3: c 5<br />
4: c 4 c 6<br />
5: c 3<br />
{ {<br />
a 4 a 1<br />
b 1 b 2 b 4 b 3 b 5<br />
⎧⎪ ⎨<br />
⎪ ⎩<br />
{<br />
Figure 11: Proposal for multi-level lists for TwigList.<br />
3.3 Filtering<br />
Low memory top-down approaches have been used as filters to bottom-up algorithms to<br />
reduce space usage by avoiding useless nodes. Note that this does not result in a perfect<br />
solution. Assume that node a 1 in Figure 10a had a different label. A O(d) space top-down<br />
pre-order approach could not decide that b 2 in the example was not part <strong>of</strong> a match, <strong>and</strong><br />
a bottom-up algorithm would have to keep it in memory until the entire tree was read.<br />
Figure 12c-e shows the effects <strong>of</strong> different previously proposed filters.<br />
PathStack Pop Filter. In the original Twig 2 Stack paper [5], PathStack was proposed<br />
as a pre-filter to allow early result enumeration. PathStack is run as usual, but<br />
without its result enumeration. As disjoint nodes are popped <strong>of</strong>f their stacks, they are<br />
128
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
x 1<br />
x 1<br />
x 1<br />
c 1 a 1<br />
c 1 a 1<br />
c 1 a 1<br />
a<br />
b c<br />
x 2<br />
b 1<br />
c 2<br />
b 2<br />
a 2<br />
x 2<br />
b 1<br />
c 2<br />
b 2<br />
a 2<br />
x 2<br />
b 1<br />
c 2<br />
b 2<br />
a 2<br />
(a) Query.<br />
(b) No filter.<br />
(c) PathStack pop filter.<br />
(d) Solution extension filter.<br />
x 1<br />
x 1<br />
x 1<br />
c 1 a 1<br />
c 1 a 1<br />
c 1 a 1<br />
x 2<br />
b 1<br />
c 2<br />
b 2<br />
a 2<br />
(e) TwigStack pop filter.<br />
x 2<br />
b 1<br />
c 2<br />
b 2<br />
a 2<br />
(f) Opportunity: PathStack useful.<br />
x 2<br />
b 1<br />
c 2<br />
b 2<br />
a 2<br />
(g) Opportunity: TwigStack useful.<br />
Figure 12: Filtering approaches for bottom-up up processing. Filtered nodes shown in<br />
gray.<br />
passed to Twig 2 Stack. When the bottom node is popped from the stack <strong>of</strong> the root query<br />
node, all results can be output, <strong>and</strong> the hierarchical stacks destroyed. A side effect <strong>of</strong> this<br />
procedure is that only nodes that are part <strong>of</strong> some prefix path match are used (these are<br />
not necessarily part <strong>of</strong> a full root-to-leaf path match). In Figure 12c, node c 1 is avoided.<br />
Note that one data node may result in the popping <strong>of</strong> multiple nodes on multiple stacks,<br />
<strong>and</strong> that Twig 2 Stack must receive descendants before ascendants.<br />
Solution Extension Filter. TwigMix [20] is an algorithm which combines the simplified<br />
data structures in TwigList with the getNext() function from TwigStack as a filter.<br />
This combination gives efficient evaluation for queries involving p-c edges, <strong>and</strong> reduced<br />
memory usage in practice. An advantage <strong>of</strong> this approach over Twig 2 Stack+PathStack is<br />
that there is no overhead <strong>of</strong> maintaining an extra set <strong>of</strong> stacks, <strong>and</strong> that internal nodes<br />
are filtered holistically. The downside is that nodes are added without even having a<br />
possible parent or ancestor. Figure 12d shows that node a 2 is filtered, because it never<br />
has a solution extension (misses b node below), while nodes c 1 <strong>and</strong> b 1 are not filtered.<br />
TwigStack Pop Filter. TwigStack can also be used as a filter for Twig 2 Stack [3]. A<br />
node is never added to the hierarchical stacks if it is not popped from a top-down stack in<br />
TwigStack. As a node is never pushed on stack if it does not have a usable ancestor, which<br />
again has a solution extension, this gives additional filtering, at the cost <strong>of</strong> maintaining the<br />
top-down stacks. Figure 12e shows the improvements both over PathStack <strong>and</strong> solution<br />
extension as filters. An issue is that Twig 2 Stack expects the stream <strong>of</strong> nodes to be in<br />
post-order, <strong>and</strong> that TwigStack may pop nodes <strong>of</strong>f stacks out <strong>of</strong> this order. When a node<br />
is returned from getNext(), only the related stack <strong>and</strong> the parent stack are inspected.<br />
129
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Also, TwigStack does not keep leaf matches on stack, but nested leaf matches may arrive<br />
later. In [3] this is solved by keeping an extra queue <strong>of</strong> data nodes into which popped<br />
nodes are placed if the algorithm decides later popped nodes may precede them.<br />
A different solution could be to allow nested nodes on query leaf stacks, <strong>and</strong> to inspect<br />
all stacks when popping disjoint nodes to ensure post-order, as with the PathStack filter.<br />
Also, Twig 2 Stack actually does not need to see nodes in strict post-order, but only to<br />
see descendants before ascendants. Hence, not all stacks in the query would have to<br />
inspected, only ascendant <strong>and</strong> descendant stacks <strong>of</strong> the current node.<br />
Opportunity 4 (Stronger filters). There are further possibilities for filters with O(d)<br />
space usage. Instead <strong>of</strong> using all nodes popped <strong>of</strong>f stacks in PathStack, one could use<br />
the nodes which would be used in full path match. As leaf nodes are pushed on stack, a<br />
simplified enumeration algorithm could be run, tagging nodes which take part in solutions.<br />
As can be seen in Figure 12f, this is an improvement over the previous PathStack filter, but<br />
only partially over the solution extension filter, which to a greater extent filters matches<br />
for higher query nodes. Leaves trivially have solution extensions. The “PathStack<br />
useful” filter works well on lower query nodes. Note that as the bottom-up algorithms<br />
to a greater extent h<strong>and</strong>le upper nodes themselves, a filter is <strong>of</strong> most use if it removes<br />
lower query node c<strong>and</strong>idates effectively. An even stronger filter would be to only use<br />
nodes which would have been output as parts <strong>of</strong> path matches in TwigStack, as shown in<br />
Figure 12g. None <strong>of</strong> [5, 20, 3] compare with using any other type <strong>of</strong> filter. A thorough<br />
comparison should compare both the practical space reductions filters give, their absolute<br />
costs, <strong>and</strong> how their use affect the total computational cost.<br />
Opportunity 5 (Unification or assimilation). When comparing absolute performance<br />
gains presented in the respective papers, TwigFast is the winner on performance for pure<br />
tag streaming. As this is a very important result, it should be verified independently.<br />
Before TwigFast is picked as the method <strong>of</strong> choice, at least the following should be answered:<br />
(i) Can the improvements discussed in Section 3.1 be applied? (ii) Is it superior<br />
to improved top-down <strong>and</strong> bottom-up combinations? (iii) Does the picture change when<br />
r<strong>and</strong>om access gets more expensive compared to computation? [20] does not comment on<br />
the spatial locality <strong>of</strong> memory access patterns in the intermediate result data structures<br />
in TwigFast, while they are very good for TwigList [23].<br />
3.4 Skipping Joins<br />
Skipping is a useful technique when the streams to be joined have very different sizes.<br />
Skipping is used to jump forward in a stream to avoid reading <strong>and</strong> processing parts <strong>of</strong><br />
the streams which cannot contain useful nodes. Figure 13 shows cases where different<br />
skipping techniques <strong>and</strong> data structures can be used.<br />
Simple B-tree skipping can be used to skip in descendant streams, <strong>and</strong> to some<br />
extent in ancestor streams. It is trivial to skip in the descendant stream to find the first<br />
possible contained node, which is the first node with a larger begin value. In Figure 13b,<br />
T c is forwarded from c 1 to c q to find the first possible ascendant <strong>of</strong> b 1 .<br />
130
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
a<br />
b<br />
c<br />
x<br />
a 1<br />
c q<br />
c 1 c p<br />
b 1<br />
a 1<br />
b 1<br />
c 1 b 2 b p<br />
b q<br />
c 2<br />
a 1<br />
b 1 b p b q<br />
c 1<br />
(a)<br />
(b)<br />
(c)<br />
(d)<br />
x 1<br />
x 1<br />
a 1<br />
x 2<br />
x p a 2<br />
b 1 b 2<br />
c 2 b p<br />
c p<br />
b q<br />
b 1<br />
c 1<br />
x 2<br />
a 2 b 2<br />
x p<br />
a p b p<br />
a q<br />
b q<br />
(e)<br />
c q<br />
(f)<br />
c q<br />
Figure 13: Benefits <strong>of</strong> skipping techniques. (a) Query. (b) Descendants easily skipped<br />
with B-tree. (c) Skip past discarded ascendant. (d) XR-tree needed to skip ascendants.<br />
(e) Holistic skipping preferred. (f) Holistic skipping with XR-tree needed.<br />
But skipping to find the next ascendant <strong>of</strong> a node using the same approach is not<br />
effective, as any node with a lower begin value may be a match. A trick for ancestor<br />
skipping was introduced in [7]. If a node b i is popped <strong>of</strong>f stack S b due to disjointness with<br />
the current data node in some query node, T b is forwarded to the first node not contained<br />
by the popped node, a b j such that b i .end < b j .begin. An example <strong>of</strong> this can be seen<br />
in Figure 13c. If c 2 pops b 1 <strong>of</strong>f stack, T b can be forwarded beyond b 1 to b q , because no<br />
descendant <strong>of</strong> b 1 could be useful.<br />
XR-trees enable ancestor skipping in the general case [14]. Figure 13d shows an<br />
example where the above trick cannot be used. The XR-tree is B-tree variant which can<br />
retrieve all R ancestors or descendants <strong>of</strong> a node from N c<strong>and</strong>idates in O(log N + R)<br />
time. Typically one tree is built for each tag. To find all a ascendants <strong>of</strong> a node d k ,<br />
find the node a i with the nearest preceding begin value, <strong>and</strong> then all a ascendants <strong>of</strong><br />
a i in the XR-tree for a. Conceptually the XR-tree contains for each node, the list <strong>of</strong><br />
ascendants, which gives quadratic space usage when implemented naively. Linear space<br />
usage is achieved by not storing information redundantly internally in XR-tree nodes, <strong>and</strong><br />
by storing common information in internal XR-tree nodes.<br />
TSGeneric+ (also called XRTwig) [15] extends the use <strong>of</strong> the XR-tree to TwigStack,<br />
<strong>and</strong> does two major modifications to the algorithm. The first is to skip forward to<br />
containment <strong>of</strong> the first child in the getNext() procedure (see line 23 in Algorithm 1).<br />
The second change is more involved. Before calling getNext() on all children in line 18, a<br />
“broken” edge in the query sub-tree is repeatedly picked, <strong>and</strong> the two related nodes are<br />
“zig-zag” forwarded until they match. This is only done if the query node does not have<br />
131
CHAPTER 4.<br />
INCLUDED PAPERS<br />
data nodes on the stack. Choosing which edge to fix is either top-down, bottom-up or by<br />
statistics.<br />
Holistic skipping was introduced in the TwigOptimal [10] algorithm, which uses<br />
B-trees. Figure 13e shows a case where the approach from TSGeneric+ would be very<br />
expensive, reading all nodes b 2 -b p <strong>and</strong> c 2 -c p to fix the edge between b <strong>and</strong> c. TwigOptimal<br />
processes the query bottom-up then top-down. In the bottom-up phase, nodes are<br />
forwarded to contain their descendants, <strong>and</strong> in the top-down phase, nodes are forwarded<br />
until they are contained by their parent. To avoid as many data structure reads as possible,<br />
nodes are forwarded to “virtual positions”, which have only begin values. When a<br />
full traversal did not forward any node, the node with the minimal current begin value is<br />
forwarded to a real data node.<br />
The name <strong>of</strong> the TwigOptimal algorithm may be slightly misleading, as the optimality<br />
is given skip structures on begin values only. Only TSGeneric+ using simple B-trees<br />
is compared with. The effects <strong>of</strong> the two contributions, holistic skipping <strong>and</strong> the virtual<br />
positions, are not separately tested. TwigOptimal would not be efficient neither on the<br />
example in Figure 13c nor 13d. The approach is best when there are more matches for<br />
lower query nodes. An common exception from this is queries with leaf value predicates<br />
in XML. [10] mentions skipping to the closest ancestor <strong>and</strong> then backtracking to the first<br />
ancestor as a possible practical speed-up.<br />
Opportunity 6 (Holistic effective ancestor skipping). Figure 13f shows a case where<br />
both TSGeneric+ with XR-trees <strong>and</strong> TwigOptimal would fail to be efficient. The former<br />
would zig-zag join a 2 -a p <strong>and</strong> b 2 -b p , <strong>and</strong> the latter would be unable to forward T a to a q<br />
without checking at least all <strong>of</strong> a 2 -a p for ancestry <strong>of</strong> c q . Combining holistic skipping <strong>and</strong><br />
data structures for efficient ancestor skipping is required in a robust solution.<br />
Opportunity 7 (Simpler <strong>and</strong> faster skipping data structures). The XR-tree is a dynamic<br />
data structure which supports insertions <strong>and</strong> deletions [14]. In regular keyword search<br />
engines, simpler data structures are usually preferred to the heavier B-trees when the<br />
data is static or semi-static. Similar simpler data structures should also be created for<br />
efficient ancestor skipping. If their use is still expensive, techniques similar to the trick<br />
used to skip past discarded ascendants should be applied when possible.<br />
3.5 Refined Access Methods<br />
There are alternatives to indexing <strong>and</strong> accessing data by node labels, such as using label<br />
<strong>and</strong> level, or the root-to-node path strings <strong>of</strong> labels (called tag+level <strong>and</strong> prefix path<br />
streaming [6]). With refined partitioning some method must be used to identify the useful<br />
partitions for each query node. For prefix path streaming this would be the partitions<br />
with data paths matching the root-to-node downward paths in the query.<br />
Structural summaries are directory structures used to classify nodes based on their<br />
surrounding structure. They where first used in combination with tree traversals, but have<br />
later been integrated with pure partitioning schemes [18]. The most common is a path<br />
summary, which is a minimal tree containing all unique root-to-node label paths seen in<br />
the data. The data nodes associated with a summary node is called the extent <strong>of</strong> the<br />
132
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
node. Figure 14a shows the path summary for the data in Figure 14b, <strong>and</strong> the extents<br />
are shown in Figure 14d.<br />
b (1)<br />
b [1]<br />
b 1<br />
a (3)<br />
a (2)<br />
b (4)<br />
b (5)<br />
a (6)<br />
(a)<br />
a [2]<br />
a [7]<br />
b [3]<br />
a [5] b [8]<br />
a [9]<br />
a [4] b [6]<br />
(b)<br />
b [10]<br />
a 2<br />
b 3 b 6<br />
a 8<br />
a 4 a 5 a 7<br />
a 10<br />
b 11<br />
b 9<br />
a 12<br />
a 13<br />
(c)<br />
a 15<br />
b 16<br />
a 17<br />
b 14<br />
b 18<br />
(1)→[1]<br />
(2)→[2, 7]<br />
(3)→[5, 9]<br />
(4)→[6, 10]<br />
(5)→[3, 8]<br />
(6)→[4]<br />
(d)<br />
[1]→1<br />
[2]→2, 10<br />
[3]→3, 6, 11<br />
[4]→4, 5, 7, 12<br />
[5]→8, 13<br />
(e)<br />
[6]→9, 14<br />
[7]→15<br />
[8]→16<br />
[9]→17<br />
[10]→18<br />
Figure 14: Structural partitioning example. (a) is path summary for (b), with extents<br />
shown in (d). (b) is F&B summary for (c), with extents shown in (e).<br />
Many alternative summary structures have been devised for general graphs. A structure<br />
which is also directly useful for trees is the stronger F&B-index [17], where two nodes<br />
are in the same equivalence class if they have the same prefix path, <strong>and</strong> have children <strong>of</strong><br />
the same equivalence classes. In the example, (b) is the F&B summary <strong>of</strong> the tree in (c).<br />
For graphs, the F&B index can be found in O(m log n), where m <strong>and</strong> n are the number<br />
<strong>of</strong> edges <strong>and</strong> nodes in the graph. It is not known whether the F&B index can be found<br />
more efficiently for trees.<br />
Opportunity 8 (Updates in stronger summaries). Simple path summaries are usually<br />
small, <strong>and</strong> are easily updateable. When traversing a data tree for indexing, the path<br />
summary is used as a deterministic automaton, where new nodes are added on the fly<br />
when needed. Data nodes can be put in the correct extents immediately. If a data tree is<br />
updated, only the data nodes whose extent changes are affected. An interesting question<br />
is the updatability <strong>of</strong> stronger structural summaries. In the worst case for the F&B<br />
index, the structure <strong>of</strong> an entire containing subtree below the global root could change<br />
if a data node is added or removed, by causing or removing equivalence with another<br />
subtree. What are the implications <strong>of</strong> strategies lessening the restrictions on the F&B<br />
index? Would this give critical “fragmentation effects” in practice? And are updates<br />
cheaper in coarser variants, such as the F+B-index [17]?<br />
133
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Opportunity 9 (Hashing F&B summaries). In some search scenarios, there are many<br />
small documents, which are parts <strong>of</strong> a virtual tree. Document updates can be implemented<br />
as document deletes <strong>and</strong> inserts. With simple path summaries, documents can be<br />
added with cost linear in the document size, by traversing the summary deterministically.<br />
However, more refined summaries are not deterministic. Are stronger summaries like the<br />
F&B index suitable in this model? A challenge is that matching a new document in the<br />
F&B index has cost linear in the size <strong>of</strong> the summary in the worst case, not the document.<br />
Assume now that a 2 , a 10 <strong>and</strong> a 15 in Figure 14c are document roots. The structure <strong>of</strong> each<br />
document is classified by a node on depth two in the F&B summary in Figure 14b. If a<br />
new document is added below b 1 , it will either have the structure defined by a [2] or a [7] ,<br />
or a new subtree will be added below b [1] . One possibility is to index the F&B summary<br />
by hashing each level 2 subtree, as these represent full document structures. When a new<br />
document is indexed, a summary <strong>of</strong> the document structure can be built <strong>and</strong> hashed, to<br />
identify a match in the global F&B index.<br />
TwigVersion [27] is a twig matching approach which introduces a novel two-layer<br />
indexing scheme, with an F&B summary <strong>of</strong> the data, <strong>and</strong> a path summary <strong>of</strong> the F&B<br />
summary. This reduces the expense <strong>of</strong> matching in the F&B index. But as they only<br />
compare to twig join algorithms which do not use structural summaries, <strong>and</strong> also introduce<br />
many other ideas, it is hard to assess the usefulness <strong>of</strong> the two-layer approach itself. They<br />
compare their two layer approach with a pure F&B index, but do not state how they<br />
search in it.<br />
A common way to use path summaries is to match each individual root-to-leaf path,<br />
<strong>and</strong> prune away matches which cannot be part <strong>of</strong> a full match [6, 2]. Another solution,<br />
which is more robust for large path summaries, is to label partition the summary <strong>and</strong> run<br />
a full twig join algorithm on it. In [3] a novel combination <strong>of</strong> Twig 2 Stack <strong>and</strong> TwigStack<br />
is used for matching in large path summaries (see Section 3.3).<br />
Opportunity 10 (Exploring summary structures <strong>and</strong> how to search them). Many twig<br />
join algorithms have leveraged the benefits <strong>of</strong> path summaries. Stronger summaries like<br />
the F&B index are not as commonly used, maybe because <strong>of</strong> worst case size <strong>and</strong> implementational<br />
complexity. Using different algorithms to search various types <strong>and</strong> combinations<br />
<strong>of</strong> summaries has not been thoroughly explored. An evaluation should address the total<br />
benefits <strong>of</strong> different single- <strong>and</strong> multi-level strategies, but also detail the local cost <strong>of</strong><br />
specific matching methods in specific summary types <strong>of</strong> different sizes.<br />
Multi-stream access may be required for a single query node when partitioning on<br />
path, as there may be many path matches. One solution is to merge the streams. Another<br />
is to have a tag partitioned store, <strong>and</strong> filter the data nodes on matching path ID [28]. A<br />
speedup to this approach is to chain together nodes with the same path [18]. This is also<br />
useful when indexing text nodes on value <strong>and</strong> integrating structure information.<br />
S 3 [13] is a twig matching system which takes all possible combinations <strong>of</strong> individual<br />
streams matching query nodes based on prefix path, <strong>and</strong> solves each combination separately,<br />
merging the results <strong>of</strong> each evaluation. This approach does not give polynomial<br />
worst-case bounds.<br />
134
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
Blocking is the reason for the sub-optimality <strong>of</strong> TwigStack. When partial matches<br />
must be output to evaluate the data <strong>and</strong> query in Figure 15a, it is because b 1 <strong>and</strong> c 1<br />
blocks each others access to c 2 <strong>and</strong> b 2 respectively.<br />
a 1<br />
a 1<br />
b 1<br />
a 2 c 2 a<br />
b 2<br />
c 1 b c<br />
e 1<br />
a 1<br />
a 2 b 2<br />
d 1 d 2 b 1 d 3<br />
c 1<br />
c 2<br />
a<br />
d b<br />
c<br />
b 1<br />
b 3<br />
d 1<br />
c 2 b 4 b 5 c b<br />
b 2<br />
a 2<br />
c 1<br />
a 3<br />
e 1<br />
a 4 d 2<br />
e 2<br />
a<br />
e d<br />
(a)<br />
(b)<br />
(c)<br />
Figure 15: Cases <strong>of</strong> data <strong>and</strong> query blocking with (a) tag streaming, (b) tag+level streaming,<br />
(c) prefix path streaming. Adapted from [6].<br />
Using more fine grained partitioning <strong>and</strong> streaming solves some blocking issues, because<br />
there are multiple heads <strong>of</strong> streams. A partitioning which refines another always<br />
inherits the reduced blocking [6]. In tag+level streaming there is no blocking when p-c<br />
edges only are used below the query root [6]. But in mixed queries, such as in Figure 15b,<br />
blocking can still occur. There the stream for tag d level 3 is [d 1 , d 2 ] <strong>and</strong> the stream for<br />
c level 4 is [c 1 , c 2 ]. There are only two matches for the query, <strong>and</strong> data nodes c 1 <strong>and</strong> d 1<br />
block each other.<br />
Prefix path streaming results in no blocking when there is a single branching node in<br />
the query. It solves the case in Figure 15b optimally, but not the one in 15c. There the<br />
stream for the path abac is [c 1 , c 2 ], <strong>and</strong> the stream for the path ababe is [e 1 , e 2 ]. Even<br />
though c 2 is also usable in the match with root a 1 , it cannot be known whether or not c 1<br />
is usable, because e 1 blocks for e 2 .<br />
iTwigJoin [6] uses a specialized approach for accessing multiple useful streams, which<br />
supports any partitioning scheme. In its variant <strong>of</strong> getNext(q), it considers for each<br />
matching stream, the streams which are usable together with this stream, for each child<br />
<strong>of</strong> q. This reduces the amount <strong>of</strong> blocking when more fine grained partitioning is used.<br />
The space usage for iTwigJoin is O(d), <strong>and</strong> the running time is O(t(I + O)) when no<br />
blocking occurs, where t is the number <strong>of</strong> useful streams.<br />
Opportunity 11 (Access methods for multiple matching partitions). Strategies for accessing<br />
multiple useful streams include merging, filtering <strong>and</strong> chaining <strong>of</strong> input, informed<br />
merging access like in iTwigJoin, <strong>and</strong> merging the output <strong>of</strong> multiple joins. Many authors<br />
do not argue for the rationale <strong>of</strong> their choice <strong>of</strong> how to access multiple useful partitions<br />
for a node. A new access paradigm is presented in [6], but only tag streaming is compared<br />
with. The benefit <strong>of</strong> the method for accessing multiple matching streams is not<br />
separated from the benefit <strong>of</strong> reduced total input size. The technique reduces the number<br />
135
CHAPTER 4.<br />
INCLUDED PAPERS<br />
<strong>of</strong> intermediate paths output by phase one in TwigStack, <strong>and</strong> would undoubtedly reduce<br />
the amount <strong>of</strong> memory needed for intermediate results in time optimal top-down algorithms<br />
like HolisticTwigStack <strong>and</strong> TwigFast, but it is not certain if it is a win-win both<br />
on memory <strong>and</strong> speed in practice.<br />
3.6 Virtual Streams<br />
Another approach that can lead to reading less input is using “virtual streams” for internal<br />
nodes, by inferring the existence <strong>of</strong> nodes from their descendants. This requires a position<br />
encoding which allows ancestor position reconstruction [26], such as Dewey [25]. Ancestor<br />
label paths must also be infereable, <strong>and</strong> a path summary is an excellent tool for this.<br />
Consider the example in Figure 16, where streams <strong>of</strong> nodes matching leaf label paths are<br />
shown. For the node 1.2.1.2 with path (4) a.a.b.a, it can be inferred that one c<strong>and</strong>idate<br />
for the query root is 1.2 with path (2) a.a.<br />
a<br />
b<br />
a<br />
a<br />
b<br />
a<br />
b<br />
b a b b<br />
(a)<br />
a (2)<br />
a (1)<br />
b (6)<br />
b (3)<br />
a (7)<br />
a (4) b (5)<br />
(b)<br />
(1)→1<br />
(2)→1.2<br />
1.3<br />
(3)→1.2.1<br />
1.3.1<br />
(4)→1.2.1.2<br />
(5)→1.2.1.1<br />
1.2.1.3<br />
(6)→1.1<br />
(7)→1.1.1<br />
(c)<br />
a<br />
b b<br />
a<br />
(d)<br />
(7) 1.1.1<br />
(4) 1.2.1.2<br />
(e)<br />
(6) 1.1<br />
(3) 1.2.1<br />
(3) 1.3.1<br />
Figure 16: Virtual stream example. (a) Data. (b) Path summary. (c) Extents <strong>of</strong> summary<br />
nodes. (d) Query. (e) Leaf streams.<br />
Virtual Cursors is an implementation <strong>of</strong> virtual streams using Deweys <strong>and</strong> path<br />
summaries [28]. Generating a “next” match for an internal query node is done by going<br />
through the prefixes <strong>of</strong> a leaf node’s Dewey <strong>and</strong> path, <strong>and</strong> using those where the ending<br />
tag is correct. The search stops when the new Dewey is lexicographically greater than<br />
the previous, meaning later in the pre-order. [28] does not give details on how ancestor<br />
c<strong>and</strong>idates are generated, but this can be done in time linear in the depth <strong>of</strong> the leaf<br />
match used.<br />
Forwarding the entire query is done by repeatedly picking a leaf with a “broken” path,<br />
forwarding it to containment by the maximal ancestor, <strong>and</strong> then forwarding all ancestors<br />
virtually to contain the leaf. In the system described by [28], tag streaming with path ID<br />
filtering was used, <strong>and</strong> B-trees were used for skipping during leaf forwarding.<br />
Other virtual stream approaches have later been introduced. TJFast [21] is an<br />
independently developed algorithm which does not use a structural summary, but stores<br />
with each data node the root-to-node label path <strong>and</strong> the Dewey encoding together in a<br />
compressed format. Label paths are matched for each node processed. An improvement<br />
over Virtual Cursors as described, is that path matching is also done when generating<br />
136
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
internal nodes, giving fewer useless c<strong>and</strong>idates. Also, non-branching internal nodes can<br />
be ignored during query evaluation, because they are implicit from the path matchings<br />
<strong>of</strong> below <strong>and</strong> above nodes. TJFast does not produce streams for internal nodes, but<br />
maintains sets <strong>of</strong> currently possible c<strong>and</strong>idates.<br />
TwigVersion [27] <strong>and</strong> S 3 [13] (see Section 3.5) are non-holistic approaches which combine<br />
structural summaries <strong>and</strong> inference <strong>of</strong> internal node matches. TwigVersion computes<br />
sets <strong>of</strong> matches bottom-up. Each child node query generates a set <strong>of</strong> c<strong>and</strong>idates for its<br />
parent query node based on its own matches, <strong>and</strong> these sets are then intersected. S 3 uses<br />
the potentially exponential number <strong>of</strong> ways a query matches the summary, <strong>and</strong> evaluates<br />
each such match, merging the results. For one summary matching, it looks at the<br />
query leaf nodes pairwise, <strong>and</strong> merge joins sets based on lowest common ancestor query<br />
nodes. This could give large useless intermediate results. The holistic skipping algorithm<br />
TwigOptimal [10] does partially implement virtual streams through its “virtual positions”<br />
(see Section 3.4).<br />
Opportunity 12 (Improved virtual streams). To reduce the number <strong>of</strong> matches <strong>and</strong><br />
make it possible to ignore non-branching internal nodes, only structurally matching internal<br />
nodes should be generated. A structural summary can be used to avoid repeated<br />
matching <strong>of</strong> paths. However, how to store path matching information is not obvious.<br />
Given a matching path for a leaf query node, there may be an exponential number <strong>of</strong><br />
combinations for matching the above query nodes. Should the matches be calculated on<br />
the fly as in TJFast, kept in independent sets for each node above a leaf match, or encoded<br />
in stacks? Or is it enough to store c<strong>and</strong>idates for the lowest branching node above each<br />
leaf match, if the query nodes on a path are processed bottom-up?<br />
It is common to store Deweys in a succinct format to reduce space usage, but in<br />
addition, some scheme should also be devised to reduce the redundancy <strong>of</strong> using related<br />
Deweys in ascendant <strong>and</strong> descendant nodes. It is also preferable if node encodings do<br />
not have to be fully de-compressed to be compared during query evaluation, but that the<br />
compressed storage format allows for cheap direct comparisons.<br />
Opportunity 13 (Holistic skipping among leaf streams). In some sense, virtual streams<br />
are skipping by not generating unrelated matches for internal query nodes. The Virtual<br />
Cursors algorithm does perform skipping which is “path-holistic”, in the way broken<br />
root-to-leaf paths are fixed. The order in which leaves are picked is not specified [28], but<br />
query node pre-order could have been used. If the lexicographically largest broken leaf<br />
was picked, the skipping would become truly holistic.<br />
The work in [28], in combination with some intermediate result h<strong>and</strong>ling method from<br />
Section 3.2, may be a suitable starting point in the hunt for the “ultimate” twig matching<br />
approach, but the work is not much compared with, or even referenced.<br />
3.7 Query Difficulty Classes<br />
As mentioned in Section 2, the “difficulty” <strong>of</strong> twig joins comes from mixture <strong>of</strong> a-d edges<br />
followed by p-c edges in queries, in combination with structurally recursive labels in the<br />
137
CHAPTER 4.<br />
INCLUDED PAPERS<br />
data. [24] shows that when an a-d edge is never followed by a p-c edge downwards in a<br />
query, it can be evaluated in linear time <strong>and</strong> O(d) space. When there in addition is a<br />
single return node (as in XPath), it can also be evaluated in O(1) space. If after combining<br />
all the advances listed in this paper, faster evaluation methods still exist for some classes<br />
<strong>of</strong> queries, practical implementations should take advantage <strong>of</strong> this.<br />
Opportunity 14 (Identifying <strong>and</strong> using difficulty classes). Can the correctness <strong>of</strong> using<br />
a simpler matching algorithm be decided not only from the query, but also from the query<br />
<strong>and</strong> the data? Structural summaries give possibilities for this. What happens if there are<br />
only single path c<strong>and</strong>idates for some query nodes? What happens when the tree level for<br />
matches <strong>of</strong> a query node is fixed? What happens if data node matches with given paths<br />
for some query nodes fix path matches for other query nodes?<br />
In [2] additional information is collected in a path summary, noting whether a node<br />
always has a given child, <strong>and</strong> whether there is at most one child. This information is used<br />
there to simplify query evaluation when there are non-return nodes in the query, such as<br />
in XPath. Could such statistics also allow detection <strong>of</strong> more cases where query evaluation<br />
can be simplified for general twig matching?<br />
4 Conclusion<br />
We have given a structured analysis <strong>of</strong> recent advances, <strong>and</strong> identified a number <strong>of</strong> opportunities<br />
for further research, focusing both on join algorithms <strong>and</strong> index organization<br />
strategies. Hopefully this has given an overview which has led us one step further towards<br />
unification <strong>of</strong> the numerous advances in this field.<br />
One conclusion is that given its sheer volume, it seems nearly impossible to consider<br />
all related work when presenting new twig join techniques. The field would benefit greatly<br />
from an open source repository <strong>of</strong> algorithms <strong>and</strong> data structures.<br />
References<br />
[1] S. Al-Khalifa, HV Jagadish, N. Koudas, JM Patel, <strong>and</strong> D. Srivastava. Structural<br />
joins: A primitive for efficient XML query pattern matching. In Proc. ICDE, 2002.<br />
[2] Andrei Arion, Angela Bonifati, Ioana Manolescu, <strong>and</strong> Andrea Pugliese. Path summaries<br />
<strong>and</strong> path partitioning in modern XML databases. In Proc. WWW, 2006.<br />
[3] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />
twig query in large DataGuide trees. In Proc. IDEAS, 2008.<br />
[4] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />
XML pattern matching. In Proc. SIGMOD, 2002.<br />
138
Paper 4: Towards Unifying Advances in Twig Join Algorithms<br />
[5] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />
Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />
queries over XML documents. In Proc. VLDB, 2006.<br />
[6] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />
pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />
[7] S.Y. Chien, Z. Vagena, D. Zhang, V.J. Tsotras, <strong>and</strong> C. Zaniolo. Efficient structural<br />
joins on indexed XML documents. In Proc. VLDB, 2002.<br />
[8] Byron Choi. What are real DTDs like. Technical Report MS-CIS-02-05, University<br />
<strong>of</strong> Pennsylvania, 2002.<br />
[9] Byron Choi, Malika Mahoui, <strong>and</strong> Derick Wood. On the optimality <strong>of</strong> holistic algorithms<br />
for twig queries. In Proc. DEXA, 2003.<br />
[10] Marcus Fontoura, Vanja Josifovski, Eugene Shekita, <strong>and</strong> Beverly Yang. Optimizing<br />
cursor movement in holistic twig joins. In Proc. CIKM, 2005.<br />
[11] Georg Gottlob, Christoph Koch, <strong>and</strong> Reinhard Pichler. Efficient algorithms for processing<br />
XPath queries. In Proc. VLDB, 2002.<br />
[12] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />
survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />
[13] Sayyed Kamyar Izadi, Theo Härder, <strong>and</strong> Mostafa S. Haghjoo. S 3 : Evaluation <strong>of</strong><br />
tree-pattern XML queries supported by structural summaries. Data Knowl. Eng.,<br />
2009.<br />
[14] Haifeng Jiang, Hongjun Lu, Wei Wang, <strong>and</strong> Beng Chin Ooi. XR-tree: Indexing XML<br />
data for efficient structural joins. In Proc. ICDE, 2003.<br />
[15] Haifeng Jiang, Wei Wang, Hongjun Lu, <strong>and</strong> Jeffrey Xu Yu. Holistic twig joins on<br />
indexed XML documents. In Proc. VLDB, 2003.<br />
[16] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />
processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />
2007.<br />
[17] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />
indexes for branching path queries. In Proc. SIGMOD, 2002.<br />
[18] Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, <strong>and</strong> Raghu Ramakrishnan.<br />
On the integration <strong>of</strong> structure indexes <strong>and</strong> inverted lists. In Proc. SIG-<br />
MOD, 2004.<br />
[19] Guoliang Li, Jianhua Feng, Yong Zhang, <strong>and</strong> Lizhu Zhou. Efficient holistic twig<br />
joins in leaf-to-root combining with root-to-leaf way. In Proc. Advances in Databases:<br />
Concepts, Systems <strong>and</strong> Applications, 2007.<br />
139
CHAPTER 4.<br />
INCLUDED PAPERS<br />
[20] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />
[21] Jiaheng Lu, Tok Wang Ling, Chee-Yong Chan, <strong>and</strong> Ting Chen. From region encoding<br />
to extended Dewey: On efficient processing <strong>of</strong> XML twig pattern matching. In<br />
Proc. VLDB, 2005.<br />
[22] Lu Qin. Personal correspondence, 2009.<br />
[23] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />
In Proc. DASFAA, 2007.<br />
[24] Mirit Shalem <strong>and</strong> Ziv Bar-Yossef. The space complexity <strong>of</strong> processing XML twig<br />
queries over indexed documents. In Proc. ICDE, 2008.<br />
[25] Igor Tatarinov, Stratis D. Viglas, Kevin Beyer, Jayavel Shanmugasundaram, Eugene<br />
Shekita, <strong>and</strong> Chun Zhang. Storing <strong>and</strong> querying ordered XML using a relational<br />
database system. In Proc. SIGMOD, 2002.<br />
[26] Felix Weigel. Structural summaries as a core technology for efficient XML retrieval.<br />
PhD thesis, Ludwig-Maximilians-Universität München, 2006.<br />
[27] Xin Wu <strong>and</strong> Guiquan Liu. XML twig pattern matching using version tree. Data &<br />
Knowl. Eng., 2008.<br />
[28] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />
Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />
[29] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />
supporting containment queries in relational database management systems. SIG-<br />
MOD Rec., 2001.<br />
[30] Junfeng Zhou, Min Xie, <strong>and</strong> Xia<strong>of</strong>eng Meng. TwigStack + : Holistic twig join pruning<br />
using extended solution extension. Wuhan University Journal <strong>of</strong> Natural Sciences,<br />
2007.<br />
A<br />
Errata<br />
1. Algorithms TwigList [23], HolisticTwigStack [16] <strong>and</strong> TwigFast [20] are incorrectly<br />
referred to as optimal in Figure 8, in Section 3.2, <strong>and</strong> in Section 3.5. Only algorithm<br />
Twig 2 Stack [5] is optimal as described.<br />
140
Paper 5<br />
<strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong><br />
Fast Optimal Twig Joins<br />
Proceedings <strong>of</strong> the 36th international conference on Very Large Data Bases (VLDB 2010).<br />
Abstract In XML search systems twig queries specify predicates on node values <strong>and</strong><br />
on the structural relationships between nodes, <strong>and</strong> a key operation is to join individual<br />
query node matches into full twig matches. Linear time twig join algorithms exist, but<br />
many non-optimal algorithms with better average-case performance have been introduced<br />
recently. These use somewhat simpler data structures that are faster in practice, but<br />
have exponential worst-case time complexity. In this paper we explore <strong>and</strong> extend the<br />
solution space spanned by previous approaches. We introduce new data structures <strong>and</strong><br />
improved strategies for filtering out useless data nodes, yielding combinations that are<br />
both worst-case optimal <strong>and</strong> faster in practice. An experimental study shows that our<br />
best algorithm outperforms previous approaches by an average factor <strong>of</strong> three on common<br />
benchmarks. On queries with at least one unselective leaf node, our algorithm can be an<br />
order <strong>of</strong> magnitude faster, <strong>and</strong> it is never more than 20% slower on any tested benchmark<br />
query.<br />
141
Paper 5: Fast Optimal Twig Joins<br />
1 Introduction<br />
XML has become the de facto st<strong>and</strong>ard for storing <strong>and</strong> transferring semistructured data<br />
due to its simplicity <strong>and</strong> flexibility [6], with XPath <strong>and</strong> XQuery as the st<strong>and</strong>ard query<br />
languages. XML documents have tree structure, where elements (tags) are internal tree<br />
nodes, <strong>and</strong> attributes <strong>and</strong> text values are leaf nodes. <strong>Information</strong> may be encoded both in<br />
structure <strong>and</strong> content, <strong>and</strong> query languages need the expressional power to specify both.<br />
Twig pattern matching (TPM) is an abstract matching problem on trees, which covers<br />
a subset <strong>of</strong> XPath, which again is a subset <strong>of</strong> XQuery. TPM is important because it<br />
represents the majority <strong>of</strong> the workload in XML search systems [6]. Both data <strong>and</strong><br />
queries (twigs) in TPM are node-labeled trees, with no distinction between node types.<br />
Figure 1 shows a twig query <strong>and</strong> data with a match. A match is a mapping <strong>of</strong> query nodes<br />
to data nodes that respects labels <strong>and</strong> the ancestor-descendant (A–D) <strong>and</strong> parent-child<br />
(P–C) relationships specified by the query edges, respectively represented by double <strong>and</strong><br />
single lines in figures here.<br />
b 1 c 1 y 1 a 4<br />
a 2<br />
c 1<br />
a 1<br />
b 1 c 2 a 1<br />
c 3<br />
b 2 a 3 c 4 b 3<br />
z 1 b 4 z 2<br />
Figure 1: Twig query <strong>and</strong> data with matches.<br />
x 2 b 6<br />
b 5<br />
x 1 c 5<br />
c 6<br />
Twig joins are algorithms for evaluating TPM queries on indexed data, where the<br />
index typically has one list <strong>of</strong> data nodes for each label. A query is evaluated by reading<br />
the label-matching data nodes for each query node, <strong>and</strong> combining these into full<br />
query matches. There exist algorithms that perform twig joins in worst-case optimal<br />
time [3], but current non-optimal algorithms that use simpler data structures are faster<br />
in practice [10, 11].<br />
In this paper we present twig join algorithms that achieve worst-case optimality without<br />
sacrificing practical performance. Our main contributions are (i ) a classification <strong>of</strong><br />
filtering methods as weak or strict, <strong>and</strong> a discussion <strong>of</strong> how filtering influences practical<br />
<strong>and</strong> worst-case performance; (ii ) level split vectors, a data structure yielding lineartime<br />
result enumeration with almost no practical overhead; (iii ) getPart, a method for<br />
merging input streams that gives additional inexpensive filtering <strong>and</strong> practical speedup;<br />
(iv ) TJStrictPost <strong>and</strong> TJStrictPre, worst-case optimal algorithms that unify <strong>and</strong> extend<br />
previous filtering strategies; <strong>and</strong> (v ) a thorough experimental comparison <strong>of</strong> the effects<br />
143
CHAPTER 4.<br />
INCLUDED PAPERS<br />
<strong>of</strong> combining different techniques. Compared to the fastest previous solution, our best<br />
algorithm is on average three times as fast, <strong>and</strong> never more than 20% slower.<br />
The scope <strong>of</strong> this paper is twig joins reading simple streams from label-partitioned<br />
data. See Section 6 for orthogonal related work that introduces other assumptions on<br />
how to partition <strong>and</strong> access the underlying data.<br />
2 Background<br />
A schema-agnostic system for indexing labeled trees usually maintains one list <strong>of</strong> data<br />
nodes per label. Each node is stored with position information that enables checking<br />
A–D <strong>and</strong> P–C relationships in constant time. A common approach is to assign intervals<br />
to nodes, such that containment reflects ancestry. Tree depth can then be used determine<br />
parenthood [15].<br />
An early approach to twig joins was to evaluate query tree edges separately using<br />
binary joins, but when A–D edges are involved, this can give huge intermediate results<br />
even when the final set <strong>of</strong> matches is small [2]. This deficiency led to the introduction<br />
<strong>of</strong> multi-way twig join algorithms. TwigStack [2] can evaluate twig queries without P–C<br />
edges in linear time. It only uses memory linear in the maximum depth <strong>of</strong> the data<br />
tree. However, when P–C <strong>and</strong> A–D edges are mixed, more memory is needed to evaluate<br />
queries in linear time [13]. The example in Figure 2 hints at why.<br />
More recent algorithms, which are used as a starting point for our methods, relax<br />
the memory requirement to be linear in the size <strong>of</strong> the input to the join. They follow<br />
a general scheme illustrated in Figure 3. The scheme has two phases, where the first<br />
phase has two components. The first component merges the stream <strong>of</strong> data node matches<br />
for each query node into a single stream <strong>of</strong> query <strong>and</strong> data node pairs. The second<br />
component organizes these matches into an intermediate data structure where matched<br />
A–D <strong>and</strong> P–C relationships are registered. This structure is used to enumerate results in<br />
the second phase.<br />
The algorithms broadly fall into two categories. So-called top-down <strong>and</strong> bottomup<br />
algorithms process <strong>and</strong> store the data nodes in preorder <strong>and</strong> postorder, respectively,<br />
<strong>and</strong> filter data nodes on matched prefix paths <strong>and</strong> subtrees before they are added to<br />
a<br />
a 1<br />
1<br />
c 1 b 1 b 1 b n a 2<br />
c n+1<br />
b n+1 c 1 c n<br />
Figure 2: Hard case with restricted memory. It cannot be known whether b 1 , . . . , b n are<br />
useful before c n+1 is seen, or whether c 1 , . . . , c n are useful before b n+1 is seen.<br />
144
Paper 5: Fast Optimal Twig Joins<br />
a 1 a 1 a 2<br />
. . .<br />
b 1 b 1 b 2<br />
. . .<br />
c 1 c 1 c 2<br />
. . .<br />
Comp. 1 c 1 ↦→c 1 b 1 ↦→b 1 . . . Comp. 2<br />
Intermed.<br />
results<br />
Phase 2<br />
a 1<br />
b 2 c 5<br />
a 1<br />
b 3 c 5<br />
. . .<br />
Figure 3: Work-flow <strong>of</strong> twig join algorithms, with input stream merge component, intermediate<br />
result construction component, <strong>and</strong> result enumeration phase.<br />
the intermediate results. Many algorithms use both types <strong>of</strong> filtering, which means the<br />
processing is a hybrid <strong>of</strong> top-down <strong>and</strong> bottom-up.<br />
Twig 2 Stack [3] was the first linear twig join algorithm. It reorders the input into a<br />
single postorder stream to build intermediate results bottom-up. The data nodes matching<br />
a query node are stored in a composite data structure (postorder-sorted lists <strong>of</strong> trees<br />
<strong>of</strong> stacks), as shown in Figure 4a. Matches are added to the intermediate results only if<br />
relations to child query nodes are satisfied, <strong>and</strong> each match has a list <strong>of</strong> pointers to usable<br />
child query node matches.<br />
a 1<br />
b 1 b 2 b 4 b 3 b 5 b 6<br />
a 4<br />
c 1<br />
a 2 a 4 a 1<br />
b 1 b 2 b 3 b 5 b 6 c 2 c 3 c 4 c 5<br />
b 4<br />
c 6<br />
c 2 c 3 c 4 c 6 c 5 c 1<br />
(a) Twig 2 Stack.<br />
(b) TwigList.<br />
Figure 4: Intermediate result data structures when evaluating the query in Figure 1.<br />
HolisticTwigStack [8] uses similar data structures, but builds intermediate results topdown<br />
in preorder, <strong>and</strong> filters matches on whether there is a usable match for the parent<br />
query node. It uses the getNext function from the TwigStack algorithm [2] as input<br />
stream merge component, which implements an inexpensive weaker form <strong>of</strong> bottom-up<br />
filtering. The combined filtering does not give worst-case optimality, but results in faster<br />
average-case evaluation <strong>of</strong> queries than for Twig 2 Stack [8].<br />
One approach for improving practical performance is using simpler data structures for<br />
intermediate results. TwigList [11] evaluates data nodes in postorder like Twig 2 Stack,<br />
but stores intermediate nodes in simple vectors, <strong>and</strong> does not differentiate between A–D<br />
<strong>and</strong> P–C relationships in the construction phase. Given a parent <strong>and</strong> child query node,<br />
the descendants <strong>of</strong> a match for the parent are found in an interval in the child’s vector,<br />
as shown in Figure 4b. Interval start indexes are set as nodes are pushed onto a global<br />
stack in preorder, <strong>and</strong> end indexes are set as nodes are popped <strong>of</strong>f the global stack in<br />
postorder. A node is added to the intermediate results if all descendant intervals are nonempty.<br />
Compared to Twig 2 Stack, this gives weaker filtering <strong>and</strong> non-linear worst-case<br />
145
CHAPTER 4.<br />
INCLUDED PAPERS<br />
performance, but is more efficient in practice [11], according to the authors because <strong>of</strong><br />
less computational overhead <strong>and</strong> better spatial locality.<br />
TwigFast [10] is an algorithm that uses data structures similar to those <strong>of</strong> TwigList,<br />
but stores data nodes in preorder. It uses the same preorder filtering as HolisticTwigStack,<br />
<strong>and</strong> inherits postorder filtering from the getNext input stream merge component. There<br />
are several algorithms that utilize both types <strong>of</strong> filtering, but among these TwigFast has<br />
the best practical performance [1,3,10]. Figure 5 gives an overview <strong>of</strong> twig join algorithms,<br />
<strong>and</strong> various properties that are introduced in Section 3.<br />
Filtering Checking <strong>of</strong> Interm.<br />
Algorithm Ref. order path subtree results Optimal<br />
getNext [2] GN none weak N/A N/A<br />
TwigStack [2] GN+pre strict weak complex no<br />
Twig 2 Stack [3] post none strict complex yes<br />
1<br />
2 T2 S [3] pre+post weak strict complex yes<br />
1<br />
2 T2 S [1] GN+pre+post weak strict complex yes<br />
HolisticTwigStack [8] GN+pre weak weak complex no<br />
TwigList [11] post none weak vectors no<br />
TwigMix [10] GN+pre+post weak weak vectors incorrect<br />
TwigFast [10] GN+pre weak weak vectors no<br />
TJStrictPost Sect. 4 pre+post strict strict vectors yes<br />
TJStrictPre Sect. 4 GN+pre(+post) strict strict vectors yes<br />
Figure 5: Previous combinations <strong>of</strong> prefix path <strong>and</strong> subtree filtering. Intermediate result<br />
storage order given by last item in “filtering order”. GN is the node order returned by<br />
the getNext input stream merger.<br />
3 Premises for Performance<br />
To make algorithms that are both fast in practice <strong>and</strong> worst-case optimal, we need an<br />
underst<strong>and</strong>ing <strong>of</strong> how filtering strategies <strong>and</strong> data structures impact performance.<br />
For any graph G, let V (G) be its node set <strong>and</strong> E(G) be its edge set. Let a matching<br />
problem M be a triple 〈Q, D, I〉, where Q is a query tree, D is a data tree, <strong>and</strong> I ⊆<br />
V (Q) × V (D) is a relation such that for q ↦→q ′ ∈ I the node label <strong>of</strong> q equals the node<br />
label <strong>of</strong> q ′ . Each edge 〈p, q〉 ∈ Q has a label L(〈p, q〉) ∈ {“A–D”, “P–C”}, specifying an<br />
ancestor–descendant or parent–child relationship. Let a node map for M be any function<br />
M ⊆ I. Assume a given M = 〈Q, D, I〉 when not otherwise specified.<br />
Definition 1 (Weak/strict edge satisfaction). The node map M weakly satisfies a downward<br />
edge e = 〈p, q〉 ∈ E(Q) iff M(p) is an ancestor <strong>of</strong> M(q), <strong>and</strong> strictly satisfies e iff<br />
M(p) <strong>and</strong> M(q) are related as specified by L(e).<br />
146
Paper 5: Fast Optimal Twig Joins<br />
Definition 2 (Match). Given subgraphs Q ′′ ⊆ Q ′ ⊆ Q, the node map M : V (Q ′ ) → V (D)<br />
is a weak (strict) match for Q ′′ iff all edges in Q ′′ are weakly (strictly) satisfied by M. If<br />
Q ′′ = Q we call M a weak (strict) full match.<br />
Where no confusion arises, the term weak (strict) match may also be used for M(Q).<br />
We denote the set <strong>of</strong> unique strict full matches by O. As is common, we view the<br />
size <strong>of</strong> the query as a constant, <strong>and</strong> call a twig join algorithm linear <strong>and</strong> optimal if the<br />
combined data <strong>and</strong> result complexity is O(I + O) [3]. 1<br />
The results presented in the following all apply to both weak <strong>and</strong> strict matching,<br />
unless otherwise specified. The following lemma implies that we can use filtering strategies<br />
that only consider parts <strong>of</strong> the query.<br />
Lemma 1 (Filtering). If there exists a Q ′ ⊆ Q containing q where no match M ′ for Q ′<br />
contains q ↦→q ′ , then there exists no match M for Q containing q ↦→q ′ .<br />
Pro<strong>of</strong>. By contraposition. Given a match M ∋ q ↦→q ′ for Q, for any Q ′ ⊆ Q containing q,<br />
the match M \ {p↦→p ′ | p ∉ Q ′ } matches Q ′ <strong>and</strong> contains q ↦→q ′ .<br />
3.1 Preorder Filtering on Matched Paths<br />
Many current algorithms use the getNext input stream merge component [2], which returns<br />
data nodes in a relaxed preorder, which only dictates the order <strong>of</strong> matches for query<br />
nodes related by ancestry. This is not detailed in the original description [2] <strong>and</strong> is easy to<br />
miss. 2 The TwigMix algorithm incorrectly assumes strict preorder [10] (See Appendix E).<br />
Definition 3 (Match preorder). The sequence <strong>of</strong> pairs q 1 ↦→q 1, ′ . . . , q k ↦→q<br />
k ′ ∈ V (Q)×V (D)<br />
is in global match preorder iff for any i < j either (1) q i ′ precedes q′ j in tree preorder, or<br />
(2) q i ′ = q′ j <strong>and</strong> q i precedes q j in tree postorder. The sequence is in local match preorder<br />
if (1) <strong>and</strong> (2) hold for any i < j where q i = q j or 〈q i , q j 〉 ∈ E(Q) or 〈q j , q i 〉 ∈ E(Q).<br />
The following definition formalizes a filtering criterion commonly used when processing<br />
data nodes in preorder, local or global.<br />
Definition 4 (Prefix path match). M is a prefix path match for q k ∈ Q iff it is a match<br />
for the (simple) path q 1 . . . q k , where q 1 is the root <strong>of</strong> Q.<br />
To implement prefix path match filtering, preorder algorithms maintain the set <strong>of</strong><br />
open nodes, i.e., the ancestors, at the current position in the tree. Most algorithms have<br />
one stack <strong>of</strong> open data nodes for each query node, <strong>and</strong> given a current pair q ↦→q ′ pop<br />
non-ancestors <strong>of</strong> q ′ from the stacks <strong>of</strong> q <strong>and</strong> its parent [2, 8, 10]. Weak filtering can then<br />
be implemented by checking if (i) q is the root, or (ii) the stack for the parent <strong>of</strong> q is<br />
1 For Twig 2 Stack the combined data, query <strong>and</strong> result complexity is O(I log Q + Ib Q + OQ), where<br />
b Q is the maximum branching factor in the query [3]. The TJStrict algorithms we present in Section 3<br />
have the same complexity when using a heap-based input stream merger, <strong>and</strong> O(IQ + OQ) complexity<br />
when using a getNext-based input stream merger. Note that the total number <strong>of</strong> data nodes references<br />
in the input <strong>and</strong> output are |I| <strong>and</strong> |O| · |Q|, respectively.<br />
2 To be precise, getNext also returns matches for sibling query nodes in order.<br />
147
CHAPTER 4.<br />
INCLUDED PAPERS<br />
non-empty. If q ′ is not filtered out, it is pushed onto the stack for q, <strong>and</strong> added to the<br />
intermediate results. This can be extended to strict checking <strong>of</strong> P–C edges by inspecting<br />
the top node on the parent’s stack. Strict prefix path matching is rarely used in practice,<br />
as can be seen from the fourth column in Figure 5.<br />
The implementation <strong>of</strong> prefix path checks is the reason for the secondary ordering on<br />
query node postorder for match pairs in Definition 3. Without the secondary ordering<br />
problems arise when multiple query nodes have the same label: A data node could be<br />
misinterpreted as a usable ancestor <strong>of</strong> itself when checking for non-empty stacks, or hide<br />
a proper parent <strong>of</strong> itself when checking top stack elements.<br />
Algorithms storing intermediate results in postorder use a global stack for the<br />
query [3, 11], <strong>and</strong> inspection <strong>of</strong> the top stack node cannot be used to implement prefix<br />
path matching when the query contains A–D edges, as ancestors may be hidden deep<br />
in the stack. Extending these algorithms to implement prefix path filtering requires maintaining<br />
additional data structures.<br />
The choice <strong>of</strong> preorder filtering does not influence optimality, as illustrated by the<br />
following example.<br />
Example 1. Assume non-branching data generated by /(α 1 /) n . . . (α m /) n β/γ, <strong>and</strong> the<br />
query ⫽α 1 ⫽ . . . ⫽α m /γ, where α 1 , . . . , α m , β, γ are all distinct labels. Unless it is signaled<br />
bottom-up that the pattern α m /γ is not matched below, the result enumeration phase will<br />
take Ω(n m ) time, because all combinations <strong>of</strong> matches for α 1 , . . . , α m will be tested.<br />
3.2 Postorder Filtering on Matched Subtrees<br />
The ordering <strong>of</strong> match pairs required by most bottom-up algorithms is symmetric with<br />
the global preorder case:<br />
Definition 5 (Match postorder). A sequence <strong>of</strong> pairs q 1 ↦→q 1, ′ . . . , q k ↦→q<br />
k<br />
′ is in match<br />
postorder iff for any i < j either (1) q i ′ precedes q′ j in tree postorder, or (2) q′ i = q′ j <strong>and</strong><br />
q i precedes q j in tree preorder.<br />
The second property is required for the correctness <strong>of</strong> both Twig 2 Stack [3], where a<br />
data node could hide proper children <strong>of</strong> itself, <strong>and</strong> TwigList [11], where a node could be<br />
added as a descendant <strong>of</strong> itself.<br />
Definition 6 (Subtree match).<br />
subtree rooted at q.<br />
M is a subtree match for q ∈ Q iff it is a match for the<br />
Example 1 also illustrates why strict subtree match checking is required for optimality,<br />
because no node labeled γ is a direct child <strong>of</strong> a node labeled α m in the data. As<br />
described in Section 2, Twig 2 Stack <strong>and</strong> TwigList respectively implement strict <strong>and</strong> weak<br />
subtree match filtering, <strong>and</strong> for these algorithms strict filtering is required for optimality.<br />
TwigList could be extended to strict filtering by traversing descendant intervals to<br />
look for direct children, but this would have quadratic cost if implemented naively, as<br />
descendant intervals may overlap.<br />
In algorithms storing data in preorder, nodes are added to the intermediate results<br />
after passing preorder checks. If nodes are stored in arrays, later removing a node failing<br />
148
Paper 5: Fast Optimal Twig Joins<br />
a postorder check from the middle <strong>of</strong> an array would incur significant cost. Note that<br />
many <strong>of</strong> the algorithms listed in Figure 5 inherit weak subtree match filtering from the<br />
input stream merger used, as described later in Section 4.4.<br />
3.3 Result Enumeration<br />
Even if the strict subtree match checking sketched for TwigList above could be implemented<br />
efficiently, results would still not be enumerated in linear time, as usable child<br />
nodes may be scattered throughout descendant intervals, as shown by the following example:<br />
Example 2. Assume a tree constructed from the nodes {a 1 , . . . , a n , b 1 , . . . , b 2n }, labeled<br />
a <strong>and</strong> b. Let each node a i have a left child b i , a right child b n+i , <strong>and</strong> a middle child a i+1<br />
if i < n. Given the query ⫽a/b, each node a i is part <strong>of</strong> two matches, one with b i <strong>and</strong> one<br />
with b n+i , but to find these two matches, 2n − 2i useless b-nodes must be traversed in the<br />
descendant interval. This gives a total enumeration cost <strong>of</strong> Ω(n 2 ).<br />
The following theorem formalizes properties <strong>of</strong> the intermediate result storage in<br />
Twig 2 Stack that are key to its optimality.<br />
Theorem 1 (Linear time result enumeration [3]). The result set O can be enumerated in<br />
Θ(O) time if (i) the data nodes d such that root(Q)↦→d is part <strong>of</strong> a strict full match can<br />
be found in time linear in their number, <strong>and</strong> (ii) given a pair q ↦→q ′ , for each child c <strong>of</strong><br />
q, the pairs c↦→c ′ that are part <strong>of</strong> a strict subtree match for q together with q ↦→q ′ can be<br />
enumerated in time linear in their number.<br />
3.4 Full Matches<br />
When different types <strong>of</strong> filtering strategies are combined, it may be interesting to know<br />
when additional filtering passes will not remove more nodes.<br />
Theorem 2 (Full Match). (i) A pair q ↦→q ′ is part <strong>of</strong> a full match iff (ii) it is part <strong>of</strong> a<br />
prefix path match that only uses pairs that are part <strong>of</strong> subtree matches.<br />
sketch. (i ⇒ ii) Follows from Lemma 1. (i ⇐ ii) Let M = 〈Q, D, I〉 be the initial<br />
matching problem, <strong>and</strong> M ′ = 〈Q, D, I ′ 〉 be the matching problem where I ′ is the set <strong>of</strong><br />
pairs that are part <strong>of</strong> subtree matches in M. The theorem is true for pairs with the query<br />
root, as for the query root a subtree match is a full match. Assume that there is a prefix<br />
path match M ↓q ∋ q ↦→q ′ for q in M ′ , <strong>and</strong> that p is the parent <strong>of</strong> q. By construction,<br />
M ↓q ∋ p↦→p ′ is also a prefix path match for p. We use induction on the query node<br />
depth, <strong>and</strong> prove that if p↦→p ′ is part <strong>of</strong> a full match M p for p, then q ↦→q ′ must be<br />
part <strong>of</strong> a full match for q. Let Q q be the subtree rooted at q, <strong>and</strong> Q p = Q \ Q q . Let<br />
M p ′ = M p \ {r ↦→r ′ | r ∈ Q q }. By the assumption q ↦→q ′ ∈ I ′ , there exists a subtree match<br />
M ↑q ∋ q ↦→q ′ for q. Then the node map M q = M p ′ ∪ M ↑q is a full match for q, because<br />
(1) p↦→p ′ ∈ M p ′ <strong>and</strong> q ↦→q ′ ∈ M ↑q must satisfy the edge 〈p, q〉 as they are used together<br />
in M ↓q , <strong>and</strong> (2) Q p <strong>and</strong> Q q can be matched independently when the mapping <strong>of</strong> p <strong>and</strong> q<br />
is fixed.<br />
149
CHAPTER 4.<br />
INCLUDED PAPERS<br />
In other words, if nodes are filtered first in postorder on strictly matched subtrees, <strong>and</strong><br />
then in preorder on strictly matched prefix paths, the intermediate result set contains only<br />
data nodes that are part <strong>of</strong> the final result. The opposite is not true: In the example in<br />
Figure 1, the pair c↦→c 3 would not be removed if strict prefix path filtering was followed<br />
by strict subtree match filtering.<br />
Note that strict full match filtering is not necessary for optimal enumeration by Theorem<br />
1, <strong>and</strong> that the optimal algorithms we present in the following do not use it. They<br />
use prefix path match filtering followed by subtree match filtering, where the former is<br />
only used to speed up practical performance. On the other h<strong>and</strong>, the input stream merge<br />
component we introduce in Section 4.4 gives inexpensive weak full match filtering.<br />
4 Fast Optimal Twig Joins<br />
In this section we create an algorithmic framework that permits any combination <strong>of</strong><br />
preorder <strong>and</strong> postorder filtering. First we introduce a new data structure that enables<br />
strict subtree match checking <strong>and</strong> linear result enumeration.<br />
4.1 Level Split Vectors<br />
A key to the practical performance <strong>of</strong> TwigList <strong>and</strong> TwigFast is the storage <strong>of</strong> intermediate<br />
nodes in simple vectors [11], but this scheme makes it hard to improve worst-case behavior.<br />
To enable direct access to usable child matches below both A–D <strong>and</strong> P–C edges, we<br />
split the intermediate result vectors for query nodes below P–C edges, such that there is<br />
one vector for each data tree level observed, as shown in Figure 6. Given a node in the<br />
intermediate results, matches for a child query node below an A–D edge can be found in<br />
a regular descendant interval, while the matches for a child query node below a P–C edge<br />
can be found in a child interval in a child vector. This vector can be identified by the<br />
depth <strong>of</strong> the parent data node plus one.<br />
In the following we assume that level split vectors can be accessed by level in amortized<br />
constant time. This is true if level split vectors are stored in dynamic arrays, <strong>and</strong> the<br />
depth <strong>of</strong> the deepest data node is asymptotically bounded by the size <strong>of</strong> the input, that<br />
is, d ∈ O(I). If this bound does not hold, which can be the case when |I| ≪ |D|, expected<br />
linear performance can be achieved by storing level split vectors in hash maps.<br />
When nodes are added to the intermediate results in postorder, checking for nonempty<br />
descendant <strong>and</strong> child intervals inductively implies strict subtree match filtering.<br />
This is illustrated for our example in Figure 6. As each check takes constant time,<br />
the intermediate results can be constructed in Θ(I) time, as for TwigList [11]. Result<br />
enumeration with this data structure is a trivial recursive iteration through nodes <strong>and</strong><br />
their child <strong>and</strong> descendant intervals, which is almost identical to the enumeration in<br />
TwigList. The difference is that no extra check is necessary for P–C relationships, <strong>and</strong><br />
that the result enumeration is Θ(O) by Theorem 1 when strict subtree match filtering is<br />
applied.<br />
150
Paper 5: Fast Optimal Twig Joins<br />
b 1 b 2 b 4 b 3 b 5 b 6<br />
a 4 a 1<br />
1: c 1<br />
2: c 2<br />
3: c 5<br />
4: c 4 c 6<br />
5: c 3<br />
Figure 6: Intermediate data structures using level split vectors <strong>and</strong> strict subtree match<br />
filtering. As opposed to in Figure 4b, a 2 does not satisfy.<br />
4.2 The TJStrictPost Algorithm<br />
Algorithm 1 shows the general framework we use for postorder construction <strong>of</strong> intermediate<br />
results, extending algorithms like TwigList [11]. It allows using any combinations <strong>of</strong><br />
the preorder <strong>and</strong> postorder checks described in Section 3, from none to weak <strong>and</strong> strict,<br />
<strong>and</strong> allows using either simple vectors or level split vectors. A global stack is used to<br />
maintain the set <strong>of</strong> open data nodes, <strong>and</strong> if prefix path matching is implemented, a local<br />
stack for each query node is maintained in parallel. The input stream merge component<br />
used is a priority queue implemented with a binary heap. The postorder storage approach<br />
used here requires global ordering, <strong>and</strong> cannot read local preorder input (see Appendix E).<br />
Algorithm 1 Postorder construction.<br />
While ¬E<strong>of</strong>:<br />
Read next q ↦→d.<br />
While non-ancestors <strong>of</strong> d on global stack:<br />
Pop q ′ ↦→d ′ from global <strong>and</strong> local stack.<br />
If q ′ ↦→d ′ satisfies postorder checks:<br />
Set interval end index for d ′<br />
in the vector <strong>of</strong> each child <strong>of</strong> q ′ .<br />
Add d ′ to intermediate results for q ′ .<br />
If d satisfies preorder checks:<br />
For each child <strong>of</strong> q, set interval start index for d.<br />
Push q ↦→d on global stack.<br />
Push d on local stack for q.<br />
Clean remaining nodes from the stacks.<br />
Enumerate results.<br />
The correctness <strong>of</strong> Algorithm 1 follows from the correctness <strong>of</strong> the filtering strategies<br />
described in Sections 3.1 <strong>and</strong> 3.2, <strong>and</strong> the correctness <strong>of</strong> TwigList [11], with the enu-<br />
151
CHAPTER 4.<br />
INCLUDED PAPERS<br />
meration algorithm trivially extended to use child intervals when level split vectors are<br />
used.<br />
We now define the TJStrictPost algorithm, which builds on this framework. Detailed<br />
pseudocode can be found in Appendix A. The algorithm uses level split vectors, <strong>and</strong>, as<br />
opposed to the previous twig join algorithms listed in Figure 5, it includes strict checking<br />
<strong>of</strong> both matched prefix paths <strong>and</strong> subtrees. The former is implemented by checking the top<br />
data node on the local stack <strong>of</strong> the parent query node, while the latter is implemented by<br />
checking for non-empty child <strong>and</strong> descendant intervals. A Θ(I + O) running time follows<br />
from the discussion in Section 4.1.<br />
4.2.1 A note on TwigList:<br />
As noted in the original description, chaining nodes with the same tree level into linked<br />
lists inside descendant intervals can improve practical performance in TwigList [11]. However,<br />
as the first child match with the correct level must still be searched for, further<br />
changes are needed to achieve linear worst-case evaluation. This can be implemented by<br />
maintaining a vector for each query node with the previous match on each tree level at<br />
any time. A node must then be given pointers to such previous matches as it is pushed<br />
on stack in TwigList. When the node is popped <strong>of</strong>f stack, it can then be checked if any<br />
children have been found, <strong>and</strong> intermediate results can be enumerated in linear time,<br />
assuming that d ∈ O(I), as for our solution.<br />
4.3 The TJStrictPre Algorithm<br />
Algorithm 2 shows the general framework we use to construct intermediate results in preorder,<br />
extending algorithms like TwigFast [10]. It supports any combination <strong>of</strong> preorder<br />
<strong>and</strong> postorder filtering, simple or level split vectors, <strong>and</strong> input in global or local preorder.<br />
As opposed to with postorder storage, nodes are inserted directly into intermediate result<br />
vectors after they have passed a prefix path check. Local stacks store references to open<br />
nodes in the intermediate results. If strict subtree match filtering is required, or weak<br />
subtree match filtering is not implied by the input stream merger, intermediate results<br />
are filtered bottom-up in a post-processing pass.<br />
TwigFast is reported to have faster average case query evaluation than previous twig<br />
joins [10], <strong>and</strong> we hope to match this performance in a worst-case linear algorithm. The<br />
TJStrictPre algorithm is similar to TJStrictPost, <strong>and</strong> uses strict checking <strong>of</strong> prefix paths<br />
<strong>and</strong> subtrees, <strong>and</strong> stores intermediate results in level split vectors. See detailed pseudocode<br />
in Appendix B. TJStrictPre uses the getPart input stream merger, which is an<br />
improvement <strong>of</strong> getNext described in Section 4.4. If the post-processing pass can be performed<br />
in linear time, then the algorithm can evaluate twig queries in Θ(I + O) time, by<br />
the same argument as for TJStrictPost.<br />
The filtering pass is implemented by cleaning intermediate result vectors bottom-up<br />
in the query, in-place overwriting data nodes not satisfying subtree matches, as described<br />
in detail in Appendix D. The indexes <strong>of</strong> kept nodes are stored in a separate vector for<br />
each query node, <strong>and</strong> are used to translate old start <strong>and</strong> end indexes into new positions.<br />
152
Paper 5: Fast Optimal Twig Joins<br />
Algorithm 2 Preorder construction.<br />
While ¬E<strong>of</strong>:<br />
Read next q ↦→d.<br />
For the stack <strong>of</strong> both q’s parent <strong>and</strong> q itself:<br />
Pop non-ancestors <strong>of</strong> d,<br />
<strong>and</strong> set their end indexes.<br />
If d satisfies preorder checks:<br />
For each child <strong>of</strong> q, set interval start index for d.<br />
Add d to intermediate results for q.<br />
Push reference to d on stack for q.<br />
Clean stacks.<br />
Clean intermediate results with postorder checks.<br />
Enumerate results.<br />
To achieve linear traversal, the intermediate result vector <strong>of</strong> a node is traversed in parallel<br />
with the index vectors <strong>of</strong> the children after they have been cleaned. For level split vectors,<br />
there is one separate index vector per used level, <strong>and</strong> a separate vector iterator is used<br />
per level when the parent is cleaned. Also, there is an array giving the level <strong>of</strong> each stored<br />
data node in preorder, such that split <strong>and</strong> non-split child vectors can then be traversed<br />
in parallel. Start values are updated as nodes are pushed onto a stack in preorder, while<br />
end values are updated as nodes are popped <strong>of</strong>f in postorder.<br />
4.4 The getPart Input Stream Merger<br />
The getNext input stream merge component implements weak subtree match filtering<br />
in Θ(I) time, <strong>and</strong> is used to improve practical performance in many current algorithms<br />
using preorder storage [8,10]. Assume in the following discussion that there is one preorder<br />
stream <strong>of</strong> label-matching data nodes associated with each query node. The input stream<br />
merger repeatedly returns pairs containing a query node <strong>and</strong> the data node at the head<br />
<strong>of</strong> its stream, implementing Comp. 1 in Figure 3.<br />
The getNext function processes the query bottom-up to find a query node that satisfies<br />
the following three properties: (1) when its stream is forwarded at least until its head<br />
follows the heads <strong>of</strong> the streams <strong>of</strong> the children in postorder, it still precedes them in<br />
preorder, (2) all children satisfy properties 1 <strong>and</strong> 2, <strong>and</strong> (3) if there is a parent, it does<br />
not satisfy 1 <strong>and</strong> 2. Property 2 implies that weak subtree filtering is achieved, <strong>and</strong><br />
Property 3 implies that local preorder by Definition 3 is achieved.<br />
The procedure is efficient if leaf query nodes have relatively few matches, which can<br />
be the case in practice in XML search when all query leaf nodes are selective text value<br />
predicates. However, if the internal query nodes are more selective than the leaf nodes,<br />
or if not all leaves are selective, the overhead <strong>of</strong> using the getNext function may outweigh<br />
the benefits.<br />
To improve practical performance we introduce the getPart function, which requires<br />
the following property in addition to the above three: (4) if there is a parent, then the<br />
153
CHAPTER 4.<br />
INCLUDED PAPERS<br />
current head <strong>of</strong> stream is a descendant <strong>of</strong> a data node that was the head <strong>of</strong> stream for<br />
the parent in some previous subtree match for the parent. This inductively implies that<br />
nodes returned are also weak prefix path matches, <strong>and</strong> from the ordering <strong>of</strong> the filtering<br />
steps, the result is weak full match filtering by Theorem 2. To allow forwarding streams<br />
to find such nodes, the algorithm can no longer be stateless, as shown by the following<br />
example:<br />
Example 3. Assume that the heads <strong>of</strong> the streams for query nodes a 1 <strong>and</strong> b 1 in Figure 1<br />
are a 3 <strong>and</strong> b 2 , respectively. Then it cannot be known by only inspecting heads <strong>of</strong> streams<br />
whether or not any usable ancestors <strong>of</strong> b 2 were seen before a 3 , <strong>and</strong> b 2 must be returned<br />
regardlessly.<br />
Property 4 is implemented in getPart by maintaining for each query node, the data<br />
node latest in the tree postorder that has been part <strong>of</strong> a weak full match. This value is<br />
updated when a query node is found to satisfy all four properties. To ensure progress,<br />
streams are forwarded top-down in the query to match the stored value or the current<br />
head for the parent node. Note that multiple top-down <strong>and</strong> bottom-up passes may be<br />
needed to find a satisfying node, but each such pass forwards at least one stream past<br />
useless matches. See detailed pseudocode in Appendix C.<br />
5 Experiments<br />
The following experiments explore the effects <strong>of</strong> weak <strong>and</strong> strict matching <strong>of</strong> prefix paths<br />
<strong>and</strong> subtrees, different input stream merge functions, <strong>and</strong> level split vectors.<br />
We have used the DBLP, XMark <strong>and</strong> Treebank benchmark data, <strong>and</strong> run the commonly<br />
used DBLP queries from the PRIX paper [12], the XMark queries from the XPath-<br />
Mark suite part A [5] (except queries 7 <strong>and</strong> 8, which are not twigs), <strong>and</strong> the Treebank<br />
queries from the TwigList paper [11]. In addition, we have created some artificial data <strong>and</strong><br />
queries. Details can be found in Appendix F. The experiments were run on a computer<br />
with an AMD Athlon 64 3500+ processor <strong>and</strong> 4 GB <strong>of</strong> RAM. All queries were warmed<br />
up by 3 runs <strong>and</strong> then run 100 times, or until at least 10 seconds had passed, measuring<br />
average running-time.<br />
All algorithms are implemented in C++, <strong>and</strong> features are turned on or <strong>of</strong>f at compile<br />
time to make sure the overhead <strong>of</strong> complex methods does not affect simpler methods.<br />
Feature combinations are coded with 5 letter tags. We use Heap, getNext <strong>and</strong> getPart<br />
for merging the input streams, <strong>and</strong> store intermediate results in prEorder or pOstorder.<br />
We use no (-), Weak or Strict prefix path match filtering, <strong>and</strong> no (-), Weak or Strict<br />
subtree match filtering. Intermediate results are stored in simple vectors (-) or Level<br />
split vectors. The previous algorithms TwigList <strong>and</strong> TwigFast are denoted by HO-W<strong>and</strong><br />
NEWW-, respectively, while TJStrictPost <strong>and</strong> TJStrictPre are denoted by HOSSL <strong>and</strong><br />
PESSL. Note that filtering checks are not performed in intermediate result construction<br />
if the given filtering level is already achieved by the input stream merger. Strict subtree<br />
match filtering is implemented by descendant interval traversal when not using level split<br />
vectors. With preorder storage an extra filtering pass is used to implement subtree match<br />
filtering. Worst-case optimal algorithms match the pattern ***SL.<br />
154
Paper 5: Fast Optimal Twig Joins<br />
We present no performance comparisons with the Twig 2 Stack algorithm because it<br />
does not fit into our general framework. Getting an accurate <strong>and</strong> fair comparison would<br />
be an exercise in independent program optimization. TwigList is previously reported to<br />
be 2–8 times faster than Twig 2 Stack for queries similar to ours [11].<br />
5.1 Checked Paths <strong>and</strong> Subtrees<br />
Figure 7 shows results for running the XMark query ⫽annotation/keyword, with cost<br />
divided into different components <strong>and</strong> phases. Filtering lowers the cost <strong>of</strong> building intermediate<br />
data structures because their size is reduced, <strong>and</strong> the cost <strong>of</strong> enumerating<br />
results because redundant traversal is avoided. Note that this query was chosen because<br />
it shows the potential effects <strong>of</strong> prefix path <strong>and</strong> subtree match filtering, <strong>and</strong> it may not<br />
be representative.<br />
HO---<br />
HO-W-<br />
HO-S-<br />
HO-SL<br />
HOW--<br />
HOWW-<br />
HOWS-<br />
HOWSL<br />
HOS--<br />
HOSW-<br />
HOSS-<br />
HOSSL<br />
0 0.05 0.1<br />
seconds<br />
merge<br />
construct<br />
enumerate<br />
Figure 7: Query //annotation/keyword on XMark data.<br />
input, building intermediate results, <strong>and</strong> result enumeration.<br />
Cost divided into merging<br />
Figure 8 shows the effects <strong>of</strong> prefix path vs. subtree match filtering averaged over all<br />
queries on DBLP, XMark <strong>and</strong> Treebank. Heap input stream merging was used because it<br />
allows all filtering levels, <strong>and</strong> postorder storage was used to avoid extra filtering passes.<br />
Each timing has been normalized to 1 for the fastest method for each query. Raw timings<br />
are listed in Appendix G. As opposed to in Figure 7, there is on average little difference<br />
between the methods using at least some filtering both in preorder <strong>and</strong> postorder. The<br />
benefits <strong>of</strong> asymptotic constant time checks when using level split vectors seems to be<br />
outweighed by the cost <strong>of</strong> maintaining <strong>and</strong> accessing them, but only by a small margin.<br />
5.2 Reading Input<br />
Figure 9 shows the effect <strong>of</strong> using different input stream mergers. The labels in the<br />
artificial data tree used are Zipf distributed (s = 1), with a, b, y <strong>and</strong> z being the labels on<br />
30%, 13%, 1.0% <strong>and</strong> 1.0% <strong>of</strong> the nodes, respectively. The data <strong>and</strong> queries were chosen<br />
to shed light on both the benefits <strong>and</strong> the possible overhead <strong>of</strong> using the advanced input<br />
methods.<br />
For the first query, ⫽a/b[y][z], the leaves are very selective, <strong>and</strong> both getNext <strong>and</strong><br />
getPart very efficiently filter away most <strong>of</strong> the nodes. The input stream merging is slightly<br />
155
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Prefix path<br />
- W S<br />
avg max avg max avg max avg max<br />
Subtree<br />
-- 1.84 3.5 1.29 1.61 1.23 1.62 1.45 3.5<br />
W- 1.24 1.45 1.03 1.13 1.01 1.06 1.09 1.45<br />
S- 1.24 1.46 1.03 1.10 1.02 1.08 1.10 1.46<br />
SL 1.32 1.55 1.09 1.19 1.06 1.20 1.16 1.55<br />
1.41 3.5 1.11 1.61 1.08 1.62 1.20 3.5<br />
Figure 8: The effect <strong>of</strong> parent match filtering. vs. child match filtering. Running DBLP,<br />
XPathMark <strong>and</strong> Treebank queries. Normalizing query times to 1 for the fastest method<br />
for each query. Showing arithmetic mean <strong>and</strong> maximum for normalized time.<br />
HO---<br />
HE---<br />
NE-W-<br />
PEWW-<br />
HOSSL<br />
HESSL<br />
NESSL<br />
PESSL<br />
0 1 2 3<br />
(a)<br />
0 0.5 1 1.5 2<br />
(b)<br />
0 5 10 15<br />
(c)<br />
Figure 9: Running queries on Zipf data. (a) Selective leaves. (b) Selective internal nodes.<br />
(c) No selective nodes.<br />
more efficient for the simpler getNext. In the second query, ⫽y/z[a][b], the internal nodes<br />
are selective, while the leaves are not. Here getPart efficiently filters away many nodes,<br />
while getNext does not, making it even slower than the simple heap, due to the additional<br />
complexity. The third query shows a case where getPart performs worse than both the<br />
other methods. In this query, ⫽a[a[a][a]][a[a][a]], all query nodes have very low selectivity,<br />
<strong>and</strong> are equally probable. The filtering has almost no effect, <strong>and</strong> only causes overhead.<br />
Note the cost difference between HOSSL <strong>and</strong> HESSL, which is due to the additional filtering<br />
pass over the large intermediate results.<br />
5.3 Combined Benefits<br />
Figure 10 shows the effects <strong>of</strong> combining different input stream mergers <strong>and</strong> additional<br />
filtering strategies. The same queries as in Figure 8 are evaluated, <strong>and</strong> the first column<br />
shows the same tendencies: There is not much difference between the strategies as long<br />
as you do at least weak match filtering on both prefix path <strong>and</strong> subtree.<br />
156
Paper 5: Fast Optimal Twig Joins<br />
Input stream merger<br />
HO HE NE PE<br />
avg max avg max avg max avg max avg max<br />
Match filtering<br />
--- 7.0 33 6.6 30 6.8 33<br />
-W- 4.5 23 7.5 34 3.7 19 5.2 34<br />
-S- 4.5 23 7.5 34 3.7 18 5.2 34<br />
-SL 4.8 25 8.3 38 3.8 19 5.6 38<br />
W-- 4.9 24 4.9 23 4.9 24<br />
WW- 3.7 19 5.4 25 3.2 15 1.02 1.11 3.3 25<br />
WS- 3.7 19 5.4 26 3.2 15 1.04 1.15 3.3 26<br />
WSL 3.9 20 5.7 27 3.2 15 1.08 1.22 3.5 27<br />
S-- 4.8 24 4.8 24 4.8 24<br />
SW- 3.7 19 5.2 26 3.2 15 1.03 1.12 3.3 26<br />
SS- 3.7 19 5.1 26 3.2 15 1.05 1.17 3.3 26<br />
SSL 3.9 20 5.5 27 3.2 15 1.05 1.20 3.4 27<br />
4.4 33 6.0 38 3.4 19 1.04 1.22 4.1 38<br />
Figure 10: Input mergers vs. filtering strategies.<br />
In the second column all methods using any subtree match filtering are more expensive,<br />
because with preorder storage, subtree match filtering is performed in a second pass over<br />
the intermediate results. A second pass is also used for for subtree match filtering in<br />
the third <strong>and</strong> fourth columns, but in practice the getNext <strong>and</strong> getPart components have<br />
already filtered away more nodes, the intermediate results are smaller, <strong>and</strong> the second<br />
pass is less expensive.<br />
Note the difference between using getNext <strong>and</strong> getPart. The new method is more than<br />
three times as fast on average, <strong>and</strong> is more than one order <strong>of</strong> magnitude faster for queries<br />
where only some <strong>of</strong> the leaf nodes are selective. The getPart function also fast forwards<br />
through useless matches for the leaves that are not selective, while getNext passes all<br />
leaf matches on to the intermediate result construction component. Also note that the<br />
maximum overhead <strong>of</strong> using PESSL, the fastest worst-case optimal method, is at most<br />
20% in any benchmark query tested.<br />
6 Related Work<br />
This work is based on the assumption that label-partitioning <strong>and</strong> simple streams is used.<br />
Orthogonal previous work investigates how the underlying data can be accessed. If the<br />
streams support skipping, both unnecessary I/O <strong>and</strong> computation can be avoided [7].<br />
Our getPart algorithm, which is detailed in Appendix C, can be modified to use any<br />
underlying skipping technology by changing the implementation <strong>of</strong> FwdToAncOf() <strong>and</strong><br />
FwdToDescOf(). Refined partitioning schemes with structure indexing can be used to<br />
157
CHAPTER 4.<br />
INCLUDED PAPERS<br />
reduce the number <strong>of</strong> data nodes read for each query node [4, 9]. Our twig join algorithms<br />
are independent <strong>of</strong> the partitioning scheme used, assuming multiple partition<br />
blocks matching a single query node are merged when read. Another technique is to use a<br />
node encoding that allows reconstruction <strong>of</strong> data node ancestors, <strong>and</strong> use virtual streams<br />
for the internal query nodes [14]. Our getPart algorithm could be changed to generate<br />
virtual internal query node matches from leaf query node matches, as complete query<br />
subtrees are always traversed. For a broader view on XML indexing see the survey by<br />
Gou <strong>and</strong> Chirkova [6]. XPath queries can be rewritten to use only the axis self, child, descendant,<br />
<strong>and</strong> following [16]. To add support for support for the following-axis, we would<br />
have to add additional logic for how to forward streams, <strong>and</strong> modify the data structures<br />
to store start indexes for the new relationship.<br />
7 Conclusions <strong>and</strong> Future Work<br />
In this paper we have shown how worst-case optimality <strong>and</strong> fast evaluation in practice<br />
can be combined in twig joins. We have performed experiments that span out <strong>and</strong> extend<br />
the space <strong>of</strong> the fastest previous solutions. For common benchmark queries our new <strong>and</strong><br />
worst-case optimal algorithms are on average three times as fast as earlier approaches.<br />
Sometimes they are more than an order <strong>of</strong> magnitude faster, <strong>and</strong> they are never more<br />
than 20% slower.<br />
In future work we would like to combine the new techniques with previous orthogonal<br />
techniques such as skipping, refined partitioning <strong>and</strong> virtual streams. Also, it would be<br />
interesting to see an elegant worst-case linear algorithm reading local preorder input <strong>and</strong><br />
producing preorder sorted results, that does not perform a post-processing pass over the<br />
data, <strong>and</strong> does not need the assumption d ∈ O(I).<br />
Acknowledgments This material is based upon work supported by the iAd Project<br />
funded by the Research Council <strong>of</strong> Norway, <strong>and</strong> the Norwegian University <strong>of</strong> Science <strong>and</strong><br />
Technology. Any opinions, findings <strong>and</strong> conclusions or recommendations expressed in this<br />
material are those <strong>of</strong> the authors, <strong>and</strong> do not necessarily reflect the views <strong>of</strong> the funding<br />
agencies.<br />
References<br />
[1] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />
twig query in large DataGuide trees. In Proc. IDEAS, 2008.<br />
[2] Nicolas Bruno, Nick Koudas, <strong>and</strong> Divesh Srivastava. Holistic twig joins: Optimal<br />
XML pattern matching. In Proc. SIGMOD, 2002.<br />
[3] Songting Chen, Hua-Gang Li, Junichi Tatemura, Wang-Pin Hsiung, Divyakant<br />
Agrawal, <strong>and</strong> K. Selçuk C<strong>and</strong>an. Twig 2 Stack: bottom-up processing <strong>of</strong> generalizedtree-pattern<br />
queries over XML documents. In Proc. VLDB, 2006.<br />
158
Paper 5: Fast Optimal Twig Joins<br />
[4] Ting Chen, Jiaheng Lu, <strong>and</strong> Tok Wang Ling. On boosting holism in XML twig<br />
pattern matching using structural indexing techniques. In Proc. SIGMOD, 2005.<br />
[5] Massimo Franceschet. XPathMark web page. http://sole.dimi.uniud.it/<br />
~massimo.franceschet/xpathmark/.<br />
[6] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />
survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />
[7] Haifeng Jiang, Wei Wang, Hongjun Lu, <strong>and</strong> Jeffrey Xu Yu. Holistic twig joins on<br />
indexed XML documents. In Proc. VLDB, 2003.<br />
[8] Zhewei Jiang, Cheng Luo, Wen-Chi Hou, <strong>and</strong> Qiang Zhu Dunren Che. Efficient<br />
processing <strong>of</strong> XML twig pattern: A novel one-phase holistic solution. In Proc. DEXA,<br />
2007.<br />
[9] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />
indexes for branching path queries. In Proc. SIGMOD, 2002.<br />
[10] Jiang Li <strong>and</strong> Junhu Wang. Fast matching <strong>of</strong> twig patterns. In Proc. DEXA, 2008.<br />
[11] Lu Qin, Jeffrey Xu Yu, <strong>and</strong> Bolin Ding. TwigList: Make twig pattern matching fast.<br />
In Proc. DASFAA, 2007.<br />
[12] Praveen Rao <strong>and</strong> Bongki Moon. PRIX: Indexing <strong>and</strong> querying XML using Prüfer<br />
sequences. In Proc. ICDE, 2004.<br />
[13] Mirit Shalem <strong>and</strong> Ziv Bar-Yossef. The space complexity <strong>of</strong> processing XML twig<br />
queries over indexed documents. In Proc. ICDE, 2008.<br />
[14] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />
Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />
[15] Chun Zhang, Jeffrey Naughton, David DeWitt, Qiong Luo, <strong>and</strong> Guy Lohman. On<br />
supporting containment queries in relational database management systems. SIG-<br />
MOD Rec., 2001.<br />
[16] Ning Zhang, Varun Kacholia, <strong>and</strong> M. Tamer Özsu. A succinct physical storage<br />
scheme for efficient evaluation <strong>of</strong> path queries in XML. In Proc. ICDE, 2004.<br />
159
CHAPTER 4.<br />
INCLUDED PAPERS<br />
A<br />
TJStrictPost Pseudocode<br />
Algorithm 3 shows more detailed pseudocode for the TJStrictPost algorithm described in<br />
Section 4.2. Tree positions are assumed to be encoded using BEL (begin, end, level) [15].<br />
The function GetVector(q, level) returns the regular intermediate result vector if q is below<br />
an A–D edge, or the split vector given by level if q is below a P–C edge.<br />
Algorithm 3 TJStrictPost<br />
1: function EvaluateGlobal():<br />
2: (c, q, d) ← MergedStreamHead()<br />
3: while c ≠ E<strong>of</strong>:<br />
4: ProcessGlobalDisjoint(d)<br />
5: if Open(q, d):<br />
6: Push(S global , (d.end, q))<br />
7: (c, q, d) ← MergedStreamHead()<br />
8: ProcessGlobalDisjoint(∞)<br />
9: function ProcessGlobalDisjoint(d):<br />
10: while S global ≠ ∅ ∧ Top(S global ).end < d.end:<br />
11: (·, q) ← Pop(S global )<br />
12: Close(q)<br />
13: function Open(q, d):<br />
14: if CheckParentMatch(q, d):<br />
15: u ← new Intermediate(d)<br />
16: MarkStart(q, u)<br />
17: Push(S local [q], u)<br />
18: return true<br />
19: else:<br />
20: return false<br />
21: function Close(q):<br />
22: u ← Pop(S local [q])<br />
23: MarkEnd(q, u)<br />
24: if CheckChildMatch(q, u):<br />
25: Append(GetVector(q, u.d.level), u)<br />
26: function MarkStart(q, u):<br />
27: for r ∈ q.children:<br />
28: u.start[r] ← GetVector(r, u.d.level + 1).size + 1<br />
29: function MarkEnd(q, u):<br />
30: for r ∈ q.children:<br />
31: u.end[r] ← GetVector(r, u.d.level + 1).size<br />
32: function CheckParentMatch(q, d):<br />
33: if Axis(q) = “⫽”:<br />
34: return IsRoot(q) or S local [Parent(q)] ≠ ∅<br />
35: else:<br />
36: if IsRoot(q): return d.level = 1<br />
37: else: return S local [Parent(q)] ≠ ∅ ∧ d.level = Top(S local [Parent(q)]).level + 1<br />
38: function CheckChildMatch(q, u):<br />
39: for r ∈ q.children:<br />
40: if u.end[r] < u.start[r]: return false<br />
41: return true<br />
160
Paper 5: Fast Optimal Twig Joins<br />
B<br />
TJStrictPre Pseudocode<br />
Algorithm 4 shows more detailed pseudocode for the TJStrictPre algorithm described in<br />
Section 4.3.<br />
Algorithm 4 TJStrictPre<br />
1: function EvaluateLocalTopDown():<br />
2: (c, q, d) ← MergedStreamHead()<br />
3: while c ≠ E<strong>of</strong>:<br />
4: if ¬IsRoot(q):<br />
5: ProcessLocalDisjoint(Parent(q), d)<br />
6: ProcessLocalDisjoint(q, d)<br />
7: Open(q, d)<br />
8: (c, q, d) ← MergedStreamHead()<br />
9: for q ∈ Q:<br />
10: ProcessLocalDisjoint(q, ∞)<br />
11: FilterPass(q.root)<br />
12: function ProcessLocalDisjoint(q, d):<br />
13: while Top(S local [q]).d.end < d.end:<br />
14: Close(q)<br />
15: function Open(q, d):<br />
16: if CheckParentMatch(q, d):<br />
17: u ← new Intermediate(d)<br />
18: V ← GetVector(q, d.level)<br />
19: if ¬IsLeaf(q):<br />
20: MarkStart(q, u)<br />
21: Push(S local [q], (V, V.size))<br />
22: Append(V, u)<br />
23: return ¬IsLeaf(q)<br />
24: else:<br />
25: return false<br />
26: function Close(q):<br />
27: if ¬IsLeaf(q):<br />
28: (V, i) ← Pop(S local [q])<br />
29: MarkEnd(q, V [i])<br />
C<br />
GetPart function<br />
Pseudocode for the getPart function is shown in Algorithm 5, where what is conceptually<br />
different from the previous getNext function is colored dark blue.<br />
GetPart forwards nodes both to catch up with the parent <strong>and</strong> child streams, whereas<br />
getNext only does the latter. The getNext algorithm is completely stateless, <strong>and</strong> only<br />
inspects stream heads. When a match for a query subtree is found, the stream for the<br />
subtree root node is read <strong>and</strong> forwarded. Then it is not possible to know in the next call<br />
on this point in the query, whether the child subtrees were once part <strong>of</strong> a match or not.<br />
In the getPart function we save one extra value per query node, stored in the M array,<br />
namely the latest match in the tree postorder which was part <strong>of</strong> a weak match for the<br />
entire query. When considering a query subtree, the currently interesting data nodes are<br />
those that are either part <strong>of</strong> a match using a previous head in the parent stream, or part<br />
<strong>of</strong> a new match using the current head in the parent stream (see Lines 9-13).<br />
161
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Algorithm 5 GetPart<br />
1: function MergedStreamHead():<br />
2: while true:<br />
3: (c, d, q) ← GetPart(Q.root)<br />
4: if c ≠ MisMatch:<br />
5: if c ≠ E<strong>of</strong>:<br />
6: Fwd(q)<br />
7: return (c, d, q)<br />
8: function GetPart(q):<br />
9: if ¬IsRoot(q):<br />
10: p ← Parent(q)<br />
11: FwdToDescOf(q, M[p])<br />
12: if ¬E<strong>of</strong>(q) ∧ ¬E<strong>of</strong>(p) ∧ M [p].end < H(q).end:<br />
13: FwdToDescOf(q, H(p))<br />
14: if IsLeaf(q):<br />
15: if E<strong>of</strong>(q): return (E<strong>of</strong>, q, ⊥)<br />
16: else: return (Match, q, H(q))<br />
17: (d min , q min ) ← (∞, ⊥) ; (d max , q max ) ← (0, ⊥)<br />
18: for r ∈ q.children:<br />
19: (c r, d r, q r) ← GetPart(r)<br />
20: if c r ≠ E<strong>of</strong>:<br />
21: if c r = MisMatch: flag MisMatch<br />
22: elif q r ≠ r: return (Match, d r, q r)<br />
23: if d r .begin < d min .begin: (d min , q min ) ← (d r, q r)<br />
24: if d r .begin > d max .begin: (d max , q max ) ← (d r, q r)<br />
25: else:<br />
26: FwdToE<strong>of</strong>(q)<br />
27: if q min = ⊥: return (E<strong>of</strong>, ⊥, q)<br />
28: FwdToAncOf(q, d max )<br />
29: if flagged MisMatch:<br />
30: return (MisMatch, ⊥, q r)<br />
31:<br />
32: if ¬E<strong>of</strong>(q) ∧ H(q).begin < d min .begin:<br />
33: if IsRoot(q) ∨ H(q).end < M [p].end:<br />
34: if M [q].end < H(q).end: M[q] ← H(q)<br />
35: return (Match, H(q), q)<br />
36: else:<br />
37: if d min .begin < M [q].end:<br />
38: return (Match, d min , q min )<br />
39: else:<br />
40: if E<strong>of</strong>(q): return (E<strong>of</strong>, ⊥, q)<br />
41: else: return (MisMatch, ⊥, q)<br />
42: function FwdToE<strong>of</strong>(q):<br />
43: while ¬E<strong>of</strong>(q): Fwd(q)<br />
44: function FwdToDescOf(q, d):<br />
45: while ¬E<strong>of</strong>(q) ∧ H(q).begin ≤ d.begin: Fwd(q)<br />
46: function FwdToAncOf(q, d):<br />
47: while ¬E<strong>of</strong>(q) ∧ H(q).end ≤ d.begin: Fwd(q)<br />
The forwarding <strong>of</strong> streams based on child stream heads is very similar to in getNext<br />
(Lines 17-28). Unless the search is short-circuit (Line 22), the stream is forwarded at<br />
least until the head is an ancestor <strong>of</strong> all the child heads (Line 28). The query node itself<br />
is returned if an ancestor <strong>of</strong> the child heads was found, <strong>and</strong> unless the previous M value<br />
is an ancestor <strong>of</strong> the current head, it is updated. When a child query node is returned, it<br />
162
Paper 5: Fast Optimal Twig Joins<br />
is known whether or not it is part <strong>of</strong> a match. 3<br />
D<br />
Extra Filtering Pass<br />
Algorithm 6 gives pseudocode for the extra filtering pass used to obtain strict subtree<br />
match filtering when using preorder storage in TJStrictPre.<br />
Algorithm 6 FilterPass<br />
1: function CleanUp(q):<br />
2: if Axis(q) = “⫽”:<br />
3: CleanUpVector(GetVector(q, ·), C[q])<br />
4: else:<br />
5: for h ∈ used levels:<br />
6: CleanUpVector(GetVector(q, h), C h [q])<br />
7: function CleanUpVector(V, C):<br />
8: i ← j ← 0<br />
9: while i < V.size:<br />
10: if CheckChildMatch(q, V [i]):<br />
11: V [j] ← V [i]<br />
12: Append(C, i)<br />
13: j ← j + 1<br />
14: i ← i + 1<br />
15: Resize(V, j)<br />
16: function FilterPass(q):<br />
17: if IsLeaf(q): return<br />
18: for r ∈ q.children:<br />
19: FilterPass(r)<br />
20: if NonLeafChildren(q) ≠ ∅:<br />
21: for u ∈ AllNodes(q):<br />
22: FilterPassPost(q, u.d)<br />
23: for r ∈ NonLeafChildren(q):<br />
24: u.start[r] ← FwdIter(r, u.start[r], u.d)<br />
25: Push(S local [q], u)<br />
26: FilterPassPost(q, ∞)<br />
27: CleanUp(q)<br />
28: function FilterPassPost(q, d):<br />
29: while S local [q] ≠ ∅ ∧ Top(S local [q]).end < d.end:<br />
30: u ← Pop(S local [q])<br />
31: for r ∈ NonLeafChildren(q):<br />
32: u.end[r] ← FwdIter(r, u.end[r], u.d)<br />
33: function FwdIter(q, pos, d):<br />
34: if Axis(q) = “⫽”:<br />
35: while I[q] < C[q].size ∧ C[q][I[q]] < pos:<br />
36: I[q] ← I[q] + 1<br />
37: return I[q]<br />
38: else:<br />
39: h ← d.level + 1<br />
40: while I h [q] < C h [q].size ∧ C h [q][I h [q]] < pos:<br />
41: I h [q] ← I h [q] + 1<br />
42: return I h [q]<br />
3 Thanks to Radim Bača for pointing out an error in the original published pseudocode, where Lines 30–<br />
31 in Algorithm 5 were different.<br />
163
CHAPTER 4.<br />
INCLUDED PAPERS<br />
During the clean-up (Line 1-15), nodes failing checks are overwritten, <strong>and</strong> it is stored<br />
in the C vectors which values were not dropped. The query nodes are visited bottom-up<br />
by the FilterPass function, updating the vectors <strong>of</strong> one query node at a time, based on<br />
the cleanup in non-leaf child query nodes. The FilterPass <strong>and</strong> FilterPassPost functions<br />
go through all data nodes in preorder <strong>and</strong> postorder respectively, updating interval start<br />
<strong>and</strong> end indexes.<br />
The AllNodes call returns a special iterator to all intermediate data nodes for a query<br />
node, sorted in total preorder. For query nodes with an incoming P–C edge <strong>and</strong> level<br />
split vectors, the order in which nodes were inserted on different levels was recorded<br />
during construction in TJStrictPre. Details are omitted in Algorithm 4, where an extra<br />
statement must be added after line 22, storing a reference to the used vector.<br />
The FwdIter function contains the logic for updating the start <strong>and</strong> end indexes. Each<br />
query node has an iterator for each vector, which is utilized when traversing the matches<br />
for the parent query node. In essence, the segments <strong>of</strong> child <strong>and</strong> descendant intervals which<br />
contain references to nodes which were not saved during a cleanup pass are discarded.<br />
E<br />
GetNext <strong>and</strong> Postorder<br />
Many algorithms use the getNext function [2] for merging the input streams instead<br />
<strong>of</strong> a heap or linear scan [8, 10], because it cheaply filters away many useless nodes by<br />
implementing weak subtree match filtering. In this Appendix we show why using getNext<br />
with postorder intermediate result construction gives problems regardless <strong>of</strong> whether local<br />
or global stacks are used.<br />
The getNext function does not return the data nodes in strict preorder, as assumed<br />
in the correctness pro<strong>of</strong> for the TwigMix algorithm [10], but in local preorder (see Definition<br />
3). As explained in Section 3.1, the top-down algorithms using getNext maintain<br />
one stack or equivalent structure for each internal query node. When a new match for a<br />
given query node is seen, the typical strategy [2] is popping non-ancestor nodes <strong>of</strong>f the<br />
parent query node’s stack, <strong>and</strong> the query node’s own stack.<br />
If a global stack is combined with using getNext input, errors may occur, as for the<br />
example query <strong>and</strong> data in Figure 11. With local preorder the ordering between the nodes<br />
not related through ancestry is not fixed. Assume that the ordering is<br />
〈a 1 ↦→a 1 , a 1 ↦→a 2 , b 1 ↦→b 1 , c 1 ↦→c 1 , d 1 ↦→d 1 , c 1 ↦→c 2 , e 1 ↦→e 1 〉.<br />
Then c 1 ↦→c 2 will pop a 1 ↦→a 2 <strong>of</strong>f stack before e 1 ↦→e 1 is observed, <strong>and</strong> e 1 ↦→e 1 will never<br />
be added as a descendant <strong>of</strong> a 1 ↦→a 2 .<br />
But using local stacks <strong>and</strong> a bottom-up approach also gives errors, because data nodes<br />
are added to the intermediate structures as they are popped <strong>of</strong>f stack when using postorder<br />
storage, which is too late when using getNext <strong>and</strong> local stacks. If the typical approach is<br />
used, c 1 ↦→c 2 will only pop b 1 <strong>of</strong>f the stack <strong>of</strong> b 1 <strong>and</strong> c 1 <strong>of</strong>f the stack <strong>of</strong> c 1 . Then d 1 ↦→d 1<br />
will never be added to the child structures <strong>of</strong> b 1 ↦→b 1 , because it is popped to late.<br />
It may be possible to modify the local stack approach to work with postorder storage<br />
<strong>and</strong> getNext input, but this would require carefully popping nodes on ancestor <strong>and</strong><br />
descendant stacks in the right order.<br />
164
Paper 5: Fast Optimal Twig Joins<br />
a 1<br />
a 1<br />
b 1 e 1<br />
c 1 d 1<br />
(a) Query<br />
a 2<br />
c 2<br />
b 1 e 1<br />
c 1 d 1<br />
(b) Data<br />
Figure 11: Problematic case with local preorder input <strong>and</strong> postorder storage.<br />
F<br />
Benchmark Data <strong>and</strong> Queries<br />
Figure 12 gives some details on the benchmark data used in our experiments, <strong>and</strong> Figure<br />
13 lists the queries we have used.<br />
Name Size Nodes<br />
Source<br />
DBLP 676 MB 35 900 666<br />
http://dblp.uni-trier.de/xml<br />
XMark 1 118 MB 32 298 988<br />
http://www.xml-benchmark.org<br />
Treebank 83 MB 3 829 512<br />
http://www.cs.washington.edu/research/xmldatasets<br />
Figure 12: Benchmark datasets used in experiments.<br />
165
CHAPTER 4.<br />
INCLUDED PAPERS<br />
# data Hits Source<br />
xpath<br />
D1 DBLP 6 [12]<br />
//inproceedings[author/text()="Jim Gray"][year/text()="1990"]/@key<br />
D2 DBLP 21 [12]<br />
//www[editor]/url<br />
D3 DBLP 13 [12]<br />
//book/author[text()="C. J. Date"]<br />
D4 DBLP 2 [12]<br />
//inproceedings[title/text()="Semantic Analysis Patterns."]/author<br />
X1 XMark 40 726 [5]<br />
/site/closed auctions/closed auction/annotation/description/text/keyword<br />
X2 XMark 124 843 [5]<br />
//closed auction//keyword<br />
X3 XMark 124 843 [5]<br />
/site/closed auctions/closed auction//keyword<br />
X4 XMark 40 726 [5]<br />
/site/closed auctions/closed auction[annotation/description/text/keyword]/date<br />
X5 XMark 124 843 [5]<br />
/site/closed auctions/closed auction[.//keyword]/date<br />
X6 XMark 32 242 [5]<br />
/site/people/person[pr<strong>of</strong>ile/gender][pr<strong>of</strong>ile/age]/name<br />
T1 Treebank 1 183 [11]<br />
//S/VP//PP[.//NP/VBN]/IN<br />
T2 Treebank 152 [11]<br />
//S/VP/PP[IN]/NP/VBN<br />
T3 Treebank 381 [11]<br />
//S/VP//PP[.//NN][.//NP[.//CD]/VBN]/IN<br />
T4 Treebank 1 185 [11]<br />
//S[.//VP][.//NP]/VP/PP[IN]/NP/VBN<br />
T5 Treebank 94 535 [11]<br />
//EMPTY[.//VP/PP//NNP][.//S[.//PP//JJ]//VBN]//PP/NP// NONE<br />
Figure 13: Benchmark queries used in experiments.<br />
166
Paper 5: Fast Optimal Twig Joins<br />
G<br />
Extended Results<br />
Figure 14 shows the timings that the aggregates in Section 5 are based on.<br />
D1 D2 D3 D4 X1 X2 X3 X4 X5 X6 T1 T2 T3 T4 T5<br />
HO--- 2.53 0.43 1.07 1.59 0.57 0.11 0.11 0.75 0.25 0.25 0.31 0.30 0.40 0.47 0.50<br />
HO-W- 1.42 0.25 0.44 1.12 0.43 0.10 0.11 0.60 0.24 0.21 0.18 0.17 0.24 0.32 0.32<br />
HO-S- 1.42 0.26 0.44 1.11 0.43 0.10 0.11 0.61 0.24 0.21 0.17 0.18 0.24 0.32 0.33<br />
HO-SL 1.70 0.28 0.48 1.22 0.46 0.10 0.11 0.65 0.26 0.23 0.18 0.19 0.24 0.34 0.33<br />
HOW-- 1.96 0.31 0.32 1.18 0.34 0.08 0.09 0.49 0.19 0.25 0.22 0.22 0.32 0.42 0.42<br />
HOWW- 1.26 0.19 0.33 0.92 0.32 0.08 0.08 0.46 0.19 0.21 0.16 0.17 0.22 0.31 0.30<br />
HOWS- 1.34 0.19 0.32 0.91 0.31 0.08 0.08 0.45 0.19 0.21 0.16 0.16 0.21 0.30 0.32<br />
HOWSL 1.31 0.20 0.31 0.96 0.31 0.09 0.09 0.46 0.20 0.23 0.17 0.17 0.23 0.33 0.32<br />
HOS-- 2.03 0.31 0.32 1.17 0.32 0.08 0.08 0.47 0.19 0.25 0.21 0.19 0.27 0.35 0.40<br />
HOSW- 1.27 0.20 0.31 0.91 0.30 0.08 0.08 0.44 0.19 0.21 0.16 0.15 0.21 0.29 0.29<br />
HOSS- 1.27 0.21 0.32 0.92 0.30 0.08 0.08 0.44 0.19 0.21 0.17 0.15 0.21 0.30 0.30<br />
HOSSL 1.32 0.20 0.31 0.97 0.30 0.08 0.08 0.46 0.19 0.23 0.17 0.18 0.23 0.34 0.31<br />
HE--- 2.46 0.40 1.07 1.48 0.55 0.09 0.09 0.70 0.20 0.23 0.30 0.29 0.36 0.45 0.48<br />
HE-W- 2.73 0.40 1.25 1.67 0.72 0.09 0.10 0.89 0.21 0.27 0.35 0.34 0.43 0.50 0.54<br />
HE-S- 2.74 0.40 1.25 1.66 0.72 0.09 0.10 0.89 0.21 0.27 0.34 0.34 0.43 0.49 0.54<br />
HE-SL 3.14 0.42 1.47 1.83 0.80 0.09 0.10 0.97 0.23 0.32 0.37 0.40 0.45 0.54 0.58<br />
HEW-- 1.97 0.31 0.34 1.13 0.34 0.08 0.08 0.48 0.19 0.24 0.23 0.22 0.28 0.42 0.42<br />
HEWW- 2.13 0.32 0.34 1.24 0.38 0.08 0.09 0.53 0.19 0.27 0.29 0.28 0.35 0.45 0.49<br />
HEWS- 2.13 0.32 0.34 1.24 0.39 0.08 0.09 0.52 0.20 0.28 0.29 0.28 0.35 0.44 0.49<br />
HEWSL 2.33 0.33 0.35 1.31 0.42 0.08 0.09 0.57 0.20 0.32 0.28 0.30 0.37 0.48 0.51<br />
HES-- 1.99 0.31 0.34 1.16 0.33 0.08 0.08 0.47 0.19 0.24 0.22 0.19 0.28 0.33 0.40<br />
HESW- 2.15 0.32 0.34 1.26 0.36 0.08 0.09 0.51 0.20 0.28 0.25 0.21 0.31 0.35 0.45<br />
HESS- 2.14 0.33 0.34 1.24 0.37 0.08 0.09 0.51 0.20 0.29 0.24 0.21 0.31 0.35 0.45<br />
HESSL 2.35 0.33 0.36 1.33 0.40 0.08 0.09 0.55 0.21 0.34 0.26 0.24 0.36 0.38 0.48<br />
NOWW- 0.96 0.22 0.09 0.71 0.49 0.11 0.15 0.85 0.34 0.30 0.08 0.08 0.16 0.34 0.22<br />
NE-W- 1.03 0.25 0.08 0.93 0.59 0.12 0.15 0.99 0.40 0.33 0.08 0.09 0.17 0.34 0.27<br />
NE-S- 1.03 0.25 0.08 0.89 0.68 0.12 0.16 1.10 0.39 0.35 0.08 0.09 0.17 0.34 0.29<br />
NE-SL 1.06 0.26 0.08 0.92 0.77 0.12 0.16 1.20 0.42 0.36 0.09 0.09 0.17 0.34 0.29<br />
NEWW- 0.94 0.23 0.08 0.72 0.51 0.11 0.14 0.87 0.35 0.31 0.07 0.07 0.15 0.31 0.23<br />
NEWS- 0.95 0.23 0.08 0.72 0.54 0.11 0.15 0.90 0.36 0.32 0.08 0.08 0.15 0.32 0.24<br />
NEWSL 0.95 0.23 0.08 0.73 0.55 0.11 0.15 0.93 0.37 0.33 0.08 0.08 0.15 0.32 0.24<br />
NESW- 0.95 0.23 0.08 0.74 0.49 0.11 0.14 0.87 0.36 0.31 0.07 0.07 0.15 0.30 0.23<br />
NESS- 0.93 0.23 0.08 0.72 0.51 0.11 0.15 0.89 0.36 0.32 0.08 0.07 0.15 0.31 0.23<br />
NESSL 0.94 0.23 0.08 0.73 0.54 0.11 0.15 0.92 0.37 0.33 0.08 0.08 0.15 0.31 0.24<br />
PEWW- 0.22 0.04 0.08 0.05 0.33 0.06 0.09 0.43 0.16 0.20 0.06 0.06 0.04 0.13 0.10<br />
PEWS- 0.22 0.04 0.08 0.05 0.34 0.06 0.09 0.44 0.16 0.21 0.06 0.06 0.04 0.13 0.10<br />
PEWSL 0.22 0.04 0.08 0.05 0.36 0.06 0.09 0.49 0.18 0.22 0.06 0.06 0.05 0.13 0.10<br />
PESW- 0.22 0.04 0.08 0.05 0.31 0.06 0.09 0.45 0.18 0.20 0.06 0.06 0.04 0.12 0.10<br />
PESS- 0.22 0.04 0.08 0.05 0.33 0.07 0.09 0.43 0.16 0.21 0.06 0.06 0.04 0.13 0.10<br />
PESSL 0.22 0.04 0.08 0.05 0.35 0.06 0.10 0.44 0.17 0.22 0.06 0.06 0.05 0.13 0.10<br />
Figure 14: Time for all tested methods on all benchmark queries. All queries were warmed<br />
up by 3 runs <strong>and</strong> then run 100 times, or until at least 10 seconds had passed, measuring<br />
average running-time.<br />
167
CHAPTER 4.<br />
INCLUDED PAPERS<br />
H<br />
Bad Behavior<br />
In this appendix we list some experiments showing the super-linear behavior <strong>of</strong> previous<br />
twig join algorithms.<br />
Figure 15a shows the exponential behavior <strong>of</strong> TwigList <strong>and</strong> TwigList (HO-W-)<br />
<strong>and</strong> TwigFast (NEWW-) with the data <strong>and</strong> query from Example 1. The data is<br />
/(α 1 /) n . . . (α m /) n β/γ with m = 10 <strong>and</strong> n = 100, <strong>and</strong> the query is ⫽α 1 ⫽ . . . ⫽α k /γ,<br />
with k varying from 1 to 7.<br />
10 0<br />
HO-W-<br />
NEWW-<br />
PESSL<br />
1.5<br />
1<br />
PESS-<br />
PESSL<br />
10 −2<br />
0.5<br />
10 −4<br />
10 2 (a) Varying query parameter k.<br />
1 2 3 4 5 6 7<br />
0<br />
10 000 20 000 30 000 40 000<br />
(b) Varying total nodes.<br />
Figure 15: (a) Exponential behavior without strict matching. (b) Quadratic behavior<br />
without optimal enumeration. Query time in seconds.<br />
Figure 15b shows the results <strong>of</strong> an experiment based on Example 2 with varying<br />
n = 10 000. The simple query ⫽a/b has quadratic cost even when strict prefix path <strong>and</strong><br />
subtree match filtering is used, if P–C child matches are not directly accessible. Many <strong>of</strong><br />
the a nodes in the data are nested, <strong>and</strong> have a small number <strong>of</strong> b children, but a large<br />
number <strong>of</strong> b descendants. For approaches using simple vectors, overlapping descendant<br />
intervals are scanned for direct children, <strong>and</strong> this results in quadratic running time.<br />
168
Paper 6<br />
<strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund <strong>and</strong> Magnus Lie Hetl<strong>and</strong><br />
Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
Proceedings <strong>of</strong> the 7th International XML Database Symposium on Database <strong>and</strong> XML<br />
Technologies (XSym 2010).<br />
Abstract The F&B-index is used to speed up pattern matching in tree <strong>and</strong> graph data,<br />
<strong>and</strong> is based on the maximum F&B-bisimulation, which can be computed in loglinear<br />
time for graphs. It has been shown that the maximum F-bisimulation can be computed<br />
in linear time for DAGs. We build on this result, <strong>and</strong> introduce a linear algorithm for<br />
computing the maximum F&B-bisimulation for tree data. It first computes the maximum<br />
F-bisimulation, <strong>and</strong> then refines this to a maximal B-bisimulation. We prove that the<br />
result equals the maximum F&B-bisimulation.<br />
169
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
1 Introduction<br />
Structure indexes are used to reduce the cost <strong>of</strong> pattern matching in labeled trees <strong>and</strong><br />
graphs [5, 15, 10], by capturing structure properties <strong>of</strong> the data in a structure summary,<br />
where some or all <strong>of</strong> the matching can be performed. Efficient construction <strong>of</strong> such<br />
indexes is important for their practical usefulness [15], <strong>and</strong> in this paper we reduce the<br />
construction cost <strong>of</strong> the F &B-index [10] for tree data.<br />
In a structure index, data nodes are typically partitioned into blocks based on properties<br />
<strong>of</strong> the surrounding nodes. A structure summary typically has one node per block<br />
in the partition, <strong>and</strong> edges between summary nodes where there are edges between data<br />
nodes in the related blocks. Matching in structure summaries is usually more efficient<br />
than partitioning the data nodes on label <strong>and</strong> using structural joins to find full query<br />
matches [15, 10].<br />
In a path index, data nodes are classified by the labels on the paths by which they are<br />
reachable [5, 15]. For tree data this equals partitioning nodes on their label <strong>and</strong> the block<br />
<strong>of</strong> the parent node. Figures 1c <strong>and</strong> 1b show path partitioning <strong>and</strong> the related summary<br />
for the example data in Figure 1a. With path indexes, non-branching queries can be<br />
evaluated without processing joins [15, 19].<br />
b 1<br />
a 1<br />
a 2 a 9 a 15<br />
a 2 b 3<br />
b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />
a 4 a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />
(a) Example query <strong>and</strong> data. Single/double query edges specify<br />
parent–child/ancestor–descendant relationships.<br />
b s1<br />
a s2<br />
a s3 b s4<br />
a s5 b s6<br />
(b) Path summary.<br />
b 1<br />
b 1<br />
a 2 a 9 a 15<br />
a 7 a 8 b 3 b 10 b 13 b 16 b 18 b 20<br />
a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />
(c) Path partition.<br />
a 2<br />
a 8<br />
a 4 a 6<br />
a 9 a 15<br />
20 b 13 b 16 b 18<br />
a 21 b 14 b 17 b 19<br />
a 7 b 3 b 10 b<br />
(d) F&B partition.<br />
Figure 1: Partitioning strategies. Superscripts <strong>and</strong> subscripts give node identifiers.<br />
A natural extension <strong>of</strong> a path index is the F&B-index, where nodes are partitioned on<br />
both their label, the partitions <strong>of</strong> the parents, <strong>and</strong> the partitions <strong>of</strong> the children, as shown<br />
171
CHAPTER 4.<br />
INCLUDED PAPERS<br />
in Figure 1d. This gives an index where more <strong>of</strong> the pattern matching can be performed on<br />
the summary, <strong>and</strong> also branching queries can be evaluated without processing joins [10].<br />
The focus <strong>of</strong> this paper is efficient computation <strong>of</strong> the maximum simultaneous forward<br />
<strong>and</strong> backward bisimulation (F&B-bisimulation), which is the underlying concept used to<br />
partition nodes in the F&B-index. It can be computed in time loglinear in the number<br />
<strong>of</strong> edges in the graph [10]. A linear construction algorithm for directed acyclic graphs<br />
(DAGs) has been presented recently [13], but we show that it is incorrect. On the other<br />
h<strong>and</strong>, there exists a correct algorithm which can compute either the maximum forward<br />
bisimulation (F-bisimulation) or the maximum backward bisimulation (B-bisimulation)<br />
in linear time for DAGs [3]. We extend this algorithm to compute the maximum F&Bbisimulation<br />
in linear time for tree data. This has relevance for applications where the<br />
underlying data is known to be tree shaped, such as in many uses <strong>of</strong> XML [6].<br />
2 Background<br />
In this section we present different types <strong>of</strong> bisimulations, <strong>and</strong> show how they can be<br />
computed by first partitioning on label, <strong>and</strong> then stabilizing the graph.<br />
We use the following notation: A directed graph G = 〈V, E〉 has node set V <strong>and</strong> edge<br />
set E ⊆ V × V . Let n = |V | <strong>and</strong> m = |E|. For X ⊆ V , E(X) = {w | ∃v ∈ X : vEw}<br />
<strong>and</strong> E −1 (X) = {u | ∃v ∈ X : uEv}. Each node v ∈ V has label L(v). A partition P <strong>of</strong><br />
V is a set <strong>of</strong> blocks, such that each node v ∈ V is contained in exactly one block. For<br />
an equivalence relation ∼ ⊆ V × V , the equivalence class containing v ∈ V is denoted by<br />
[v] ∼ . The equivalence relation arising from the partition P is denoted = P . A relation R 2<br />
is a refinement <strong>of</strong> R 1 iff R 2 ⊆ R 1 . A partition P 2 is a refinement <strong>of</strong> the coarser P 1 iff<br />
= P2 ⊆ = P1 . Let the contraction graph <strong>of</strong> a partition P be a graph with one node for each<br />
equivalence class <strong>of</strong> = P , <strong>and</strong> an edge 〈[u] =P , [v] =P 〉 whenever 〈u, v〉 ∈ E.<br />
The structure summary built for a structure index is typically isomorphic with the<br />
contraction graph for the data partition. For a partitioning to be useful, it must yield a<br />
summary that somehow simulates the data, such that pattern matching in the summary<br />
gives the same results as pattern matching in the data, or at least no false negatives. If<br />
nodes are partitioned into blocks where nodes in some way simulate each other, then the<br />
contraction graph also simulates the data in the same way.<br />
2.1 Bisimulation <strong>and</strong> Bisimilarity<br />
Broadly speaking, bisimulations relate nodes that have the same label <strong>and</strong> related neighbors.<br />
We use the following properties <strong>of</strong> a binary relation R ⊆ V × V to formally define<br />
the different types <strong>of</strong> bisimulation:<br />
172
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
vRv ′ ⇒ L(v) = L(v ′ ) (1)<br />
vRv ′ ⇒ (uEv ⇒ ∃u ′ : u ′ Ev ′ ∧ uRu ′ ) ∧<br />
(u ′ Ev ′ ⇒ ∃u : uEv ∧ uRu ′ ) (2)<br />
vRv ′ ⇒ (vEw ⇒ ∃w ′ : v ′ Ew ′ ∧ wRw ′ ) ∧<br />
(v ′ Ew ′ ⇒ ∃w : vEw ∧ wRw ′ ) (3)<br />
Definition 1 (Bisimulations [14, 10]). A relation R is a B-bisimulation iff it satisfies (1)<br />
<strong>and</strong> (2) above, an F-bisimulation iff it satisfies (1) <strong>and</strong> (3), <strong>and</strong> an F&B-bisimulation iff<br />
it satisfies (1), (2) <strong>and</strong> (3).<br />
For each type, there exists a unique maximum bisimulation, <strong>of</strong> which all other bisimulations<br />
are refinements [14, 10]. We say that two nodes are bisimilar if there exists<br />
a bisimulation that relates them, i.e., they are related by the maximum bisimulation.<br />
Since bisimilarity is an equivalence relation, it can be used to partition the nodes [14, 10].<br />
When nodes u <strong>and</strong> v are backward, forward, <strong>and</strong> forward <strong>and</strong> backward bisimilar, we<br />
write u ∼ B v, u ∼ F v <strong>and</strong> u ∼ F &B v, respectively. Figure 2 illustrates the different types<br />
<strong>of</strong> bisimilarity.<br />
b 2<br />
a 1<br />
c 3<br />
b 4<br />
c 5 d 6<br />
(a) Data.<br />
a 1<br />
b 2 b 4<br />
b 2<br />
a 1 a 1<br />
b 4<br />
b 4<br />
c 3 c 5 d 6 c 3 c 5 d 6 c 3 c 5 d 6<br />
(b) ∼ B (c) ∼ F<br />
b 2<br />
(d) ∼ F &B<br />
Figure 2: Partitioning on different types <strong>of</strong> bisimilarity.<br />
Two graphs are said to be bisimilar if there exists a total surjective bisimulation from<br />
the nodes in one graph to the nodes in the other. For a given graph, the smallest bisimilar<br />
graph is unique, <strong>and</strong> is exactly the contraction for bisimilarity [3]. The F&B-bisimilarity<br />
contraction is the basis for the F&B-index [10].<br />
2.2 Stability<br />
The different types <strong>of</strong> bisimilarity listed in the previous section can be computed by first<br />
partitioning the data nodes on label, <strong>and</strong> then finding the coarsest stable refinements <strong>of</strong><br />
the initial partition [16, 4, 10]. A partition is successor stable iff all nodes in a block have<br />
incoming edges from nodes in the same set <strong>of</strong> blocks, <strong>and</strong> is predecessor stable iff all nodes<br />
in a block have outgoing edges to the same set <strong>of</strong> blocks [10]. The coarsest successor,<br />
predecessor, <strong>and</strong> successor <strong>and</strong> predecessor stable refinement <strong>of</strong> a label partition, equal a<br />
partition on B-bisimilarity, F-bisimilarity <strong>and</strong> F&B-bisimilarity, respectively [4, 10].<br />
173
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Definition 2 (Partition stability [16, 10]). Given a directed graph G = 〈V, E〉, then<br />
D ⊆ V is successor stable with respect to B ⊆ V if either all or none <strong>of</strong> the nodes in<br />
D are pointed to from nodes in B (meaning D ⊆ E(B) or D ∩ E(B) = ∅), <strong>and</strong> D is<br />
predecessor stable with respect to B if either none or all <strong>of</strong> the nodes in D point to nodes<br />
in B (meaning D ⊆ E −1 (B) or D ∩ E −1 (B) = ∅).<br />
For any combination <strong>of</strong> successor <strong>and</strong> predecessor stability, a partition P <strong>of</strong> V is said<br />
to be stable with respect to a block B if all blocks in P are stable with respect to B. A<br />
partition P is stable with respect to another partition Q if it is stable with respect to all<br />
blocks in Q. P is said to be stable if it is stable with respect to itself.<br />
Figure 3 shows cases where a block can be split to achieve different types <strong>of</strong> stability.<br />
The block D is not stable with respect to B, but we can split it into blocks that are:<br />
Assume that D is stable with respect to a union <strong>of</strong> blocks S such that B ⊂ S. We can<br />
split D into blocks that are stable with respect to both B <strong>and</strong> S \ B, shown as D B , D BS<br />
<strong>and</strong> D S in the figure. Stabilizing also with respect to S \ B is crucial for obtaining a<br />
O(m log n) running time in the partition stabilization algorithm explained in the next<br />
section.<br />
B<br />
D B<br />
D BS<br />
D S<br />
S<br />
D<br />
D<br />
S<br />
D B D BS D S<br />
(a) Succ-stability.<br />
B<br />
(b) Pred-stability.<br />
Figure 3: Refining D into blocks that are stable with respect to B <strong>and</strong> S \ B.<br />
2.3 Stabilizing graph partitions<br />
We now go through Paige <strong>and</strong> Tarjan’s algorithm for refinement to the coarsest predecessor<br />
stable partition [16], extended to simultaneous successor <strong>and</strong> predecessor stability by<br />
Kaushik et al. [10], as shown in Algorithm 1. The input to the algorithm is a partition<br />
P <strong>and</strong> the set <strong>of</strong> flags Directions ⊆ {Succ, Pred}, which specifies whether P is to be<br />
successor <strong>and</strong>/or predecessor stabilized.<br />
Figure 4 illustrates an example run <strong>of</strong> the algorithm with Directions = {Succ, Pred}.<br />
In addition to the current partition P , the algorithm maintains a partition X, where the<br />
blocks are unions <strong>of</strong> blocks in P . Initially X contains a single block that is the union<br />
<strong>of</strong> all blocks in P , <strong>and</strong> the algorithm maintains the loop invariant that P is stable with<br />
respect to X by Definition 2. The algorithm terminates when the partitions P <strong>and</strong> X<br />
174
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
Algorithm 1 Graph partition stabilization.<br />
1: ⊲ P is the initial partition.<br />
2: function StabilizeGraph(P, Directions):<br />
3: for dir ∈ Directions:<br />
4: InitialRefinement(P, dir)<br />
5: X ← {⋃ B∈P B}<br />
6: while P ≠ X:<br />
7: Extract B ∈ P such that B ⊂ S ∈ X <strong>and</strong> |B| ≤ |S|/2.<br />
8: Replace S by B <strong>and</strong> S \ B in X.<br />
9: for dir ∈ Directions:<br />
10: StabilizeWRT(copy <strong>of</strong> B, P, dir)<br />
are equal, which means P must be stable with respect to itself. But the loop invariant<br />
may not be true for the given input partition initially: Blocks containing both roots <strong>and</strong><br />
non-roots are not successor stable with respect to X, because non-roots have incoming<br />
edges from the single block S ∈ X, while roots do not. Similarly, blocks containing both<br />
leaves <strong>and</strong> non-leaves are not predecessor stable with respect to X. In Algorithm 1 initial<br />
stability is achieved by calls to InitialRefinement(), which splits blocks in a simple linear<br />
pass. Initial splitting is illustrated by the step from line (a) to line (b) in Figure 4.<br />
The algorithm repeatedly selects a block S ∈ X that is a compound union <strong>of</strong> blocks<br />
from P , <strong>and</strong> selects a splitter block B ⊂ S with size at most half <strong>of</strong> S. Then S is replaced<br />
by B <strong>and</strong> S \ B in X, as shown when extracting B = {a 2 , a 9 , a 15 } between lines (b) <strong>and</strong><br />
(c) in Figure 4. The call to StabilizeWRT() uses the strategies depicted in Figure 3 to<br />
stabilize P with respect to both B <strong>and</strong> S \ B, to make sure P is stable with respect to<br />
the new X. It is important to use a copy <strong>of</strong> B as splitter, as the stabilization may cause<br />
B itself to be split. The step from line (b) to (c) in the figure shows that a block <strong>of</strong> nodes<br />
labeled a is split when successor stabilizing with respect to B = {a 2 , a 9 , a 15 }.<br />
Efficient implementation <strong>of</strong> the above requires some attention to detail [16]: The<br />
partition X can be realized through a set X containing sets <strong>of</strong> pointers to blocks in P , such<br />
that for each S ∈ X we have S = ⋃ B∈S<br />
B for the related S ∈ X . To extract a B ⊂ S ∈ X<br />
in constant time, we also maintain the set <strong>of</strong> compound unions C = {S ∈ X | 1 < |S|}. A<br />
block B ⊂ S such that |B| ≤ |S|/2 can be found by choosing the smalllest <strong>of</strong> the two first<br />
blocks in any S ∈ C. Note that we can check if P = X by checking whether C is empty,<br />
<strong>and</strong> neither P , X nor X need to be materialized, as they are never used directly. You<br />
only need to maintain C, <strong>and</strong> a P ⊆ P containing the blocks in P not in some S ∈ C. For<br />
inserting <strong>and</strong> removing elements in constant time, the sets are implemented as doubly<br />
linked list. In addition, each v ∈ B has a pointer to B, <strong>and</strong> each B ∈ S has a pointer to<br />
S. This allows keeping the data structures updated throughout the evaluation.<br />
As all operations in the while-loop excluding the calls to StabilizeWRT() are constant<br />
time operations on linked lists, the complexity <strong>of</strong> the loop is bounded by the number <strong>of</strong><br />
splitter blocks selected, which again is bounded by the number <strong>of</strong> times a node may be<br />
part <strong>of</strong> such a splitter. Splitter blocks at most half the size <strong>of</strong> a compound union are<br />
175
CHAPTER 4.<br />
INCLUDED PAPERS<br />
b 1<br />
a 2<br />
a 9 a 15<br />
b 3 a 7 a 8 b 10 b 13 b 16 b 18 b 20<br />
a 4 b 5 a 6 a 11 a 12 b 14 b 17 b 19 a 21<br />
(a)<br />
a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 b 13<br />
b 5 b 10 b 13 b 14 b 16 b 17 b 19 b 18 b 20<br />
(b)<br />
a 4 a 6 a 7 a 8 a 11 a 12 a 21 a 2 a 9 a 15 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20<br />
(c)<br />
a 2 a 9 a 15 a 4 a 6 a 11 a 12 a 21 a 7 a 8 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20<br />
(d)<br />
a 7 a 8 a 4 a 6 a 11 a 12 a 21 b 1 b 5 b 14 b 17 b 19 b 3 b 10 b 13 b 16 b 18 b 20 a 9 a 15 a 2<br />
.<br />
(l)<br />
a 11 a 12 a 21 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 2 a 9 a 15 a 7 a 8 b 1 b 5 b 14 b 17 b 19 b 3<br />
Figure 4: Algorithm 1 doing successor <strong>and</strong> predecessor stabilization on a label partition<br />
<strong>of</strong> the data from Figure 1a. The white boxes are the current blocks in P , while the gray<br />
boxes are the current blocks in X. Line (a) shows initial label partition. Step (a)–(b)<br />
shows refinement separating roots from non-roots <strong>and</strong> leaves from non-leaves, <strong>and</strong> steps<br />
(b)–(l) show simultaneous predecessor <strong>and</strong> successor stabilization.<br />
always selected, <strong>and</strong> no node in this block is part <strong>of</strong> a splitter again before the block itself<br />
has become a compound union. This means that the number <strong>of</strong> times a given node is<br />
part <strong>of</strong> a splitter block is O(log n), <strong>and</strong> that the total number <strong>of</strong> splitter blocks used is<br />
O(n log n) [16].<br />
Algorithm 2 shows how all blocks in a partition P can be successor (or predecessor)<br />
stabilized with respect to a block B ∈ P <strong>and</strong> S \ B, where B ⊂ S ∈ X, in time linear in<br />
the number <strong>of</strong> edges going out from (or into) B [16]. For successor stability, only blocks<br />
D pointed to from B are affected, <strong>and</strong> they are stabilized with respect to both B <strong>and</strong><br />
S \ B without involving S \ B directly. This is done by maintaining for each node v ∈ V ,<br />
the number <strong>of</strong> times it is pointed to from each set S ∈ X, <strong>and</strong> storing references to these<br />
records from the related edges. We can then differentiate between nodes pointed to from<br />
B, pointed to from both B <strong>and</strong> S \ B, <strong>and</strong> pointed to only from S \ B. Nodes from the<br />
first two categories are moved into new blocks, while the rest are untouched.<br />
As the cost <strong>of</strong> a single call to StabilizeWRT() is bounded by the number <strong>of</strong> nodes in the<br />
splitter block <strong>and</strong> the number <strong>of</strong> outgoing (or incoming) edges, the total cost for the calls<br />
is O((m+n) log n), as a given node or edge is used for splitting O(log n) times. Assuming<br />
n ∈ O(m), the cost <strong>of</strong> Algorithm 1 is O(n) for the initial refinement, O(n log n) for the<br />
176
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
Algorithm 2 Stabilizing with respect to a block<br />
1: function StabilizeWRT(B, P, d):<br />
2: Assume dir = Succ. (or Pred)<br />
3: for D ∈ P pointed to from B: (or pointing into B)<br />
4: Initialize D B <strong>and</strong> D BS <strong>and</strong> associate with D.<br />
5: for v ∈ D ∈ P pointed to from B: (or pointing into B)<br />
6: if v is pointed to only from B: (or pointing only into B)<br />
7: D ′ ← D B<br />
8: else:<br />
9: D ′ ← D BS<br />
10: if D ′ /∈ P : Insert D ′ after D in P<br />
11: Move v from D to D ′ .<br />
12: if D = ∅: Remove D from P .<br />
while-loop excluding StabilizeWRT(), <strong>and</strong> O(m log n) for the StabilizeWRT() calls, giving<br />
a total <strong>of</strong> O(m log n) [16, 10].<br />
3 Linear Time Stabilization<br />
Linear time computation <strong>of</strong> F&B-bisimilarity for DAG data has been attempted earlier.<br />
The SAM algorithm [13] partitions the data separately on B-bisimilarity <strong>and</strong><br />
F-bisimilarity, <strong>and</strong> then combines the partitions by putting two nodes in the same final<br />
block iff they are in the same blocks in both partitions. It builds on the following<br />
theorem, which is stated without pro<strong>of</strong> or reference: “Node n 1 <strong>and</strong> node n 2 satisfy F&Bbisimulation<br />
if <strong>and</strong> only if they satisfy F-bisimulation <strong>and</strong> B-bisimulation.” The only if<br />
part is <strong>of</strong> course true, but the if part is not, as can be seen from the partitioning <strong>of</strong> a tree<br />
with six nodes in Figure 2. Here c 3 ∼ B c 5 <strong>and</strong> c 3 ∼ F c 5 , but c 3 ≁ F &B c 5 , because for the<br />
parent nodes b 2 ≁ F &B b 4 . Also note that it is assumed for the running time <strong>of</strong> the SAM<br />
algorithm that the number <strong>of</strong> edges to <strong>and</strong> from each node can be viewed as a constant.<br />
3.1 Stabilizing DAG partitions<br />
We now present an algorithm for refining a partition <strong>of</strong> the nodes in a DAG either to successor<br />
stability or to predecessor stability. It is based on two different previous algorithms:<br />
Paige <strong>and</strong> Tarjan’s loglinear algorithm for stabilizing general graphs [16], <strong>and</strong> Dovier, Piazza<br />
<strong>and</strong> Policriti’s algorithm for computing F-bisimilarity on unlabeled graphs [3], which<br />
has linear complexity for DAG data. A difference between these two algorithms is that<br />
the former is given an initial partition as input, which is then refined, while the latter<br />
starts with the set <strong>of</strong> singleton blocks, from which the final partition is constructed. These<br />
are called negative <strong>and</strong> positive strategies, respectively [3]. Dovier et al. describe how<br />
their algorithm can be extended to compute F-bisimilarity for labeled data, but when<br />
developing an algorithm for refining to simultaneous successor <strong>and</strong> predecessor stability<br />
177
CHAPTER 4.<br />
INCLUDED PAPERS<br />
in the next section, we use the result <strong>of</strong> a predecessor stabilization as input to a successor<br />
stabilization, <strong>and</strong> hence cannot use a positive strategy.<br />
Dovier et al.’s algorithm initially computes the rank <strong>of</strong> each node in the DAG, which<br />
is the length <strong>of</strong> the longest path from the node to a leaf. We extend the notion <strong>of</strong> rank<br />
to both directions in the DAG:<br />
Definition 3 (Rank). In a DAG G, the successor <strong>and</strong> predecessor rank <strong>of</strong> v ∈ V (G) is<br />
defined as:<br />
{<br />
0<br />
if v is a root in G<br />
rank Succ (v) =<br />
1 + max 〈u,v〉∈E(G) rank Succ (u)<br />
otherwise<br />
{<br />
0<br />
if v is a leaf in G<br />
rank Pred (v) =<br />
1 + max 〈v,w〉∈E(G) rank Pred (w) otherwise<br />
Algorithm 3 shows our modification <strong>of</strong> Paige <strong>and</strong> Tarjan’s algorithm [16] based on<br />
Dovier et al.’s principles [3]. It refines a partition <strong>of</strong> a DAG either to predecessor or to<br />
successor stability, <strong>and</strong> runs in linear time, due to the order in which splitter blocks are<br />
chosen.<br />
Algorithm 3 DAG partition stabilization.<br />
1: ⊲ Assume sets are ordered.<br />
2: function StabilizeDAG(P, dir):<br />
3: RefineAndSortOnRank(P, dir)<br />
4: X ← {⋃ B∈P B}<br />
5: while P ≠ X:<br />
6: Extract first B ∈ P such that B ⊂ S ∈ X.<br />
7: Replace S by B <strong>and</strong> S \ B in X.<br />
8: StabilizeWRT(B copy, P, dir)<br />
A run <strong>of</strong> the algorithm with dir = Pred is illustrated in Figure 5. Instead <strong>of</strong> only<br />
separating between leaves <strong>and</strong> non-leaves (or roots <strong>and</strong> non-roots) as in Algorithm 1,<br />
blocks are initially split such that predecessor (or successor) rank is uniform within each<br />
block, <strong>and</strong> sorted, such that the rank is monotonically increasing in the partition. This<br />
is done in the function RefineAndSortOnRank(), which is described later in this section.<br />
An initial refinement <strong>and</strong> sorting on predecessor rank is shown when going from line (a)<br />
to line (b) in Figure 5.<br />
In a partition that respects a given type <strong>of</strong> rank, let the rank <strong>of</strong> a block be equal to<br />
the rank <strong>of</strong> the contained nodes. The following lemma implies that the initial refinement<br />
on rank does not split blocks unnecessarily:<br />
Lemma 1. Given nodes u, v ∈ V (G) <strong>and</strong> a partition P <strong>of</strong> G, if P is successor stable<br />
then [u] =P = [v] =P ⇒ rank Succ (u) = rank Succ (v), <strong>and</strong> if P is predecessor stable then<br />
[u] =P = [v] =P ⇒ rank Pred (u) = rank Pred (v).<br />
178
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
(a)<br />
a 2 a 4 a 6 a 7 a 8 a 9 a 11 a 12 a 15 a 21 b 13<br />
b 5 b 10 b 13 b 14 b 16 b 17 b 19 b 18 b 20<br />
(b)<br />
a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 10 b 20 b 3 b 13 b 16 b 18 a 2 a 9 a 15 b 1<br />
(c)<br />
a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />
(c)<br />
a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />
.<br />
(i)<br />
a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />
Figure 5: Using Algorithm 3 to predecessor stabilize a label partition <strong>of</strong> the data from<br />
Figure 1a. The first step shows refinement on predecessor rank.<br />
Pro<strong>of</strong>. For F-bisimilarity [u] ∼F = [v] ∼F ⇒ rank Pred (u) = rank Pred (v) [3]. If P is predecessor<br />
stable, then = P is an F-bisimulation [4], <strong>and</strong> therefore a refinement <strong>of</strong> partitioning<br />
on ∼ F , such that [u] =P = [v] =P ⇒ [u] ∼F = [v] ∼F . The case for successor stability is<br />
symmetric.<br />
The next lemma implies that if blocks are chosen as splitters in order <strong>of</strong> their rank, a<br />
node will be part <strong>of</strong> a block that is used as a splitter at most once. This property is used<br />
to achieve a linear complexity in our algorithm.<br />
Lemma 2 ([3]). Given a DAG G, a partition P <strong>of</strong> G that respects predecessor rank <strong>and</strong> a<br />
block B ∈ P , predecessor stabilization <strong>of</strong> P with respect to B only splits blocks D where<br />
rank Pred (D) > rank Pred (B).<br />
This is symmetric for successor stabilization <strong>and</strong> successor rank, as a reversed DAG is<br />
also a DAG.<br />
In Algorithm 3 the blocks in the current partition P are maintained ordered on rank.<br />
This is implemented through a detail in our method for stabilization with respect to a<br />
block in Algorithm 2, which is different from the original description [16]: The new blocks<br />
D B <strong>and</strong> D BS are inserted into P on the position after the old block D, <strong>and</strong> not at the<br />
end <strong>of</strong> P . The sets <strong>of</strong> blocks which make up the unions in X are also ordered such that<br />
their concatenation yields an ordered list <strong>of</strong> blocks. This is maintained by inserting B<br />
followed by S \ B on the original position <strong>of</strong> S in X. Notice how blocks never change<br />
positions during the stabilization shown in lines (b)–(i) in Figure 5.<br />
In Dovier et al.’s algorithm, rank is computed by performing a topological traversal<br />
<strong>of</strong> the DAG depth first [3]. Because we need to refine a given partition on rank, as<br />
opposed to constructing a partition on rank from scratch, the problem is slightly more<br />
involved. Algorithm 4 shows how a partitioning can be refined <strong>and</strong> sorted on successor<br />
or predecessor rank in a single pass. The algorithm traverses the DAG with a hybrid<br />
between a topological sort <strong>and</strong> a breadth first search, implemented using edge counters<br />
<strong>and</strong> a queue. Blocks are refined <strong>and</strong> sorted on the fly.<br />
179
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Algorithm 4 Refining <strong>and</strong> sorting on rank.<br />
1: function RefineAndSortOnRank(P, dir):<br />
2: Assume dir = Succ (or dir = Pred)<br />
3: Q is a queue.<br />
4: for v ∈ V :<br />
5: v.count ← |{x | 〈x, v〉 ∈ E}| (or |{x | 〈v, x〉 ∈ E}|)<br />
6: if v.count = 0:<br />
7: v.rank dir ← 0<br />
8: PushBack(Q, v)<br />
9: for B ∈ P :<br />
10: B.currRank ← −1<br />
11: while Q:<br />
12: v ← PopFront(Q)<br />
13: Let B ∋ v.<br />
14: if v.rank dir ≠ B.currRank:<br />
15: B.rankedB ← {}<br />
16: Append B.rankedB at the end <strong>of</strong> P .<br />
17: B.currRank ← v.rank dir<br />
18: Move v from B to B.rankedB<br />
19: Remove B from P if empty.<br />
20: for x where 〈v, x〉 ∈ E: (or 〈x, v〉 ∈ E)<br />
21: x.count ← x.count − 1<br />
22: if x.count = 0:<br />
23: x.rank dir ← v.rank dir + 1<br />
24: PushBack(Q, x)<br />
Lemma 3. Algorithm 4 refines <strong>and</strong> orders P on successor (or predecessor) rank in O(m+<br />
n) time.<br />
Pro<strong>of</strong> (sketch). Because the queue is initialized with the roots, <strong>and</strong> a node is added to the<br />
queue when the last parent is popped from the queue, the nodes are queued <strong>and</strong> popped<br />
in order <strong>of</strong> successor rank, <strong>and</strong> this distance is calculated from the parent node with the<br />
greatest successor rank. As the successor rank <strong>of</strong> the nodes that are moved to a new block<br />
grows monotonically, only one associated block B.rankedB is created per successor rank<br />
found in block B, <strong>and</strong> all such blocks are appended to P in sorted order. As a node is only<br />
queued once, <strong>and</strong> the cost <strong>of</strong> processing a node is proportional to the number <strong>of</strong> outgoing<br />
edges, the total running time <strong>of</strong> the algorithm is O(m + n). The case for predecessor rank<br />
is symmetric.<br />
Theorem 1. Algorithm 3 yields the coarsest refinement <strong>of</strong> a partition <strong>of</strong> a DAG that is<br />
successor (or predecessor) stable.<br />
Pro<strong>of</strong> (sketch). For predecessor stability, the only differences from Paige <strong>and</strong> Tarjan’s<br />
algorithm are the initial refinement <strong>and</strong> the order in which splitter blocks are chosen.<br />
180
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
By Lemma 1 blocks are not refined unnecessarily when refininig on rank, <strong>and</strong> the order<br />
<strong>of</strong> split operations is not used in the correctness pro<strong>of</strong> for the original algorithm [16].<br />
Successor stability is symmetric.<br />
To implement Algorithm 3 we use the same underlying data structures as for Algorithm<br />
1: doubly linked lists C <strong>and</strong> P realizing X <strong>and</strong> P . In Algorithm 3, the extract<br />
operation is implemented by removing the first B ∈ S from the first S ∈ C. The replace<br />
operation is implemented by moving B from S to the end <strong>of</strong> P, <strong>and</strong> if only one block B ′<br />
is left in S, this B ′ is also moved from S to P, <strong>and</strong> S is removed from C.<br />
Theorem 2. The running time <strong>of</strong> Algorithm 3 is O(m + n).<br />
Pro<strong>of</strong> (sketch). We analyze the cost <strong>of</strong> StabilizeWRT() separately. Outside the while loop,<br />
the call to RefineAndSortOnRank() has cost O(m+n) by Lemma 3, <strong>and</strong> the construction<br />
<strong>of</strong> C <strong>and</strong> P has cost O(|P |) ⊆ O(n). As splitters are chosen in order <strong>of</strong> their rank, by<br />
Lemma 2 a splitter block is not later split itself. This means that each node is only<br />
part <strong>of</strong> a splitter once, <strong>and</strong> that the while-loop is run O(n) times. The loop condition<br />
is implemented by checking if C ≠ ∅. As all the operations on linked lists inside the<br />
loop have complexity O(1), the total cost <strong>of</strong> the while-loop is O(n). The StabilizeWRT()<br />
function is called O(n) times, <strong>and</strong> the cost <strong>of</strong> one call is linear in the number <strong>of</strong> nodes<br />
<strong>and</strong> edges used [16]. As nodes are only part <strong>of</strong> a splitter block once, edges are also only<br />
used for splitting once, <strong>and</strong> the total cost is O(m + n).<br />
3.2 Stabilizing Trees<br />
We now present an algorithm for finding the coarsest successor <strong>and</strong> predecessor stable<br />
refinement <strong>of</strong> the nodes in a tree. It uses the solution for DAGs from the previous section<br />
to refine a partition first to the coarsest predecessor stable refinement, <strong>and</strong> then to the<br />
coarsest successor stable refinement, as shown in Algorithm 5. For trees this yields a<br />
partition that is still predecessor stable, as we prove in the following.<br />
Algorithm 5 Stabilization for trees<br />
1: function StabilizeTree(P, Directions):<br />
2: if Pred ∈ Directions:<br />
3: StabilizeDAG(P, Pred)<br />
4: if Succ ∈ Directions:<br />
5: StabilizeDAG(P, Succ)<br />
Figure 6 shows with a continuation <strong>of</strong> Figure 5 how Algorithm 5 is used to find a<br />
successor <strong>and</strong> predecessor stable refinement. The starting point in the figure is a predecessor<br />
stable refinement found after calling StabilizeDAG(P, Pred). This partition is<br />
then successor stabilized by calling StabilizeDAG(P, Succ), which first refines <strong>and</strong> sorts<br />
on successor rank, shown between lines (i) <strong>and</strong> (j) in the figure, <strong>and</strong> then uses the blocks<br />
in the current P as splitters in order, shown in lines (j)–(t). Compare this partition with<br />
the F&B-bisimilarity partition in Figure 1d.<br />
181
CHAPTER 4.<br />
INCLUDED PAPERS<br />
(i)<br />
a 4 a 6 a 7 a 8 a 11 a 12 a 21 b 5 b 14 b 17 b 19 b 13 b 16 b 18 b 10 b 20 b 3 a 9 a 15 a 2 b 1<br />
(j)<br />
b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />
.<br />
(m)<br />
b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 4 a 6 a 11 a 12 a 21 b 5 b 14 b 17 b 19<br />
(n)<br />
b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 11 a 12 a 21 a 4 a 6 b 14 b 17 b 19 b 5<br />
.<br />
(t)<br />
b 1 a 2 a 9 a 15 b 3 a 7 a 8 b 10 b 20 b 13 b 16 b 18 a 11 a 12 a 21 a 4 a 6 b 14 b 17 b 19 b 5<br />
Figure 6: Continuing Fig. 5: Line (i) shows predecessor stable partition, step (i)–(j) shows<br />
successor rank refinement, <strong>and</strong> steps (j)–(t) show successor stabilization.<br />
Theorem 3. If a predecessor stable partition <strong>of</strong> the nodes in a tree is refined to successor<br />
stability, the resulting partition is still predecessor stable.<br />
Pro<strong>of</strong> (sketch). Blocks are split in three ways: To refine on rank, with respect to a block<br />
B ∈ P , or as a side effect with respect to S \ B in Algorithm 2. By Lemma 1, the first<br />
type <strong>of</strong> split does not cause any split that would not eventually be caused by an algorithm<br />
that iteratively refines P with respect to a r<strong>and</strong>om block B ∈ P , <strong>and</strong> from the correctness<br />
pro<strong>of</strong> <strong>of</strong> Paige <strong>and</strong> Tarjan’s algorithm [16], neither does splitting with respect to S \ B.<br />
We now use induction on the refinement steps, <strong>and</strong> show that the partition P remains<br />
predecessor stable. It is true initially by assumption. The induction step is to split a<br />
block D ∈ P on successor stability with respect to a block B ∈ P .<br />
B will split D into two parts D B <strong>and</strong> D S , containing the nodes pointed to <strong>and</strong> not<br />
pointed to from B, respectively. The splitting <strong>of</strong> D may only affect the predecessor<br />
stability <strong>of</strong> the new blocks D B <strong>and</strong> D S with respect to their descendants, <strong>and</strong> <strong>of</strong> the set<br />
<strong>of</strong> blocks B pointing into D with respect to D B <strong>and</strong> D S . After the split, B ⊆ E −1 (D B )<br />
<strong>and</strong> B ∩ E −1 (D S ) = ∅, <strong>and</strong> for all other B ′ ∈ B we have that B ′ ∩ E −1 (D B ) = ∅ <strong>and</strong><br />
B ′ ⊆ E −1 (D S ), because these B ′ by assumption have pointers into D, but by the fact<br />
that the data is a tree, do not have pointers into D B .<br />
For any block G pointed to from some node in D, D ⊆ E −1 (G) by the initial assumption<br />
<strong>of</strong> predecessor stability, meaning all nodes in D point into G. This means that all<br />
nodes in D B <strong>and</strong> D S also point into G, <strong>and</strong> thus D B <strong>and</strong> D S are predecessor stable with<br />
respect to G.<br />
Figure 7a illustrates this theorem. By contrast, assuming the data was not a tree, the<br />
blocks B ′ ∈ B would not necessarily be predecessor stable after splitting D, as shown in<br />
Figure 7b.<br />
Theorem 4. Algorithm 5 finds the coarsest refinement <strong>of</strong> a partition <strong>of</strong> the nodes in tree<br />
that is successor <strong>and</strong> predecessor stable in O(n) time.<br />
182
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
B B ′<br />
B B ′<br />
D<br />
D<br />
G<br />
(a) For a tree.<br />
G<br />
(b) For a DAG.<br />
Figure 7: Splitting D for successor stability w.r.t. B in a partition P . This does not<br />
impact predecessor stability for any block given tree data, but may break predecessor<br />
stability in a DAG.<br />
Pro<strong>of</strong>. From Theorems 1, 2 <strong>and</strong> 3, <strong>and</strong> the fact that m ∈ O(n) for tree data.<br />
Corollary 1. The F&B-index can be built in O(n) time for tree data using Algorithm 5.<br />
Pro<strong>of</strong> (sketch). The coarsest successor <strong>and</strong> predecessor stable refinement <strong>of</strong> a partitioning<br />
on label gives the maximum F&B-bisimulation [10], <strong>and</strong> the summary structure can be<br />
constructed from the contraction graph [10].<br />
4 Related Work<br />
There are many variations <strong>of</strong> structure summaries for graph data, such as partitionings on<br />
similarity [8, 15], the F+B-index [10], the A(k) index [12], <strong>and</strong> the D(k)-index [2]. Note<br />
that an implication <strong>of</strong> Theorem 3 is that the F+B-index <strong>and</strong> the F&B-index are identical<br />
for tree data. The cost <strong>of</strong> matching in the summary can be reduced by label-partitioning<br />
it <strong>and</strong> using specialized joins [1], or using multi-level structure indexing [18]. For queries<br />
using ancestor–descendant edges on graph data, different graph encodings <strong>of</strong>fer trade-<strong>of</strong>fs<br />
between space usage <strong>and</strong> query time [21]. For a general overview <strong>of</strong> indexing <strong>and</strong> search<br />
in XML see the survey by Gou <strong>and</strong> Chirkova [6]. There is previous research on updates<br />
<strong>of</strong> bisimilarity partitions [11, 20, 17]. Some <strong>of</strong> these methods trade update time for<br />
coarseness, as refinements <strong>of</strong> bisimilarity may be cheaper to compute after a data update.<br />
Single directional bisimulations are used in many fields, such as modal logic, concurrency<br />
theory, set theory, formal verification, etc. [3], but to our knowledge, F&B-bisimulation<br />
is not frequently used outside XML search.<br />
5 Conclusions <strong>and</strong> Future Work<br />
In this paper we have improved the running time for refining a partition to the coarsest<br />
simultaneous successor <strong>and</strong> predecessor stability for tree data from O(n log n) to O(n),<br />
183
CHAPTER 4.<br />
INCLUDED PAPERS<br />
<strong>and</strong> with that the computation <strong>of</strong> F&B-bisimilarity, <strong>and</strong> construction <strong>of</strong> the F&B-index. 1<br />
An incorrect linear algorithm for DAGs has been presented recently [13], <strong>and</strong> it would be<br />
interesting to know whether the problem is actually solvable in linear time for DAGs.<br />
A natural extension <strong>of</strong> our work would be to reduce the cost <strong>of</strong> updates in F&B-bisimilarity<br />
partitions for trees. A particularly interesting direction would be to improve indexing performance<br />
for typical XML document collections, where there is a large number <strong>of</strong> small<br />
independent documents. It may be possible to iteratively add documents to the index<br />
with (expected) cost dependent only on the size <strong>of</strong> the documents.<br />
Acknowledgments<br />
This material is based upon work supported by the iAd Project funded by the Research<br />
Council <strong>of</strong> Norway <strong>and</strong> the Norwegian University <strong>of</strong> Science <strong>and</strong> Technology. Any opinions,<br />
findings <strong>and</strong> conclusions or recommendations expressed in this material are those <strong>of</strong><br />
the authors, <strong>and</strong> do not necessarily reflect the views <strong>of</strong> the funding agencies.<br />
References<br />
[1] Radim Bača, Michal Krátký, <strong>and</strong> Václav Snášel. On the efficient search <strong>of</strong> an XML<br />
twig query in large DataGuide trees. In Proc. IDEAS, 2008.<br />
[2] Qun Chen, Andrew Lim, <strong>and</strong> Kian Win Ong. D(k)-index: an adaptive structural<br />
summary for graph-structured data. In Proc. SIGMOD, 2003.<br />
[3] Agostino Dovier, Carla Piazza, <strong>and</strong> Alberto Policriti. A fast bisimulation algorithm.<br />
In Proc. CAV, 2001.<br />
[4] Jean-Claude Fern<strong>and</strong>ez. An implementation <strong>of</strong> an efficient algorithm for bisimulation<br />
equivalence. Sci. Comput. Program., 13(2-3), 1990.<br />
[5] Roy Goldman <strong>and</strong> Jennifer Widom. DataGuides: Enabling query formulation <strong>and</strong><br />
optimization in semistructured databases. In Proc. VLDB, 1997.<br />
[6] Gang Gou <strong>and</strong> Rada Chirkova. Efficiently querying large XML data repositories: A<br />
survey. Knowl. <strong>and</strong> Data Eng., 2007.<br />
[7] <strong>Nils</strong> <strong>Grimsmo</strong>, Truls Amundsen Bjørklund, <strong>and</strong> Magnus Lie Hetl<strong>and</strong>. Linear computation<br />
<strong>of</strong> the maximum simultaneous forward <strong>and</strong> backward bisimulation for nodelabeled<br />
trees (extended version). Technical Report IDI-TR-2010-10, NTNU, Trondheim,<br />
Norway, 2010.<br />
[8] Monika R. Henzinger, Thomas A. Henzinger, <strong>and</strong> Peter W. Kopke. Computing<br />
simulations on finite <strong>and</strong> infinite graphs. In Proc. FOCS, 1995.<br />
[9] Paris C. Kanellakis <strong>and</strong> Scott A. Smolka. Ccs expressions, finite state processes, <strong>and</strong><br />
three problems <strong>of</strong> equivalence. In Proc. PODC, 1983.<br />
1 See the extended version <strong>of</strong> this paper for some performance experiments [7].<br />
184
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
[10] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Henry F. Korth. Covering<br />
indexes for branching path queries. In Proc. SIGMOD, 2002.<br />
[11] Raghav Kaushik, Philip Bohannon, Jeffrey F. Naughton, <strong>and</strong> Pradeep Shenoy. Updates<br />
for structure indexes. In Proc. VLDB, 2002.<br />
[12] Raghav Kaushik, Pradeep Shenoy, Philip Bohannon, <strong>and</strong> Ehud Gudes. Exploiting<br />
local similarity for indexing paths in graph-structured data. In Proc. ICDE, 2002.<br />
[13] Xianmin Liu, Jianzhong Li, <strong>and</strong> Hongzhi Wang. SAM: An efficient algorithm for<br />
F&B-index construction. In Proc. APWeb/WAIM, 2007.<br />
[14] Robin Milner. Communication <strong>and</strong> concurrency. Prentice-Hall, Inc., 1989.<br />
[15] Tova Milo <strong>and</strong> Dan Suciu. Index structures for path expressions. Proc. ICDT, 1999.<br />
[16] Robert Paige <strong>and</strong> Robert E. Tarjan. Three partition refinement algorithms. SIAM<br />
J. Comput., 1987.<br />
[17] Diptikalyan Saha. An incremental bisimulation algorithm. In Proc. FSTTCS, 2007.<br />
[18] Xin Wu <strong>and</strong> Guiquan Liu. XML twig pattern matching using version tree. Data &<br />
Knowl. Eng., 2008.<br />
[19] Beverly Yang, Marcus Fontoura, Eugene Shekita, Sridhar Rajagopalan, <strong>and</strong> Kevin<br />
Beyer. Virtual cursors for XML joins. In Proc. CIKM, 2004.<br />
[20] Ke Yi, Hao He, Ioana Stanoi, <strong>and</strong> Jun Yang. Incremental maintenance <strong>of</strong> XML<br />
structural indexes. In Proc. SIGMOD, 2004.<br />
[21] Jeffrey Xu Yu <strong>and</strong> Jiefeng Cheng. Graph reachability queries: A survey. In Managing<br />
<strong>and</strong> Mining Graph Data, Advances in Database Systems. Springer US, 2010.<br />
A<br />
Experiments<br />
To investigate the practical impact <strong>of</strong> the algorithms presented we have run a small<br />
experiment on the common XML tree benchmarks DBLP 2 , XMark 3 <strong>and</strong> Treebank 4 . The<br />
general graph algorithm <strong>and</strong> the specialized tree algorithm were implemented in C++<br />
<strong>and</strong> run on an AMD Athlon 3500+. We generated XMark data <strong>of</strong> roughly the same size<br />
as the Treebank benchmark, <strong>and</strong> used a prefix <strong>of</strong> the DBLP data <strong>of</strong> similar size. Only<br />
XML elements <strong>and</strong> attributes without their values were indexed. Properties <strong>of</strong> the data<br />
sets <strong>and</strong> the constructed indexes are listed in Figure 8. DBLP is a shallow tree, while<br />
XMark <strong>and</strong> Treebank are much deeper. Note that Treebank has a larger number <strong>of</strong> labels,<br />
<strong>and</strong> a structure that gives a large number <strong>of</strong> bisimulation partitions.<br />
2 http://dblp.uni-trier.de/xml/dblp.xml<br />
3 http://www.xml-benchmark.org/downloads.html<br />
4 http://www.cs.washington.edu/research/xmldatasets/data/treebank/<br />
185
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Nodes Labels B-partitions F-partitions F&B-partitions<br />
DBLP 2 830 635 33 88 292 2479<br />
XMark 2 048 193 77 547 31 230 484 615<br />
Treebank 2 437 667 251 338 749 443 639 2 277 127<br />
Figure 8: Data set properties.<br />
Figure 9a shows the number <strong>of</strong> times each edge was used in a stabilization operation<br />
on average throughout the refinement, while Figure 9b shows the time in milliseconds<br />
divided by the number <strong>of</strong> edges in the graph. For the DBLP data, the original algorithm<br />
is actually slightly faster for all partition types. The explanation can be found in the<br />
number <strong>of</strong> times each edge is used, which is close to the optimal. The specialized tree<br />
algorithm has some linear time overhead because the refinement on maximum distance<br />
is slightly more involved than the separation <strong>of</strong> roots <strong>and</strong> leaves in the graph algorithm.<br />
Partitioning on B-bisimilarity <strong>and</strong> F-bisimilarity has roughly the same cost with both<br />
algorithms on all the data sets, again because the graph algorithm is very far from its<br />
worst-case behavior.<br />
StabilizeGraph<br />
StabilizeTree<br />
6<br />
4<br />
2<br />
DBLP<br />
XMark<br />
Treebank<br />
15<br />
10<br />
5<br />
DBLP<br />
XMark<br />
Treebank<br />
0<br />
0<br />
B<br />
F<br />
F&B<br />
B<br />
F<br />
F&B<br />
B<br />
F<br />
F&B<br />
B<br />
F<br />
F&B<br />
B<br />
F<br />
F&B<br />
B<br />
F<br />
F&B<br />
(a) Avg. uses per edge.<br />
(b) Avg. time per edge (in ms).<br />
Figure 9: Computing different bisimulations using the original algorithm for graphs or<br />
the specialized algorithm for trees.<br />
On F&B-index construction for XMark data, the tree algorithm uses one third the<br />
edge operations, <strong>and</strong> is almost twice as fast as the general graph algorithm. Note that<br />
the 5.9 uses <strong>of</strong> each edge in the graph algorithm is far from the worst-case number <strong>of</strong> uses<br />
per edge. For the Treebank benchmark, the number times an edge is used is smaller, <strong>and</strong><br />
so is the difference between the algorithms. A greater part <strong>of</strong> the cost is the overhead<br />
<strong>of</strong> maintaining the blocks in the partition, because the number <strong>of</strong> blocks is close to the<br />
number <strong>of</strong> nodes in the tree, as can be seen in Figure 8.<br />
186
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
B<br />
Bi-directional Bisimulation <strong>and</strong> Stability<br />
Most work on bisimulation considers only forward bisimulation on edge-labeled graph<br />
data, <strong>and</strong> most work on stabilizing partitions considers only predecessor stabilization<br />
on node-labeled graph data. When bi-directional bisimulation <strong>and</strong> the connection to<br />
predecessor <strong>and</strong> successor stabilization was introduced [10], extension <strong>of</strong> these previous<br />
results were omitted probably due to space restrictions. We therefore go through <strong>and</strong><br />
extend the results for bisimulation [14], partition stabilization [16] <strong>and</strong> their connection [9]<br />
to the bi-directional case for node- <strong>and</strong> edge-labeled data. The extensions may be trivial,<br />
but this appendix also serves as an easily accessible reference.<br />
Assume a node- <strong>and</strong> edge-labeled graph G = 〈V, E〉 with node set V <strong>and</strong> edge set<br />
E ⊆ V × V . The label <strong>of</strong> a node v ∈ V is L(v) ∈ A V , <strong>and</strong> the label <strong>of</strong> an edge e ∈ E<br />
is L(e) ∈ A E . Let A = A V ∪ A E . For a given α ∈ A V , let V α be the set <strong>of</strong> nodes<br />
with this label, V α = {v | v ∈ V, L(v) = α}. And similarly, for a given α ∈ A E , let<br />
E α = {e | e ∈ E, L(e) = α}.<br />
is a F&B-<br />
B.1 Bisimulation<br />
Definition 4 (Node- <strong>and</strong> edge-labeled F&B-bisimulation). R ⊆ V × V<br />
bisimulation if v R v ′ implies that the following Properties are satisfied:<br />
1. L(v) = L(v ′ )<br />
2. u E α v ⇒ ∃u ′ : u ′ E α v ′ ∧ u R u ′<br />
3. u ′ E α v ′ ⇒ ∃u : u E α v ∧ u R u ′<br />
4. v E α w ⇒ ∃w ′ : v ′ E α w ′ ∧ w R w ′<br />
5. v ′ E α w ′ ⇒ ∃w : v E α w ∧ w R w ′<br />
This is illustrated in Figure 10. The definition becomes equal to the definition <strong>of</strong><br />
F&B-bisimulation for node-labeled data [10] when there is only a single edge label α, <strong>and</strong><br />
equal to the original (forward) bisimulation definition [14] when there is a single node<br />
label <strong>and</strong> we ignore Properties 2 <strong>and</strong> 3. Various types <strong>of</strong> simulation [8] are defined by<br />
ignoring at least Properties 3 <strong>and</strong> 5.<br />
The following lemma trivially extends the original for forward bisimulation in edge<br />
labeled graphs [14, Proposition 4.1].<br />
Lemma 4 (Constructing F&B-bisimulations). Assuming that R i <strong>and</strong> R j are F&Bbisimulations,<br />
then the following are F&B-bisimulations:<br />
(1) ∅<br />
(2) Id V = {〈v, v〉 | v ∈ V }<br />
(3) R −1<br />
i = {〈v ′ , v〉 | 〈v, v ′ 〉 ∈ R i }<br />
(4) R i ∪ R j = {〈v, v ′ 〉 | 〈v, v ′ 〉 ∈ R i ∨ 〈v, v ′ 〉 ∈ R j }.<br />
187
CHAPTER 4.<br />
INCLUDED PAPERS<br />
E α<br />
u<br />
v<br />
R<br />
R<br />
u ′<br />
v ′<br />
E α<br />
E α<br />
w<br />
R<br />
w ′<br />
E α<br />
Figure 10: Illustration <strong>of</strong> F&B-bisimulation<br />
(5) R i R j = {〈v, v ′′ 〉 | 〈v, v ′ 〉 ∈ R i , 〈v ′ , v ′′ 〉 ∈ R j }<br />
Pro<strong>of</strong>. Property 1 is trivially satisfied for (1)–(5), as it is an equality. (1) is true as there<br />
are no v R v ′ . For (2), substitute u for u ′ <strong>and</strong> w for w ′ in Properties 2–5. (3) is true by<br />
the symmetry <strong>of</strong> Properties 2 <strong>and</strong> 3, <strong>and</strong> <strong>of</strong> Properties 4 <strong>and</strong> 5. (4) is true by substituting<br />
R i ∪ R j for R in Properties 2–5. For (5), assume that v R i R j v ′′ . Then there is some v ′<br />
such that v R i v ′ <strong>and</strong> v ′ R j v ′′ . For Property 2, let there be some u such that uE α v. Then,<br />
as v R i v ′ <strong>and</strong> R i is a F&B-bisimulation, there exists a u ′ such that u ′ E α v ′ <strong>and</strong> u R i u ′ . As<br />
v ′ R j v ′′ <strong>and</strong> R j is a F&B-bisimulation, there exists a u ′′ such that u ′′ E α v ′′ <strong>and</strong> u ′ R j u ′′ ,<br />
<strong>and</strong> the relation u R i R j u ′′ follows. Pro<strong>of</strong>s <strong>of</strong> Properties 3–5 are similar.<br />
Corollary 2 (Union <strong>of</strong> bisimulation refinements). It follows from Lemma 4(4) that given<br />
an initial relation S, the union <strong>of</strong> all F&B-bisimulations that are refinements <strong>of</strong> S is also<br />
a F&B-bisimulation. Trivially this union is also a refinement <strong>of</strong> S.<br />
B.2 Bisimilarity<br />
We know from Corollary 2 that there exist a unique maximum F&B-bisimulation, <strong>of</strong><br />
which all other F&B-bisimulations are refinements:<br />
Definition 5 (F&B-bisimilarity). Nodes v <strong>and</strong> v ′ are F&B-bisimilar, written v ∼ F &B v ′ ,<br />
iff v R v ′ for some F&B-bisimulation R. Equivalently,<br />
∼ F &B = ⋃ {R | R is a F&B-bisimulation}<br />
If the type <strong>of</strong> bisimulation is clear from the context, such as in the following, we write<br />
just ∼ for ∼ F &B .<br />
The next theorem is a generalization <strong>of</strong> the following Corollary 3, which is taken<br />
from Milner’s original results [14]. This more general result will be useful when linking<br />
F&B-bisimulations with partition P&S-stability in Section B.3.<br />
Theorem 5 (Largest bisimulation partition refinement is equivalence). If Q is a partition<br />
<strong>of</strong> the nodes in a graph, then the largest F&B-bisimulation R that is a refinement <strong>of</strong> = Q<br />
is an equivalence relation.<br />
188
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
Pro<strong>of</strong>. Reflexivity: We have that Id V ⊆ = Q as = Q is an equivalence relation, <strong>and</strong> that<br />
Id V is a F&B-bisimulation by Lemma 4(2). Hence Id V ⊆ R, as R contains all such<br />
F&B-bisimulations by Corollary 2. Symmetry: If v R v ′ , then v R i v ′ for some F&Bbisimulation<br />
R i ⊆ R. We have that v ′ R −1 i v by definition <strong>of</strong> the inverse, <strong>and</strong> R −1 i is<br />
a F&B-bisimulation by Lemma 4(3). For any v ′ R −1 i v, we have than v ′ = Q v, because<br />
v R i v ′ implies v = Q v ′ by assumption, <strong>and</strong> = Q is an equivalence relation <strong>and</strong> hence<br />
symmetric. This means that R −1 i ⊆ = Q , <strong>and</strong> that R −1 i ⊆ R by the maximality <strong>of</strong> R,<br />
<strong>and</strong> v ′ R v follows. Transitivity: If v R v ′ <strong>and</strong> v ′ R v ′′ , then for some F&B-bisimulation<br />
R i ⊆ R <strong>and</strong> R j ⊆ R we have that v R i v ′ <strong>and</strong> v ′ R j v ′′ . By Lemma 4(5) R i R j is a F&Bbisimulation.<br />
For any x <strong>and</strong> x ′ , if x R i x ′ <strong>and</strong> x ′ R j x ′′ , then x = Q x ′ <strong>and</strong> x ′ = Q x ′′ follows<br />
from R i , R j ⊆ R ⊆ = Q . Because = Q is an equivalence relation <strong>and</strong> hence transitive,<br />
x = Q x ′′ follows, <strong>and</strong> hence R i R j ⊆ = Q . By the maximality <strong>of</strong> R we have that R i R j ⊆ R,<br />
<strong>and</strong> therefore v R v ′′ .<br />
The next corollary is a simple extension <strong>of</strong> the original result for single-directional<br />
bisimulations in edge-labeled data [14, Proposition 2].<br />
Corollary 3 (Bisimilarity is equivalence). It follows from Theorem 5 that F&B-bisimilarity<br />
is an equivalence relation.<br />
The following lemma states that for F&B-bisimilarity, the word implies in Definition 5<br />
can be exchanged with iff. The pro<strong>of</strong> is almost identical to the pro<strong>of</strong> for the singledirectional<br />
bisimulation for edge-labeled data [14, Lemma 3+Proposition 4].<br />
Lemma 5. Properties 1–5 in Definition 4 with ∼ substituted for R imply v ∼ v ′ .<br />
Pro<strong>of</strong>. Define ∼ ′ to be a relation such that v ∼ ′ v ′ iff Properties 1–5 are satisfied with ∼<br />
replaced for R. We have that v ∼ v ′ implies v ∼ ′ v ′ , because v ∼ v ′ implies the properties,<br />
<strong>and</strong> the properties imply v ∼ ′ v ′ by the definition <strong>of</strong> ∼ ′ . To prove that the properties<br />
imply v ∼ v ′ , we prove that v ∼ ′ v ′ implies v ∼ v ′ , i.e., ∼ ′ ⊆ ∼. Assume that v ∼ ′ v ′ .<br />
Property 1 is trivially satisfied. By the definition <strong>of</strong> ∼ ′ , for all u such that u E α v, there<br />
exist a u ′ such that u ′ E α v ′ <strong>and</strong> u ∼ u ′ , which, as shown above implies u ∼ ′ u ′ , satisfying<br />
Property 2. Properties 3–5 are similarly shown.<br />
B.3 Stability<br />
The following is a reformulation <strong>of</strong> Definition 2 <strong>of</strong> successor <strong>and</strong> predecessor stability,<br />
adding edge labels:<br />
Definition 6 (Stability). A set D ⊆ V is S α -stable with respect to a set B ⊆ V iff<br />
D ⊆ E α (B) or D ∩ E α (B) = ∅, <strong>and</strong> P α -stable with respect to B iff D ⊆ E α −1 (B) or<br />
D ∩ E α −1 (B) = ∅. Furthermore, D is S-stable or P-stable with respect to B iff D is<br />
S α -stable or P α -stable with respect to B for all α ∈ A E , respectively.<br />
If D is both successor stable <strong>and</strong> predecessor stable with respect to B, we shall say<br />
that D is P&S-stable with respect to B, or simply stable if there is no ambiguity.<br />
For any combination <strong>of</strong> successor <strong>and</strong> predecessor stability, a partition Q <strong>of</strong> V is said<br />
to be stable with respect to a block B if all blocks in Q are stable with respect to B. A<br />
189
CHAPTER 4.<br />
INCLUDED PAPERS<br />
partition Q is stable with respect to another partition Q ′ if it is stable with respect to all<br />
blocks in Q ′ . Q is said to be stable if it is stable with respect to itself.<br />
B.4 Stability <strong>and</strong> bisimulation<br />
The next lemma states that a F&B-bisimulation can be found by P&S-stabilizing a partition,<br />
extending the link between single-directional (forward) bisimulation <strong>and</strong> the relational<br />
coarsest (predecessor) stable partition problem [9, Lemma 3].<br />
Lemma 6 (Stability implies bisimulation). If a partition Q respects node labels <strong>and</strong> is<br />
P&S-stable, then = Q is a F&B-bisimulation.<br />
Pro<strong>of</strong>. We must prove that v = Q v ′ implies Properties 1–5 in Definition 4. Property 1<br />
is satisfied as node labeling is assumed to be respected by the partition. For Property 2,<br />
assume that D ∈ Q is the block containing v <strong>and</strong> v ′ , <strong>and</strong> that there is an edge u E α v<br />
for some α. Let B ∈ P be the block containing u. We have that D ∩ E(B) ≠ ∅ from<br />
u E α v, <strong>and</strong> hence D ⊆ E α (B) from the successor stability. This means that for any<br />
v ′ ∈ D, there is some u ′ ∈ B, i.e., u = Q u ′ , such that u ′ E α v ′ . Properties 3–5 are proved<br />
similarly.<br />
Corollary 4. Given a partition Q, by Definition 5 <strong>and</strong> Lemma 6 there is a unique<br />
coarsest P&S-stable refinement <strong>of</strong> Q, <strong>of</strong> which all other P&S-stable refinements <strong>of</strong> Q are<br />
again refinements.<br />
Lemma 7 (Bisimulation <strong>and</strong> equivalence implies stability). If a relation R is a F&Bbisimulation<br />
<strong>and</strong> an equivalence relation, then the equivalence classes <strong>of</strong> R give a P&Sstable<br />
partition respecting node labels.<br />
Pro<strong>of</strong>. Let Q be the partition arising from R. Q respects labels by Property 1 in Definition<br />
4. Successor stability: We prove that for blocks D, B ∈ Q, we have that D∩E(B) ≠ ∅<br />
implies D ⊆ E(B). Assume D ∩ E(B) ≠ ∅, i.e., there is some node v ∈ D such that<br />
v ∈ E α (B), <strong>and</strong> hence some u ∈ B such that u E α v. For each v ′ ∈ D, we have that that<br />
v R v ′ , <strong>and</strong> by Property 2 in Definition 4 there is some u ′ such that u ′ E α v ′ <strong>and</strong> u R u ′ ,<br />
i.e., u ′ ∈ B <strong>and</strong> v ′ ∈ E(B). Hence, D ⊆ E(B). Predecessor stability: Identical pro<strong>of</strong><br />
substituting E α with E α −1 .<br />
Theorem 6 (Maximum refinements). Given an initial partition Q, if R is the maximum<br />
F&B-bisimulation that is a refinement <strong>of</strong> = Q , <strong>and</strong> P is the coarsest stable partition<br />
refining Q respecting node labels, then = P is equal to R.<br />
Pro<strong>of</strong>. Follows from Lemmas 6 <strong>and</strong> 7.<br />
B.5 Implementing Stability<br />
The following two lemmas are trivial extensions <strong>of</strong> Paige <strong>and</strong> Tarjan’s original description<br />
[16].<br />
190
Paper 6: Linear Computation <strong>of</strong> the Maximum Simultaneous Forward <strong>and</strong> Backward<br />
Bisimulation for Node-Labeled Trees<br />
Lemma 8 (Stability inherited under refinement). For any type <strong>of</strong> stability, if a partition<br />
Q 2 is a refinement <strong>of</strong> Q 1 <strong>and</strong> Q 1 is stable with respect to a partition Q, then so is Q 2 .<br />
Pro<strong>of</strong>. For any block D 2 ∈ Q 2 there is some block D 1 ∈ Q 1 such that D 2 ⊆ D 1 . We<br />
show successor stability. Assume a given α ∈ A E for S α -stability, or any α for general<br />
S-stability. For any block B ∈ Q, if D 1 ⊆ E α (B) then D 2 ⊆ E α (B), or if D 1 ∩E α (B) = ∅<br />
then D 2 ∩ E α (B) = ∅. Predecessor stability is symmetric.<br />
Lemma 9 (Stability inherited under union). For any type <strong>of</strong> stability, if a partition Q is<br />
stable with respect to both B 1 ⊆ V <strong>and</strong> B 2 ⊆ V , then Q is stable with respect to B 1 ∪B 2 .<br />
Pro<strong>of</strong>. For successor stability, for any block D ∈ Q <strong>and</strong> a given or any α ∈ A E , if at least<br />
one <strong>of</strong> D ⊆ E α (B 1 ) <strong>and</strong> D ⊆ E α (B 2 ) is true, then D ⊆ E α (B 1 ∪ B 2 ) is true. If both<br />
<strong>of</strong> D ∩ E α (B 1 ) = ∅ <strong>and</strong> D ∩ E α (B 2 ) = ∅ are true, then D ∩ E α (B 1 ∪ B 2 ) = ∅ is true.<br />
Predecessor stability is symmetric.<br />
When using the procedure StabilizeWRT() in our stabilization algorithms, an inplace<br />
stabilization is performed, while in the theoretical description [16] <strong>of</strong> the original<br />
algorithm, there is a function split(B, Q), which returns the coarsest refinement <strong>of</strong> Q<br />
that is predecessor stable with respect to B. This is naturally generalized to stabilizing<br />
on P-stability, S-stability <strong>and</strong> S&P-stability. The following functions are h<strong>and</strong>y for the<br />
definitions:<br />
clean(Q) = {B | B ∈ Q ∧ B ≠ ∅}<br />
sep(C, Q) = clean({D ∩ C | D ∈ Q} ∪ {D \ C | D ∈ Q})<br />
Note that sep is commutative, i.e., sep(C 1 , sep(C 2 , Q)) = sep(C 2 , sep(C 1 , Q)).<br />
Definition 7 (α-Split).<br />
split Sα<br />
(B, Q) = sep(E α (B), Q)<br />
split Pα<br />
(B, Q) = sep(E α −1 (B), Q)<br />
split P &Sα (B, Q) = split Pα<br />
(B, split Sα<br />
(B, Q))<br />
Lemma 10 (Splitting <strong>and</strong> stability). A partition Q is:<br />
S-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split Sα<br />
(B, Q)<br />
P-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split Pα<br />
(B, Q)<br />
P&S-stable iff ∀α ∈ A E : ∀B ∈ Q : Q = split P &Sα (B, Q)<br />
Pro<strong>of</strong>. Follows directly from the Definitions 6 <strong>and</strong> 7.<br />
Lemma 11 (α-splitting is monotone). Given a set B ⊆ V , if a partition Q 2 is a refinement<br />
<strong>of</strong> a partition Q 1 , then split Sα<br />
(B, Q 2 ) is a refinement <strong>of</strong> split Sα<br />
(B, Q 1 ), <strong>and</strong> split Pα<br />
(B, Q 2 )<br />
is a refinement <strong>of</strong> split Pα<br />
(B, Q 1 ),<br />
Pro<strong>of</strong>. For any block D 2 ∈ Q 2 , there is a block D 1 ∈ Q 1 such that D 2 ⊆ D 1 , <strong>and</strong> for<br />
split Sα<br />
we have D 2 ∩ E(B) ⊆ D 1 ∩ E(B) <strong>and</strong> D 2 \ E(B) ⊆ D 1 \ E(B). Similarly, for<br />
split Pα<br />
we have D 2 ∩ E −1 (B) ⊆ D 1 ∩ E −1 (B) <strong>and</strong> D 2 \ E −1 (B) ⊆ D 1 \ E −1 (B).<br />
191
CHAPTER 4.<br />
INCLUDED PAPERS<br />
Lemma 12 (Correctness <strong>of</strong> splitting). For an initial partition Q 0 , let R be the coarsest<br />
refinement <strong>of</strong> Q 0 that is P&S-stable, <strong>and</strong> let Q i <strong>and</strong> Q ′ be refinements <strong>of</strong> Q 0 such that R<br />
is a refinement <strong>of</strong> both Q i <strong>and</strong> Q ′ . If U is a union <strong>of</strong> blocks in Q ′ , then for any α we have<br />
that R is a refinement <strong>of</strong> split Sα<br />
(U, Q i ), <strong>of</strong> split Pα<br />
(U, Q i ), <strong>and</strong> hence <strong>of</strong> split P &Sα (U, Q i ).<br />
Pro<strong>of</strong>. Since U is a union <strong>of</strong> blocks from Q ′ it is also a union <strong>of</strong> blocks from R, which<br />
refines Q ′ . As R is S-stable we have R = split Sα<br />
(U, R) for any α. From Lemma 11 we<br />
have that split Sα<br />
(U, R) is a refinement <strong>of</strong> split Sα<br />
(U, Q i ) when R is a refinement <strong>of</strong> Q i .<br />
Hence, R is a refinement <strong>of</strong> split Sα<br />
(U, Q i ). P-stability is symmetric for split Pα<br />
.<br />
Algorithm 6 shows the generic framework for stabilization used in Paige <strong>and</strong> Tarjan’s<br />
algorithm [16], extended to our more general case.<br />
Algorithm 6 Generic stabilization<br />
Given a partition Q 0 .<br />
i ← 0<br />
Until no change is possible,<br />
find some set U that is a union <strong>of</strong> blocks in Q i<br />
such that Q i ≠ split P &Sα (U, Q i ) for some α ∈ A E .<br />
Q i+1 ← split P &Sα (U, Q i ).<br />
i ← i + 1<br />
Corollary 5 (Stabilization correct). It follows from Lemma 12 that algorithm 6 maintains<br />
the invariant that the coarsest P&S-stable refinement <strong>of</strong> the initial partition Q 0 is also a<br />
refinement <strong>of</strong> the current partition Q i .<br />
Theorem 7 (Stabilization terminates). Algorithm 6 terminates, <strong>and</strong> then Q i<br />
coarsest P&S-stable refinement <strong>of</strong> Q 0 .<br />
is the<br />
Pro<strong>of</strong>. As long as Q i is not P&S-stable, by Lemma 10 there exist a union U <strong>of</strong> blocks in<br />
Q i <strong>and</strong> an α ∈ A E such that split Xα<br />
(U, Q i ) ≠ Q i . When split Xα<br />
(U, Q i ) = Q i for all U<br />
<strong>and</strong> α, Q i is P&S-stable by Lemma 10, <strong>and</strong> by Corollary 5, Q i is the coarsest P&S-stable<br />
refinement <strong>of</strong> Q 0 .<br />
Corollary 6. It follows from Theorem 6 <strong>and</strong> Theorem 7 that F&B-bisimilarity can be<br />
computed by using Algorithm 6 to compute the coarsest P&S-stable refinement <strong>of</strong> the<br />
label partition.<br />
192
Appendix A<br />
Other Papers<br />
“No. Better research needed. Fire your research person.<br />
No fishnet stockings. Never. Not in this b<strong>and</strong>.”<br />
– Gene Simmons<br />
193
Paper 7<br />
<strong>Nils</strong> <strong>Grimsmo</strong><br />
On performance <strong>and</strong> cache effects in substring indexes<br />
Technical Report IDI-TR-2007-04, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology,<br />
Trondheim, Norway, 2007<br />
Abstract This report evaluates the performance <strong>of</strong> uncompressed <strong>and</strong> compressed substring<br />
indexes on build time, space usage <strong>and</strong> search performance. It is shown how the<br />
structures react to increasing data size, alphabet size <strong>and</strong> repetitiveness in the data.<br />
Conclusions are drawn as to when different data structures should be used. The main<br />
contribution is the strong relationship identified between time performance <strong>and</strong> locality<br />
in the data structures. As an example, it is found that for byte sized alphabets, suffix<br />
tree construction can be speeded up by a factor 16, <strong>and</strong> query lookup by a factor 8, if<br />
dynamic arrays are used to store the lists <strong>of</strong> children for each node instead <strong>of</strong> linked lists,<br />
at the cost <strong>of</strong> using about 20% more space. And for enhanced suffix arrays, query lookup<br />
is up to twice as fast if the data structure is stored as an array <strong>of</strong> structs instead <strong>of</strong> a set<br />
<strong>of</strong> arrays, at no extra space cost.<br />
Research process<br />
I started on this PhD right after I had finished my masters degree [16], where the topic<br />
was substring indexes, such as suffix trees <strong>and</strong> suffix arrays. XML search was the topic<br />
my PhD supervisors had planned for me to work on, but I felt that I had some results on<br />
substring indexing I should try to publish first.<br />
Retrospective view<br />
In retrospect, pursuing this research direction was a bad idea, as I was fresh out <strong>of</strong> school<br />
<strong>and</strong> tried to publish in a field that was outside the expertise <strong>of</strong> my supervisors. Also,<br />
in the last years before I had started working on this topic, some research had been<br />
published that drastically changed the field, <strong>and</strong> my research turned out to be outdated<br />
<strong>and</strong> uninteresting for the research community. The results <strong>of</strong> this venture were published<br />
in a technical report [17].<br />
195
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
1 Introduction<br />
The suffix tree is a versatile substring index, first introduced by Weiner [44] (see more<br />
accessible descriptions by McCreight <strong>and</strong> Ukkonen [29, 42]). It can be used to search<br />
for occurrences <strong>of</strong> patterns in a string, <strong>and</strong> solve many other problems in combinatorial<br />
pattern matching [16]. The suffix tree <strong>and</strong> similar structures are mostly used in computational<br />
biology, where the sequences considered do not have word boundaries, <strong>and</strong> methods<br />
such as inverted lists are not suitable.<br />
1.1 Suffix Tree Definition<br />
A suffix tree for a string T <strong>of</strong> length n from the alphabet Σ <strong>of</strong> size σ, is a trie <strong>of</strong> all n + 1<br />
suffixes <strong>of</strong> the string padded with a unique terminal, edge compacted such that every<br />
internal node is branching. Since the padded string has n + 1 suffixes, <strong>and</strong> the unique<br />
terminal ensures that no padded suffix is a prefix <strong>of</strong> another, the tree has n+1 leaf nodes.<br />
There are at most n internal nodes, since they all have at least two children. This means<br />
that if edges are represented as pointers into the string, the suffix tree needs Θ(n) space<br />
measured in computer words, or Θ(n log n) bits, which is slightly more than the optimal<br />
O(n log σ) bits, the space needed to store the string indexed.<br />
In addition to the parent–child edges, each internal (non-root) node has a suffix link<br />
to another internal node [29]. If the edges from the root to an internal node spells χα,<br />
where χ is a string <strong>of</strong> length 1, this node has a suffix link to the internal node which<br />
represents the string α. These links are used in construction algorithms, <strong>and</strong> to solve<br />
some problems, such as longest common substring [16].<br />
A suffix tree can be built in Θ(n) time, when the alphabet size is constant [44] or<br />
integer [5]. As with any trie, it can be checked if a pattern P <strong>of</strong> length m is contained<br />
in the tree in O(m) time, if node child lookup is Θ(1). Since all leaf nodes below the<br />
tree position found in such a lookup represent unique matches, <strong>and</strong> all internal nodes are<br />
branching, at most 2z − 1 nodes must then be visited to find all z matches, giving a total<br />
time <strong>of</strong> O(m + z). The problem <strong>of</strong> finding all matches <strong>of</strong> a pattern in a string or a set<br />
<strong>of</strong> strings in known as the occurrence listing problem. A suffix tree has asymptotically<br />
optimal construction time for integer alphabets, <strong>and</strong> optimal search time for constant<br />
alphabets. Optimal search time for larger alphabets can be achieved with perfect hashing,<br />
at the cost <strong>of</strong> a longer construction time [10].<br />
A generalised suffix tree [16] for a set <strong>of</strong> strings S = {T 1 , . . . , T d }, is an edge compacted<br />
trie <strong>of</strong> all suffixes <strong>of</strong> each string T i padded with a unique terminator string $ i . A string<br />
T i <strong>of</strong> length n i can be added to the tree in Θ(n i ) time by slightly modifying a suffix tree<br />
construction algorithm, <strong>and</strong> can be removed in Θ(n i ) time.<br />
1.2 Alternatives to Suffix Trees<br />
A high constant factor in the space needed for suffix trees make them unsuited for solving<br />
problems with large strings on commodity computers, as they do not work well on disk<br />
197
APPENDIX A. OTHER PAPERS<br />
[2]. The space efficient representation by Kurtz [22] needs 13 bytes per string symbol on<br />
average 1 . This has lead to the development <strong>of</strong> many alternative index structures.<br />
The suffix array [27] is a simplification <strong>of</strong> suffix trees, consisting only <strong>of</strong> an array A<br />
with the sorted order <strong>of</strong> the suffixes <strong>of</strong> the padded string, such that T [A[i] . . . n + 1] <<br />
T [A[i + 1] . . . n + 1] for all 1 ≤ i ≤ n. The order in A is equal to the order <strong>of</strong> the leaf<br />
nodes seen in an ordered depth first traversal <strong>of</strong> the suffix tree for the string. The space<br />
usage is n⌈log n⌉ bits, plus n⌈log σ⌉ bits for storing the text. Lookup by binary search<br />
is O(m log n), which can be improved to O(m + log n) by using additional arrays storing<br />
longest common prefixes (LCP) between the suffixes indexed [27]. After finding the left<br />
<strong>and</strong> right border <strong>of</strong> suffixes matching the pattern, the hits are read by a sequential scan<br />
in the array. Suffix arrays can be constructed in Θ(n) time [18, 20, 21], but algorithms<br />
with higher worst-case bounds are usually faster in practice [28, 40].<br />
A “hybrid” between suffix arrays <strong>and</strong> trees is the enhanced suffix array, which has the<br />
same functionality as suffix trees, but uses less space in the worst case. It is a suffix array<br />
with additional fields, <strong>and</strong> can be built in linear time. Abouelhoda et al. [1] show how<br />
to use the enhanced suffix array to replace any algorithm doing top–down, bottom–up or<br />
suffix link traversal in a suffix tree. The steps <strong>of</strong> looking up a pattern is similar to what<br />
is done in a suffix tree implemented with sibling lists. Listing hits is done as in suffix<br />
arrays, by sequential scan between a left <strong>and</strong> right border. Enhanced suffix arrays are<br />
expected to list large numbers <strong>of</strong> hits much faster than suffix trees in practise, due to the<br />
improved locality.<br />
An alternative non-linear suffix tree construction is the write-only top-down construction<br />
(wotd) [11], in which the suffix tree is constructed in O(n 2 ) time (expected<br />
Θ(n log n)). Sibling nodes are stored adjacently in memory. This spatial locality makes<br />
it faster than linear time suffix trees for some cases.<br />
Many compressed substring indexes have been introduced the last ten years. See [34]<br />
for an introduction, time <strong>and</strong> space bounds, <strong>and</strong> an extensive list <strong>of</strong> references. The family<br />
<strong>of</strong> LZ-indexes [19, 8, 32] use structures based on the Ziv-Lempel decomposition <strong>of</strong> the<br />
string [45]. Other indexes are based on suffix arrays, such as the family <strong>of</strong> Compressed<br />
suffix arrays [23, 15, 38, 14, 24]. The FM-index family [7, 12, 9, 26] also build on suffix<br />
arrays, but uses a different search method. These structures <strong>of</strong>fer various trade<strong>of</strong>fs between<br />
index size, construction time, <strong>and</strong> query performance. Many <strong>of</strong> these structures do<br />
not depend upon keeping a copy <strong>of</strong> the text, <strong>and</strong> are hence called self indexes.<br />
Sadakane [39] has shown that it is also possible to combine a compressed suffix array<br />
with a balanced parenthesis representation <strong>of</strong> a suffix tree <strong>and</strong> various other structures to<br />
get full functionality, with bottom–up, top–down <strong>and</strong> suffix link traversal <strong>of</strong> the tree.<br />
1.3 Dynamic Problems<br />
Suffix trees have been equalled by other structures in terms <strong>of</strong> construction time <strong>and</strong> search<br />
speed, <strong>and</strong> surpassed on space usage <strong>and</strong> practical performance when listing many hits.<br />
The only problems in which suffix trees have not been matched (at least in asymptotic<br />
terms, to the authors knowledge), are problems concerning dynamic sets <strong>of</strong> strings, in<br />
1 When supporting strings up to 500MB<br />
198
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
which strings are added to <strong>and</strong> removed from the set. Generalised suffix trees can be used<br />
to solve many <strong>of</strong> these problems optimally.<br />
Problems with static sets <strong>of</strong> strings can be solved by using suffix arrays or similar<br />
structures, by indexing the concatenation <strong>of</strong> the strings, separated by a unique symbol /∈<br />
Σ. Also, any dynamic decomposable indexing problem can be solved by using a hierarchy<br />
<strong>of</strong> static indexes <strong>of</strong> varying sizes, <strong>and</strong> the cost <strong>of</strong> a O(log |S|) overhead on build time<br />
<strong>and</strong> search performance [35]. <strong>Grimsmo</strong> [13] shows the performance <strong>of</strong> hierarchies <strong>of</strong> suffix<br />
arrays compared to a generalised suffix tree.<br />
A problem related to the occurrence listing problem is the document listing problem for<br />
sets <strong>of</strong> strings. Given a pattern, find all documents in which it occurs. Muthukrishnan<br />
[31] shows how to solve this problem optimally by using a suffix tree with additional<br />
information. The technique used can be adapted to suffix arrays <strong>and</strong> similar structures.<br />
An open problem (to the authors knowledge) is solving the document listing problem<br />
optimally for dynamic sets <strong>of</strong> documents.<br />
1.4 Report Overview<br />
Section 2 details how to efficiently implement a suffix tree using dynamic arrays for the<br />
parent–child relationship, with a low space overhead compared to sibling lists. Section<br />
3 describes some <strong>of</strong> the choices made in the tested implementation <strong>of</strong> enhanced suffix<br />
arrays. Section 4 features extensive tests <strong>of</strong> many substring indexes, on build time <strong>and</strong><br />
search performance, with many types <strong>of</strong> test data. It is shown how the indexes react to<br />
increasing data size, varying alphabet size, <strong>and</strong> varying text r<strong>and</strong>omness. Section 5 draws<br />
conclusions from the experiments, <strong>and</strong> gives some guidelines for when the various types<br />
<strong>of</strong> index structures are suitable.<br />
2 Implementation <strong>of</strong> Suffix Trees<br />
The suffix tree data structure consists <strong>of</strong> a set <strong>of</strong> nodes. These nodes are linked together<br />
with parent–child edges <strong>and</strong> suffix links. One <strong>of</strong> the major choices when implementing<br />
suffix trees is how to represent the parent–child relationship. McCreight [29] describes<br />
the use <strong>of</strong> linked lists (called sibling lists) <strong>and</strong> hash tables. His article states that using<br />
an array <strong>of</strong> pointers “would be fast to search, but slow to initialise <strong>and</strong> prohibitive in size<br />
for a large alphabet.” This was true in 1976.<br />
2.1 Erroneous Assumptions on Using Arrays for Child Lists<br />
Various authors have made assumptions about the efficiency <strong>of</strong> different ways to implement<br />
suffix trees. Bedathur <strong>and</strong> Haritsa [2] claim that using arrays would result in wasted<br />
space, as there would be a lot <strong>of</strong> null pointers. They say this would be especially severe in<br />
the lower parts <strong>of</strong> the tree. This implies that the authors do not consider the possibility <strong>of</strong><br />
storing the pointers in an unsorted array <strong>and</strong> using dynamic arrays, or that they consider<br />
such a solution to be inefficient. Tian et al. [41, page 288] make similar claims, referring<br />
to [29] <strong>and</strong> [2].<br />
199
APPENDIX A. OTHER PAPERS<br />
This report presents results showing that using dynamic arrays for storing the parentchild<br />
relationship clearly outperforms using sibling lists in terms <strong>of</strong> speed, at the cost <strong>of</strong><br />
using slightly more space.<br />
2.2 Parent-Child Relationship<br />
Suffix tree construction is usually described as being Θ(n), with the assumption that the<br />
size <strong>of</strong> the alphabet can be viewed as a constant. The most common way <strong>of</strong> implementing<br />
suffix trees is using sibling lists. This is a simple solution, which gives a construction<br />
time <strong>of</strong> O(nσ), which is effective for small alphabets, such as DNA. This alphabet factor<br />
is most visible in trees for highly r<strong>and</strong>om strings from large alphabets, where there are<br />
fewer nodes, but a higher average branching factor.<br />
When considering general alphabets (sorting only possible by comparison), the running<br />
time <strong>of</strong> any suffix tree construction algorithm has a worst case Ω(n log n), as it has sorting<br />
complexity [6]. Farach [5] shows how to construct suffix trees for integer alphabets (σ ≤ n)<br />
in Θ(n) time by recursively constructing a suffix tree for odd numbered suffixes, building<br />
a tree for the even numbered from the information found, <strong>and</strong> then merging these trees.<br />
McCreight [29] proposed to use hashing as an alternative to sibling lists. Kurtz [22]<br />
shows how to do this effectively. Since there is an initial space overhead for each hash<br />
table, a single table is used for all nodes in the tree. The keys in this table are pairs<br />
<strong>of</strong> parent node numbers <strong>and</strong> edge first symbols, <strong>and</strong> the values are child node numbers.<br />
With this scheme, the expected construction time is Θ(n), but finding all occurrences<br />
when searching can be very slow, as a lookup for children must be done on all possible<br />
symbols in Σ. With sibling lists, all z occurrences are found in O(mσ + z), while with a<br />
hash map the expected time is O(m + zσ). It is possible to implement a combination <strong>of</strong><br />
a hash map <strong>and</strong> a linked list, where the children <strong>of</strong> a node are linked together, as shown<br />
by <strong>Grimsmo</strong> [13]. The proposed structure uses less space than what is needed for the<br />
combination <strong>of</strong> the sibling lists <strong>and</strong> the hash table, but it is complex <strong>and</strong> slow in practice.<br />
Simply using both sibling lists <strong>and</strong> hashing would probably be faster, at the cost <strong>of</strong> using<br />
more space.<br />
Bedathur <strong>and</strong> Haritsa [2] propose storing all child pointers inside the nodes in their<br />
disk-based suffix tree construction for very small alphabets (σ = 4). However, when using<br />
this approach, the array storing the pointers must be <strong>of</strong> a constant size.<br />
2.3 Sibling List Implementation<br />
The implementation by Kurtz [22] is the fastest space efficient implementation <strong>of</strong> linear<br />
time suffix trees known to the author. Nodes are stored as integer fields in an array,<br />
<strong>and</strong> node references are indexes into this array. The following layout is used for internal<br />
nodes:<br />
• first-child – Pointer to first node in list <strong>of</strong> children<br />
• branch-brother – Pointer to the next sibling<br />
200
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
• suffix-link – Pointer to the node α, if this is node χα.<br />
• head-position – The starting position <strong>of</strong> the second suffix in the string represented<br />
in this node.<br />
• depth – The depth <strong>of</strong> the node.<br />
The fields head-position <strong>and</strong> depth are used to find the string the incoming edge<br />
represents (see [22]). These are used instead <strong>of</strong> start <strong>and</strong> end positions because they do<br />
not change for a given node during the construction. The number <strong>of</strong> internal nodes is not<br />
known in advance, <strong>and</strong> are therefore kept in a dynamic array. There are always exactly<br />
n + 1 leaf nodes, which can be stored in a static array. The only field needed in leaf nodes<br />
is the branch-brother pointer.<br />
As the space needed is at most 5n words for the internal nodes, <strong>and</strong> n words for the<br />
leaf nodes, an upper limit for the total space needed for the suffix tree is 6n words. A<br />
word must be ⌈log n⌉ + 1 bits to index all nodes <strong>and</strong> string positions. A total <strong>of</strong> 24n bytes<br />
is needed if 32-bit words are assumed, plus n bytes for storing the string. Kurtz shows<br />
how to reduce this to 20n bytes, by storing suffix-link in the branch-brother field <strong>of</strong> the<br />
last child. Another trick shown is using small nodes, which are internal nodes with fewer<br />
fields, exploiting redundant information in the tree. On average, the space usage is about<br />
13n bytes, when supporting strings up to 500 MB.<br />
When the number <strong>of</strong> children per node is high, child lookup is a costly part <strong>of</strong> tree<br />
traversal, <strong>and</strong> even more so on a computer architecture where cache misses are expensive.<br />
When traversing a sibling list, two cache misses are expected per child visited: one for<br />
looking up fields in the node, <strong>and</strong> one for extracting the first symbol on the incoming<br />
edge to the child.<br />
2.4 Child Array Implementation<br />
On modern computers, a miss in level 1 cache costs around 20 cycles, while a miss in level 2<br />
cache costs around 200 cycles, more than four times the cost <strong>of</strong> an integer division [17] (<strong>and</strong><br />
more if pipelined instructions are considered). This implies that using dynamic arrays<br />
<strong>and</strong> copying could be preferable in many applications where linked lists were previously<br />
considered the best option. Below follows a description <strong>of</strong> how to implement suffix trees<br />
with child arrays efficiently.<br />
Each internal node refers to a child array. There exist child arrays <strong>of</strong> a set <strong>of</strong> predefined<br />
sizes. Arrays <strong>of</strong> the same size are stored together in a container, giving allocation<br />
<strong>and</strong> de-allocation in Θ(1) time by storing free locations in a linked list, with the pointers<br />
saved inside the free slots. When a node needs room for more children, a new child array<br />
<strong>of</strong> larger size is allocated, the child pointers copied here, <strong>and</strong> the old array released.<br />
Table 1 shows the layout <strong>of</strong> nodes in the used implementation. Two new fields replace<br />
first-child <strong>and</strong> branch-bro in internal nodes. Which child array size is used is given in<br />
carr-size, while the position <strong>of</strong> the child array in the container is given in carr-pos. Since<br />
there is no branch-brother field, leaf nodes do not need to be explicitly represented at all.<br />
The reasoning is the same as for a hash table implementation [22]. The fields in-symb<br />
201
APPENDIX A. OTHER PAPERS<br />
<strong>and</strong> in-ref listed in the table is the space needed for each node in its parents child array.<br />
In a traditional sibling list implementation, a lookup in the string is required for each<br />
child considered in lookup on a specific symbol. This is avoided here by storing the first<br />
symbol interleaved with the node references. This reduces the number <strong>of</strong> cache misses.<br />
While an internal node is identified by its index in memory, a leaf node can be identified<br />
by its head-position. The depth can be found by subtracting this from the length <strong>of</strong> the<br />
string. A bit in each node reference is needed to distinguish between internal <strong>and</strong> leaf<br />
nodes. Lookup in the string while traversing the children can also be avoided, by storing<br />
the first symbol on incoming edges interleaved with the node references. As relatively few<br />
cache misses are expected compared to the number <strong>of</strong> nodes considered in child traversal,<br />
this should prove more efficient than sibling lists.<br />
Table 1: Node layouts with child arrays<br />
Large B Small B Leaf B<br />
carr-size 1 carr-size 1<br />
carr-pos 4 carr-pos 4<br />
suffix-link 4 dist 1<br />
hpos 4<br />
depth 4<br />
Child array space usage<br />
in-symb 1 in-symb 1 in-symb 1<br />
in-ref 4 in-ref 4 in-ref 4<br />
Sum 22 Sum 11 Sum 5<br />
The term array doubling is <strong>of</strong>ten used when describing dynamic arrays, but is a bit<br />
misleading, as the growth factor can be different from 2. The amortised asymptotic cost<br />
<strong>of</strong> insertion into the array is Θ(n) for any constant growth factor. If a growth factor <strong>of</strong><br />
2 is used, <strong>and</strong> n values are inserted into a dynamic array, the worst case total number<br />
<strong>of</strong> insertions <strong>and</strong> re-insertions is about 3n. If the growth factor is 1.1, the worst case is<br />
about 12n. Because <strong>of</strong> the cache effects on modern computers, copying is very cheap, <strong>and</strong><br />
using a low growth factor is affordable.<br />
The total space needed for this implementation, disregarding overhead from dynamic<br />
arrays, is 27n bytes in the worst. This is 27/20 = 1.35 times more than for sibling<br />
lists. In practice, the space usage is lower, as the number <strong>of</strong> internal nodes is usually<br />
around 0.5n to 0.7n [22]. The space wasted internally in each child array should also<br />
be considered, which comes in addition to the overhead due to the growth factor in the<br />
storage for the groups <strong>of</strong> child arrays. With a growth factor <strong>of</strong> 1.1, the cost for child<br />
arrays is 17 · 1.1 + (5 + 5) · 1.1 2 ≈ 30.8, while for sibling lists it is 16 · 1.1 + 4 ≈ 21.6.<br />
This gives a ratio <strong>of</strong> 30.8/21.6 ≈ 1.43. This is also lower in practice, as the out-degree<br />
<strong>of</strong> internal nodes <strong>of</strong>ten has a very skewed distribution, which can be taken into account<br />
when configuring the pre-defined child array sizes.<br />
202
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
3 Implementation <strong>of</strong> Enhanced Suffix Arrays<br />
An enhanced suffix array is also tested in the following experiments. The implementation<br />
is the authors translation <strong>of</strong> the pseudo-code from Abouelhoda et al. [1] into C++. Some<br />
choices for the data structures affect the performance, especially query lookup, <strong>and</strong> are<br />
therefore discussed here.<br />
The data structures needed for substring search with an enhanced suffix array is the<br />
suffix array itself, the LCP table <strong>and</strong> the child table (CLD), all <strong>of</strong> which have the same<br />
length. The first choice to be made in an implementation is whether to store the three<br />
as separate arrays, or as an array <strong>of</strong> structs with three fields. As the fields for a given<br />
“position” in the enhanced suffix array are <strong>of</strong>ten read together during search, the latter<br />
choice should give better locality <strong>and</strong> performance.<br />
Abouelhoda et al. describe how store the LCP table in a byte array, overflowing into<br />
another data structure, where lookup is done by binary search. This was not done here,<br />
because it gave a considerable slowdown on some <strong>of</strong> the benchmarks, as they had a high<br />
average LCP. On the file w3c2 (see Table 2), there was a 36% overflow with 1 byte LCP<br />
fields, <strong>and</strong> 11% overflow with 2 bytes.<br />
The CLD table can also be stored in an overflowing byte array [1]. This gave a very<br />
low overflow on the tests run here, but still a considerable slowdown during search. This<br />
is because the “nodes” close to the root <strong>of</strong> the virtual tree are the most likely to overflow,<br />
as they “span” larger portions <strong>of</strong> the enhanced suffix array. Even though a small portion<br />
<strong>of</strong> all nodes overflow, a large portion <strong>of</strong> the nodes traversed during a query lookup are<br />
expected to have overflowing CLD fields.<br />
In the implementations tested, all three mentioned fields were stored as four byte<br />
integers, resulting in a space usage <strong>of</strong> 3 · 4n + n = 13n bytes including storage for the<br />
text itself. If the LCP <strong>and</strong> CLD fields are stored as overflowing bytes, the expected space<br />
usage is 4n + 3 · n = 7n bytes. Performance for an implementation where the CLD values<br />
were stored in bytes overflowing into a hash table is also given in some <strong>of</strong> the tests on<br />
query lookup in Section 4.7.<br />
For the tests on tree traversal operations in Section 4.9, two additional fields were<br />
used. The “left” <strong>and</strong> “right” suffix link pointers where stored in 2 · 4n bytes, <strong>and</strong> the<br />
inverse suffix array used to create them was stored in 4n bytes. [1] describes how to store<br />
the suffix link pointers in expected 2 · n bytes.<br />
4 Experiments<br />
Below follow experiments testing many substring index implementations on various types<br />
<strong>of</strong> text data <strong>and</strong> queries. The experiments were designed to emphasis the strengths <strong>and</strong><br />
weaknesses <strong>of</strong> the various methods. Construction time, space usage <strong>and</strong> query performance<br />
was tested.<br />
203
APPENDIX A. OTHER PAPERS<br />
4.1 Test Data<br />
Both “real world” <strong>and</strong> synthetic data was used in the tests. The largest tests from the<br />
Canterbury Corpus [3] <strong>and</strong> the tests from Manzini <strong>and</strong> Ferragina’s test collection [30]<br />
were used, in addition to some artificial data generated by the author. The results for<br />
these tests are given in section 4.3.<br />
To underline the properties <strong>of</strong> the methods, other synthetic data was also used: Space<br />
separated words from a Zipf distribution (Section 4.4), uniform r<strong>and</strong>om data with varying<br />
alphabet size (Section 4.5), <strong>and</strong> first order Markov data with a Zipf distribution on the<br />
transitions (Section 4.6).<br />
4.2 Tested Implementations<br />
The following implementations were tested:<br />
• STa - The author’s implementation <strong>of</strong> suffix trees using child arrays.<br />
2.4.<br />
See Section<br />
• STs - Using sibling lists. See Section 2.3.<br />
• Kur - The suffix tree implementation by Kurtz [22], taken from MUMmer 3.15 2 . It<br />
was compiled using the internal flag STREE HUGE, which gives a maximum data size<br />
<strong>of</strong> 500 MB.<br />
• wotde - The Write Only Top Town Eager 3 suffix tree construction algorithm [11].<br />
• DS - Deep-shallow suffix array construction algorithm 4 [28]. Default construction<br />
parameters used.<br />
• esa1 [b] - Enhanced suffix array. The implementation is a translation <strong>of</strong> the pseudocode<br />
from [1] into C++. Node fields stored in separate arrays. Initial suffix array<br />
built with deep-shallow. Suffix b denotes that CLD values were stored in bytes<br />
overflowing into a hash map.<br />
• esa2 [b] - Enhanced suffix array stored as an array <strong>of</strong> structs.<br />
• BPR - The Bucket Pointer Refinement suffix array construction algorithm 5 [40].<br />
Default construction parameters used.<br />
• LZ - LZ-index (LZ ) [33], implementation by Navarro <strong>and</strong> Arroyuelo 6 .<br />
• CCSA - Compressed compact suffix array [24], implementation by Mäkinen <strong>and</strong><br />
González 7 .<br />
2 http://mummer.sourceforge.net/<br />
3 http://bibiserv.techfak.uni-bielefeld.de/wotd/<br />
4 http://www.mfn.unipmn.it/~manzini/lightweight/<br />
5 http://bibiserv.techfak.uni-bielefeld.de/bpr/<br />
6 http://pizzachili.dcc.uchile.cl/indexes/LZ-index/<br />
7 http://pizzachili.dcc.uchile.cl/indexes/Compressed_Compact_Suffix_Array/<br />
204
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
• FM - FM-index (FM ) [8], implementation (version 2) by Ferragina <strong>and</strong> Venturini 8 .<br />
• AFFM - Alphabet-Friendly FM-index [9], implementation (version 2) by González 9 .<br />
• RLFM - Run-Length FM-Index [25], implementation by Mäkinen <strong>and</strong> González 10 .<br />
• SSA - Succinct suffix array [26] (FM family), implementation (version 2) by Mäkinen<br />
<strong>and</strong> González 11 .<br />
All programs were compiled with gcc 4.1.2 with -O3 optimisation, <strong>and</strong> used glibc<br />
2.3.6. They were run on an AMD Athlon 64 3500+ with 512 KB L2 cache, running<br />
Debian Sid with Linux 2.6.16. The kernel had the perfctr [37] patch, which makes<br />
hardware performance counters readable. The PAPI [36] interface was used to read<br />
the variable PAPI L2 TCM, giving level 2 cache misses. The times given are all wall-clock<br />
timings, excluding the time for reading files into memory. The memory usage given is<br />
VmRSS (resident set size) <strong>and</strong> VmHWM (resident set size peak) from /proc/$pid/status.<br />
(What is used by the tools top <strong>and</strong> ps). When running the tests, all initialisation <strong>of</strong><br />
measurement variables <strong>and</strong> reading <strong>of</strong> data was done before each timing was started, <strong>and</strong><br />
all related calculations were done after it was stopped.<br />
4.3 General Tests<br />
The following tests include data from the Canterbury Corpus [3] <strong>and</strong> Manzini <strong>and</strong> Ferragina’s<br />
test collection [30]. In addition, the 35th Fibonacci string is used. Note that the odd<br />
numbered 12 Fibonacci strings (counting from 1) give the maximum number <strong>of</strong> nodes in<br />
suffix trees (2n + 1), <strong>and</strong> give the worst case in many non-linear suffix array construction<br />
algorithms [40]. The test a025 is 2 25 subsequent a’s. Table 2 gives some statistics for the<br />
tests: Length, alphabet size, LCPs, <strong>and</strong> first order empirical entropy [34].<br />
Table 3 shows the construction speed for the tested methods, given as symbols indexed<br />
per second. The rationale for giving this instead <strong>of</strong> absolute time, is to give easier comparison<br />
between the tests shown in the tables, <strong>and</strong> clearer differentiation between the best<br />
methods in the plots that follow. An “x” in the tables denotes that that the test crashed,<br />
while “-” denotes that it took too long to complete. “N/A” denotes that the implementation<br />
was not designed to h<strong>and</strong>le this alphabet size. Table 4 shows symbols indexed per<br />
L2 cache miss. It is expected that this is related to the construction performance.<br />
The suffix trees with child arrays clearly perform better than the sibling lists variants,<br />
except on the small DNA tests, where they are slightly slower, <strong>and</strong> on the Fibonacci string,<br />
which has an alphabet <strong>of</strong> size 2. The advantage comes from the improved locality, as can<br />
be read from Table 4. Kur <strong>and</strong> STs have similar behaviour, with the former being slightly<br />
faster. This is probably because the latter was originally developed to index dynamic sets<br />
<strong>of</strong> documents. Wotde has a varying performance, which seems dependent <strong>of</strong> the average<br />
8 http://pizzachili.dcc.uchile.cl/indexes/FM-indexV2/<br />
9 http://pizzachili.dcc.uchile.cl/indexes/Alphabet-Friendly_FM-Index/<br />
10 http://pizzachili.dcc.uchile.cl/indexes/Run_Length_FM_index/<br />
11 http://pizzachili.dcc.uchile.cl/indexes/Succinct_Suffix_Array/<br />
12 In the even numbered Fibonacci strings, a is always followed by b.<br />
205
APPENDIX A. OTHER PAPERS<br />
Table 2: Test statistics. The three first come from the Canterbury Corpus [3], while the next 10 come from Manzini <strong>and</strong><br />
Ferragina’s test collection [30]<br />
Length σ Med. LCP Avg. LCP Max. LCP Entr. (1)<br />
1 bible.txt 4047393 63 11 14 551 3.27 King James bible<br />
2 E.coli 4638690 4 11 17 2815 1.98 Escherichia coli<br />
3 world192.txt 2473400 94 13 23 559 3.66 CIA world fact book<br />
4 chr22.dna 34553758 5 13 1979 199999 1.88 Human chromo. 22<br />
5 etext99 105277340 146 12 1109 286352 3.57 Project Gutenberg<br />
6 gcc-3.0.tar 86630400 150 21 8603 856970 3.80 gcc 3.0 source files<br />
7 howto 39422105 197 12 268 70720 3.92 Linux HOWTO text<br />
8 jdk13c 69728899 113 113 679 37334 3.54 HTML <strong>and</strong> Java<br />
9 linux-2.4.5.tar 116254720 256 17 479 136035 3.90 Linux Kernel 2.4.5<br />
10 rctail96 114711151 93 61 282 26597 3.47 Reuters news in XML<br />
11 rfc 116421901 120 19 93 3445 3.40 RFC text files<br />
12 sprot34.dat 109617186 66 32 89 7373 3.93 Swiss Prot database<br />
13 w3c2 104201579 256 114 42300 990053 4.06 HTML from w3c.org<br />
14 fib035 9227465 2 2306866 2435423 5702885 0.59 35th Fibonacci String<br />
15 r<strong>and</strong>254 104857600 254 3 2.86 6 7.98 Uniform, |σ| = 254<br />
16 a025 33554432 1 16777217 16777216 33554431 0.00 a repeated 2 25 times<br />
206
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
Table 3: Symbols indexed per second in thous<strong>and</strong>s.<br />
STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />
1 1948 1126 1475 1556 531 3935 3430 1897 1816 3201 1011 2525 630 2286 2127<br />
2 1370 1046 1377 1256 518 3770 3374 1728 1658 4504 869 2365 783 2108 2251<br />
3 2335 1263 1745 1857 570 4833 3241 1934 2056 3211 1129 2884 522 2622 2310<br />
4 1212 932 1165 22 426 2900 2250 1236 1325 3753 730 1712 669 1825 1888<br />
5 1332 632 700 127 309 1736 1292 938 996 1873 606 1430 341 1240 1200<br />
6 2227 1044 1273 12 398 1300 1813 913 920 2457 630 1804 350 1122 1036<br />
7 1677 707 791 468 391 2763 2128 1372 1401 2110 796 2006 332 1781 1659<br />
8 3321 2282 3598 178 429 1251 1341 880 892 3553 579 1360 387 965 879<br />
9 2049 883 1032 258 366 2520 2086 1348 1365 N/A N/A N/A N/A N/A N/A<br />
10 2302 1198 1447 287 369 1091 1039 741 767 2708 519 1216 371 855 801<br />
11 1842 864 1005 533 356 2224 1463 1171 1216 2212 712 1769 410 1550 1433<br />
12 2029 778 925 499 351 1904 1261 1029 1087 2445 658 1594 473 1356 1267<br />
13 3052 1509 2138 - 403 1178 1484 825 853 N/A N/A N/A N/A N/A N/A<br />
14 4586 5346 12375 - 889 91 267 89 89 21400 73 502 73 76 76<br />
15 606 30 32 1957 473 2470 1218 1020 1094 647 592 x 191 1158 1252<br />
16 4547 5931 14767 - 4338 41708 22805 12904 9677 55966 3290 - 1012 9558 10168<br />
207
APPENDIX A. OTHER PAPERS<br />
Table 4: Symbols indexed per L2 cache miss.<br />
STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />
1 0.30 0.15 0.13 0.18 0.039 0.48 0.34 0.20 0.22 0.44 0.15 0.34 0.13 0.34 0.34<br />
2 0.16 0.13 0.13 0.14 0.038 0.43 0.37 0.17 0.19 0.61 0.15 0.30 0.11 0.31 0.31<br />
3 0.40 0.18 0.16 0.27 0.042 0.67 0.40 0.23 0.26 0.47 0.18 0.44 0.14 0.44 0.44<br />
4 0.18 0.14 0.15 0.012 0.036 0.29 0.32 0.14 0.16 0.47 0.11 0.22 0.096 0.23 0.23<br />
5 0.22 0.092 0.088 0.057 0.032 0.18 0.16 0.11 0.12 0.24 0.088 0.16 0.082 0.15 0.15<br />
6 0.49 0.16 0.15 0.00087 0.036 0.19 0.15 0.13 0.14 0.34 0.10 0.23 0.11 0.19 0.19<br />
7 0.28 0.091 0.082 0.11 0.033 0.29 0.24 0.15 0.17 0.29 0.11 0.23 0.11 0.22 0.22<br />
8 1.2 0.56 0.54 0.031 0.037 0.17 0.093 0.12 0.12 0.43 0.085 0.17 0.090 0.14 0.14<br />
9 0.44 0.14 0.13 0.068 0.035 0.28 0.19 0.16 0.17 N/A N/A N/A N/A N/A N/A<br />
10 0.50 0.20 0.19 0.023 0.036 0.11 0.069 0.083 0.088 0.33 0.067 0.14 0.067 0.10 0.10<br />
11 0.36 0.13 0.13 0.068 0.036 0.24 0.13 0.14 0.15 0.30 0.10 0.21 0.10 0.19 0.19<br />
12 0.39 0.11 0.11 0.052 0.035 0.20 0.11 0.13 0.13 0.31 0.093 0.19 0.093 0.17 0.17<br />
13 0.97 0.26 0.26 - 0.036 0.16 0.12 0.11 0.12 N/A N/A N/A N/A N/A N/A<br />
14 11 10 18 - 0.063 0.015 0.0095 0.014 0.014 4.1 0.013 0.11 0.013 0.014 0.014<br />
15 0.099 0.0031 0.0033 0.18 0.056 0.34 0.16 0.14 0.15 0.12 0.12 x 0.089 0.25 0.25<br />
16 23 28 49 - 1.9 130 55 26 14 419 20 - 4.1 32 32<br />
208
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
LCP (See Table 2). For tests larger than 5 MB, it is faster than the regular suffix trees<br />
only on the r<strong>and</strong>om data. The trees using sibling lists seem unsuited for r<strong>and</strong>om data<br />
with large alphabets. One might have expected that the time <strong>and</strong> cache performance for<br />
the suffix trees was directly proportional to the alphabet size, but it also depends on other<br />
properties <strong>of</strong> the data. A test where only the alphabet size is varied is given in Section<br />
4.5.<br />
The cost <strong>of</strong> building the enhanced suffix array is rather high. A reason for this is<br />
the way data is laid out in memory. The improvement <strong>of</strong> esa2 over esa1 does not show<br />
very well in this test on construction performance, as the data fields are built by separate<br />
algorithms. The tests in Section 4.7 show that esa2 is <strong>of</strong>ten significantly faster on query<br />
lookup. In general, the regular suffix tree with child arrays has faster construction than<br />
both the enhanced suffix array <strong>and</strong> wotde.<br />
DS <strong>and</strong> BPR are faster than the linear time suffix trees on most <strong>of</strong> these tests. The<br />
exception is those with very high LCP.<br />
All the compressed structures except LZ use a suffix array built with deep-shallow as a<br />
starting point for their construction. The build times for the compressed representations<br />
seem very affordable. The LZ index has fast construction, <strong>and</strong> is even faster than all noncompressed<br />
methods on many <strong>of</strong> the tests. For all the compressed indexesthere is a strong<br />
relationship between time <strong>and</strong> cache performance. This can be seen when comparing for<br />
each method the results on the different tests in Tables 3 <strong>and</strong> 4.<br />
Query performance is not shown in this test because it depends on too many parameters,<br />
such as the length <strong>of</strong> the query, the number <strong>of</strong> hits, <strong>and</strong> the data distribution. Query<br />
performance is evaluated in Section 4.7.<br />
Table 5 shows the memory usage for the various implementations, in bytes per symbol<br />
after the construction was finished. The peak space usage is shown in table 6. The<br />
memory usage <strong>of</strong> the application seen externally is measured, which could give slightly<br />
inconsistent results because <strong>of</strong> space re-allocation. One method might use all its allocated<br />
memory, while one may have allocated more just before it finished.<br />
STa uses 15-25% more memory than STs. Both methods use more memory on the<br />
DNA tests, probably because <strong>of</strong> a smaller average branching factor <strong>and</strong> many internal<br />
nodes. Apart from that all non-compressed methods have a rather constant space usage.<br />
Both wotde <strong>and</strong> esa1/2 use slightly less space than the regular suffix trees. Among<br />
the construction algorithms for non-compressed structures, BPR is the only one using<br />
considerable extra space during construction. DS is much more space efficient, <strong>and</strong> has<br />
virtually no space overhead.<br />
The FM index is a clear winner on space usage on most <strong>of</strong> the tests. The exception<br />
seems to be those with a large alphabet. The other indexes from the FM family, AFFM,<br />
RLFM <strong>and</strong> SSA, fare much better there. FIXME: [Fix bug in FM for large σ.] The LZ<br />
index uses more space than a suffix array on some <strong>of</strong> the smaller tests, but around half<br />
the space on the larger tests.<br />
The working space used during construction varies for the compressed indexes, as seen<br />
in table 6. CCSA <strong>and</strong> AFFM need around 8-9 bytes per symbol, while the rest <strong>of</strong> the FM<br />
family needs a little more than 6. The working space for the LZ index strongly depends<br />
on the properties <strong>of</strong> the text data, as the tree data structures built utilise its repetitions.<br />
209
APPENDIX A. OTHER PAPERS<br />
Table 5: Space usage in bytes per symbol indexed after construction.<br />
STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />
1 16 13 13 11 7.3 5.5 5.4 13 13 6.2 1.8 1.1 2.4 1.4 1.4<br />
2 20 17 16 11 6.5 5.4 5.3 13 13 4.4 3.0 1.00 1.0 1.3 1.0<br />
3 16 13 13 11 7.4 5.8 5.6 14 14 7.4 1.9 1.4 1.8 1.7 1.9<br />
4 19 16 16 11 5.8 5.1 5.0 13 13 2.4 2.5 0.62 0.66 0.92 0.69<br />
5 16 13 13 10 5.2 5.0 5.0 13 13 2.8 1.9 0.68 0.83 0.98 1.0<br />
6 16 13 13 11 5.5 5.0 5.0 13 13 2.5 1.1 0.59 0.96 0.85 1.1<br />
7 16 13 13 11 5.8 5.0 5.0 13 13 2.9 1.6 0.75 2.0 0.98 1.1<br />
8 16 13 13 11 5.3 5.0 5.0 13 13 1.9 0.70 0.43 0.76 0.73 1.2<br />
9 16 13 13 11 5.3 5.0 5.0 13 13 N/A N/A N/A N/A N/A N/A<br />
10 15 12 12 10 5.2 5.0 5.0 13 13 2.0 0.98 0.46 0.73 0.79 1.1<br />
11 16 13 13 11 5.3 5.0 5.0 13 13 2.5 1.2 0.60 0.84 0.85 1.0<br />
12 15 13 13 10 5.2 5.0 5.0 13 13 2.5 1.4 0.57 0.90 0.88 1.1<br />
13 16 13 14 - 5.2 5.0 5.0 13 13 N/A N/A N/A N/A N/A N/A<br />
14 19 16 17 - 5.2 5.2 5.2 13 13 1.3 0.64 0.39 0.49 0.85 0.69<br />
15 13 9.5 8.4 6.4 5.0 5.0 5.0 13 13 5.8 4.0 x 2.5 2.0 1.5<br />
16 19 16 17 - 5.1 5.1 5.0 13 13 1.1 0.50 - 3.5 0.69 0.49<br />
210
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
Table 6: Peak space usage during construction.<br />
STa STs Kur wotde skew DS BPR esa1 esa2 LZ CCSA FM AFFM RLFM SSA<br />
1 16 13 13 11 18 5.5 11 13 13 6.7 8.3 6.5 10 6.7 6.7<br />
2 20 17 16 11 18 5.4 10 13 13 4.9 8.4 6.5 11 6.7 6.7<br />
3 16 13 13 11 18 5.8 12 14 14 7.9 8.6 6.8 9.1 7.0 7.0<br />
4 19 16 16 12 18 5.1 10 13 13 4.1 8.4 6.2 9.8 6.3 6.3<br />
5 16 13 13 10 17 5.0 10 13 13 5.6 8.5 6.1 9.0 6.3 6.3<br />
6 16 13 13 11 18 5.1 10 13 13 5.8 8.5 6.1 8.2 6.3 6.3<br />
7 16 13 13 11 18 5.1 11 13 13 6.6 8.4 6.1 9.7 6.3 6.3<br />
8 16 13 13 11 18 5.0 10 13 13 4.2 8.5 6.1 7.7 6.3 6.3<br />
9 16 13 13 11 17 5.0 11 13 13 N/A N/A N/A N/A N/A N/A<br />
10 15 12 12 10 17 5.0 10 13 13 4.7 8.5 6.1 8.0 6.3 6.3<br />
11 16 13 13 11 17 5.0 10 13 13 5.4 8.5 6.1 8.3 6.3 6.3<br />
12 15 13 13 10 17 5.0 10 13 13 5.7 8.5 6.1 8.4 6.3 6.3<br />
13 16 13 14 - 17 5.0 11 13 13 N/A N/A N/A N/A N/A N/A<br />
14 19 16 17 - 17 5.2 10 13 13 1.4 8.3 6.3 7.6 6.5 6.5<br />
15 13 9.5 8.4 7.7 13 5.0 11 13 13 20 9.3 x 12 6.3 6.3<br />
16 19 16 17 - 17 5.1 10 13 13 1.1 8.4 - 59 6.3 6.3<br />
211
APPENDIX A. OTHER PAPERS<br />
4.4 Test on Increasing Data Size<br />
Figure 1 gives the results <strong>of</strong> indexing increasing amounts <strong>of</strong> space separated word data<br />
from a Zipf distribution (parameter s = 1.0). The figure shows symbols indexed per<br />
second <strong>and</strong> per L2 cache miss. Notice that these two quantities are closely related, both<br />
here <strong>and</strong> in the following tests.<br />
The suffix tree STa is almost three times faster than STs, even though they logically<br />
perform the same steps. This is because <strong>of</strong> a better locality, which can be read from the<br />
plot for cache performance. STa traverses a contiguous array to lookup the child <strong>of</strong> a<br />
node, while STs traverses sibling nodes with “r<strong>and</strong>om” memory locations. DS <strong>and</strong> BPR<br />
here exhibit the most non-linear performance, suggesting than for larger data, linear time<br />
suffix array constructions would be faster. Wotde, which is also a non-linear method, is<br />
faster than STs, but slower than STa.<br />
One would expect that the reason STa <strong>and</strong> STs show a slightly non-linear behaviour,<br />
is the fact that a smaller relative proportion <strong>of</strong> the tree fits in cache. In the construction<br />
algorithms the tree is traversed in a seemingly r<strong>and</strong>om pattern along a certain depth.<br />
The average depth is the expected longest common prefix <strong>of</strong> closest pairs between the<br />
suffixes added so far. For r<strong>and</strong>om data, this is log σ n [4]. But the cache performance is<br />
more linear than the time performance both for STa <strong>and</strong> STs. The additional slowdown<br />
might be due to memory management overhead.<br />
The LZ index shows great indexing performance. It has a more linear behaviour<br />
than DS, <strong>and</strong> is faster when the data size reaches 100 MB. For all the other compressed<br />
implementations, the time <strong>and</strong> cache misses for the initial construction <strong>of</strong> the suffix array<br />
by deep-shallow has been subtracted. This is done because this construction in many<br />
cases dominates the running time, <strong>and</strong> made it very hard to interpret the cost <strong>of</strong> building<br />
the compressed representations themselves in Sections 4.5 <strong>and</strong> 4.6. For FM, RLFM <strong>and</strong><br />
SSA, the cost <strong>of</strong> the compression is less than the cost <strong>of</strong> the initial build. An interesting<br />
feature in the plot for cache performance is that FM, RLFM, <strong>and</strong> SSA are nearly identical.<br />
This must be because they have similar access patterns to their data structures. Figure<br />
3 shows the space usage for the compressed methods on this test. FM is twice as space<br />
efficient as the next method for 100 MB.<br />
Figure 4 shows a similar test, but with uniform r<strong>and</strong>om data (σ = 20). DS <strong>and</strong> esa1/2<br />
show approximately the same performance as in the last test, but the other methods do<br />
not. Note that BPR here seems has less decrease than DS, <strong>and</strong> may have been the fastest<br />
method with even more data. Wotde is now much faster, while the other suffix trees are<br />
slower. R<strong>and</strong>om data gives lower average LCP, which benefits wotde, but also a higher<br />
average out degree in the nodes, which is bad for the regular suffix trees. The worst effect<br />
is seen for the sibling lists, where the performance is halved from what was seen for Zipf<br />
data in Figure 1. All compressed structures perform slightly worse on the r<strong>and</strong>om data.<br />
This is probably because there are fewer regularities in the text, <strong>and</strong> the index structures<br />
grow larger.<br />
212
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
3.5e+06<br />
3e+06<br />
2.5e+06<br />
symb/s<br />
2e+06<br />
1.5e+06<br />
1e+06<br />
500000<br />
0<br />
0 20 40 60 80 100<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Data size in MiB<br />
DS<br />
esa2<br />
(a) Symbols indexed per second.<br />
BPR<br />
LZ<br />
8e+06<br />
7e+06<br />
6e+06<br />
5e+06<br />
symb/s<br />
4e+06<br />
3e+06<br />
2e+06<br />
1e+06<br />
0<br />
0 20 40 60 80 100<br />
Data size in MiB<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per second.<br />
Figure 1: Indexing performance on increasing data size, Zipf word data. Construction <strong>of</strong><br />
initial suffix array not included for compressed indexes.<br />
213
APPENDIX A. OTHER PAPERS<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
symb/L2m<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0 20 40 60 80 100<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Data size in MiB<br />
DS<br />
esa2<br />
(a) Symbols indexed per L2 miss.<br />
BPR<br />
LZ<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
symb/L2m<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0 20 40 60 80 100<br />
Data size in MiB<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per L2 miss.<br />
Figure 2: Continuing Figure 1. Indexing performance on increasing data size, Zipf word<br />
data. Construction <strong>of</strong> initial suffix array not included for compressed indexes.<br />
214
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
3.5<br />
3<br />
2.5<br />
B/symb<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
0 20 40 60 80 100<br />
Data size in MiB<br />
CCSA FM AFFM RLFM SSA LZ<br />
Figure 3: Space usage for compressed structures after construction, Zipf word data, increasing<br />
size.<br />
4.5 Increasing Alphabet Size<br />
Figure 6 shows performance when indexing 10 MB <strong>of</strong> uniformly r<strong>and</strong>om data with an<br />
increasing alphabet size.<br />
DS here clearly out-performs the other methods for large alphabets, but has a rather<br />
peculiar behaviour. This suffix array construction algorithm combines two different sorting<br />
techniques, where one utilises the results from the other. The behaviour seen may be<br />
due to the individual performance <strong>of</strong> these, <strong>and</strong> the way they are combined. Wotde shows<br />
great performance as the alphabet increases, because <strong>of</strong> the decreasing average LCP. The<br />
opposite behaviour is seen for the other suffix trees, where an increasing number <strong>of</strong> children<br />
for internal nodes slows down the construction. For STs the performance decreases<br />
with a factor 30 as the alphabet size goes from 2 to 254. The same effect is seen to a<br />
lesser degree for STa, with a drop factor <strong>of</strong> 2.<br />
The construction times <strong>of</strong> all compressed indexes except CCSA seem strongly dependant<br />
on the alphabet size, but the amount <strong>of</strong> cache misses is not. Seemingly, a larger<br />
alphabet results in more computation, but as the data accesses are already rather r<strong>and</strong>om,<br />
there are not more cache misses, even though the data structures are larger. Figure 8<br />
shows the space usage in bytes per symbol. The FM index is extremely efficient for small<br />
alphabets, <strong>and</strong> is best <strong>of</strong> all methods up to an alphabet size <strong>of</strong> around 128. Remember<br />
that this very artificial test gives the worst case space usage for the compressed indexes, as<br />
the entropy <strong>of</strong> the data is maximal for the given alphabet size. The space usage <strong>of</strong> around<br />
215
APPENDIX A. OTHER PAPERS<br />
3.5e+06<br />
3e+06<br />
2.5e+06<br />
symb/s<br />
2e+06<br />
1.5e+06<br />
1e+06<br />
500000<br />
0<br />
0 20 40 60 80 100<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Data size in MiB<br />
DS<br />
esa2<br />
(a) Symbols indexed per second.<br />
BPR<br />
LZ<br />
8e+06<br />
7e+06<br />
6e+06<br />
5e+06<br />
symb/s<br />
4e+06<br />
3e+06<br />
2e+06<br />
1e+06<br />
0<br />
0 20 40 60 80 100<br />
Data size in MiB<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per second.<br />
Figure 4: Indexing performance on increasing data size, r<strong>and</strong>om data (σ = 20). Construction<br />
<strong>of</strong> initial suffix array not included for compressed indexes.<br />
216
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
0.4<br />
0.35<br />
0.3<br />
0.25<br />
symb/L2m<br />
0.2<br />
0.15<br />
0.1<br />
0.05<br />
0<br />
0 20 40 60 80 100<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Data size in MiB<br />
DS<br />
esa2<br />
(a) Symbols indexed per L2 miss.<br />
BPR<br />
LZ<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
symb/L2m<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0 20 40 60 80 100<br />
Data size in MiB<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per L2 miss.<br />
Figure 5: Continuing Figure 4. Indexing performance on increasing data size, r<strong>and</strong>om<br />
data (σ = 20). Construction <strong>of</strong> initial suffix array not included for compressed indexes.<br />
217
APPENDIX A. OTHER PAPERS<br />
5e+06<br />
4e+06<br />
3e+06<br />
symb/s<br />
2e+06<br />
1e+06<br />
0<br />
2 4 8 16 32 64 128 256<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Alphabet size<br />
DS<br />
esa2<br />
BPR<br />
LZ<br />
(a) Symbols indexed per second.<br />
8e+06<br />
7e+06<br />
6e+06<br />
5e+06<br />
symb/s<br />
4e+06<br />
3e+06<br />
2e+06<br />
1e+06<br />
0<br />
2 4 8 16 32 64 128 256<br />
Alphabet size<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per second.<br />
Figure 6: Indexing performance on increasing alphabet size, uniform data. Construction<br />
<strong>of</strong> initial suffix array not included for compressed indexes.<br />
218
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
0.8<br />
0.7<br />
0.6<br />
symb/L2m<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
2 4 8 16 32 64 128 256<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Alphabet size<br />
DS<br />
esa2<br />
BPR<br />
LZ<br />
(a) Symbols indexed per L2 miss.<br />
1<br />
0.8<br />
symb/L2m<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
2 4 8 16 32 64 128 256<br />
Alphabet size<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per L2 miss.<br />
Figure 7: Continuing Figure 6. Indexing performance on increasing alphabet size, uniform<br />
data. Construction <strong>of</strong> initial suffix array not included for compressed indexes.<br />
219
APPENDIX A. OTHER PAPERS<br />
25<br />
20<br />
15<br />
B/symb<br />
10<br />
5<br />
0<br />
2 4 8 16 32 64 128 256<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Alphabet size<br />
DS<br />
esa2<br />
BPR<br />
LZ<br />
Figure 8: Space usage for compressed structures after construction, on increasing alphabet<br />
size.<br />
2 bytes per symbol with an alphabet size <strong>of</strong> 254 for the most space efficient structures is<br />
impressive.<br />
4.6 Markov Data<br />
Many <strong>of</strong> the methods tested have time <strong>and</strong> space efficiencies which are dependant on<br />
the r<strong>and</strong>omness <strong>of</strong> the data. Less r<strong>and</strong>om data gives higher average LCP, degrading the<br />
performance <strong>of</strong> some <strong>of</strong> the non-linear construction methods. Figure 9 shows results for<br />
indexing first order Markov data with a Zipfian transition distribution with varying parameter<br />
s. An alphabet size <strong>of</strong> 20 was used. In a general Zipf distribution, the probability<br />
<strong>of</strong> selecting element k is given as<br />
p(k; s, n) =<br />
k −s<br />
∑ n<br />
i=1 i−s<br />
Setting s = 0 gives a uniform distribution. For the data used, the average LCP<br />
approximately doubled each time the parameter s increased by 1, from 4.7 for s = 0, to<br />
2160 for s = 10.<br />
Wotde has the greatest dependence on the LCP, <strong>and</strong> is the only method showing<br />
a contiguous drop in performance here. DS has a strong peak at s = 4, <strong>and</strong> a total<br />
breakdown at s = 7, where the average LCP is 324. The peak probably comes from the<br />
220
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
5e+06<br />
4e+06<br />
3e+06<br />
symb/s<br />
2e+06<br />
1e+06<br />
0<br />
0 2 4 6 8 10<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Increasing regularity<br />
DS<br />
esa2<br />
(a) Symbols indexed per second.<br />
BPR<br />
LZ<br />
1.4e+07<br />
1.2e+07<br />
1e+07<br />
symb/s<br />
8e+06<br />
6e+06<br />
4e+06<br />
2e+06<br />
0<br />
0 2 4 6 8 10<br />
Increasing regularity<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per second.<br />
Figure 9: Indexing performance on first order Markov data with Zipfian transition distribution<br />
<strong>and</strong> increasing parameter s. (σ = 20). Construction <strong>of</strong> initial suffix array not<br />
included for compressed indexes.<br />
221
APPENDIX A. OTHER PAPERS<br />
1.4<br />
1.2<br />
1<br />
symb/L2m<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 2 4 6 8 10<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Increasing regularity<br />
DS<br />
esa2<br />
(a) Symbols indexed per L2 miss.<br />
BPR<br />
LZ<br />
2<br />
1.5<br />
symb/L2m<br />
1<br />
0.5<br />
0<br />
0 2 4 6 8 10<br />
Increasing regularity<br />
CCSA FM AFFM RLFM SSA<br />
(b) Symbols indexed per L2 miss.<br />
Figure 10: Continuing Figure 9. Indexing performance on first order Markov data with<br />
Zipfian transition distribution <strong>and</strong> increasing parameter s. (σ = 20). Construction <strong>of</strong><br />
initial suffix array not included for compressed indexes.<br />
222
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
distribution <strong>of</strong> work between the two methods deep-shallow combines, as was discussed<br />
in section 4.5. The performances <strong>of</strong> the regular suffix trees increase as the data grows less<br />
r<strong>and</strong>om, due to lower average branching factors.<br />
4<br />
3.5<br />
3<br />
2.5<br />
B/symb<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
0 2 4 6 8 10<br />
Increasing regularity<br />
CCSA FM AFFM RLFM SSA LZ<br />
Figure 11: Space usage for compressed structures after construction on Markov data.<br />
(σ = 20)<br />
LZ shows an extreme performance on this data, which is very repetitive for large s.<br />
The curve for LZ continues rising to 35 million symbols per second at s = 10, with 70<br />
symbols indexed per L2 miss. FM, RLFM <strong>and</strong> SSA also show a significant increase in<br />
performance with the regularity. Note that for these methods the bulk <strong>of</strong> the time used<br />
is the initial deep-shallow construction, which is subtracted here. To get more robust<br />
performance, a linear time suffix array construction algorithm could be used.<br />
Figure 11 shows the space efficiency for the compressed indexes. As in the previous<br />
tests, the FM index is the most space efficient. CCSA <strong>and</strong> LZ also show very good<br />
trends. Although the space usage for LZ is low, it does not match the extreme time<br />
performance that was shown in Figure 9a. Various trade<strong>of</strong>fs between space usage <strong>and</strong><br />
query performance could be made in the implementation. All the compressed methods<br />
have a dependency on higher order entropy in their asymptotic space usage [34], but the<br />
implementations show this to a varying extent here.<br />
For large parameters s, the data has LCPs higher than what you would see in any<br />
real world data <strong>of</strong> reasonable size. In these pure Markov chain strings, the LCP between<br />
almost all neighbouring suffixes is very high. The average LCP here should probably be<br />
223
APPENDIX A. OTHER PAPERS<br />
compared with the median LCP seen in table 2.<br />
the median LCP 159.<br />
At s = 6, the average LCP is 172, <strong>and</strong><br />
4.7 Query Lookup<br />
As previously mentioned, many parameters influence query performance. One such parameter<br />
is the total data size. Figure 12 shows the number <strong>of</strong> queries per second <strong>and</strong> per<br />
L2 cache miss, on an increasing amount <strong>of</strong> Zipf word data (s = 1, σ = 20). Queries <strong>of</strong><br />
length 30 were run, <strong>and</strong> only one hit was reported (to isolate lookup time). The performance<br />
drop for the suffix trees is due to the increasing number <strong>of</strong> nodes seen in downward<br />
traversal in the tree, <strong>and</strong> the number <strong>of</strong> children considered in child lookup. The latter<br />
hits STs worse than STa. The decrease for the suffix array is due to the increasing number<br />
<strong>of</strong> jumps in the binary search. Wotde has the best performance in this test, followed by<br />
STa.<br />
Esa1 performs rather poorly, <strong>and</strong> even worse than STs, which logically performs the<br />
same steps. This is related to the number <strong>of</strong> cache misses, which is many times as<br />
high. A sibling traversal is also performed in esa1, but this has bad locality for two<br />
reasons: The first is an implementational detail. The information for one “position” in<br />
the enhanced suffix array is spread over different arrays, giving unnecessary cache misses.<br />
This is implemented differently in esa2, which has performance similar to STs. The second<br />
reason is an inherently bad locality in the enhanced suffix array. For an internal “node”<br />
close to the root, the values read to traverse the child nodes is spread over a large area.<br />
This is the reason esa2 has much slower lookup than STa, <strong>and</strong> also a more degrading<br />
performance as the amount <strong>of</strong> data increases. . The variants esa1b <strong>and</strong> esa2b have the<br />
CLD field stored in bytes overflowing into a hash table (see Section 3). Even though<br />
the total overflow is neglible in terms <strong>of</strong> space, the query lookup performance is rougly<br />
halved, because many nodes in the upper part <strong>of</strong> the tree overflow.<br />
The query lookup performance on the compressed indexes differs greatly. SSA is<br />
almost 20 times faster than FM on 100 MB <strong>of</strong> data, but 20 times slower than the fastest<br />
suffix tree. FM has poor time performance, but better cache performance than the other<br />
methods. The author does not know the implementation well enough to comment this<br />
properly. Note the different scales on the y-axis for compressed indexes. Searching in the<br />
compressed structures involves many recursive lookups for each logical step in the search.<br />
The values to be read are spread throughout the data structures, giving bad locality.<br />
Figure 14 shows query lookup on data with varying alphabet size. Queries <strong>of</strong> length<br />
30 were issued on 10 MB <strong>of</strong> uniform r<strong>and</strong>om data. Lookup performance degrades in the<br />
suffix tree variants <strong>and</strong> the enhanced suffix array when the alphabet size grows larger.<br />
This is because these methods traverse lists <strong>of</strong> children in trees, <strong>and</strong> the average lengths<br />
<strong>of</strong> these lists increase with the alphabet. The effect hits STs, esa1 <strong>and</strong> esa2 worst, as<br />
they have poor locality in the child traversal.<br />
The performances <strong>of</strong> many <strong>of</strong> the compressed indexes are very dependant <strong>of</strong> the alphabet<br />
size. Only CCSA <strong>and</strong> FM have alphabet independent asymptotic lookup times [34].<br />
CCSA shows an increase in performance as the alphabet size increases. This is because<br />
the average length <strong>of</strong> the match between the search pattern <strong>and</strong> the suffixes considered in<br />
224
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
300000<br />
250000<br />
200000<br />
query/s<br />
150000<br />
100000<br />
50000<br />
0<br />
0 20 40 60 80 100<br />
STa<br />
STs<br />
wotde<br />
DS<br />
Data size in MiB<br />
esa1<br />
esa1b<br />
(a) Queries per second.<br />
esa2<br />
esa2b<br />
18000<br />
16000<br />
14000<br />
12000<br />
query/s<br />
10000<br />
8000<br />
6000<br />
4000<br />
2000<br />
0<br />
0 20 40 60 80 100<br />
Data size in MiB<br />
CCSA FM AFFM RLFM SSA LZ<br />
(b) Queries per second.<br />
Figure 12: Query lookup on increasing data size, Zipf word data.<br />
225
APPENDIX A. OTHER PAPERS<br />
0.06<br />
0.05<br />
0.04<br />
query/L2m<br />
0.03<br />
0.02<br />
0.01<br />
0<br />
0 20 40 60 80 100<br />
STa<br />
STs<br />
wotde<br />
DS<br />
Data size in MiB<br />
esa1<br />
esa1b<br />
(a) Queries per L2 miss.<br />
esa2<br />
esa2b<br />
0.0045<br />
0.004<br />
0.0035<br />
0.003<br />
query/L2m<br />
0.0025<br />
0.002<br />
0.0015<br />
0.001<br />
0.0005<br />
0<br />
0 20 40 60 80 100<br />
Data size in MiB<br />
CCSA FM AFFM RLFM SSA LZ<br />
(b) Queries per L2 miss.<br />
Figure 13: Continuing Figure 12. Query lookup on increasing data size, Zipf word data.<br />
226
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
400000<br />
350000<br />
300000<br />
250000<br />
query/s<br />
200000<br />
150000<br />
100000<br />
50000<br />
0<br />
2 4 8 16 32 64 128 256<br />
STa<br />
STs<br />
wotde<br />
DS<br />
Alphabet size<br />
esa1<br />
esa1b<br />
esa2<br />
esa2b<br />
(a) Queries per second.<br />
400000<br />
350000<br />
300000<br />
250000<br />
query/s<br />
200000<br />
150000<br />
100000<br />
50000<br />
0<br />
2 4 8 16 32 64 128 256<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Alphabet size<br />
DS<br />
esa2<br />
BPR<br />
LZ<br />
(b) Queries per second.<br />
Figure 14: Query lookup on increasing alphabet size, uniform data.<br />
227
APPENDIX A. OTHER PAPERS<br />
0.12<br />
0.1<br />
0.08<br />
query/L2m<br />
0.06<br />
0.04<br />
0.02<br />
0<br />
2 4 8 16 32 64 128 256<br />
STa<br />
STs<br />
wotde<br />
DS<br />
Alphabet size<br />
esa1<br />
esa1b<br />
(a) Queries per L2 miss.<br />
esa2<br />
esa2b<br />
0.12<br />
0.1<br />
0.08<br />
query/L2m<br />
0.06<br />
0.04<br />
0.02<br />
0<br />
2 4 8 16 32 64 128 256<br />
STa<br />
STs<br />
wotde<br />
skew<br />
Alphabet size<br />
DS<br />
esa2<br />
(b) Queries per L2 miss.<br />
BPR<br />
LZ<br />
Figure 15: Continuing Figure 14.<br />
data.<br />
Query lookup on increasing alphabet size, uniform<br />
228
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
the binary search decreases, resulting in fewer character comparisons. The implementation<br />
tested does not use the trick <strong>of</strong> keeping track <strong>of</strong> how many symbols match on the left<br />
<strong>and</strong> right border <strong>of</strong> the binary search, to reduce the expected number <strong>of</strong> character comparisons.<br />
The methods AFFM, RLFM <strong>and</strong> SSA all have the same asymptotic lookup cost<br />
[34], but SSA is faster in practise, <strong>and</strong> has the best query performance <strong>of</strong> all compressed<br />
indexes for small alphabets.<br />
4.8 Reporting Hits<br />
In the previous tests on query performance a single hit was reported for each query, but<br />
for some applications the efficiency when listing large numbers <strong>of</strong> occurrences is more<br />
relevant. Figure 16 shows the performance when reporting an increasing number <strong>of</strong> hits<br />
from 100 MB <strong>of</strong> uniform data with an alphabet size <strong>of</strong> 20. All search functions were<br />
modified to cap the number <strong>of</strong> hits, <strong>and</strong> a varying number was requested. The lengths <strong>of</strong><br />
the queries were set such that the number <strong>of</strong> hits would be at least what was requested.<br />
Although suffix trees are asymptotically optimal here, both for sibling lists <strong>and</strong> child<br />
arrays, the suffix arrays have an advantage in practise. After finding the left <strong>and</strong> right<br />
border for a match, values are read between these subsequently, giving great spatial<br />
locality <strong>and</strong> performance. DS is more than 15 times faster than STa at the most. The<br />
enhanced suffix array is not included, as it has a performance almost identical to the<br />
regular suffix array. Wotde is significantly faster than the regular suffix trees in this test,<br />
due to a more compact representation <strong>and</strong> better locality. The slight drop in performance<br />
for more than 3000 hits seen for the fastest methods is due to the overhead for the dynamic<br />
container used to hold the hits. This could be easily avoided for the suffix arrays, as the<br />
number <strong>of</strong> hits is known after finding the left <strong>and</strong> right border.<br />
The LZ index shows the best performance among the compressed indexes, <strong>and</strong> delivers<br />
hits nearly as fast as a suffix tree with sibling lists. The other compressed structures are<br />
3-5 orders <strong>of</strong> magnitude slower than the suffix array, but in many applications, thous<strong>and</strong>s<br />
<strong>of</strong> hits per second is sufficient.<br />
4.9 Tree Operations<br />
Suffix trees can be used for many purposes other than substring search. They are used in<br />
bio-informatics to find many properties <strong>of</strong> strings. Gusfield [16] lists numerous applications.<br />
Because suffix trees are costly in space, Abouelhoda et al. [1] show how to replace<br />
suffix trees with enhanced suffix arrays in all algorithms doing bottom-up, top-down or<br />
suffix link traversal in the tree. Sadakane [39] has taken this one step further <strong>and</strong> shown<br />
how to use a succinct representation <strong>of</strong> a suffix tree on top <strong>of</strong> any compressed suffix array,<br />
supporting the same operations.<br />
Table 7 shows the performance <strong>of</strong> suffix trees, enhanced suffix arrays <strong>and</strong> a compressed<br />
suffix tree (CST ) on various tree operations. The compressed suffix tree implementation<br />
by Mäkinen [43] was used. The longest common sub-string (LCSS) between two strings<br />
is found by building a tree for the first string, <strong>and</strong> then traversing it matching suffixes <strong>of</strong><br />
the other string, following parent-to-child <strong>and</strong> suffix links in the tree. (The construction<br />
229
APPENDIX A. OTHER PAPERS<br />
1e+08<br />
1e+07<br />
1e+06<br />
hit/s<br />
100000<br />
10000<br />
1000<br />
10 100 1000 10000<br />
STa<br />
STs<br />
wotde<br />
DS<br />
Requested number <strong>of</strong> hits<br />
LZ<br />
CCSA<br />
FM<br />
AFFM<br />
(a) Hits reported per second.<br />
RLFM<br />
SSA<br />
100<br />
10<br />
1<br />
hit/L2m<br />
0.1<br />
0.01<br />
0.001<br />
10 100 1000 10000<br />
STa<br />
STs<br />
wotde<br />
DS<br />
Requested number <strong>of</strong> hits<br />
LZ<br />
CCSA<br />
FM<br />
AFFM<br />
(b) Hits reported per L2 miss.<br />
RLFM<br />
SSA<br />
Figure 16: Requesting an increasing number <strong>of</strong> hits. log 10 scale on both axis.<br />
230
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
Table 7: Tree operations. Showing time in seconds for construction, bottom–up <strong>and</strong><br />
top–down traverse, <strong>and</strong> longest common substring search.<br />
(a) 50 MB DNA data.<br />
STa STs esa1 esa2 CST<br />
Build 50 59 76 77 2604<br />
Top–down 14 16 1.9 2.5 16<br />
Bottom–up 9.7 9.3 9.8 9.9 23<br />
LCSS 24 33 4.0 3.7 313<br />
Memory 1586 1461 1313 1313 226<br />
Memory (peak) 1961 1837 1621 1621 364<br />
(b) 50 MB protein data.<br />
STa STs esa1 esa2 CST<br />
Build 42 137 75 76 4788<br />
Top–down 11 14 1.9 2.5 17<br />
Bottom–up 6.4 6.3 7.0 7.3 24<br />
LCSS 22 76 10 6.1 721<br />
Memory 1362 1236 1331 1364 251<br />
Memory (peak) 1737 1611 1514 1513 391<br />
time for the index is excluded.) On LCSS, esa2 is much faster than esa1, because many<br />
fields are read in each “node”, <strong>and</strong> these are stored close to each other in esa2. In general,<br />
esa2 is as fast or faster than the regular suffix trees on tree traversal.<br />
The compressed suffix tree is very competitive on top–down <strong>and</strong> bottom–up traversal,<br />
where the operations on nodes performed in constant time, but around 100 times slower<br />
than esa2 on LCSS, where they are not [39].<br />
5 Conclusion<br />
It has been shown that performance is strongly dependant <strong>of</strong> locality also in data structures<br />
resident in primary memory. This should be considered when implementing indexes,<br />
as can be seen when comparing the performance <strong>of</strong> STa with STs, <strong>and</strong> esa2 with esa1.<br />
Which index structures should be chosen for which tasks depends on what is more important<br />
<strong>of</strong> space usage, construction time, query lookup time <strong>and</strong> reporting hits.<br />
• Suffix trees with dynamic arrays can be as much as 20 times faster on construction<br />
than sibling lists for byte sized alphabets, as seen in figure 6a, <strong>and</strong> 10 times faster<br />
on query lookup, as seen in figure 14. The array representation requires about 20%<br />
more space.<br />
231
APPENDIX A. OTHER PAPERS<br />
• In applications where large numbers <strong>of</strong> hits must be reported, suffix arrays are<br />
strongly preferable over suffix trees, as they list hits 1-2 orders <strong>of</strong> magnitude faster.<br />
See Figure 16a.<br />
• For fast lookup for small numbers <strong>of</strong> hits, a suffix tree variant is the most effective<br />
index, if you have enough memory. See Figures 12a <strong>and</strong> 14a.<br />
• The lookup performance <strong>of</strong> the enhanced suffix array is dependant <strong>of</strong> the alphabet<br />
size, <strong>and</strong> a regular suffix array may be faster if the alphabet is sufficiently large.<br />
See figure 14a.<br />
• The deep-shallow suffix array construction should in general be chosen over BPR,<br />
as it has a much lower space overhead. See Table 6.<br />
• Among the implementations tested here, the FM index is the most space efficient,<br />
as long as the alphabet size is not too large. See Table 5.<br />
• The LZ-index is fast on construction <strong>and</strong> listing hits. As it is more space efficient<br />
than a suffix array, it would be the structure <strong>of</strong> choice in many situations.<br />
• The wotdeager suffix tree is fast on lookup <strong>and</strong> reporting hits, but the construction<br />
easily breaks down for non-r<strong>and</strong>om data, as seen for some <strong>of</strong> the benchmarks in<br />
table 3.<br />
• Tree traversal algorithms are faster with enhanced suffix arrays than suffix trees.<br />
See Table 7.<br />
In general, when designing the layout <strong>of</strong> data structures, it is important to consider the<br />
access patterns in construction <strong>and</strong> search to maximise locality. A few more computational<br />
steps during lookup will <strong>of</strong>ten be cheaper than a cache miss.<br />
Acknowledgements.<br />
The author would like to thank all the authors who have made their code publically available,<br />
<strong>and</strong> Øystein Torbjørnsen, Magnus Lie Hetl<strong>and</strong> <strong>and</strong> Tor Egge for helpful feedback<br />
on this article,<br />
References<br />
[1] M. I. Abouelhoda, S. Kurtz, <strong>and</strong> E. Ohlebusch. Replacing suffix trees with enhanced<br />
suffix arrays. J. <strong>of</strong> Discrete Algorithms, 2(1):53–86, 2004.<br />
[2] S. J. Bedathur <strong>and</strong> J. R. Haritsa. Engineering a fast online persistent suffix tree<br />
construction. In Proc. ICDE, page 720, 2004.<br />
[3] The canterbury corpus. http://corpus.canterbury.ac.nz/.<br />
[4] L. Devroye. Note on the average depth <strong>of</strong> tries. Computing, 28:367–371, 1982.<br />
232
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
[5] M. Farach. Optimal suffix tree construction with large alphabets. In Proc. FOCS,<br />
pages 137–143, 1997.<br />
[6] M. Farach-Colton, P. Ferragina, <strong>and</strong> S. M. Muthukrishnan. On the sortingcomplexity<br />
<strong>of</strong> suffix tree construction. J. ACM, 47(6):987–1011, 2000.<br />
[7] P. Ferragina <strong>and</strong> G. Manzini. Opportunistic data structures with applications. In<br />
Proc. FOCS, page 390, 2000.<br />
[8] P. Ferragina <strong>and</strong> G. Manzini. Indexing compressed text. J. ACM, 52(4):552–581,<br />
2005.<br />
[9] P. Ferragina, G. Manzini, V. Makinen, <strong>and</strong> G. Navarro. An alphabet-friendly FMindex.<br />
Proc. SPIRE, pages 150–160, 2004.<br />
[10] M.L. Fredman, J. Komlós, <strong>and</strong> E. Szemerédi. Storing a Sparse Table with 0 (1)<br />
Worst Case Access Time. J. ACM, 31(3):538–544, 1984.<br />
[11] R. Giegerich, S. Kurtz, <strong>and</strong> J. Stoye. Efficient implementation <strong>of</strong> lazy suffix trees.<br />
S<strong>of</strong>tware: Practice <strong>and</strong> Experience, 33:1035–1049, 2003.<br />
[12] S. Grabowski, V. Makinen, <strong>and</strong> G. Navarro. First Huffman, then Burrows-Wheeler:<br />
An alphabet-independent FM-index. Proc SPIRE, 2004.<br />
[13] N. <strong>Grimsmo</strong>. Dynamic indexes vs. static hierarchies for substring search. Master’s<br />
thesis, Norwegian University <strong>of</strong> Science <strong>and</strong> Technology, 2005.<br />
[14] R. Grossi, A. Gupta, <strong>and</strong> J. S. Vitter. High-order entropy-compressed text indexes.<br />
In Proc. SODA, pages 841–850, 2003.<br />
[15] R. Grossi <strong>and</strong> J. S. Vitter. Compressed suffix arrays <strong>and</strong> suffix trees with applications<br />
to text indexing <strong>and</strong> string matching (extended abstract). In Proc. STOC, pages<br />
397–406, 2000.<br />
[16] D. Gusfield. Algorithms on strings, trees, <strong>and</strong> sequences: <strong>Computer</strong> science <strong>and</strong><br />
computational biology. Cambridge University Press, 1997.<br />
[17] Intel Corporation. IA-32 Intel R○ Architecture Optimization Reference Manual, 2006.<br />
[18] J. Kärkkäinen <strong>and</strong> P. S<strong>and</strong>ers. Simple linear work suffix array construction. In Proc.<br />
ICALP, 2003.<br />
[19] J. Kärkkäinen <strong>and</strong> E. Ukkonen. Lempel-Ziv parsing <strong>and</strong> sublinear-size index structures<br />
for string matching. In Proc. WSP, pages 141–155, 1996.<br />
[20] D. K. Kim, J. S. Sim, H. Park, <strong>and</strong> K. Park. Linear-time construction <strong>of</strong> suffix<br />
arrays. In Proc. CPM, pages 186–199, 2003.<br />
[21] P. Ko <strong>and</strong> S. Aluru. Space efficient linear time construction <strong>of</strong> suffix arrays. In Proc.<br />
CPM, 2003.<br />
233
APPENDIX A. OTHER PAPERS<br />
[22] S. Kurtz. Reducing the space requirement <strong>of</strong> suffix trees. S<strong>of</strong>tware: Practice <strong>and</strong><br />
Experience, 29(13):1149–1171, 1999.<br />
[23] V. Mäkinen. Compact suffix array. In CPM, pages 305–319, 2000.<br />
[24] V. Mäkinen <strong>and</strong> G. Navarro. Compressed compact suffix arrays. In CPM, pages<br />
420–433, 2004.<br />
[25] V. Mäkinen <strong>and</strong> G. Navarro. Run-length FM-index. In Proc. DIMACS, pages 17–19,<br />
2004.<br />
[26] V. Mäkinen <strong>and</strong> G. Navarro. Succinct suffix arrays based on run-length encoding.<br />
In Proc. CPM, 2005.<br />
[27] U. Manber <strong>and</strong> G. Myers. Suffix arrays: A new method for on-line string searches.<br />
SIAM J. on Computing, 22(5):935–948, 1993.<br />
[28] G. Manzini <strong>and</strong> P. Ferragina. Engineering a lightweight suffix array construction<br />
algorithm. Algorithmica, 40(1):33–50, 2004.<br />
[29] E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM,<br />
23(2):262–272, 1976.<br />
[30] Manzini <strong>and</strong> Ferragina’s test collection.<br />
http://www.mfn.unipmn.it/~manzini/lightweight/.<br />
[31] S. Muthukrishnan. Efficient algorithms for document retrieval problems. In Proc.<br />
SODA, pages 657–666, 2002.<br />
[32] G. Navarro. The LZ-index: A text index based on the Ziv-Lempel trie. Technical<br />
Report TR/DCC-2003-1, Dept. <strong>of</strong> <strong>Computer</strong> Science, Univ. <strong>of</strong> Chile, 2003.<br />
[33] G. Navarro. Indexing text using the ziv-lempel trie. J. <strong>of</strong> Discrete Algorithms,<br />
2(1):87–114, 2004.<br />
[34] G. Navarro <strong>and</strong> V. Mäkinen. Compressed full-text indexes. Technical Report<br />
TR/DCC-2006-6, Dept. <strong>of</strong> Comp. Sci., U. <strong>of</strong> Chile, 2006.<br />
[35] M. H. Overmars <strong>and</strong> J. van Leeuwen. Some principles for dynamizing decomposable<br />
searching problems. Technical Report RUU-CS-80-1, Rijksuniversitet Utrecht, 1980.<br />
[36] Performance Application Programming Interface.<br />
http://icl.cs.utk.edu/papi/.<br />
[37] Perfctr Hardware performance counters.<br />
http://user.it.uu.se/~mikpe/linux/perfctr/.<br />
[38] K. Sadakane. New text indexing functionalities <strong>of</strong> the compressed suffix arrays. J.<br />
<strong>of</strong> Algorithms, 48(2):294–313, 2003.<br />
234
Paper 7: On performance <strong>and</strong> cache effects in substring indexes<br />
[39] K. Sadakane. Compressed suffix trees with full functionality. Theory <strong>of</strong> Computing<br />
Systems (Online), 2007.<br />
[40] K. Schürmann <strong>and</strong> J. Stoye. An incomplex algorithm for fast suffix array construction.<br />
In Proc. ALENEX, 2005.<br />
[41] Y. Tian, S. Tata, A. Hankins, <strong>and</strong> M. Patel. Practical methods for constructing<br />
suffix trees. VLDB J., 14(3):281–299, 2005.<br />
[42] E. Ukkonen. On-line construction <strong>of</strong> suffix trees. Algorithmica, 14(5):249–260, 1995.<br />
[43] Niko Välimäki, Wolfgang Gerlach, Kashyap Dixit, <strong>and</strong> Veli Mäkinen. Engineering a<br />
compressed suffix tree implementation. In Proc. WEA, pages 217–228, 2007.<br />
[44] P. Weiner. Linear pattern matching algorithms. In Proc. SWAT, pages 1–11, 1973.<br />
[45] J. Ziv <strong>and</strong> A. Lempel. A universal algorithm for sequential data compression. IEEE<br />
Transactions on <strong>Information</strong> Theory, 23(3):337–343, 1977.<br />
235
Paper 8<br />
Truls A. Bjørklund, <strong>Nils</strong> <strong>Grimsmo</strong>, Johannes Gehrke <strong>and</strong> Øystein Torbjørnsen<br />
Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />
Proceeding <strong>of</strong> the 18th ACM conference on <strong>Information</strong> <strong>and</strong> knowledge management<br />
(CIKM 2009)<br />
Abstract Bitmap indexes are widely used in Decision Support Systems (DSSs) to improve<br />
query performance. In this paper, we evaluate the use <strong>of</strong> compressed inverted<br />
indexes with adapted query processing strategies from <strong>Information</strong> Retrieval as an alternative.<br />
In a thorough experimental evaluation on both synthetic data <strong>and</strong> data from the<br />
Star Schema Benchmark, we show that inverted indexes are more compact than bitmap<br />
indexes in almost all cases. This compactness combined with efficient query processing<br />
strategies results in inverted indexes outperforming bitmap indexes for most queries, <strong>of</strong>ten<br />
significantly.<br />
My role as an author<br />
This paper <strong>and</strong> the contained ideas was mostly the work <strong>of</strong> Bjørklund. I took part in<br />
some idea brainstormings, discussions about the implementation, <strong>and</strong> the writing process.<br />
A technical contribution <strong>of</strong> mine was to configure the FastBit system compared with to<br />
make it not crash for large queries.<br />
Retrospective view<br />
Bjørklund <strong>and</strong> I share supervisors, started on our PhDs roughly at the same time, <strong>and</strong> we<br />
were both given tasks related to indexing <strong>and</strong> search. Common denominators have been<br />
columnar storage <strong>and</strong> multi-way operators. In retrospect it could have been advantageous<br />
for us if we were given even closer topics, <strong>and</strong> shared more implementation in our research<br />
projects.<br />
237
Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />
1 Introduction<br />
Decision Support Systems (DSSs) support queries over large amounts <strong>of</strong> structured data,<br />
<strong>and</strong> bitmap indexes are <strong>of</strong>ten used to improve the efficiency <strong>of</strong> important query classes<br />
involving selection predicates <strong>and</strong> joins [16, 17].<br />
Bitmap indexes were formerly also used in <strong>Information</strong> Retrieval (IR), but are today<br />
mainly replaced by inverted indexes. Part <strong>of</strong> the reason why inverted indexes gained<br />
popularity in IR was that they easily support integrating new fields required to support<br />
ranked queries. The switch from bitmap indexes to inverted indexes lead to a flood <strong>of</strong><br />
research on efficient inverted indexes [25, 30, 6, 13, 24, 3, 31, 29], <strong>and</strong> inverted indexes<br />
are now the preferred indexing method in search engines [30].<br />
In this paper, we are asking (<strong>and</strong> answering) the question: What are the trade-<strong>of</strong>fs<br />
<strong>of</strong> using inverted indexes in DSSs, <strong>and</strong> should they be considered a serious alternative<br />
to bitmap indexes? The main contributions <strong>of</strong> this paper are (1) the study <strong>of</strong> how to<br />
use <strong>and</strong> implement inverted indexes in DSSs, <strong>and</strong> (2) a thorough performance evaluation<br />
that compares inverted indexes <strong>and</strong> bitmap indexes in DSSs. In particular, we compare<br />
inverted indexes with FastBit, 1 a state-<strong>of</strong>-the-art bitmap query processing <strong>and</strong> indexing<br />
system based on WAH-compressed bitmap indexes [27].<br />
2 Background<br />
A st<strong>and</strong>ard bitmap index has one bitmap per distinct value for the indexed attribute, with<br />
1’s at positions for tuples with the represented value, <strong>and</strong> 0’s elsewhere. Bitmaps can be<br />
combined using bitwise operators to answer complex boolean queries. For attributes with<br />
few distinct values, bitmap indexes are relatively compact, but their space usage increases<br />
linearly with the cardinality. One approach to limit the space usage <strong>of</strong> bitmap indexes<br />
for high cardinality attributes is compression. WAH [27] is one <strong>of</strong> several introduced<br />
compression schemes. Although there are schemes with more compact indexes, WAH<br />
supports efficient query processing. This combined with the fact that FastBit is openly<br />
available motivates the use <strong>of</strong> WAH-compressed bitmap indexes in the experiments in this<br />
paper.<br />
WAH-compression is a form <strong>of</strong> word-aligned run-length encoding for bitmap indexes,<br />
where consecutive words containing only 0’s or 1’s are stored as fill words, <strong>and</strong> other words<br />
are stored literally [26, 27]. WAH-compressed bitmaps for high cardinality attributes are<br />
relatively compact because most words contain only 0’s.<br />
In IR, inverted indexes consist <strong>of</strong> a search structure for all searchable words called a<br />
dictionary, <strong>and</strong> lists <strong>of</strong> references to documents containing each searchable word, called<br />
inverted lists. An inverted index for an attribute in a DSS consists <strong>of</strong> a dictionary <strong>of</strong> the<br />
distinct values in the attribute, with pointers to inverted lists that reference tuples with<br />
the given value through tuple identifiers (TIDs). To reduce both space usage <strong>and</strong> the<br />
I/O requirements in query processing, the inverted lists are <strong>of</strong>ten compressed by storing<br />
the deltas between the sorted references [30]. This approach makes small values more<br />
1 http://sdm.lbl.gov/fastbit/<br />
239
APPENDIX A. OTHER PAPERS<br />
likely, <strong>and</strong> several compression schemes that represent small values compactly have been<br />
suggested. According to a recent study, PForDelta [31] is currently the most efficient<br />
method [29], <strong>and</strong> is therefore used in this paper. PForDelta stores deltas in a wordaligned<br />
version <strong>of</strong> bit packing, which also includes exceptions to enable storing larger<br />
values than the chosen number <strong>of</strong> bits allows [31].<br />
Two overall query processing approaches exist in search engines. Document-at-atime<br />
strategies avoid materializing intermediate results by processing all inverted lists<br />
in a query in parallel [6, 24], <strong>and</strong> are well suited for boolean query processing. They<br />
can be combined with skipping, which is used in search engines to avoid reading <strong>and</strong><br />
decompressing parts <strong>of</strong> inverted lists that are not required to process a query [13]. We<br />
give a brief description <strong>of</strong> how we use these ideas in the query processing in this paper in<br />
the next section.<br />
3 Query Processing<br />
Recall that we use document-at-a-time strategies that avoid materializing intermediate<br />
results to process inverted index queries in this paper. We support three operators which<br />
can be combined to answer complex queries. They all support skipping to the next result<br />
with a given minimum TID value, in addition to st<strong>and</strong>ard Volcano-style iteration [9].<br />
The SCAN operator can iterate through an inverted list. To support skipping, each kth<br />
TID in each inverted list is stored in an external list. The external list is kept in memory<br />
during scans, <strong>and</strong> supports binary searches to find the correct part <strong>of</strong> the inverted list to<br />
process when skipping.<br />
The OR operator provides an iterator interface over the sorted merge <strong>of</strong> its multiple<br />
input iterators. The iterators are organized in a priority-queue based on a heap, which is<br />
maintained to make sure that the input with the smallest next TID is at the top. Skipping<br />
in the OR operator is based on a breadth-first search in the heap. A skip may not result<br />
in an actual skip for a given input iterator. If so, we know that neither <strong>of</strong> its children<br />
in the heap can do any skipping either, <strong>and</strong> we therefore avoid testing. After the search,<br />
we make sure that only the part <strong>of</strong> the heap involving iterators that actually skipped is<br />
maintained. This approach is reasonably efficient when actually performing skips in both<br />
large <strong>and</strong> small fractions <strong>of</strong> the set <strong>of</strong> input iterators.<br />
The AND operator expects that the input iterators are sorted in ascending order according<br />
to the expected number <strong>of</strong> returned results. To find the next result, we start with<br />
a c<strong>and</strong>idate from the iterator with the fewest number <strong>of</strong> expected results. We then try to<br />
skip to the c<strong>and</strong>idate in the other input iterators, re-starting with a new c<strong>and</strong>idate if the<br />
current c<strong>and</strong>idate is absent in one iterator. A c<strong>and</strong>idate found in all inputs is returned<br />
as a result. To support skipping, we start with the value to skip to as the c<strong>and</strong>idate <strong>and</strong><br />
proceed as in normal iteration.<br />
240
Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />
MB<br />
200<br />
150<br />
100<br />
50<br />
0<br />
MB<br />
200<br />
150<br />
100<br />
50<br />
0<br />
Uniform data<br />
2 1 10 1 10 2 10 3 10 4 10 5 10 6 10 7<br />
Zipf data, k = 1.5<br />
2 1 10 1 10 2 10 3 10 4 10 5 10 6 10 7<br />
Uncompr. attrib.<br />
FastBit<br />
InvInd<br />
InvInd w/skip<br />
Figure 1: Index sizes. X-axis gives attribute cardinality.<br />
4 Experiments<br />
To investigate the trade-<strong>of</strong>fs between inverted indexes <strong>and</strong> bitmaps, we experiment with<br />
FastBit <strong>and</strong> our inverted index solutions with <strong>and</strong> without support for skipping. We<br />
present results from experiments with synthetic data <strong>and</strong> data from the Star Schema<br />
Benchmark (SSB) [14].<br />
All experiments are run on a quad-core Intel Xeon CPU at 3.2 GHz with 16 GB main<br />
memory. All indexes are stored on disk, but queries are run warm by performing 10<br />
runs during one invocation <strong>of</strong> the system, <strong>and</strong> reporting the average from the last 8, thus<br />
measuring the in-memory query processing performance. We run FastBit version 1.0.5<br />
(implemented in C++), with extra stack space to enable processing queries with many<br />
oper<strong>and</strong>s. Our approaches are implemented in Java (version 1.6). We use additional<br />
warm-up for our system to enable run-time optimizations in the Java virtual machine<br />
that reduce variance between runs. Additional warm-up did not change the performance<br />
for FastBit.<br />
4.1 Synthetic Data<br />
To experiment with synthetic data we generate two tables. Both tables have 10 million<br />
tuples <strong>and</strong> 8 indexed attributes with maximum cardinalities ranging from 2 via all powers<br />
<strong>of</strong> 10 up to 10 million. The attributes in the first table follow a uniform distribution, while<br />
a Zipf distribution (with k = 1.5) is used in the other.<br />
241
APPENDIX A. OTHER PAPERS<br />
4.1.1 Index Size<br />
The sizes <strong>of</strong> the uncompressed attributes in the synthetic tables <strong>and</strong> their indexes are<br />
shown in Figure 1. When using st<strong>and</strong>ard PForDelta compression on the attribute with<br />
cardinality 2 in the table with uniform data, each value is represented with 4 bits in<br />
the most compact index. The reason why a lower number <strong>of</strong> bits results in a larger<br />
index is that the implementation <strong>of</strong> PForDelta might result in artificial extra exceptions<br />
when using a small number <strong>of</strong> bits per value [31]. Bitmap indexes are known to be<br />
compact when the cardinality is 2, <strong>and</strong> FastBit outperforms our approaches in this case.<br />
PForDelta results in compact indexes for higher cardinality attributes, <strong>and</strong> most <strong>of</strong> the<br />
space usage for the highest cardinality attribute comes from the dictionary (62 out <strong>of</strong><br />
91MB). The WAH-compressed bitmaps for the same attribute can in worst case contain<br />
nearly 3 computer words per tuple, resulting in a space usage <strong>of</strong> over 228MB on a 64-bit<br />
architecture. The actual results are significantly better, but compressed inverted indexes<br />
are clearly more compact.<br />
Indexes for Zipf distributed attributes are more compact than for uniformly distributed<br />
attributes with the same maximum cardinality, because skewed distributions make it less<br />
likely for the actual cardinality to be equal to the maximum.<br />
4.1.2 Query processing<br />
To experiment with query processing performance, we test four different query types<br />
which all vary the attribute on which there is a single value predicate:<br />
1. Query type SCAN: Finds all tuples with value 0 for a varied attribute.<br />
2. Query type skewed AND: Finds all tuples having value 0 for the attribute with<br />
cardinality 10, in addition to 0 in one other varied attribute.<br />
3. Query type OR: Finds all tuples having values in the lower half range for a varied<br />
attribute.<br />
4. Query type AND-OR: Finds all tuples with value in the lower half range for the<br />
attribute with cardinality 100000, <strong>and</strong> value 0 for another varied attribute.<br />
All queries compute the sum <strong>of</strong> the primary keys <strong>of</strong> the matching tuples, to ensure that<br />
the output from the index is used to perform table look-ups. In the table with uniform<br />
distributions, there were no tuples with value 0 for the highest cardinality attribute, so<br />
all single valued predicates on this attribute was changed to require the value to be 2.<br />
The results are shown in Figure 2.<br />
Compared to bitmaps, decompressed inverted lists are well suited for looking up other<br />
attributes for the qualifying tuples, a factor contributing to faster scans for uniform data.<br />
The difference in index size also seems to have an impact. All scans are relatively slow for<br />
Zipfian data because we always search for the most common attribute value in a skewed<br />
distribution, except for in the highest cardinality attribute as noted above.<br />
Skewed AND favors methods capable <strong>of</strong> taking advantage <strong>of</strong> the different density in<br />
the oper<strong>and</strong>s. Inverted indexes with skipping are therefore efficient for uniform data,<br />
but introduce overhead for Zipfian data because both oper<strong>and</strong>s are dense when the most<br />
common values in skewed distributions are accessed. FastBit performs well on dense<br />
242
Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />
FastBit<br />
InvInd<br />
InvInd w/skip<br />
SCAN query<br />
(logscale)<br />
Skewed AND<br />
OR query<br />
AND-OR query<br />
(logscale)<br />
(a1)<br />
(a2)<br />
(a3)<br />
(a4)<br />
10 −1 150<br />
0.01<br />
10 2<br />
10 −4<br />
100<br />
10 0<br />
0.005<br />
50<br />
10 −2<br />
Uniform data<br />
10 −7<br />
0<br />
0<br />
2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 10 1 10 2 10 3 10 4 10 5 10 6 10 7 2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7<br />
(b1)<br />
(b2)<br />
(b3)<br />
(b3)<br />
0.15<br />
40<br />
10 2<br />
10 −1<br />
10 −2<br />
0.10<br />
0.05<br />
30<br />
20<br />
10<br />
10 0<br />
Zipf data<br />
10 −2<br />
10 −3<br />
0<br />
0<br />
2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7 10 2 10 3 10 4 10 5 10 6 10 7 10 1 10 2 10 3 10 4 10 5 10 6 10 7 2 1 10 1 10 3 10 2 10 4 10 5 10 6 10 7<br />
Figure 2: Results from running queries on generated tables, showing query time in seconds for varying cardinalities.<br />
243
APPENDIX A. OTHER PAPERS<br />
MB<br />
40<br />
20<br />
0<br />
custkey partkey suppkey<br />
Uncompr. attr.<br />
FastBit<br />
InvInd<br />
InvInd w/skip<br />
Figure 3: Size <strong>of</strong> indexes for foreign key columns in SSB<br />
oper<strong>and</strong>s, both because it can combine multiple logical TIDs using one CPU instruction,<br />
<strong>and</strong> because it performs the operator before extracting the tuple references. Because neither<br />
input is smaller than the output for AND operators, FastBit decodes fewer references<br />
compared to the inverted indexes.<br />
The multi-way OR operators in our solution demonstrate better scalability than FastBit<br />
with respect to the number <strong>of</strong> inputs for both tables.<br />
The idea <strong>of</strong> skipping in OR operators is ideally suited for query type AND-OR, but it is<br />
only useful when the other oper<strong>and</strong>s to the AND return data that enables reasonable skip<br />
lengths, which occurs for high cardinality attributes with uniform distributions.<br />
4.2 Star Schema Benchmark<br />
Star schemas represent a best practice for how to organize data in decision support systems,<br />
<strong>and</strong> are characterized by a central fact table that references several smaller dimension<br />
tables. Typical queries on such schemas involve joins <strong>of</strong> the fact table with relevant<br />
dimension tables called star joins. Bitmap indexes can be constructed over the foreign<br />
keys in the fact table to speed up such joins, <strong>and</strong> are then called join indexes [16, 17].<br />
We experiment with using inverted indexes as an alternative to bitmap indexes for this<br />
purpose in the Star Schema Benchmark (SSB) [14]. We use Scale Factor 1, <strong>and</strong> precalculate<br />
the foreign keys that match the queries in SSB. They are submitted as part <strong>of</strong><br />
the query to the tested systems. We also avoid calculating the exact answer, but rather<br />
let all queries return the sum <strong>of</strong> an attribute <strong>of</strong> the fact table. This isolates the effects<br />
<strong>of</strong> the indexes while making sure the returned results are suitable for further look-ups in<br />
the fact table. There are four dimension tables in SSB, but we avoid constructing join<br />
indexes for the Date table because FastBit is unable to process the queries involving all<br />
tables without a very significant stack size.<br />
4.2.1 Index Size<br />
Figure 3 shows the join index sizes in both systems. FastBit has significantly larger indexes<br />
because the foreign keys have relatively high cardinalities. The attribute custkey is partly<br />
sorted, resulting in longer runs in WAH-compressed indexes, <strong>and</strong> the relative difference<br />
between FastBit <strong>and</strong> inverted indexes is therefore smaller in that case.<br />
244
Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />
20<br />
2.0<br />
1.5<br />
0.20<br />
0.15<br />
20<br />
10<br />
1.0<br />
0.5<br />
0.10<br />
0.05<br />
10<br />
0<br />
1.0<br />
0.5<br />
0<br />
Q2.1<br />
0.03<br />
0.02<br />
0.01<br />
Q2.2<br />
0<br />
200<br />
150<br />
100<br />
50<br />
Q2.3<br />
0<br />
10<br />
5<br />
Q3.1<br />
FastBit<br />
InvInd<br />
InvInd w/skip<br />
0<br />
Q3.2<br />
0<br />
Q3.3/3.4<br />
0<br />
Q4.1/4.2<br />
0<br />
Q4.3<br />
Figure 4: Query processing time in seconds for SSB queries.<br />
4.2.2 Join Processing<br />
The query processing results for SSB are given in Figure 4. Within each set <strong>of</strong> queries, the<br />
predicates on the dimension tables become increasingly selective, making FastBit perform<br />
better relative to inverted indexes because the OR operators that combine the tuples<br />
representing each qualifying foreign key have fewer inputs. Query 2.3 has skewed input to<br />
an AND operator making skipping important for performance. The OR operator providing<br />
dense input to the AND is over the attribute with the lowest cardinality, contributing to<br />
the smaller performance difference between FastBit <strong>and</strong> inverted indexes for this query.<br />
5 Related Work<br />
Several alternatives to the compression schemes discussed in this paper have been suggested<br />
both for bitmaps [4, 10, 5, 15, 21] <strong>and</strong> inverted indexes [25, 20, 1]. Experiments<br />
have shown that the query processing efficiency <strong>of</strong> WAH remains attractive, even though<br />
there are approaches resulting in smaller indexes. WAH is known to result in smaller indexes<br />
when the table is sorted on the indexed attribute [19, 12]. Due to space restrictions,<br />
we do not experiment with sorted tables in this paper. Experiments with compression in<br />
inverted indexes in IR have shown that PForDelta currently is the most efficient technique<br />
[29], <strong>and</strong> further improvements to the technique have also been suggested recently [28].<br />
As an alternative to compression, there are several approaches that reduce the number<br />
<strong>of</strong> bitmaps in the index [17, 7, 8, 11, 22]. Strategies for operating on many bitmaps by<br />
processing two at a time have been explored for WAH-compressed bitmap indexes [26],<br />
245
APPENDIX A. OTHER PAPERS<br />
<strong>and</strong> a recent paper suggests using multi-way operators for bitmaps, but the idea is not<br />
tested [12]. Query processing approaches in inverted indexes in IR have focused on termat-a-time<br />
strategies in addition to the document-at-a-time approach used in this paper<br />
[6, 24, 18, 2, 3, 23].<br />
6 Conclusions<br />
In this paper, we have evaluated the applicability <strong>of</strong> compressed inverted indexes as an<br />
alternative to bitmap indexes in DSSs. Inverted indexes are generally significantly more<br />
space efficient. The only case where WAH-compressed bitmaps are clearly more compact<br />
is when the cardinality <strong>of</strong> the indexed attribute is very low. FastBit performs well on<br />
simple queries with dense oper<strong>and</strong>s, but inverted indexes are better in other cases, <strong>of</strong>ten<br />
significantly.<br />
Acknowledgments: This material is based upon work supported by New York State<br />
Science Technology <strong>and</strong> Academic Research under agreement number C050061, by Grant<br />
NFR 162349, by the National Science Foundation under Grants 0534404 <strong>and</strong> 0627680,<br />
<strong>and</strong> by the iAd Project funded by the Research Council <strong>of</strong> Norway. Any opinions, findings<br />
<strong>and</strong> conclusions or recommendations expressed in this material are those <strong>of</strong> the authors<br />
<strong>and</strong> do not necessarily reflect the views <strong>of</strong> NYSTAR, the National Science Foundation or<br />
the Research Council <strong>of</strong> Norway.<br />
References<br />
[1] V. N. Anh <strong>and</strong> A. M<strong>of</strong>fat. Index compression using fixed binary codewords. In<br />
Proc. ADC, 2004.<br />
[2] V. N. Anh <strong>and</strong> A. M<strong>of</strong>fat. Simplified similarity scoring using term ranks. In Proc. SI-<br />
GIR, 2005.<br />
[3] V. N. Anh <strong>and</strong> A. M<strong>of</strong>fat. Pruned query evaluation using pre-computed impacts. In<br />
Proc. SIGIR, 2006.<br />
[4] G. Antoshenkov. Byte-aligned bitmap compression. In Proc. DCC, 1995.<br />
[5] T. Apaydin, G. Canahuate, H. Ferhatosmanoglu, <strong>and</strong> A. S. Tosun. Approximate<br />
encoding for direct access <strong>and</strong> query processing over compressed bitmaps. In<br />
Proc. VLDB, 2006.<br />
[6] E. W. Brown. Fast evaluation <strong>of</strong> structured queries for information retrieval. In<br />
Proc. SIGIR, 1995.<br />
[7] C.-Y. Chan <strong>and</strong> Y. E. Ioannidis. Bitmap index design <strong>and</strong> evaluation. SIGMOD<br />
Rec., 27(2), 1998.<br />
[8] C.-Y. Chan <strong>and</strong> Y. E. Ioannidis. An efficient bitmap encoding scheme for selection<br />
queries. In Proc. SIGMOD, 1999.<br />
246
Paper 8: Inverted Indexes vs. Bitmap Indexes in Decision Support Systems<br />
[9] G. Graefe. Volcano - an extensible <strong>and</strong> parallel query evaluation system. IEEE<br />
Trans. on Knowl. <strong>and</strong> Data Eng., 6(1), 1994.<br />
[10] T. Johnson. Performance measurements <strong>of</strong> compressed bitmap indices. In<br />
Proc. VLDB, 1999.<br />
[11] N. Koudas. Space efficient bitmap indexing. In Proc. CIKM, 2000.<br />
[12] D. Lemire, O. Kaser, <strong>and</strong> K. Aouiche. Sorting improves word-aligned bitmap indexes.<br />
CoRR, abs/0901.3751, 2009.<br />
[13] A. M<strong>of</strong>fat <strong>and</strong> J. Zobel. Self-indexing inverted files for fast text retrieval. ACM<br />
Trans. Inf. Syst., 14(4), 1996.<br />
[14] E. O’Neil, P. O’Neil, <strong>and</strong> X. Chen. http://www.cs.umb.edu/ poneil/StarSchemaB.PDF.<br />
[15] E. O’Neil, P. O’Neil, <strong>and</strong> K. Wu. Bitmap index design choices <strong>and</strong> their performance<br />
implications. In Proc. IDEAS, 2007.<br />
[16] P. O’Neil <strong>and</strong> G. Graefe. Multi-table joins through bitmapped join indices. SIGMOD<br />
Rec., 24(3), 1995.<br />
[17] P. O’Neil <strong>and</strong> D. Quass. Improved query performance with variant indexes. In<br />
Proc. SIGMOD, 1997.<br />
[18] M. Persin, J. Zobel, <strong>and</strong> R. Sacks-Davis. Filtered document retrieval with frequencysorted<br />
indexes. J. Am. Soc. Inf. Sci., 47(10), 1996.<br />
[19] A. Pinar, T. Tao, <strong>and</strong> H. Ferhatosmanoglu. Compressing bitmap indices by data<br />
reorganization. In Proc. ICDE, 2005.<br />
[20] F. Scholer, H. E. Williams, J. Yiannis, <strong>and</strong> J. Zobel. Compression <strong>of</strong> inverted indexes<br />
for fast query evaluation. In Proc. SIGIR, 2002.<br />
[21] M. Stabno <strong>and</strong> R. Wrembel. Rlh: Bitmap compression technique based on run-length<br />
<strong>and</strong> huffman encoding. <strong>Information</strong> Systems, 2008.<br />
[22] K. Stockinger, K. Wu, <strong>and</strong> A. Shoshani. Evaluation strategies for bitmap indices<br />
with binning. In Database <strong>and</strong> Expert Systems Applications, 2004.<br />
[23] T. Strohman <strong>and</strong> W. B. Cr<strong>of</strong>t. Efficient document retrieval in main memory. In<br />
Proc. SIGIR, 2007.<br />
[24] T. Strohman, H. Turtle, <strong>and</strong> W. B. Cr<strong>of</strong>t. Optimization strategies for complex<br />
queries. In Proc. SIGIR, 2005.<br />
[25] I. Witten, A. M<strong>of</strong>fat, <strong>and</strong> T. C. Bell. Managing Gigabytes. Academic Press, 1999.<br />
[26] K. Wu, E. Otoo, <strong>and</strong> A. Shoshani. On the performance <strong>of</strong> bitmap indices for high<br />
cardinality attributes. In Proc. VLDB, 2004.<br />
247
APPENDIX A. OTHER PAPERS<br />
[27] K. Wu, E. J. Otoo, <strong>and</strong> A. Shoshani. Optimizing bitmap indices with efficient<br />
compression. ACM Trans. Database Syst., 31(1), 2006.<br />
[28] H. Yan, S. Ding, <strong>and</strong> T. Suel. Inverted index compression <strong>and</strong> query processing with<br />
optimized document ordering. In Proc. WWW, 2009.<br />
[29] J. Zhang, X. Long, <strong>and</strong> T. Suel. Performance <strong>of</strong> compressed inverted list caching in<br />
search engines. In Proc. WWW, 2008.<br />
[30] J. Zobel <strong>and</strong> A. M<strong>of</strong>fat. Inverted files for text search engines. ACM Comput. Surv.,<br />
38(2), 2006.<br />
[31] M. Zukowski, S. Heman, N. Nes, <strong>and</strong> P. Boncz. Super-scalar ram-cpu cache compression.<br />
In Proc. ICDE, 2006.<br />
248
Paper 9<br />
Truls A. Bjørklund, Michaela Götz, Johannes Gehrke <strong>and</strong> <strong>Nils</strong> <strong>Grimsmo</strong><br />
Search Your Friends And Not Your Enemies<br />
Submitted to Proceedings <strong>of</strong> the VLDB Endowment (VLDB 2011)<br />
Abstract More <strong>and</strong> more data is accumulated inside social networks. Keyword search<br />
provides a simple interface for exploring this content. However, a lot <strong>of</strong> the content is<br />
private, <strong>and</strong> a search system must enforce the privacy settings <strong>of</strong> the social network.<br />
In this paper, we present a workload aware keyword search system with access control.<br />
We develop a range <strong>of</strong> cost models that vary in sophistication <strong>and</strong> accuracy. These<br />
cost models provide input to an optimization algorithm that selects the ideal solution<br />
for a given workload. With our cost models we find designs that outperform previous<br />
approaches by up to a factor <strong>of</strong> 3. We also address the query processing strategy in<br />
the system, <strong>and</strong> develop a novel union operator called HeapUnion that speeds up query<br />
processing with a factor between 1.1 <strong>and</strong> 2.3 compared to the best previous solution. We<br />
believe that both our cost models <strong>and</strong> our novel union operator will be <strong>of</strong> independent<br />
interest for future work.<br />
My role as an author<br />
Bjørklund was the main contributor for this paper. I supported him during debugging<br />
sessions for the cost models, <strong>of</strong>fered a shoulder to cry on when they had to be modified,<br />
<strong>and</strong> helped during the writing process.<br />
249
Paper 9: Search Your Friends And Not Your Enemies<br />
1 Introduction<br />
More <strong>and</strong> more data is accumulated inside social networks where users tweet, update their<br />
status, chat, post photos, <strong>and</strong> comment on each other’s lives. From a user’s perspective, a<br />
lot <strong>of</strong> her content is private <strong>and</strong> should not be accessible to everybody. To limit arbitrary<br />
information flow, social networks enable users to adjust their privacy settings, e.g., to<br />
ensure that only friends can see the content they have posted. Keyword search provides<br />
a simple interface for the exploration <strong>of</strong> content in social networks. It is the challenge<br />
<strong>of</strong> enabling search over the content while enforcing privacy settings that motivates the<br />
research in this paper.<br />
Search over collections <strong>of</strong> documents — without taking social networks into account<br />
— is a well-studied problem [29], <strong>and</strong> a search query consisting <strong>of</strong> keywords is usually<br />
answered using an inverted index [31]. An inverted index consists <strong>of</strong> a look-up structure<br />
over all unique terms in the indexed document collection, <strong>and</strong> a posting list for each term.<br />
The posting list contains the document identifiers (IDs) <strong>of</strong> all documents that contain the<br />
term.<br />
Search in a social network is more challenging because each user sees a unique subset<br />
<strong>of</strong> the document collection. We view a social network as a directed graph where each node<br />
represents a user <strong>and</strong> a directed edge represents a (one-way) friendship. We assume in this<br />
paper that the social network implements the following privacy setting: A user has access<br />
to the documents authored by herself <strong>and</strong> to the documents authored by her friends. We<br />
rank the results <strong>of</strong> a query according to recency with the most recent documents ranked<br />
highest; then we retrieve the top-k results.<br />
Figure 1(a) shows an example <strong>of</strong> a social network with four users, where User 4 is<br />
friends with Users 1 <strong>and</strong> 3, <strong>and</strong> User 2 is friends with Users 1 <strong>and</strong> 4. All the users have<br />
posted documents, <strong>and</strong> each document has an ID which is shown in its top right corner.<br />
User 3 has posted Documents 2 <strong>and</strong> 5, <strong>and</strong> in our model she can search across Documents<br />
2, 5, <strong>and</strong> 7.<br />
In previous work we developed a conceptual framework for solutions to this problem<br />
[8]. Since we need to both search <strong>and</strong> enforce access control, we have characterized<br />
solutions along two axes; the index axis <strong>and</strong> the access axis. The index axis captures the<br />
idea that instead <strong>of</strong> creating one single inverted index over all the content in the social<br />
network, we may create several inverted indexes, each containing a subset <strong>of</strong> the content.<br />
A set <strong>of</strong> inverted indexes <strong>and</strong> their content is called an index design. The access axis<br />
mirrors the index axis <strong>and</strong> describes the meta-data used to filter out inaccessible results;<br />
the meta-data is organized into author-lists. For the purpose <strong>of</strong> this paper, an author-list<br />
contains the IDs <strong>of</strong> all documents authored by a set <strong>of</strong> users. An access design describes<br />
a set <strong>of</strong> author-lists.<br />
Previous work experimented with a few extreme points in this solution space; it showed<br />
that two <strong>of</strong> the most promising solutions both use an index design with a single index<br />
containing all users’ documents, while the access design in the two approaches differ.<br />
The first approach is called user design <strong>and</strong> has one author-list per user that contains<br />
the document IDs posted by that particular user. The second approach is called friends<br />
design; it also has one author-list per user, but this author-list contains the documents<br />
251
APPENDIX A. OTHER PAPERS<br />
1<br />
social<br />
network<br />
4<br />
access<br />
control<br />
6<br />
keyword<br />
search<br />
3<br />
keyword<br />
access<br />
User 1: 6 4 1<br />
User 2: 3<br />
User 3: 5 2<br />
User 4: 7<br />
User 1 User 2<br />
(b) User Design<br />
User 3 User 4<br />
2 5<br />
7<br />
network social<br />
social<br />
access search<br />
control<br />
User 1: 7 6 4 3 1<br />
User 2: 7 6 4 3 1<br />
User 3: 7 5 2<br />
User 4: 7 6 5 4 2 1<br />
(a) Example social<br />
network with posted<br />
documents<br />
Figure 1: Social Network <strong>and</strong> Basic Designs<br />
(c) Friends Design<br />
posted by the user <strong>and</strong> all <strong>of</strong> her friends. The author-lists for the user <strong>and</strong> friends designs<br />
for our example from Figure 1(a) are shown in Figures 1(b) <strong>and</strong> 1(c), respectively. In<br />
both <strong>of</strong> these designs, a keyword query from a user is processed in the single inverted<br />
index. To enforce access control, the results from the index are intersected with a set<br />
<strong>of</strong> author-lists containing all friends <strong>of</strong> the user. In the friends design, all friends <strong>of</strong> the<br />
user are represented in the author-list for the user, whereas in the user design, we need<br />
to calculate the union <strong>of</strong> the author-lists for all friends.<br />
Because <strong>of</strong> the promising performance <strong>of</strong> user <strong>and</strong> friends designs in previous work [8],<br />
we have chosen them as the basis for our solutions. Note that the two designs have very<br />
different trade<strong>of</strong>fs. When a user posts a new document, only a single author-list must<br />
be updated in the user design. In the friends design, however, the author-lists for all<br />
users that are friends with the posting user must be updated. During queries, only one<br />
author-list is accessed with the friends design, whereas one author-list for each friend <strong>of</strong><br />
the user must be accessed with the user design.<br />
In this paper we bridge the gap between the two extrema to enable efficient search<br />
in a social networks with access control. We propose an intermediate strategy that combines<br />
the best <strong>of</strong> both the friends design <strong>and</strong> the user design: low search costs <strong>and</strong> low<br />
update costs. Our solution starts with the user design <strong>and</strong> then judiciously adds selected<br />
252
Paper 9: Search Your Friends And Not Your Enemies<br />
additional author-lists to improve query performance.<br />
The remainder <strong>of</strong> the paper explains our solution in detail. In Section 2, we introduce<br />
notation <strong>and</strong> define the problem we address. Then, we develop efficient query processing<br />
strategies based on a novel union operator (Section 3). Furthermore, we develop a set <strong>of</strong><br />
cost models with various degrees <strong>of</strong> sophistication <strong>and</strong> accuracy, <strong>and</strong> use them as bases for<br />
workload optimization to find the ideal access design for a particular workload (Section 4).<br />
In a thorough experimental evaluation, we demonstrate the efficiency <strong>of</strong> our new union<br />
operator, validate the accuracy <strong>of</strong> our cost models, <strong>and</strong> compare access designs resulting<br />
from workload optimization based on different cost models to previous work (Section 5).<br />
Related work is discussed in Section 6 <strong>and</strong> we conclude in Section 7. We believe that both<br />
our novel union operator <strong>and</strong> our cost models will be <strong>of</strong> independent interest for future<br />
work.<br />
2 Problem Definition<br />
In this section, we will introduce some notation which is used to define the problem<br />
addressed in this paper.<br />
Search Data <strong>and</strong> Query Model. We view a social network as a directed graph<br />
〈V, E〉, where each node u ∈ V represents a user. There is an edge 〈u, v〉 ∈ E if user v is<br />
a friend <strong>of</strong> user u, denoted v ∈ F u or alternatively that u is one <strong>of</strong> v’s followers, denoted<br />
u ∈ O v . We always have u ∈ F u <strong>and</strong> u ∈ O u .<br />
We consider workloads that consist <strong>of</strong> two different operations: posting new documents<br />
<strong>and</strong> issuing queries. A new document, which we also refer to as an update, consists <strong>of</strong> a set<br />
<strong>of</strong> terms. We will call the user who posted document d the author <strong>of</strong> d, <strong>and</strong> we will also<br />
say that the user authored d. The new document gets assigned a unique document ID;<br />
more recently posted documents have higher IDs. Let n u denote the number <strong>of</strong> documents<br />
authored by user u, <strong>and</strong> let N = ∑ u n u denote the total number <strong>of</strong> documents in the<br />
system.<br />
A query submitted by a user u consists <strong>of</strong> a set <strong>of</strong> keywords. As mentioned in the<br />
previous section, we assume that only documents authored by users in F u are accessible<br />
to u, <strong>and</strong> that the ranking is based on recency. Thus the results <strong>of</strong> a keyword query are<br />
the k documents that (1) contain the query keywords, (2) are authored by a user in F u ,<br />
<strong>and</strong> (3) have the k highest document IDs among the documents that satisfy (1) <strong>and</strong> (2).<br />
Beyond User <strong>and</strong> Friends Designs. The relative merits <strong>of</strong> the user <strong>and</strong> friends<br />
designs motivate the solution in this paper. Recall from the previous section that in the<br />
user design, updates are efficient because only u’s author-list is updated when u posts<br />
a document; queries, however, need to access the author-lists <strong>of</strong> all users in F u . In the<br />
friends design, queries are more efficient because only the author-list <strong>of</strong> u is accessed.<br />
Updates, however, need to change the author-lists <strong>of</strong> all users in O u .<br />
In our approach, we start out with the user design. In addition, we add one author-list<br />
l u for each user u; l u contains the IDs <strong>of</strong> all documents authored by a selected set <strong>of</strong> users<br />
L u ⊆ F u . When user u submits a query, there is no need to access specific author-lists for<br />
users in L u , <strong>and</strong> queries therefore become more efficient as more users are represented in<br />
253
APPENDIX A. OTHER PAPERS<br />
L u . On the other h<strong>and</strong>, representing more users in L u leads to higher update costs. We<br />
therefore determine the contents <strong>of</strong> L u (<strong>and</strong> thus l u ) based on the workload characteristics.<br />
System Model. Keyword search over collections where updates are interleaved with<br />
queries is typically supported with a hierarchy <strong>of</strong> indexes [19, 10, 17, 21]. New data is<br />
accumulated in a small updatable structure that also supports concurrent queries, while<br />
the rest <strong>of</strong> the hierarchy consists <strong>of</strong> a set <strong>of</strong> read-only indexes. The read-only indexes<br />
change through merges that form larger read-only indexes, resulting in a hierarchy where<br />
the largest indexes contain the least recent documents.<br />
An index hierarchy is well suited for search in social networks, especially when used<br />
in combination with an access design that adapts to the workload. The time at which<br />
indexes are merged represents an opportunity to modify the access design <strong>and</strong> adapt it to<br />
the current workload, so that different indexes in the hierarchy potentially have different<br />
access designs. In our system, we process top-k queries with recency used as ranking, <strong>and</strong><br />
the largest indexes would therefore rarely be accessed. Our workload-aware algorithms<br />
can be used to find suitable access designs in these cases.<br />
In this paper, we focus on selecting an access design for an index with a stratified workload,<br />
where all updates are processed before all queries. A stratified workload closely approximates<br />
the workloads for the indexes at the various levels in the hierarchy, except that<br />
the update costs can vary slightly based on the structure <strong>of</strong> the hierarchy [19, 10, 17, 21].<br />
Recall that a system based on a hierarchy with indexes, where each index supports stratified<br />
workloads, will still support interleaved workloads overall, because <strong>of</strong> the small updatable<br />
part <strong>of</strong> the index. The updatable part supports interleaved workloads, but we<br />
have verified experimentally that because it is small, its processing times for stratified<br />
workloads very well approximate (within single percentage points) the processing times<br />
for interleaved workloads because the constant costs in query processing dominate. The<br />
problem we focus on in this paper is thus important for a system that supports an interleaved<br />
workload <strong>of</strong> updates <strong>and</strong> queries, <strong>and</strong> we will address both query processing <strong>and</strong><br />
adaptation to workloads based on cost models.<br />
3 Query processing<br />
Our main memory-based search system constructs indexes by accumulating batches <strong>of</strong><br />
documents in an updatable structure where the lists are compressed using VByte [24].<br />
The batches are combined in the end to form the complete index, where the lists are<br />
compressed using PForDelta [32, 30].<br />
The system answers queries by computing the intersection <strong>of</strong> a posting list with a<br />
union <strong>of</strong> author lists. 1 Note that similar types <strong>of</strong> queries occur in several scenarios, e.g.,<br />
in star joins in data warehouses [23]. We process the queries with the template shown in<br />
Figure 2, which is a combination <strong>of</strong> three operators (intersection, union <strong>and</strong> list iterator)<br />
that all support the following interface:<br />
• Init(), initialize the operator.<br />
1 Notice that our presentation focuses on single-term queries. This is done to simplify the presentation<br />
<strong>and</strong> does not reflect a limitation <strong>of</strong> our system.<br />
254
Paper 9: Search Your Friends And Not Your Enemies<br />
⋂<br />
p t<br />
⋃<br />
a 1 a 2 . . . a n<br />
Figure 2: Query Template<br />
• Current(), retrieve the current result.<br />
• Next(), forward to <strong>and</strong> retrieve the next result.<br />
• SkipTo(val), forward to next result with value ≤ val.<br />
All results are returned in sorted order based on descending document IDs to facilitate<br />
efficient ranking on recency. The top-k ranked results are therefore found by calling Next()<br />
on the intersection operator k times. A st<strong>and</strong>ard intersection operator alternates between<br />
the inputs <strong>and</strong> skips forward to find a value that occurs in both. With such a solution,<br />
the implementation <strong>of</strong> the SkipTo(val) operation in the union operator is essential to the<br />
overall processing strategy. The union operator can for example merge all the inputs first<br />
<strong>and</strong> perform all operations on the merged list [23]. Another strategy is to perform all skips<br />
on all inputs <strong>and</strong> return the maximum result. This last strategy essentially calculates the<br />
intersection <strong>of</strong> the posting list <strong>and</strong> each single author-list <strong>and</strong> then returns the union <strong>of</strong><br />
the results [23].<br />
Raman et al. have introduced a union operator called Lazy Merge, which is based on<br />
the idea that if the number <strong>of</strong> skip operations in the union is large compared to the length<br />
<strong>of</strong> an input, it would have been ideal to pre-merge this input into a list <strong>of</strong> intermediate<br />
results. Lazy Merge adaptively merges an input into the intermediate results when the<br />
number <strong>of</strong> skip operations exceeds the length <strong>of</strong> the input times a constant α. The<br />
strategy never uses more than twice the processing time <strong>of</strong> a solution that pre-merges the<br />
optimal set <strong>of</strong> inputs [23]. However, the approach is not well suited for our workloads for<br />
two reasons. First, we usually process top-k queries <strong>and</strong> therefore only need the first k<br />
results from the intersection. It is inefficient to merge a set <strong>of</strong> complete inputs when only<br />
a small fraction <strong>of</strong> the results are used during query processing. Second, we <strong>of</strong>ten have a<br />
large number <strong>of</strong> inputs to the union, which can lead to many merges in Lazy Merge. The<br />
analysis in Lazy Merge does not take the actual cost <strong>of</strong> merges into consideration, <strong>and</strong><br />
this can result in far-from-optimal processing strategies with many inputs.<br />
To address these issues, we develop a union operator called HeapUnion which is described<br />
next. The implementation <strong>of</strong> the other operators is described in Appendix A.<br />
255
APPENDIX A. OTHER PAPERS<br />
3.1 HeapUnion<br />
HeapUnion, our novel union operator, is designed to be efficient regardless <strong>of</strong> whether all<br />
or only a fraction <strong>of</strong> the results are actually needed, <strong>and</strong> to scale gracefully to a very large<br />
number <strong>of</strong> inputs. To achieve these goals, HeapUnion is based on a binary heap. The heap<br />
contains one entry for each input operator, <strong>and</strong> it is ordered based on the value obtained<br />
from calling Current() on each input operator (referred to as the current value for the<br />
input operator). The same overall strategy is commonly used in information retrieval<br />
during inverted index construction [29].<br />
We support the st<strong>and</strong>ard operator interface by always having the input with the<br />
highest current value at the top <strong>of</strong> the heap, so that this value is also the current value for<br />
HeapUnion. The heap is initialized the first time Next() or SkipTo(val) is called. When<br />
the first call is a Next() (SkipTo(val) resp.), HeapUnion calls Next() (SkipTo(val)) on all<br />
inputs, <strong>and</strong> the heap is constructed using Floyd’s Algorithm [13]. Floyd’s algorithm calls<br />
a sub-procedure called heapify() which can be used to construct a legal heap from an entry<br />
with two legal sub-heaps as children. The heapify() operation has logarithmic worst-case<br />
complexity in the size <strong>of</strong> the heap, <strong>and</strong> Floyd’s Algorithm runs in linear time [13]. We<br />
will also use these operations during heap maintenance.<br />
After initialization, HeapUnion works as shown in the pseudo code in Algorithm 1.<br />
The Current() operation either returns the current value from the input operator at the<br />
top <strong>of</strong> the heap, or it indicates that there are no more results if the heap is empty. The<br />
Next() operation forwards the input with the current value, <strong>and</strong> calls heapify() to ensure<br />
that the input with the new highest current value is at the top <strong>of</strong> the heap. The worst-case<br />
complexity <strong>of</strong> this operation is thus logarithmic in the number <strong>of</strong> input operators.<br />
The SkipTo(val) operation is based on a breadth-first search (BFS) in the heap. When<br />
forwarding to a value val, only the inputs with a current value > val actually need to be<br />
forwarded. Because the heap is organized according to the current value for all inputs,<br />
we know that if a given input has a current value ≤ val, the same is true for all <strong>of</strong> its<br />
descendants. If we determine that no skip is necessary in a given input, we thus also<br />
know that no skip is required in any <strong>of</strong> its children in the heap, <strong>and</strong> there is no need to<br />
process the children in the BFS. Furthermore, if an input is not forwarded, we know that<br />
its position in the heap relative to its children will not change. We also take advantage<br />
<strong>of</strong> this observation by calling heapify() only for the inputs where an actual skip occurred,<br />
<strong>and</strong> use a complete run <strong>of</strong> Floyd’s algorithm only in the worst case.<br />
Because HeapUnion never pre-merges any inputs, its efficiency does not depend on<br />
the total number <strong>of</strong> results, only on the number <strong>of</strong> returned results. It is therefore<br />
well suited for top-k queries. Furthermore, the BFS in the heap during skip operations<br />
ensures scalability with the number <strong>of</strong> inputs. If only one input is forwarded during a<br />
skip operation, the processing cost is logarithmic in the number <strong>of</strong> inputs; if all inputs<br />
are forwarded, the total maintenance costs becomes linear in the number <strong>of</strong> inputs.<br />
256
Paper 9: Search Your Friends And Not Your Enemies<br />
Algorithm 1 HeapUnion Operator<br />
1: function Init():<br />
2: Allocate heap<br />
3: function Next():<br />
4: heap[0].Next()<br />
5: heapify(0)<br />
6: return Current()<br />
7: function SkipTo(int Val):<br />
8: Perform a breadth-first search in the heap from the root<br />
9: while BFS queue is not empty:<br />
10: if Current iterator is forwarded in SkipTo(val):<br />
11: Add to LIFO-list <strong>of</strong> entries for heap reorganization<br />
12: Add children to BFS queue<br />
13: Call heapify() for forwarded inputs in LIFO-list<br />
14: return Current()<br />
15: function Current():<br />
16: if size(heap) == 0:<br />
17: return E<strong>of</strong><br />
18: else:<br />
19: return heap[0].Current()<br />
4 Cost Models<br />
The efficiency <strong>of</strong> our system for a particular workload depends on the contents <strong>of</strong> the<br />
additional author-lists, <strong>and</strong> selecting a good set <strong>of</strong> lists is therefore essential. For each<br />
user u, any subset <strong>of</strong> F u can be included in L u , which leads to 2 ∑ u∈V |Fu| = 2 |E| possible<br />
designs. We use cost models to investigate this large space; the models are used as bases<br />
for optimization algorithms to find the best access design for a stratified workload <strong>of</strong><br />
updates <strong>and</strong> queries.<br />
In related work, Silberstein et al. have developed a simple cost model to address the<br />
problem <strong>of</strong> finding the most recent events in users’ event feeds [25]. It is straight-forward<br />
to adapt their model to our problem, <strong>and</strong> we will refer to this model as Simple. Simple<br />
assumes that the cost <strong>of</strong> a search is linear in the number <strong>of</strong> accessed author-lists, <strong>and</strong><br />
that the cost <strong>of</strong> constructing a list is linear in the number <strong>of</strong> document IDs in the list. It<br />
thus captures the intuition that including more users in the additional author-lists leads<br />
to lower search costs <strong>and</strong> higher update costs. A formal description <strong>of</strong> Simple is found in<br />
Figure 3. Silberstein et al. have shown that when using Simple as a basis for optimization,<br />
a globally optimal solution can be found by making local decisions for each friend pair.<br />
This implies that only ∑ u∈V |F u| = |E| different designs have to be explored in order to<br />
257
APPENDIX A. OTHER PAPERS<br />
find the optimal one [25].<br />
Unfortunately, Simple is not accurate enough for our application scenario. Figure 4<br />
compares actual running times <strong>of</strong> our system to predictions from Simple. The tested<br />
workloads are described in Appendix B, but for now it suffices to know that a low limit<br />
indicates that the additional author-lists are almost empty, while a high limit indicates<br />
that the additional author-lists include most or all friends. The search estimates for<br />
Simple are clearly far from accurate, <strong>and</strong> as we will see in Section 5, this inaccuracy<br />
can lead to access designs that are up to 30% slower than designs from more accurate<br />
models. In this section, we introduce two more accurate models called Monotonic <strong>and</strong><br />
Non-monotonic.<br />
4.1 Monotonic<br />
Monotonic is designed to be an accurate yet tractable cost model, where only a small<br />
number <strong>of</strong> access designs must be checked to find a globally optimal solution. The model<br />
for updates in Monotonic is the same as in Simple, but we use a different approach to<br />
model query costs. Monotonic has one cost model for each operator <strong>and</strong> query costs are<br />
estimated by combining the models for all operators in the query. The cost model for an<br />
operator describes the cost <strong>of</strong> each method supported by the operator. For operators that<br />
have other operators as inputs, like HeapUnion <strong>and</strong> Intersection, the cost is calculated<br />
by combining the cost <strong>of</strong> operations within the operator with the cost <strong>of</strong> method calls on<br />
the inputs. To find the cost <strong>of</strong> the queries we use in this paper, we combine the operators<br />
according to the template in Figure 2, <strong>and</strong> calculate the cost <strong>of</strong> k Next() calls for the<br />
Intersection to retrieve the top-k results (assuming there are at least k).<br />
Monotonic is described in Figure 3. Skip(s) is a model for SkipTo(val). If SkipTo(val)<br />
forwards the current value <strong>of</strong> an operator with ∆v document IDs, we model the cost <strong>of</strong> the<br />
operation with Skip( r∆v<br />
N<br />
), where r is the number <strong>of</strong> results <strong>of</strong> the operator. The number <strong>of</strong><br />
results for an operator is the number <strong>of</strong> times we can call Next() <strong>and</strong> retrieve a new result.<br />
Monotonic <strong>and</strong> Non-monotonic use the same models for List Iterator <strong>and</strong> Intersection,<br />
but they have different models for HeapUnion. We will now explain Monotonic’s model<br />
for HeapUnion; models for the other operators are explained in Appendix C. 2<br />
HeapUnion. Let us assume that HeapUnion has m inputs k 1 , . . . , k m . We use r i to<br />
denote the number <strong>of</strong> results from input i, <strong>and</strong> define R = ∑ m<br />
i=1 r i. We assume that<br />
the cost <strong>of</strong> initialization within the HeapUnion operator itself is negligible, <strong>and</strong> therefore<br />
model the cost <strong>of</strong> initialization as the sum <strong>of</strong> the initialization costs for all inputs.<br />
Recall from Section 3.1 that the first call to either Next() or SkipTo(val) will involve<br />
construction <strong>of</strong> the heap; we therefore have two different cases in the model for Next() <strong>and</strong><br />
SkipTo(val), depending on whether it is the first or subsequent calls. The cost <strong>of</strong> the first<br />
call to Next() includes the cost <strong>of</strong> calling Next() on all m inputs, <strong>and</strong> the cost <strong>of</strong> the heap<br />
construction using Floyd’s Algorithm. For heap construction, we model the cost <strong>of</strong> each<br />
call to heapify() as being constant, c h . With Floyd’s algorithm, heapify() is called for half<br />
the heap entries, <strong>and</strong> then recursively every time there is a reorganization. The average<br />
case complexity <strong>of</strong> Floyd’s algorithm is well known, <strong>and</strong> the number <strong>of</strong> relocations in the<br />
2 We determined values for the constants with microbenchmarks as described in Appendix E.<br />
258
Paper 9: Search Your Friends And Not Your Enemies<br />
Update Cost (|lu| = n) Query Cost (user u)<br />
Simple cupdaten clist (|Fu| − |Lu| + 1)<br />
Monotonic ⎪⎨ c1b ∗ n if N n < b 1<br />
cupdaten ⎧<br />
see below<br />
Non-monotonic c2b ∗ n if b1 ≤ N<br />
⎪ n < b 2 see below<br />
⎩<br />
c3b ∗ n otherwise<br />
List Iterator<br />
Init() cg Next() Skip(s)<br />
{<br />
{<br />
ceinit empty list<br />
cnext<br />
cc + s b c d + scan(s)csc Skip(b) + log ( )<br />
s<br />
if s ≤ b<br />
cinit otherwise<br />
otherwise<br />
Intersection k1.Init() + k2.Init()<br />
Monotonic<br />
HeapUnion<br />
Nonmonotonic<br />
HeapUnion<br />
∑ m<br />
i=1 k i.Init()<br />
∑ m<br />
i=1 k i.Init()<br />
k1.Next() + (t − 1)k1.Skip(1 + r1<br />
) r2<br />
+t · k2.Skip( t−1<br />
t<br />
(1 + r2<br />
) + r2<br />
r1 r1t ) not used<br />
{ ∑m<br />
i=1 k i.Next() + (γ + 1 2 ) · m · c h first call<br />
∑ m<br />
i=1<br />
ri<br />
R k i.Next() + (γ + 1) · ch otherwise<br />
{ ∑m<br />
i=1 k i.Next() + (γ + 1 2 ) · m · c h first call<br />
∑ m<br />
i=1<br />
ri<br />
R (k i.Next() + log(1 + pi) · ch) otherwise<br />
b<br />
⎧ ∑ ⎪⎨ m<br />
i=1 k i.Skip( ris<br />
R ) + (γ + 1 2 ) · m · c h first call<br />
ris<br />
min(1,<br />
⎪ ⎩<br />
∑ m<br />
i=1<br />
R )k i.Skip(max(1, ris<br />
R ))<br />
+(γ + 1) · ms · ch otherwise<br />
∑ m<br />
i=1<br />
⎧⎪ k i.Skip( ris<br />
R ) + (γ + 1 2 ) · m · c h first call<br />
⎨ ris<br />
min(1,<br />
⎪ ⎩<br />
∑ m<br />
i=1<br />
R )k i.Skip(max(1, ris<br />
R ))<br />
+ max(msγ + min(ms, m 2 ),<br />
ms( ∑ m<br />
j=1<br />
rj<br />
R log(1 + p j)) − h(⌊ms⌋)) · ch otherwise<br />
Figure 3: Overview <strong>of</strong> Cost Estimates<br />
259
APPENDIX A. OTHER PAPERS<br />
heap is approximately γm = 0.744m [27]. Thus we model the cost <strong>of</strong> heap construction<br />
as (γ + 1 2 ) · m · c h.<br />
A first call to SkipTo(val) involves skipping in all inputs, in addition to heap construction.<br />
Given that s results are skipped in this operation, we simply assume that the<br />
number <strong>of</strong> entries skipped on input i is ris<br />
R , resulting in a cost <strong>of</strong> ∑ m<br />
i=1 k i.Skip( ris<br />
R<br />
). The<br />
cost <strong>of</strong> heap construction is modeled as explained for Next() above.<br />
Subsequent calls to Next() involve a call to Next() for the input at the top <strong>of</strong> the heap<br />
<strong>and</strong> a heap reorganization. We estimate the cost <strong>of</strong> the call to Next() for the input at the<br />
top as a weighted average over the inputs. The model for the cost <strong>of</strong> heap maintenance is<br />
simple. We assume that there will be γ relocations when a single operator is forwarded,<br />
leading to γ + 1 calls to heapify().<br />
Subsequent calls to SkipTo(val) will potentially forward all inputs, <strong>and</strong> then reorganize<br />
the heap according to the new current values <strong>of</strong> the inputs. On average, each input will<br />
be forwarded past ris<br />
R<br />
entries. However, HeapUnion will ensure not to call SkipTo(val)<br />
for inputs that will not be forwarded. Therefore, when the average skip length is less than<br />
1, we model the cost as ris<br />
R<br />
calls that skip 1 entry. To find the cost <strong>of</strong> heap maintenance<br />
we estimate the number <strong>of</strong> forwarded inputs as: m s = ∑ m sri<br />
i=1<br />
min(<br />
R<br />
, 1). Assuming that<br />
there will be as many relocations in the heap as when constructing a heap with m s entries,<br />
the cost <strong>of</strong> heap maintenance is (γ + 1) · m s · c h .<br />
Optimization Algorithm. Although Monotonic is more complex than Simple, it still<br />
has the property that testing only |E| access designs is sufficient to find the optimal<br />
approach. We will describe how this is done in two steps. The following lemma shows<br />
that we can find a globally optimal solution by choosing the contents <strong>of</strong> the additional<br />
author-list for each user individually; it follows directly from the definitions in Figure 3.<br />
Lemma 1. In Monotonic, the costs <strong>and</strong> savings from including the users L u in the<br />
additional author-list for u are not affected by the contents <strong>of</strong> l v when v ≠ u.<br />
Lemma 1 reduces the number <strong>of</strong> access designs to test in the optimization algorithm<br />
from 2 ∑ u∈V |Fu| to ∑ u∈V 2|Fu| .<br />
Theorem 2. If Monotonic estimates that for a user u <strong>and</strong> a given workload, the performance<br />
is improved if v ∈ F u is included in L u , then Monotonic will predict that it leads<br />
to a performance improvement to also include user w in L u if w ∈ F u , 0 < n w < n v , <strong>and</strong><br />
c d − csc<br />
2b<br />
≥ cg<br />
ln 2 .<br />
The pro<strong>of</strong> <strong>of</strong> Theorem 2 is found in Appendix D. We notice that in our case, c d − csc<br />
2b<br />
≥<br />
c g<br />
ln 2<br />
translates to 330.6578 ≥ 114.0751, which holds with a significant margin. Theorem 2<br />
implies that if we sort all friends <strong>of</strong> u based on the number <strong>of</strong> documents they post, the<br />
optimal contents <strong>of</strong> L u is a prefix <strong>of</strong> this sorted list. We thus need to check |E| designs<br />
in total in order to find he optimal solution. Furthermore, notice that there is no cost<br />
associated with including users who do not post documents in the additional author-lists,<br />
<strong>and</strong> it is therefore always beneficial to do so.<br />
Validation. Monotonic’s predictions for our test workloads are presented in Figure 4.<br />
The search estimates are much more accurate than Simple’s, but there is still room for<br />
improvement when limit is close to 0 for Workload 2. Inaccurate modeling <strong>of</strong> the cost<br />
260
Paper 9: Search Your Friends And Not Your Enemies<br />
Search time<br />
Update time<br />
2.5 ·1011 limit<br />
·10 12 limit<br />
Processing time (s)<br />
2<br />
1.5<br />
1<br />
0.5<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
0 0.5 1 1.5<br />
·10 4<br />
Search time<br />
0<br />
0 0.5 1 1.5<br />
·10 4<br />
Update time<br />
Actual time<br />
Simple<br />
Monotonic<br />
Non-monotonic<br />
Processing time (s)<br />
·10 11<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
0<br />
0 1,000 2,000 3,000 4,000 5,000<br />
limit<br />
·10 12<br />
1.5<br />
1<br />
0.5<br />
0<br />
0 1,000 2,000 3,000 4,000 5,000<br />
limit<br />
Figure 4: Accuracy <strong>of</strong> cost models for queries <strong>and</strong> updates in Workload 1 <strong>and</strong> Workload 2<br />
<strong>of</strong> heap maintenance is one factor that contributes to this error, <strong>and</strong> we will therefore<br />
improve this aspect in Non-monotonic.<br />
4.2 Non-monotonic<br />
Non-monotonic is designed to be more accurate than Monotonic, <strong>and</strong> rather sacrifice the<br />
efficiency <strong>of</strong> the optimization algorithm that finds the globally optimal access design.<br />
To achieve better accuracy, two aspects <strong>of</strong> Monotonic are changed: The model for heap<br />
maintenance is extended, <strong>and</strong> a slightly more advanced update model is used. The formal<br />
description <strong>of</strong> Non-monotonic is found in Figure 3.<br />
The update model from Monotonic is extended by taking list compression into account.<br />
During accumulation, the lists are compressed using VByte, which implies that lists with<br />
few entries result in long deltas that use more storage space. The model assumes that<br />
the cost <strong>of</strong> updating a list depends on the number <strong>of</strong> bytes used by VByte to represent<br />
the average delta length.<br />
The models for heap maintenance in subsequent calls to Next() <strong>and</strong> SkipTo(val) reflect<br />
261
APPENDIX A. OTHER PAPERS<br />
that the cost <strong>of</strong> heap maintenance <strong>of</strong>ten depends on the total number <strong>of</strong> inputs as well<br />
as on the number <strong>of</strong> forwarded inputs. Given that input i is at the top <strong>of</strong> the heap when<br />
Next() is called, let p i denote the number <strong>of</strong> inputs that will have Current() values larger<br />
than input i after the call to Next(). We estimate p i as ∑ m<br />
k=1 min( r k<br />
ri<br />
, 1), <strong>and</strong> assume<br />
that the heap maintenance cost when input i is at the top <strong>of</strong> the heap is log(1 + p i )c h .<br />
By calculating a weighted average over all inputs, we end up with an average cost <strong>of</strong> heap<br />
maintenance for a Next() as shown in Figure 3.<br />
The model for heap maintenance in SkipTo(val) is slightly more complex, <strong>and</strong> the<br />
maximum <strong>of</strong> two different estimates is used: (1) The first alternative is similar to the<br />
estimate in Monotonic, but incorporates that Floyd’s algorithm will never call heapify()<br />
for more than half the inputs, which yields the following estimate: (γm s +min(m s , m 2 ))c h.<br />
(2) The other alternative reflects that the cost can be logarithmic in the number <strong>of</strong> inputs<br />
when the number <strong>of</strong> forwarded inputs is low. We have already estimated the average<br />
number <strong>of</strong> calls to heapify() when only one input is forwarded in the model for Next(),<br />
denoted h next in the following. We now assume that all forwarded inputs will lead to<br />
h next heapify() operations, but compensate for the fact that many <strong>of</strong> the inputs are not<br />
at the top <strong>of</strong> the heap when heapify() is called. The compensation is achieved with the<br />
function h(m s ) in Figure 3, which returns the minimum possible total distance from the<br />
root to m s entries in the heap.<br />
Optimization Algorithm. Lemma 1 also holds for Non-monotonic. However, the<br />
pro<strong>of</strong> <strong>of</strong> Theorem 2 does not hold due to the extensions in Non-monotonic. As a result,<br />
an optimization algorithm based on Non-monotonic must test ∑ u∈V 2|Fu| access designs<br />
to find the optimal approach. In social networks, users have hundreds <strong>of</strong> friends, <strong>and</strong><br />
it is therefore not feasible to test all possibilities. We thus choose to limit the solution<br />
space to the same space as the optimization algorithm based on Monotonic explores. If<br />
Non-monotonic is actually more accurate than Monotonic, the resulting design should be<br />
at least as efficient.<br />
Validation. Non-monotonic’s predictions on our two test workloads are shown in Figure<br />
4. The extended update model represents a slight improvement. For search costs,<br />
Non-monotonic is clearly more accurate than Monotonic for low limits in Workload 2.<br />
This reflects that the model for heap maintenance actually plays a significant role.<br />
5 Experiments<br />
To validate the efficiency <strong>of</strong> our system, we conduct a set <strong>of</strong> experiments. We compare our<br />
query processing strategy based on HeapUnion to Lazy Merge, <strong>and</strong> test how the solutions<br />
based on optimizations with different cost models perform compared to the user design<br />
<strong>and</strong> the friends design. Our experiments are based on the system described in Section 3<br />
which is implemented in Java. The experiments are run on a computer with an Intel<br />
Xeon 3.2 GHz CPU with 6 MB cache <strong>and</strong> 16 GB main memory. The computer is running<br />
RedHat Enterprise 5.3 <strong>and</strong> Java version 1.6.<br />
262
Paper 9: Search Your Friends And Not Your Enemies<br />
5.1 Efficiency <strong>of</strong> HeapUnion<br />
To test the efficiency <strong>of</strong> HeapUnion, we compare it to the alternative suggested by Raman<br />
et al. by exchanging the HeapUnion operator in the queries with Lazy Merge [23]. To do<br />
so, we have implemented a version <strong>of</strong> Lazy Merge that stores the intermediate results in<br />
an uncompressed list where skips are implemented with galloping search [6]. As explained<br />
in Section 3, the parameter α describes how eager Lazy Merge is at pre-merging inputs<br />
into the intermediate results. When setting α to 0, Lazy Merge pre-merges all inputs,<br />
<strong>and</strong> when setting it to ∞, no inputs are pre-merged.<br />
We use the workloads from Appendix B in the experiments, but limit the design to<br />
the user design because this provides a real test <strong>of</strong> the solutions’ ability to process queries<br />
with many author-lists efficiently. We report the time spent processing the 100,000 queries<br />
for each <strong>of</strong> the workloads, <strong>and</strong> vary α between 0 <strong>and</strong> ∞. We have argued that one <strong>of</strong><br />
the reasons why Lazy Merge is not ideal for our workloads is that we typically process<br />
top-k queries. To isolate this effect, we process the workload both when only returning<br />
the top-100 results <strong>and</strong> when returning all results.<br />
The results from the experiments are shown in Figure 5. Notice that HeapUnion<br />
does not depend on α, <strong>and</strong> its cost is therefore constant. When using Lazy Merge, the<br />
difference between the cost <strong>of</strong> top-100 queries <strong>and</strong> retrieving all results increases with the<br />
size <strong>of</strong> α. This reflects the inadequacy <strong>of</strong> approaches that pre-merge when processing<br />
top-k queries. The processing time with HeapUnion is clearly dependent on the number<br />
<strong>of</strong> retrieved results, <strong>and</strong> HeapUnion is therefore an attractive solution for top-k queries<br />
as expected.<br />
Lazy Merge consistently performs best with α set to one <strong>of</strong> the extreme values. Poor<br />
performance in the other cases is mainly caused by the large number <strong>of</strong> merges resulting<br />
from many inputs with different lengths. What extreme value <strong>of</strong> α that performs best<br />
for a particular workload depends on the average length <strong>of</strong> the author-lists compared<br />
to the posting list. HeapUnion outperforms all configurations <strong>of</strong> Lazy Merge for both<br />
workloads with a speed-up between 1.13 <strong>and</strong> 2.36, reflecting that the high number <strong>of</strong><br />
inputs is processed efficiently regardless <strong>of</strong> workload characteristics.<br />
5.2 Workload Optimization<br />
We have tested the accuracy <strong>of</strong> the different cost models in Section 4, but the key success<br />
factor for a cost model is whether using it in an optimization algorithm leads to efficient<br />
designs. We therefore conduct a set <strong>of</strong> experiments to compare the access designs suggested<br />
by the different cost models <strong>and</strong> associated optimization algorithms. To do so, we<br />
use the networks <strong>and</strong> documents from the workloads in Appendix B, <strong>and</strong> combine them<br />
with several different sets <strong>of</strong> queries. We first test workloads with different numbers <strong>of</strong><br />
queries generated by letting a r<strong>and</strong>om user search for a r<strong>and</strong>om term. We also experiment<br />
with workloads where the users who search follow a Zipfian distribution with exponent<br />
1.5, to see how skew affects the efficiency <strong>of</strong> the different approaches. We compare the designs<br />
based on Simple, Monotonic <strong>and</strong> Non-Monotonic to the user design <strong>and</strong> the friends<br />
design.<br />
263
APPENDIX A. OTHER PAPERS<br />
Lazy Merge (top-100)<br />
Lazy Merge (top-∞)<br />
HeapUnion (top-100)<br />
HeapUnion (top-∞)<br />
Workload 1<br />
Workload 2<br />
2<br />
processing time (ns)<br />
1<br />
·10 13 alpha<br />
4<br />
2<br />
·10 11 alpha<br />
0<br />
0 10 −4 10 −3 10 −2 10 −1 10 0 10 1 inf<br />
0<br />
0 10 −4 10 −3 10 −2 10 −1 10 0 10 1 inf<br />
Figure 5: HeapUnion vs. Lazy Merge varying α.<br />
The results <strong>of</strong> our experiments are shown in Figure 6, where the first line shows the<br />
results for workloads with uniformly distributed queries. To get a better view <strong>of</strong> the<br />
relative differences between the methods, the second line shows the performance <strong>of</strong> all<br />
approaches relative to the best <strong>of</strong> user design <strong>and</strong> friends design.<br />
For the workloads with uniformly distributed queries, Simple <strong>of</strong>ten leads to designs<br />
that are slower than choosing the best <strong>of</strong> user design <strong>and</strong> friends design due to the<br />
inaccuracies in prediction. Compared to Monotonic <strong>and</strong> Non-monotonic, the designs<br />
from Simple are up to 30% slower. Monotonic <strong>and</strong> Non-monotonic generally lead to<br />
reasonable designs that are comparable to or faster than the basic approaches. However,<br />
for Workload 2, both approaches lead to sub-optimal designs when queries are frequent<br />
relative to updates. This reflects the inaccuracies in the estimates for low limits for<br />
Workload 2 in Figure 4. However, Non-monotonic clearly results in better designs than<br />
Monotonic in this case, so the additional complexity pays <strong>of</strong>f.<br />
The results from Workload 2 with Zipfian queries are shown in the third line in Figure<br />
6, denoted Workload 2Z. We have omitted the results for Workload 1Z due to space<br />
constraints, but the results are similar as for Workload 2Z. The results show that our<br />
overall solution performs much better compared to the basic designs when there is skew,<br />
with a speed-up <strong>of</strong> up to 3.4. With skew, the optimization problem is simple because<br />
a few users submit almost all queries, <strong>and</strong> all cost models are able to reflect that these<br />
users should have additional author-lists with nearly all their friends represented. Nonmonotonic<br />
is still slightly better than the others, but the difference is not significant.<br />
6 Related Work<br />
Due to its clear commercial value there has recently been some interest in search in social<br />
networks. Search engines like Google <strong>and</strong> Bing already support search over public Twitter<br />
264
Paper 9: Search Your Friends And Not Your Enemies<br />
Workload 1<br />
Workload 2<br />
Processing time (ns)<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
0 1 2 3 4<br />
6 ·1012 #queries ·10 6<br />
Processing time (ns)<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
0 1 2 3 4<br />
3.5 ·1012 #queries ·10 6<br />
Workload 1<br />
Workload 2<br />
Relative performance<br />
1.2<br />
1.1<br />
1<br />
0.9<br />
0.8<br />
Relative performance<br />
1.2<br />
1.1<br />
1<br />
0.9<br />
0.8<br />
0 1 2 3 4<br />
#queries ·10 6<br />
0 1 2 3 4<br />
#queries ·10 6<br />
Workload 2Z<br />
User design<br />
Friends design<br />
Simple<br />
Monotonic<br />
Non-monotonic<br />
Processing time (ns)<br />
3<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
0 1 2 3 4<br />
3.5 ·1012 #queries ·10 6<br />
Figure 6: Performance for workloads with different fractions <strong>of</strong> documents vs. queries<br />
posts, <strong>and</strong> they plan to include private posts as well. 3 Facebook allows users to search<br />
their friends’ posts. However, the details <strong>of</strong> these commercial solutions have not been<br />
published.<br />
The problem <strong>of</strong> supporting keyword search over content in a social network was described<br />
in [8]. The problem is related to access control in both information retrieval [9, 26]<br />
<strong>and</strong> for structured data [7]. The solutions we explore in this paper have relations to the<br />
3 http://www.networkworldme.com/v1/news.aspx?v=1&nid=3236&sec=netmanagement<br />
265
APPENDIX A. OTHER PAPERS<br />
solution explored by Silberstein et al. on how to support retrieving the most recent events<br />
in users’ news feeds in a social network [25]. While our problem is more complex because<br />
we support keyword search, the procedure <strong>of</strong> solving it is along the same lines.<br />
Cost models have been used to estimate the efficiency <strong>of</strong> processing strategies in both<br />
information retrieval <strong>and</strong> databases [29, 10, 11, 20]. There exists advanced cost models to<br />
evaluate different index construction <strong>and</strong> update strategies [10, 29]. For search queries,<br />
however, simple models are most commonly used, sometimes without verification on an<br />
actual system [11]. We use simple cost models for updates in this paper, but relatively<br />
advanced models are required to predict search performance accurately. Manegold et<br />
al. outline a generic strategy to estimate the memory access costs in databases [20]. The<br />
memory access costs in our advanced models can be considered part <strong>of</strong> the model for the<br />
list iterator. However, our focus is on estimating processing time, whereas Manegold et<br />
al. focus on memory access costs.<br />
The problems <strong>of</strong> calculating unions <strong>and</strong> particularly intersections <strong>of</strong> lists have attracted<br />
a lot <strong>of</strong> attention, both through the introduction <strong>of</strong> new algorithms with theoretical<br />
analyses [18, 16, 1, 4], <strong>and</strong> experimental investigations [5, 2]. The algorithms for single<br />
operations are also combined into full query plans [23, 12]. Most algorithms assume that<br />
the input sets are uncompressed. Motivated by the ability to store more data <strong>and</strong> in<br />
some cases improve computational efficiency, work from the IR community relaxes this<br />
assumption by introducing synchronization points in the compressed lists to speed up<br />
r<strong>and</strong>om look-ups [14, 22]. In addition, there exists work on algorithms adapted to new<br />
hardware trends [28]. Our intersection operator is based on the ideas from Demaine et<br />
al. [16], <strong>and</strong> we use a strategy similar to the one introduced by Culpepper <strong>and</strong> M<strong>of</strong>fat<br />
for synchronization points in the lists, but we use a novel union operator. Another<br />
discriminating factor is that we attempt to estimate the exact running time <strong>of</strong> the queries,<br />
while most previous work focus on either asymptotic complexity or experimental running<br />
times.<br />
7 Conclusions<br />
In this paper we have presented a system for keyword search in social networks with<br />
access control, which is designed to perform well for a wide range <strong>of</strong> workloads. We have<br />
addressed the query processing strategy <strong>and</strong> developed an efficient HeapUnion operator<br />
that results in a speed-up between 1.13 <strong>and</strong> 2.36 compared to previous solutions. We<br />
have also developed cost models for the system <strong>and</strong> used them as bases for workload<br />
optimization. With an accurate cost model, workload optimization leads to comparable<br />
or better performance than the best basic designs; the speed-up is up to 3.4 for workloads<br />
with skew.<br />
We believe that the ideas discussed here have even wider applicability than search<br />
in social networks. First, queries similar to the ones we have discussed in this paper<br />
occur in several settings, like in star joins in data warehouses [23], <strong>and</strong> HeapUnion might<br />
be a practical <strong>and</strong> efficient solution in such application scenarios as well. Furthermore,<br />
workload optimization might be applicable for other problems. In particular, faceted<br />
266
Paper 9: Search Your Friends And Not Your Enemies<br />
search is supported by storing meta-data about the documents, <strong>and</strong> several different<br />
storage schemes are possible [15]. Workload optimization could be used to find an optimal<br />
storage scheme for the meta-data, an interesting research area for future work.<br />
References<br />
[1] R. Baeza-Yates. A fast set intersection algorithm for sorted sequences. In Proc. CPM,<br />
2004.<br />
[2] R. Baeza-Yates <strong>and</strong> A. Salinger. Fast intersection algorithms for sorted sequences.<br />
Algorithms <strong>and</strong> Applications, 2010.<br />
[3] A. L. Barabasi <strong>and</strong> R. Albert. Emergence <strong>of</strong> scaling in r<strong>and</strong>om networks. Science,<br />
286(5439), 1999.<br />
[4] J. Barbay <strong>and</strong> C. Kenyon. Alternation <strong>and</strong> redundancy analysis <strong>of</strong> the intersection<br />
problem. ACM Trans. Algorithms, 4(1), 2008.<br />
[5] J. Barbay, A. López-Ortiz, T. Lu, <strong>and</strong> A. Salinger. An experimental investigation <strong>of</strong><br />
set intersection algorithms for text searching. J. Exp. Algorithmics, 14, 2009.<br />
[6] J. L. Bentley <strong>and</strong> A. C.-C. Yao. An almost optimal algorithm for unbounded searching.<br />
<strong>Information</strong> Processing Letters, 5(3), 1976.<br />
[7] E. Bertino, S. Jajodia, <strong>and</strong> P. Samarati. Database security: research <strong>and</strong> practice.<br />
Inf. Syst., 20(7), 1995.<br />
[8] T. A. Bjørklund, M. Götz, <strong>and</strong> J. Gehrke. Search in social networks with access<br />
control. In Proc. KEYS, 2010.<br />
[9] S. Büttcher <strong>and</strong> C. L. A. Clarke. A security model for full-text file system search in<br />
multi-user environments. In FAST, 2005.<br />
[10] S. Büttcher, C. L. A. Clarke, <strong>and</strong> B. Lushman. Hybrid index maintenance for growing<br />
text collections. In Proc. SIGIR, 2006.<br />
[11] B B. Cambazoglu, V. Plachouras, <strong>and</strong> R. Baeza-Yates. Quantifying performance <strong>and</strong><br />
quality gains in distributed web search engines. In Proc. SIGIR, 2009.<br />
[12] E. Chiniforooshan, A. Farzan, <strong>and</strong> M. Mirzazadeh. Worst case optimal unionintersection<br />
expression evaluation. In ICALP, 2005.<br />
[13] T. H. Cormen, C. E. Leiserson, R. L. Rivest, <strong>and</strong> C. Stein. Introduction to Algorithms,<br />
Second Edition. The MIT Press, 2001.<br />
[14] J. S. Culpepper <strong>and</strong> A. M<strong>of</strong>fat. Compact set representation for information retrieval.<br />
In Proc. SPIRE, 2007.<br />
267
APPENDIX A. OTHER PAPERS<br />
[15] D. Dash, J. Rao, N. Megiddo, A. Ailamaki, <strong>and</strong> G. Lohman. Dynamic faceted search<br />
for discovery-driven analysis. In Proc. CIKM, 2008.<br />
[16] E. D. Demaine, A. López-Ortiz, <strong>and</strong> J. I. Munro. Adaptive set intersections, unions,<br />
<strong>and</strong> differences. In Proc. SODA, 2000.<br />
[17] S. Gurajada <strong>and</strong> S. Kumar P. On-line index maintenance using horizontal partitioning.<br />
In Proc. CIKM, 2009.<br />
[18] F. K. Hwang <strong>and</strong> S. Lin. Optimal merging <strong>of</strong> 2 elements with n elements. Acta<br />
Informatica, 1(2), 1971.<br />
[19] N. Lester, A. M<strong>of</strong>fat, <strong>and</strong> J. Zobel. Fast on-line index construction by geometric<br />
partitioning. In Proc. CIKM, 2005.<br />
[20] S. Manegold, P. Boncz, <strong>and</strong> M. L. Kersten. Generic database cost models for hierarchical<br />
memory systems. In Proc. VLDB, 2002.<br />
[21] G. Margaritis <strong>and</strong> S. V. Anastasiadis. Low-cost management <strong>of</strong> inverted files for<br />
online full-text search. In Proc. CIKM, 2009.<br />
[22] A. M<strong>of</strong>fat <strong>and</strong> J. Zobel. Self-indexing inverted files for fast text retrieval. ACM<br />
Trans. Inf. Syst., 14(4), 1996.<br />
[23] V. Raman, L. Qiao, W. Han, I. Narang, Y.-L. Chen, K.-H. Yang, <strong>and</strong> F.-L. Ling.<br />
Lazy, adaptive rid-list intersection, <strong>and</strong> its application to index <strong>and</strong>ing. In Proc. SIG-<br />
MOD, 2007.<br />
[24] F. Scholer, H. E. Williams, J. Yiannis, <strong>and</strong> J. Zobel. Compression <strong>of</strong> inverted indexes<br />
for fast query evaluation. In Proc. SIGIR, 2002.<br />
[25] A. Silberstein, J. Terrace, B. F. Cooper, <strong>and</strong> R. Ramakrishnan. Feeding frenzy:<br />
Selectively materializing users. event feeds. In Proc. SIGMOD, 2010.<br />
[26] A. Singh, M. Srivatsa, <strong>and</strong> L. Liu. Efficient <strong>and</strong> secure search <strong>of</strong> enterprise file<br />
systems. In Proc. ICWS, 2007.<br />
[27] R. Sprugnoli. Recurrence relations on heaps. Algorithmica, 15(5), 1996.<br />
[28] D. Tsirogiannis, S. Guha, <strong>and</strong> N. Koudas. Improving the performance <strong>of</strong> list intersection.<br />
Proc. VLDB Endow., 2(1), 2009.<br />
[29] I. Witten, A. M<strong>of</strong>fat, <strong>and</strong> T. C. Bell. Managing Gigabytes. Academic Press, 1999.<br />
[30] J. Zhang, X. Long, <strong>and</strong> T. Suel. Performance <strong>of</strong> compressed inverted list caching in<br />
search engines. In Proc. WWW, 2008.<br />
[31] J. Zobel <strong>and</strong> A. M<strong>of</strong>fat. Inverted files for text search engines. ACM Comput. Surv.,<br />
38(2), 2006.<br />
[32] M. Zukowski, S. Heman, N. Nes, <strong>and</strong> P. Boncz. Super-scalar ram-cpu cache compression.<br />
In Proc. ICDE, 2006.<br />
268
Paper 9: Search Your Friends And Not Your Enemies<br />
A<br />
Query Processing Operators<br />
A.1 List Iterator<br />
The List Iterator is used to iterate through posting lists <strong>and</strong> author lists in our system.<br />
Recall from Section 3 that the lists are compressed using PForDelta. PForDelta is a<br />
form <strong>of</strong> delta compression which is commonly used to compress inverted indexes in search<br />
engines [32, 30]. With delta compression, an entry in a sorted list is stored as the delta<br />
from the previous entry. Small values are therefore likely, <strong>and</strong> compression is achieved by<br />
using a representation where small values require little space.<br />
To achieve efficient decompression rates, we decompress batches <strong>of</strong> b = 128 values<br />
at a time [32]. With this in mind, the implementation <strong>of</strong> Init(), Current() <strong>and</strong> Next()<br />
is straight-forward. During initialization, we allocate an array to store the batch being<br />
processed in an uncompressed format. A st<strong>and</strong>ard call to Next() will forward the current<br />
result to the next in the batch, <strong>and</strong> every bth call will result in decompression <strong>of</strong> a new<br />
batch.<br />
The use <strong>of</strong> delta compression complicates the implementation <strong>of</strong> SkipTo(val), because<br />
previous entries in the list must be decompressed to find the value <strong>of</strong> an entry; straightforward<br />
r<strong>and</strong>om look-ups are therefore impossible. We can implement SkipTo(val) by<br />
calling Next() until the returned value is ≤ val, but that is potentially inefficient when<br />
many entries are skipped. We enable more efficient r<strong>and</strong>om look-ups by storing the first<br />
value <strong>of</strong> each batch uncompressed in an auxiliary array [14]. A SkipTo(val) is processed<br />
by first checking whether the entry to skip to is located in the current batch. If not, we do<br />
a galloping search in the auxiliary array to find the correct batch <strong>and</strong> decompress it [6].<br />
When the correct batch is stored in the intermediate array, we forward within that batch<br />
until the correct answer is found.<br />
A.2 Intersection<br />
The intersection operator sorts its inputs according to their expected number <strong>of</strong> results.<br />
The number <strong>of</strong> results for an operator is the number <strong>of</strong> times we can call Next() <strong>and</strong><br />
retrieve a new result. Both Next() <strong>and</strong> SkipTo(val), will begin with the input with fewest<br />
results, <strong>and</strong> call Next() or SkipTo(val), respectively. From then on, these methods are<br />
similar. Both <strong>of</strong> them alternate between the inputs <strong>and</strong> tries to skip forward to the<br />
last value returned from the other input. When the same value is found in both inputs,<br />
this value is part <strong>of</strong> the intersection <strong>and</strong> is therefore returned. The outlined processing<br />
strategy is similar to the one introduced by Demaine et al. [16].<br />
B<br />
Test Workloads<br />
Two workloads with different characteristics, denoted Workload 1 <strong>and</strong> Workload 2, are<br />
used to test the accuracy <strong>of</strong> our cost models <strong>and</strong> efficiency <strong>of</strong> our system. Workload 1<br />
has 10,000 users <strong>and</strong> Workload 2 has 100,000. In both networks, users have 100 friends<br />
each <strong>and</strong> the friendships are generated with Barabasi’s preferential attachment model [3].<br />
269
APPENDIX A. OTHER PAPERS<br />
Documents in both workloads were obtained from a crawl <strong>of</strong> Twitter in February 2010. In<br />
Workload 1, we assign 1,500,000 documents to the users in a r<strong>and</strong>om process where there is<br />
a strong correlation between number <strong>of</strong> documents posted <strong>and</strong> number <strong>of</strong> followers, while<br />
the 2,500,000 documents in Workload 2 are assigned to users without such a correlation.<br />
Each workload also has 100,000 top-100 queries generated by choosing a r<strong>and</strong>om user who<br />
searches for a r<strong>and</strong>om term in the documents (except stop words).<br />
We use a set <strong>of</strong> pre-generated designs to test the accuracy <strong>of</strong> the cost models. Motivated<br />
by the fact that search performance is improved if more users are represented in the<br />
additional author-lists, <strong>and</strong> that it is less costly to include a user who post infrequently<br />
compared to one who posts frequently, we generate a set <strong>of</strong> designs with the following<br />
simple strategy: We represent a user u in the additional author-lists <strong>of</strong> all her followers,<br />
O u , if n u < limit. By choosing a set <strong>of</strong> different values for limit, we obtain a range <strong>of</strong><br />
designs in between empty additional author-lists <strong>and</strong> additional author-lists that include<br />
all friends.<br />
C<br />
Cost Models for Operators<br />
Monotonic <strong>and</strong> Non-monotonic use the same models for the List Iterator <strong>and</strong> Intersection<br />
operators, <strong>and</strong> the underlying intuition for these models are presented here. A more<br />
formal description is given in Figure 3.<br />
C.1 List Iterator<br />
The list iterator operator is initialized by using a look-up structure to find the list to<br />
process. Recall that we decompress lists in batches <strong>of</strong> b values at a time. An array for<br />
the b intermediate results is therefore allocated during initialization, <strong>and</strong> the first batch<br />
<strong>of</strong> values is decompressed. However, if the look-up indicates that the list is empty, no<br />
intermediate array allocation nor decompression is required. In conclusion, we model the<br />
cost <strong>of</strong> Init() for list iterators with two constants, c init <strong>and</strong> c einit , depending on whether<br />
the list is empty or not.<br />
Every bth call to Next() will trigger decompression <strong>of</strong> a new batch. However, we<br />
estimate the average cost <strong>of</strong> a call to Next() to be constant, c next .<br />
The model for the SkipTo(val)-operation is slightly more complex <strong>and</strong> depends on the<br />
number <strong>of</strong> entries skipped, denoted s. As described in Appendix A.1, a galloping search<br />
is used to find the correct batch to decompress if many entries are skipped, <strong>and</strong> we will<br />
therefore discriminate between cases for when s > b or not. If s ≤ b, we assume that<br />
there is some constant cost associated with a skip operation, c c . It might be necessary<br />
to decompress a new batch <strong>of</strong> values to find the correct return value. We model the<br />
cost <strong>of</strong> decompressing a segment as a constant, c d , <strong>and</strong> the probability that it happens<br />
as s b<br />
. In addition, we have to forward within the correct batch <strong>of</strong> values until we find<br />
the value we seek. The number <strong>of</strong> scanned values is lower than s when we need to<br />
decompress a new batch, <strong>and</strong> the average number <strong>of</strong> entries scanned in our implementation<br />
is scan(s) = 1 + (2b−1)s<br />
2b<br />
− s2<br />
2b . We assume that scanning an entry also has a constant<br />
270
Paper 9: Search Your Friends And Not Your Enemies<br />
cost, c sc . When s > b, the logarithmic cost <strong>of</strong> the galloping search is incorporated into<br />
the model as c g log ( s<br />
b<br />
)<br />
, <strong>and</strong> the rest <strong>of</strong> the cost is equal to Skip(b).<br />
C.2 Intersection<br />
For the intersection operator, we will restrict our analysis to two inputs because that is<br />
what we use in this paper. The initialization within the Intersection operator is assumed<br />
to have negligible cost overall, <strong>and</strong> the cost <strong>of</strong> Init() is the sum <strong>of</strong> the initialization costs in<br />
the two inputs, k 1 <strong>and</strong> k 2 . The inputs are assumed to have r 1 <strong>and</strong> r 2 results, respectively,<br />
with r 1 < r 2 .<br />
As explained in Appendix A.2, the Intersection operator works similarly for Next()<br />
<strong>and</strong> SkipTo(val)-operations, but we focus on Next() in this paper because that is the<br />
method we will use. The processing <strong>of</strong> a Next() call will begin with a call to Next() on<br />
input k 1 . Then, there will be a set <strong>of</strong> skips in each input until we find a value that occurs<br />
in both. We assume that deltas between entries in the inputs are N r 1<br />
<strong>and</strong> N r 2<br />
document<br />
IDs, respectively. Furthermore, we assume that the inputs are independent, so that the<br />
deltas between results <strong>of</strong> the Intersection can be estimated as N 2<br />
r 1r 2<br />
. The amount skipped<br />
in one round <strong>of</strong> skips (one skip in each input) is estimated to N r 1<br />
+ N r 2<br />
. We now calculate<br />
t, the expected number <strong>of</strong> rounds with skips, as:<br />
t = 1 +<br />
N 2<br />
r 1r 2<br />
− N r 1<br />
N<br />
r 1<br />
+ N = 1 + N − r 2<br />
r<br />
r 1 + r 2<br />
2<br />
Because input 1 will be forwarded with a Next() call in the first round, there will be t − 1<br />
skips in it with average length N r + N 1 r 2<br />
N = 1 + r1<br />
r<br />
r 2<br />
. Similarly, there will be t skips in input 2,<br />
1<br />
but one <strong>of</strong> the skips will, according to our assumptions, arrive at the last value from input<br />
1 which results in a shorter skip in the last round. The average skip length is therefore:<br />
t−1<br />
t<br />
(1 + r2<br />
r 1<br />
) + r2<br />
r . 1t<br />
D Pro<strong>of</strong> <strong>of</strong> Theorem 2<br />
Pro<strong>of</strong>. To see this, we will show that the additional costs from also including user w are<br />
never larger per document posted by w than the costs <strong>of</strong> including v per document posted<br />
by v, <strong>and</strong> that the savings during query processing from also including w are at least as<br />
large per posted document as for v.<br />
The cost <strong>of</strong> updates to l u is linear in the number <strong>of</strong> document IDs. The costs <strong>of</strong> including<br />
the documents from v <strong>and</strong> w in l u are therefore c update n v <strong>and</strong> c update n w , respectively.<br />
Per document for each user, the update cost is thus the same, namely c update .<br />
We will now focus on queries. Recall that the queries are processed with the query<br />
template in Figure 2. The only things that will change in the query plans for search<br />
queries from user u when L u changes are the inputs to HeapUnion. Notice that what<br />
HeapUnion returns for a particular call to SkipTo(val) or Next() will not change, <strong>and</strong><br />
the processing cost for the intersection operator <strong>and</strong> the iterator for the posting list is<br />
271
APPENDIX A. OTHER PAPERS<br />
thus unaffected. We therefore only need to show that for the methods for Monotonic<br />
HeapUnion in Figure 3, the savings from including w in addition to v in L u are at least<br />
as large per posted document as when only including v, assuming that all inputs to the<br />
union are List Iterators. We will go through each <strong>of</strong> the methods in order.<br />
Init(): When v is included in L u , the cost will decrease with c init if |L u | > 0 before v<br />
was included, because the included list does not require initialization. Otherwise, the cost<br />
will remain constant because we have to initialize l u instead. When w is also included in<br />
L u , the cost will definitely decrease with c init because we know that |L u | > 0 before w<br />
was included. Because cinit<br />
n v<br />
< cinit<br />
n w<br />
when n v > n w , the savings during Init() are at least<br />
as large per entry for w as for v.<br />
Next(), first call: From the formula in Figure 3, it is clear that when assuming that<br />
|l u | > 0 before v is included, the savings from including v are c next + (γ + 1 2 )c h (it is less if<br />
|l u | = 0). The savings from including w in L u are at least the same (or c next + 2(γ + 1 2 )c h<br />
if adding w to L u makes L u = F u , in which case no heap is required). By a similar<br />
argument as for Init(), the savings per n w for w are at least as large as per n v for v.<br />
Next(), subsequent calls: Because all inputs to the union are List Iterators, the cost<br />
<strong>of</strong> the call to Next() for the input at the top <strong>of</strong> the heap remains the same when inputs<br />
are removed. There will be no savings from the heap maintenance costs either unless<br />
including w makes L u = F u (when no heap is required). It is therefore straight-forward<br />
to conclude that the savings from including w are at least as large per n w as the savings<br />
from including v per n v .<br />
Skip(s), first call: We will first consider the cost <strong>of</strong> the calls to skip in the inputs<br />
that are processed in this operation. Notice that when including a new user in L u , the<br />
resulting changes in skip costs are: 1) Each skip in l u will be longer, <strong>and</strong> the added length<br />
depends on the number <strong>of</strong> documents posted by the new user. And, 2) the skips in the<br />
list that is included will no longer be necessary. In the following, we will refer to the<br />
cost <strong>of</strong> performing a skip with length s in a List Iterator as LI.Skip(s). Given that there<br />
are |l u | entries in the additional author-list before v or w is included, the savings from<br />
including v are:<br />
S v = LI .Skip<br />
( snv<br />
)<br />
R<br />
( ) s|lu |<br />
+ LI .Skip<br />
R<br />
( )<br />
s(|lu | + n v )<br />
− LI .Skip<br />
R<br />
The savings from including w as well are:<br />
( snw<br />
) ( )<br />
s(|lu | + n v )<br />
S w = LI .Skip + LI .Skip<br />
R<br />
R<br />
( )<br />
s(|lu | + n v + n w )<br />
− LI .Skip<br />
R<br />
272
Paper 9: Search Your Friends And Not Your Enemies<br />
We need to show that Sv<br />
n v<br />
≤ Sw<br />
n w<br />
. Or, in particular that:<br />
LI .Skip ( sn w<br />
R<br />
)<br />
n w<br />
(LI .Skip<br />
−<br />
− LI .Skip ( sn v<br />
R<br />
n v<br />
LI .Skip<br />
+<br />
(<br />
s(|lu|+n v+n w)<br />
)<br />
R<br />
)<br />
− LI .Skip<br />
n w<br />
( )<br />
s(|lu|+n v)<br />
R<br />
− LI .Skip<br />
( )<br />
s(|lu|+n v)<br />
R<br />
( )<br />
s|lu|<br />
n v<br />
R<br />
≥ 0<br />
We will now show that LI .Skip(s) is concave, a fact we will use to show the above.<br />
First, the second derivative <strong>of</strong> LI .Skip(s) when s < b is − csc<br />
b<br />
, which is negative because<br />
c sc <strong>and</strong> b are both positive constants. When s > b, the second derivative is −<br />
cg<br />
s 2 ln 2 , which<br />
is also negative. Furthermore, LI .Skip(s) is continuous at s = b. To show that it is<br />
concave, it thus remains to show that the derivative as s approaches b from below is not<br />
smaller than when s approaches b from above. By calculating the derivatives <strong>and</strong> finding<br />
the limits, we see that the following must hold:<br />
c d − c sc<br />
2b ≥ c g<br />
ln 2<br />
And this holds by assumption in the theorem. We thus know that LI .Skip(s) is concave.<br />
For a concave function f(x) <strong>and</strong> two points x 1 <strong>and</strong> x 2 where x 1 < x 2 , it follows that<br />
the slope between the two points ( f(x2)−f(x1)<br />
x 2−x 1<br />
) will never increase when either x 1 , x 2 or<br />
both increase. We can therefore conclude that<br />
LI .Skip( s(|lu|+nv)<br />
R<br />
) − LI .Skip( s|lu|<br />
R )<br />
n v<br />
s(|lu|+nv+nw)<br />
(LI .Skip(<br />
R<br />
) − LI .Skip( s(|lu|+nv)<br />
R<br />
)<br />
≥<br />
n w<br />
snw<br />
LI .Skip( R )<br />
n w<br />
snv<br />
LI .Skip( R<br />
It remains to show that ≥ )<br />
n v<br />
. To do so, we return to f(x) as<br />
discussed above, <strong>and</strong> substitute x 1 = 0. As long as f(0) ≥ 0, the average slope between<br />
(0, 0) <strong>and</strong> (x 2 , f(x 2 ) is smaller than between (0, 0) <strong>and</strong> (x 3 , f(x 3 )) when x 3 < x 2 because<br />
f(x) is concave. Because LI.Skip(0) = c c + c sc is positive as no costs are negative, we<br />
snw<br />
LI .Skip( R )<br />
n w<br />
LI .Skip(<br />
snv<br />
R )<br />
n v<br />
.<br />
can conclude that ≥<br />
We know that the theorem holds for the heap construction from the description <strong>of</strong><br />
Next(), first call, above.<br />
Skip(s), subsequent calls: The pro<strong>of</strong> for the savings in calls to SkipTo(val) for the<br />
inputs is mostly as for the first call to Skip(s) above. The only difference occurs when the<br />
skip length for an input is less than 1. We thus need to prove that this resulting function<br />
273
APPENDIX A. OTHER PAPERS<br />
is still concave, <strong>and</strong> that it is not negative for s = 0. Because it is linear when s < 1,<br />
the second derivative is 0. Furthermore, for the derivative not to increase at s = 1, we<br />
require that:<br />
This holds when:<br />
c c + c d<br />
b<br />
+ (1 +<br />
2b − 1<br />
2b<br />
− 1 2b )c sc<br />
≥ c d<br />
b + (2b − 1 − 2 2b 2b )c sc<br />
c c + (1 + 1 2b )c sc ≥ 0<br />
This holds because all costs are positive. Furthermore, when the skip length is 0, the cost<br />
is 0, so we reach the same conclusion as for Skip(s), first call, above.<br />
The model for the cost <strong>of</strong> heap maintenance in subsequent calls to SkipTo(val) is<br />
linear in the number <strong>of</strong> forwarded inputs. Hence, we proceed to show that the number <strong>of</strong><br />
forwarded inputs decrease at least as much per n w when w is included in the additional<br />
author-list as per n v when only v is included. The reduction when including v in L u is:<br />
R v = min<br />
( s|lu |<br />
R , 1 )<br />
( snv<br />
)<br />
+ min<br />
R , 1 ( s(|lu | + n v )<br />
− min<br />
R<br />
)<br />
, 1<br />
And, the reduction when also including w is:<br />
( )<br />
s(|lu | + n v )<br />
( snw<br />
)<br />
R w = min<br />
, 1 + min<br />
R<br />
R , 1 ( )<br />
s(|lu | + n v + n w )<br />
− min<br />
, 1<br />
R<br />
We need to show that Rw<br />
n w<br />
≥ Rv<br />
n v<br />
.<br />
n w < n v , <strong>and</strong> it remains to show that:<br />
( )<br />
min s(|lu|+n v)<br />
R<br />
, 1<br />
It is clear that<br />
min(<br />
snv<br />
R ,1)<br />
n v<br />
( )<br />
− min s(|lu|+n v+n w)<br />
R<br />
, 1<br />
≤<br />
min(<br />
snw<br />
R ,1)<br />
n w<br />
when<br />
n w<br />
( ) ( )<br />
min s|lu|<br />
R , 1 − min s(|lu|+n v)<br />
R<br />
, 1<br />
−<br />
≥ 0<br />
n v<br />
Assume first that s(|lu|+nv+nw)<br />
R<br />
≤ 1. In that case, we see that the statement sums to 0<br />
because |l u |, n v , n w ≥ 0. There are three more possible cases:<br />
a) s |lu|+nv+nw<br />
R<br />
b) s |lu|+nv<br />
R<br />
> 1 <strong>and</strong> s |lu|+nv<br />
R<br />
< 1,<br />
< 1, <strong>and</strong><br />
≥ 1 <strong>and</strong> s |lu|<br />
R<br />
274
Paper 9: Search Your Friends And Not Your Enemies<br />
Processing time (ns)<br />
1.5<br />
0.5<br />
0.1<br />
·10 12 Actual time<br />
Estimate<br />
0.5 1 1.5<br />
#authorings ·10 9<br />
10 11<br />
10 10<br />
10 5 10 6<br />
#lists<br />
Figure 7: (Left) update workload with x list entries (c update = 1001.7939), (right) search<br />
workload with x accessed author-lists (c list = 24111.2354).<br />
c) s |lu|<br />
R ≥ 1.<br />
( )<br />
In case a), − min s(|lu|+n v+n w)<br />
R<br />
, 1 , which is negative, is limited so we subtract less<br />
relative ( to when ) the statement ( summed to 0, ) so the statement is positive. ( In case)<br />
b),<br />
min s(|lu|+n v)<br />
R<br />
, 1 <strong>and</strong> − min s(|lu|+n v+n w)<br />
R<br />
, 1 cancel. Furthermore, min s|lu|<br />
R , 1 <<br />
( )<br />
min s(|lu|+n v)<br />
R<br />
, 1 , making the statement positive. In case c), the result is 0.<br />
We have now showed that the costs <strong>of</strong> including w in addition to v in the additional<br />
author-list for u are exactly the same per document posted by w, n w , as the cost <strong>of</strong> including<br />
only v per n v . Furthermore, we have showed that the savings per posted document<br />
are at least as large for all relevant functions in HeapUnion when also including w as<br />
when including v in L u . Because it by assumption leads to a performance improvement<br />
to include v, it will thus lead to a further performance improvement to also include w.<br />
E<br />
Microbenchmarks<br />
This section contains an overview <strong>of</strong> the microbenchmarks used to determine the constants<br />
in the cost models in Section 4. All constants are summarized in Table 1.<br />
The constants used in Simple are estimated from a complete workload. We process<br />
Workload 2 from Appendix B <strong>and</strong> measure update cost as a function <strong>of</strong> author-list length,<br />
<strong>and</strong> search cost as a function <strong>of</strong> the number <strong>of</strong> merged lists in the queries. Results <strong>and</strong><br />
estimates are shown in Figure 7.<br />
The costs <strong>of</strong> initialization <strong>and</strong> Next() operations in list iterators are estimated in an<br />
experiment using author-lists with deltas between entries ranging from 1 to 100,000. The<br />
results are shown in Figure 8. In a similar experiment using posting lists instead <strong>of</strong><br />
author-lists, the estimate for c init was higher. This is probably due to that we use a<br />
different look-up structure in the inverted index, <strong>and</strong> we therefore have two values for<br />
c init .<br />
275
APPENDIX A. OTHER PAPERS<br />
10 5 delta<br />
Processing time (ns)<br />
Processing time (ns)<br />
10 4<br />
10 3<br />
10 4<br />
10 0 10 1 10 2 10 3 10 4 10 5<br />
1<br />
2<br />
5<br />
10<br />
20<br />
50<br />
100<br />
200<br />
500<br />
1000<br />
2000<br />
5000<br />
10000<br />
20000<br />
50000<br />
100000<br />
resulting estimate<br />
10 5 delta<br />
10 3<br />
10 0 10 1 10 2 10 3 10 4 10 5<br />
Figure 8: Time spent on x next operations with different deltas. For author-lists (above)<br />
estimate with c init = 1261.6450 <strong>and</strong> c next = 15.0138. For posting-lists (below) estimate<br />
with c initposting = 1675.6671.<br />
The cost <strong>of</strong> accessing an empty list is estimated with an experiment where we have a<br />
user who is friends with x users that do not post documents. We perform 100,000 queries<br />
on behalf <strong>of</strong> this user <strong>and</strong> record the time per search. The results are shown in Figure 9<br />
along with the estimates.<br />
The cost <strong>of</strong> skipping in the list iterator involves several constants. We first estimate<br />
the scan cost with an experiment that tests all scan lengths up to b = 128. The results<br />
<strong>and</strong> the resulting estimates are shown in Figure 9. The rest <strong>of</strong> the constants for skipping<br />
in list iterators are estimated from a set <strong>of</strong> experiments where skips with various lengths<br />
(1 to 10,000) are performed in 1000 lists. All results from these experiments are shown in<br />
Figure 10. We use these results to first estimate c c <strong>and</strong> c d by considering all skip lengths<br />
below b <strong>and</strong> subtracting the average scan cost (from the above experiment). The constant<br />
for long skips is estimated by subtracting the estimated cost <strong>of</strong> skipping b values from<br />
the longer skips. The results <strong>and</strong> resulting estimates from these experiments are shown<br />
in Figure 11.<br />
The cost <strong>of</strong> a heapify() operation is estimated by running union queries over a set <strong>of</strong><br />
iterators that simply return numbers with given intervals. We are thus able to predetermine<br />
the number <strong>of</strong> heapify() operations. To estimate the constants in construction, we<br />
test a workload with 100,000 lists <strong>and</strong> accumulate 1,000 entries in each list. The deltas<br />
276
Paper 9: Search Your Friends And Not Your Enemies<br />
Processing time (ns)<br />
3 ·104 Results<br />
2.5 Estimates<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
0 5 10 15 20 25 30 35 40 45 50<br />
#empty inits<br />
160<br />
140<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
0 20 40 60 80 100 120<br />
scan length<br />
Figure 9: Left: Time spent on x initializations <strong>of</strong> empty lists c einit = 543.7537. Right:<br />
Cost <strong>of</strong> scanning x entries in array (c sc = 1.1889).<br />
between the entries vary from 1 to 100,000, which results in different numbers <strong>of</strong> bytes<br />
used during accumulation. Results <strong>and</strong> estimates are shown in Figure 12.<br />
Processing time (ns)<br />
10 5<br />
10 4<br />
10 3<br />
10 0 10 1 10 2 10 3 10 4<br />
num skip operations<br />
1 skipped entries<br />
2 skipped entries<br />
5 skipped entries<br />
10 skipped entries<br />
20 skipped entries<br />
50 skipped entries<br />
100 skipped entries<br />
200 skipped entries<br />
500 skipped entries<br />
1000 skipped entries<br />
2000 skipped entries<br />
5000 skipped entries<br />
10000 skipped entries<br />
Figure 10: Cost <strong>of</strong> SkipTo(val) in list iterator<br />
277
APPENDIX A. OTHER PAPERS<br />
Processing time (ns)<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
Results<br />
Estimates<br />
250<br />
200<br />
150<br />
100<br />
50<br />
0<br />
0 20 40 60 80 100<br />
skip length<br />
0<br />
200 400 600 800 1,000<br />
skip length<br />
Figure 11: Determination <strong>of</strong> skip constants for short skips (left: c c = 24.8305 <strong>and</strong> c d =<br />
330.6624) <strong>and</strong> long skips (right: c g = 79.0708).<br />
Processing time (ns)<br />
4 ·108 Results<br />
Estimates<br />
3<br />
2<br />
1<br />
0<br />
0 1 2 3 4<br />
Number <strong>of</strong> heapify ops ·10 7<br />
Processing per entry (ns)<br />
1,000<br />
500<br />
0<br />
10 0 10 1 10 2 10 3 10 4 10 5<br />
delta<br />
Figure 12: Left: Time per heapify operations (c h = 9.9073). Right: Microbenchmark for<br />
construction (c 1b = 990.3406, c 2b = 1082.8243 <strong>and</strong> c 3b = 1135.9749).<br />
Constant<br />
Estimate<br />
c update 1001.7939<br />
c list 24111.2354<br />
c init 1261.6450<br />
c init−posting 1675.6671<br />
c einit 543.7537<br />
c next 15.0138<br />
c sc 1.1889<br />
c c 24.8305<br />
c d 330.6624<br />
c g 79.0708<br />
c h 9.9073<br />
c 1b 990.3406<br />
c 2b 1082.8243<br />
c 3b 1135.9749<br />
Table 1: Estimates from microbenchmarks<br />
278