Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
INDEXING VALUES<br />
Now what if we want to search for element values? Let's imagine that I ask<br />
you for books published in a certain year. In XPath, that can be expressed as /<br />
book[metadata/pubyear = 2016]. How do we resolve this efficiently?<br />
Thinking back to the paper-based approach, what you want to do is maintain a term<br />
list for each XML element value (or JSON property value). In other words, you can<br />
track a term list for documents having a equal to 2016, as well as any<br />
other element name with any other value you find during indexing. Then, for any<br />
query asking for an element with a particular value, you can immediately resolve which<br />
documents have it directly from indexes. Intersect that value index with the parent-child<br />
structural indexes discussed above, and you've used several small indexes in cooperation<br />
to match a larger query. It works even when you don't know the schema or query in<br />
advance. This is how you build a database using the heart of a search engine.<br />
Can an element-value index be stored efficiently? Yes, thanks to hashing. Instead of<br />
storing the full element name and value, you can hash the element name and value<br />
down to a succinct integer and use that as the term list lookup key. Then no matter how<br />
long the element name and value, it's actually a small entry in the index. MarkLogic<br />
uses hashes behind the scenes to store all term list keys, element-value or otherwise, for<br />
the sake of efficiency. The element-value index has proven to be so efficient and useful<br />
that it's always and automatically enabled within MarkLogic.<br />
In the above example, 2016 is queried as an integer. Does MarkLogic actually store<br />
the value as an integer? By default, no—it's stored as the textual representation of the<br />
integer value, the same as it appeared in the document, and the above query executes<br />
the same as if 2016 were surrounded by string quotes. Often this type of fuzziness is<br />
sufficient. For cases where data type encoding matters, it's possible to use a range index,<br />
which is discussed later.<br />
17