You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
CHAPTER 2<br />
CORE TOPICS<br />
INDEXING TEXT AND STRUCTURE FOR FAST SEARCH<br />
Now that we've covered what MarkLogic is, let's dig into how it works, starting with its<br />
unique indexing model. MarkLogic indexes the words, phrases, structural elements,<br />
and metadata in database documents so it can search those documents efficiently.<br />
Collectively, this indexed information is known as the Universal Index. (MarkLogic uses<br />
other types of indexes as well—for example, range indexes, reverse indexes, and triple<br />
indexes. We'll cover those later.)<br />
INDEXING WORDS<br />
Let's begin with a thought experiment. Imagine that I give you printouts of 10<br />
documents. I tell you that I'm going to provide you with a word, and you'll tell me<br />
which documents contain the word. What will you do to prepare? If you think like a<br />
search engine, you'll create a list of all words that appear across all the documents, and<br />
for each word, make a list of which documents contain that word. This is called an<br />
inverted index; inverted because instead of documents having words, it's words having<br />
document identifiers. Each entry in the inverted index is called a term list. A term is just<br />
a technical name for something like a word. Regardless of which word I give you, you<br />
can quickly give me the associated documents by finding the right term list. This is how<br />
MarkLogic resolves simple word queries.<br />
Now let's imagine that I ask you for documents that contain two different words. That's<br />
not hard, you can use the same data structure. Find all document IDs with the first<br />
word and all document IDs with the second, and intersect the lists (see Figure 1). That<br />
results in the set of documents with both words.<br />
It works the same with negative queries. If I ask for documents containing one word but<br />
excluding those with another, you can use the indexes to find all document IDs with the<br />
first word and all document IDs with the second word, then do a list subtraction.<br />
12