15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 2<br />

CORE TOPICS<br />

INDEXING TEXT AND STRUCTURE FOR FAST SEARCH<br />

Now that we've covered what MarkLogic is, let's dig into how it works, starting with its<br />

unique indexing model. MarkLogic indexes the words, phrases, structural elements,<br />

and metadata in database documents so it can search those documents efficiently.<br />

Collectively, this indexed information is known as the Universal Index. (MarkLogic uses<br />

other types of indexes as well—for example, range indexes, reverse indexes, and triple<br />

indexes. We'll cover those later.)<br />

INDEXING WORDS<br />

Let's begin with a thought experiment. Imagine that I give you printouts of 10<br />

documents. I tell you that I'm going to provide you with a word, and you'll tell me<br />

which documents contain the word. What will you do to prepare? If you think like a<br />

search engine, you'll create a list of all words that appear across all the documents, and<br />

for each word, make a list of which documents contain that word. This is called an<br />

inverted index; inverted because instead of documents having words, it's words having<br />

document identifiers. Each entry in the inverted index is called a term list. A term is just<br />

a technical name for something like a word. Regardless of which word I give you, you<br />

can quickly give me the associated documents by finding the right term list. This is how<br />

MarkLogic resolves simple word queries.<br />

Now let's imagine that I ask you for documents that contain two different words. That's<br />

not hard, you can use the same data structure. Find all document IDs with the first<br />

word and all document IDs with the second, and intersect the lists (see Figure 1). That<br />

results in the set of documents with both words.<br />

It works the same with negative queries. If I ask for documents containing one word but<br />

excluding those with another, you can use the indexes to find all document IDs with the<br />

first word and all document IDs with the second word, then do a list subtraction.<br />

12

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!