15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

INDEXING LONGER PHRASES<br />

What happens if instead of a two-word phrase I give you a three- or four-word phrase?<br />

Again, what would you do on paper? You can choose to rely solely on filtering, you can<br />

use position calculation, or you can create a term list entry for all three- and<br />

four-word phrases.<br />

It goes beyond the point of diminishing returns to try to maintain a term list for all<br />

three- and four-word phrases, so that's not actually an option in MarkLogic. The fast<br />

phrase searches option only tracks two-word phrases. However, the two-word phrase can<br />

still be of some help with longer phrases though. You can use it to find documents that<br />

have the first and second words together, the second and third words together, and the<br />

third and fourth words together. Documents that satisfy those three constraints are more<br />

likely candidates than documents that just have all four words at unknown locations.<br />

If word positions is enabled, will MarkLogic use fast phrase searches as well? Yes, because the<br />

position calculations require time and memory proportional to the number of terms<br />

being examined MarkLogic uses fast phrase searches to reduce the number of documents,<br />

and thus terms, that need to be processed. You'll notice that MarkLogic doesn't require<br />

monolithic indexes. Instead, it depends on lots of smaller indexes, turned on and off<br />

depending on your needs, cooperating to resolve queries. The usual routine is to look at<br />

the query, decide what indexes can help, and use the indexes in cooperation to narrow<br />

the results down to a set of candidate documents. MarkLogic can then optionally<br />

filter the documents to confirm that the documents actually match, eliminating false<br />

positives. The more index options enabled, the tighter the candidate result set can be.<br />

You may be able to tune your index settings so that false positives never occur (or occur<br />

so rarely that they're tolerable), in which case, you don't need filtering.<br />

INDEXING STRUCTURE<br />

Everything up to this point is pretty much standard search engine behavior. (Well,<br />

except that traditional search engines, because they don't hold the source data, can't<br />

actually do the filtering and always return the results unfiltered from their indexes.) Let's<br />

now look at how MarkLogic goes beyond simple search to index document structure.<br />

Say I ask you to find XML documents that have a element within them. What<br />

would you do? If you're MarkLogic, you'd create a term list for element , the<br />

same way you do for a word. You'd have a term list for each element, and no matter<br />

what element name gets requested you can deliver a fast answer. (It works similarly for<br />

JSON documents. See "Indexing JSON" below for details.)<br />

Of course, not many queries ask for documents containing just a named element. So<br />

let's say the question is more complex. Let's try to find XML documents matching the<br />

XPath /book/metadata/title and for each, return the title node. That is, we want<br />

documents having a root element, with a child and a <br />

15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!