You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
INDEXING LONGER PHRASES<br />
What happens if instead of a two-word phrase I give you a three- or four-word phrase?<br />
Again, what would you do on paper? You can choose to rely solely on filtering, you can<br />
use position calculation, or you can create a term list entry for all three- and<br />
four-word phrases.<br />
It goes beyond the point of diminishing returns to try to maintain a term list for all<br />
three- and four-word phrases, so that's not actually an option in MarkLogic. The fast<br />
phrase searches option only tracks two-word phrases. However, the two-word phrase can<br />
still be of some help with longer phrases though. You can use it to find documents that<br />
have the first and second words together, the second and third words together, and the<br />
third and fourth words together. Documents that satisfy those three constraints are more<br />
likely candidates than documents that just have all four words at unknown locations.<br />
If word positions is enabled, will MarkLogic use fast phrase searches as well? Yes, because the<br />
position calculations require time and memory proportional to the number of terms<br />
being examined MarkLogic uses fast phrase searches to reduce the number of documents,<br />
and thus terms, that need to be processed. You'll notice that MarkLogic doesn't require<br />
monolithic indexes. Instead, it depends on lots of smaller indexes, turned on and off<br />
depending on your needs, cooperating to resolve queries. The usual routine is to look at<br />
the query, decide what indexes can help, and use the indexes in cooperation to narrow<br />
the results down to a set of candidate documents. MarkLogic can then optionally<br />
filter the documents to confirm that the documents actually match, eliminating false<br />
positives. The more index options enabled, the tighter the candidate result set can be.<br />
You may be able to tune your index settings so that false positives never occur (or occur<br />
so rarely that they're tolerable), in which case, you don't need filtering.<br />
INDEXING STRUCTURE<br />
Everything up to this point is pretty much standard search engine behavior. (Well,<br />
except that traditional search engines, because they don't hold the source data, can't<br />
actually do the filtering and always return the results unfiltered from their indexes.) Let's<br />
now look at how MarkLogic goes beyond simple search to index document structure.<br />
Say I ask you to find XML documents that have a element within them. What<br />
would you do? If you're MarkLogic, you'd create a term list for element , the<br />
same way you do for a word. You'd have a term list for each element, and no matter<br />
what element name gets requested you can deliver a fast answer. (It works similarly for<br />
JSON documents. See "Indexing JSON" below for details.)<br />
Of course, not many queries ask for documents containing just a named element. So<br />
let's say the question is more complex. Let's try to find XML documents matching the<br />
XPath /book/metadata/title and for each, return the title node. That is, we want<br />
documents having a root element, with a child and a <br />
15