MARKLOGIC SERVER

Recommendations

Info

If you have the fast phrase searches option enabled, MarkLogic will incorporate two-word terms into its inverted index. With this index enabled, MarkLogic can use the third approach listed above. This makes phrase queries very efficient at the cost of slightly larger indexes on disk and slightly slower performance during document ingestion, since extra entries must be added to the indexes. If you have the word positions option enabled, MarkLogic will use position information to resolve the phrase—the second approach listed above. This index resolution isn't as efficient as fast phrase searches because it requires position-matching work, but word positions can also support proximity queries that look for words near each other but not necessarily next to each other. Word positions also enables "mild not" where you can, for example, search for Mexico but not if it appears in the phrase New Mexico. FAST PHRASE SEARCHES WORD POSITIONS Term Doc Term Doc:Pos a 1 a blue 1 blue 1, 2 blue car 2, 3 … … a 1:1 blue 1:2 car 1:3, 2:3 red 2:2, 3:2 … … Figure 2: With fast phrase searches turned on, MarkLogic also indexes word pairs. The word positions setting indexes information about a term's location in the document. If neither the fast phrase searches nor word positions index is enabled, then MarkLogic will use the simple word indexes as best it can and rely on filtering the results. Filtering is what MarkLogic calls the act of opening a document and checking to see if it's a true match. Having the indexes off and relying on filtering keeps the indexes smaller but will slow queries proportional to how many candidate documents aren't actually matching documents—that is, how many have to be read as candidates but are thrown away during filtering. It's important to note that the query results will include the same documents in each case; it's mostly just a matter of performance tradeoffs. Relevance order may differ slightly based on index settings because having more indexes available (as with fast phrase searches) will enable MarkLogic to do more accurate relevance calculations, because the calculations have to be made using indexes before pulling documents off disk. We'll talk more about relevance calculation later. 14
INDEXING LONGER PHRASES What happens if instead of a two-word phrase I give you a three- or four-word phrase? Again, what would you do on paper? You can choose to rely solely on filtering, you can use position calculation, or you can create a term list entry for all three- and four-word phrases. It goes beyond the point of diminishing returns to try to maintain a term list for all three- and four-word phrases, so that's not actually an option in MarkLogic. The fast phrase searches option only tracks two-word phrases. However, the two-word phrase can still be of some help with longer phrases though. You can use it to find documents that have the first and second words together, the second and third words together, and the third and fourth words together. Documents that satisfy those three constraints are more likely candidates than documents that just have all four words at unknown locations. If word positions is enabled, will MarkLogic use fast phrase searches as well? Yes, because the position calculations require time and memory proportional to the number of terms being examined MarkLogic uses fast phrase searches to reduce the number of documents, and thus terms, that need to be processed. You'll notice that MarkLogic doesn't require monolithic indexes. Instead, it depends on lots of smaller indexes, turned on and off depending on your needs, cooperating to resolve queries. The usual routine is to look at the query, decide what indexes can help, and use the indexes in cooperation to narrow the results down to a set of candidate documents. MarkLogic can then optionally filter the documents to confirm that the documents actually match, eliminating false positives. The more index options enabled, the tighter the candidate result set can be. You may be able to tune your index settings so that false positives never occur (or occur so rarely that they're tolerable), in which case, you don't need filtering. INDEXING STRUCTURE Everything up to this point is pretty much standard search engine behavior. (Well, except that traditional search engines, because they don't hold the source data, can't actually do the filtering and always return the results unfiltered from their indexes.) Let's now look at how MarkLogic goes beyond simple search to index document structure. Say I ask you to find XML documents that have a element within them. What would you do? If you're MarkLogic, you'd create a term list for element , the same way you do for a word. You'd have a term list for each element, and no matter what element name gets requested you can deliver a fast answer. (It works similarly for JSON documents. See "Indexing JSON" below for details.) Of course, not many queries ask for documents containing just a named element. So let's say the question is more complex. Let's try to find XML documents matching the XPath /book/metadata/title and for each, return the title node. That is, we want documents having a root element, with a child and a 15
Page 1 and 2: Inside MARKLOGIC SERVER Jason Hunte
Page 3 and 4: You can find the full set of API do
Page 5 and 6: CHAPTER 1 WHAT IS MARKLOGIC SERVER?
Page 7 and 8: enforced, such as that no two docum
Page 9 and 10: You can even use MarkLogic to enfor
Page 11 and 12: instance, all the way up to (in 201
Page 13: Doc1 Doc 2 Doc 3 Doc 4 a blue car t
Page 17 and 18: INDEXING VALUES Now what if we want
Page 19 and 20: The indexes don't know if they're t
Page 21 and 22: for $result in cts:search( /article
Page 23 and 24: DIRECTORY INDEXES MarkLogic include
Page 25 and 26: Every fragment acts as its own self
Page 27 and 28: 4. Perform optimized order by calcu
Page 29 and 30: constraint (term lists are of no us
Page 31 and 32: Performance of range index operatio
Page 33 and 34: Fields are another way to pinpoint
Page 35 and 36: DATABASE MyDocuments FOREST MyFores
Page 37 and 38: stands. Merges tend to be CPU- and
Page 39 and 40: When doing point-in-time queries, y
Page 41 and 42: Isolating an Update When a request
Page 43 and 44: timestamp to make the fragment live
Page 45 and 46: if the global commit happened or no
Page 47 and 48: CLUSTERING AND CACHING As your data
Page 49 and 50: Expanded Tree Cache Each time a D-n
Page 51 and 52: In the regular heartbeat communicat
Page 53 and 54: QUERY QUERY LIFECYCLE 7 RESULT SET
Page 55 and 56: Figure 9: During a commit involving
Page 57 and 58: other transactions as the documents
Page 59 and 60: MODULES AND DEPLOYMENT XQuery, XSLT
Page 61 and 62: REST API FOR MULTI-TIER DEVELOPMENT
Page 63 and 64: SEARCH AND JSEARCH APIS The Search
Page 65 and 66:
SQL/ODBC ACCESS FOR BUSINESS INTELL
Page 67 and 68:
CHAPTER 3 ADVANCED TOPICS ADVANCED
Page 69 and 70:
MarkLogic provides basic language s
Page 71 and 72:
as a space removed from all text, s
Page 73 and 74:
If instead of matching documents th
Page 75 and 76:
You can watch as the server does th
Page 77 and 78:
MORE WITH FIELDS Fields also provid
Page 79 and 80:
The cts:register() call returns an
Page 81 and 82:
value but a latitude and longitude
Page 83 and 84:
produces this XML: dog name Ch
Page 85 and 86:
This XQuery code inserts the defini
Page 87 and 88:
It searches across passengers, requ
Page 89 and 90:
five.xml (doc ID 5): { cts:and-quer
Page 91 and 92:
Valid Start: 2016-01-01 End: ∞ Sy
Page 93 and 94:
BITEMPORAL QUERIES Querying on bite
Page 95 and 96:
A key aspect of semantic data is no
Page 97 and 98:
Triple Type Index Object values in
Page 99 and 100:
estimates the efficiency of that pl
Page 101 and 102:
MANAGING BACKUPS MarkLogic supports
Page 103 and 104:
esult of a code bug, modifies data
Page 105 and 106:
and Local-Disk Failover. Failover w
Page 107 and 108:
andwidth. Local-Disk is faster when
Page 109 and 110:
Contemporaneous vs. Non-Blocking Ea
Page 111 and 112:
database allowed to have most of it
Page 113 and 114:
threshold needs to be reached among
Page 115 and 116:
TIERED STORAGE All storage media ar
Page 117 and 118:
storage media but still query those
Page 119 and 120:
LOW-LEVEL SYSTEM CONTROL When scali
Page 121 and 122:
OUTSIDE THE CORE That completes our
Page 123 and 124:
CONNECTOR FOR SHAREPOINT Microsoft
Page 125 and 126:
Sublime Text Plug-in An add-on to t
Page 127:
999 Skyway Road, Suite 200 San Carl
show all

MARKLOGIC SERVER

Create successful ePaper yourself

Delete template?

Save as template?