MARKLOGIC SERVER

Recommendations

Info

uckets to forests is stored in memory for fast assignment.) Employing such a large number of buckets—many more than the number of forests in a cluster—helps keep the amount of content transferred during redistribution to a minimum. 19 The legacy policy uses the same algorithm that older MarkLogic releases used when assigning documents during ingestion. It distributes documents individually based on a hash of its URI. The legacy policy is less efficient than bucket because it moves more content than the bucket policy (the addition of one new forest greatly changes almost every document's assignment), but it is included for backward compatibility. The bucket and legacy policies are deterministic since a document will always end up in the same place given the same set of forests. The statistical policy assigns documents to the forest that currently has the least number of documents of all the forests in the database. The statistical policy requires even less content to be moved compared with the bucket policy, but the statistical policy is nondeterministic—where a document ends up will be based on the content distribution in the forests at a given time. For this reason, the statistical policy can only be used with strict locking. 20 The range policy is used in connection with tiered storage and assigns documents based on a range index. This means that documents in a database can be distributed based on things like their creation or update time. This allows recent documents or those that are more frequently accessed to be stored on faster solid-state media and older documents to be stored on slower disk-based media. For more information, see the "Tiered Storage" section. With the range policy, there will typically be multiple forests within each range; documents are assigned to forests within a range based on the algorithm that is used with the statistical policy. MOVING THE DATA Under the covers, rebalancing involves steps that are similar to how you move documents between forests programmatically. During a rebalancing event, documents are deleted from an existing forest and inserted into a new forest, with both steps happening in a single transaction to maintain database consistency. For greater efficiency, the documents are transferred not one at a time but in batches, with the sizes of the batches varying with the assignment policy (and how resourceintensive the policy's operations are). Furthermore, for the statistical policy, a difference 19 Suppose there are M buckets and M is sufficiently large. Also suppose that a new forest is added to a database that already has N forests. To again get to a balanced state, the bucket policy requires the movement of N x (M/N – M/(N+1)) x 1/M = 1/(N+1) of the data. This is almost ideal. However, the larger the value of M is the more costly the management of the mapping (from bucket to forest). 20 A consequence of strict locking is that MarkLogic Content Pump (MLCP) can't use the fastload option. 112
threshold needs to be reached among the forests to trigger rebalancing. This keeps rebalancing from occurring unnecessarily when there is a small difference in forest counts. 21 HADOOP MarkLogic leverages Apache Hadoop—specifically the MapReduce part of the Hadoop stack—to facilitate bulk processing of data. The Hadoop MapReduce engine has become a popular way to run Java-based, computationally intensive programs across a large number of nodes. It's called MapReduce because it breaks down all work into two tasks. This first is a coded "map" task that takes in key-value pairs and outputs a series of intermediate key-value pairs. The second is a "reduce" task that takes in each key from the "map" phase along with all earlier generated values for that key, then outputs a final series of values. This simple model makes it easy to parallelize the work, running the map and reduce tasks in parallel across machines, yet the model has proven to be robust enough to handle many complex workloads. MarkLogic uses MapReduce for bulk processing: for large-scale data ingestion, transformation, and export. (MarkLogic does not use Hadoop to run live queries or updates.) At a technical level, MarkLogic provides and supports a bi-directional connector for Hadoop. 22 This connector is open source, written in Java, and available separately from the main MarkLogic Server package. The connector includes logic that enables coordinated activity between a MarkLogic cluster and a Hadoop cluster. It ensures that all connectivity happens directly between machines, node-to-node, with no machine becoming a bottleneck. A MarkLogic Hadoop job typically follows one of three patterns: 1. Read data from an external source, such as a filesystem or HDFS (the Hadoop Distributed File System), and push it into MarkLogic. This is your classic ETL (extract-transform-load) job. The data can be transformed (standardized, denormalized, deduplicated, reshaped) as much as necessary within Hadoop as part of the work. 2. Read data from MarkLogic and output to an external destination, such as a filesystem or HDFS. This is your classic database export. 3. Read data from MarkLogic and write the results back into MarkLogic. This is an efficient way to run a bulk transformation. For example, the MarkMail.org project wanted to "geo-tag" every email message with a latitudelongitude location based on IP addresses in the mail headers. The logic to determine the location based on IP was written in Java and executed in parallel against all messages with a Hadoop job running on a small Hadoop cluster. 21 Suppose the average fragment count of the forests is AVG and the number of forests is N. A threshold is calculated as max(AVG/(20*N),10000). So, data is moved between forests A and B only when A's fragment count is greater than AVG + threshold and B's fragment count is smaller than AVG – threshold. 22 MarkLogic supports only specific versions of Hadoop, and only certain operating systems. See the documentation for specifics. 113
Page 1 and 2:
Inside MARKLOGIC SERVER Jason Hunte
Page 3 and 4:
You can find the full set of API do
Page 5 and 6:
CHAPTER 1 WHAT IS MARKLOGIC SERVER?
Page 7 and 8:
enforced, such as that no two docum
Page 9 and 10:
You can even use MarkLogic to enfor
Page 11 and 12:
instance, all the way up to (in 201
Page 13 and 14:
Doc1 Doc 2 Doc 3 Doc 4 a blue car t
Page 15 and 16:
INDEXING LONGER PHRASES What happen
Page 17 and 18:
INDEXING VALUES Now what if we want
Page 19 and 20:
The indexes don't know if they're t
Page 21 and 22:
for $result in cts:search( /article
Page 23 and 24:
DIRECTORY INDEXES MarkLogic include
Page 25 and 26:
Every fragment acts as its own self
Page 27 and 28:
4. Perform optimized order by calcu
Page 29 and 30:
constraint (term lists are of no us
Page 31 and 32:
Performance of range index operatio
Page 33 and 34:
Fields are another way to pinpoint
Page 35 and 36:
DATABASE MyDocuments FOREST MyFores
Page 37 and 38:
stands. Merges tend to be CPU- and
Page 39 and 40:
When doing point-in-time queries, y
Page 41 and 42:
Isolating an Update When a request
Page 43 and 44:
timestamp to make the fragment live
Page 45 and 46:
if the global commit happened or no
Page 47 and 48:
CLUSTERING AND CACHING As your data
Page 49 and 50:
Expanded Tree Cache Each time a D-n
Page 51 and 52:
In the regular heartbeat communicat
Page 53 and 54:
QUERY QUERY LIFECYCLE 7 RESULT SET
Page 55 and 56:
Figure 9: During a commit involving
Page 57 and 58:
other transactions as the documents
Page 59 and 60:
MODULES AND DEPLOYMENT XQuery, XSLT
Page 61 and 62: REST API FOR MULTI-TIER DEVELOPMENT
Page 63 and 64: SEARCH AND JSEARCH APIS The Search
Page 65 and 66: SQL/ODBC ACCESS FOR BUSINESS INTELL
Page 67 and 68: CHAPTER 3 ADVANCED TOPICS ADVANCED
Page 69 and 70: MarkLogic provides basic language s
Page 71 and 72: as a space removed from all text, s
Page 73 and 74: If instead of matching documents th
Page 75 and 76: You can watch as the server does th
Page 77 and 78: MORE WITH FIELDS Fields also provid
Page 79 and 80: The cts:register() call returns an
Page 81 and 82: value but a latitude and longitude
Page 83 and 84: produces this XML: dog name Ch
Page 85 and 86: This XQuery code inserts the defini
Page 87 and 88: It searches across passengers, requ
Page 89 and 90: five.xml (doc ID 5): { cts:and-quer
Page 91 and 92: Valid Start: 2016-01-01 End: ∞ Sy
Page 93 and 94: BITEMPORAL QUERIES Querying on bite
Page 95 and 96: A key aspect of semantic data is no
Page 97 and 98: Triple Type Index Object values in
Page 99 and 100: estimates the efficiency of that pl
Page 101 and 102: MANAGING BACKUPS MarkLogic supports
Page 103 and 104: esult of a code bug, modifies data
Page 105 and 106: and Local-Disk Failover. Failover w
Page 107 and 108: andwidth. Local-Disk is faster when
Page 109 and 110: Contemporaneous vs. Non-Blocking Ea
Page 111: database allowed to have most of it
Page 115 and 116: TIERED STORAGE All storage media ar
Page 117 and 118: storage media but still query those
Page 119 and 120: LOW-LEVEL SYSTEM CONTROL When scali
Page 121 and 122: OUTSIDE THE CORE That completes our
Page 123 and 124: CONNECTOR FOR SHAREPOINT Microsoft
Page 125 and 126: Sublime Text Plug-in An add-on to t
Page 127: 999 Skyway Road, Suite 200 San Carl
show all

MARKLOGIC SERVER

Create successful ePaper yourself

Delete template?

Save as template?