MARKLOGIC SERVER

Recommendations

Info

A MapReduce job reading data from MarkLogic employs some tricks to make the reading more efficient. It gets from each forest a manifest of what documents are present in the forest (optionally limiting it to documents matching an ad hoc query constraint, or to subdocument sections), divides up the documents into "splits," and assigns those splits to map jobs that can run in parallel across Hadoop nodes. Communication always happens between the Hadoop node running the map job and the machine hosting the forest. (Side note: In order for the Hadoop node to communicate to the MarkLogic node hosting the forest, the MarkLogic node has to have an open XDBC port, meaning it needs to be a combination E-node/D-node.) The connector code also internally uses the little-known "unordered" feature to pull the documents in the order they're stored on disk. The fn:unordered($sequence) function provides a hint to the optimizer that the order of the items in the sequence does not matter. MarkLogic uses that liberty to order the results in the same order they're stored on disk (by increasing internal fragment ID), providing better performance on disks that optimize sequential reads. A MapReduce job writing data to MarkLogic also employs a trick to gain performance. If the reduce step inserts a document into the database and the connector deems it safe, the insert will use in-forest placement so that the communication only needs to happen directly to the host with the forest. As an administrator, whether you want to place a Hadoop process onto each MarkLogic node or keep them separate depends on your workload. Co-location reduces network traffic between MarkLogic and the MapReduce tasks but places a heavier computational and memory burden on the host. The MarkLogic Content Pump (MLCP), discussed later, is a command-line program built on Hadoop for managing bulk import, export, and data copy tasks. MarkLogic can also leverage the Hadoop Distributed File System (HDFS) for storing database content. This is useful for tiered storage deployments, discussed next. A World with No Hadoop What about a world with no Hadoop? You could still do the same work, and people have, but usually for simplicity's sake, people have loaded from one machine, driven a bulk transformation from one machine, or pulled down an export to one machine. This limits performance to that of the one client. Hadoop lets an arbitrarily large cluster of machines act as the client, talking in parallel to all of the MarkLogic nodes, and that speeds up data flow into and out of MarkLogic. 114
TIERED STORAGE All storage media are not created equal. Fast but expensive storage is great for highvalue documents that are accessed often. Slower but cheaper storage, including storage in the cloud, can be a good fit for older data that is rarely accessed but needs to be kept for historical or auditing purposes. MarkLogic's tiered storage lets you automatically store database documents on the media that's most appropriate. Documents can be saved to different locations based on range-index properties—for example, the date the documents were created or last modified. This is in contrast to how documents are typically assigned, which is with algorithms that ensure that documents are spread evenly across all forests. By storing database documents depending on access needs, tiered storage can help users get better performance at lower costs. You set up tiered storage by organizing a database's forests into partitions. Each partition is associated with a start and end value from the range index as well as a storage location. Imagine a tiered storage database that uses the last-modified date as the index. (You can have MarkLogic automatically track this value for documents with the "maintain last modified" database setting.) You could define a "Newest" partition for the most recently modified data, setting the range to the years 2014 to 2016 and the location to a directory on an SSD. An "Intermediate" partition might handle data from 2010 to 2013 and be set to a regular directory on a local disk. An "Archival" partition could handle data from 2000 to 2009 and be set to cloud storage. Last Modified: 2015-01-12 Last Modified: 2006-11-06 Last Modified: 2011-07-20 NEWEST 2014-2016 INTERMEDIATE 2010-2013 ARCHIVAL 2000-2009 F F F F F F F F F F F F F F F F F F Fast SSD Local Disk Amazon S3 DATABASE WITH THREE PARTITIONS Figure 24: With tiered storage, you can organize a database's forests into partitions and then insert documents into the partitions based on a range index value. 115
Page 1 and 2:
Inside MARKLOGIC SERVER Jason Hunte
Page 3 and 4:
You can find the full set of API do
Page 5 and 6:
CHAPTER 1 WHAT IS MARKLOGIC SERVER?
Page 7 and 8:
enforced, such as that no two docum
Page 9 and 10:
You can even use MarkLogic to enfor
Page 11 and 12:
instance, all the way up to (in 201
Page 13 and 14:
Doc1 Doc 2 Doc 3 Doc 4 a blue car t
Page 15 and 16:
INDEXING LONGER PHRASES What happen
Page 17 and 18:
INDEXING VALUES Now what if we want
Page 19 and 20:
The indexes don't know if they're t
Page 21 and 22:
for $result in cts:search( /article
Page 23 and 24:
DIRECTORY INDEXES MarkLogic include
Page 25 and 26:
Every fragment acts as its own self
Page 27 and 28:
4. Perform optimized order by calcu
Page 29 and 30:
constraint (term lists are of no us
Page 31 and 32:
Performance of range index operatio
Page 33 and 34:
Fields are another way to pinpoint
Page 35 and 36:
DATABASE MyDocuments FOREST MyFores
Page 37 and 38:
stands. Merges tend to be CPU- and
Page 39 and 40:
When doing point-in-time queries, y
Page 41 and 42:
Isolating an Update When a request
Page 43 and 44:
timestamp to make the fragment live
Page 45 and 46:
if the global commit happened or no
Page 47 and 48:
CLUSTERING AND CACHING As your data
Page 49 and 50:
Expanded Tree Cache Each time a D-n
Page 51 and 52:
In the regular heartbeat communicat
Page 53 and 54:
QUERY QUERY LIFECYCLE 7 RESULT SET
Page 55 and 56:
Figure 9: During a commit involving
Page 57 and 58:
other transactions as the documents
Page 59 and 60:
MODULES AND DEPLOYMENT XQuery, XSLT
Page 61 and 62:
REST API FOR MULTI-TIER DEVELOPMENT
Page 63 and 64: SEARCH AND JSEARCH APIS The Search
Page 65 and 66: SQL/ODBC ACCESS FOR BUSINESS INTELL
Page 67 and 68: CHAPTER 3 ADVANCED TOPICS ADVANCED
Page 69 and 70: MarkLogic provides basic language s
Page 71 and 72: as a space removed from all text, s
Page 73 and 74: If instead of matching documents th
Page 75 and 76: You can watch as the server does th
Page 77 and 78: MORE WITH FIELDS Fields also provid
Page 79 and 80: The cts:register() call returns an
Page 81 and 82: value but a latitude and longitude
Page 83 and 84: produces this XML: dog name Ch
Page 85 and 86: This XQuery code inserts the defini
Page 87 and 88: It searches across passengers, requ
Page 89 and 90: five.xml (doc ID 5): { cts:and-quer
Page 91 and 92: Valid Start: 2016-01-01 End: ∞ Sy
Page 93 and 94: BITEMPORAL QUERIES Querying on bite
Page 95 and 96: A key aspect of semantic data is no
Page 97 and 98: Triple Type Index Object values in
Page 99 and 100: estimates the efficiency of that pl
Page 101 and 102: MANAGING BACKUPS MarkLogic supports
Page 103 and 104: esult of a code bug, modifies data
Page 105 and 106: and Local-Disk Failover. Failover w
Page 107 and 108: andwidth. Local-Disk is faster when
Page 109 and 110: Contemporaneous vs. Non-Blocking Ea
Page 111 and 112: database allowed to have most of it
Page 113: threshold needs to be reached among
Page 117 and 118: storage media but still query those
Page 119 and 120: LOW-LEVEL SYSTEM CONTROL When scali
Page 121 and 122: OUTSIDE THE CORE That completes our
Page 123 and 124: CONNECTOR FOR SHAREPOINT Microsoft
Page 125 and 126: Sublime Text Plug-in An add-on to t
Page 127: 999 Skyway Road, Suite 200 San Carl
show all

MARKLOGIC SERVER

Create successful ePaper yourself

Delete template?

Save as template?