15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

stands. Merges tend to be CPU- and disk-intensive, and for this reason, you have<br />

control over when merges can happen via system administration.<br />

Each forest has its own in-memory stand and set of on-disk stands. Loading and<br />

indexing content is a largely parallelizable activity, so splitting the loading effort across<br />

forests and potentially across machines in a cluster can help scale the ingestion work.<br />

Document Compression<br />

The TreeData file stores XML document data in a highly efficient manner. The tree structure of the<br />

document gets saved using a compact binary encoding. The text nodes are saved using a dictionary-based<br />

compression scheme. In this scheme, the text is tokenized (into words, white space, and punctuation) and each<br />

document constructs its own token dictionary, mapping numeric token IDs to token values. Instead of storing<br />

strings as sequences of characters, each string is stored as a sequence of numeric token IDs. The original<br />

string can be reconstructed using the dictionary as a lookup table. The tokens in the dictionary are placed<br />

in frequency order, whereby the most frequently occurring tokens will have the smallest token IDs. That's<br />

important because all numbers in the representation of the tree structure and the strings of text are encoded<br />

using a variable-length encoding, so smaller numbers take fewer bits than larger numbers. The numeric<br />

representation of a token is a unary nibble count followed by nibbles of data. (A nibble is half a byte.) The most<br />

frequently occurring token (usually a space) gets token ID 0, which takes only a single bit to represent in a string<br />

(the single bit 0). Tokens 1-16 take 6 bits (two bits for the count (10) and a single 4-bit nibble). Tokens 17-272<br />

take 11 bits (three bits of count (110) and two 4-bit nibbles). And so on. Each compact token ID represents an<br />

arbitrarily long token. The end result is a highly compact serialization of XML, much smaller than the XML<br />

you see in a regular file.<br />

MODIFYING DATA<br />

What happens if you delete or change a document? If you delete a document,<br />

MarkLogic marks the document as deleted but does not immediately remove it from<br />

disk. The deleted document will be removed from query results based on its deletion<br />

markings, and the next merge of the stand holding the document will bypass the<br />

deleted document when writing the new stand, effectively removing it from disk.<br />

If you change a document, MarkLogic marks the old version of the document as<br />

deleted in its current stand and creates a new version of the document in the inmemory<br />

stand. MarkLogic distinctly avoids modifying the document in place. If you<br />

consider how many term lists a single document change might affect, updates in place<br />

are an entirely inefficient proposition. So, instead, MarkLogic treats any changed<br />

document like a new document and treats the old version like a deleted document.<br />

We simplified things a bit here. If you remember, fragments (not documents) are the<br />

basic unit of query, retrieval, and update. So if you have fragmentation rules enabled<br />

and make a change in a document that has fragments, MarkLogic will determine<br />

37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!