You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
stands. Merges tend to be CPU- and disk-intensive, and for this reason, you have<br />
control over when merges can happen via system administration.<br />
Each forest has its own in-memory stand and set of on-disk stands. Loading and<br />
indexing content is a largely parallelizable activity, so splitting the loading effort across<br />
forests and potentially across machines in a cluster can help scale the ingestion work.<br />
Document Compression<br />
The TreeData file stores XML document data in a highly efficient manner. The tree structure of the<br />
document gets saved using a compact binary encoding. The text nodes are saved using a dictionary-based<br />
compression scheme. In this scheme, the text is tokenized (into words, white space, and punctuation) and each<br />
document constructs its own token dictionary, mapping numeric token IDs to token values. Instead of storing<br />
strings as sequences of characters, each string is stored as a sequence of numeric token IDs. The original<br />
string can be reconstructed using the dictionary as a lookup table. The tokens in the dictionary are placed<br />
in frequency order, whereby the most frequently occurring tokens will have the smallest token IDs. That's<br />
important because all numbers in the representation of the tree structure and the strings of text are encoded<br />
using a variable-length encoding, so smaller numbers take fewer bits than larger numbers. The numeric<br />
representation of a token is a unary nibble count followed by nibbles of data. (A nibble is half a byte.) The most<br />
frequently occurring token (usually a space) gets token ID 0, which takes only a single bit to represent in a string<br />
(the single bit 0). Tokens 1-16 take 6 bits (two bits for the count (10) and a single 4-bit nibble). Tokens 17-272<br />
take 11 bits (three bits of count (110) and two 4-bit nibbles). And so on. Each compact token ID represents an<br />
arbitrarily long token. The end result is a highly compact serialization of XML, much smaller than the XML<br />
you see in a regular file.<br />
MODIFYING DATA<br />
What happens if you delete or change a document? If you delete a document,<br />
MarkLogic marks the document as deleted but does not immediately remove it from<br />
disk. The deleted document will be removed from query results based on its deletion<br />
markings, and the next merge of the stand holding the document will bypass the<br />
deleted document when writing the new stand, effectively removing it from disk.<br />
If you change a document, MarkLogic marks the old version of the document as<br />
deleted in its current stand and creates a new version of the document in the inmemory<br />
stand. MarkLogic distinctly avoids modifying the document in place. If you<br />
consider how many term lists a single document change might affect, updates in place<br />
are an entirely inefficient proposition. So, instead, MarkLogic treats any changed<br />
document like a new document and treats the old version like a deleted document.<br />
We simplified things a bit here. If you remember, fragments (not documents) are the<br />
basic unit of query, retrieval, and update. So if you have fragmentation rules enabled<br />
and make a change in a document that has fragments, MarkLogic will determine<br />
37