Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
DATABASES, FORESTS, AND STANDS<br />
MarkLogic organizes data hierarchically, with the database as the logical unit at the<br />
top. A system can have multiple databases, and generally will, but each query or update<br />
request typically executes against a particular database. (The exception is with a superdatabase,<br />
which lets you query multiple databases as a single unit. That's covered later.)<br />
Installing MarkLogic creates several auxiliary databases by default: App-Services,<br />
Documents, Extensions, Fab, Last-Login, Meters, Modules, Schemas,<br />
Security, and Triggers.<br />
A database consists of one or more forests. 5 A forest is a collection of documents and<br />
is implemented as a physical directory on disk. In addition to the documents, a forest<br />
stores the indexes for those documents and a journal that tracks transaction operations<br />
as they occur. A single machine may manage several forests, or when in a cluster<br />
(acting as an E-node, an evaluator), it might manage none. Forests can be queried in<br />
parallel, so placing more forests on a multi-core server will help with concurrency. The<br />
optimal number of forests depends on a variety of factors, not all of which are related<br />
to performance. On modern server hardware, six primary forests per server (with six<br />
replica forests for failover, as discussed later) is a good starting point. The size of each<br />
forest will vary depending on if it's a high-performance or high-capacity environment.<br />
For high performance (e.g., a user-facing search site with facets), each forest might<br />
contain 8 million documents and 100 gigabytes on disk. For high capacity (e.g., an<br />
internal analytical application), each forest might contain 150 million documents and<br />
500 gigabytes on disk. In a clustered environment, you can have a set of servers, each<br />
managing its own set of forests, all unified into a single database. 6 For more about<br />
setting up forests, see Performance: Understanding System Resources.<br />
Each forest is made up of stands. A stand (like a stand of trees) holds a subset of the<br />
forest data. Each forest has an in-memory stand and some number of on-disk stands<br />
that exist as physical subdirectories under the forest directory. Having multiple stands<br />
helps MarkLogic efficiently ingest data. A stand contains a set of compressed binary<br />
files with names like TreeData, IndexData, Frequencies, and Qualities. This is<br />
where the actual compressed XML data (in TreeData) and indexes (in IndexData)<br />
can be found.<br />
5 Why are they called "forests"? Because the documents they contain are stored as tree structures.<br />
6 Is a forest like a relational database partition? Yes and no. Yes because both hold a subset of data. No<br />
because you don't typically allocate data to a forest based on a particular aspect of the data. Forests aren't<br />
about pre-optimizing a certain query. They instead allow a request to be run in parallel across many forests, on<br />
many cores, and possibly many disks.<br />
34