15.07.2016 Views

MARKLOGIC SERVER

Inside-MarkLogic-Server

Inside-MarkLogic-Server

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

DATABASES, FORESTS, AND STANDS<br />

MarkLogic organizes data hierarchically, with the database as the logical unit at the<br />

top. A system can have multiple databases, and generally will, but each query or update<br />

request typically executes against a particular database. (The exception is with a superdatabase,<br />

which lets you query multiple databases as a single unit. That's covered later.)<br />

Installing MarkLogic creates several auxiliary databases by default: App-Services,<br />

Documents, Extensions, Fab, Last-Login, Meters, Modules, Schemas,<br />

Security, and Triggers.<br />

A database consists of one or more forests. 5 A forest is a collection of documents and<br />

is implemented as a physical directory on disk. In addition to the documents, a forest<br />

stores the indexes for those documents and a journal that tracks transaction operations<br />

as they occur. A single machine may manage several forests, or when in a cluster<br />

(acting as an E-node, an evaluator), it might manage none. Forests can be queried in<br />

parallel, so placing more forests on a multi-core server will help with concurrency. The<br />

optimal number of forests depends on a variety of factors, not all of which are related<br />

to performance. On modern server hardware, six primary forests per server (with six<br />

replica forests for failover, as discussed later) is a good starting point. The size of each<br />

forest will vary depending on if it's a high-performance or high-capacity environment.<br />

For high performance (e.g., a user-facing search site with facets), each forest might<br />

contain 8 million documents and 100 gigabytes on disk. For high capacity (e.g., an<br />

internal analytical application), each forest might contain 150 million documents and<br />

500 gigabytes on disk. In a clustered environment, you can have a set of servers, each<br />

managing its own set of forests, all unified into a single database. 6 For more about<br />

setting up forests, see Performance: Understanding System Resources.<br />

Each forest is made up of stands. A stand (like a stand of trees) holds a subset of the<br />

forest data. Each forest has an in-memory stand and some number of on-disk stands<br />

that exist as physical subdirectories under the forest directory. Having multiple stands<br />

helps MarkLogic efficiently ingest data. A stand contains a set of compressed binary<br />

files with names like TreeData, IndexData, Frequencies, and Qualities. This is<br />

where the actual compressed XML data (in TreeData) and indexes (in IndexData)<br />

can be found.<br />

5 Why are they called "forests"? Because the documents they contain are stored as tree structures.<br />

6 Is a forest like a relational database partition? Yes and no. Yes because both hold a subset of data. No<br />

because you don't typically allocate data to a forest based on a particular aspect of the data. Forests aren't<br />

about pre-optimizing a certain query. They instead allow a request to be run in parallel across many forests, on<br />

many cores, and possibly many disks.<br />

34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!