Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
storage media but still query those documents as a single unit. For queries to work, the<br />
super- and sub-databases must have the same configuration. Querying multiple data<br />
sources at once and then aggregating the results is also known as federated search.<br />
For example, let's say that you have a Database A of active documents on fast,<br />
expensive storage and a Database B of archival documents on slower, lower-cost<br />
storage. You can create a super-database and associate it with Database A and Database<br />
B. (The super-database itself can also contain forests of documents, although that is not<br />
recommended since those forests cannot be queried independently of the associated<br />
sub-databases.) With this arrangement, you can query just the high-value documents<br />
in Database A, just the archival documents in Database B, or the entire corpus by<br />
querying the super-database.<br />
Query<br />
“cat OR dog”<br />
SUPER DATABASE<br />
DATABASE A<br />
DATABASE B<br />
F<br />
F<br />
F F F<br />
F<br />
F<br />
F<br />
F<br />
F<br />
Figure 25: A super-database lets you query multiple databases as a single unit.<br />
There are limitations on super-databases. You cannot perform updates on documents<br />
that reside in a sub-database via a super-database, and you must perform the updates<br />
directly on the sub-database where the documents reside. You also cannot have a superdatabase<br />
with more than one level of sub-databases beneath it. This is to avoid circular<br />
references—i.e., two databases referencing one another.<br />
HDFS AND AMAZON S3<br />
A key to tiered storage is the ability to store less-active data on lower-cost storage media.<br />
Two low-cost options are Hadoop Distributed File System (HDFS) and Amazon S3<br />
(Simple Storage Service).<br />
HDFS is the storage part of Hadoop. Because HDFS is open source and runs on<br />
commodity hardware, it can store data less expensively than traditional shared storage<br />
solutions such as SAN (storage area network) and NAS (network-attached storage).<br />
Once you configure MarkLogic to use HDFS, you can specify HDFS as a storage<br />
directory for a database using the "hdfs:" prefix.<br />
Amazon S3 is cloud-based storage service offered through Amazon Web Services. S3<br />
customers can interact with documents stored on the service through various interfaces,<br />
117