03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

By employing schema-on-read you get value from the data as soon as possible. It does<br />

not impose any structure on the data because it does not change the structure that the<br />

data had when loaded. It can be particularly useful when dealing with semi-structured,<br />

poly-structured, and unstructured data. It was difficult and often impossible to model<br />

some of this data and ingesting it into a data warehouse, but there is no problem getting<br />

it into the data lake.<br />

In general, schema-on-read allows for all types of data and encourages a less rigid<br />

organization of data. It is easier to create two different views of the same data with<br />

schema-on-read. Also, it does not prevent data modelling, it just makes it necessary to<br />

defer it.<br />

<strong>The</strong> <strong>Data</strong> Catalog and Master <strong>Data</strong> Management (MDM)<br />

<strong>The</strong>re needs to be a catalog for all the data in the lake. It is important and will become<br />

increasingly important the data catalog to be as rich as possible, embodying as much<br />

“meaning” as possible. It may help here if we explain what we mean by “data catalog,”<br />

as the term is open to interpretation.<br />

<strong>The</strong> data catalog for an operating system (such as Windows or Linux) is the file system.<br />

<strong>The</strong> metadata it stores for each file is incomplete, as its primary purpose is to enable<br />

programs to find and access files. <strong>The</strong> programs themselves will know the layout of<br />

the data and thus how to interpret the values in any given file. This arrangement is not<br />

ideal because the data cannot be shared with programs that do not understand the file<br />

structure. It is adequate only for data that never needs to be shared.<br />

<strong>The</strong> data catalog for a database is the schema. <strong>The</strong> schema defines the structure of all<br />

the physical data held in the data sets or tables of the database and seeks to provide<br />

a logical meaning to each data item by attaching a label to it (Customer_ID, Amount,<br />

Date_of_Birth, etc). <strong>The</strong> amount of meaning that can be derived from the data catalog<br />

depends upon the specific database product. For relational databases the catalog<br />

roughly reflects the meaning that is captured in an ER Diagram, where the relationships<br />

between specific data entities are indicated. With NoSQL and document databases the<br />

situation is similar, although how the schema is used varies from product to product<br />

With an RDF database (sometimes called a Semantic <strong>Data</strong>base) the catalog can record<br />

a greater level of meaning. This is an area where Cambridge Semantics, with its Smart<br />

<strong>Data</strong> <strong>Lake</strong> solutions currently excels. <strong>The</strong> simple fact is that semantic technology<br />

allows you to capture much more meaning for the data catalog than you can using ER<br />

modelling.<br />

A traditional data warehouse metadata catalog can be complex, but nothing like as<br />

complex as a data lake catalog, which may include many sources of unstructured<br />

data (document data, graph data) that may require ontologies (semantic structures) to<br />

accurately define the metadata.<br />

<strong>The</strong> point of the data lake’s data catalog is to provide users (and programs) with a<br />

data self-service capability. On a simple level, if you compare the pre-data-lake world<br />

of data warehouses and data marts to the data lake world, two obvious facts emerge:<br />

26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!