The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
By employing schema-on-read you get value from the data as soon as possible. It does<br />
not impose any structure on the data because it does not change the structure that the<br />
data had when loaded. It can be particularly useful when dealing with semi-structured,<br />
poly-structured, and unstructured data. It was difficult and often impossible to model<br />
some of this data and ingesting it into a data warehouse, but there is no problem getting<br />
it into the data lake.<br />
In general, schema-on-read allows for all types of data and encourages a less rigid<br />
organization of data. It is easier to create two different views of the same data with<br />
schema-on-read. Also, it does not prevent data modelling, it just makes it necessary to<br />
defer it.<br />
<strong>The</strong> <strong>Data</strong> Catalog and Master <strong>Data</strong> Management (MDM)<br />
<strong>The</strong>re needs to be a catalog for all the data in the lake. It is important and will become<br />
increasingly important the data catalog to be as rich as possible, embodying as much<br />
“meaning” as possible. It may help here if we explain what we mean by “data catalog,”<br />
as the term is open to interpretation.<br />
<strong>The</strong> data catalog for an operating system (such as Windows or Linux) is the file system.<br />
<strong>The</strong> metadata it stores for each file is incomplete, as its primary purpose is to enable<br />
programs to find and access files. <strong>The</strong> programs themselves will know the layout of<br />
the data and thus how to interpret the values in any given file. This arrangement is not<br />
ideal because the data cannot be shared with programs that do not understand the file<br />
structure. It is adequate only for data that never needs to be shared.<br />
<strong>The</strong> data catalog for a database is the schema. <strong>The</strong> schema defines the structure of all<br />
the physical data held in the data sets or tables of the database and seeks to provide<br />
a logical meaning to each data item by attaching a label to it (Customer_ID, Amount,<br />
Date_of_Birth, etc). <strong>The</strong> amount of meaning that can be derived from the data catalog<br />
depends upon the specific database product. For relational databases the catalog<br />
roughly reflects the meaning that is captured in an ER Diagram, where the relationships<br />
between specific data entities are indicated. With NoSQL and document databases the<br />
situation is similar, although how the schema is used varies from product to product<br />
With an RDF database (sometimes called a Semantic <strong>Data</strong>base) the catalog can record<br />
a greater level of meaning. This is an area where Cambridge Semantics, with its Smart<br />
<strong>Data</strong> <strong>Lake</strong> solutions currently excels. <strong>The</strong> simple fact is that semantic technology<br />
allows you to capture much more meaning for the data catalog than you can using ER<br />
modelling.<br />
A traditional data warehouse metadata catalog can be complex, but nothing like as<br />
complex as a data lake catalog, which may include many sources of unstructured<br />
data (document data, graph data) that may require ontologies (semantic structures) to<br />
accurately define the metadata.<br />
<strong>The</strong> point of the data lake’s data catalog is to provide users (and programs) with a<br />
data self-service capability. On a simple level, if you compare the pre-data-lake world<br />
of data warehouses and data marts to the data lake world, two obvious facts emerge:<br />
26