03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> Logical and <strong>The</strong> Physical<br />

In our view, the two primary dynamics involved in establishing a data lake are:<br />

1. To gradually migrate all the data that makes up the system of record to the data<br />

lake where it becomes the golden copy of the data.<br />

2. For the data lake to become the primary point of ingest of external data where<br />

governance processing to data, both internal and external, as soon as possible<br />

when it enters the data lake.<br />

We note here that the data lake concept, which was first proposed about five years<br />

ago, has gradually grown in sophistication and thus here we are describing current<br />

thinking about what a data lake is and how to use it.<br />

For companies building a data lake, it is important to think in terms of a “logical data<br />

lake” along the lines we described, and to acknowledge that its physical implementation<br />

may be far more involved than our diagrams suggest.<br />

If the recent history of IT has taught us anything, it is that everything needs to scale.<br />

Most companies have a series of transactional systems (the mission critical systems)<br />

that currently constitute most if not all of the system of record. For the data lake to<br />

assume its role as the system of record, the data from such systems needs to be copied<br />

into the data lake.<br />

Pre-assembled <strong>Data</strong> <strong>Lake</strong>s<br />

For many companies the idea of commencing a strategic data lake project will make no<br />

commercial sense, particularly if their primary goal is, for example, only to do analytic<br />

exploration of a collection of data from various sources. Such a set of applications are<br />

unlikely to require all the governance activities we have discussed. In the circumstances<br />

the pragmatic goal will be to build the desired applications to a simpler target data lake<br />

architecture that omits some of the elements we have described.<br />

This approach will be easier and more likely to bring success if a data lake platform is<br />

employed, which is capable of delivering a data lake “out of the box.” As previously<br />

noted, vendors such as Cask, Unifi or Cambridge Semantics provide such capability.<br />

<strong>The</strong>y deliver a flexible abstraction layer between the Apache Stack and the applications<br />

built on top of it. <strong>The</strong>y also provide other components for managing, building and<br />

enriching data lake applications.<br />

It is possible to think of such vendors as providing a data operating system for an<br />

expansible cluster onto which you can build one or more applications. It is also feasible<br />

to build many such “dedicated” data lakes with different applications on each. A<br />

company might for example, build an event log “data lake” for IT operations usage, a<br />

real-time manufacturing data lake, a sales and marketing “data lake” and so on.<br />

One of the beauties of the current Apache Stack is that, with the inclusion the powerful<br />

communications components, Kafka and NiFi, it possible to establish loosely coupled<br />

data lakes that flow data one to another. If the data is coherently managed, simply<br />

34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!