The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> Logical and <strong>The</strong> Physical<br />
In our view, the two primary dynamics involved in establishing a data lake are:<br />
1. To gradually migrate all the data that makes up the system of record to the data<br />
lake where it becomes the golden copy of the data.<br />
2. For the data lake to become the primary point of ingest of external data where<br />
governance processing to data, both internal and external, as soon as possible<br />
when it enters the data lake.<br />
We note here that the data lake concept, which was first proposed about five years<br />
ago, has gradually grown in sophistication and thus here we are describing current<br />
thinking about what a data lake is and how to use it.<br />
For companies building a data lake, it is important to think in terms of a “logical data<br />
lake” along the lines we described, and to acknowledge that its physical implementation<br />
may be far more involved than our diagrams suggest.<br />
If the recent history of IT has taught us anything, it is that everything needs to scale.<br />
Most companies have a series of transactional systems (the mission critical systems)<br />
that currently constitute most if not all of the system of record. For the data lake to<br />
assume its role as the system of record, the data from such systems needs to be copied<br />
into the data lake.<br />
Pre-assembled <strong>Data</strong> <strong>Lake</strong>s<br />
For many companies the idea of commencing a strategic data lake project will make no<br />
commercial sense, particularly if their primary goal is, for example, only to do analytic<br />
exploration of a collection of data from various sources. Such a set of applications are<br />
unlikely to require all the governance activities we have discussed. In the circumstances<br />
the pragmatic goal will be to build the desired applications to a simpler target data lake<br />
architecture that omits some of the elements we have described.<br />
This approach will be easier and more likely to bring success if a data lake platform is<br />
employed, which is capable of delivering a data lake “out of the box.” As previously<br />
noted, vendors such as Cask, Unifi or Cambridge Semantics provide such capability.<br />
<strong>The</strong>y deliver a flexible abstraction layer between the Apache Stack and the applications<br />
built on top of it. <strong>The</strong>y also provide other components for managing, building and<br />
enriching data lake applications.<br />
It is possible to think of such vendors as providing a data operating system for an<br />
expansible cluster onto which you can build one or more applications. It is also feasible<br />
to build many such “dedicated” data lakes with different applications on each. A<br />
company might for example, build an event log “data lake” for IT operations usage, a<br />
real-time manufacturing data lake, a sales and marketing “data lake” and so on.<br />
One of the beauties of the current Apache Stack is that, with the inclusion the powerful<br />
communications components, Kafka and NiFi, it possible to establish loosely coupled<br />
data lakes that flow data one to another. If the data is coherently managed, simply<br />
34