The Data Lake Survival Guide

The Data Lake Survival Guide 

The Logical and The Physical 

In our view, the two primary dynamics involved in establishing a data lake are: 

1. To gradually migrate all the data that makes up the system of record to the data 

lake where it becomes the golden copy of the data. 

2. For the data lake to become the primary point of ingest of external data where 

governance processing to data, both internal and external, as soon as possible 

when it enters the data lake. 

We note here that the data lake concept, which was first proposed about five years 

ago, has gradually grown in sophistication and thus here we are describing current 

thinking about what a data lake is and how to use it. 

For companies building a data lake, it is important to think in terms of a “logical data 

lake” along the lines we described, and to acknowledge that its physical implementation 

may be far more involved than our diagrams suggest. 

If the recent history of IT has taught us anything, it is that everything needs to scale. 

Most companies have a series of transactional systems (the mission critical systems) 

that currently constitute most if not all of the system of record. For the data lake to 

assume its role as the system of record, the data from such systems needs to be copied 

into the data lake. 

Pre-assembled Data Lakes 

For many companies the idea of commencing a strategic data lake project will make no 

commercial sense, particularly if their primary goal is, for example, only to do analytic 

exploration of a collection of data from various sources. Such a set of applications are 

unlikely to require all the governance activities we have discussed. In the circumstances 

the pragmatic goal will be to build the desired applications to a simpler target data lake 

architecture that omits some of the elements we have described. 

This approach will be easier and more likely to bring success if a data lake platform is 

employed, which is capable of delivering a data lake “out of the box.” As previously 

noted, vendors such as Cask, Unifi or Cambridge Semantics provide such capability. 

They deliver a flexible abstraction layer between the Apache Stack and the applications 

built on top of it. They also provide other components for managing, building and 

enriching data lake applications. 

It is possible to think of such vendors as providing a data operating system for an 

expansible cluster onto which you can build one or more applications. It is also feasible 

to build many such “dedicated” data lakes with different applications on each. A 

company might for example, build an event log “data lake” for IT operations usage, a 

real-time manufacturing data lake, a sales and marketing “data lake” and so on. 

One of the beauties of the current Apache Stack is that, with the inclusion the powerful 

communications components, Kafka and NiFi, it possible to establish loosely coupled 

data lakes that flow data one to another. If the data is coherently managed, simply 

34

Previous page

Next page

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?