The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Data</strong> <strong>Lake</strong> Architecture<br />
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Let’s begin with the idea that the data lake is the system of record. And let’s be clear<br />
what we mean by this. <strong>The</strong> system of record is the system that records all the data that<br />
is used by the business. It also holds the golden copy of each data record.<br />
For simplification the system of record should be thought of as a logical system. It may<br />
be possible to implement it on a single cluster of servers, but this is not a requirement<br />
and should not be a goal. In practice the whole configuration will probably involve<br />
multiple clusters if, for no other reason, than to provide disaster recovery.<br />
<strong>The</strong> system of record should also be the system where governance processes are<br />
applied to data. Clearly, data needs to be subject to governance wherever it is used<br />
within the organization. Some governance processes, particularly data security, need<br />
to be applied to data as soon as possible after its creation or capture. For that reason,<br />
the data lake will inevitably be the landing zone for external data brought into the<br />
business, so governance processes can quickly and easily be applied it. <strong>Data</strong> created<br />
within the organization should be passed to the data lake immediately after creation so<br />
that governance processes can be applied.<br />
It is best to think of the data lake as a store of event records. We can think of it in<br />
this way: events are atoms of data and transactions (the traditional paradigm with its<br />
traditional data structures) are molecules of information. <strong>The</strong> analogy works reasonably<br />
well.<br />
Consider for example a web site visit. A user lands on site, clicks through a few pages,<br />
decides to buy something, enters credit card details, and clicks on the “confirm” button.<br />
In transactional terms we may think of this as a purchase: a molecule of data, that’s<br />
quickly followed by a delivery transaction, another molecule of data to record.<br />
But in reality it’s a stream of events, a series of atoms of data. Every user mouse<br />
action creates an event. <strong>The</strong> computer records these events and responds to each one<br />
immediately; it displays new web pages or expands text descriptions or whatever.<br />
<strong>The</strong> purchase confirmation is just another event, distinguished only by the fact that it<br />
generates a cascade of other related events in the application or in other applications.<br />
When you look at it like that, business systems consist of applications generating events<br />
and sending event information to other applications. <strong>The</strong>re is nothing particularly<br />
special about web applications in this respect; all applications are like that. <strong>The</strong>y<br />
respond to events and send messages or data to other applications. That’s been the<br />
nature of computing for decades.<br />
When you scan all the servers and network devices in a data center you find them<br />
awash with log files that store data about events of every kind: network logs, message<br />
logs, system logs, application logs, API logs, security logs and so on. Collectively the<br />
logs provide an extensive audit trail of the activity of the data center, organized by<br />
time stamp. It happens at the application level, at the data level and lower down at<br />
the hardware level. Hiding within this disparate set of data can be found details of<br />
anomalies, error conditions, hacker attacks, business transactions and so on.<br />
28