03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Data</strong> <strong>Lake</strong> Architecture<br />

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Let’s begin with the idea that the data lake is the system of record. And let’s be clear<br />

what we mean by this. <strong>The</strong> system of record is the system that records all the data that<br />

is used by the business. It also holds the golden copy of each data record.<br />

For simplification the system of record should be thought of as a logical system. It may<br />

be possible to implement it on a single cluster of servers, but this is not a requirement<br />

and should not be a goal. In practice the whole configuration will probably involve<br />

multiple clusters if, for no other reason, than to provide disaster recovery.<br />

<strong>The</strong> system of record should also be the system where governance processes are<br />

applied to data. Clearly, data needs to be subject to governance wherever it is used<br />

within the organization. Some governance processes, particularly data security, need<br />

to be applied to data as soon as possible after its creation or capture. For that reason,<br />

the data lake will inevitably be the landing zone for external data brought into the<br />

business, so governance processes can quickly and easily be applied it. <strong>Data</strong> created<br />

within the organization should be passed to the data lake immediately after creation so<br />

that governance processes can be applied.<br />

It is best to think of the data lake as a store of event records. We can think of it in<br />

this way: events are atoms of data and transactions (the traditional paradigm with its<br />

traditional data structures) are molecules of information. <strong>The</strong> analogy works reasonably<br />

well.<br />

Consider for example a web site visit. A user lands on site, clicks through a few pages,<br />

decides to buy something, enters credit card details, and clicks on the “confirm” button.<br />

In transactional terms we may think of this as a purchase: a molecule of data, that’s<br />

quickly followed by a delivery transaction, another molecule of data to record.<br />

But in reality it’s a stream of events, a series of atoms of data. Every user mouse<br />

action creates an event. <strong>The</strong> computer records these events and responds to each one<br />

immediately; it displays new web pages or expands text descriptions or whatever.<br />

<strong>The</strong> purchase confirmation is just another event, distinguished only by the fact that it<br />

generates a cascade of other related events in the application or in other applications.<br />

When you look at it like that, business systems consist of applications generating events<br />

and sending event information to other applications. <strong>The</strong>re is nothing particularly<br />

special about web applications in this respect; all applications are like that. <strong>The</strong>y<br />

respond to events and send messages or data to other applications. That’s been the<br />

nature of computing for decades.<br />

When you scan all the servers and network devices in a data center you find them<br />

awash with log files that store data about events of every kind: network logs, message<br />

logs, system logs, application logs, API logs, security logs and so on. Collectively the<br />

logs provide an extensive audit trail of the activity of the data center, organized by<br />

time stamp. It happens at the application level, at the data level and lower down at<br />

the hardware level. Hiding within this disparate set of data can be found details of<br />

anomalies, error conditions, hacker attacks, business transactions and so on.<br />

28

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!