03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

We represent this reality, in Figure 7, by showing real-time<br />

applications running directly against the real-time data<br />

stream within the <strong>Data</strong> Bus and also accessing a disk-based<br />

data source. In practice this is likely to be achieved by a<br />

Lambda or Kappa architecture, which is far ore involved<br />

than teh diagram suggests.<br />

In practice it is far more likely, for latency reason, that realtime<br />

apps will not be located anywhere near the data lake,<br />

but will instead be as close as possible to the data stream(s)<br />

that feed them. Nevertheless they will likely pass the data<br />

stream they process directly to the data lake. Since such<br />

applications run prior to any data governance it may be<br />

necessary to pass data to these applications if anything important is discovered during<br />

the governance processes.<br />

Governance Applications<br />

We need to distinguish between the governance processing that occurs on ingest and<br />

the governance processing that may occur later. One of the rules of governance itself<br />

may be to specify what processes<br />

have to take place before data is<br />

made available to data lake users.<br />

It makes sense to do as much<br />

processing as possible while data is<br />

on the <strong>Data</strong> Bus (i.e. held in memory)<br />

since it can be accessed very<br />

quickly. <strong>The</strong> ideal would be to do<br />

all governance processing on ingest,<br />

but this may be impossible. Some<br />

data cleansing activity and some<br />

metadata discovery activity requires<br />

human intervention and it may not<br />

be practical to do it on ingest. For<br />

some data, the chosen policy may<br />

be to implement schema-on-read so<br />

<strong>Data</strong><br />

Security<br />

<strong>Data</strong><br />

Transforms<br />

<strong>Data</strong><br />

Aggregat'n<br />

Metadata<br />

Mgt<br />

<strong>Data</strong><br />

Cleansing<br />

Figure 8. Ingest Governance Apps<br />

metadata gathering will occur after ingest. <strong>The</strong>re can be other competing dynamics. <strong>The</strong><br />

desire may be to encrypt all data (or at least all data that is destined to be encrypted)<br />

on ingest. If data is to be stored in the write-only HDFS, this is doubly important, as it<br />

is a write-only file system. However data cleansing will require data to be unencrypted.<br />

<strong>The</strong> data transform and aggregation activities shown in the diagram are not governance<br />

activities per se. From an efficiency perspective it will be better to perform data transformations,<br />

aggregations and other data calculations that are known to be required before data is written<br />

to disk. Of course, some will need to be done later simply because not all the data they<br />

required is in the data stream.<br />

D<br />

A<br />

T<br />

A<br />

B<br />

U<br />

S<br />

Real-Time<br />

Apps<br />

<strong>Data</strong><br />

Figure 7. Real-Time<br />

30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!