The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
We represent this reality, in Figure 7, by showing real-time<br />
applications running directly against the real-time data<br />
stream within the <strong>Data</strong> Bus and also accessing a disk-based<br />
data source. In practice this is likely to be achieved by a<br />
Lambda or Kappa architecture, which is far ore involved<br />
than teh diagram suggests.<br />
In practice it is far more likely, for latency reason, that realtime<br />
apps will not be located anywhere near the data lake,<br />
but will instead be as close as possible to the data stream(s)<br />
that feed them. Nevertheless they will likely pass the data<br />
stream they process directly to the data lake. Since such<br />
applications run prior to any data governance it may be<br />
necessary to pass data to these applications if anything important is discovered during<br />
the governance processes.<br />
Governance Applications<br />
We need to distinguish between the governance processing that occurs on ingest and<br />
the governance processing that may occur later. One of the rules of governance itself<br />
may be to specify what processes<br />
have to take place before data is<br />
made available to data lake users.<br />
It makes sense to do as much<br />
processing as possible while data is<br />
on the <strong>Data</strong> Bus (i.e. held in memory)<br />
since it can be accessed very<br />
quickly. <strong>The</strong> ideal would be to do<br />
all governance processing on ingest,<br />
but this may be impossible. Some<br />
data cleansing activity and some<br />
metadata discovery activity requires<br />
human intervention and it may not<br />
be practical to do it on ingest. For<br />
some data, the chosen policy may<br />
be to implement schema-on-read so<br />
<strong>Data</strong><br />
Security<br />
<strong>Data</strong><br />
Transforms<br />
<strong>Data</strong><br />
Aggregat'n<br />
Metadata<br />
Mgt<br />
<strong>Data</strong><br />
Cleansing<br />
Figure 8. Ingest Governance Apps<br />
metadata gathering will occur after ingest. <strong>The</strong>re can be other competing dynamics. <strong>The</strong><br />
desire may be to encrypt all data (or at least all data that is destined to be encrypted)<br />
on ingest. If data is to be stored in the write-only HDFS, this is doubly important, as it<br />
is a write-only file system. However data cleansing will require data to be unencrypted.<br />
<strong>The</strong> data transform and aggregation activities shown in the diagram are not governance<br />
activities per se. From an efficiency perspective it will be better to perform data transformations,<br />
aggregations and other data calculations that are known to be required before data is written<br />
to disk. Of course, some will need to be done later simply because not all the data they<br />
required is in the data stream.<br />
D<br />
A<br />
T<br />
A<br />
B<br />
U<br />
S<br />
Real-Time<br />
Apps<br />
<strong>Data</strong><br />
Figure 7. Real-Time<br />
30