The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
If we now look at<br />
Figure 10, which<br />
illustrates the data<br />
lake complete, the<br />
only two processes,<br />
we have not yet<br />
discussed are data<br />
extracts and data lifecycle<br />
management.<br />
While data lake<br />
processing can be<br />
fast, if particularly<br />
high performance<br />
is required for some<br />
applications, there<br />
will be a need to<br />
export data to a<br />
fast data engine or<br />
database. It will<br />
probably be many<br />
years before data<br />
lake data access<br />
speed gets close<br />
to a purpose built<br />
database.<br />
<strong>The</strong> truth is that<br />
the focus of data<br />
lake architecture is<br />
data ingest and data<br />
governance. <strong>The</strong>re<br />
are many processes,<br />
all of them important,<br />
competing for the<br />
same resources.<br />
Servers, Desktops, Mobile, Network Devices, Embedded<br />
Chips, RFID, IoT, <strong>The</strong> Cloud, Oses, VMs, Log Files, Sys<br />
Mgt Apps, ESBs, Web Services, SaaS, Business Apps,<br />
Office Apps, BI Apps, Workflow, <strong>Data</strong> Streams, Social...<br />
<strong>Data</strong><br />
Governance<br />
<strong>Data</strong> <strong>Lake</strong><br />
Mgt<br />
Ingest<br />
Transform &<br />
Aggregate<br />
Archive<br />
<strong>Data</strong><br />
Security<br />
Life Cycle<br />
Mgt<br />
DATA LAKE<br />
Real-Time<br />
Apps<br />
Metadata<br />
Mgt<br />
<strong>Data</strong><br />
Cleansing<br />
Extracts<br />
Search &<br />
Query<br />
BI, Visual'n<br />
& Analytics<br />
Other<br />
Apps<br />
To<br />
<strong>Data</strong>bases<br />
<strong>Data</strong> Marts<br />
Other Apps<br />
Figure 10. <strong>The</strong> <strong>Data</strong> <strong>Lake</strong> Complete<br />
<strong>The</strong>re will always be a limit to the capacity of the data lake, and governance processes<br />
naturally take priority, so it will prove necessary to replicate data to other data lakes or<br />
data marts to properly serve some applications or users.<br />
As regards data archive, data life-cycle management can be regarded as an aspect of<br />
data governance. It can best be thought of as a background process. <strong>The</strong> exact rules<br />
of if and when data needs to be deleted may be influenced by business imperatives<br />
(regulation), but may also be determined by storage costs. Ideally, archive will be an<br />
automatic process.<br />
32