The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>Data</strong> <strong>Lake</strong> Products<br />
<strong>The</strong> alternative to acquiring technical staff to understand the nuts and bolts of the<br />
Apache Hadoop Stack to buy technology from one of the Hadoop integration vendors<br />
or as they are also called “data lake” vendors: Cask, Unifi, Cambridge Semantics et al.<br />
Such vendors provide a data lake capability “out of the box.” Many companies have<br />
found this a more productive way to build a data lake than building it from scratch.<br />
Consider the position of a company that wishes to build a data lake that uses, say, 20<br />
components of the Apache Stack. <strong>The</strong> issues they will inevitably face are as follows:<br />
• Technical skills: New staff with appropriate technical skills may need to be<br />
hired.<br />
• Operational Management: <strong>The</strong> server cluster will need to be monitored and<br />
managed, altering configurations, provisioning new hardware and tuning for<br />
performance as needed. <strong>The</strong> software environment for this may need to be built.<br />
• Upgrades: Upgrade management of the Apache Stack is a potential headache.<br />
<strong>The</strong> Apache Stack upgrades are not going to be as smooth as, for example, a<br />
Windows Server upgrade - simply because the Apache Stack is not under the<br />
close control that a vendor like Microsoft or IBM provides.<br />
• Standards and Integration Issues: <strong>The</strong> question here is how to align the Apache<br />
Stack with common data center standards (e.g. security) for security, data life<br />
cycle management, etc.<br />
In reality what the data lake vendors do is provide an abstraction layer between the<br />
Apache Stack and the applications built on top of it. If this is done well then the user<br />
could, for example, migrate from Cloudera’s stack to MapR’s stack or even move the<br />
data lake applications into the cloud without concern for whether the application will<br />
function as expected.<br />
<strong>The</strong> <strong>Data</strong> Layer OS<br />
As we introduced the concept of a data layer OS it is worth explaining here why<br />
we do not currently believe that the Apache Stack has earned that description. If you<br />
ignore all the infrastructure software that is there to make applications possible, then<br />
we can think in terms of there being three types of applications:<br />
• OLTP applications: <strong>The</strong>se are the applications that process the events and<br />
transactions of the business<br />
• Office applications: This category embraces communications activity from<br />
email to multimedia collaboration and includes personal applications from the<br />
word processor to graphics software. (We would even include development<br />
software here).<br />
• BI and Analytics applications: <strong>The</strong>se are the applications that analyze the<br />
business and provide feedback.<br />
16