03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>Data</strong> <strong>Lake</strong> Products<br />

<strong>The</strong> alternative to acquiring technical staff to understand the nuts and bolts of the<br />

Apache Hadoop Stack to buy technology from one of the Hadoop integration vendors<br />

or as they are also called “data lake” vendors: Cask, Unifi, Cambridge Semantics et al.<br />

Such vendors provide a data lake capability “out of the box.” Many companies have<br />

found this a more productive way to build a data lake than building it from scratch.<br />

Consider the position of a company that wishes to build a data lake that uses, say, 20<br />

components of the Apache Stack. <strong>The</strong> issues they will inevitably face are as follows:<br />

• Technical skills: New staff with appropriate technical skills may need to be<br />

hired.<br />

• Operational Management: <strong>The</strong> server cluster will need to be monitored and<br />

managed, altering configurations, provisioning new hardware and tuning for<br />

performance as needed. <strong>The</strong> software environment for this may need to be built.<br />

• Upgrades: Upgrade management of the Apache Stack is a potential headache.<br />

<strong>The</strong> Apache Stack upgrades are not going to be as smooth as, for example, a<br />

Windows Server upgrade - simply because the Apache Stack is not under the<br />

close control that a vendor like Microsoft or IBM provides.<br />

• Standards and Integration Issues: <strong>The</strong> question here is how to align the Apache<br />

Stack with common data center standards (e.g. security) for security, data life<br />

cycle management, etc.<br />

In reality what the data lake vendors do is provide an abstraction layer between the<br />

Apache Stack and the applications built on top of it. If this is done well then the user<br />

could, for example, migrate from Cloudera’s stack to MapR’s stack or even move the<br />

data lake applications into the cloud without concern for whether the application will<br />

function as expected.<br />

<strong>The</strong> <strong>Data</strong> Layer OS<br />

As we introduced the concept of a data layer OS it is worth explaining here why<br />

we do not currently believe that the Apache Stack has earned that description. If you<br />

ignore all the infrastructure software that is there to make applications possible, then<br />

we can think in terms of there being three types of applications:<br />

• OLTP applications: <strong>The</strong>se are the applications that process the events and<br />

transactions of the business<br />

• Office applications: This category embraces communications activity from<br />

email to multimedia collaboration and includes personal applications from the<br />

word processor to graphics software. (We would even include development<br />

software here).<br />

• BI and Analytics applications: <strong>The</strong>se are the applications that analyze the<br />

business and provide feedback.<br />

16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!