03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> Emergence of <strong>Data</strong> Governance<br />

<strong>Data</strong> is not what it used to be. Let’s take this on board. <strong>The</strong> data we knew and loved<br />

was pretty much static, sitting in databases or files fairly close to the applications that<br />

used it. It wasn’t until the blooming of BI in the late 1990s that the industry felt obliged<br />

to let data flow around. That desire gave birth to the data warehouse into whence data<br />

flowed and from whence data trickled into data marts.<br />

Things began to change. Every year that passed witnessed an increase in the need<br />

for data movement. With the advent of CEP technology to process data streams of<br />

stock and commodity prices for trading banks, we got the first hints of real time data<br />

processing and as time marched forward, such activity grew larger.<br />

It was obvious to web businesses like Yahoo and Amazon, that we lived in a real-time<br />

world. <strong>The</strong> web logs that drove their businesses were the digital footprints of users<br />

visiting their web sites. A transaction-based world was giving way to event-based<br />

world.<br />

Had we thought about it at the time, we might have realized that it wasn’t just web<br />

sites that lived and died by events. Network devices and operating systems and<br />

databases and middleware and applications were all happily recording their events in<br />

log files that squandered disk space until they were eventually deleted. <strong>The</strong> computer<br />

networks of the world were already event oriented. <strong>The</strong>y were born that way, but the<br />

applications we built were not.<br />

<strong>The</strong> first software vendor to notice this was Splunk. It detected and mined a thick<br />

vein of gold in the log files that litter the corporate networks. <strong>The</strong> advent of Splunk<br />

was a boon to IT departments that often needed to consult collections of log files to<br />

identify the causes of application error. Security teams were also appreciative of the<br />

technology as it helped them to hunt network intruders and vagrant viruses. That we<br />

were entering an event based world was transparently obvious to users of Splunk since<br />

all the data they gathered was event data.<br />

But it was still not obvious to others, and even now with the advent of streaming<br />

analytics, it is still not obvious to everyone.<br />

<strong>The</strong> Dawn<br />

You would be hard put to find any references to the term “data governance” before<br />

the year 2000. In fact you wont find many references to it prior to 2005. And let’s be<br />

clear about this, it was not that businesses did not care about the governance of data in<br />

earlier years. It’s just that pretty much all corporate data was internal data generated<br />

by the business and most of it stayed put, where it was born, or if it moved anywhere<br />

it made its way into a data warehouse.<br />

Under those circumstances governing the data was not such a pressing need. But, as<br />

we have described, data found the need to move much more often and the volume of<br />

data exploded.<br />

18

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!