The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> Emergence of <strong>Data</strong> Governance<br />
<strong>Data</strong> is not what it used to be. Let’s take this on board. <strong>The</strong> data we knew and loved<br />
was pretty much static, sitting in databases or files fairly close to the applications that<br />
used it. It wasn’t until the blooming of BI in the late 1990s that the industry felt obliged<br />
to let data flow around. That desire gave birth to the data warehouse into whence data<br />
flowed and from whence data trickled into data marts.<br />
Things began to change. Every year that passed witnessed an increase in the need<br />
for data movement. With the advent of CEP technology to process data streams of<br />
stock and commodity prices for trading banks, we got the first hints of real time data<br />
processing and as time marched forward, such activity grew larger.<br />
It was obvious to web businesses like Yahoo and Amazon, that we lived in a real-time<br />
world. <strong>The</strong> web logs that drove their businesses were the digital footprints of users<br />
visiting their web sites. A transaction-based world was giving way to event-based<br />
world.<br />
Had we thought about it at the time, we might have realized that it wasn’t just web<br />
sites that lived and died by events. Network devices and operating systems and<br />
databases and middleware and applications were all happily recording their events in<br />
log files that squandered disk space until they were eventually deleted. <strong>The</strong> computer<br />
networks of the world were already event oriented. <strong>The</strong>y were born that way, but the<br />
applications we built were not.<br />
<strong>The</strong> first software vendor to notice this was Splunk. It detected and mined a thick<br />
vein of gold in the log files that litter the corporate networks. <strong>The</strong> advent of Splunk<br />
was a boon to IT departments that often needed to consult collections of log files to<br />
identify the causes of application error. Security teams were also appreciative of the<br />
technology as it helped them to hunt network intruders and vagrant viruses. That we<br />
were entering an event based world was transparently obvious to users of Splunk since<br />
all the data they gathered was event data.<br />
But it was still not obvious to others, and even now with the advent of streaming<br />
analytics, it is still not obvious to everyone.<br />
<strong>The</strong> Dawn<br />
You would be hard put to find any references to the term “data governance” before<br />
the year 2000. In fact you wont find many references to it prior to 2005. And let’s be<br />
clear about this, it was not that businesses did not care about the governance of data in<br />
earlier years. It’s just that pretty much all corporate data was internal data generated<br />
by the business and most of it stayed put, where it was born, or if it moved anywhere<br />
it made its way into a data warehouse.<br />
Under those circumstances governing the data was not such a pressing need. But, as<br />
we have described, data found the need to move much more often and the volume of<br />
data exploded.<br />
18