The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Parallelism Playing Havoc<br />
By 2010 we, at <strong>The</strong> Bloor Group, began to observe a significant increase in the speed of<br />
server software occurring as a consequence of parallelism. That was in the early days<br />
of the Hadoop project, which had been inspired by Google and Yahoo’s use of scaledout<br />
server networks.<br />
Let’s think about this. Google and Yahoo were the first two Web 2.0 businesses. <strong>The</strong>y<br />
had been forced to find their own solutions to big data problems. Searching the web<br />
was a completely new application. No developers had ever built software to solve a<br />
problem like that. First you send out spiders; you run software that accessed every<br />
web site, including new web sites, gathering information about new web pages and<br />
changes to old pages. After you harvest the data, you compress it to the digital limit<br />
and add it to your big data heap. That’s the relatively easy part of the problem. <strong>The</strong><br />
hard part is updating the indexes on the huge data heap and letting the world pick at<br />
it. It was the first of these two problems that gave MapReduce its start in life.<br />
<strong>The</strong> idea of “mapping” and “reducing” was not new. It is a relatively old technique that<br />
emerged from functional programming, a 1970s programming paradigm. MapReduce,<br />
as invented by Google, was a development framework that was scalable over grids of<br />
servers and ran in a parallel manner. It provided a solution to the indexing problem.<br />
It was research activity by Doug Cutting and Mike Cafarella at Yahoo which spawned<br />
Hadoop, with its Hadoop Distributed File System (HDFS) and MapReduce framework.<br />
Quite likely the project would not have taken flight if Yahoo had not decided to throw<br />
it to Apache as an open source project with Doug Cutting in the pilot’s seat. Soon<br />
Hadoop distribution companies sprouted up: Cloudera in 2008, MapR in 2009 and<br />
Hortonworks in 2011.<br />
Under the auspices of <strong>The</strong> Apache Software Foundation (ASF), Hadoop acquired a<br />
coterie of complementary software components: Avro and Chuckwa in 2009, HBase<br />
and Hive in 2010, then Pig and ZooKeeper in 2011. Soon ASF - a nonprofit corporation<br />
devoted to open source software - acquired a destiny. It provided a well-honed process<br />
for incubating the development and assisting the delivery of open source products. It<br />
wasn’t long before it was supervising over 100 such projects, about one fifth of which<br />
were Hadoop related.<br />
<strong>The</strong> commercial early adopters of Hadoop began to trickle in around 2012. <strong>The</strong> trickle<br />
soon became a stream, and then the stream became a river, and the river flowed into a<br />
lake. Hadoop and its growing band of open source jig saw pieces were truly inexpensive.<br />
It was free software until you needed support, and if you cared to assemble a few old<br />
servers, you could prototype applications and experiment with parallel computing for<br />
almost nothing.<br />
Before Hadoop danced into the data center with its scale-out file system, no such<br />
scale-out file system existed. <strong>The</strong> assumption had always been that if you had data that<br />
needed to scale out in a big way – hundreds of terabytes or petabytes or beyond – you<br />
needed to put it in a database or a data warehouse. Hadoop changed all that.<br />
12