03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Parallelism Playing Havoc<br />

By 2010 we, at <strong>The</strong> Bloor Group, began to observe a significant increase in the speed of<br />

server software occurring as a consequence of parallelism. That was in the early days<br />

of the Hadoop project, which had been inspired by Google and Yahoo’s use of scaledout<br />

server networks.<br />

Let’s think about this. Google and Yahoo were the first two Web 2.0 businesses. <strong>The</strong>y<br />

had been forced to find their own solutions to big data problems. Searching the web<br />

was a completely new application. No developers had ever built software to solve a<br />

problem like that. First you send out spiders; you run software that accessed every<br />

web site, including new web sites, gathering information about new web pages and<br />

changes to old pages. After you harvest the data, you compress it to the digital limit<br />

and add it to your big data heap. That’s the relatively easy part of the problem. <strong>The</strong><br />

hard part is updating the indexes on the huge data heap and letting the world pick at<br />

it. It was the first of these two problems that gave MapReduce its start in life.<br />

<strong>The</strong> idea of “mapping” and “reducing” was not new. It is a relatively old technique that<br />

emerged from functional programming, a 1970s programming paradigm. MapReduce,<br />

as invented by Google, was a development framework that was scalable over grids of<br />

servers and ran in a parallel manner. It provided a solution to the indexing problem.<br />

It was research activity by Doug Cutting and Mike Cafarella at Yahoo which spawned<br />

Hadoop, with its Hadoop Distributed File System (HDFS) and MapReduce framework.<br />

Quite likely the project would not have taken flight if Yahoo had not decided to throw<br />

it to Apache as an open source project with Doug Cutting in the pilot’s seat. Soon<br />

Hadoop distribution companies sprouted up: Cloudera in 2008, MapR in 2009 and<br />

Hortonworks in 2011.<br />

Under the auspices of <strong>The</strong> Apache Software Foundation (ASF), Hadoop acquired a<br />

coterie of complementary software components: Avro and Chuckwa in 2009, HBase<br />

and Hive in 2010, then Pig and ZooKeeper in 2011. Soon ASF - a nonprofit corporation<br />

devoted to open source software - acquired a destiny. It provided a well-honed process<br />

for incubating the development and assisting the delivery of open source products. It<br />

wasn’t long before it was supervising over 100 such projects, about one fifth of which<br />

were Hadoop related.<br />

<strong>The</strong> commercial early adopters of Hadoop began to trickle in around 2012. <strong>The</strong> trickle<br />

soon became a stream, and then the stream became a river, and the river flowed into a<br />

lake. Hadoop and its growing band of open source jig saw pieces were truly inexpensive.<br />

It was free software until you needed support, and if you cared to assemble a few old<br />

servers, you could prototype applications and experiment with parallel computing for<br />

almost nothing.<br />

Before Hadoop danced into the data center with its scale-out file system, no such<br />

scale-out file system existed. <strong>The</strong> assumption had always been that if you had data that<br />

needed to scale out in a big way – hundreds of terabytes or petabytes or beyond – you<br />

needed to put it in a database or a data warehouse. Hadoop changed all that.<br />

12

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!