The Data Lake Survival Guide

More documents

Recommendations

Info

The Data Lake Survival Guide Parallelism Playing Havoc By 2010 we, at The Bloor Group, began to observe a significant increase in the speed of server software occurring as a consequence of parallelism. That was in the early days of the Hadoop project, which had been inspired by Google and Yahoo’s use of scaledout server networks. Let’s think about this. Google and Yahoo were the first two Web 2.0 businesses. They had been forced to find their own solutions to big data problems. Searching the web was a completely new application. No developers had ever built software to solve a problem like that. First you send out spiders; you run software that accessed every web site, including new web sites, gathering information about new web pages and changes to old pages. After you harvest the data, you compress it to the digital limit and add it to your big data heap. That’s the relatively easy part of the problem. The hard part is updating the indexes on the huge data heap and letting the world pick at it. It was the first of these two problems that gave MapReduce its start in life. The idea of “mapping” and “reducing” was not new. It is a relatively old technique that emerged from functional programming, a 1970s programming paradigm. MapReduce, as invented by Google, was a development framework that was scalable over grids of servers and ran in a parallel manner. It provided a solution to the indexing problem. It was research activity by Doug Cutting and Mike Cafarella at Yahoo which spawned Hadoop, with its Hadoop Distributed File System (HDFS) and MapReduce framework. Quite likely the project would not have taken flight if Yahoo had not decided to throw it to Apache as an open source project with Doug Cutting in the pilot’s seat. Soon Hadoop distribution companies sprouted up: Cloudera in 2008, MapR in 2009 and Hortonworks in 2011. Under the auspices of The Apache Software Foundation (ASF), Hadoop acquired a coterie of complementary software components: Avro and Chuckwa in 2009, HBase and Hive in 2010, then Pig and ZooKeeper in 2011. Soon ASF - a nonprofit corporation devoted to open source software - acquired a destiny. It provided a well-honed process for incubating the development and assisting the delivery of open source products. It wasn’t long before it was supervising over 100 such projects, about one fifth of which were Hadoop related. The commercial early adopters of Hadoop began to trickle in around 2012. The trickle soon became a stream, and then the stream became a river, and the river flowed into a lake. Hadoop and its growing band of open source jig saw pieces were truly inexpensive. It was free software until you needed support, and if you cared to assemble a few old servers, you could prototype applications and experiment with parallel computing for almost nothing. Before Hadoop danced into the data center with its scale-out file system, no such scale-out file system existed. The assumption had always been that if you had data that needed to scale out in a big way – hundreds of terabytes or petabytes or beyond – you needed to put it in a database or a data warehouse. Hadoop changed all that. 12
The Data Lake Survival Guide YARNing towards the data lake Until the release of YARN (Yet Another Resource Negotiator), Hadoop was truly limited. Its two main constraints were that it ran just one task at a time and that software development was tied to the use of MapReduce. YARN, released in late 2013, provided a scheduler and in the same release the enforced use of MapReduce was removed. Once Hadoop had a scheduling capability it was possible for multiple applications to share HDFS data concurrently, making it far more appropriate for many applications. Given direct access to HDFS, applications could leverage Hadoop hardware resources in any way they chose. If you were watching closely you would have noticed a slew of commercial software vendors announcing compatible software at the time of the announcement. And of course more followed later. Mesos, another open source scheduling capability, that was built for data center scheduling then stepped into the frame and soon after that the Myriad project was set up to enable Mesos and Yarn to work together. Once process scheduling was possible, the idea of a data lake - as a kind of data hub - took off. With the reality of multiple concurrent tasks it was definitely possible to have one process or set of processes ingesting data and another process doing analytics and a third process carrying out ETL on the data. The Spark Phenomenon Hadoop disrupted the data warehouse world, then Spark disrupted Hadoop. Described as “lightning fast,” it seemed to emerge from nowhere in 2015. But of course it didn’t. Spark began life at UC Berkeley AMPLab in 2009 and was open sourced in 2010 (under a BSD license). It became an Apache project in 2013 about the time that YARN was released. For a variety of reasons it turned out to be a far better development layer than MapReduce. It was an in-memory distributed platform comprised of a collection of components that could accelerate batch analytics jobs, including machine learning applications, and could also handle interactive query and graph processing. It was not built to be a stream processing capability, but was capable of doing such work (via Spark Streaming). Spark was entirely independent of Hadoop. However, it was often implemented with Hadoop and HDFS providing the file system. The major Hadoop distros quickly added Spark to their Hadoop bundles. The Hercules Effect Yahoo started the Apache revolution when it threw Hadoop into the open source pot. This was not altruistic gesture but an act of self-interest. It had petabytes of data stored in Hadoop against which it ran many applications. Yahoo figured it would save time and money if it let external developers contribute to Hadoop’s development and shared the bounty. 13
Page 1 and 2: The Bloor Group The Data Lake Survi
Page 3 and 4: The Genesis of the Data Lake The Da
Page 5 and 6: The Data Lake Survival Guide No dou
Page 7 and 8: The Data Lake Survival Guide Unders
Page 9 and 10: The Data Lake Survival Guide - a us
Page 11 and 12: The Data Lake Survival Guide their
Page 13: The Data Lake Survival Guide The Ty
Page 17 and 18: The Data Lake Survival Guide The Ne
Page 19 and 20: The Data Lake Survival Guide Right
Page 21 and 22: The Data Lake Survival Guide Let us
Page 23 and 24: A General Definition of The Data La
Page 25 and 26: The Data Lake Survival Guide will b
Page 27 and 28: The Data Lake Survival Guide as pos
Page 29 and 30: The Data Lake Survival Guide • Th
Page 31 and 32: The Data Lake Survival Guide Clearl
Page 33 and 34: The Data Lake Survival Guide Data L
Page 35 and 36: The Data Lake Survival Guide Next G
Page 37 and 38: The Data Lake Survival Guide Ingest
Page 39 and 40: The Data Lake Survival Guide are ma

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?