The Data Lake Survival Guide

More documents

Recommendations

Info

The Data Lake Survival Guide This collaborative approach to software development established a trend. Facebook, with its exabyte volumes of data was built on open source from the ground up. Imitating Yahoo, it also had gifts to give Apache: Hive, the quasi SQL capability and the graph processing system Giraph. Its most recent donation was Presto, the distributed SQL query engine, for which Teradata now offers commercial support. Linked In’s open sourcing of Kafka (a persistent distributed message queue) and Voldemort (a distributed key-value store) can be added to the list of Apache gifts - as can Parquet (a column-store capability developed in a collaboration between Twitter and Cloudera). There is a huge difference between a new Hadoop component, whose development only recently started, and one that drops magically from the sky into the ecosystem, fully tested and implemented over years by a highly scaled web business. The first kind of component may take a long time to mature, while the other is born fully formed, like Hercules of Greek myth. The addition of Kafka to the Hadoop ecosystem was particularly important. Kafka provides a publish-subscribe capability that enables data to be streamed in a managed fashion from one application to another or from one Hadoop or Spark cluster to another. In mid-2015, Kafka was joined by another important communications capability that goes by the name of NiFi, short for Niagra Files. The software was developed by the NSA over a period of 8 years. It had the goal of automating the flow of data between systems at scale, while dealing with the issues of failures, bottlenecks, security and compliance, and allowing changes to the data flow. Taken together, Kafka and NiFi provide a complete data flow solution that scales as far as the eye can see, and beyond. If you consider the whole Apache Hadoop stack, the collection of open source software that emerged in the wake of Hadoop, it obviously constitutes a new environment for building and deploying parallel software applications. It is remarkably inexpensive, both because of its open source nature and the commodity hardware on which it runs. We sometimes think of this stack as constituting a data layer OS, a distributed operating system for data. It doesn’t quite quality, but it is close and it is moving in that direction. 14
The Data Lake Survival Guide The Next Gen Stack The hallmark of the Apache Hadoop stack, in all its glory, is that it is built for parallel operation on commodity hardware. If you have experience of it, no doubt you will be able to report that some of it is high quality and some of it is less so. However all of it is under continuous development and will likely improve. It is not tightly integrated in the manner that one might expect were it provided by a single software vendor like Microsoft and Oracle. Also, which components to choose and why can be confusing. This is certainly the case when it comes to streaming applications. Spark can build streaming applications of a kind, but in reality it processes micro-batches quickly, which technically is not streaming. Another component, Apache Storm has been built for true data streaming. And yet another component, Apache Flink does streaming and can also do micro-batch processing. There is competition within the Apache stack for streaming applications and that may remain so for a while. A further potential point of confusion is the existence of three major Hadoop distributors: Cloudera, MapR and Hortonworks. There are many similarities between these distributors in that they all pursue a similar business model that emphasizes support revenues and all provide downloadable free versions of their distributions. Hortonworks is the “pure play,” while MapR and Cloudera provide “premium” distributions to their paying customers. Technically, MapR is the most distinct, providing what it calls a converged platform that includes the MapR file system which is read/write (rather than HDFS, which is append only), MapR Streams (a capability similar to Kafka but more sophisticated) and MapR DB a NoSQL database. Cloudera also has some unique components, including Cloudera Manager, Cloudera Navigator Optimizer, Cloudera Search, the Kudu file system which allows read/write and its Impala database. In contrast, HortonWorks tries to remain true to the Apache Hadoop stack. There are other distributions, too. Cloud vendors, such as Amazon and Microsoft, provide their own Hadoop distributions. A Hadoop distribution contains many components (Cloudera’s currently has 20 for example) and others can be added. It is up to the customer to determine which components are required and that is to some extent application dependent. Those who have never experimented with Hadoop may assume that it’s relatively simple to install and get running. However it ain’t necessarily so. Aside from anything else, there’s a need to either hire experienced programmers or train existing ones in the use of a tribe of components: MapReduce, Hive, Pig, HBase, Spark and others. You are taking on a new and complex software environment, much more so than if you were implementing a Windows or Linux server. And then you are going to build applications to run on it. 15
Page 1 and 2: The Bloor Group The Data Lake Survi
Page 3 and 4: The Genesis of the Data Lake The Da
Page 5 and 6: The Data Lake Survival Guide No dou
Page 7 and 8: The Data Lake Survival Guide Unders
Page 9 and 10: The Data Lake Survival Guide - a us
Page 11 and 12: The Data Lake Survival Guide their
Page 13 and 14: The Data Lake Survival Guide The Ty
Page 15: The Data Lake Survival Guide YARNin
Page 19 and 20: The Data Lake Survival Guide Right
Page 21 and 22: The Data Lake Survival Guide Let us
Page 23 and 24: A General Definition of The Data La
Page 25 and 26: The Data Lake Survival Guide will b
Page 27 and 28: The Data Lake Survival Guide as pos
Page 29 and 30: The Data Lake Survival Guide • Th
Page 31 and 32: The Data Lake Survival Guide Clearl
Page 33 and 34: The Data Lake Survival Guide Data L
Page 35 and 36: The Data Lake Survival Guide Next G
Page 37 and 38: The Data Lake Survival Guide Ingest
Page 39 and 40: The Data Lake Survival Guide are ma

The Data Lake Survival Guide

Create successful ePaper yourself

Delete template?

Save as template?