03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

This collaborative approach to software development established a trend. Facebook,<br />

with its exabyte volumes of data was built on open source from the ground up. Imitating<br />

Yahoo, it also had gifts to give Apache: Hive, the quasi SQL capability and the graph<br />

processing system Giraph. Its most recent donation was Presto, the distributed SQL<br />

query engine, for which Teradata now offers commercial support.<br />

Linked In’s open sourcing of Kafka (a persistent distributed message queue) and<br />

Voldemort (a distributed key-value store) can be added to the list of Apache gifts - as<br />

can Parquet (a column-store capability developed in a collaboration between Twitter<br />

and Cloudera).<br />

<strong>The</strong>re is a huge difference between a new Hadoop component, whose development<br />

only recently started, and one that drops magically from the sky into the ecosystem,<br />

fully tested and implemented over years by a highly scaled web business. <strong>The</strong> first kind<br />

of component may take a long time to mature, while the other is born fully formed, like<br />

Hercules of Greek myth.<br />

<strong>The</strong> addition of Kafka to the Hadoop ecosystem was particularly important. Kafka<br />

provides a publish-subscribe capability that enables data to be streamed in a managed<br />

fashion from one application to another or from one Hadoop or Spark cluster to another.<br />

In mid-2015, Kafka was joined by another important communications capability that<br />

goes by the name of NiFi, short for Niagra Files. <strong>The</strong> software was developed by the<br />

NSA over a period of 8 years. It had the goal of automating the flow of data between<br />

systems at scale, while dealing with the issues of failures, bottlenecks, security and<br />

compliance, and allowing changes to the data flow. Taken together, Kafka and NiFi<br />

provide a complete data flow solution that scales as far as the eye can see, and beyond.<br />

If you consider the whole Apache Hadoop stack, the collection of open source software<br />

that emerged in the wake of Hadoop, it obviously constitutes a new environment for<br />

building and deploying parallel software applications. It is remarkably inexpensive,<br />

both because of its open source nature and the commodity hardware on which it runs.<br />

We sometimes think of this stack as constituting a data layer OS, a distributed operating<br />

system for data.<br />

It doesn’t quite quality, but it is close and it is moving in that direction.<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!