The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
This collaborative approach to software development established a trend. Facebook,<br />
with its exabyte volumes of data was built on open source from the ground up. Imitating<br />
Yahoo, it also had gifts to give Apache: Hive, the quasi SQL capability and the graph<br />
processing system Giraph. Its most recent donation was Presto, the distributed SQL<br />
query engine, for which Teradata now offers commercial support.<br />
Linked In’s open sourcing of Kafka (a persistent distributed message queue) and<br />
Voldemort (a distributed key-value store) can be added to the list of Apache gifts - as<br />
can Parquet (a column-store capability developed in a collaboration between Twitter<br />
and Cloudera).<br />
<strong>The</strong>re is a huge difference between a new Hadoop component, whose development<br />
only recently started, and one that drops magically from the sky into the ecosystem,<br />
fully tested and implemented over years by a highly scaled web business. <strong>The</strong> first kind<br />
of component may take a long time to mature, while the other is born fully formed, like<br />
Hercules of Greek myth.<br />
<strong>The</strong> addition of Kafka to the Hadoop ecosystem was particularly important. Kafka<br />
provides a publish-subscribe capability that enables data to be streamed in a managed<br />
fashion from one application to another or from one Hadoop or Spark cluster to another.<br />
In mid-2015, Kafka was joined by another important communications capability that<br />
goes by the name of NiFi, short for Niagra Files. <strong>The</strong> software was developed by the<br />
NSA over a period of 8 years. It had the goal of automating the flow of data between<br />
systems at scale, while dealing with the issues of failures, bottlenecks, security and<br />
compliance, and allowing changes to the data flow. Taken together, Kafka and NiFi<br />
provide a complete data flow solution that scales as far as the eye can see, and beyond.<br />
If you consider the whole Apache Hadoop stack, the collection of open source software<br />
that emerged in the wake of Hadoop, it obviously constitutes a new environment for<br />
building and deploying parallel software applications. It is remarkably inexpensive,<br />
both because of its open source nature and the commodity hardware on which it runs.<br />
We sometimes think of this stack as constituting a data layer OS, a distributed operating<br />
system for data.<br />
It doesn’t quite quality, but it is close and it is moving in that direction.<br />
14