The Data Lake Survival Guide

More documents

Recommendations

Info

Data Lake Architecture The Data Lake Survival Guide Let’s begin with the idea that the data lake is the system of record. And let’s be clear what we mean by this. The system of record is the system that records all the data that is used by the business. It also holds the golden copy of each data record. For simplification the system of record should be thought of as a logical system. It may be possible to implement it on a single cluster of servers, but this is not a requirement and should not be a goal. In practice the whole configuration will probably involve multiple clusters if, for no other reason, than to provide disaster recovery. The system of record should also be the system where governance processes are applied to data. Clearly, data needs to be subject to governance wherever it is used within the organization. Some governance processes, particularly data security, need to be applied to data as soon as possible after its creation or capture. For that reason, the data lake will inevitably be the landing zone for external data brought into the business, so governance processes can quickly and easily be applied it. Data created within the organization should be passed to the data lake immediately after creation so that governance processes can be applied. It is best to think of the data lake as a store of event records. We can think of it in this way: events are atoms of data and transactions (the traditional paradigm with its traditional data structures) are molecules of information. The analogy works reasonably well. Consider for example a web site visit. A user lands on site, clicks through a few pages, decides to buy something, enters credit card details, and clicks on the “confirm” button. In transactional terms we may think of this as a purchase: a molecule of data, that’s quickly followed by a delivery transaction, another molecule of data to record. But in reality it’s a stream of events, a series of atoms of data. Every user mouse action creates an event. The computer records these events and responds to each one immediately; it displays new web pages or expands text descriptions or whatever. The purchase confirmation is just another event, distinguished only by the fact that it generates a cascade of other related events in the application or in other applications. When you look at it like that, business systems consist of applications generating events and sending event information to other applications. There is nothing particularly special about web applications in this respect; all applications are like that. They respond to events and send messages or data to other applications. That’s been the nature of computing for decades. When you scan all the servers and network devices in a data center you find them awash with log files that store data about events of every kind: network logs, message logs, system logs, application logs, API logs, security logs and so on. Collectively the logs provide an extensive audit trail of the activity of the data center, organized by time stamp. It happens at the application level, at the data level and lower down at the hardware level. Hiding within this disparate set of data can be found details of anomalies, error conditions, hacker attacks, business transactions and so on. 28
The Data Lake Survival Guide Clearly it is possible, by adding real-time logging to every business application on every device, to collect all the company’s data and dispatch it to the data lake - although, it is no doubt overkill. Figure 6 depicts this possibility. Data is ingested either from static data sources (files and databases) on a scheduled basis, or is ingested directly from data streams. We show both Kafka and NiFi as possible components of an ingest solution. Kafka is pure publish subscribe and thus can easily be used to gather data changes from multiple files (as publishers) and pass them to the Data Bus (the subscriber), probably a well configured server. However a more sophisticated and set of capabilities can be created by integrating NiFi with Kafka. NiFi can completely automate data flows across thousands of systems and can be configured to handle the failure of any system or network component, queue management, data corruption, priorities, compliance and security. And it provides an excellent drag and drop interface that allows data flow configurations to be designed and enhanced. You can think of NiFi as ETL on steroids. We included an ingest application in the diagram to allow for any functionality or integration capability that could not be delivered by Kafka in conjunction with NiFi. The goal is to provide an in-memory stream of data that can be processed as it arrives. Given that, in theory at least, input may be required from any data source any where, the data access ability needed here is extensive. Data lake projects will most likely begin with fairly unsophisticated data acquisition from just a few sources. What we have described here is a kind of “worst case scenario” and how it would likely be handled. Real Time Applications Servers, Desktops, Mobile, Network Devices, Embedded Chips, RFID, IoT, The Cloud, Oses, VMs, Log Files, Sys Mgt Apps, ESBs, Web Services, SaaS, Business Apps, Office Apps, BI Apps, Workflow, Data Streams, Social... Ingest Kafka The goal of software architecture is to satisfy application service levels, pure and simple. When we consider real-time applications, for example responding immediately to price changes in an automated market, there is simply no room for any unavoidable latency. D A T A B U S NiFi Figure 6. Data Lake Ingest 29
Page 1 and 2: The Bloor Group The Data Lake Survi
Page 3 and 4: The Genesis of the Data Lake The Da
Page 5 and 6: The Data Lake Survival Guide No dou
Page 7 and 8: The Data Lake Survival Guide Unders
Page 9 and 10: The Data Lake Survival Guide - a us
Page 11 and 12: The Data Lake Survival Guide their
Page 13 and 14: The Data Lake Survival Guide The Ty
Page 15 and 16: The Data Lake Survival Guide YARNin
Page 17 and 18: The Data Lake Survival Guide The Ne
Page 19 and 20: The Data Lake Survival Guide Right
Page 21 and 22: The Data Lake Survival Guide Let us
Page 23 and 24: A General Definition of The Data La
Page 25 and 26: The Data Lake Survival Guide will b
Page 27 and 28: The Data Lake Survival Guide as pos
Page 29: The Data Lake Survival Guide • Th
Page 33 and 34: The Data Lake Survival Guide Data L
Page 35 and 36: The Data Lake Survival Guide Next G
Page 37 and 38: The Data Lake Survival Guide Ingest
Page 39 and 40: The Data Lake Survival Guide are ma

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?