The Data Lake Survival Guide

More documents

Recommendations

Info

The Data Lake Survival Guide We represent this reality, in Figure 7, by showing real-time applications running directly against the real-time data stream within the Data Bus and also accessing a disk-based data source. In practice this is likely to be achieved by a Lambda or Kappa architecture, which is far ore involved than teh diagram suggests. In practice it is far more likely, for latency reason, that realtime apps will not be located anywhere near the data lake, but will instead be as close as possible to the data stream(s) that feed them. Nevertheless they will likely pass the data stream they process directly to the data lake. Since such applications run prior to any data governance it may be necessary to pass data to these applications if anything important is discovered during the governance processes. Governance Applications We need to distinguish between the governance processing that occurs on ingest and the governance processing that may occur later. One of the rules of governance itself may be to specify what processes have to take place before data is made available to data lake users. It makes sense to do as much processing as possible while data is on the Data Bus (i.e. held in memory) since it can be accessed very quickly. The ideal would be to do all governance processing on ingest, but this may be impossible. Some data cleansing activity and some metadata discovery activity requires human intervention and it may not be practical to do it on ingest. For some data, the chosen policy may be to implement schema-on-read so Data Security Data Transforms Data Aggregat'n Metadata Mgt Data Cleansing Figure 8. Ingest Governance Apps metadata gathering will occur after ingest. There can be other competing dynamics. The desire may be to encrypt all data (or at least all data that is destined to be encrypted) on ingest. If data is to be stored in the write-only HDFS, this is doubly important, as it is a write-only file system. However data cleansing will require data to be unencrypted. The data transform and aggregation activities shown in the diagram are not governance activities per se. From an efficiency perspective it will be better to perform data transformations, aggregations and other data calculations that are known to be required before data is written to disk. Of course, some will need to be done later simply because not all the data they required is in the data stream. D A T A B U S Real-Time Apps Data Figure 7. Real-Time 30
The Data Lake Survival Guide Data Lake Management Figure 9 shows the processes that run on the data lake. The other on-going activity aside from data governance is data lake management. This is the system management activity that monitors and responds to hardware and operating system events and manages all the applications that dip their toes in the data lake. The data lake is a computer grid that needs to be managed like any other network of hardware. The appropriate system management activities can be many and varied, including: server performance, availability monitoring, automated recovery, software management, security management, access management, user monitoring, application monitoring, capacity management, provisioning, network monitoring and scheduling There is nothing new in respect of system management here. However, it is important to recognize that some of software employed here will be traditional system management software and some is likely to be data lake specific (Hadoop management software like Ambari, Pepperdata for cluster tuning, etc.). Data Lake Applications Data Governance Data Lake Mgt DATA BUS DATA LAKE Search & Query BI, Visual'n & Analytics Other Apps Figure 9. Data Lake Applications and Processes Little needs to be said about data lake applications beyond the fact that businesses almost always employ data lakes in the same way they used data warehouses for BI and analytics applications. It is important to note that data self-service is more practical with data lakes, than it ever was with data warehouses. An effective data self-service capability requires an effective search and query capability. This in turn requires the existence of a data catalog, which should be a natural result of intelligent metadata management. It will also require a search capability (after the fashion of Google search), which can pick out anything in the data lake. A really comprehensive search capability will necessitate some kind of indexing activity as part of data ingest. A query capability will need to work in conjunction with the data catalog. There might be support for multiple query languages (SQL, XQuery, SparQL, etc). 31
Page 1 and 2: The Bloor Group The Data Lake Survi
Page 3 and 4: The Genesis of the Data Lake The Da
Page 5 and 6: The Data Lake Survival Guide No dou
Page 7 and 8: The Data Lake Survival Guide Unders
Page 9 and 10: The Data Lake Survival Guide - a us
Page 11 and 12: The Data Lake Survival Guide their
Page 13 and 14: The Data Lake Survival Guide The Ty
Page 15 and 16: The Data Lake Survival Guide YARNin
Page 17 and 18: The Data Lake Survival Guide The Ne
Page 19 and 20: The Data Lake Survival Guide Right
Page 21 and 22: The Data Lake Survival Guide Let us
Page 23 and 24: A General Definition of The Data La
Page 25 and 26: The Data Lake Survival Guide will b
Page 27 and 28: The Data Lake Survival Guide as pos
Page 29 and 30: The Data Lake Survival Guide • Th
Page 31: The Data Lake Survival Guide Clearl
Page 35 and 36: The Data Lake Survival Guide Next G
Page 37 and 38: The Data Lake Survival Guide Ingest
Page 39 and 40: The Data Lake Survival Guide are ma

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?