The Data Lake Survival Guide

More documents

Recommendations

Info

The Data Lake Survival Guide We will discuss these processes one by one: 1. Assigning data provenance and lineage. Accurate data analytics needs to be certain of the provenance and lineage of the data precisely. Prior to the dawn of the “big data age” this was rarely a problem as the data either originated within the business or came from a traditional and reputable external source. As the number of sources of data increases the difficulties of provenance/lineage will increase. We expect the ability of data to self-identify for the sake of provenance to increase, although it may be many years before a general standard is agreed. For the sake of lineage and provenance each event record would need to record the time of creation, geolocation of creation, ID of creating device, ID of the process/app which created the data, ownership of the data, the metadata, and the identity of the data set or grouping it belongs to. To this we can add the details of derivation if the data was derived in some way, which would allow lineage to be deduced. Where such precise details are absent on ingest to the data lake, it should be possible at least, to know and record where the data came from and how. As very few data records self-identify as described some compromise is inevitable until standards emerge. With the advent of the Internet of Things, the need to self-defining data will increase. 2. Data security. The goal of data security is to prevent data theft or illicit data usage. Encryption is one primary dimension of this and access control is the other. Let us consider encryption first. Encryption needs to be planned and, ideally, applied as data enters the lake. In a world of data movement, security rules that are applied need to be distributable to wherever encrypted data is used. Ideally with encryption, you will encrypt data as soon as possible and decrypt it at the last moment, when it needs to be seen “in the clear.” The reason for this approach is twofold. First, it makes obvious sense to minimize the time that data is in the clear. Secondly, encryption and decryption make heavy use of CPU resources and hence minimizing such activity reduces cost. To implement this approach, format-preserving encryption (FPE) is necessary. The point about FPE is that it does not change the characteristics of the data such as format and sort order, it simply disguises the data values. There are FPE standards and vendors that specialize in their application. Data access controls require the existence of a reasonably comprehensive identity management system and access rights associated with all data. Access rights may distinguish between the right to view and the right to process the data using a particular program or process. 3. Data Compliance. Data compliance regulations are now common, and likely to become increasingly complicated with the passage of time. The EU is hoping to establish a General Data Protection Regulation (GDPR) for personal data that is implemented across the world and has devoted considerable effort to formulating rules for that. GDPR 22
The Data Lake Survival Guide will become law within the EU and will likely take effect in many other countries. The practical consequence is that businesses world wide need to implement these rules. The EU legislation does not care where the data is held, so its jurisdiction is worldwide in respect of anyone living in the EU. You can think of such regulations as international whereas healthcare regulations (HIPPA in the US) and financial compliance rules tend to take effect at a national level. To this collection of compliance regulations you can add non-binding sector compliance initiatives and best practice rules that an individual business might establish. It should be obvious that applying such rules is difficult unless the data is self-defining to some degree. 4. Data integrity. Once you eliminate data updates, the possibility of data corruption diminishes. Nevertheless it still exists and may never be entirely eliminated. Software errors can certainly achieve it. The need to specify whether data is a copy or the actual source is necessary if there is going to be any possibility of auditing the use of replicated data. Test data needs to know that it is test data. Data back-ups also need to know they are back-ups. Disaster recovery needs to restore all data to its original state. All of this applies to data in motion as well as data at rest. 5. Data cleansing. Newly ingested data may be inaccurate, especially if we have no control over and little knowledge of how it was created. No data creation or collection processes are perfect. This is even true of sensor data, which we may think of as reliable, because it comes from an automated source; sensors are also capable of error. Data may also be corrupted by subsequent processes after creation. Data needs to be cleaned as soon as possible after ingest - where that’s feasible. There are exceptional situations, for example where the requirement is to process a data stream as it is ingested. There is no time available for data cleansing, so the streaming application must allow for possible data error. A real world parallel to this is the way that news is processed. Sometimes false news reports are aired because some editor thought that the news was too “valuable” not to air it immediately and the usual verification process was skipped. It is corrected later, if it turns out to be wrong. Any urgent processing of uncleaned data will normally allow for its poor quality and the possible need for later correction. Data cleansing standards are a natural part of data governance. However there is no silver bullet for cleaning data. There are obvious tests that can be done such as checking for logically impossible values, and you can formulate rules to detect unlikely values. It is possible to cleanse some data automatically, and there are some well-designed tools for this from Trifacta, Paxata, Unifi and others. But no matter how effective the cleansing software, it requires human supervision and intervention. Cleansing can thus be a slow process. 6. Data reliability. As far as it is possible to ensure, data needs to be accurate and also to be checked for accuracy regularly. Data values can be corrupted in various ways: It can be changed by a hacker for fraudulent reasons or even sabotage. It can be overwritten at some point in its life. It can be corrupted “in flight,” although this 23
Page 1 and 2: The Bloor Group The Data Lake Survi
Page 3 and 4: The Genesis of the Data Lake The Da
Page 5 and 6: The Data Lake Survival Guide No dou
Page 7 and 8: The Data Lake Survival Guide Unders
Page 9 and 10: The Data Lake Survival Guide - a us
Page 11 and 12: The Data Lake Survival Guide their
Page 13 and 14: The Data Lake Survival Guide The Ty
Page 15 and 16: The Data Lake Survival Guide YARNin
Page 17 and 18: The Data Lake Survival Guide The Ne
Page 19 and 20: The Data Lake Survival Guide Right
Page 21 and 22: The Data Lake Survival Guide Let us
Page 23: A General Definition of The Data La
Page 27 and 28: The Data Lake Survival Guide as pos
Page 29 and 30: The Data Lake Survival Guide • Th
Page 31 and 32: The Data Lake Survival Guide Clearl
Page 33 and 34: The Data Lake Survival Guide Data L
Page 35 and 36: The Data Lake Survival Guide Next G
Page 37 and 38: The Data Lake Survival Guide Ingest
Page 39 and 40: The Data Lake Survival Guide are ma

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?