The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
We will discuss these processes one by one:<br />
1. Assigning data provenance and lineage. Accurate data analytics needs to be certain<br />
of the provenance and lineage of the data precisely. Prior to the dawn of the “big data<br />
age” this was rarely a problem as the data either originated within the business or came<br />
from a traditional and reputable external source. As the number of sources of data<br />
increases the difficulties of provenance/lineage will increase.<br />
We expect the ability of data to self-identify for the sake of provenance to increase,<br />
although it may be many years before a general standard is agreed. For the sake of<br />
lineage and provenance each event record would need to record the time of creation,<br />
geolocation of creation, ID of creating device, ID of the process/app which created the<br />
data, ownership of the data, the metadata, and the identity of the data set or grouping<br />
it belongs to. To this we can add the details of derivation if the data was derived in<br />
some way, which would allow lineage to be deduced.<br />
Where such precise details are absent on ingest to the data lake, it should be possible at<br />
least, to know and record where the data came from and how. As very few data records<br />
self-identify as described some compromise is inevitable until standards emerge. With<br />
the advent of the Internet of Things, the need to self-defining data will increase.<br />
2. <strong>Data</strong> security. <strong>The</strong> goal of data security is to prevent data theft or illicit data usage.<br />
Encryption is one primary dimension of this and access control is the other. Let us<br />
consider encryption first.<br />
Encryption needs to be planned and, ideally, applied as data enters the lake. In a<br />
world of data movement, security rules that are applied need to be distributable to<br />
wherever encrypted data is used. Ideally with encryption, you will encrypt data as<br />
soon as possible and decrypt it at the last moment, when it needs to be seen “in the<br />
clear.”<br />
<strong>The</strong> reason for this approach is twofold. First, it makes obvious sense to minimize the<br />
time that data is in the clear. Secondly, encryption and decryption make heavy use of<br />
CPU resources and hence minimizing such activity reduces cost.<br />
To implement this approach, format-preserving encryption (FPE) is necessary. <strong>The</strong><br />
point about FPE is that it does not change the characteristics of the data such as format<br />
and sort order, it simply disguises the data values. <strong>The</strong>re are FPE standards and vendors<br />
that specialize in their application.<br />
<strong>Data</strong> access controls require the existence of a reasonably comprehensive identity<br />
management system and access rights associated with all data. Access rights may<br />
distinguish between the right to view and the right to process the data using a particular<br />
program or process.<br />
3. <strong>Data</strong> Compliance. <strong>Data</strong> compliance regulations are now common, and likely to<br />
become increasingly complicated with the passage of time. <strong>The</strong> EU is hoping to establish<br />
a General <strong>Data</strong> Protection Regulation (GDPR) for personal data that is implemented<br />
across the world and has devoted considerable effort to formulating rules for that. GDPR<br />
22