03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

We will discuss these processes one by one:<br />

1. Assigning data provenance and lineage. Accurate data analytics needs to be certain<br />

of the provenance and lineage of the data precisely. Prior to the dawn of the “big data<br />

age” this was rarely a problem as the data either originated within the business or came<br />

from a traditional and reputable external source. As the number of sources of data<br />

increases the difficulties of provenance/lineage will increase.<br />

We expect the ability of data to self-identify for the sake of provenance to increase,<br />

although it may be many years before a general standard is agreed. For the sake of<br />

lineage and provenance each event record would need to record the time of creation,<br />

geolocation of creation, ID of creating device, ID of the process/app which created the<br />

data, ownership of the data, the metadata, and the identity of the data set or grouping<br />

it belongs to. To this we can add the details of derivation if the data was derived in<br />

some way, which would allow lineage to be deduced.<br />

Where such precise details are absent on ingest to the data lake, it should be possible at<br />

least, to know and record where the data came from and how. As very few data records<br />

self-identify as described some compromise is inevitable until standards emerge. With<br />

the advent of the Internet of Things, the need to self-defining data will increase.<br />

2. <strong>Data</strong> security. <strong>The</strong> goal of data security is to prevent data theft or illicit data usage.<br />

Encryption is one primary dimension of this and access control is the other. Let us<br />

consider encryption first.<br />

Encryption needs to be planned and, ideally, applied as data enters the lake. In a<br />

world of data movement, security rules that are applied need to be distributable to<br />

wherever encrypted data is used. Ideally with encryption, you will encrypt data as<br />

soon as possible and decrypt it at the last moment, when it needs to be seen “in the<br />

clear.”<br />

<strong>The</strong> reason for this approach is twofold. First, it makes obvious sense to minimize the<br />

time that data is in the clear. Secondly, encryption and decryption make heavy use of<br />

CPU resources and hence minimizing such activity reduces cost.<br />

To implement this approach, format-preserving encryption (FPE) is necessary. <strong>The</strong><br />

point about FPE is that it does not change the characteristics of the data such as format<br />

and sort order, it simply disguises the data values. <strong>The</strong>re are FPE standards and vendors<br />

that specialize in their application.<br />

<strong>Data</strong> access controls require the existence of a reasonably comprehensive identity<br />

management system and access rights associated with all data. Access rights may<br />

distinguish between the right to view and the right to process the data using a particular<br />

program or process.<br />

3. <strong>Data</strong> Compliance. <strong>Data</strong> compliance regulations are now common, and likely to<br />

become increasingly complicated with the passage of time. <strong>The</strong> EU is hoping to establish<br />

a General <strong>Data</strong> Protection Regulation (GDPR) for personal data that is implemented<br />

across the world and has devoted considerable effort to formulating rules for that. GDPR<br />

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!