03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

is rare because of communications error checking procedures (in flight corruption is<br />

most likely hacking). It can be corrupted by any software that rewrites the data. It can<br />

be corrupted by database error (DBA error). To deal with such possibilities, some form<br />

of checksum integrity can be applied.<br />

7. Disambiguation. Disambiguation is an aspect of data cleansing and data reliability.<br />

It refers to the removal of ambiguities in the data. It is a process that would be unlikely<br />

to be applied when data enters the data lake, since it can be a time consuming process.<br />

<strong>The</strong>re is a particular problem with data ambiguity when it comes to people. <strong>The</strong>re is no<br />

bullet-proof global identity system, so identity theft is a fact of life.<br />

<strong>The</strong>re is no standard for people’s names. Sometimes just the first name and surname is<br />

asked for. Sometimes a middle initial or a middle name is asked for. Sometimes the full<br />

name. <strong>The</strong> name is a poor identifier. People can change their names legally and women<br />

change their names by marriage. People sometimes disguise their names deliberately<br />

for fraudulent reasons. But they may do so legitimately (in an effort to anonymize<br />

their data). Some attributes change (address, telephone number, etc.) and usually do so<br />

without the information being easily gathered. To complicate the picture, the structure<br />

of the customer entity changes over time. For example, the social media identities (on<br />

twitter, etc.) were born only recently.<br />

<strong>The</strong> consequence of this is that for many businesses cleansing customer data also<br />

means disambiguating the data. <strong>The</strong>re are few software tools that are effective at<br />

disambiguation. Novetta has this capability and IBM also, but they are the only two<br />

software providers we know of with this capability.<br />

8. Audit trail of data usage. A record of who used what data and when needs to be<br />

maintained both for security purposes and for usage analytics. Usage analytics play a<br />

part in query optimization as well as in data life-cycle management.<br />

9. <strong>Data</strong> life-cycle management. Few companies have formally implemented a general<br />

data life cycle strategy. More likely is that they have a strategy for some data, such as<br />

data covered by compliance regulations, but not all data. Others may have no strategy<br />

at all.<br />

With analytical applications in particular, the need to manage data life cycles is<br />

important, because much of the data used in data exploration may eventually be<br />

discarded as worthless. <strong>The</strong>re is no point in retaining the data, beyond recording that<br />

it was once explored. As data lakes are also used for archive, the use of a data lake<br />

creates an opportunity to implement or tighten up the procedures around data life cycle<br />

management. Life-cycle management can be thought of as the strategy for moving data<br />

to least cost locations as its usage diminishes. Deletion is a possible destination in this.<br />

Metadata and Schema-on-read<br />

<strong>The</strong> only aspect of governance that we have not yet discussed is metadata management.<br />

<strong>The</strong> metadata situation is complex and thus we are devoting more time to it than<br />

other aspect of data governance. In overview, the situation is simple: since metadata<br />

determines the meaning of data, the natural preference is for metadata be as complete<br />

24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!