The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
is rare because of communications error checking procedures (in flight corruption is<br />
most likely hacking). It can be corrupted by any software that rewrites the data. It can<br />
be corrupted by database error (DBA error). To deal with such possibilities, some form<br />
of checksum integrity can be applied.<br />
7. Disambiguation. Disambiguation is an aspect of data cleansing and data reliability.<br />
It refers to the removal of ambiguities in the data. It is a process that would be unlikely<br />
to be applied when data enters the data lake, since it can be a time consuming process.<br />
<strong>The</strong>re is a particular problem with data ambiguity when it comes to people. <strong>The</strong>re is no<br />
bullet-proof global identity system, so identity theft is a fact of life.<br />
<strong>The</strong>re is no standard for people’s names. Sometimes just the first name and surname is<br />
asked for. Sometimes a middle initial or a middle name is asked for. Sometimes the full<br />
name. <strong>The</strong> name is a poor identifier. People can change their names legally and women<br />
change their names by marriage. People sometimes disguise their names deliberately<br />
for fraudulent reasons. But they may do so legitimately (in an effort to anonymize<br />
their data). Some attributes change (address, telephone number, etc.) and usually do so<br />
without the information being easily gathered. To complicate the picture, the structure<br />
of the customer entity changes over time. For example, the social media identities (on<br />
twitter, etc.) were born only recently.<br />
<strong>The</strong> consequence of this is that for many businesses cleansing customer data also<br />
means disambiguating the data. <strong>The</strong>re are few software tools that are effective at<br />
disambiguation. Novetta has this capability and IBM also, but they are the only two<br />
software providers we know of with this capability.<br />
8. Audit trail of data usage. A record of who used what data and when needs to be<br />
maintained both for security purposes and for usage analytics. Usage analytics play a<br />
part in query optimization as well as in data life-cycle management.<br />
9. <strong>Data</strong> life-cycle management. Few companies have formally implemented a general<br />
data life cycle strategy. More likely is that they have a strategy for some data, such as<br />
data covered by compliance regulations, but not all data. Others may have no strategy<br />
at all.<br />
With analytical applications in particular, the need to manage data life cycles is<br />
important, because much of the data used in data exploration may eventually be<br />
discarded as worthless. <strong>The</strong>re is no point in retaining the data, beyond recording that<br />
it was once explored. As data lakes are also used for archive, the use of a data lake<br />
creates an opportunity to implement or tighten up the procedures around data life cycle<br />
management. Life-cycle management can be thought of as the strategy for moving data<br />
to least cost locations as its usage diminishes. Deletion is a possible destination in this.<br />
Metadata and Schema-on-read<br />
<strong>The</strong> only aspect of governance that we have not yet discussed is metadata management.<br />
<strong>The</strong> metadata situation is complex and thus we are devoting more time to it than<br />
other aspect of data governance. In overview, the situation is simple: since metadata<br />
determines the meaning of data, the natural preference is for metadata be as complete<br />
24