The Data Lake Survival Guide

More documents

Recommendations

Info

The Data Lake Survival Guide is rare because of communications error checking procedures (in flight corruption is most likely hacking). It can be corrupted by any software that rewrites the data. It can be corrupted by database error (DBA error). To deal with such possibilities, some form of checksum integrity can be applied. 7. Disambiguation. Disambiguation is an aspect of data cleansing and data reliability. It refers to the removal of ambiguities in the data. It is a process that would be unlikely to be applied when data enters the data lake, since it can be a time consuming process. There is a particular problem with data ambiguity when it comes to people. There is no bullet-proof global identity system, so identity theft is a fact of life. There is no standard for people’s names. Sometimes just the first name and surname is asked for. Sometimes a middle initial or a middle name is asked for. Sometimes the full name. The name is a poor identifier. People can change their names legally and women change their names by marriage. People sometimes disguise their names deliberately for fraudulent reasons. But they may do so legitimately (in an effort to anonymize their data). Some attributes change (address, telephone number, etc.) and usually do so without the information being easily gathered. To complicate the picture, the structure of the customer entity changes over time. For example, the social media identities (on twitter, etc.) were born only recently. The consequence of this is that for many businesses cleansing customer data also means disambiguating the data. There are few software tools that are effective at disambiguation. Novetta has this capability and IBM also, but they are the only two software providers we know of with this capability. 8. Audit trail of data usage. A record of who used what data and when needs to be maintained both for security purposes and for usage analytics. Usage analytics play a part in query optimization as well as in data life-cycle management. 9. Data life-cycle management. Few companies have formally implemented a general data life cycle strategy. More likely is that they have a strategy for some data, such as data covered by compliance regulations, but not all data. Others may have no strategy at all. With analytical applications in particular, the need to manage data life cycles is important, because much of the data used in data exploration may eventually be discarded as worthless. There is no point in retaining the data, beyond recording that it was once explored. As data lakes are also used for archive, the use of a data lake creates an opportunity to implement or tighten up the procedures around data life cycle management. Life-cycle management can be thought of as the strategy for moving data to least cost locations as its usage diminishes. Deletion is a possible destination in this. Metadata and Schema-on-read The only aspect of governance that we have not yet discussed is metadata management. The metadata situation is complex and thus we are devoting more time to it than other aspect of data governance. In overview, the situation is simple: since metadata determines the meaning of data, the natural preference is for metadata be as complete 24
The Data Lake Survival Guide as possible as soon as possible, so the possibility of data being misinterpreted is minimized. However, one of the loudly proclaimed benefits of the data lake is that “there is no need to model data before ingesting it.” This contrasts significantly to the data warehouse situation, where a great deal of data modelling effort is required before data is allowed in to the warehouse. The alternative to data modelling is called schema-on-read. With schema-on-read, the process that reads the data for the first time determines the metadata. The work involved varies according to data source. Some data such as CSV files and XML files define the metadata, and thus it’s possible to know the metadata as the data is read. Other data may not be so convenient. However, some data formats can be recognized from the data and also some metadata may be deducible from data values. This is how products like Waterline can automatically determine metadata values. In other circumstances human input may be required to determine metadata. Once teh metadata is determined it can be stored in a metadata repository. With schema-on-read the end goal is not to establish an RDBMS-type data model which defines the relationships (foreign key relationships) between data sets. For many BI and analytics applications it is not required of the user can specify the metadata. The schema-on-read approach means that any Master Data Management (MDM) process that maintains a master data model of all corporate data, will probably need to be adjusted. The assumption of data modelling is that by applying various rules and a little common sense you can provide a data model (usually and ER model) that is suited to all possible uses of the data. The truth is that in almost all circumstances you cannot. It will be imperfect, at best. The reality of the situation is this: • The time taken to model the data, either in the beginning or when new data sources are added or if errors are found in the model, constitutes a definite and possibly large cost to the business in respect of time to value. • The modeler has to try to anticipate all new data sets that may appear later, so that the model does not require significant rework when new data sets are added. This can only be guessed at and rework is sometimes required. • Data (in a data lake) is intended to be a shared asset and will be shared by groups of people with varying roles and differing interests, all of whom hope to get insights from the data. To model such data means trying to allow for every constituency in advance. Possibly this will result in a “lowest common denominator” schema that is an imperfect fit for anyone. This problem gets worse with more data sources, higher data volumes and more users. • With schema-on-read, you’re not glued to a predetermined structure so you can present the data in a schema that fits reasonably well to the task that requests the data. 25
Page 1 and 2: The Bloor Group The Data Lake Survi
Page 3 and 4: The Genesis of the Data Lake The Da
Page 5 and 6: The Data Lake Survival Guide No dou
Page 7 and 8: The Data Lake Survival Guide Unders
Page 9 and 10: The Data Lake Survival Guide - a us
Page 11 and 12: The Data Lake Survival Guide their
Page 13 and 14: The Data Lake Survival Guide The Ty
Page 15 and 16: The Data Lake Survival Guide YARNin
Page 17 and 18: The Data Lake Survival Guide The Ne
Page 19 and 20: The Data Lake Survival Guide Right
Page 21 and 22: The Data Lake Survival Guide Let us
Page 23 and 24: A General Definition of The Data La
Page 25: The Data Lake Survival Guide will b
Page 29 and 30: The Data Lake Survival Guide • Th
Page 31 and 32: The Data Lake Survival Guide Clearl
Page 33 and 34: The Data Lake Survival Guide Data L
Page 35 and 36: The Data Lake Survival Guide Next G
Page 37 and 38: The Data Lake Survival Guide Ingest
Page 39 and 40: The Data Lake Survival Guide are ma

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?