The Data Lake Survival Guide

More documents

Recommendations

Info

The Data Lake Survival Guide By employing schema-on-read you get value from the data as soon as possible. It does not impose any structure on the data because it does not change the structure that the data had when loaded. It can be particularly useful when dealing with semi-structured, poly-structured, and unstructured data. It was difficult and often impossible to model some of this data and ingesting it into a data warehouse, but there is no problem getting it into the data lake. In general, schema-on-read allows for all types of data and encourages a less rigid organization of data. It is easier to create two different views of the same data with schema-on-read. Also, it does not prevent data modelling, it just makes it necessary to defer it. The Data Catalog and Master Data Management (MDM) There needs to be a catalog for all the data in the lake. It is important and will become increasingly important the data catalog to be as rich as possible, embodying as much “meaning” as possible. It may help here if we explain what we mean by “data catalog,” as the term is open to interpretation. The data catalog for an operating system (such as Windows or Linux) is the file system. The metadata it stores for each file is incomplete, as its primary purpose is to enable programs to find and access files. The programs themselves will know the layout of the data and thus how to interpret the values in any given file. This arrangement is not ideal because the data cannot be shared with programs that do not understand the file structure. It is adequate only for data that never needs to be shared. The data catalog for a database is the schema. The schema defines the structure of all the physical data held in the data sets or tables of the database and seeks to provide a logical meaning to each data item by attaching a label to it (Customer_ID, Amount, Date_of_Birth, etc). The amount of meaning that can be derived from the data catalog depends upon the specific database product. For relational databases the catalog roughly reflects the meaning that is captured in an ER Diagram, where the relationships between specific data entities are indicated. With NoSQL and document databases the situation is similar, although how the schema is used varies from product to product With an RDF database (sometimes called a Semantic Database) the catalog can record a greater level of meaning. This is an area where Cambridge Semantics, with its Smart Data Lake solutions currently excels. The simple fact is that semantic technology allows you to capture much more meaning for the data catalog than you can using ER modelling. A traditional data warehouse metadata catalog can be complex, but nothing like as complex as a data lake catalog, which may include many sources of unstructured data (document data, graph data) that may require ontologies (semantic structures) to accurately define the metadata. The point of the data lake’s data catalog is to provide users (and programs) with a data self-service capability. On a simple level, if you compare the pre-data-lake world of data warehouses and data marts to the data lake world, two obvious facts emerge: 26
The Data Lake Survival Guide • The data lake can hold every kind of data (unstructured, semi-structured, structured) allowing it to provide a more comprehensive data service to users. • The data lake can scale out, almost indefinitely, to accommodate far more data than a data warehouse ever dreamed of doing. It is best to think of metadata enrichment as an ongoing process. The use of schemaon-read may mean that the data catalog is incomplete when a data set (or file) first enters the lake. However that will change as soon as the data is used. The store of metadata may be further enriched by usage and formal modelling activity may be carried out to assist in that. The idea of Master Data Management is not dead, but it has morphed over the years. The original idea was to arrive at a “single version of the truth,” a holy grail that none of the knights of the round table in any organization ever managed to feast their eyes on. Nevertheless, there is sense it trying to link together all the data of an organization within a metadata driven model that is comprehensible to business users and enables data self-service. It should be possible to define the business meaning of everything that lives within the data lake. This is an area where we expect semantic technology to make an invaluable contribution. The Cloud Dynamic It could be argued that data lakes are cloud-neutral. Depending on circumstance, the cloud may prove attractive to companies involved in data lake projects. Certainly it is likely to be appropriate for building prototypes, with the idea of later migrating back into the data center. As regards data lakes, the economics of the cloud needs to be carefully considered. With most cloud vendors you will pay rent for all the data that lives in the cloud - and if a good deal of that data is never accessed, it is will almost certainly be much less expensive to keep that data on premise. That’s the downside of the cloud, but it still has its characteristic an upside. It’s the go-to solution when there’s a need for instant extra capacity, whether for storing data or processing it. 27
Page 1 and 2: The Bloor Group The Data Lake Survi
Page 3 and 4: The Genesis of the Data Lake The Da
Page 5 and 6: The Data Lake Survival Guide No dou
Page 7 and 8: The Data Lake Survival Guide Unders
Page 9 and 10: The Data Lake Survival Guide - a us
Page 11 and 12: The Data Lake Survival Guide their
Page 13 and 14: The Data Lake Survival Guide The Ty
Page 15 and 16: The Data Lake Survival Guide YARNin
Page 17 and 18: The Data Lake Survival Guide The Ne
Page 19 and 20: The Data Lake Survival Guide Right
Page 21 and 22: The Data Lake Survival Guide Let us
Page 23 and 24: A General Definition of The Data La
Page 25 and 26: The Data Lake Survival Guide will b
Page 27: The Data Lake Survival Guide as pos
Page 31 and 32: The Data Lake Survival Guide Clearl
Page 33 and 34: The Data Lake Survival Guide Data L
Page 35 and 36: The Data Lake Survival Guide Next G
Page 37 and 38: The Data Lake Survival Guide Ingest
Page 39 and 40: The Data Lake Survival Guide are ma

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?