The Data Lake Survival Guide

The Bloor Group 

The Data Lake Survival Guide 

The What, Why and How of the Data 

Lake 

Robin Bloor, Ph.D. 

& Rebecca Jozwiak 

RESEARCH REPORT

"Perhaps the truth depends on a walk around the lake.” 

~ Wallace Stevens 

RESEARCH REPORT

The Genesis of the Data Lake 


In times past, when thinking about digital data, it made sense to segregate data between 

transactional data, the data captured in business applications, stored in database tables 

and presented by BI tools, and all other data: emails, web pages, images, video and so 

on. Nowadays we tend to refer to such “other data” as unstructured data. Of course 

it is not unstructured in the true sense of the word, but it does not have a convenient 

structure for being stored in a relational database. It is usually lives in files or more 

specialized databases and until recently it was rarely analyzed. 

Nevertheless it was analyzable and software for deriving value from such data has 

crossed the chasm. Combine this with the reality that there is even more value to joining 

some of this data with regularly structured data for additional analytics and you have 

a strong motive for re-imagining the idea of a data warehouse. 

It was that analytical imperative more than anything else which gave rise to the 

original concept of a data lake, a data store for both species of data and, additionally 

for data harvested from multiple sources external to the business, some of which was 

inevitably unstructured. 

Data Flow Architectures 

The now-aging data warehouse architecture ruled the data empire for two decades 

or more and will probably continue to play a role in the data lake architecture that 

supersedes it - but only as a supporting actor. Its longevity stands as a testament to its 

effectiveness. It was the first generation data flow architecture, and as is the case with 

the data lake, its claim to fame was in providing data to BI and analytics applications. 

Figure 1 below provides a simplified conceptual illustration of this architecture. 

Data flows from 

OLTP databases via 

Extract Transform 

and Load (ETL) 

software to the 

data warehouse. 

Queries and other 

apps access the 

data warehouse 

and more ETL 

software passes 

data into data 

marts against 

which other (BI 

Data 

Layer 

OLTP 

Apps 

OLTP 

DBMS 

ETL 

Query 

Apps 


Warehouse 

Data Mart 

Apps 


Marts 

and analytics) 

applications run. 

The data layer for business applications is thus comprised of transactional databases, 

a data warehouse and data marts consisting of subsets of data drawn from the data 

warehouse. 

ETL 

Figure 1. Conceptual Data Warehouse Architecture 

1


Real implementations are much more complex, usually involving a data staging area 

where data is placed prior to ingestion into the data warehouse. This may be necessary 

for operational reasons such as the data warehouse needing to limit data ingest to 

particular times. Alternatively data may need to be cleaned or restructured before 

ingest. In some instances, because data took too long to flow through to a data mart, 

yet another database - called an Operational Data Store (ODS) - would be created to 

provide a more timely service to BI dashboards. 

The need for such awkward manoeuvres might have been eliminated in time by the 

increasing power of hardware. However, this approach was constrained by other factors 

all of which we will identify and discuss in detail later. 

The Value of Data 

Businesses do not remain static. Their processes change and evolve, their business 

models change and the markets they serve are gradually reshaped. Precisely how this 

happens varies, but generally we can think of there being a simple feedback loop, which 

governs the process. We illustrate this in Figure 2. 

The feedback loop has three steps: 

• Plan (business process design and implementation) 

• Run operational business processes 

• Review operational business processes 

There may be many manual elements in this; it is rarely 

driven by IT, although IT normally contributes. BI and 

analytics have the clear role of providing information either 

to assist operational business processes or to assist planning 

and change management activities. 

Planning & 

Change 

Management 

Operational 

Activity 

Conceptually, there is nothing new at all about this view of 

company behavior and the role of BI and Analytics. What 

has drawn attention to Big Data and the BI & Analytics 

applications it supports, is that the technology parameters 

have shifted dramatically, as we shall discuss later in 

this report. If the Big Data opportunity is pursued, the 

efficacy of this corporate feedback loop will be improved. 

Organizations will be more “data driven” than before and 

their success will be more dependent on making effective 

Business 

Intelligence 

& Analytics 

Figure 2. Change 

use of technology for this purpose. This is why some businesses are now chanting the 

mantra, “data driven, data driven, data driven.” 

The triumph in 1997 IBM’s Deep Blue computer in a chess match with world chess 

champion, Gary Kasparov, the later victory in 2011 by IBM’s Watson computer system 

against three Jeopardy champions and the recent (2016) victory by Google AI against the 

world Go champion have demonstrated beyond argument that computer intelligence 

can now outstrip the most intelligent humans in well-defined contexts. 

2


No doubt analytics technology is far better at taking decisions than even the most 

skilled humans too. Whether computer skills will soon usurp human skills in most 

aspects of running a business is thus worth considering. To explore this possibility, we 

need to define, examine and discuss the data pyramid. 

The Data Pyramid 

The data pyramid illustrated in Figure 3 below was first conceived in the 1990s 

as part of a philosophical and technological study of artificial intelligence at Bloor 

Research. Variations of this simple model have appeared occasionally since then. The 

fundamental point it makes is that data has to go through refinement before it becomes 

useful to people. 

Rules, Policies 

Guidelines, Procedures 

Linked data, Structured data, 

Visualization, Glossaries, Schemas, Ontologies 

New 


Signals, Measurements, Recordings, 

Events, Transactions, Calculations, Aggregations 

Refinement 

Figure 3. The Data Pyramid 

The data pyramid has four layers which we define precisely as follows: 


We define data to mean records of events or transactions from some source. A data 

record indicates a particular state of something or possibly a change between two 

states. Such a record is a “data point.” For practical IT purposes it needs to record its 

time of birth and other data items that identify the data’s origin. While a data item of 

this kind may play a role in an operational business system, collections of such data 

and their analysis is required to create useful “business intelligence.” 

Information 

For data to become information it requires context. This is a matter of making 

connections. A single customer record, for example, lacks context, which may be 

3


provided by orders, contact records and so on. If you look at a set of customer data 

it may yield information that is not part of any single record. As soon as you have 

multiple records you can calculate categories (gender, age, location, etc.). From a 

“business intelligence” perspective you are creating information with such activities. 

When you link customer data to all the orders placed by a customer, you can generate 

useful information about that customer’s buying patterns. 

We define information to be: collections of data linked together for human consumption. 

In general BI products are software products that present information, possibly as 

reports or visually on dashboards. Some BI products are interactive, enabling the user 

to slice and dice the information. BI tools such as spreadsheets or OLAP tools can be 

thought of as user workbenches for the further analysis of information. The databases 

and data warehouses that feed such tools store information in a semi-refined form. 

Knowledge 

We define knowledge to be information that has been refined to the point where it is 

actionable. 

Consider the BI tools that simply present information for decision support. The user 

knows their own context and consumes the information to take an action, such as 

resolving an insurance claim or approving a loan. The user has the knowledge of what 

to do and the BI tools assists by providing information. 

In the case of the BI tools that enable data exploration, the user has some idea of what 

he or she needs to know, explores the information to create that knowledge and then 

applies it to their business context. As such the knowledge of how to explore the data 

lives in the user. Such a user can accurately be described as a knowledge worker. 

Knowledge can also be stored in computer systems. This is where we encounter rules 

based systems and all the technology that is normally classified as AI. But knowledge 

manifests in computer systems in many other ways. The people who run a business 

create business processes to carry out particular activities. These are normally 

improved over time on the basis of acquired experience (feedback). They may even be 

fully automated - converted into software and implemented without the need for any 

human intervention. This is implemented knowledge. 

Indeed, all software, no matter what it does, can be classified as implemented 

knowledge. Nevertheless within any business there will also be other knowledge: 

rules, procedures, guidelines and policies that are not automated and are implemented 

by staff. 

Data analytics, or Data Science as it has now been named, is the activity of trying to 

discover new knowledge from data by applying mathematical techniques to reveal 

previously unknown patterns. It is science of a kind, in the sense that the data scientist 

may formulate and then test hypotheses, although there are some brute force techniques 

that can discover patterns without the need to hypothesize. It is not the only way to 

discover new knowledge, but it can be a very powerful and rewarding route. 

4


Understanding 

We define understanding to be the synthesis of both knowledge and experience. 

Currently this can reside only in people, and it is probably the case that it will only ever 

reside in people. Although computers may be able to model most human intellectual 

processes they only ever execute directives. Higher human cognitive activities such as 

taking initiative, contemplation, pondering and abstract thought are beyond the remit 

of the machine. 

Understanding is served by knowledge and information and, as such, Analytics 

and BI systems can considerably enhance the capabilities of the users at every level 

within an organization, especially those with a deep understanding of the activities 

and dynamics of a business. However, every business, no matter how well it exploits 

the opportunities that such technology presents, also has to cater for change. Human 

understanding is what drives and responds to change. 

BI And Analytics As Business Processes 

The creation of BI services and the generation of business insight through analytics 

are themselves business processes or at least they should be. They are also the natural 

data lake applications. Hence the user constituency that is likely to benefit most from 

building a data lake is the community that creates and uses such capabilities. 

The marketing hype surrounding Big Data and the potential importance of data 

scientists to the success of the business have tended to obscure the nature of this 

business process. 

It can be viewed from two perspectives. 

• R&D 

• Software Development 

Data science teams investigate aspects of the business statistically to discover useful 

and actionable knowledge. Such activity should properly be regarded as research 

into business processes (R&D). When new insights are discovered, the subsequent 

exploitation of these insights is clearly a software development process. 

We illustrate this in Figure 4, on the next page. The analytical exploration of data 

to generate knowledge or enhance existing knowledge is very similar to software 

development, in the sense that when new knowledge is discovered it is likely to be 

used to enhance computer systems. Where it differs from software development is that 

it is an R&D activity and software development rarely is. 

Once new knowledge is discovered and implemented we get the situation depicted 

on the right hand side of Figure 4. Business intelligence systems may be enhanced 

to improve decision making. This might simply be a matter of upgrading passive 

decision support such as dashboard information or sending a specific alert to the user 

in a specific context, or it might lead to the upgrade of interactive decision support 

capabilities such as OLAP software or data visualization software like Tableau. 

5


Analytics Development 

Analytics Implementation 

Passive 

Decision 

Support 


Set 

User 

Analytic 

Exploration 

New 

Knowledge 

Interactive 

Decision 

Support 


Scientist 


Set 


Set 

User 

Automation 


Set 

Figure 4. Analytics and BI, Development and Implementation 

Alternatively, the knowledge may be automatically included in an operational system 

improving it in some way. The illustration does not try to elaborate on how an analytic 

discovery is made operational, since this varies according to context. 

The Data Lake Dynamic 

The fundamental assumption of the data warehouse architecture was that there 

needed to be a very powerful query engine (database) at the center of the data flow. 

It thus suggested a centralized architecture where, first of all data flowed to the data 

warehouse. It was then used in place or it was distributed from there for use elsewhere. 

The fatal flaw of this architecture was that it did not scale out well. However this 

limitations did not become apparent until a whole series of forces came into play. They 

were: 

• The need to analyze unstructured data, both external and internal. The need for 

this continues to grow. 

• External data sources began to multiply. Particularly prominent in this was social 

media data, but it was by no means the only source. Until recently, selling or 

renting data was a niche activity but this has ceased to be the case. An expanding 

amount of valuable data is now bought and sold publicly. 

• Traditionally analytics applications lived in “walled gardens” served by their 

own data mart. However a data lake could do service as an analytics sandbox 

6


- a useful idea given the incursion of external data that data analysts wished to 

explore. 

• Hadoop (and later Spark) with their Open Source software ecosystems gained 

traction and those ecosystems began growing apace. Such software is very low 

cost to adopt. There were significant economies available just for off-loading 

tired data from data warehouses, to a data lake. 

• The parallelism of these Open Source environments made it possible to run 

some analytic applications much faster than had previously been possible. 

• The Hadoop/Spark ecosystems were further strengthened. New metadata 

capture and data cleansing products emerged. Security products appeared that 

strengthened the initially poor security capabilities of these environments. 

• The concept of a data lake (or data hub) quickly gained acceptance as an 

architectural idea for data management. 

• Cloud vendors, particularly Amazon with EMR and Microsoft with Azure quickly 

identified the opportunity and built easily deployed data lake environments. 

• The release of Open Source Kafka in combination with Spark’s micro-batch 

capability caused some companies to experiment with near real-time applications 

significantly expanding the data lake’s areas of application towards real-time 

analytics. This move will no doubt continue as other open source streaming 

capabilities, such as Flink, mature. 

• Kafka can now be regarded as a foundational data lake component (or if you use 

MapR then MapR Streams fulfils the same role). It was donated to Open Source 

by Linked In in a fully proven form. As a fast publish-subscribe capability it 

enables replication and disaster recovery configurations as well as making it 

possible to treat multiple physical data lakes as a single logical data lake. 

• The traditional data warehouse was never an ingest point of itself. Data was 

prepared in various ways, in a “staging area” before going to the data warehouse. 

In contrast, a data lake is capable of being a staging area, not just for corporate 

data but for all data, including unstructured data. 

• The data lake provided a means of acquiring data prior to modeling the data 

(for inclusion in a database or data warehouse). Since a good deal of data could 

be processed immediately without such modeling, considerable time could be 

saved. For applications where “time to market” matters, the data lake delivers. 

• A whole series of data governance issues began to raise their heads. Data 

governance had never been an important issue, in fact it hadn’t even been well 

defined until the possibility of gathering large collections of data into data lakes 

became a possibility. Then a whole series of issues from data security through 

data lineage to data archive stumbled into the spotlight. The question of how 

should data be governed has now become a pressing question. 

7


Hardware Disruption 

To explore the emerging reality of the data lake, we now need to consider how 

computer hardware technology has been changing and may continue to change. 

Technology change can be dramatic and in respect of the data lake it has been, it has 

been dramatically disruptive. 

Moore’s Law, which has remained true for CPU power over a span of more than 40 

years, regularly delivers a doubling of CPU power roughly every 18 months. At the 

hardware level nothing else kept pace with this. DRAM (memory) managed to do so 

up to the year 2000, but faded after that. Disk was far worse, lagging terribly, but it is 

now being superseded by SSDs (solid state drives), which recently climbed onto the 

Moore’s Law curve in the sense that its speed is roughly doubling every 18 months. 

Moore’s Law was disruptive, but we gradually adjusted to its disruptive impact, as 

it gifted us its regular increases in speed. It transformed PCs into throw-away with a 

life span of 4 or 5 years. PCs were superseded by laptops which in turn are giving way 

tablets. It had a similar impact on some PC software, such as email which went to live 

on the Internet, soon to be joined by personal applications and graphics applications. 

Moore’s Law had a distinctly different impact on server software. The databases 

technology that was born the 1980s and 1990s evolved to keep pace and so did the ERP 

systems. On the server side extra power simply made large databases and applications 

faster. For a while, much of the additional server power was simply squandered, 

with CPUs often idling. This created a market for virtual machines that enabled more 

applications per server. 

It wasn’t until about 2010 that server software applications set off in a new direction. 

In 2004 it ceased to become physically possible to increase CPU clock speeds to garner 

extra power and thus CPUs, still capable of further miniaturization, became multicore. 

This trend gradually forced software development to take advantage of parallel 

processing. 

Multicore and its tangled web 

The hardware landscape used to be simple. There was CPU, memory, disk and network 

technology. CPU kept getting faster, as did memory and disk even if they didn’t really 

keep pace. The evolution of hardware was thus reasonably predictable, but that stalled 

in 2004. The chip vendors (Intel, AMD, IBM, ARM, etc.) were forced to add more cores 

to each CPU to make their previous chips obsolete, which is how the game is played. 

Such a major change in direction forced the software world to adjust. It took time. 

Operating systems needed to adjust, compilers needed to adjust, development 

software needed to adjust and, most of all, the software developers needed to adjust. It 

might have been relatively plain sailing if that was the whole story. But it wasn’t. There 

were also GPU chips (Graphical Processing Units), FPGAs (Field Programmable Gate 

Arrays) and SoCs (Systems on a Chip). 

We thought of CPUs and GPUs as different beasts of burden with different loads on 

8


their back: general purpose processing and graphical processing. But GPUs are not 

confined to graphics, they are equally suited to high performance computing tasks 

(numeric workloads) including analytic processing. 

This led to the development of GPGPUs (General purpose GPUs) that are far faster than 

CPUs for such tasks having many more processor cores. A few companies (Kinetica, 

BlazingDB and MapD) have exploited such hardware to build database servers that 

sport ultra-fast databases, proving that a great deal of a database’s workload can run 

on GPGPUs. 

An FPGA (Field Programmable Gate Array) is, as the name suggests, a chip that you 

can add logic to after it has been manufactured - that’s what “field programmable” 

means. In order of speed, GPUs are faster than CPUs which are faster than FPGAs. 

The virtue of the FPGA is that you can configure it to run a particular program that 

is frequently used. Once configured for such a program it runs like a race horse. 

The FPGAs contain arrays of programmable logic blocks along with a hierarchy of 

reconfigurable interconnects that allow the blocks to be “wired together” in different 

configurations. This, along with a little memory makes it possible to purpose built a 

chip for a given application. 

We know of two hardware vendors that have demonstrated the virtues of combining 

different types of processing unit in custom designed hardware. Velocidata with its 

Enterprise Streaming Compute Appliance (ESCA) combines the power of CPU, GPU and 

FPGA to dramatically accelerate stream processing, particularly for ETL applications. 

Ryft, with its Ryft One combines a CPU with an array of FPGAs to provide a very fast 

search and query capability on large volumes of data. 

There is a convergence in progress between the CPU and GPU. It began with Intel in 

2010, with its line of HD Graphics chips. Intel describes these as a CPU with Integrated 

Graphics. In 2011 AMD created what it called an APU (Accelerated Processing Unit) 

which was involved the same marriage of CPU and GPU on a single chip. Such chips 

can be used on PCs and similar devices, eliminating the need for a GPU. 

Following its acquisition of Altera (an FPGA and System on a Chip company), Intel 

recently announced Xeon chips with a built-in FPGA (Field Programmable Gate Array) 

and Stratix FPGAs and SoCs. The market for Xeon with FPGA has yet to become clear, 

although it is not hard to imagine companies adding analytics logic to the FPGA portion 

of the chip to create specialist servers in the same way that Velocidata added ETL logic 

and Ryft added search logic to FPGAs. 

As regards the SoC the big market here is cell phones and tablets, although one can 

easily imagine that they will also play a role in the Internet of Things, perhaps a very 

major one. 

Processors are evolving in a variety of ways. The x86 chip dominated the industry on 

the desktop and eventually on the server, to the greater glory of Intel, but it is outgunned 

by the ARM chip in respect of numbers (tablets, cell phones and other devices). The 

problem that these and other chip vendors face is that soon it will no longer be possible 

9


to miniaturize circuits any further - so Moore’s Law will cease to deliver its regular 

bounty. Thus the CPU vendors are exploring innovative alternatives to keep the plates 

spinning. 

The persistence of memory 

RAM (DRAM and SRAM) is fast volatile memory. It is extremely fast compared to 

traditional spinning disk (or spinning rust as it is sometimes called). The figure varies 

according to circumstance but a rule of thumb is 100,000 times faster for random access 

- less so for serial access. The best current (2017) figures for RAM speed are 61,000 MB/ 

sec (read) and 48,000 MB/sec (write). 

Currently there are three disruptive memory technologies making their way to 

market. At the time of writing, software and hardware vendors are experimenting with 

them. They differ from RAM in three respects. They are slightly slower, non-volatile 

and about half the price. Two of them, from Intel (called 3D XPoint) and IBM (called 

PCM or phase change memory) are fairly new. The third technology, HP’s memristor, 

was announced way back in 2008 but has been slow to develop. Nevertheless HP is 

partnering with SanDisk to develop what it calls Storage-Class Memory (SCM) to offer 

memory that has distinctly similar capabilities to Intel’s 3D XPoint and IBM’s PCM. 

From the hardware perspective it doesn’t matter which of these technologies 

dominates, since their primary characteristics are fairly similar. However what isn’t 

certain is what a standard server will look like once these technologies kick in, and we 

won’t know until it happens. 

It is worth mentioning, en passant, that the volume of RAM relative to the CPU on 

commodity servers continues to increase and with the advent of these new memory 

technologies that trend will likely persist. 

SSD: RIP Spinning Disk? 

Solid State Disk (SSD) is replacing spinning disk almost everywhere. Spinning disk is 

still cheaper (by a factor of 5) but SSD is faster in most contexts (and can be a great deal 

faster). SSD used to be limited in volume but last year Seagate surprised the world with 

a 60 Tbyte SSD and erased that complaint. 

SSDs can be accessed in a parallel manner which accelerates read speeds considerably. 

To achieve this, SSDs “stripe” data into arrays so that when a read operation spans data 

across multiple arrays the on-disk controller issues parallel reads to get the data. The 

latency of each fetch will be constant. To leverage this you need to be careful about how 

and where you write the data. Aerospike, the high performance database uses SSD in 

this way. 

SSDs are not much better than spinning disk in write-heavy applications because they 

need to re-write entire blocks at a time (a read followed by an erase followed by a 

write). The more sophisticated drives organize write activity to minimize this. 

10


The Typical Server 

Not so long ago, a typical server was (primarily) a CPU with memory and disk storage. 

So, consider how much more complex it could become for the software engineers that 

need to make it dance. 

• The CPU is multicore with up to 12 cores. 

• Alternatively it may be have fewer cores but be integrated with GPU or FPGA 

capability. 

• The CPU has 3 exploitable layers of storage for itself: level 1, 2 and 3 cache, which 

can be exploited, e.g. for vector processing or data compression/decompression. 

• Configurable memory (DRAM) has grown to the terabyte level 

• A new fast access memory capability is emerging that is significantly faster than 

current SSDs and a little slower than memory. 

• SSDs are swiftly replacing spinning disk, but software engineers need to know 

how to best use them. 

It may be a while before we know what a “standard server” is going to become. 

Nevertheless it cannot be in the industry’s commercial interest for there to be too much 

variance. A consensus will emerge, what it will be is difficult to predict. 

The Changing of The Guard 

A further disturbing variable in this picture is the growing impact of the cloud. 

Amazon currently dominates cloud computing with 31 percent market share, with 

Microsoft playing second fiddle at 9 percent. As far as we are aware, both companies 

have projects to design and build their own chips (almost certainly based on ARM 

designs). Given the economies of scale of the cloud operations of both companies, it 

makes sense for them to do this. 

The impact either company could have on the future architecture of commodity servers 

is impossible to predict, aside for the fact that they will doubtless build infrastructure 

that is easy to software migrate to. It is not beyond the bounds of possibility for either 

company (or Google for that matter) to enter the chip market. 

11


Parallelism Playing Havoc 

By 2010 we, at The Bloor Group, began to observe a significant increase in the speed of 

server software occurring as a consequence of parallelism. That was in the early days 

of the Hadoop project, which had been inspired by Google and Yahoo’s use of scaledout 

server networks. 

Let’s think about this. Google and Yahoo were the first two Web 2.0 businesses. They 

had been forced to find their own solutions to big data problems. Searching the web 

was a completely new application. No developers had ever built software to solve a 

problem like that. First you send out spiders; you run software that accessed every 

web site, including new web sites, gathering information about new web pages and 

changes to old pages. After you harvest the data, you compress it to the digital limit 

and add it to your big data heap. That’s the relatively easy part of the problem. The 

hard part is updating the indexes on the huge data heap and letting the world pick at 

it. It was the first of these two problems that gave MapReduce its start in life. 

The idea of “mapping” and “reducing” was not new. It is a relatively old technique that 

emerged from functional programming, a 1970s programming paradigm. MapReduce, 

as invented by Google, was a development framework that was scalable over grids of 

servers and ran in a parallel manner. It provided a solution to the indexing problem. 

It was research activity by Doug Cutting and Mike Cafarella at Yahoo which spawned 

Hadoop, with its Hadoop Distributed File System (HDFS) and MapReduce framework. 

Quite likely the project would not have taken flight if Yahoo had not decided to throw 

it to Apache as an open source project with Doug Cutting in the pilot’s seat. Soon 

Hadoop distribution companies sprouted up: Cloudera in 2008, MapR in 2009 and 

Hortonworks in 2011. 

Under the auspices of The Apache Software Foundation (ASF), Hadoop acquired a 

coterie of complementary software components: Avro and Chuckwa in 2009, HBase 

and Hive in 2010, then Pig and ZooKeeper in 2011. Soon ASF - a nonprofit corporation 

devoted to open source software - acquired a destiny. It provided a well-honed process 

for incubating the development and assisting the delivery of open source products. It 

wasn’t long before it was supervising over 100 such projects, about one fifth of which 

were Hadoop related. 

The commercial early adopters of Hadoop began to trickle in around 2012. The trickle 

soon became a stream, and then the stream became a river, and the river flowed into a 

lake. Hadoop and its growing band of open source jig saw pieces were truly inexpensive. 

It was free software until you needed support, and if you cared to assemble a few old 

servers, you could prototype applications and experiment with parallel computing for 

almost nothing. 

Before Hadoop danced into the data center with its scale-out file system, no such 

scale-out file system existed. The assumption had always been that if you had data that 

needed to scale out in a big way – hundreds of terabytes or petabytes or beyond – you 

needed to put it in a database or a data warehouse. Hadoop changed all that. 

12


YARNing towards the data lake 

Until the release of YARN (Yet Another Resource Negotiator), Hadoop was truly 

limited. Its two main constraints were that it ran just one task at a time and that software 

development was tied to the use of MapReduce. YARN, released in late 2013, provided 

a scheduler and in the same release the enforced use of MapReduce was removed. 

Once Hadoop had a scheduling capability it was possible for multiple applications to 

share HDFS data concurrently, making it far more appropriate for many applications. 

Given direct access to HDFS, applications could leverage Hadoop hardware resources 

in any way they chose. If you were watching closely you would have noticed a slew 

of commercial software vendors announcing compatible software at the time of the 

announcement. And of course more followed later. 

Mesos, another open source scheduling capability, that was built for data center 

scheduling then stepped into the frame and soon after that the Myriad project was set 

up to enable Mesos and Yarn to work together. 

Once process scheduling was possible, the idea of a data lake - as a kind of data hub - 

took off. With the reality of multiple concurrent tasks it was definitely possible to have 

one process or set of processes ingesting data and another process doing analytics and 

a third process carrying out ETL on the data. 

The Spark Phenomenon 

Hadoop disrupted the data warehouse world, then Spark disrupted Hadoop. 

Described as “lightning fast,” it seemed to emerge from nowhere in 2015. But of course 

it didn’t. Spark began life at UC Berkeley AMPLab in 2009 and was open sourced in 

2010 (under a BSD license). It became an Apache project in 2013 about the time that 

YARN was released. 

For a variety of reasons it turned out to be a far better development layer than 

MapReduce. It was an in-memory distributed platform comprised of a collection of 

components that could accelerate batch analytics jobs, including machine learning 

applications, and could also handle interactive query and graph processing. It was 

not built to be a stream processing capability, but was capable of doing such work (via 

Spark Streaming). 

Spark was entirely independent of Hadoop. However, it was often implemented with 

Hadoop and HDFS providing the file system. The major Hadoop distros quickly added 

Spark to their Hadoop bundles. 

The Hercules Effect 

Yahoo started the Apache revolution when it threw Hadoop into the open source 

pot. This was not altruistic gesture but an act of self-interest. It had petabytes of data 

stored in Hadoop against which it ran many applications. Yahoo figured it would save 

time and money if it let external developers contribute to Hadoop’s development and 

shared the bounty. 

13


This collaborative approach to software development established a trend. Facebook, 

with its exabyte volumes of data was built on open source from the ground up. Imitating 

Yahoo, it also had gifts to give Apache: Hive, the quasi SQL capability and the graph 

processing system Giraph. Its most recent donation was Presto, the distributed SQL 

query engine, for which Teradata now offers commercial support. 

Linked In’s open sourcing of Kafka (a persistent distributed message queue) and 

Voldemort (a distributed key-value store) can be added to the list of Apache gifts - as 

can Parquet (a column-store capability developed in a collaboration between Twitter 

and Cloudera). 

There is a huge difference between a new Hadoop component, whose development 

only recently started, and one that drops magically from the sky into the ecosystem, 

fully tested and implemented over years by a highly scaled web business. The first kind 

of component may take a long time to mature, while the other is born fully formed, like 

Hercules of Greek myth. 

The addition of Kafka to the Hadoop ecosystem was particularly important. Kafka 

provides a publish-subscribe capability that enables data to be streamed in a managed 

fashion from one application to another or from one Hadoop or Spark cluster to another. 

In mid-2015, Kafka was joined by another important communications capability that 

goes by the name of NiFi, short for Niagra Files. The software was developed by the 

NSA over a period of 8 years. It had the goal of automating the flow of data between 

systems at scale, while dealing with the issues of failures, bottlenecks, security and 

compliance, and allowing changes to the data flow. Taken together, Kafka and NiFi 

provide a complete data flow solution that scales as far as the eye can see, and beyond. 

If you consider the whole Apache Hadoop stack, the collection of open source software 

that emerged in the wake of Hadoop, it obviously constitutes a new environment for 

building and deploying parallel software applications. It is remarkably inexpensive, 

both because of its open source nature and the commodity hardware on which it runs. 

We sometimes think of this stack as constituting a data layer OS, a distributed operating 

system for data. 

It doesn’t quite quality, but it is close and it is moving in that direction. 

14


The Next Gen Stack 

The hallmark of the Apache Hadoop stack, in all its glory, is that it is built for parallel 

operation on commodity hardware. If you have experience of it, no doubt you will be 

able to report that some of it is high quality and some of it is less so. However all of it 

is under continuous development and will likely improve. It is not tightly integrated 

in the manner that one might expect were it provided by a single software vendor like 

Microsoft and Oracle. 

Also, which components to choose and why can be confusing. This is certainly the 

case when it comes to streaming applications. Spark can build streaming applications 

of a kind, but in reality it processes micro-batches quickly, which technically is not 

streaming. Another component, Apache Storm has been built for true data streaming. 

And yet another component, Apache Flink does streaming and can also do micro-batch 

processing. There is competition within the Apache stack for streaming applications 

and that may remain so for a while. 

A further potential point of confusion is the existence of three major Hadoop 

distributors: Cloudera, MapR and Hortonworks. There are many similarities between 

these distributors in that they all pursue a similar business model that emphasizes 

support revenues and all provide downloadable free versions of their distributions. 

Hortonworks is the “pure play,” while MapR and Cloudera provide “premium” 

distributions to their paying customers. 

Technically, MapR is the most distinct, providing what it calls a converged platform 

that includes the MapR file system which is read/write (rather than HDFS, which is 

append only), MapR Streams (a capability similar to Kafka but more sophisticated) and 

MapR DB a NoSQL database. Cloudera also has some unique components, including 

Cloudera Manager, Cloudera Navigator Optimizer, Cloudera Search, the Kudu file 

system which allows read/write and its Impala database. In contrast, HortonWorks 

tries to remain true to the Apache Hadoop stack. 

There are other distributions, too. Cloud vendors, such as Amazon and Microsoft, 

provide their own Hadoop distributions. A Hadoop distribution contains many 

components (Cloudera’s currently has 20 for example) and others can be added. It is 

up to the customer to determine which components are required and that is to some 

extent application dependent. 

Those who have never experimented with Hadoop may assume that it’s relatively 

simple to install and get running. However it ain’t necessarily so. Aside from anything 

else, there’s a need to either hire experienced programmers or train existing ones in 

the use of a tribe of components: MapReduce, Hive, Pig, HBase, Spark and others. 

You are taking on a new and complex software environment, much more so than if 

you were implementing a Windows or Linux server. And then you are going to build 

applications to run on it. 

15


Data Lake Products 

The alternative to acquiring technical staff to understand the nuts and bolts of the 

Apache Hadoop Stack to buy technology from one of the Hadoop integration vendors 

or as they are also called “data lake” vendors: Cask, Unifi, Cambridge Semantics et al. 

Such vendors provide a data lake capability “out of the box.” Many companies have 

found this a more productive way to build a data lake than building it from scratch. 

Consider the position of a company that wishes to build a data lake that uses, say, 20 

components of the Apache Stack. The issues they will inevitably face are as follows: 

• Technical skills: New staff with appropriate technical skills may need to be 

hired. 

• Operational Management: The server cluster will need to be monitored and 

managed, altering configurations, provisioning new hardware and tuning for 

performance as needed. The software environment for this may need to be built. 

• Upgrades: Upgrade management of the Apache Stack is a potential headache. 

The Apache Stack upgrades are not going to be as smooth as, for example, a 

Windows Server upgrade - simply because the Apache Stack is not under the 

close control that a vendor like Microsoft or IBM provides. 

• Standards and Integration Issues: The question here is how to align the Apache 

Stack with common data center standards (e.g. security) for security, data life 

cycle management, etc. 

In reality what the data lake vendors do is provide an abstraction layer between the 

Apache Stack and the applications built on top of it. If this is done well then the user 

could, for example, migrate from Cloudera’s stack to MapR’s stack or even move the 

data lake applications into the cloud without concern for whether the application will 

function as expected. 

The Data Layer OS 

As we introduced the concept of a data layer OS it is worth explaining here why 

we do not currently believe that the Apache Stack has earned that description. If you 

ignore all the infrastructure software that is there to make applications possible, then 

we can think in terms of there being three types of applications: 

• OLTP applications: These are the applications that process the events and 

transactions of the business 

• Office applications: This category embraces communications activity from 

email to multimedia collaboration and includes personal applications from the 

word processor to graphics software. (We would even include development 

software here). 

• BI and Analytics applications: These are the applications that analyze the 

business and provide feedback. 

16


Right now the Apache Stack dominates the area of BI and analytics to the point where 

everything else is a sideshow. However the other two areas of application are currently 

conspicuous by their absence from the data lake. There are several reasons why that 

is so, the main one being that BI and analytics applications can profit most from the 

parallelism that the Apache Stack provides. Transactional applications like ERP and 

CRM and office applications like email do not experience such a dramatic boost from 

parallel processing, because they do not process such large amounts of data. 

In recent years the IT industry has witnessed unprecedented software acceleration 

in BI and analytic applications. Prior to 2010, in particular application areas where it 

made a difference, performance increases of 10x took six years at least, but by 2010 

we began to witness much faster software accelerations than that. Nowadays we 

sometimes encounter projects that have accelerated a previous analytical process by 

1000x or even 10,000x. 

This is almost entirely due the effective exploitation of parallelism in conjunction with 

in-memory processing. If you rarely have to go to disk for data and you can scale a 

workload out over many servers then 1000x is not that difficult to achieve, and with 

the Apache Stack and commodity servers, it can be achieved at remarkably low cost. 

17


The Emergence of Data Governance 

Data is not what it used to be. Let’s take this on board. The data we knew and loved 

was pretty much static, sitting in databases or files fairly close to the applications that 

used it. It wasn’t until the blooming of BI in the late 1990s that the industry felt obliged 

to let data flow around. That desire gave birth to the data warehouse into whence data 

flowed and from whence data trickled into data marts. 

Things began to change. Every year that passed witnessed an increase in the need 

for data movement. With the advent of CEP technology to process data streams of 

stock and commodity prices for trading banks, we got the first hints of real time data 

processing and as time marched forward, such activity grew larger. 

It was obvious to web businesses like Yahoo and Amazon, that we lived in a real-time 

world. The web logs that drove their businesses were the digital footprints of users 

visiting their web sites. A transaction-based world was giving way to event-based 

world. 

Had we thought about it at the time, we might have realized that it wasn’t just web 

sites that lived and died by events. Network devices and operating systems and 

databases and middleware and applications were all happily recording their events in 

log files that squandered disk space until they were eventually deleted. The computer 

networks of the world were already event oriented. They were born that way, but the 

applications we built were not. 

The first software vendor to notice this was Splunk. It detected and mined a thick 

vein of gold in the log files that litter the corporate networks. The advent of Splunk 

was a boon to IT departments that often needed to consult collections of log files to 

identify the causes of application error. Security teams were also appreciative of the 

technology as it helped them to hunt network intruders and vagrant viruses. That we 

were entering an event based world was transparently obvious to users of Splunk since 

all the data they gathered was event data. 

But it was still not obvious to others, and even now with the advent of streaming 

analytics, it is still not obvious to everyone. 

The Dawn 

You would be hard put to find any references to the term “data governance” before 

the year 2000. In fact you wont find many references to it prior to 2005. And let’s be 

clear about this, it was not that businesses did not care about the governance of data in 

earlier years. It’s just that pretty much all corporate data was internal data generated 

by the business and most of it stayed put, where it was born, or if it moved anywhere 

it made its way into a data warehouse. 

Under those circumstances governing the data was not such a pressing need. But, as 

we have described, data found the need to move much more often and the volume of 

data exploded. 

18


Let us home in on one of the faults of our current use of data. It can be thought of as a 

fundamental problem that needs to be fixed: 

Data is not self-defining! 

It’s easy to understand why. Some data is created by programs which expect to be the 

only programs ever to use it. Within the program that uses it, the data is adequately 

defined for use and the program understands it perfectly. As there is no intention for 

the data to be used by any other program, there is no need to explain the meaning of 

the data when it is stored. 

At the dawn of IT, back in the punched card era, all data was treated in that way and it 

was not until the advent of database - a software technology built to enable data sharing 

- was there any change. The data definitions in the database, the schema, applied to all 

the data but ceased to apply once the data was exported from the database. It was, at 

best a halfway solution to the data definition problem. 

Now that the era of event processing has arrived it is possible to be more specific 

about how to make data self-defining. What follows is a suggested list of data items 

that could be added to every event record, along with brief definitions: 

• Date-Time: The data and time (GMT) that the data was created. 

• Geographic location: The geographical location of the device that created the 

data. For precision we might think here in terms of the map reference (latitude 

and longitude) and possibly a three dimensional reference for the point within 

the building that occupies that map reference). Data created off-planet would 

probably use a reference point based on the sun. 

• Source device: The precise identity of the device that created the data (server, 

PC, mobile phone, etc.) 

• Device ID: Specific ID of the source device. 

• Source software: The precise identity of the software that created the data. 

• Derivation: An indication of whether the data was derived and if so, how. 

• Creator: The owner or enabler of the device and software that created the data. 

• Owner: The owner of the data, who may be different to the creator. 

• Permissions: Security permissions for usage of the data, enabling its use by 

specific programs . 

• Status: When there is the master copy of the data or a valid replica. (Replicas 

will be kept for back-up at the very least but may also be created for multiple 

concurrent usage). 

• Metadata: Associated directly with each data value is a metadata tag or reference 

identifying the meaning of the data. 

19


• Master Audit Trail: Recording of who has used the data, when, where and how. 

• Archive Flag: The date when the data is scheduled for archive or deletion. 

This is not necessarily an exhaustive list. Note that the scheme described here assumes 

that data is never updated. Under such a regime, data correction, where values are 

known to be wrong, would be corrected in a way analogous to how postings are 

reversed out in an accounting ledger. 

It is clear in perusing this collection of definition data that it constitutes a significant 

amount of data of itself. How such data is stored is a physical implementation issue - 

clearly there are many ways in which it much of this data could be compressed and/ 

or stored separately to data values, depending on how its is being used or is intended 

to be used. 

The function of all this additional data is to establish indelibly: 

• The provenance of the data. 

• The ownership of the data and its allowed usage. 

• The history of its usage. 

• The schedule for its archive or deletion. 

• The meaning of the data. 

These can also be regarded as five dimensions of data governance. We will discuss 

these one by one a little later. However before we do that, we need to define, in a 

general way, what we consider a data lake to be. 

20

A General Definition of The Data Lake 


It is our view that the Data Lake can and should be the system of record of an 

organization. Specifically, that means that the data storage system the data lake 

embodies, becomes the authoritative data source for all corporate information. 

We are speaking here of 

a “logical” data lake. For 

reasons of practicality 

it may be necessary to 

have multiple physical 

data lakes. In which case 

the system of record is 

constituted by all those 

physical data lakes taken 

together. 

Not only should the 

Data Lake be the system 

of record, it should be 

the logical location for 

the implementation of 

data governance. Data 

governance rules and 

procedures need to 

have a natural point of 

enforcement and the 

logical place for such 

enforcement is the system 

of record, the data lake. 

Figure 5 depicts the data 

lake in terms of the primary 

software components that 

are required: an ingest 

capability that can harvest 

both data streams and 

batches of data, the data storage capability, data governance so that data is governed on 

entry to the lake, data lake management so that the data lake environment is monitored 

and kept operational and ETL capabilities for transporting data to other locations. In 

addition, we also have the applications that run on the lake. 

Data Governance Processes 

Static Data Sources 


Governance 

Data Lake 

Mgt 

Ingest 



To 

Databases 

Data Marts 

Other Apps 

Data Streams 

Analytics 

or BI Apps 

The data lake is a staging area for the entry of new data into the system of record, 

whether it was created within the organization or elsewhere. There is an imperative for 

all governance processes to be applied to the data at that point of entry, if possible, and 

for data to be available for use once those processes have completed. 

ETL 

Figure 5. Data Lake Overview 

21


We will discuss these processes one by one: 

1. Assigning data provenance and lineage. Accurate data analytics needs to be certain 

of the provenance and lineage of the data precisely. Prior to the dawn of the “big data 

age” this was rarely a problem as the data either originated within the business or came 

from a traditional and reputable external source. As the number of sources of data 

increases the difficulties of provenance/lineage will increase. 

We expect the ability of data to self-identify for the sake of provenance to increase, 

although it may be many years before a general standard is agreed. For the sake of 

lineage and provenance each event record would need to record the time of creation, 

geolocation of creation, ID of creating device, ID of the process/app which created the 

data, ownership of the data, the metadata, and the identity of the data set or grouping 

it belongs to. To this we can add the details of derivation if the data was derived in 

some way, which would allow lineage to be deduced. 

Where such precise details are absent on ingest to the data lake, it should be possible at 

least, to know and record where the data came from and how. As very few data records 

self-identify as described some compromise is inevitable until standards emerge. With 

the advent of the Internet of Things, the need to self-defining data will increase. 

2. Data security. The goal of data security is to prevent data theft or illicit data usage. 

Encryption is one primary dimension of this and access control is the other. Let us 

consider encryption first. 

Encryption needs to be planned and, ideally, applied as data enters the lake. In a 

world of data movement, security rules that are applied need to be distributable to 

wherever encrypted data is used. Ideally with encryption, you will encrypt data as 

soon as possible and decrypt it at the last moment, when it needs to be seen “in the 

clear.” 

The reason for this approach is twofold. First, it makes obvious sense to minimize the 

time that data is in the clear. Secondly, encryption and decryption make heavy use of 

CPU resources and hence minimizing such activity reduces cost. 

To implement this approach, format-preserving encryption (FPE) is necessary. The 

point about FPE is that it does not change the characteristics of the data such as format 

and sort order, it simply disguises the data values. There are FPE standards and vendors 

that specialize in their application. 

Data access controls require the existence of a reasonably comprehensive identity 

management system and access rights associated with all data. Access rights may 

distinguish between the right to view and the right to process the data using a particular 

program or process. 

3. Data Compliance. Data compliance regulations are now common, and likely to 

become increasingly complicated with the passage of time. The EU is hoping to establish 

a General Data Protection Regulation (GDPR) for personal data that is implemented 

across the world and has devoted considerable effort to formulating rules for that. GDPR 

22


will become law within the EU and will likely take effect in many other countries. The 

practical consequence is that businesses world wide need to implement these rules. 

The EU legislation does not care where the data is held, so its jurisdiction is worldwide 

in respect of anyone living in the EU. 

You can think of such regulations as international whereas healthcare regulations 

(HIPPA in the US) and financial compliance rules tend to take effect at a national level. 

To this collection of compliance regulations you can add non-binding sector compliance 

initiatives and best practice rules that an individual business might establish. It should 

be obvious that applying such rules is difficult unless the data is self-defining to some 

degree. 

4. Data integrity. Once you eliminate data updates, the possibility of data corruption 

diminishes. Nevertheless it still exists and may never be entirely eliminated. Software 

errors can certainly achieve it. The need to specify whether data is a copy or the actual 

source is necessary if there is going to be any possibility of auditing the use of replicated 

data. Test data needs to know that it is test data. Data back-ups also need to know they 

are back-ups. Disaster recovery needs to restore all data to its original state. All of this 

applies to data in motion as well as data at rest. 

5. Data cleansing. Newly ingested data may be inaccurate, especially if we have no 

control over and little knowledge of how it was created. No data creation or collection 

processes are perfect. This is even true of sensor data, which we may think of as reliable, 

because it comes from an automated source; sensors are also capable of error. Data may 

also be corrupted by subsequent processes after creation. 

Data needs to be cleaned as soon as possible after ingest - where that’s feasible. There 

are exceptional situations, for example where the requirement is to process a data 

stream as it is ingested. There is no time available for data cleansing, so the streaming 

application must allow for possible data error. A real world parallel to this is the way 

that news is processed. Sometimes false news reports are aired because some editor 

thought that the news was too “valuable” not to air it immediately and the usual 

verification process was skipped. It is corrected later, if it turns out to be wrong. Any 

urgent processing of uncleaned data will normally allow for its poor quality and the 

possible need for later correction. 

Data cleansing standards are a natural part of data governance. However there is no 

silver bullet for cleaning data. There are obvious tests that can be done such as checking 

for logically impossible values, and you can formulate rules to detect unlikely values. 

It is possible to cleanse some data automatically, and there are some well-designed 

tools for this from Trifacta, Paxata, Unifi and others. But no matter how effective the 

cleansing software, it requires human supervision and intervention. Cleansing can 

thus be a slow process. 

6. Data reliability. As far as it is possible to ensure, data needs to be accurate and 

also to be checked for accuracy regularly. Data values can be corrupted in various 

ways: It can be changed by a hacker for fraudulent reasons or even sabotage. It can 

be overwritten at some point in its life. It can be corrupted “in flight,” although this 

23


is rare because of communications error checking procedures (in flight corruption is 

most likely hacking). It can be corrupted by any software that rewrites the data. It can 

be corrupted by database error (DBA error). To deal with such possibilities, some form 

of checksum integrity can be applied. 

7. Disambiguation. Disambiguation is an aspect of data cleansing and data reliability. 

It refers to the removal of ambiguities in the data. It is a process that would be unlikely 

to be applied when data enters the data lake, since it can be a time consuming process. 

There is a particular problem with data ambiguity when it comes to people. There is no 

bullet-proof global identity system, so identity theft is a fact of life. 

There is no standard for people’s names. Sometimes just the first name and surname is 

asked for. Sometimes a middle initial or a middle name is asked for. Sometimes the full 

name. The name is a poor identifier. People can change their names legally and women 

change their names by marriage. People sometimes disguise their names deliberately 

for fraudulent reasons. But they may do so legitimately (in an effort to anonymize 

their data). Some attributes change (address, telephone number, etc.) and usually do so 

without the information being easily gathered. To complicate the picture, the structure 

of the customer entity changes over time. For example, the social media identities (on 

twitter, etc.) were born only recently. 

The consequence of this is that for many businesses cleansing customer data also 

means disambiguating the data. There are few software tools that are effective at 

disambiguation. Novetta has this capability and IBM also, but they are the only two 

software providers we know of with this capability. 

8. Audit trail of data usage. A record of who used what data and when needs to be 

maintained both for security purposes and for usage analytics. Usage analytics play a 

part in query optimization as well as in data life-cycle management. 

9. Data life-cycle management. Few companies have formally implemented a general 

data life cycle strategy. More likely is that they have a strategy for some data, such as 

data covered by compliance regulations, but not all data. Others may have no strategy 

at all. 

With analytical applications in particular, the need to manage data life cycles is 

important, because much of the data used in data exploration may eventually be 

discarded as worthless. There is no point in retaining the data, beyond recording that 

it was once explored. As data lakes are also used for archive, the use of a data lake 

creates an opportunity to implement or tighten up the procedures around data life cycle 

management. Life-cycle management can be thought of as the strategy for moving data 

to least cost locations as its usage diminishes. Deletion is a possible destination in this. 

Metadata and Schema-on-read 

The only aspect of governance that we have not yet discussed is metadata management. 

The metadata situation is complex and thus we are devoting more time to it than 

other aspect of data governance. In overview, the situation is simple: since metadata 

determines the meaning of data, the natural preference is for metadata be as complete 

24


as possible as soon as possible, so the possibility of data being misinterpreted is 

minimized. 

However, one of the loudly proclaimed benefits of the data lake is that “there is no need 

to model data before ingesting it.” This contrasts significantly to the data warehouse 

situation, where a great deal of data modelling effort is required before data is allowed 

in to the warehouse. The alternative to data modelling is called schema-on-read. 

With schema-on-read, the process that reads the data for the first time determines the 

metadata. The work involved varies according to data source. Some data such as CSV 

files and XML files define the metadata, and thus it’s possible to know the metadata 

as the data is read. Other data may not be so convenient. However, some data formats 

can be recognized from the data and also some metadata may be deducible from data 

values. This is how products like Waterline can automatically determine metadata 

values. In other circumstances human input may be required to determine metadata. 

Once teh metadata is determined it can be stored in a metadata repository. 

With schema-on-read the end goal is not to establish an RDBMS-type data model 

which defines the relationships (foreign key relationships) between data sets. For many 

BI and analytics applications it is not required of the user can specify the metadata. The 

schema-on-read approach means that any Master Data Management (MDM) process 

that maintains a master data model of all corporate data, will probably need to be 

adjusted. 

The assumption of data modelling is that by applying various rules and a little 

common sense you can provide a data model (usually and ER model) that is suited to 

all possible uses of the data. The truth is that in almost all circumstances you cannot. It 

will be imperfect, at best. 

The reality of the situation is this: 

• The time taken to model the data, either in the beginning or when new data 

sources are added or if errors are found in the model, constitutes a definite and 

possibly large cost to the business in respect of time to value. 

• The modeler has to try to anticipate all new data sets that may appear later, 

so that the model does not require significant rework when new data sets are 

added. This can only be guessed at and rework is sometimes required. 

• Data (in a data lake) is intended to be a shared asset and will be shared by 

groups of people with varying roles and differing interests, all of whom hope 

to get insights from the data. To model such data means trying to allow for 

every constituency in advance. Possibly this will result in a “lowest common 

denominator” schema that is an imperfect fit for anyone. This problem gets 

worse with more data sources, higher data volumes and more users. 

• With schema-on-read, you’re not glued to a predetermined structure so you can 

present the data in a schema that fits reasonably well to the task that requests 

the data. 

25


By employing schema-on-read you get value from the data as soon as possible. It does 

not impose any structure on the data because it does not change the structure that the 

data had when loaded. It can be particularly useful when dealing with semi-structured, 

poly-structured, and unstructured data. It was difficult and often impossible to model 

some of this data and ingesting it into a data warehouse, but there is no problem getting 

it into the data lake. 

In general, schema-on-read allows for all types of data and encourages a less rigid 

organization of data. It is easier to create two different views of the same data with 

schema-on-read. Also, it does not prevent data modelling, it just makes it necessary to 

defer it. 

The Data Catalog and Master Data Management (MDM) 

There needs to be a catalog for all the data in the lake. It is important and will become 

increasingly important the data catalog to be as rich as possible, embodying as much 

“meaning” as possible. It may help here if we explain what we mean by “data catalog,” 

as the term is open to interpretation. 

The data catalog for an operating system (such as Windows or Linux) is the file system. 

The metadata it stores for each file is incomplete, as its primary purpose is to enable 

programs to find and access files. The programs themselves will know the layout of 

the data and thus how to interpret the values in any given file. This arrangement is not 

ideal because the data cannot be shared with programs that do not understand the file 

structure. It is adequate only for data that never needs to be shared. 

The data catalog for a database is the schema. The schema defines the structure of all 

the physical data held in the data sets or tables of the database and seeks to provide 

a logical meaning to each data item by attaching a label to it (Customer_ID, Amount, 

Date_of_Birth, etc). The amount of meaning that can be derived from the data catalog 

depends upon the specific database product. For relational databases the catalog 

roughly reflects the meaning that is captured in an ER Diagram, where the relationships 

between specific data entities are indicated. With NoSQL and document databases the 

situation is similar, although how the schema is used varies from product to product 

With an RDF database (sometimes called a Semantic Database) the catalog can record 

a greater level of meaning. This is an area where Cambridge Semantics, with its Smart 

Data Lake solutions currently excels. The simple fact is that semantic technology 

allows you to capture much more meaning for the data catalog than you can using ER 

modelling. 

A traditional data warehouse metadata catalog can be complex, but nothing like as 

complex as a data lake catalog, which may include many sources of unstructured 

data (document data, graph data) that may require ontologies (semantic structures) to 

accurately define the metadata. 

The point of the data lake’s data catalog is to provide users (and programs) with a 

data self-service capability. On a simple level, if you compare the pre-data-lake world 

of data warehouses and data marts to the data lake world, two obvious facts emerge: 

26


• The data lake can hold every kind of data (unstructured, semi-structured, 

structured) allowing it to provide a more comprehensive data service to users. 

• The data lake can scale out, almost indefinitely, to accommodate far more data 

than a data warehouse ever dreamed of doing. 

It is best to think of metadata enrichment as an ongoing process. The use of schemaon-read 

may mean that the data catalog is incomplete when a data set (or file) first 

enters the lake. However that will change as soon as the data is used. The store of 

metadata may be further enriched by usage and formal modelling activity may be 

carried out to assist in that. 

The idea of Master Data Management is not dead, but it has morphed over the years. 

The original idea was to arrive at a “single version of the truth,” a holy grail that none 

of the knights of the round table in any organization ever managed to feast their eyes 

on. Nevertheless, there is sense it trying to link together all the data of an organization 

within a metadata driven model that is comprehensible to business users and enables 

data self-service. It should be possible to define the business meaning of everything 

that lives within the data lake. This is an area where we expect semantic technology to 

make an invaluable contribution. 

The Cloud Dynamic 

It could be argued that data lakes are cloud-neutral. Depending on circumstance, the 

cloud may prove attractive to companies involved in data lake projects. Certainly it is 

likely to be appropriate for building prototypes, with the idea of later migrating back 

into the data center. 

As regards data lakes, the economics of the cloud needs to be carefully considered. 

With most cloud vendors you will pay rent for all the data that lives in the cloud - and 

if a good deal of that data is never accessed, it is will almost certainly be much less 

expensive to keep that data on premise. 

That’s the downside of the cloud, but it still has its characteristic an upside. It’s the 

go-to solution when there’s a need for instant extra capacity, whether for storing data 

or processing it. 

27

Data Lake Architecture 


Let’s begin with the idea that the data lake is the system of record. And let’s be clear 

what we mean by this. The system of record is the system that records all the data that 

is used by the business. It also holds the golden copy of each data record. 

For simplification the system of record should be thought of as a logical system. It may 

be possible to implement it on a single cluster of servers, but this is not a requirement 

and should not be a goal. In practice the whole configuration will probably involve 

multiple clusters if, for no other reason, than to provide disaster recovery. 

The system of record should also be the system where governance processes are 

applied to data. Clearly, data needs to be subject to governance wherever it is used 

within the organization. Some governance processes, particularly data security, need 

to be applied to data as soon as possible after its creation or capture. For that reason, 

the data lake will inevitably be the landing zone for external data brought into the 

business, so governance processes can quickly and easily be applied it. Data created 

within the organization should be passed to the data lake immediately after creation so 

that governance processes can be applied. 

It is best to think of the data lake as a store of event records. We can think of it in 

this way: events are atoms of data and transactions (the traditional paradigm with its 

traditional data structures) are molecules of information. The analogy works reasonably 

well. 

Consider for example a web site visit. A user lands on site, clicks through a few pages, 

decides to buy something, enters credit card details, and clicks on the “confirm” button. 

In transactional terms we may think of this as a purchase: a molecule of data, that’s 

quickly followed by a delivery transaction, another molecule of data to record. 

But in reality it’s a stream of events, a series of atoms of data. Every user mouse 

action creates an event. The computer records these events and responds to each one 

immediately; it displays new web pages or expands text descriptions or whatever. 

The purchase confirmation is just another event, distinguished only by the fact that it 

generates a cascade of other related events in the application or in other applications. 

When you look at it like that, business systems consist of applications generating events 

and sending event information to other applications. There is nothing particularly 

special about web applications in this respect; all applications are like that. They 

respond to events and send messages or data to other applications. That’s been the 

nature of computing for decades. 

When you scan all the servers and network devices in a data center you find them 

awash with log files that store data about events of every kind: network logs, message 

logs, system logs, application logs, API logs, security logs and so on. Collectively the 

logs provide an extensive audit trail of the activity of the data center, organized by 

time stamp. It happens at the application level, at the data level and lower down at 

the hardware level. Hiding within this disparate set of data can be found details of 

anomalies, error conditions, hacker attacks, business transactions and so on. 

28


Clearly it is possible, by 

adding real-time logging to 

every business application on 

every device, to collect all the 

company’s data and dispatch 

it to the data lake - although, 

it is no doubt overkill. Figure 6 

depicts this possibility. Data is 

ingested either from static data 

sources (files and databases) on 

a scheduled basis, or is ingested 

directly from data streams. 

We show both Kafka and NiFi 

as possible components of an 

ingest solution. Kafka is pure 

publish subscribe and thus 

can easily be used to gather 

data changes from multiple 

files (as publishers) and pass 

them to the Data Bus (the 

subscriber), probably a well 

configured server. However 

a more sophisticated and set 

of capabilities can be created 

by integrating NiFi with Kafka. NiFi can completely automate data flows across 

thousands of systems and can be configured to handle the failure of any system or 

network component, queue management, data corruption, priorities, compliance and 

security. And it provides an excellent drag and drop interface that allows data flow 

configurations to be designed and enhanced. You can think of NiFi as ETL on steroids. 

We included an ingest application in the diagram to allow for any functionality or 

integration capability that could not be delivered by Kafka in conjunction with NiFi. 

The goal is to provide an in-memory stream of data that can be processed as it arrives. 

Given that, in theory at least, input may be required from any data source any where, 

the data access ability needed here is extensive. 

Data lake projects will most likely begin with fairly unsophisticated data acquisition 

from just a few sources. What we have described here is a kind of “worst case scenario” 

and how it would likely be handled. 

Real Time Applications 

Servers, Desktops, Mobile, Network Devices, 

Embedded Chips, RFID, IoT, The Cloud, Oses, 

VMs, Log Files, Sys Mgt Apps, ESBs, Web 

Services, SaaS, Business Apps, Office Apps, 

BI Apps, Workflow, Data Streams, Social... 

Ingest 

Kafka 

The goal of software architecture is to satisfy application service levels, pure and 

simple. When we consider real-time applications, for example responding immediately 

to price changes in an automated market, there is simply no room for any unavoidable 

latency. 

D 

A 

T 

A 

B 

U 

S 

NiFi 

Figure 6. Data Lake Ingest 

29


We represent this reality, in Figure 7, by showing real-time 

applications running directly against the real-time data 

stream within the Data Bus and also accessing a disk-based 

data source. In practice this is likely to be achieved by a 

Lambda or Kappa architecture, which is far ore involved 

than teh diagram suggests. 

In practice it is far more likely, for latency reason, that realtime 

apps will not be located anywhere near the data lake, 

but will instead be as close as possible to the data stream(s) 

that feed them. Nevertheless they will likely pass the data 

stream they process directly to the data lake. Since such 

applications run prior to any data governance it may be 

necessary to pass data to these applications if anything important is discovered during 

the governance processes. 

Governance Applications 

We need to distinguish between the governance processing that occurs on ingest and 

the governance processing that may occur later. One of the rules of governance itself 

may be to specify what processes 

have to take place before data is 

made available to data lake users. 

It makes sense to do as much 

processing as possible while data is 

on the Data Bus (i.e. held in memory) 

since it can be accessed very 

quickly. The ideal would be to do 

all governance processing on ingest, 

but this may be impossible. Some 

data cleansing activity and some 

metadata discovery activity requires 

human intervention and it may not 

be practical to do it on ingest. For 

some data, the chosen policy may 

be to implement schema-on-read so 


Security 


Transforms 


Aggregat'n 

Metadata 

Mgt 


Cleansing 

Figure 8. Ingest Governance Apps 

metadata gathering will occur after ingest. There can be other competing dynamics. The 

desire may be to encrypt all data (or at least all data that is destined to be encrypted) 

on ingest. If data is to be stored in the write-only HDFS, this is doubly important, as it 

is a write-only file system. However data cleansing will require data to be unencrypted. 

The data transform and aggregation activities shown in the diagram are not governance 

activities per se. From an efficiency perspective it will be better to perform data transformations, 

aggregations and other data calculations that are known to be required before data is written 

to disk. Of course, some will need to be done later simply because not all the data they 

required is in the data stream. 

D 

A 

T 

A 

B 

U 

S 

Real-Time 

Apps 


Figure 7. Real-Time 

30


Data Lake Management 

Figure 9 shows the 

processes that run on 

the data lake. The other 

on-going activity aside 

from data governance is 

data lake management. 

This is the system 

management activity 

that monitors and 

responds to hardware 

and operating system 

events and manages all 

the applications that 

dip their toes in the data 

lake. 

The data lake is a 

computer grid that 

needs to be managed 

like any other network 

of hardware. The 

appropriate system management activities can be many and varied, including: server 

performance, availability monitoring, automated recovery, software management, 

security management, access management, user monitoring, application monitoring, 

capacity management, provisioning, network monitoring and scheduling 

There is nothing new in respect of system management here. However, it is important to 

recognize that some of software employed here will be traditional system management 

software and some is likely to be data lake specific (Hadoop management software like 

Ambari, Pepperdata for cluster tuning, etc.). 

Data Lake Applications 


Governance 


Mgt 

DATA 

BUS 

DATA LAKE 

Search & 

Query 

BI, Visual'n 

& Analytics 

Other 

Apps 

Figure 9. Data Lake Applications and Processes 

Little needs to be said about data lake applications beyond the fact that businesses 

almost always employ data lakes in the same way they used data warehouses for BI 

and analytics applications. It is important to note that data self-service is more practical 

with data lakes, than it ever was with data warehouses. 

An effective data self-service capability requires an effective search and query capability. 

This in turn requires the existence of a data catalog, which should be a natural result 

of intelligent metadata management. It will also require a search capability (after the 

fashion of Google search), which can pick out anything in the data lake. 

A really comprehensive search capability will necessitate some kind of indexing 

activity as part of data ingest. A query capability will need to work in conjunction with 

the data catalog. There might be support for multiple query languages (SQL, XQuery, 

SparQL, etc). 

31


If we now look at 

Figure 10, which 

illustrates the data 

lake complete, the 

only two processes, 

we have not yet 

discussed are data 

extracts and data lifecycle 

management. 

While data lake 

processing can be 

fast, if particularly 

high performance 

is required for some 

applications, there 

will be a need to 

export data to a 

fast data engine or 

database. It will 

probably be many 

years before data 

lake data access 

speed gets close 

to a purpose built 

database. 

The truth is that 

the focus of data 

lake architecture is 

data ingest and data 

governance. There 

are many processes, 

all of them important, 

competing for the 

same resources. 

Servers, Desktops, Mobile, Network Devices, Embedded 

Chips, RFID, IoT, The Cloud, Oses, VMs, Log Files, Sys 

Mgt Apps, ESBs, Web Services, SaaS, Business Apps, 

Office Apps, BI Apps, Workflow, Data Streams, Social... 


Governance 


Mgt 

Ingest 

Transform & 

Aggregate 

Archive 


Security 

Life Cycle 

Mgt 

DATA LAKE 

Real-Time 

Apps 

Metadata 

Mgt 


Cleansing 

Extracts 

Search & 

Query 

BI, Visual'n 

& Analytics 

Other 

Apps 

To 

Databases 

Data Marts 

Other Apps 

Figure 10. The Data Lake Complete 

There will always be a limit to the capacity of the data lake, and governance processes 

naturally take priority, so it will prove necessary to replicate data to other data lakes or 

data marts to properly serve some applications or users. 

As regards data archive, data life-cycle management can be regarded as an aspect of 

data governance. It can best be thought of as a background process. The exact rules 

of if and when data needs to be deleted may be influenced by business imperatives 

(regulation), but may also be determined by storage costs. Ideally, archive will be an 

automatic process. 

32


Next Generation Architecture 

In our view, the data lake is more than an interesting software architecture that utilizes 

many inexpensive open source components, it is the foundation of the next generation 

of enterprise software. As such we can think of there being three major architectural 

generations of software: 

1. Batch Computing. The centralized mainframe architecture, characterized by 

all applications running on one (or more) large mainframes usually in a batch 

manner. 

2. On-line Computing. This involves an arrangement of distributed servers and 

client devices. It is characterized by applications interacting with databases in 

a transactional way, supported by batch data flows between databases for data 

sharing. 

3. Real-Time Computing. This is based on event-based applications that employ 

parallel architectures for speed and can support real-time applications where 

data is processed as quickly as it arrives. 

If we look at software architectures in this manner, it is clear that the data lake belongs 

to a new generation of software that is built in a fundamentally different way to what 

came before. 

The event-based data lake serves as the system of record of the corporate software 

ecosystem and the point of implementation of data governance. Data flows into the 

lake from data streams or bulk data transfers that have their origin within or outside 

the organization. During this process, governance rules are applied to the data to ensure 

it is stored in an appropriate form and subject to appropriate security policies. Once 

in the data lake, with all the appropriate governance procedures applied, it becomes 

available for use. If any data is extracted for use outside the data lake, it will be a data 

copy. Ultimately data leaves the lake when it is archived or permanently deleted. 

So, the data lake is not a swamp into which any old data can be poured and lost from 

sight. It is the ingest point for the controlled collection and governance of corporate 

data. It is the system of record and the foundation for data life cycle management, from 

ingest to archive. It is an event management platform and could be viewed as a truly 

versatile data warehouse. 

As a data warehouse, it is distinctly different from the traditional variety as it can 

accept any data with any structure rather than just the so-called structured data that 

relational databases store. There is no modelling required in designing and maintaining 

the lake, and a query service should be available to retrieve data from the lake - but it 

may not be a lightning-fast query service. 

While data warehouses were presided over by extremely powerful database 

technology with a purpose built query optimizer, the data lake is likely to be devoid of 

such technology and if a fast query service is needed (as is likely) for some of the data 

in the lake, it will be provided by exporting data to an appropriate database to deliver 

the needed query service. 

33


The Logical and The Physical 

In our view, the two primary dynamics involved in establishing a data lake are: 

1. To gradually migrate all the data that makes up the system of record to the data 

lake where it becomes the golden copy of the data. 

2. For the data lake to become the primary point of ingest of external data where 

governance processing to data, both internal and external, as soon as possible 

when it enters the data lake. 

We note here that the data lake concept, which was first proposed about five years 

ago, has gradually grown in sophistication and thus here we are describing current 

thinking about what a data lake is and how to use it. 

For companies building a data lake, it is important to think in terms of a “logical data 

lake” along the lines we described, and to acknowledge that its physical implementation 

may be far more involved than our diagrams suggest. 

If the recent history of IT has taught us anything, it is that everything needs to scale. 

Most companies have a series of transactional systems (the mission critical systems) 

that currently constitute most if not all of the system of record. For the data lake to 

assume its role as the system of record, the data from such systems needs to be copied 

into the data lake. 

Pre-assembled Data Lakes 

For many companies the idea of commencing a strategic data lake project will make no 

commercial sense, particularly if their primary goal is, for example, only to do analytic 

exploration of a collection of data from various sources. Such a set of applications are 

unlikely to require all the governance activities we have discussed. In the circumstances 

the pragmatic goal will be to build the desired applications to a simpler target data lake 

architecture that omits some of the elements we have described. 

This approach will be easier and more likely to bring success if a data lake platform is 

employed, which is capable of delivering a data lake “out of the box.” As previously 

noted, vendors such as Cask, Unifi or Cambridge Semantics provide such capability. 

They deliver a flexible abstraction layer between the Apache Stack and the applications 

built on top of it. They also provide other components for managing, building and 

enriching data lake applications. 

It is possible to think of such vendors as providing a data operating system for an 

expansible cluster onto which you can build one or more applications. It is also feasible 

to build many such “dedicated” data lakes with different applications on each. A 

company might for example, build an event log “data lake” for IT operations usage, a 

real-time manufacturing data lake, a sales and marketing “data lake” and so on. 

One of the beauties of the current Apache Stack is that, with the inclusion the powerful 

communications components, Kafka and NiFi, it possible to establish loosely coupled 

data lakes that flow data one to another. If the data is coherently managed, simply 

34


Ingest 

Ingest 


Governance 


Governance 


Mgt 

Apps 






Mgt 

Apps 

Replicate, 

Export 


Governance 

To other 

destinations 

(databases, apps, etc.) 


Mgt 

Apps 



Ingest 

Figure 11. Multiple Physical Data Lakes in Overview 

adding clusters like this, whether located in the cloud or on premise will allow you to 

gradually grow the type of system of record we have discussed in this report. 

35


The main point is that there can be multiple physical data lakes, each with ingest 

capabilities and governance processes, that constitute a logical data lake, as illustrated 

in Figure 11. While the diagram implies that all the physical data lakes are running 

roughly the same processes, this may not be the case. There are different reasons for 

having multiple physical data lakes. Some might exist entirely for disaster recovery 

or as a reserve resource for unexpected processing demand or as a dedicated analyst 

sand box for a specific group of users. Some may simply be established for time zone 

or geographical reasons. 

Having multiple physical data lakes will complicate the global approach to governance 

and establishing a system of record. However, with an intelligent deployment of Kafka 

(and possibly also NiFi) to manage the replication and export of data, ensuring that the 

physical data lakes correspond to a logical data lake is achievable. 

The system of record is likely to be logical (i.e. spread physically across multiple 

systems and data lakes) for several reasons. One particular cause for this that we 

believe is worth discussing is Internet of Things (IoT) applications, where the source 

data is created and is likely to remain physically remote. 

The Internet of Things 

The IoT is currently in its infancy, although some IoT applications have existed for many 

years, particularly those involving mobile phones. The Uber and Lyft applications, for 

example, are complex internet of things applications. 

However, such applications are not 

what normally spring to mind when the 

IoT is mentioned. The general idea is that 

there will be some physical domain – a 

building or many buildings or a transport 

network or a pipeline of a chemical 

plant or a factory – and this domain will 

be peppered with sensors, controllers 

or even embedded CPUs in various 

locations that are, at the very minimum, 

recording information but may also be 

running local applications. 

Sensors, Controllers, CPUs 


Depot 

Depot 

Source 

Proc. 

Depot 

Proc. 

Figure 12 illustrates a typical scenario. 

Consider an example, let us say a car or a 

truck or an airplane engine, loaded with 

sensors. The data gathered locally needs 

to be marshaled in a local data depot, 

which may contain considerable amounts 

of data, maybe terabytes. Some of that 

data will probably be processed and used 

locally and there may be no need to send 

it to a central data hub. However, there 


Central 

Central 

Hub 

Proc. 

Figure 12. IoT in Overview 

36


are many instances of such an object (car, truck, etc.) and there can be no doubt that the 

data gathered for each object will have value in aggregation . 

So some of the IoT data will be collected in some central hub so that the aggregate data 

can be analyzed. The bulk of the data is thus distributed and will remain distributed. It 

may even be that some application at some point needs to access the complete collection 

of all the data. If so, it will be far more economic to run a distributed application across 

all the depots than to try to centralize the data. 

A data lake might be involved in IoT applications like this, but its scale-out capabilities 

may have little importance for such applications. The scale-out applications of the IoT 

will probably be handled via distributed processing. 

The System of Record in Summary 

We have positioned the data lake as being a next generation technology concept, 

founded on the use of parallel processing in combination with a whole series of new 

software components, the majority of which are Apache projects. 

In this new ecosystem, the system of record, which historically was regarded as the 

data of the primary transactional applications of the business, will reside (mainly) in 

the data lake, where the purifying processes of data governance will be applied to it 

on ingest. 

The system of record will no longer consist entirely of the transactions (or events) of 

the business. It will also include data from other sources, which the business uses to 

perform analytics and inform its users of important information on which decisions can 

be based. The system of record will be, as it always was, the golden copy of corporate 

data and the audit trail of the IT activities of the business. 

37


Thank You To Our Sponsors: 

About The Bloor Group 

The Bloor Group is a consulting, research and technology analysis firm that focuses on open 

Research and the use of modern media to gather knowledge and disseminate it to IT users. 

Visit both www.BloorGroup.com and www.InsideAnalysis.com for more information. The 

Bloor Group is the sole copyright holder of this publication. 

Austin, TX 78720 | 512-426–7725 

38

The Data Lake Survival Guide

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?