03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> Bloor Group<br />

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> What, Why and How of the <strong>Data</strong><br />

<strong>Lake</strong><br />

Robin Bloor, Ph.D.<br />

& Rebecca Jozwiak<br />

RESEARCH REPORT


"Perhaps the truth depends on a walk around the lake.”<br />

~ Wallace Stevens<br />

RESEARCH REPORT


<strong>The</strong> Genesis of the <strong>Data</strong> <strong>Lake</strong><br />

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

In times past, when thinking about digital data, it made sense to segregate data between<br />

transactional data, the data captured in business applications, stored in database tables<br />

and presented by BI tools, and all other data: emails, web pages, images, video and so<br />

on. Nowadays we tend to refer to such “other data” as unstructured data. Of course<br />

it is not unstructured in the true sense of the word, but it does not have a convenient<br />

structure for being stored in a relational database. It is usually lives in files or more<br />

specialized databases and until recently it was rarely analyzed.<br />

Nevertheless it was analyzable and software for deriving value from such data has<br />

crossed the chasm. Combine this with the reality that there is even more value to joining<br />

some of this data with regularly structured data for additional analytics and you have<br />

a strong motive for re-imagining the idea of a data warehouse.<br />

It was that analytical imperative more than anything else which gave rise to the<br />

original concept of a data lake, a data store for both species of data and, additionally<br />

for data harvested from multiple sources external to the business, some of which was<br />

inevitably unstructured.<br />

<strong>Data</strong> Flow Architectures<br />

<strong>The</strong> now-aging data warehouse architecture ruled the data empire for two decades<br />

or more and will probably continue to play a role in the data lake architecture that<br />

supersedes it - but only as a supporting actor. Its longevity stands as a testament to its<br />

effectiveness. It was the first generation data flow architecture, and as is the case with<br />

the data lake, its claim to fame was in providing data to BI and analytics applications.<br />

Figure 1 below provides a simplified conceptual illustration of this architecture.<br />

<strong>Data</strong> flows from<br />

OLTP databases via<br />

Extract Transform<br />

and Load (ETL)<br />

software to the<br />

data warehouse.<br />

Queries and other<br />

apps access the<br />

data warehouse<br />

and more ETL<br />

software passes<br />

data into data<br />

marts against<br />

which other (BI<br />

<strong>Data</strong><br />

Layer<br />

OLTP<br />

Apps<br />

OLTP<br />

DBMS<br />

ETL<br />

Query<br />

Apps<br />

<strong>Data</strong><br />

Warehouse<br />

<strong>Data</strong> Mart<br />

Apps<br />

<strong>Data</strong><br />

Marts<br />

and analytics)<br />

applications run.<br />

<strong>The</strong> data layer for business applications is thus comprised of transactional databases,<br />

a data warehouse and data marts consisting of subsets of data drawn from the data<br />

warehouse.<br />

ETL<br />

Figure 1. Conceptual <strong>Data</strong> Warehouse Architecture<br />

1


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Real implementations are much more complex, usually involving a data staging area<br />

where data is placed prior to ingestion into the data warehouse. This may be necessary<br />

for operational reasons such as the data warehouse needing to limit data ingest to<br />

particular times. Alternatively data may need to be cleaned or restructured before<br />

ingest. In some instances, because data took too long to flow through to a data mart,<br />

yet another database - called an Operational <strong>Data</strong> Store (ODS) - would be created to<br />

provide a more timely service to BI dashboards.<br />

<strong>The</strong> need for such awkward manoeuvres might have been eliminated in time by the<br />

increasing power of hardware. However, this approach was constrained by other factors<br />

all of which we will identify and discuss in detail later.<br />

<strong>The</strong> Value of <strong>Data</strong><br />

Businesses do not remain static. <strong>The</strong>ir processes change and evolve, their business<br />

models change and the markets they serve are gradually reshaped. Precisely how this<br />

happens varies, but generally we can think of there being a simple feedback loop, which<br />

governs the process. We illustrate this in Figure 2.<br />

<strong>The</strong> feedback loop has three steps:<br />

• Plan (business process design and implementation)<br />

• Run operational business processes<br />

• Review operational business processes<br />

<strong>The</strong>re may be many manual elements in this; it is rarely<br />

driven by IT, although IT normally contributes. BI and<br />

analytics have the clear role of providing information either<br />

to assist operational business processes or to assist planning<br />

and change management activities.<br />

Planning &<br />

Change<br />

Management<br />

Operational<br />

Activity<br />

Conceptually, there is nothing new at all about this view of<br />

company behavior and the role of BI and Analytics. What<br />

has drawn attention to Big <strong>Data</strong> and the BI & Analytics<br />

applications it supports, is that the technology parameters<br />

have shifted dramatically, as we shall discuss later in<br />

this report. If the Big <strong>Data</strong> opportunity is pursued, the<br />

efficacy of this corporate feedback loop will be improved.<br />

Organizations will be more “data driven” than before and<br />

their success will be more dependent on making effective<br />

Business<br />

Intelligence<br />

& Analytics<br />

Figure 2. Change<br />

use of technology for this purpose. This is why some businesses are now chanting the<br />

mantra, “data driven, data driven, data driven.”<br />

<strong>The</strong> triumph in 1997 IBM’s Deep Blue computer in a chess match with world chess<br />

champion, Gary Kasparov, the later victory in 2011 by IBM’s Watson computer system<br />

against three Jeopardy champions and the recent (2016) victory by Google AI against the<br />

world Go champion have demonstrated beyond argument that computer intelligence<br />

can now outstrip the most intelligent humans in well-defined contexts.<br />

2


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

No doubt analytics technology is far better at taking decisions than even the most<br />

skilled humans too. Whether computer skills will soon usurp human skills in most<br />

aspects of running a business is thus worth considering. To explore this possibility, we<br />

need to define, examine and discuss the data pyramid.<br />

<strong>The</strong> <strong>Data</strong> Pyramid<br />

<strong>The</strong> data pyramid illustrated in Figure 3 below was first conceived in the 1990s<br />

as part of a philosophical and technological study of artificial intelligence at Bloor<br />

Research. Variations of this simple model have appeared occasionally since then. <strong>The</strong><br />

fundamental point it makes is that data has to go through refinement before it becomes<br />

useful to people.<br />

Rules, Policies<br />

<strong>Guide</strong>lines, Procedures<br />

Linked data, Structured data,<br />

Visualization, Glossaries, Schemas, Ontologies<br />

New<br />

<strong>Data</strong><br />

Signals, Measurements, Recordings,<br />

Events, Transactions, Calculations, Aggregations<br />

Refinement<br />

Figure 3. <strong>The</strong> <strong>Data</strong> Pyramid<br />

<strong>The</strong> data pyramid has four layers which we define precisely as follows:<br />

<strong>Data</strong><br />

We define data to mean records of events or transactions from some source. A data<br />

record indicates a particular state of something or possibly a change between two<br />

states. Such a record is a “data point.” For practical IT purposes it needs to record its<br />

time of birth and other data items that identify the data’s origin. While a data item of<br />

this kind may play a role in an operational business system, collections of such data<br />

and their analysis is required to create useful “business intelligence.”<br />

Information<br />

For data to become information it requires context. This is a matter of making<br />

connections. A single customer record, for example, lacks context, which may be<br />

3


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

provided by orders, contact records and so on. If you look at a set of customer data<br />

it may yield information that is not part of any single record. As soon as you have<br />

multiple records you can calculate categories (gender, age, location, etc.). From a<br />

“business intelligence” perspective you are creating information with such activities.<br />

When you link customer data to all the orders placed by a customer, you can generate<br />

useful information about that customer’s buying patterns.<br />

We define information to be: collections of data linked together for human consumption.<br />

In general BI products are software products that present information, possibly as<br />

reports or visually on dashboards. Some BI products are interactive, enabling the user<br />

to slice and dice the information. BI tools such as spreadsheets or OLAP tools can be<br />

thought of as user workbenches for the further analysis of information. <strong>The</strong> databases<br />

and data warehouses that feed such tools store information in a semi-refined form.<br />

Knowledge<br />

We define knowledge to be information that has been refined to the point where it is<br />

actionable.<br />

Consider the BI tools that simply present information for decision support. <strong>The</strong> user<br />

knows their own context and consumes the information to take an action, such as<br />

resolving an insurance claim or approving a loan. <strong>The</strong> user has the knowledge of what<br />

to do and the BI tools assists by providing information.<br />

In the case of the BI tools that enable data exploration, the user has some idea of what<br />

he or she needs to know, explores the information to create that knowledge and then<br />

applies it to their business context. As such the knowledge of how to explore the data<br />

lives in the user. Such a user can accurately be described as a knowledge worker.<br />

Knowledge can also be stored in computer systems. This is where we encounter rules<br />

based systems and all the technology that is normally classified as AI. But knowledge<br />

manifests in computer systems in many other ways. <strong>The</strong> people who run a business<br />

create business processes to carry out particular activities. <strong>The</strong>se are normally<br />

improved over time on the basis of acquired experience (feedback). <strong>The</strong>y may even be<br />

fully automated - converted into software and implemented without the need for any<br />

human intervention. This is implemented knowledge.<br />

Indeed, all software, no matter what it does, can be classified as implemented<br />

knowledge. Nevertheless within any business there will also be other knowledge:<br />

rules, procedures, guidelines and policies that are not automated and are implemented<br />

by staff.<br />

<strong>Data</strong> analytics, or <strong>Data</strong> Science as it has now been named, is the activity of trying to<br />

discover new knowledge from data by applying mathematical techniques to reveal<br />

previously unknown patterns. It is science of a kind, in the sense that the data scientist<br />

may formulate and then test hypotheses, although there are some brute force techniques<br />

that can discover patterns without the need to hypothesize. It is not the only way to<br />

discover new knowledge, but it can be a very powerful and rewarding route.<br />

4


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Understanding<br />

We define understanding to be the synthesis of both knowledge and experience.<br />

Currently this can reside only in people, and it is probably the case that it will only ever<br />

reside in people. Although computers may be able to model most human intellectual<br />

processes they only ever execute directives. Higher human cognitive activities such as<br />

taking initiative, contemplation, pondering and abstract thought are beyond the remit<br />

of the machine.<br />

Understanding is served by knowledge and information and, as such, Analytics<br />

and BI systems can considerably enhance the capabilities of the users at every level<br />

within an organization, especially those with a deep understanding of the activities<br />

and dynamics of a business. However, every business, no matter how well it exploits<br />

the opportunities that such technology presents, also has to cater for change. Human<br />

understanding is what drives and responds to change.<br />

BI And Analytics As Business Processes<br />

<strong>The</strong> creation of BI services and the generation of business insight through analytics<br />

are themselves business processes or at least they should be. <strong>The</strong>y are also the natural<br />

data lake applications. Hence the user constituency that is likely to benefit most from<br />

building a data lake is the community that creates and uses such capabilities.<br />

<strong>The</strong> marketing hype surrounding Big <strong>Data</strong> and the potential importance of data<br />

scientists to the success of the business have tended to obscure the nature of this<br />

business process.<br />

It can be viewed from two perspectives.<br />

• R&D<br />

• Software Development<br />

<strong>Data</strong> science teams investigate aspects of the business statistically to discover useful<br />

and actionable knowledge. Such activity should properly be regarded as research<br />

into business processes (R&D). When new insights are discovered, the subsequent<br />

exploitation of these insights is clearly a software development process.<br />

We illustrate this in Figure 4, on the next page. <strong>The</strong> analytical exploration of data<br />

to generate knowledge or enhance existing knowledge is very similar to software<br />

development, in the sense that when new knowledge is discovered it is likely to be<br />

used to enhance computer systems. Where it differs from software development is that<br />

it is an R&D activity and software development rarely is.<br />

Once new knowledge is discovered and implemented we get the situation depicted<br />

on the right hand side of Figure 4. Business intelligence systems may be enhanced<br />

to improve decision making. This might simply be a matter of upgrading passive<br />

decision support such as dashboard information or sending a specific alert to the user<br />

in a specific context, or it might lead to the upgrade of interactive decision support<br />

capabilities such as OLAP software or data visualization software like Tableau.<br />

5


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Analytics Development<br />

Analytics Implementation<br />

Passive<br />

Decision<br />

Support<br />

<strong>Data</strong><br />

Set<br />

User<br />

Analytic<br />

Exploration<br />

New<br />

Knowledge<br />

Interactive<br />

Decision<br />

Support<br />

<strong>Data</strong><br />

Scientist<br />

<strong>Data</strong><br />

Set<br />

<strong>Data</strong><br />

Set<br />

User<br />

Automation<br />

<strong>Data</strong><br />

Set<br />

Figure 4. Analytics and BI, Development and Implementation<br />

Alternatively, the knowledge may be automatically included in an operational system<br />

improving it in some way. <strong>The</strong> illustration does not try to elaborate on how an analytic<br />

discovery is made operational, since this varies according to context.<br />

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> Dynamic<br />

<strong>The</strong> fundamental assumption of the data warehouse architecture was that there<br />

needed to be a very powerful query engine (database) at the center of the data flow.<br />

It thus suggested a centralized architecture where, first of all data flowed to the data<br />

warehouse. It was then used in place or it was distributed from there for use elsewhere.<br />

<strong>The</strong> fatal flaw of this architecture was that it did not scale out well. However this<br />

limitations did not become apparent until a whole series of forces came into play. <strong>The</strong>y<br />

were:<br />

• <strong>The</strong> need to analyze unstructured data, both external and internal. <strong>The</strong> need for<br />

this continues to grow.<br />

• External data sources began to multiply. Particularly prominent in this was social<br />

media data, but it was by no means the only source. Until recently, selling or<br />

renting data was a niche activity but this has ceased to be the case. An expanding<br />

amount of valuable data is now bought and sold publicly.<br />

• Traditionally analytics applications lived in “walled gardens” served by their<br />

own data mart. However a data lake could do service as an analytics sandbox<br />

6


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

- a useful idea given the incursion of external data that data analysts wished to<br />

explore.<br />

• Hadoop (and later Spark) with their Open Source software ecosystems gained<br />

traction and those ecosystems began growing apace. Such software is very low<br />

cost to adopt. <strong>The</strong>re were significant economies available just for off-loading<br />

tired data from data warehouses, to a data lake.<br />

• <strong>The</strong> parallelism of these Open Source environments made it possible to run<br />

some analytic applications much faster than had previously been possible.<br />

• <strong>The</strong> Hadoop/Spark ecosystems were further strengthened. New metadata<br />

capture and data cleansing products emerged. Security products appeared that<br />

strengthened the initially poor security capabilities of these environments.<br />

• <strong>The</strong> concept of a data lake (or data hub) quickly gained acceptance as an<br />

architectural idea for data management.<br />

• Cloud vendors, particularly Amazon with EMR and Microsoft with Azure quickly<br />

identified the opportunity and built easily deployed data lake environments.<br />

• <strong>The</strong> release of Open Source Kafka in combination with Spark’s micro-batch<br />

capability caused some companies to experiment with near real-time applications<br />

significantly expanding the data lake’s areas of application towards real-time<br />

analytics. This move will no doubt continue as other open source streaming<br />

capabilities, such as Flink, mature.<br />

• Kafka can now be regarded as a foundational data lake component (or if you use<br />

MapR then MapR Streams fulfils the same role). It was donated to Open Source<br />

by Linked In in a fully proven form. As a fast publish-subscribe capability it<br />

enables replication and disaster recovery configurations as well as making it<br />

possible to treat multiple physical data lakes as a single logical data lake.<br />

• <strong>The</strong> traditional data warehouse was never an ingest point of itself. <strong>Data</strong> was<br />

prepared in various ways, in a “staging area” before going to the data warehouse.<br />

In contrast, a data lake is capable of being a staging area, not just for corporate<br />

data but for all data, including unstructured data.<br />

• <strong>The</strong> data lake provided a means of acquiring data prior to modeling the data<br />

(for inclusion in a database or data warehouse). Since a good deal of data could<br />

be processed immediately without such modeling, considerable time could be<br />

saved. For applications where “time to market” matters, the data lake delivers.<br />

• A whole series of data governance issues began to raise their heads. <strong>Data</strong><br />

governance had never been an important issue, in fact it hadn’t even been well<br />

defined until the possibility of gathering large collections of data into data lakes<br />

became a possibility. <strong>The</strong>n a whole series of issues from data security through<br />

data lineage to data archive stumbled into the spotlight. <strong>The</strong> question of how<br />

should data be governed has now become a pressing question.<br />

7


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Hardware Disruption<br />

To explore the emerging reality of the data lake, we now need to consider how<br />

computer hardware technology has been changing and may continue to change.<br />

Technology change can be dramatic and in respect of the data lake it has been, it has<br />

been dramatically disruptive.<br />

Moore’s Law, which has remained true for CPU power over a span of more than 40<br />

years, regularly delivers a doubling of CPU power roughly every 18 months. At the<br />

hardware level nothing else kept pace with this. DRAM (memory) managed to do so<br />

up to the year 2000, but faded after that. Disk was far worse, lagging terribly, but it is<br />

now being superseded by SSDs (solid state drives), which recently climbed onto the<br />

Moore’s Law curve in the sense that its speed is roughly doubling every 18 months.<br />

Moore’s Law was disruptive, but we gradually adjusted to its disruptive impact, as<br />

it gifted us its regular increases in speed. It transformed PCs into throw-away with a<br />

life span of 4 or 5 years. PCs were superseded by laptops which in turn are giving way<br />

tablets. It had a similar impact on some PC software, such as email which went to live<br />

on the Internet, soon to be joined by personal applications and graphics applications.<br />

Moore’s Law had a distinctly different impact on server software. <strong>The</strong> databases<br />

technology that was born the 1980s and 1990s evolved to keep pace and so did the ERP<br />

systems. On the server side extra power simply made large databases and applications<br />

faster. For a while, much of the additional server power was simply squandered,<br />

with CPUs often idling. This created a market for virtual machines that enabled more<br />

applications per server.<br />

It wasn’t until about 2010 that server software applications set off in a new direction.<br />

In 2004 it ceased to become physically possible to increase CPU clock speeds to garner<br />

extra power and thus CPUs, still capable of further miniaturization, became multicore.<br />

This trend gradually forced software development to take advantage of parallel<br />

processing.<br />

Multicore and its tangled web<br />

<strong>The</strong> hardware landscape used to be simple. <strong>The</strong>re was CPU, memory, disk and network<br />

technology. CPU kept getting faster, as did memory and disk even if they didn’t really<br />

keep pace. <strong>The</strong> evolution of hardware was thus reasonably predictable, but that stalled<br />

in 2004. <strong>The</strong> chip vendors (Intel, AMD, IBM, ARM, etc.) were forced to add more cores<br />

to each CPU to make their previous chips obsolete, which is how the game is played.<br />

Such a major change in direction forced the software world to adjust. It took time.<br />

Operating systems needed to adjust, compilers needed to adjust, development<br />

software needed to adjust and, most of all, the software developers needed to adjust. It<br />

might have been relatively plain sailing if that was the whole story. But it wasn’t. <strong>The</strong>re<br />

were also GPU chips (Graphical Processing Units), FPGAs (Field Programmable Gate<br />

Arrays) and SoCs (Systems on a Chip).<br />

We thought of CPUs and GPUs as different beasts of burden with different loads on<br />

8


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

their back: general purpose processing and graphical processing. But GPUs are not<br />

confined to graphics, they are equally suited to high performance computing tasks<br />

(numeric workloads) including analytic processing.<br />

This led to the development of GPGPUs (General purpose GPUs) that are far faster than<br />

CPUs for such tasks having many more processor cores. A few companies (Kinetica,<br />

BlazingDB and MapD) have exploited such hardware to build database servers that<br />

sport ultra-fast databases, proving that a great deal of a database’s workload can run<br />

on GPGPUs.<br />

An FPGA (Field Programmable Gate Array) is, as the name suggests, a chip that you<br />

can add logic to after it has been manufactured - that’s what “field programmable”<br />

means. In order of speed, GPUs are faster than CPUs which are faster than FPGAs.<br />

<strong>The</strong> virtue of the FPGA is that you can configure it to run a particular program that<br />

is frequently used. Once configured for such a program it runs like a race horse.<br />

<strong>The</strong> FPGAs contain arrays of programmable logic blocks along with a hierarchy of<br />

reconfigurable interconnects that allow the blocks to be “wired together” in different<br />

configurations. This, along with a little memory makes it possible to purpose built a<br />

chip for a given application.<br />

We know of two hardware vendors that have demonstrated the virtues of combining<br />

different types of processing unit in custom designed hardware. Velocidata with its<br />

Enterprise Streaming Compute Appliance (ESCA) combines the power of CPU, GPU and<br />

FPGA to dramatically accelerate stream processing, particularly for ETL applications.<br />

Ryft, with its Ryft One combines a CPU with an array of FPGAs to provide a very fast<br />

search and query capability on large volumes of data.<br />

<strong>The</strong>re is a convergence in progress between the CPU and GPU. It began with Intel in<br />

2010, with its line of HD Graphics chips. Intel describes these as a CPU with Integrated<br />

Graphics. In 2011 AMD created what it called an APU (Accelerated Processing Unit)<br />

which was involved the same marriage of CPU and GPU on a single chip. Such chips<br />

can be used on PCs and similar devices, eliminating the need for a GPU.<br />

Following its acquisition of Altera (an FPGA and System on a Chip company), Intel<br />

recently announced Xeon chips with a built-in FPGA (Field Programmable Gate Array)<br />

and Stratix FPGAs and SoCs. <strong>The</strong> market for Xeon with FPGA has yet to become clear,<br />

although it is not hard to imagine companies adding analytics logic to the FPGA portion<br />

of the chip to create specialist servers in the same way that Velocidata added ETL logic<br />

and Ryft added search logic to FPGAs.<br />

As regards the SoC the big market here is cell phones and tablets, although one can<br />

easily imagine that they will also play a role in the Internet of Things, perhaps a very<br />

major one.<br />

Processors are evolving in a variety of ways. <strong>The</strong> x86 chip dominated the industry on<br />

the desktop and eventually on the server, to the greater glory of Intel, but it is outgunned<br />

by the ARM chip in respect of numbers (tablets, cell phones and other devices). <strong>The</strong><br />

problem that these and other chip vendors face is that soon it will no longer be possible<br />

9


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

to miniaturize circuits any further - so Moore’s Law will cease to deliver its regular<br />

bounty. Thus the CPU vendors are exploring innovative alternatives to keep the plates<br />

spinning.<br />

<strong>The</strong> persistence of memory<br />

RAM (DRAM and SRAM) is fast volatile memory. It is extremely fast compared to<br />

traditional spinning disk (or spinning rust as it is sometimes called). <strong>The</strong> figure varies<br />

according to circumstance but a rule of thumb is 100,000 times faster for random access<br />

- less so for serial access. <strong>The</strong> best current (2017) figures for RAM speed are 61,000 MB/<br />

sec (read) and 48,000 MB/sec (write).<br />

Currently there are three disruptive memory technologies making their way to<br />

market. At the time of writing, software and hardware vendors are experimenting with<br />

them. <strong>The</strong>y differ from RAM in three respects. <strong>The</strong>y are slightly slower, non-volatile<br />

and about half the price. Two of them, from Intel (called 3D XPoint) and IBM (called<br />

PCM or phase change memory) are fairly new. <strong>The</strong> third technology, HP’s memristor,<br />

was announced way back in 2008 but has been slow to develop. Nevertheless HP is<br />

partnering with SanDisk to develop what it calls Storage-Class Memory (SCM) to offer<br />

memory that has distinctly similar capabilities to Intel’s 3D XPoint and IBM’s PCM.<br />

From the hardware perspective it doesn’t matter which of these technologies<br />

dominates, since their primary characteristics are fairly similar. However what isn’t<br />

certain is what a standard server will look like once these technologies kick in, and we<br />

won’t know until it happens.<br />

It is worth mentioning, en passant, that the volume of RAM relative to the CPU on<br />

commodity servers continues to increase and with the advent of these new memory<br />

technologies that trend will likely persist.<br />

SSD: RIP Spinning Disk?<br />

Solid State Disk (SSD) is replacing spinning disk almost everywhere. Spinning disk is<br />

still cheaper (by a factor of 5) but SSD is faster in most contexts (and can be a great deal<br />

faster). SSD used to be limited in volume but last year Seagate surprised the world with<br />

a 60 Tbyte SSD and erased that complaint.<br />

SSDs can be accessed in a parallel manner which accelerates read speeds considerably.<br />

To achieve this, SSDs “stripe” data into arrays so that when a read operation spans data<br />

across multiple arrays the on-disk controller issues parallel reads to get the data. <strong>The</strong><br />

latency of each fetch will be constant. To leverage this you need to be careful about how<br />

and where you write the data. Aerospike, the high performance database uses SSD in<br />

this way.<br />

SSDs are not much better than spinning disk in write-heavy applications because they<br />

need to re-write entire blocks at a time (a read followed by an erase followed by a<br />

write). <strong>The</strong> more sophisticated drives organize write activity to minimize this.<br />

10


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> Typical Server<br />

Not so long ago, a typical server was (primarily) a CPU with memory and disk storage.<br />

So, consider how much more complex it could become for the software engineers that<br />

need to make it dance.<br />

• <strong>The</strong> CPU is multicore with up to 12 cores.<br />

• Alternatively it may be have fewer cores but be integrated with GPU or FPGA<br />

capability.<br />

• <strong>The</strong> CPU has 3 exploitable layers of storage for itself: level 1, 2 and 3 cache, which<br />

can be exploited, e.g. for vector processing or data compression/decompression.<br />

• Configurable memory (DRAM) has grown to the terabyte level<br />

• A new fast access memory capability is emerging that is significantly faster than<br />

current SSDs and a little slower than memory.<br />

• SSDs are swiftly replacing spinning disk, but software engineers need to know<br />

how to best use them.<br />

It may be a while before we know what a “standard server” is going to become.<br />

Nevertheless it cannot be in the industry’s commercial interest for there to be too much<br />

variance. A consensus will emerge, what it will be is difficult to predict.<br />

<strong>The</strong> Changing of <strong>The</strong> Guard<br />

A further disturbing variable in this picture is the growing impact of the cloud.<br />

Amazon currently dominates cloud computing with 31 percent market share, with<br />

Microsoft playing second fiddle at 9 percent. As far as we are aware, both companies<br />

have projects to design and build their own chips (almost certainly based on ARM<br />

designs). Given the economies of scale of the cloud operations of both companies, it<br />

makes sense for them to do this.<br />

<strong>The</strong> impact either company could have on the future architecture of commodity servers<br />

is impossible to predict, aside for the fact that they will doubtless build infrastructure<br />

that is easy to software migrate to. It is not beyond the bounds of possibility for either<br />

company (or Google for that matter) to enter the chip market.<br />

11


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Parallelism Playing Havoc<br />

By 2010 we, at <strong>The</strong> Bloor Group, began to observe a significant increase in the speed of<br />

server software occurring as a consequence of parallelism. That was in the early days<br />

of the Hadoop project, which had been inspired by Google and Yahoo’s use of scaledout<br />

server networks.<br />

Let’s think about this. Google and Yahoo were the first two Web 2.0 businesses. <strong>The</strong>y<br />

had been forced to find their own solutions to big data problems. Searching the web<br />

was a completely new application. No developers had ever built software to solve a<br />

problem like that. First you send out spiders; you run software that accessed every<br />

web site, including new web sites, gathering information about new web pages and<br />

changes to old pages. After you harvest the data, you compress it to the digital limit<br />

and add it to your big data heap. That’s the relatively easy part of the problem. <strong>The</strong><br />

hard part is updating the indexes on the huge data heap and letting the world pick at<br />

it. It was the first of these two problems that gave MapReduce its start in life.<br />

<strong>The</strong> idea of “mapping” and “reducing” was not new. It is a relatively old technique that<br />

emerged from functional programming, a 1970s programming paradigm. MapReduce,<br />

as invented by Google, was a development framework that was scalable over grids of<br />

servers and ran in a parallel manner. It provided a solution to the indexing problem.<br />

It was research activity by Doug Cutting and Mike Cafarella at Yahoo which spawned<br />

Hadoop, with its Hadoop Distributed File System (HDFS) and MapReduce framework.<br />

Quite likely the project would not have taken flight if Yahoo had not decided to throw<br />

it to Apache as an open source project with Doug Cutting in the pilot’s seat. Soon<br />

Hadoop distribution companies sprouted up: Cloudera in 2008, MapR in 2009 and<br />

Hortonworks in 2011.<br />

Under the auspices of <strong>The</strong> Apache Software Foundation (ASF), Hadoop acquired a<br />

coterie of complementary software components: Avro and Chuckwa in 2009, HBase<br />

and Hive in 2010, then Pig and ZooKeeper in 2011. Soon ASF - a nonprofit corporation<br />

devoted to open source software - acquired a destiny. It provided a well-honed process<br />

for incubating the development and assisting the delivery of open source products. It<br />

wasn’t long before it was supervising over 100 such projects, about one fifth of which<br />

were Hadoop related.<br />

<strong>The</strong> commercial early adopters of Hadoop began to trickle in around 2012. <strong>The</strong> trickle<br />

soon became a stream, and then the stream became a river, and the river flowed into a<br />

lake. Hadoop and its growing band of open source jig saw pieces were truly inexpensive.<br />

It was free software until you needed support, and if you cared to assemble a few old<br />

servers, you could prototype applications and experiment with parallel computing for<br />

almost nothing.<br />

Before Hadoop danced into the data center with its scale-out file system, no such<br />

scale-out file system existed. <strong>The</strong> assumption had always been that if you had data that<br />

needed to scale out in a big way – hundreds of terabytes or petabytes or beyond – you<br />

needed to put it in a database or a data warehouse. Hadoop changed all that.<br />

12


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

YARNing towards the data lake<br />

Until the release of YARN (Yet Another Resource Negotiator), Hadoop was truly<br />

limited. Its two main constraints were that it ran just one task at a time and that software<br />

development was tied to the use of MapReduce. YARN, released in late 2013, provided<br />

a scheduler and in the same release the enforced use of MapReduce was removed.<br />

Once Hadoop had a scheduling capability it was possible for multiple applications to<br />

share HDFS data concurrently, making it far more appropriate for many applications.<br />

Given direct access to HDFS, applications could leverage Hadoop hardware resources<br />

in any way they chose. If you were watching closely you would have noticed a slew<br />

of commercial software vendors announcing compatible software at the time of the<br />

announcement. And of course more followed later.<br />

Mesos, another open source scheduling capability, that was built for data center<br />

scheduling then stepped into the frame and soon after that the Myriad project was set<br />

up to enable Mesos and Yarn to work together.<br />

Once process scheduling was possible, the idea of a data lake - as a kind of data hub -<br />

took off. With the reality of multiple concurrent tasks it was definitely possible to have<br />

one process or set of processes ingesting data and another process doing analytics and<br />

a third process carrying out ETL on the data.<br />

<strong>The</strong> Spark Phenomenon<br />

Hadoop disrupted the data warehouse world, then Spark disrupted Hadoop.<br />

Described as “lightning fast,” it seemed to emerge from nowhere in 2015. But of course<br />

it didn’t. Spark began life at UC Berkeley AMPLab in 2009 and was open sourced in<br />

2010 (under a BSD license). It became an Apache project in 2013 about the time that<br />

YARN was released.<br />

For a variety of reasons it turned out to be a far better development layer than<br />

MapReduce. It was an in-memory distributed platform comprised of a collection of<br />

components that could accelerate batch analytics jobs, including machine learning<br />

applications, and could also handle interactive query and graph processing. It was<br />

not built to be a stream processing capability, but was capable of doing such work (via<br />

Spark Streaming).<br />

Spark was entirely independent of Hadoop. However, it was often implemented with<br />

Hadoop and HDFS providing the file system. <strong>The</strong> major Hadoop distros quickly added<br />

Spark to their Hadoop bundles.<br />

<strong>The</strong> Hercules Effect<br />

Yahoo started the Apache revolution when it threw Hadoop into the open source<br />

pot. This was not altruistic gesture but an act of self-interest. It had petabytes of data<br />

stored in Hadoop against which it ran many applications. Yahoo figured it would save<br />

time and money if it let external developers contribute to Hadoop’s development and<br />

shared the bounty.<br />

13


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

This collaborative approach to software development established a trend. Facebook,<br />

with its exabyte volumes of data was built on open source from the ground up. Imitating<br />

Yahoo, it also had gifts to give Apache: Hive, the quasi SQL capability and the graph<br />

processing system Giraph. Its most recent donation was Presto, the distributed SQL<br />

query engine, for which Teradata now offers commercial support.<br />

Linked In’s open sourcing of Kafka (a persistent distributed message queue) and<br />

Voldemort (a distributed key-value store) can be added to the list of Apache gifts - as<br />

can Parquet (a column-store capability developed in a collaboration between Twitter<br />

and Cloudera).<br />

<strong>The</strong>re is a huge difference between a new Hadoop component, whose development<br />

only recently started, and one that drops magically from the sky into the ecosystem,<br />

fully tested and implemented over years by a highly scaled web business. <strong>The</strong> first kind<br />

of component may take a long time to mature, while the other is born fully formed, like<br />

Hercules of Greek myth.<br />

<strong>The</strong> addition of Kafka to the Hadoop ecosystem was particularly important. Kafka<br />

provides a publish-subscribe capability that enables data to be streamed in a managed<br />

fashion from one application to another or from one Hadoop or Spark cluster to another.<br />

In mid-2015, Kafka was joined by another important communications capability that<br />

goes by the name of NiFi, short for Niagra Files. <strong>The</strong> software was developed by the<br />

NSA over a period of 8 years. It had the goal of automating the flow of data between<br />

systems at scale, while dealing with the issues of failures, bottlenecks, security and<br />

compliance, and allowing changes to the data flow. Taken together, Kafka and NiFi<br />

provide a complete data flow solution that scales as far as the eye can see, and beyond.<br />

If you consider the whole Apache Hadoop stack, the collection of open source software<br />

that emerged in the wake of Hadoop, it obviously constitutes a new environment for<br />

building and deploying parallel software applications. It is remarkably inexpensive,<br />

both because of its open source nature and the commodity hardware on which it runs.<br />

We sometimes think of this stack as constituting a data layer OS, a distributed operating<br />

system for data.<br />

It doesn’t quite quality, but it is close and it is moving in that direction.<br />

14


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> Next Gen Stack<br />

<strong>The</strong> hallmark of the Apache Hadoop stack, in all its glory, is that it is built for parallel<br />

operation on commodity hardware. If you have experience of it, no doubt you will be<br />

able to report that some of it is high quality and some of it is less so. However all of it<br />

is under continuous development and will likely improve. It is not tightly integrated<br />

in the manner that one might expect were it provided by a single software vendor like<br />

Microsoft and Oracle.<br />

Also, which components to choose and why can be confusing. This is certainly the<br />

case when it comes to streaming applications. Spark can build streaming applications<br />

of a kind, but in reality it processes micro-batches quickly, which technically is not<br />

streaming. Another component, Apache Storm has been built for true data streaming.<br />

And yet another component, Apache Flink does streaming and can also do micro-batch<br />

processing. <strong>The</strong>re is competition within the Apache stack for streaming applications<br />

and that may remain so for a while.<br />

A further potential point of confusion is the existence of three major Hadoop<br />

distributors: Cloudera, MapR and Hortonworks. <strong>The</strong>re are many similarities between<br />

these distributors in that they all pursue a similar business model that emphasizes<br />

support revenues and all provide downloadable free versions of their distributions.<br />

Hortonworks is the “pure play,” while MapR and Cloudera provide “premium”<br />

distributions to their paying customers.<br />

Technically, MapR is the most distinct, providing what it calls a converged platform<br />

that includes the MapR file system which is read/write (rather than HDFS, which is<br />

append only), MapR Streams (a capability similar to Kafka but more sophisticated) and<br />

MapR DB a NoSQL database. Cloudera also has some unique components, including<br />

Cloudera Manager, Cloudera Navigator Optimizer, Cloudera Search, the Kudu file<br />

system which allows read/write and its Impala database. In contrast, HortonWorks<br />

tries to remain true to the Apache Hadoop stack.<br />

<strong>The</strong>re are other distributions, too. Cloud vendors, such as Amazon and Microsoft,<br />

provide their own Hadoop distributions. A Hadoop distribution contains many<br />

components (Cloudera’s currently has 20 for example) and others can be added. It is<br />

up to the customer to determine which components are required and that is to some<br />

extent application dependent.<br />

Those who have never experimented with Hadoop may assume that it’s relatively<br />

simple to install and get running. However it ain’t necessarily so. Aside from anything<br />

else, there’s a need to either hire experienced programmers or train existing ones in<br />

the use of a tribe of components: MapReduce, Hive, Pig, HBase, Spark and others.<br />

You are taking on a new and complex software environment, much more so than if<br />

you were implementing a Windows or Linux server. And then you are going to build<br />

applications to run on it.<br />

15


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>Data</strong> <strong>Lake</strong> Products<br />

<strong>The</strong> alternative to acquiring technical staff to understand the nuts and bolts of the<br />

Apache Hadoop Stack to buy technology from one of the Hadoop integration vendors<br />

or as they are also called “data lake” vendors: Cask, Unifi, Cambridge Semantics et al.<br />

Such vendors provide a data lake capability “out of the box.” Many companies have<br />

found this a more productive way to build a data lake than building it from scratch.<br />

Consider the position of a company that wishes to build a data lake that uses, say, 20<br />

components of the Apache Stack. <strong>The</strong> issues they will inevitably face are as follows:<br />

• Technical skills: New staff with appropriate technical skills may need to be<br />

hired.<br />

• Operational Management: <strong>The</strong> server cluster will need to be monitored and<br />

managed, altering configurations, provisioning new hardware and tuning for<br />

performance as needed. <strong>The</strong> software environment for this may need to be built.<br />

• Upgrades: Upgrade management of the Apache Stack is a potential headache.<br />

<strong>The</strong> Apache Stack upgrades are not going to be as smooth as, for example, a<br />

Windows Server upgrade - simply because the Apache Stack is not under the<br />

close control that a vendor like Microsoft or IBM provides.<br />

• Standards and Integration Issues: <strong>The</strong> question here is how to align the Apache<br />

Stack with common data center standards (e.g. security) for security, data life<br />

cycle management, etc.<br />

In reality what the data lake vendors do is provide an abstraction layer between the<br />

Apache Stack and the applications built on top of it. If this is done well then the user<br />

could, for example, migrate from Cloudera’s stack to MapR’s stack or even move the<br />

data lake applications into the cloud without concern for whether the application will<br />

function as expected.<br />

<strong>The</strong> <strong>Data</strong> Layer OS<br />

As we introduced the concept of a data layer OS it is worth explaining here why<br />

we do not currently believe that the Apache Stack has earned that description. If you<br />

ignore all the infrastructure software that is there to make applications possible, then<br />

we can think in terms of there being three types of applications:<br />

• OLTP applications: <strong>The</strong>se are the applications that process the events and<br />

transactions of the business<br />

• Office applications: This category embraces communications activity from<br />

email to multimedia collaboration and includes personal applications from the<br />

word processor to graphics software. (We would even include development<br />

software here).<br />

• BI and Analytics applications: <strong>The</strong>se are the applications that analyze the<br />

business and provide feedback.<br />

16


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Right now the Apache Stack dominates the area of BI and analytics to the point where<br />

everything else is a sideshow. However the other two areas of application are currently<br />

conspicuous by their absence from the data lake. <strong>The</strong>re are several reasons why that<br />

is so, the main one being that BI and analytics applications can profit most from the<br />

parallelism that the Apache Stack provides. Transactional applications like ERP and<br />

CRM and office applications like email do not experience such a dramatic boost from<br />

parallel processing, because they do not process such large amounts of data.<br />

In recent years the IT industry has witnessed unprecedented software acceleration<br />

in BI and analytic applications. Prior to 2010, in particular application areas where it<br />

made a difference, performance increases of 10x took six years at least, but by 2010<br />

we began to witness much faster software accelerations than that. Nowadays we<br />

sometimes encounter projects that have accelerated a previous analytical process by<br />

1000x or even 10,000x.<br />

This is almost entirely due the effective exploitation of parallelism in conjunction with<br />

in-memory processing. If you rarely have to go to disk for data and you can scale a<br />

workload out over many servers then 1000x is not that difficult to achieve, and with<br />

the Apache Stack and commodity servers, it can be achieved at remarkably low cost.<br />

17


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> Emergence of <strong>Data</strong> Governance<br />

<strong>Data</strong> is not what it used to be. Let’s take this on board. <strong>The</strong> data we knew and loved<br />

was pretty much static, sitting in databases or files fairly close to the applications that<br />

used it. It wasn’t until the blooming of BI in the late 1990s that the industry felt obliged<br />

to let data flow around. That desire gave birth to the data warehouse into whence data<br />

flowed and from whence data trickled into data marts.<br />

Things began to change. Every year that passed witnessed an increase in the need<br />

for data movement. With the advent of CEP technology to process data streams of<br />

stock and commodity prices for trading banks, we got the first hints of real time data<br />

processing and as time marched forward, such activity grew larger.<br />

It was obvious to web businesses like Yahoo and Amazon, that we lived in a real-time<br />

world. <strong>The</strong> web logs that drove their businesses were the digital footprints of users<br />

visiting their web sites. A transaction-based world was giving way to event-based<br />

world.<br />

Had we thought about it at the time, we might have realized that it wasn’t just web<br />

sites that lived and died by events. Network devices and operating systems and<br />

databases and middleware and applications were all happily recording their events in<br />

log files that squandered disk space until they were eventually deleted. <strong>The</strong> computer<br />

networks of the world were already event oriented. <strong>The</strong>y were born that way, but the<br />

applications we built were not.<br />

<strong>The</strong> first software vendor to notice this was Splunk. It detected and mined a thick<br />

vein of gold in the log files that litter the corporate networks. <strong>The</strong> advent of Splunk<br />

was a boon to IT departments that often needed to consult collections of log files to<br />

identify the causes of application error. Security teams were also appreciative of the<br />

technology as it helped them to hunt network intruders and vagrant viruses. That we<br />

were entering an event based world was transparently obvious to users of Splunk since<br />

all the data they gathered was event data.<br />

But it was still not obvious to others, and even now with the advent of streaming<br />

analytics, it is still not obvious to everyone.<br />

<strong>The</strong> Dawn<br />

You would be hard put to find any references to the term “data governance” before<br />

the year 2000. In fact you wont find many references to it prior to 2005. And let’s be<br />

clear about this, it was not that businesses did not care about the governance of data in<br />

earlier years. It’s just that pretty much all corporate data was internal data generated<br />

by the business and most of it stayed put, where it was born, or if it moved anywhere<br />

it made its way into a data warehouse.<br />

Under those circumstances governing the data was not such a pressing need. But, as<br />

we have described, data found the need to move much more often and the volume of<br />

data exploded.<br />

18


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Let us home in on one of the faults of our current use of data. It can be thought of as a<br />

fundamental problem that needs to be fixed:<br />

<strong>Data</strong> is not self-defining!<br />

It’s easy to understand why. Some data is created by programs which expect to be the<br />

only programs ever to use it. Within the program that uses it, the data is adequately<br />

defined for use and the program understands it perfectly. As there is no intention for<br />

the data to be used by any other program, there is no need to explain the meaning of<br />

the data when it is stored.<br />

At the dawn of IT, back in the punched card era, all data was treated in that way and it<br />

was not until the advent of database - a software technology built to enable data sharing<br />

- was there any change. <strong>The</strong> data definitions in the database, the schema, applied to all<br />

the data but ceased to apply once the data was exported from the database. It was, at<br />

best a halfway solution to the data definition problem.<br />

Now that the era of event processing has arrived it is possible to be more specific<br />

about how to make data self-defining. What follows is a suggested list of data items<br />

that could be added to every event record, along with brief definitions:<br />

• Date-Time: <strong>The</strong> data and time (GMT) that the data was created.<br />

• Geographic location: <strong>The</strong> geographical location of the device that created the<br />

data. For precision we might think here in terms of the map reference (latitude<br />

and longitude) and possibly a three dimensional reference for the point within<br />

the building that occupies that map reference). <strong>Data</strong> created off-planet would<br />

probably use a reference point based on the sun.<br />

• Source device: <strong>The</strong> precise identity of the device that created the data (server,<br />

PC, mobile phone, etc.)<br />

• Device ID: Specific ID of the source device.<br />

• Source software: <strong>The</strong> precise identity of the software that created the data.<br />

• Derivation: An indication of whether the data was derived and if so, how.<br />

• Creator: <strong>The</strong> owner or enabler of the device and software that created the data.<br />

• Owner: <strong>The</strong> owner of the data, who may be different to the creator.<br />

• Permissions: Security permissions for usage of the data, enabling its use by<br />

specific programs .<br />

• Status: When there is the master copy of the data or a valid replica. (Replicas<br />

will be kept for back-up at the very least but may also be created for multiple<br />

concurrent usage).<br />

• Metadata: Associated directly with each data value is a metadata tag or reference<br />

identifying the meaning of the data.<br />

19


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

• Master Audit Trail: Recording of who has used the data, when, where and how.<br />

• Archive Flag: <strong>The</strong> date when the data is scheduled for archive or deletion.<br />

This is not necessarily an exhaustive list. Note that the scheme described here assumes<br />

that data is never updated. Under such a regime, data correction, where values are<br />

known to be wrong, would be corrected in a way analogous to how postings are<br />

reversed out in an accounting ledger.<br />

It is clear in perusing this collection of definition data that it constitutes a significant<br />

amount of data of itself. How such data is stored is a physical implementation issue -<br />

clearly there are many ways in which it much of this data could be compressed and/<br />

or stored separately to data values, depending on how its is being used or is intended<br />

to be used.<br />

<strong>The</strong> function of all this additional data is to establish indelibly:<br />

• <strong>The</strong> provenance of the data.<br />

• <strong>The</strong> ownership of the data and its allowed usage.<br />

• <strong>The</strong> history of its usage.<br />

• <strong>The</strong> schedule for its archive or deletion.<br />

• <strong>The</strong> meaning of the data.<br />

<strong>The</strong>se can also be regarded as five dimensions of data governance. We will discuss<br />

these one by one a little later. However before we do that, we need to define, in a<br />

general way, what we consider a data lake to be.<br />

20


A General Definition of <strong>The</strong> <strong>Data</strong> <strong>Lake</strong><br />

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

It is our view that the <strong>Data</strong> <strong>Lake</strong> can and should be the system of record of an<br />

organization. Specifically, that means that the data storage system the data lake<br />

embodies, becomes the authoritative data source for all corporate information.<br />

We are speaking here of<br />

a “logical” data lake. For<br />

reasons of practicality<br />

it may be necessary to<br />

have multiple physical<br />

data lakes. In which case<br />

the system of record is<br />

constituted by all those<br />

physical data lakes taken<br />

together.<br />

Not only should the<br />

<strong>Data</strong> <strong>Lake</strong> be the system<br />

of record, it should be<br />

the logical location for<br />

the implementation of<br />

data governance. <strong>Data</strong><br />

governance rules and<br />

procedures need to<br />

have a natural point of<br />

enforcement and the<br />

logical place for such<br />

enforcement is the system<br />

of record, the data lake.<br />

Figure 5 depicts the data<br />

lake in terms of the primary<br />

software components that<br />

are required: an ingest<br />

capability that can harvest<br />

both data streams and<br />

batches of data, the data storage capability, data governance so that data is governed on<br />

entry to the lake, data lake management so that the data lake environment is monitored<br />

and kept operational and ETL capabilities for transporting data to other locations. In<br />

addition, we also have the applications that run on the lake.<br />

<strong>Data</strong> Governance Processes<br />

Static <strong>Data</strong> Sources<br />

<strong>Data</strong><br />

Governance<br />

<strong>Data</strong> <strong>Lake</strong><br />

Mgt<br />

Ingest<br />

<strong>Data</strong><br />

<strong>Lake</strong><br />

To<br />

<strong>Data</strong>bases<br />

<strong>Data</strong> Marts<br />

Other Apps<br />

<strong>Data</strong> Streams<br />

Analytics<br />

or BI Apps<br />

<strong>The</strong> data lake is a staging area for the entry of new data into the system of record,<br />

whether it was created within the organization or elsewhere. <strong>The</strong>re is an imperative for<br />

all governance processes to be applied to the data at that point of entry, if possible, and<br />

for data to be available for use once those processes have completed.<br />

ETL<br />

Figure 5. <strong>Data</strong> <strong>Lake</strong> Overview<br />

21


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

We will discuss these processes one by one:<br />

1. Assigning data provenance and lineage. Accurate data analytics needs to be certain<br />

of the provenance and lineage of the data precisely. Prior to the dawn of the “big data<br />

age” this was rarely a problem as the data either originated within the business or came<br />

from a traditional and reputable external source. As the number of sources of data<br />

increases the difficulties of provenance/lineage will increase.<br />

We expect the ability of data to self-identify for the sake of provenance to increase,<br />

although it may be many years before a general standard is agreed. For the sake of<br />

lineage and provenance each event record would need to record the time of creation,<br />

geolocation of creation, ID of creating device, ID of the process/app which created the<br />

data, ownership of the data, the metadata, and the identity of the data set or grouping<br />

it belongs to. To this we can add the details of derivation if the data was derived in<br />

some way, which would allow lineage to be deduced.<br />

Where such precise details are absent on ingest to the data lake, it should be possible at<br />

least, to know and record where the data came from and how. As very few data records<br />

self-identify as described some compromise is inevitable until standards emerge. With<br />

the advent of the Internet of Things, the need to self-defining data will increase.<br />

2. <strong>Data</strong> security. <strong>The</strong> goal of data security is to prevent data theft or illicit data usage.<br />

Encryption is one primary dimension of this and access control is the other. Let us<br />

consider encryption first.<br />

Encryption needs to be planned and, ideally, applied as data enters the lake. In a<br />

world of data movement, security rules that are applied need to be distributable to<br />

wherever encrypted data is used. Ideally with encryption, you will encrypt data as<br />

soon as possible and decrypt it at the last moment, when it needs to be seen “in the<br />

clear.”<br />

<strong>The</strong> reason for this approach is twofold. First, it makes obvious sense to minimize the<br />

time that data is in the clear. Secondly, encryption and decryption make heavy use of<br />

CPU resources and hence minimizing such activity reduces cost.<br />

To implement this approach, format-preserving encryption (FPE) is necessary. <strong>The</strong><br />

point about FPE is that it does not change the characteristics of the data such as format<br />

and sort order, it simply disguises the data values. <strong>The</strong>re are FPE standards and vendors<br />

that specialize in their application.<br />

<strong>Data</strong> access controls require the existence of a reasonably comprehensive identity<br />

management system and access rights associated with all data. Access rights may<br />

distinguish between the right to view and the right to process the data using a particular<br />

program or process.<br />

3. <strong>Data</strong> Compliance. <strong>Data</strong> compliance regulations are now common, and likely to<br />

become increasingly complicated with the passage of time. <strong>The</strong> EU is hoping to establish<br />

a General <strong>Data</strong> Protection Regulation (GDPR) for personal data that is implemented<br />

across the world and has devoted considerable effort to formulating rules for that. GDPR<br />

22


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

will become law within the EU and will likely take effect in many other countries. <strong>The</strong><br />

practical consequence is that businesses world wide need to implement these rules.<br />

<strong>The</strong> EU legislation does not care where the data is held, so its jurisdiction is worldwide<br />

in respect of anyone living in the EU.<br />

You can think of such regulations as international whereas healthcare regulations<br />

(HIPPA in the US) and financial compliance rules tend to take effect at a national level.<br />

To this collection of compliance regulations you can add non-binding sector compliance<br />

initiatives and best practice rules that an individual business might establish. It should<br />

be obvious that applying such rules is difficult unless the data is self-defining to some<br />

degree.<br />

4. <strong>Data</strong> integrity. Once you eliminate data updates, the possibility of data corruption<br />

diminishes. Nevertheless it still exists and may never be entirely eliminated. Software<br />

errors can certainly achieve it. <strong>The</strong> need to specify whether data is a copy or the actual<br />

source is necessary if there is going to be any possibility of auditing the use of replicated<br />

data. Test data needs to know that it is test data. <strong>Data</strong> back-ups also need to know they<br />

are back-ups. Disaster recovery needs to restore all data to its original state. All of this<br />

applies to data in motion as well as data at rest.<br />

5. <strong>Data</strong> cleansing. Newly ingested data may be inaccurate, especially if we have no<br />

control over and little knowledge of how it was created. No data creation or collection<br />

processes are perfect. This is even true of sensor data, which we may think of as reliable,<br />

because it comes from an automated source; sensors are also capable of error. <strong>Data</strong> may<br />

also be corrupted by subsequent processes after creation.<br />

<strong>Data</strong> needs to be cleaned as soon as possible after ingest - where that’s feasible. <strong>The</strong>re<br />

are exceptional situations, for example where the requirement is to process a data<br />

stream as it is ingested. <strong>The</strong>re is no time available for data cleansing, so the streaming<br />

application must allow for possible data error. A real world parallel to this is the way<br />

that news is processed. Sometimes false news reports are aired because some editor<br />

thought that the news was too “valuable” not to air it immediately and the usual<br />

verification process was skipped. It is corrected later, if it turns out to be wrong. Any<br />

urgent processing of uncleaned data will normally allow for its poor quality and the<br />

possible need for later correction.<br />

<strong>Data</strong> cleansing standards are a natural part of data governance. However there is no<br />

silver bullet for cleaning data. <strong>The</strong>re are obvious tests that can be done such as checking<br />

for logically impossible values, and you can formulate rules to detect unlikely values.<br />

It is possible to cleanse some data automatically, and there are some well-designed<br />

tools for this from Trifacta, Paxata, Unifi and others. But no matter how effective the<br />

cleansing software, it requires human supervision and intervention. Cleansing can<br />

thus be a slow process.<br />

6. <strong>Data</strong> reliability. As far as it is possible to ensure, data needs to be accurate and<br />

also to be checked for accuracy regularly. <strong>Data</strong> values can be corrupted in various<br />

ways: It can be changed by a hacker for fraudulent reasons or even sabotage. It can<br />

be overwritten at some point in its life. It can be corrupted “in flight,” although this<br />

23


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

is rare because of communications error checking procedures (in flight corruption is<br />

most likely hacking). It can be corrupted by any software that rewrites the data. It can<br />

be corrupted by database error (DBA error). To deal with such possibilities, some form<br />

of checksum integrity can be applied.<br />

7. Disambiguation. Disambiguation is an aspect of data cleansing and data reliability.<br />

It refers to the removal of ambiguities in the data. It is a process that would be unlikely<br />

to be applied when data enters the data lake, since it can be a time consuming process.<br />

<strong>The</strong>re is a particular problem with data ambiguity when it comes to people. <strong>The</strong>re is no<br />

bullet-proof global identity system, so identity theft is a fact of life.<br />

<strong>The</strong>re is no standard for people’s names. Sometimes just the first name and surname is<br />

asked for. Sometimes a middle initial or a middle name is asked for. Sometimes the full<br />

name. <strong>The</strong> name is a poor identifier. People can change their names legally and women<br />

change their names by marriage. People sometimes disguise their names deliberately<br />

for fraudulent reasons. But they may do so legitimately (in an effort to anonymize<br />

their data). Some attributes change (address, telephone number, etc.) and usually do so<br />

without the information being easily gathered. To complicate the picture, the structure<br />

of the customer entity changes over time. For example, the social media identities (on<br />

twitter, etc.) were born only recently.<br />

<strong>The</strong> consequence of this is that for many businesses cleansing customer data also<br />

means disambiguating the data. <strong>The</strong>re are few software tools that are effective at<br />

disambiguation. Novetta has this capability and IBM also, but they are the only two<br />

software providers we know of with this capability.<br />

8. Audit trail of data usage. A record of who used what data and when needs to be<br />

maintained both for security purposes and for usage analytics. Usage analytics play a<br />

part in query optimization as well as in data life-cycle management.<br />

9. <strong>Data</strong> life-cycle management. Few companies have formally implemented a general<br />

data life cycle strategy. More likely is that they have a strategy for some data, such as<br />

data covered by compliance regulations, but not all data. Others may have no strategy<br />

at all.<br />

With analytical applications in particular, the need to manage data life cycles is<br />

important, because much of the data used in data exploration may eventually be<br />

discarded as worthless. <strong>The</strong>re is no point in retaining the data, beyond recording that<br />

it was once explored. As data lakes are also used for archive, the use of a data lake<br />

creates an opportunity to implement or tighten up the procedures around data life cycle<br />

management. Life-cycle management can be thought of as the strategy for moving data<br />

to least cost locations as its usage diminishes. Deletion is a possible destination in this.<br />

Metadata and Schema-on-read<br />

<strong>The</strong> only aspect of governance that we have not yet discussed is metadata management.<br />

<strong>The</strong> metadata situation is complex and thus we are devoting more time to it than<br />

other aspect of data governance. In overview, the situation is simple: since metadata<br />

determines the meaning of data, the natural preference is for metadata be as complete<br />

24


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

as possible as soon as possible, so the possibility of data being misinterpreted is<br />

minimized.<br />

However, one of the loudly proclaimed benefits of the data lake is that “there is no need<br />

to model data before ingesting it.” This contrasts significantly to the data warehouse<br />

situation, where a great deal of data modelling effort is required before data is allowed<br />

in to the warehouse. <strong>The</strong> alternative to data modelling is called schema-on-read.<br />

With schema-on-read, the process that reads the data for the first time determines the<br />

metadata. <strong>The</strong> work involved varies according to data source. Some data such as CSV<br />

files and XML files define the metadata, and thus it’s possible to know the metadata<br />

as the data is read. Other data may not be so convenient. However, some data formats<br />

can be recognized from the data and also some metadata may be deducible from data<br />

values. This is how products like Waterline can automatically determine metadata<br />

values. In other circumstances human input may be required to determine metadata.<br />

Once teh metadata is determined it can be stored in a metadata repository.<br />

With schema-on-read the end goal is not to establish an RDBMS-type data model<br />

which defines the relationships (foreign key relationships) between data sets. For many<br />

BI and analytics applications it is not required of the user can specify the metadata. <strong>The</strong><br />

schema-on-read approach means that any Master <strong>Data</strong> Management (MDM) process<br />

that maintains a master data model of all corporate data, will probably need to be<br />

adjusted.<br />

<strong>The</strong> assumption of data modelling is that by applying various rules and a little<br />

common sense you can provide a data model (usually and ER model) that is suited to<br />

all possible uses of the data. <strong>The</strong> truth is that in almost all circumstances you cannot. It<br />

will be imperfect, at best.<br />

<strong>The</strong> reality of the situation is this:<br />

• <strong>The</strong> time taken to model the data, either in the beginning or when new data<br />

sources are added or if errors are found in the model, constitutes a definite and<br />

possibly large cost to the business in respect of time to value.<br />

• <strong>The</strong> modeler has to try to anticipate all new data sets that may appear later,<br />

so that the model does not require significant rework when new data sets are<br />

added. This can only be guessed at and rework is sometimes required.<br />

• <strong>Data</strong> (in a data lake) is intended to be a shared asset and will be shared by<br />

groups of people with varying roles and differing interests, all of whom hope<br />

to get insights from the data. To model such data means trying to allow for<br />

every constituency in advance. Possibly this will result in a “lowest common<br />

denominator” schema that is an imperfect fit for anyone. This problem gets<br />

worse with more data sources, higher data volumes and more users.<br />

• With schema-on-read, you’re not glued to a predetermined structure so you can<br />

present the data in a schema that fits reasonably well to the task that requests<br />

the data.<br />

25


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

By employing schema-on-read you get value from the data as soon as possible. It does<br />

not impose any structure on the data because it does not change the structure that the<br />

data had when loaded. It can be particularly useful when dealing with semi-structured,<br />

poly-structured, and unstructured data. It was difficult and often impossible to model<br />

some of this data and ingesting it into a data warehouse, but there is no problem getting<br />

it into the data lake.<br />

In general, schema-on-read allows for all types of data and encourages a less rigid<br />

organization of data. It is easier to create two different views of the same data with<br />

schema-on-read. Also, it does not prevent data modelling, it just makes it necessary to<br />

defer it.<br />

<strong>The</strong> <strong>Data</strong> Catalog and Master <strong>Data</strong> Management (MDM)<br />

<strong>The</strong>re needs to be a catalog for all the data in the lake. It is important and will become<br />

increasingly important the data catalog to be as rich as possible, embodying as much<br />

“meaning” as possible. It may help here if we explain what we mean by “data catalog,”<br />

as the term is open to interpretation.<br />

<strong>The</strong> data catalog for an operating system (such as Windows or Linux) is the file system.<br />

<strong>The</strong> metadata it stores for each file is incomplete, as its primary purpose is to enable<br />

programs to find and access files. <strong>The</strong> programs themselves will know the layout of<br />

the data and thus how to interpret the values in any given file. This arrangement is not<br />

ideal because the data cannot be shared with programs that do not understand the file<br />

structure. It is adequate only for data that never needs to be shared.<br />

<strong>The</strong> data catalog for a database is the schema. <strong>The</strong> schema defines the structure of all<br />

the physical data held in the data sets or tables of the database and seeks to provide<br />

a logical meaning to each data item by attaching a label to it (Customer_ID, Amount,<br />

Date_of_Birth, etc). <strong>The</strong> amount of meaning that can be derived from the data catalog<br />

depends upon the specific database product. For relational databases the catalog<br />

roughly reflects the meaning that is captured in an ER Diagram, where the relationships<br />

between specific data entities are indicated. With NoSQL and document databases the<br />

situation is similar, although how the schema is used varies from product to product<br />

With an RDF database (sometimes called a Semantic <strong>Data</strong>base) the catalog can record<br />

a greater level of meaning. This is an area where Cambridge Semantics, with its Smart<br />

<strong>Data</strong> <strong>Lake</strong> solutions currently excels. <strong>The</strong> simple fact is that semantic technology<br />

allows you to capture much more meaning for the data catalog than you can using ER<br />

modelling.<br />

A traditional data warehouse metadata catalog can be complex, but nothing like as<br />

complex as a data lake catalog, which may include many sources of unstructured<br />

data (document data, graph data) that may require ontologies (semantic structures) to<br />

accurately define the metadata.<br />

<strong>The</strong> point of the data lake’s data catalog is to provide users (and programs) with a<br />

data self-service capability. On a simple level, if you compare the pre-data-lake world<br />

of data warehouses and data marts to the data lake world, two obvious facts emerge:<br />

26


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

• <strong>The</strong> data lake can hold every kind of data (unstructured, semi-structured,<br />

structured) allowing it to provide a more comprehensive data service to users.<br />

• <strong>The</strong> data lake can scale out, almost indefinitely, to accommodate far more data<br />

than a data warehouse ever dreamed of doing.<br />

It is best to think of metadata enrichment as an ongoing process. <strong>The</strong> use of schemaon-read<br />

may mean that the data catalog is incomplete when a data set (or file) first<br />

enters the lake. However that will change as soon as the data is used. <strong>The</strong> store of<br />

metadata may be further enriched by usage and formal modelling activity may be<br />

carried out to assist in that.<br />

<strong>The</strong> idea of Master <strong>Data</strong> Management is not dead, but it has morphed over the years.<br />

<strong>The</strong> original idea was to arrive at a “single version of the truth,” a holy grail that none<br />

of the knights of the round table in any organization ever managed to feast their eyes<br />

on. Nevertheless, there is sense it trying to link together all the data of an organization<br />

within a metadata driven model that is comprehensible to business users and enables<br />

data self-service. It should be possible to define the business meaning of everything<br />

that lives within the data lake. This is an area where we expect semantic technology to<br />

make an invaluable contribution.<br />

<strong>The</strong> Cloud Dynamic<br />

It could be argued that data lakes are cloud-neutral. Depending on circumstance, the<br />

cloud may prove attractive to companies involved in data lake projects. Certainly it is<br />

likely to be appropriate for building prototypes, with the idea of later migrating back<br />

into the data center.<br />

As regards data lakes, the economics of the cloud needs to be carefully considered.<br />

With most cloud vendors you will pay rent for all the data that lives in the cloud - and<br />

if a good deal of that data is never accessed, it is will almost certainly be much less<br />

expensive to keep that data on premise.<br />

That’s the downside of the cloud, but it still has its characteristic an upside. It’s the<br />

go-to solution when there’s a need for instant extra capacity, whether for storing data<br />

or processing it.<br />

27


<strong>Data</strong> <strong>Lake</strong> Architecture<br />

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Let’s begin with the idea that the data lake is the system of record. And let’s be clear<br />

what we mean by this. <strong>The</strong> system of record is the system that records all the data that<br />

is used by the business. It also holds the golden copy of each data record.<br />

For simplification the system of record should be thought of as a logical system. It may<br />

be possible to implement it on a single cluster of servers, but this is not a requirement<br />

and should not be a goal. In practice the whole configuration will probably involve<br />

multiple clusters if, for no other reason, than to provide disaster recovery.<br />

<strong>The</strong> system of record should also be the system where governance processes are<br />

applied to data. Clearly, data needs to be subject to governance wherever it is used<br />

within the organization. Some governance processes, particularly data security, need<br />

to be applied to data as soon as possible after its creation or capture. For that reason,<br />

the data lake will inevitably be the landing zone for external data brought into the<br />

business, so governance processes can quickly and easily be applied it. <strong>Data</strong> created<br />

within the organization should be passed to the data lake immediately after creation so<br />

that governance processes can be applied.<br />

It is best to think of the data lake as a store of event records. We can think of it in<br />

this way: events are atoms of data and transactions (the traditional paradigm with its<br />

traditional data structures) are molecules of information. <strong>The</strong> analogy works reasonably<br />

well.<br />

Consider for example a web site visit. A user lands on site, clicks through a few pages,<br />

decides to buy something, enters credit card details, and clicks on the “confirm” button.<br />

In transactional terms we may think of this as a purchase: a molecule of data, that’s<br />

quickly followed by a delivery transaction, another molecule of data to record.<br />

But in reality it’s a stream of events, a series of atoms of data. Every user mouse<br />

action creates an event. <strong>The</strong> computer records these events and responds to each one<br />

immediately; it displays new web pages or expands text descriptions or whatever.<br />

<strong>The</strong> purchase confirmation is just another event, distinguished only by the fact that it<br />

generates a cascade of other related events in the application or in other applications.<br />

When you look at it like that, business systems consist of applications generating events<br />

and sending event information to other applications. <strong>The</strong>re is nothing particularly<br />

special about web applications in this respect; all applications are like that. <strong>The</strong>y<br />

respond to events and send messages or data to other applications. That’s been the<br />

nature of computing for decades.<br />

When you scan all the servers and network devices in a data center you find them<br />

awash with log files that store data about events of every kind: network logs, message<br />

logs, system logs, application logs, API logs, security logs and so on. Collectively the<br />

logs provide an extensive audit trail of the activity of the data center, organized by<br />

time stamp. It happens at the application level, at the data level and lower down at<br />

the hardware level. Hiding within this disparate set of data can be found details of<br />

anomalies, error conditions, hacker attacks, business transactions and so on.<br />

28


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Clearly it is possible, by<br />

adding real-time logging to<br />

every business application on<br />

every device, to collect all the<br />

company’s data and dispatch<br />

it to the data lake - although,<br />

it is no doubt overkill. Figure 6<br />

depicts this possibility. <strong>Data</strong> is<br />

ingested either from static data<br />

sources (files and databases) on<br />

a scheduled basis, or is ingested<br />

directly from data streams.<br />

We show both Kafka and NiFi<br />

as possible components of an<br />

ingest solution. Kafka is pure<br />

publish subscribe and thus<br />

can easily be used to gather<br />

data changes from multiple<br />

files (as publishers) and pass<br />

them to the <strong>Data</strong> Bus (the<br />

subscriber), probably a well<br />

configured server. However<br />

a more sophisticated and set<br />

of capabilities can be created<br />

by integrating NiFi with Kafka. NiFi can completely automate data flows across<br />

thousands of systems and can be configured to handle the failure of any system or<br />

network component, queue management, data corruption, priorities, compliance and<br />

security. And it provides an excellent drag and drop interface that allows data flow<br />

configurations to be designed and enhanced. You can think of NiFi as ETL on steroids.<br />

We included an ingest application in the diagram to allow for any functionality or<br />

integration capability that could not be delivered by Kafka in conjunction with NiFi.<br />

<strong>The</strong> goal is to provide an in-memory stream of data that can be processed as it arrives.<br />

Given that, in theory at least, input may be required from any data source any where,<br />

the data access ability needed here is extensive.<br />

<strong>Data</strong> lake projects will most likely begin with fairly unsophisticated data acquisition<br />

from just a few sources. What we have described here is a kind of “worst case scenario”<br />

and how it would likely be handled.<br />

Real Time Applications<br />

Servers, Desktops, Mobile, Network Devices,<br />

Embedded Chips, RFID, IoT, <strong>The</strong> Cloud, Oses,<br />

VMs, Log Files, Sys Mgt Apps, ESBs, Web<br />

Services, SaaS, Business Apps, Office Apps,<br />

BI Apps, Workflow, <strong>Data</strong> Streams, Social...<br />

Ingest<br />

Kafka<br />

<strong>The</strong> goal of software architecture is to satisfy application service levels, pure and<br />

simple. When we consider real-time applications, for example responding immediately<br />

to price changes in an automated market, there is simply no room for any unavoidable<br />

latency.<br />

D<br />

A<br />

T<br />

A<br />

B<br />

U<br />

S<br />

NiFi<br />

Figure 6. <strong>Data</strong> <strong>Lake</strong> Ingest<br />

29


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

We represent this reality, in Figure 7, by showing real-time<br />

applications running directly against the real-time data<br />

stream within the <strong>Data</strong> Bus and also accessing a disk-based<br />

data source. In practice this is likely to be achieved by a<br />

Lambda or Kappa architecture, which is far ore involved<br />

than teh diagram suggests.<br />

In practice it is far more likely, for latency reason, that realtime<br />

apps will not be located anywhere near the data lake,<br />

but will instead be as close as possible to the data stream(s)<br />

that feed them. Nevertheless they will likely pass the data<br />

stream they process directly to the data lake. Since such<br />

applications run prior to any data governance it may be<br />

necessary to pass data to these applications if anything important is discovered during<br />

the governance processes.<br />

Governance Applications<br />

We need to distinguish between the governance processing that occurs on ingest and<br />

the governance processing that may occur later. One of the rules of governance itself<br />

may be to specify what processes<br />

have to take place before data is<br />

made available to data lake users.<br />

It makes sense to do as much<br />

processing as possible while data is<br />

on the <strong>Data</strong> Bus (i.e. held in memory)<br />

since it can be accessed very<br />

quickly. <strong>The</strong> ideal would be to do<br />

all governance processing on ingest,<br />

but this may be impossible. Some<br />

data cleansing activity and some<br />

metadata discovery activity requires<br />

human intervention and it may not<br />

be practical to do it on ingest. For<br />

some data, the chosen policy may<br />

be to implement schema-on-read so<br />

<strong>Data</strong><br />

Security<br />

<strong>Data</strong><br />

Transforms<br />

<strong>Data</strong><br />

Aggregat'n<br />

Metadata<br />

Mgt<br />

<strong>Data</strong><br />

Cleansing<br />

Figure 8. Ingest Governance Apps<br />

metadata gathering will occur after ingest. <strong>The</strong>re can be other competing dynamics. <strong>The</strong><br />

desire may be to encrypt all data (or at least all data that is destined to be encrypted)<br />

on ingest. If data is to be stored in the write-only HDFS, this is doubly important, as it<br />

is a write-only file system. However data cleansing will require data to be unencrypted.<br />

<strong>The</strong> data transform and aggregation activities shown in the diagram are not governance<br />

activities per se. From an efficiency perspective it will be better to perform data transformations,<br />

aggregations and other data calculations that are known to be required before data is written<br />

to disk. Of course, some will need to be done later simply because not all the data they<br />

required is in the data stream.<br />

D<br />

A<br />

T<br />

A<br />

B<br />

U<br />

S<br />

Real-Time<br />

Apps<br />

<strong>Data</strong><br />

Figure 7. Real-Time<br />

30


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>Data</strong> <strong>Lake</strong> Management<br />

Figure 9 shows the<br />

processes that run on<br />

the data lake. <strong>The</strong> other<br />

on-going activity aside<br />

from data governance is<br />

data lake management.<br />

This is the system<br />

management activity<br />

that monitors and<br />

responds to hardware<br />

and operating system<br />

events and manages all<br />

the applications that<br />

dip their toes in the data<br />

lake.<br />

<strong>The</strong> data lake is a<br />

computer grid that<br />

needs to be managed<br />

like any other network<br />

of hardware. <strong>The</strong><br />

appropriate system management activities can be many and varied, including: server<br />

performance, availability monitoring, automated recovery, software management,<br />

security management, access management, user monitoring, application monitoring,<br />

capacity management, provisioning, network monitoring and scheduling<br />

<strong>The</strong>re is nothing new in respect of system management here. However, it is important to<br />

recognize that some of software employed here will be traditional system management<br />

software and some is likely to be data lake specific (Hadoop management software like<br />

Ambari, Pepperdata for cluster tuning, etc.).<br />

<strong>Data</strong> <strong>Lake</strong> Applications<br />

<strong>Data</strong><br />

Governance<br />

<strong>Data</strong> <strong>Lake</strong><br />

Mgt<br />

DATA<br />

BUS<br />

DATA LAKE<br />

Search &<br />

Query<br />

BI, Visual'n<br />

& Analytics<br />

Other<br />

Apps<br />

Figure 9. <strong>Data</strong> <strong>Lake</strong> Applications and Processes<br />

Little needs to be said about data lake applications beyond the fact that businesses<br />

almost always employ data lakes in the same way they used data warehouses for BI<br />

and analytics applications. It is important to note that data self-service is more practical<br />

with data lakes, than it ever was with data warehouses.<br />

An effective data self-service capability requires an effective search and query capability.<br />

This in turn requires the existence of a data catalog, which should be a natural result<br />

of intelligent metadata management. It will also require a search capability (after the<br />

fashion of Google search), which can pick out anything in the data lake.<br />

A really comprehensive search capability will necessitate some kind of indexing<br />

activity as part of data ingest. A query capability will need to work in conjunction with<br />

the data catalog. <strong>The</strong>re might be support for multiple query languages (SQL, XQuery,<br />

SparQL, etc).<br />

31


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

If we now look at<br />

Figure 10, which<br />

illustrates the data<br />

lake complete, the<br />

only two processes,<br />

we have not yet<br />

discussed are data<br />

extracts and data lifecycle<br />

management.<br />

While data lake<br />

processing can be<br />

fast, if particularly<br />

high performance<br />

is required for some<br />

applications, there<br />

will be a need to<br />

export data to a<br />

fast data engine or<br />

database. It will<br />

probably be many<br />

years before data<br />

lake data access<br />

speed gets close<br />

to a purpose built<br />

database.<br />

<strong>The</strong> truth is that<br />

the focus of data<br />

lake architecture is<br />

data ingest and data<br />

governance. <strong>The</strong>re<br />

are many processes,<br />

all of them important,<br />

competing for the<br />

same resources.<br />

Servers, Desktops, Mobile, Network Devices, Embedded<br />

Chips, RFID, IoT, <strong>The</strong> Cloud, Oses, VMs, Log Files, Sys<br />

Mgt Apps, ESBs, Web Services, SaaS, Business Apps,<br />

Office Apps, BI Apps, Workflow, <strong>Data</strong> Streams, Social...<br />

<strong>Data</strong><br />

Governance<br />

<strong>Data</strong> <strong>Lake</strong><br />

Mgt<br />

Ingest<br />

Transform &<br />

Aggregate<br />

Archive<br />

<strong>Data</strong><br />

Security<br />

Life Cycle<br />

Mgt<br />

DATA LAKE<br />

Real-Time<br />

Apps<br />

Metadata<br />

Mgt<br />

<strong>Data</strong><br />

Cleansing<br />

Extracts<br />

Search &<br />

Query<br />

BI, Visual'n<br />

& Analytics<br />

Other<br />

Apps<br />

To<br />

<strong>Data</strong>bases<br />

<strong>Data</strong> Marts<br />

Other Apps<br />

Figure 10. <strong>The</strong> <strong>Data</strong> <strong>Lake</strong> Complete<br />

<strong>The</strong>re will always be a limit to the capacity of the data lake, and governance processes<br />

naturally take priority, so it will prove necessary to replicate data to other data lakes or<br />

data marts to properly serve some applications or users.<br />

As regards data archive, data life-cycle management can be regarded as an aspect of<br />

data governance. It can best be thought of as a background process. <strong>The</strong> exact rules<br />

of if and when data needs to be deleted may be influenced by business imperatives<br />

(regulation), but may also be determined by storage costs. Ideally, archive will be an<br />

automatic process.<br />

32


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Next Generation Architecture<br />

In our view, the data lake is more than an interesting software architecture that utilizes<br />

many inexpensive open source components, it is the foundation of the next generation<br />

of enterprise software. As such we can think of there being three major architectural<br />

generations of software:<br />

1. Batch Computing. <strong>The</strong> centralized mainframe architecture, characterized by<br />

all applications running on one (or more) large mainframes usually in a batch<br />

manner.<br />

2. On-line Computing. This involves an arrangement of distributed servers and<br />

client devices. It is characterized by applications interacting with databases in<br />

a transactional way, supported by batch data flows between databases for data<br />

sharing.<br />

3. Real-Time Computing. This is based on event-based applications that employ<br />

parallel architectures for speed and can support real-time applications where<br />

data is processed as quickly as it arrives.<br />

If we look at software architectures in this manner, it is clear that the data lake belongs<br />

to a new generation of software that is built in a fundamentally different way to what<br />

came before.<br />

<strong>The</strong> event-based data lake serves as the system of record of the corporate software<br />

ecosystem and the point of implementation of data governance. <strong>Data</strong> flows into the<br />

lake from data streams or bulk data transfers that have their origin within or outside<br />

the organization. During this process, governance rules are applied to the data to ensure<br />

it is stored in an appropriate form and subject to appropriate security policies. Once<br />

in the data lake, with all the appropriate governance procedures applied, it becomes<br />

available for use. If any data is extracted for use outside the data lake, it will be a data<br />

copy. Ultimately data leaves the lake when it is archived or permanently deleted.<br />

So, the data lake is not a swamp into which any old data can be poured and lost from<br />

sight. It is the ingest point for the controlled collection and governance of corporate<br />

data. It is the system of record and the foundation for data life cycle management, from<br />

ingest to archive. It is an event management platform and could be viewed as a truly<br />

versatile data warehouse.<br />

As a data warehouse, it is distinctly different from the traditional variety as it can<br />

accept any data with any structure rather than just the so-called structured data that<br />

relational databases store. <strong>The</strong>re is no modelling required in designing and maintaining<br />

the lake, and a query service should be available to retrieve data from the lake - but it<br />

may not be a lightning-fast query service.<br />

While data warehouses were presided over by extremely powerful database<br />

technology with a purpose built query optimizer, the data lake is likely to be devoid of<br />

such technology and if a fast query service is needed (as is likely) for some of the data<br />

in the lake, it will be provided by exporting data to an appropriate database to deliver<br />

the needed query service.<br />

33


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> Logical and <strong>The</strong> Physical<br />

In our view, the two primary dynamics involved in establishing a data lake are:<br />

1. To gradually migrate all the data that makes up the system of record to the data<br />

lake where it becomes the golden copy of the data.<br />

2. For the data lake to become the primary point of ingest of external data where<br />

governance processing to data, both internal and external, as soon as possible<br />

when it enters the data lake.<br />

We note here that the data lake concept, which was first proposed about five years<br />

ago, has gradually grown in sophistication and thus here we are describing current<br />

thinking about what a data lake is and how to use it.<br />

For companies building a data lake, it is important to think in terms of a “logical data<br />

lake” along the lines we described, and to acknowledge that its physical implementation<br />

may be far more involved than our diagrams suggest.<br />

If the recent history of IT has taught us anything, it is that everything needs to scale.<br />

Most companies have a series of transactional systems (the mission critical systems)<br />

that currently constitute most if not all of the system of record. For the data lake to<br />

assume its role as the system of record, the data from such systems needs to be copied<br />

into the data lake.<br />

Pre-assembled <strong>Data</strong> <strong>Lake</strong>s<br />

For many companies the idea of commencing a strategic data lake project will make no<br />

commercial sense, particularly if their primary goal is, for example, only to do analytic<br />

exploration of a collection of data from various sources. Such a set of applications are<br />

unlikely to require all the governance activities we have discussed. In the circumstances<br />

the pragmatic goal will be to build the desired applications to a simpler target data lake<br />

architecture that omits some of the elements we have described.<br />

This approach will be easier and more likely to bring success if a data lake platform is<br />

employed, which is capable of delivering a data lake “out of the box.” As previously<br />

noted, vendors such as Cask, Unifi or Cambridge Semantics provide such capability.<br />

<strong>The</strong>y deliver a flexible abstraction layer between the Apache Stack and the applications<br />

built on top of it. <strong>The</strong>y also provide other components for managing, building and<br />

enriching data lake applications.<br />

It is possible to think of such vendors as providing a data operating system for an<br />

expansible cluster onto which you can build one or more applications. It is also feasible<br />

to build many such “dedicated” data lakes with different applications on each. A<br />

company might for example, build an event log “data lake” for IT operations usage, a<br />

real-time manufacturing data lake, a sales and marketing “data lake” and so on.<br />

One of the beauties of the current Apache Stack is that, with the inclusion the powerful<br />

communications components, Kafka and NiFi, it possible to establish loosely coupled<br />

data lakes that flow data one to another. If the data is coherently managed, simply<br />

34


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Ingest<br />

Ingest<br />

<strong>Data</strong><br />

Governance<br />

<strong>Data</strong><br />

Governance<br />

<strong>Data</strong> <strong>Lake</strong><br />

Mgt<br />

Apps<br />

<strong>Data</strong><br />

<strong>Lake</strong><br />

<strong>Data</strong><br />

<strong>Lake</strong><br />

<strong>Data</strong> <strong>Lake</strong><br />

Mgt<br />

Apps<br />

Replicate,<br />

Export<br />

<strong>Data</strong><br />

Governance<br />

To other<br />

destinations<br />

(databases, apps, etc.)<br />

<strong>Data</strong> <strong>Lake</strong><br />

Mgt<br />

Apps<br />

<strong>Data</strong><br />

<strong>Lake</strong><br />

Ingest<br />

Figure 11. Multiple Physical <strong>Data</strong> <strong>Lake</strong>s in Overview<br />

adding clusters like this, whether located in the cloud or on premise will allow you to<br />

gradually grow the type of system of record we have discussed in this report.<br />

35


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> main point is that there can be multiple physical data lakes, each with ingest<br />

capabilities and governance processes, that constitute a logical data lake, as illustrated<br />

in Figure 11. While the diagram implies that all the physical data lakes are running<br />

roughly the same processes, this may not be the case. <strong>The</strong>re are different reasons for<br />

having multiple physical data lakes. Some might exist entirely for disaster recovery<br />

or as a reserve resource for unexpected processing demand or as a dedicated analyst<br />

sand box for a specific group of users. Some may simply be established for time zone<br />

or geographical reasons.<br />

Having multiple physical data lakes will complicate the global approach to governance<br />

and establishing a system of record. However, with an intelligent deployment of Kafka<br />

(and possibly also NiFi) to manage the replication and export of data, ensuring that the<br />

physical data lakes correspond to a logical data lake is achievable.<br />

<strong>The</strong> system of record is likely to be logical (i.e. spread physically across multiple<br />

systems and data lakes) for several reasons. One particular cause for this that we<br />

believe is worth discussing is Internet of Things (IoT) applications, where the source<br />

data is created and is likely to remain physically remote.<br />

<strong>The</strong> Internet of Things<br />

<strong>The</strong> IoT is currently in its infancy, although some IoT applications have existed for many<br />

years, particularly those involving mobile phones. <strong>The</strong> Uber and Lyft applications, for<br />

example, are complex internet of things applications.<br />

However, such applications are not<br />

what normally spring to mind when the<br />

IoT is mentioned. <strong>The</strong> general idea is that<br />

there will be some physical domain – a<br />

building or many buildings or a transport<br />

network or a pipeline of a chemical<br />

plant or a factory – and this domain will<br />

be peppered with sensors, controllers<br />

or even embedded CPUs in various<br />

locations that are, at the very minimum,<br />

recording information but may also be<br />

running local applications.<br />

Sensors, Controllers, CPUs<br />

<strong>Data</strong><br />

Depot<br />

Depot<br />

Source<br />

Proc.<br />

Depot<br />

Proc.<br />

Figure 12 illustrates a typical scenario.<br />

Consider an example, let us say a car or a<br />

truck or an airplane engine, loaded with<br />

sensors. <strong>The</strong> data gathered locally needs<br />

to be marshaled in a local data depot,<br />

which may contain considerable amounts<br />

of data, maybe terabytes. Some of that<br />

data will probably be processed and used<br />

locally and there may be no need to send<br />

it to a central data hub. However, there<br />

<strong>Data</strong><br />

Central<br />

Central<br />

Hub<br />

Proc.<br />

Figure 12. IoT in Overview<br />

36


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

are many instances of such an object (car, truck, etc.) and there can be no doubt that the<br />

data gathered for each object will have value in aggregation .<br />

So some of the IoT data will be collected in some central hub so that the aggregate data<br />

can be analyzed. <strong>The</strong> bulk of the data is thus distributed and will remain distributed. It<br />

may even be that some application at some point needs to access the complete collection<br />

of all the data. If so, it will be far more economic to run a distributed application across<br />

all the depots than to try to centralize the data.<br />

A data lake might be involved in IoT applications like this, but its scale-out capabilities<br />

may have little importance for such applications. <strong>The</strong> scale-out applications of the IoT<br />

will probably be handled via distributed processing.<br />

<strong>The</strong> System of Record in Summary<br />

We have positioned the data lake as being a next generation technology concept,<br />

founded on the use of parallel processing in combination with a whole series of new<br />

software components, the majority of which are Apache projects.<br />

In this new ecosystem, the system of record, which historically was regarded as the<br />

data of the primary transactional applications of the business, will reside (mainly) in<br />

the data lake, where the purifying processes of data governance will be applied to it<br />

on ingest.<br />

<strong>The</strong> system of record will no longer consist entirely of the transactions (or events) of<br />

the business. It will also include data from other sources, which the business uses to<br />

perform analytics and inform its users of important information on which decisions can<br />

be based. <strong>The</strong> system of record will be, as it always was, the golden copy of corporate<br />

data and the audit trail of the IT activities of the business.<br />

37


<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

Thank You To Our Sponsors:<br />

About <strong>The</strong> Bloor Group<br />

<strong>The</strong> Bloor Group is a consulting, research and technology analysis firm that focuses on open<br />

Research and the use of modern media to gather knowledge and disseminate it to IT users.<br />

Visit both www.BloorGroup.com and www.InsideAnalysis.com for more information. <strong>The</strong><br />

Bloor Group is the sole copyright holder of this publication.<br />

Austin, TX 78720 | 512-426–7725<br />

38

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!