The Data Lake Survival Guide
2o2JwuQ
2o2JwuQ
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>The</strong> Bloor Group<br />
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> What, Why and How of the <strong>Data</strong><br />
<strong>Lake</strong><br />
Robin Bloor, Ph.D.<br />
& Rebecca Jozwiak<br />
RESEARCH REPORT
"Perhaps the truth depends on a walk around the lake.”<br />
~ Wallace Stevens<br />
RESEARCH REPORT
<strong>The</strong> Genesis of the <strong>Data</strong> <strong>Lake</strong><br />
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
In times past, when thinking about digital data, it made sense to segregate data between<br />
transactional data, the data captured in business applications, stored in database tables<br />
and presented by BI tools, and all other data: emails, web pages, images, video and so<br />
on. Nowadays we tend to refer to such “other data” as unstructured data. Of course<br />
it is not unstructured in the true sense of the word, but it does not have a convenient<br />
structure for being stored in a relational database. It is usually lives in files or more<br />
specialized databases and until recently it was rarely analyzed.<br />
Nevertheless it was analyzable and software for deriving value from such data has<br />
crossed the chasm. Combine this with the reality that there is even more value to joining<br />
some of this data with regularly structured data for additional analytics and you have<br />
a strong motive for re-imagining the idea of a data warehouse.<br />
It was that analytical imperative more than anything else which gave rise to the<br />
original concept of a data lake, a data store for both species of data and, additionally<br />
for data harvested from multiple sources external to the business, some of which was<br />
inevitably unstructured.<br />
<strong>Data</strong> Flow Architectures<br />
<strong>The</strong> now-aging data warehouse architecture ruled the data empire for two decades<br />
or more and will probably continue to play a role in the data lake architecture that<br />
supersedes it - but only as a supporting actor. Its longevity stands as a testament to its<br />
effectiveness. It was the first generation data flow architecture, and as is the case with<br />
the data lake, its claim to fame was in providing data to BI and analytics applications.<br />
Figure 1 below provides a simplified conceptual illustration of this architecture.<br />
<strong>Data</strong> flows from<br />
OLTP databases via<br />
Extract Transform<br />
and Load (ETL)<br />
software to the<br />
data warehouse.<br />
Queries and other<br />
apps access the<br />
data warehouse<br />
and more ETL<br />
software passes<br />
data into data<br />
marts against<br />
which other (BI<br />
<strong>Data</strong><br />
Layer<br />
OLTP<br />
Apps<br />
OLTP<br />
DBMS<br />
ETL<br />
Query<br />
Apps<br />
<strong>Data</strong><br />
Warehouse<br />
<strong>Data</strong> Mart<br />
Apps<br />
<strong>Data</strong><br />
Marts<br />
and analytics)<br />
applications run.<br />
<strong>The</strong> data layer for business applications is thus comprised of transactional databases,<br />
a data warehouse and data marts consisting of subsets of data drawn from the data<br />
warehouse.<br />
ETL<br />
Figure 1. Conceptual <strong>Data</strong> Warehouse Architecture<br />
1
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Real implementations are much more complex, usually involving a data staging area<br />
where data is placed prior to ingestion into the data warehouse. This may be necessary<br />
for operational reasons such as the data warehouse needing to limit data ingest to<br />
particular times. Alternatively data may need to be cleaned or restructured before<br />
ingest. In some instances, because data took too long to flow through to a data mart,<br />
yet another database - called an Operational <strong>Data</strong> Store (ODS) - would be created to<br />
provide a more timely service to BI dashboards.<br />
<strong>The</strong> need for such awkward manoeuvres might have been eliminated in time by the<br />
increasing power of hardware. However, this approach was constrained by other factors<br />
all of which we will identify and discuss in detail later.<br />
<strong>The</strong> Value of <strong>Data</strong><br />
Businesses do not remain static. <strong>The</strong>ir processes change and evolve, their business<br />
models change and the markets they serve are gradually reshaped. Precisely how this<br />
happens varies, but generally we can think of there being a simple feedback loop, which<br />
governs the process. We illustrate this in Figure 2.<br />
<strong>The</strong> feedback loop has three steps:<br />
• Plan (business process design and implementation)<br />
• Run operational business processes<br />
• Review operational business processes<br />
<strong>The</strong>re may be many manual elements in this; it is rarely<br />
driven by IT, although IT normally contributes. BI and<br />
analytics have the clear role of providing information either<br />
to assist operational business processes or to assist planning<br />
and change management activities.<br />
Planning &<br />
Change<br />
Management<br />
Operational<br />
Activity<br />
Conceptually, there is nothing new at all about this view of<br />
company behavior and the role of BI and Analytics. What<br />
has drawn attention to Big <strong>Data</strong> and the BI & Analytics<br />
applications it supports, is that the technology parameters<br />
have shifted dramatically, as we shall discuss later in<br />
this report. If the Big <strong>Data</strong> opportunity is pursued, the<br />
efficacy of this corporate feedback loop will be improved.<br />
Organizations will be more “data driven” than before and<br />
their success will be more dependent on making effective<br />
Business<br />
Intelligence<br />
& Analytics<br />
Figure 2. Change<br />
use of technology for this purpose. This is why some businesses are now chanting the<br />
mantra, “data driven, data driven, data driven.”<br />
<strong>The</strong> triumph in 1997 IBM’s Deep Blue computer in a chess match with world chess<br />
champion, Gary Kasparov, the later victory in 2011 by IBM’s Watson computer system<br />
against three Jeopardy champions and the recent (2016) victory by Google AI against the<br />
world Go champion have demonstrated beyond argument that computer intelligence<br />
can now outstrip the most intelligent humans in well-defined contexts.<br />
2
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
No doubt analytics technology is far better at taking decisions than even the most<br />
skilled humans too. Whether computer skills will soon usurp human skills in most<br />
aspects of running a business is thus worth considering. To explore this possibility, we<br />
need to define, examine and discuss the data pyramid.<br />
<strong>The</strong> <strong>Data</strong> Pyramid<br />
<strong>The</strong> data pyramid illustrated in Figure 3 below was first conceived in the 1990s<br />
as part of a philosophical and technological study of artificial intelligence at Bloor<br />
Research. Variations of this simple model have appeared occasionally since then. <strong>The</strong><br />
fundamental point it makes is that data has to go through refinement before it becomes<br />
useful to people.<br />
Rules, Policies<br />
<strong>Guide</strong>lines, Procedures<br />
Linked data, Structured data,<br />
Visualization, Glossaries, Schemas, Ontologies<br />
New<br />
<strong>Data</strong><br />
Signals, Measurements, Recordings,<br />
Events, Transactions, Calculations, Aggregations<br />
Refinement<br />
Figure 3. <strong>The</strong> <strong>Data</strong> Pyramid<br />
<strong>The</strong> data pyramid has four layers which we define precisely as follows:<br />
<strong>Data</strong><br />
We define data to mean records of events or transactions from some source. A data<br />
record indicates a particular state of something or possibly a change between two<br />
states. Such a record is a “data point.” For practical IT purposes it needs to record its<br />
time of birth and other data items that identify the data’s origin. While a data item of<br />
this kind may play a role in an operational business system, collections of such data<br />
and their analysis is required to create useful “business intelligence.”<br />
Information<br />
For data to become information it requires context. This is a matter of making<br />
connections. A single customer record, for example, lacks context, which may be<br />
3
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
provided by orders, contact records and so on. If you look at a set of customer data<br />
it may yield information that is not part of any single record. As soon as you have<br />
multiple records you can calculate categories (gender, age, location, etc.). From a<br />
“business intelligence” perspective you are creating information with such activities.<br />
When you link customer data to all the orders placed by a customer, you can generate<br />
useful information about that customer’s buying patterns.<br />
We define information to be: collections of data linked together for human consumption.<br />
In general BI products are software products that present information, possibly as<br />
reports or visually on dashboards. Some BI products are interactive, enabling the user<br />
to slice and dice the information. BI tools such as spreadsheets or OLAP tools can be<br />
thought of as user workbenches for the further analysis of information. <strong>The</strong> databases<br />
and data warehouses that feed such tools store information in a semi-refined form.<br />
Knowledge<br />
We define knowledge to be information that has been refined to the point where it is<br />
actionable.<br />
Consider the BI tools that simply present information for decision support. <strong>The</strong> user<br />
knows their own context and consumes the information to take an action, such as<br />
resolving an insurance claim or approving a loan. <strong>The</strong> user has the knowledge of what<br />
to do and the BI tools assists by providing information.<br />
In the case of the BI tools that enable data exploration, the user has some idea of what<br />
he or she needs to know, explores the information to create that knowledge and then<br />
applies it to their business context. As such the knowledge of how to explore the data<br />
lives in the user. Such a user can accurately be described as a knowledge worker.<br />
Knowledge can also be stored in computer systems. This is where we encounter rules<br />
based systems and all the technology that is normally classified as AI. But knowledge<br />
manifests in computer systems in many other ways. <strong>The</strong> people who run a business<br />
create business processes to carry out particular activities. <strong>The</strong>se are normally<br />
improved over time on the basis of acquired experience (feedback). <strong>The</strong>y may even be<br />
fully automated - converted into software and implemented without the need for any<br />
human intervention. This is implemented knowledge.<br />
Indeed, all software, no matter what it does, can be classified as implemented<br />
knowledge. Nevertheless within any business there will also be other knowledge:<br />
rules, procedures, guidelines and policies that are not automated and are implemented<br />
by staff.<br />
<strong>Data</strong> analytics, or <strong>Data</strong> Science as it has now been named, is the activity of trying to<br />
discover new knowledge from data by applying mathematical techniques to reveal<br />
previously unknown patterns. It is science of a kind, in the sense that the data scientist<br />
may formulate and then test hypotheses, although there are some brute force techniques<br />
that can discover patterns without the need to hypothesize. It is not the only way to<br />
discover new knowledge, but it can be a very powerful and rewarding route.<br />
4
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Understanding<br />
We define understanding to be the synthesis of both knowledge and experience.<br />
Currently this can reside only in people, and it is probably the case that it will only ever<br />
reside in people. Although computers may be able to model most human intellectual<br />
processes they only ever execute directives. Higher human cognitive activities such as<br />
taking initiative, contemplation, pondering and abstract thought are beyond the remit<br />
of the machine.<br />
Understanding is served by knowledge and information and, as such, Analytics<br />
and BI systems can considerably enhance the capabilities of the users at every level<br />
within an organization, especially those with a deep understanding of the activities<br />
and dynamics of a business. However, every business, no matter how well it exploits<br />
the opportunities that such technology presents, also has to cater for change. Human<br />
understanding is what drives and responds to change.<br />
BI And Analytics As Business Processes<br />
<strong>The</strong> creation of BI services and the generation of business insight through analytics<br />
are themselves business processes or at least they should be. <strong>The</strong>y are also the natural<br />
data lake applications. Hence the user constituency that is likely to benefit most from<br />
building a data lake is the community that creates and uses such capabilities.<br />
<strong>The</strong> marketing hype surrounding Big <strong>Data</strong> and the potential importance of data<br />
scientists to the success of the business have tended to obscure the nature of this<br />
business process.<br />
It can be viewed from two perspectives.<br />
• R&D<br />
• Software Development<br />
<strong>Data</strong> science teams investigate aspects of the business statistically to discover useful<br />
and actionable knowledge. Such activity should properly be regarded as research<br />
into business processes (R&D). When new insights are discovered, the subsequent<br />
exploitation of these insights is clearly a software development process.<br />
We illustrate this in Figure 4, on the next page. <strong>The</strong> analytical exploration of data<br />
to generate knowledge or enhance existing knowledge is very similar to software<br />
development, in the sense that when new knowledge is discovered it is likely to be<br />
used to enhance computer systems. Where it differs from software development is that<br />
it is an R&D activity and software development rarely is.<br />
Once new knowledge is discovered and implemented we get the situation depicted<br />
on the right hand side of Figure 4. Business intelligence systems may be enhanced<br />
to improve decision making. This might simply be a matter of upgrading passive<br />
decision support such as dashboard information or sending a specific alert to the user<br />
in a specific context, or it might lead to the upgrade of interactive decision support<br />
capabilities such as OLAP software or data visualization software like Tableau.<br />
5
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Analytics Development<br />
Analytics Implementation<br />
Passive<br />
Decision<br />
Support<br />
<strong>Data</strong><br />
Set<br />
User<br />
Analytic<br />
Exploration<br />
New<br />
Knowledge<br />
Interactive<br />
Decision<br />
Support<br />
<strong>Data</strong><br />
Scientist<br />
<strong>Data</strong><br />
Set<br />
<strong>Data</strong><br />
Set<br />
User<br />
Automation<br />
<strong>Data</strong><br />
Set<br />
Figure 4. Analytics and BI, Development and Implementation<br />
Alternatively, the knowledge may be automatically included in an operational system<br />
improving it in some way. <strong>The</strong> illustration does not try to elaborate on how an analytic<br />
discovery is made operational, since this varies according to context.<br />
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> Dynamic<br />
<strong>The</strong> fundamental assumption of the data warehouse architecture was that there<br />
needed to be a very powerful query engine (database) at the center of the data flow.<br />
It thus suggested a centralized architecture where, first of all data flowed to the data<br />
warehouse. It was then used in place or it was distributed from there for use elsewhere.<br />
<strong>The</strong> fatal flaw of this architecture was that it did not scale out well. However this<br />
limitations did not become apparent until a whole series of forces came into play. <strong>The</strong>y<br />
were:<br />
• <strong>The</strong> need to analyze unstructured data, both external and internal. <strong>The</strong> need for<br />
this continues to grow.<br />
• External data sources began to multiply. Particularly prominent in this was social<br />
media data, but it was by no means the only source. Until recently, selling or<br />
renting data was a niche activity but this has ceased to be the case. An expanding<br />
amount of valuable data is now bought and sold publicly.<br />
• Traditionally analytics applications lived in “walled gardens” served by their<br />
own data mart. However a data lake could do service as an analytics sandbox<br />
6
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
- a useful idea given the incursion of external data that data analysts wished to<br />
explore.<br />
• Hadoop (and later Spark) with their Open Source software ecosystems gained<br />
traction and those ecosystems began growing apace. Such software is very low<br />
cost to adopt. <strong>The</strong>re were significant economies available just for off-loading<br />
tired data from data warehouses, to a data lake.<br />
• <strong>The</strong> parallelism of these Open Source environments made it possible to run<br />
some analytic applications much faster than had previously been possible.<br />
• <strong>The</strong> Hadoop/Spark ecosystems were further strengthened. New metadata<br />
capture and data cleansing products emerged. Security products appeared that<br />
strengthened the initially poor security capabilities of these environments.<br />
• <strong>The</strong> concept of a data lake (or data hub) quickly gained acceptance as an<br />
architectural idea for data management.<br />
• Cloud vendors, particularly Amazon with EMR and Microsoft with Azure quickly<br />
identified the opportunity and built easily deployed data lake environments.<br />
• <strong>The</strong> release of Open Source Kafka in combination with Spark’s micro-batch<br />
capability caused some companies to experiment with near real-time applications<br />
significantly expanding the data lake’s areas of application towards real-time<br />
analytics. This move will no doubt continue as other open source streaming<br />
capabilities, such as Flink, mature.<br />
• Kafka can now be regarded as a foundational data lake component (or if you use<br />
MapR then MapR Streams fulfils the same role). It was donated to Open Source<br />
by Linked In in a fully proven form. As a fast publish-subscribe capability it<br />
enables replication and disaster recovery configurations as well as making it<br />
possible to treat multiple physical data lakes as a single logical data lake.<br />
• <strong>The</strong> traditional data warehouse was never an ingest point of itself. <strong>Data</strong> was<br />
prepared in various ways, in a “staging area” before going to the data warehouse.<br />
In contrast, a data lake is capable of being a staging area, not just for corporate<br />
data but for all data, including unstructured data.<br />
• <strong>The</strong> data lake provided a means of acquiring data prior to modeling the data<br />
(for inclusion in a database or data warehouse). Since a good deal of data could<br />
be processed immediately without such modeling, considerable time could be<br />
saved. For applications where “time to market” matters, the data lake delivers.<br />
• A whole series of data governance issues began to raise their heads. <strong>Data</strong><br />
governance had never been an important issue, in fact it hadn’t even been well<br />
defined until the possibility of gathering large collections of data into data lakes<br />
became a possibility. <strong>The</strong>n a whole series of issues from data security through<br />
data lineage to data archive stumbled into the spotlight. <strong>The</strong> question of how<br />
should data be governed has now become a pressing question.<br />
7
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Hardware Disruption<br />
To explore the emerging reality of the data lake, we now need to consider how<br />
computer hardware technology has been changing and may continue to change.<br />
Technology change can be dramatic and in respect of the data lake it has been, it has<br />
been dramatically disruptive.<br />
Moore’s Law, which has remained true for CPU power over a span of more than 40<br />
years, regularly delivers a doubling of CPU power roughly every 18 months. At the<br />
hardware level nothing else kept pace with this. DRAM (memory) managed to do so<br />
up to the year 2000, but faded after that. Disk was far worse, lagging terribly, but it is<br />
now being superseded by SSDs (solid state drives), which recently climbed onto the<br />
Moore’s Law curve in the sense that its speed is roughly doubling every 18 months.<br />
Moore’s Law was disruptive, but we gradually adjusted to its disruptive impact, as<br />
it gifted us its regular increases in speed. It transformed PCs into throw-away with a<br />
life span of 4 or 5 years. PCs were superseded by laptops which in turn are giving way<br />
tablets. It had a similar impact on some PC software, such as email which went to live<br />
on the Internet, soon to be joined by personal applications and graphics applications.<br />
Moore’s Law had a distinctly different impact on server software. <strong>The</strong> databases<br />
technology that was born the 1980s and 1990s evolved to keep pace and so did the ERP<br />
systems. On the server side extra power simply made large databases and applications<br />
faster. For a while, much of the additional server power was simply squandered,<br />
with CPUs often idling. This created a market for virtual machines that enabled more<br />
applications per server.<br />
It wasn’t until about 2010 that server software applications set off in a new direction.<br />
In 2004 it ceased to become physically possible to increase CPU clock speeds to garner<br />
extra power and thus CPUs, still capable of further miniaturization, became multicore.<br />
This trend gradually forced software development to take advantage of parallel<br />
processing.<br />
Multicore and its tangled web<br />
<strong>The</strong> hardware landscape used to be simple. <strong>The</strong>re was CPU, memory, disk and network<br />
technology. CPU kept getting faster, as did memory and disk even if they didn’t really<br />
keep pace. <strong>The</strong> evolution of hardware was thus reasonably predictable, but that stalled<br />
in 2004. <strong>The</strong> chip vendors (Intel, AMD, IBM, ARM, etc.) were forced to add more cores<br />
to each CPU to make their previous chips obsolete, which is how the game is played.<br />
Such a major change in direction forced the software world to adjust. It took time.<br />
Operating systems needed to adjust, compilers needed to adjust, development<br />
software needed to adjust and, most of all, the software developers needed to adjust. It<br />
might have been relatively plain sailing if that was the whole story. But it wasn’t. <strong>The</strong>re<br />
were also GPU chips (Graphical Processing Units), FPGAs (Field Programmable Gate<br />
Arrays) and SoCs (Systems on a Chip).<br />
We thought of CPUs and GPUs as different beasts of burden with different loads on<br />
8
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
their back: general purpose processing and graphical processing. But GPUs are not<br />
confined to graphics, they are equally suited to high performance computing tasks<br />
(numeric workloads) including analytic processing.<br />
This led to the development of GPGPUs (General purpose GPUs) that are far faster than<br />
CPUs for such tasks having many more processor cores. A few companies (Kinetica,<br />
BlazingDB and MapD) have exploited such hardware to build database servers that<br />
sport ultra-fast databases, proving that a great deal of a database’s workload can run<br />
on GPGPUs.<br />
An FPGA (Field Programmable Gate Array) is, as the name suggests, a chip that you<br />
can add logic to after it has been manufactured - that’s what “field programmable”<br />
means. In order of speed, GPUs are faster than CPUs which are faster than FPGAs.<br />
<strong>The</strong> virtue of the FPGA is that you can configure it to run a particular program that<br />
is frequently used. Once configured for such a program it runs like a race horse.<br />
<strong>The</strong> FPGAs contain arrays of programmable logic blocks along with a hierarchy of<br />
reconfigurable interconnects that allow the blocks to be “wired together” in different<br />
configurations. This, along with a little memory makes it possible to purpose built a<br />
chip for a given application.<br />
We know of two hardware vendors that have demonstrated the virtues of combining<br />
different types of processing unit in custom designed hardware. Velocidata with its<br />
Enterprise Streaming Compute Appliance (ESCA) combines the power of CPU, GPU and<br />
FPGA to dramatically accelerate stream processing, particularly for ETL applications.<br />
Ryft, with its Ryft One combines a CPU with an array of FPGAs to provide a very fast<br />
search and query capability on large volumes of data.<br />
<strong>The</strong>re is a convergence in progress between the CPU and GPU. It began with Intel in<br />
2010, with its line of HD Graphics chips. Intel describes these as a CPU with Integrated<br />
Graphics. In 2011 AMD created what it called an APU (Accelerated Processing Unit)<br />
which was involved the same marriage of CPU and GPU on a single chip. Such chips<br />
can be used on PCs and similar devices, eliminating the need for a GPU.<br />
Following its acquisition of Altera (an FPGA and System on a Chip company), Intel<br />
recently announced Xeon chips with a built-in FPGA (Field Programmable Gate Array)<br />
and Stratix FPGAs and SoCs. <strong>The</strong> market for Xeon with FPGA has yet to become clear,<br />
although it is not hard to imagine companies adding analytics logic to the FPGA portion<br />
of the chip to create specialist servers in the same way that Velocidata added ETL logic<br />
and Ryft added search logic to FPGAs.<br />
As regards the SoC the big market here is cell phones and tablets, although one can<br />
easily imagine that they will also play a role in the Internet of Things, perhaps a very<br />
major one.<br />
Processors are evolving in a variety of ways. <strong>The</strong> x86 chip dominated the industry on<br />
the desktop and eventually on the server, to the greater glory of Intel, but it is outgunned<br />
by the ARM chip in respect of numbers (tablets, cell phones and other devices). <strong>The</strong><br />
problem that these and other chip vendors face is that soon it will no longer be possible<br />
9
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
to miniaturize circuits any further - so Moore’s Law will cease to deliver its regular<br />
bounty. Thus the CPU vendors are exploring innovative alternatives to keep the plates<br />
spinning.<br />
<strong>The</strong> persistence of memory<br />
RAM (DRAM and SRAM) is fast volatile memory. It is extremely fast compared to<br />
traditional spinning disk (or spinning rust as it is sometimes called). <strong>The</strong> figure varies<br />
according to circumstance but a rule of thumb is 100,000 times faster for random access<br />
- less so for serial access. <strong>The</strong> best current (2017) figures for RAM speed are 61,000 MB/<br />
sec (read) and 48,000 MB/sec (write).<br />
Currently there are three disruptive memory technologies making their way to<br />
market. At the time of writing, software and hardware vendors are experimenting with<br />
them. <strong>The</strong>y differ from RAM in three respects. <strong>The</strong>y are slightly slower, non-volatile<br />
and about half the price. Two of them, from Intel (called 3D XPoint) and IBM (called<br />
PCM or phase change memory) are fairly new. <strong>The</strong> third technology, HP’s memristor,<br />
was announced way back in 2008 but has been slow to develop. Nevertheless HP is<br />
partnering with SanDisk to develop what it calls Storage-Class Memory (SCM) to offer<br />
memory that has distinctly similar capabilities to Intel’s 3D XPoint and IBM’s PCM.<br />
From the hardware perspective it doesn’t matter which of these technologies<br />
dominates, since their primary characteristics are fairly similar. However what isn’t<br />
certain is what a standard server will look like once these technologies kick in, and we<br />
won’t know until it happens.<br />
It is worth mentioning, en passant, that the volume of RAM relative to the CPU on<br />
commodity servers continues to increase and with the advent of these new memory<br />
technologies that trend will likely persist.<br />
SSD: RIP Spinning Disk?<br />
Solid State Disk (SSD) is replacing spinning disk almost everywhere. Spinning disk is<br />
still cheaper (by a factor of 5) but SSD is faster in most contexts (and can be a great deal<br />
faster). SSD used to be limited in volume but last year Seagate surprised the world with<br />
a 60 Tbyte SSD and erased that complaint.<br />
SSDs can be accessed in a parallel manner which accelerates read speeds considerably.<br />
To achieve this, SSDs “stripe” data into arrays so that when a read operation spans data<br />
across multiple arrays the on-disk controller issues parallel reads to get the data. <strong>The</strong><br />
latency of each fetch will be constant. To leverage this you need to be careful about how<br />
and where you write the data. Aerospike, the high performance database uses SSD in<br />
this way.<br />
SSDs are not much better than spinning disk in write-heavy applications because they<br />
need to re-write entire blocks at a time (a read followed by an erase followed by a<br />
write). <strong>The</strong> more sophisticated drives organize write activity to minimize this.<br />
10
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> Typical Server<br />
Not so long ago, a typical server was (primarily) a CPU with memory and disk storage.<br />
So, consider how much more complex it could become for the software engineers that<br />
need to make it dance.<br />
• <strong>The</strong> CPU is multicore with up to 12 cores.<br />
• Alternatively it may be have fewer cores but be integrated with GPU or FPGA<br />
capability.<br />
• <strong>The</strong> CPU has 3 exploitable layers of storage for itself: level 1, 2 and 3 cache, which<br />
can be exploited, e.g. for vector processing or data compression/decompression.<br />
• Configurable memory (DRAM) has grown to the terabyte level<br />
• A new fast access memory capability is emerging that is significantly faster than<br />
current SSDs and a little slower than memory.<br />
• SSDs are swiftly replacing spinning disk, but software engineers need to know<br />
how to best use them.<br />
It may be a while before we know what a “standard server” is going to become.<br />
Nevertheless it cannot be in the industry’s commercial interest for there to be too much<br />
variance. A consensus will emerge, what it will be is difficult to predict.<br />
<strong>The</strong> Changing of <strong>The</strong> Guard<br />
A further disturbing variable in this picture is the growing impact of the cloud.<br />
Amazon currently dominates cloud computing with 31 percent market share, with<br />
Microsoft playing second fiddle at 9 percent. As far as we are aware, both companies<br />
have projects to design and build their own chips (almost certainly based on ARM<br />
designs). Given the economies of scale of the cloud operations of both companies, it<br />
makes sense for them to do this.<br />
<strong>The</strong> impact either company could have on the future architecture of commodity servers<br />
is impossible to predict, aside for the fact that they will doubtless build infrastructure<br />
that is easy to software migrate to. It is not beyond the bounds of possibility for either<br />
company (or Google for that matter) to enter the chip market.<br />
11
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Parallelism Playing Havoc<br />
By 2010 we, at <strong>The</strong> Bloor Group, began to observe a significant increase in the speed of<br />
server software occurring as a consequence of parallelism. That was in the early days<br />
of the Hadoop project, which had been inspired by Google and Yahoo’s use of scaledout<br />
server networks.<br />
Let’s think about this. Google and Yahoo were the first two Web 2.0 businesses. <strong>The</strong>y<br />
had been forced to find their own solutions to big data problems. Searching the web<br />
was a completely new application. No developers had ever built software to solve a<br />
problem like that. First you send out spiders; you run software that accessed every<br />
web site, including new web sites, gathering information about new web pages and<br />
changes to old pages. After you harvest the data, you compress it to the digital limit<br />
and add it to your big data heap. That’s the relatively easy part of the problem. <strong>The</strong><br />
hard part is updating the indexes on the huge data heap and letting the world pick at<br />
it. It was the first of these two problems that gave MapReduce its start in life.<br />
<strong>The</strong> idea of “mapping” and “reducing” was not new. It is a relatively old technique that<br />
emerged from functional programming, a 1970s programming paradigm. MapReduce,<br />
as invented by Google, was a development framework that was scalable over grids of<br />
servers and ran in a parallel manner. It provided a solution to the indexing problem.<br />
It was research activity by Doug Cutting and Mike Cafarella at Yahoo which spawned<br />
Hadoop, with its Hadoop Distributed File System (HDFS) and MapReduce framework.<br />
Quite likely the project would not have taken flight if Yahoo had not decided to throw<br />
it to Apache as an open source project with Doug Cutting in the pilot’s seat. Soon<br />
Hadoop distribution companies sprouted up: Cloudera in 2008, MapR in 2009 and<br />
Hortonworks in 2011.<br />
Under the auspices of <strong>The</strong> Apache Software Foundation (ASF), Hadoop acquired a<br />
coterie of complementary software components: Avro and Chuckwa in 2009, HBase<br />
and Hive in 2010, then Pig and ZooKeeper in 2011. Soon ASF - a nonprofit corporation<br />
devoted to open source software - acquired a destiny. It provided a well-honed process<br />
for incubating the development and assisting the delivery of open source products. It<br />
wasn’t long before it was supervising over 100 such projects, about one fifth of which<br />
were Hadoop related.<br />
<strong>The</strong> commercial early adopters of Hadoop began to trickle in around 2012. <strong>The</strong> trickle<br />
soon became a stream, and then the stream became a river, and the river flowed into a<br />
lake. Hadoop and its growing band of open source jig saw pieces were truly inexpensive.<br />
It was free software until you needed support, and if you cared to assemble a few old<br />
servers, you could prototype applications and experiment with parallel computing for<br />
almost nothing.<br />
Before Hadoop danced into the data center with its scale-out file system, no such<br />
scale-out file system existed. <strong>The</strong> assumption had always been that if you had data that<br />
needed to scale out in a big way – hundreds of terabytes or petabytes or beyond – you<br />
needed to put it in a database or a data warehouse. Hadoop changed all that.<br />
12
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
YARNing towards the data lake<br />
Until the release of YARN (Yet Another Resource Negotiator), Hadoop was truly<br />
limited. Its two main constraints were that it ran just one task at a time and that software<br />
development was tied to the use of MapReduce. YARN, released in late 2013, provided<br />
a scheduler and in the same release the enforced use of MapReduce was removed.<br />
Once Hadoop had a scheduling capability it was possible for multiple applications to<br />
share HDFS data concurrently, making it far more appropriate for many applications.<br />
Given direct access to HDFS, applications could leverage Hadoop hardware resources<br />
in any way they chose. If you were watching closely you would have noticed a slew<br />
of commercial software vendors announcing compatible software at the time of the<br />
announcement. And of course more followed later.<br />
Mesos, another open source scheduling capability, that was built for data center<br />
scheduling then stepped into the frame and soon after that the Myriad project was set<br />
up to enable Mesos and Yarn to work together.<br />
Once process scheduling was possible, the idea of a data lake - as a kind of data hub -<br />
took off. With the reality of multiple concurrent tasks it was definitely possible to have<br />
one process or set of processes ingesting data and another process doing analytics and<br />
a third process carrying out ETL on the data.<br />
<strong>The</strong> Spark Phenomenon<br />
Hadoop disrupted the data warehouse world, then Spark disrupted Hadoop.<br />
Described as “lightning fast,” it seemed to emerge from nowhere in 2015. But of course<br />
it didn’t. Spark began life at UC Berkeley AMPLab in 2009 and was open sourced in<br />
2010 (under a BSD license). It became an Apache project in 2013 about the time that<br />
YARN was released.<br />
For a variety of reasons it turned out to be a far better development layer than<br />
MapReduce. It was an in-memory distributed platform comprised of a collection of<br />
components that could accelerate batch analytics jobs, including machine learning<br />
applications, and could also handle interactive query and graph processing. It was<br />
not built to be a stream processing capability, but was capable of doing such work (via<br />
Spark Streaming).<br />
Spark was entirely independent of Hadoop. However, it was often implemented with<br />
Hadoop and HDFS providing the file system. <strong>The</strong> major Hadoop distros quickly added<br />
Spark to their Hadoop bundles.<br />
<strong>The</strong> Hercules Effect<br />
Yahoo started the Apache revolution when it threw Hadoop into the open source<br />
pot. This was not altruistic gesture but an act of self-interest. It had petabytes of data<br />
stored in Hadoop against which it ran many applications. Yahoo figured it would save<br />
time and money if it let external developers contribute to Hadoop’s development and<br />
shared the bounty.<br />
13
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
This collaborative approach to software development established a trend. Facebook,<br />
with its exabyte volumes of data was built on open source from the ground up. Imitating<br />
Yahoo, it also had gifts to give Apache: Hive, the quasi SQL capability and the graph<br />
processing system Giraph. Its most recent donation was Presto, the distributed SQL<br />
query engine, for which Teradata now offers commercial support.<br />
Linked In’s open sourcing of Kafka (a persistent distributed message queue) and<br />
Voldemort (a distributed key-value store) can be added to the list of Apache gifts - as<br />
can Parquet (a column-store capability developed in a collaboration between Twitter<br />
and Cloudera).<br />
<strong>The</strong>re is a huge difference between a new Hadoop component, whose development<br />
only recently started, and one that drops magically from the sky into the ecosystem,<br />
fully tested and implemented over years by a highly scaled web business. <strong>The</strong> first kind<br />
of component may take a long time to mature, while the other is born fully formed, like<br />
Hercules of Greek myth.<br />
<strong>The</strong> addition of Kafka to the Hadoop ecosystem was particularly important. Kafka<br />
provides a publish-subscribe capability that enables data to be streamed in a managed<br />
fashion from one application to another or from one Hadoop or Spark cluster to another.<br />
In mid-2015, Kafka was joined by another important communications capability that<br />
goes by the name of NiFi, short for Niagra Files. <strong>The</strong> software was developed by the<br />
NSA over a period of 8 years. It had the goal of automating the flow of data between<br />
systems at scale, while dealing with the issues of failures, bottlenecks, security and<br />
compliance, and allowing changes to the data flow. Taken together, Kafka and NiFi<br />
provide a complete data flow solution that scales as far as the eye can see, and beyond.<br />
If you consider the whole Apache Hadoop stack, the collection of open source software<br />
that emerged in the wake of Hadoop, it obviously constitutes a new environment for<br />
building and deploying parallel software applications. It is remarkably inexpensive,<br />
both because of its open source nature and the commodity hardware on which it runs.<br />
We sometimes think of this stack as constituting a data layer OS, a distributed operating<br />
system for data.<br />
It doesn’t quite quality, but it is close and it is moving in that direction.<br />
14
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> Next Gen Stack<br />
<strong>The</strong> hallmark of the Apache Hadoop stack, in all its glory, is that it is built for parallel<br />
operation on commodity hardware. If you have experience of it, no doubt you will be<br />
able to report that some of it is high quality and some of it is less so. However all of it<br />
is under continuous development and will likely improve. It is not tightly integrated<br />
in the manner that one might expect were it provided by a single software vendor like<br />
Microsoft and Oracle.<br />
Also, which components to choose and why can be confusing. This is certainly the<br />
case when it comes to streaming applications. Spark can build streaming applications<br />
of a kind, but in reality it processes micro-batches quickly, which technically is not<br />
streaming. Another component, Apache Storm has been built for true data streaming.<br />
And yet another component, Apache Flink does streaming and can also do micro-batch<br />
processing. <strong>The</strong>re is competition within the Apache stack for streaming applications<br />
and that may remain so for a while.<br />
A further potential point of confusion is the existence of three major Hadoop<br />
distributors: Cloudera, MapR and Hortonworks. <strong>The</strong>re are many similarities between<br />
these distributors in that they all pursue a similar business model that emphasizes<br />
support revenues and all provide downloadable free versions of their distributions.<br />
Hortonworks is the “pure play,” while MapR and Cloudera provide “premium”<br />
distributions to their paying customers.<br />
Technically, MapR is the most distinct, providing what it calls a converged platform<br />
that includes the MapR file system which is read/write (rather than HDFS, which is<br />
append only), MapR Streams (a capability similar to Kafka but more sophisticated) and<br />
MapR DB a NoSQL database. Cloudera also has some unique components, including<br />
Cloudera Manager, Cloudera Navigator Optimizer, Cloudera Search, the Kudu file<br />
system which allows read/write and its Impala database. In contrast, HortonWorks<br />
tries to remain true to the Apache Hadoop stack.<br />
<strong>The</strong>re are other distributions, too. Cloud vendors, such as Amazon and Microsoft,<br />
provide their own Hadoop distributions. A Hadoop distribution contains many<br />
components (Cloudera’s currently has 20 for example) and others can be added. It is<br />
up to the customer to determine which components are required and that is to some<br />
extent application dependent.<br />
Those who have never experimented with Hadoop may assume that it’s relatively<br />
simple to install and get running. However it ain’t necessarily so. Aside from anything<br />
else, there’s a need to either hire experienced programmers or train existing ones in<br />
the use of a tribe of components: MapReduce, Hive, Pig, HBase, Spark and others.<br />
You are taking on a new and complex software environment, much more so than if<br />
you were implementing a Windows or Linux server. And then you are going to build<br />
applications to run on it.<br />
15
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>Data</strong> <strong>Lake</strong> Products<br />
<strong>The</strong> alternative to acquiring technical staff to understand the nuts and bolts of the<br />
Apache Hadoop Stack to buy technology from one of the Hadoop integration vendors<br />
or as they are also called “data lake” vendors: Cask, Unifi, Cambridge Semantics et al.<br />
Such vendors provide a data lake capability “out of the box.” Many companies have<br />
found this a more productive way to build a data lake than building it from scratch.<br />
Consider the position of a company that wishes to build a data lake that uses, say, 20<br />
components of the Apache Stack. <strong>The</strong> issues they will inevitably face are as follows:<br />
• Technical skills: New staff with appropriate technical skills may need to be<br />
hired.<br />
• Operational Management: <strong>The</strong> server cluster will need to be monitored and<br />
managed, altering configurations, provisioning new hardware and tuning for<br />
performance as needed. <strong>The</strong> software environment for this may need to be built.<br />
• Upgrades: Upgrade management of the Apache Stack is a potential headache.<br />
<strong>The</strong> Apache Stack upgrades are not going to be as smooth as, for example, a<br />
Windows Server upgrade - simply because the Apache Stack is not under the<br />
close control that a vendor like Microsoft or IBM provides.<br />
• Standards and Integration Issues: <strong>The</strong> question here is how to align the Apache<br />
Stack with common data center standards (e.g. security) for security, data life<br />
cycle management, etc.<br />
In reality what the data lake vendors do is provide an abstraction layer between the<br />
Apache Stack and the applications built on top of it. If this is done well then the user<br />
could, for example, migrate from Cloudera’s stack to MapR’s stack or even move the<br />
data lake applications into the cloud without concern for whether the application will<br />
function as expected.<br />
<strong>The</strong> <strong>Data</strong> Layer OS<br />
As we introduced the concept of a data layer OS it is worth explaining here why<br />
we do not currently believe that the Apache Stack has earned that description. If you<br />
ignore all the infrastructure software that is there to make applications possible, then<br />
we can think in terms of there being three types of applications:<br />
• OLTP applications: <strong>The</strong>se are the applications that process the events and<br />
transactions of the business<br />
• Office applications: This category embraces communications activity from<br />
email to multimedia collaboration and includes personal applications from the<br />
word processor to graphics software. (We would even include development<br />
software here).<br />
• BI and Analytics applications: <strong>The</strong>se are the applications that analyze the<br />
business and provide feedback.<br />
16
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Right now the Apache Stack dominates the area of BI and analytics to the point where<br />
everything else is a sideshow. However the other two areas of application are currently<br />
conspicuous by their absence from the data lake. <strong>The</strong>re are several reasons why that<br />
is so, the main one being that BI and analytics applications can profit most from the<br />
parallelism that the Apache Stack provides. Transactional applications like ERP and<br />
CRM and office applications like email do not experience such a dramatic boost from<br />
parallel processing, because they do not process such large amounts of data.<br />
In recent years the IT industry has witnessed unprecedented software acceleration<br />
in BI and analytic applications. Prior to 2010, in particular application areas where it<br />
made a difference, performance increases of 10x took six years at least, but by 2010<br />
we began to witness much faster software accelerations than that. Nowadays we<br />
sometimes encounter projects that have accelerated a previous analytical process by<br />
1000x or even 10,000x.<br />
This is almost entirely due the effective exploitation of parallelism in conjunction with<br />
in-memory processing. If you rarely have to go to disk for data and you can scale a<br />
workload out over many servers then 1000x is not that difficult to achieve, and with<br />
the Apache Stack and commodity servers, it can be achieved at remarkably low cost.<br />
17
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> Emergence of <strong>Data</strong> Governance<br />
<strong>Data</strong> is not what it used to be. Let’s take this on board. <strong>The</strong> data we knew and loved<br />
was pretty much static, sitting in databases or files fairly close to the applications that<br />
used it. It wasn’t until the blooming of BI in the late 1990s that the industry felt obliged<br />
to let data flow around. That desire gave birth to the data warehouse into whence data<br />
flowed and from whence data trickled into data marts.<br />
Things began to change. Every year that passed witnessed an increase in the need<br />
for data movement. With the advent of CEP technology to process data streams of<br />
stock and commodity prices for trading banks, we got the first hints of real time data<br />
processing and as time marched forward, such activity grew larger.<br />
It was obvious to web businesses like Yahoo and Amazon, that we lived in a real-time<br />
world. <strong>The</strong> web logs that drove their businesses were the digital footprints of users<br />
visiting their web sites. A transaction-based world was giving way to event-based<br />
world.<br />
Had we thought about it at the time, we might have realized that it wasn’t just web<br />
sites that lived and died by events. Network devices and operating systems and<br />
databases and middleware and applications were all happily recording their events in<br />
log files that squandered disk space until they were eventually deleted. <strong>The</strong> computer<br />
networks of the world were already event oriented. <strong>The</strong>y were born that way, but the<br />
applications we built were not.<br />
<strong>The</strong> first software vendor to notice this was Splunk. It detected and mined a thick<br />
vein of gold in the log files that litter the corporate networks. <strong>The</strong> advent of Splunk<br />
was a boon to IT departments that often needed to consult collections of log files to<br />
identify the causes of application error. Security teams were also appreciative of the<br />
technology as it helped them to hunt network intruders and vagrant viruses. That we<br />
were entering an event based world was transparently obvious to users of Splunk since<br />
all the data they gathered was event data.<br />
But it was still not obvious to others, and even now with the advent of streaming<br />
analytics, it is still not obvious to everyone.<br />
<strong>The</strong> Dawn<br />
You would be hard put to find any references to the term “data governance” before<br />
the year 2000. In fact you wont find many references to it prior to 2005. And let’s be<br />
clear about this, it was not that businesses did not care about the governance of data in<br />
earlier years. It’s just that pretty much all corporate data was internal data generated<br />
by the business and most of it stayed put, where it was born, or if it moved anywhere<br />
it made its way into a data warehouse.<br />
Under those circumstances governing the data was not such a pressing need. But, as<br />
we have described, data found the need to move much more often and the volume of<br />
data exploded.<br />
18
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Let us home in on one of the faults of our current use of data. It can be thought of as a<br />
fundamental problem that needs to be fixed:<br />
<strong>Data</strong> is not self-defining!<br />
It’s easy to understand why. Some data is created by programs which expect to be the<br />
only programs ever to use it. Within the program that uses it, the data is adequately<br />
defined for use and the program understands it perfectly. As there is no intention for<br />
the data to be used by any other program, there is no need to explain the meaning of<br />
the data when it is stored.<br />
At the dawn of IT, back in the punched card era, all data was treated in that way and it<br />
was not until the advent of database - a software technology built to enable data sharing<br />
- was there any change. <strong>The</strong> data definitions in the database, the schema, applied to all<br />
the data but ceased to apply once the data was exported from the database. It was, at<br />
best a halfway solution to the data definition problem.<br />
Now that the era of event processing has arrived it is possible to be more specific<br />
about how to make data self-defining. What follows is a suggested list of data items<br />
that could be added to every event record, along with brief definitions:<br />
• Date-Time: <strong>The</strong> data and time (GMT) that the data was created.<br />
• Geographic location: <strong>The</strong> geographical location of the device that created the<br />
data. For precision we might think here in terms of the map reference (latitude<br />
and longitude) and possibly a three dimensional reference for the point within<br />
the building that occupies that map reference). <strong>Data</strong> created off-planet would<br />
probably use a reference point based on the sun.<br />
• Source device: <strong>The</strong> precise identity of the device that created the data (server,<br />
PC, mobile phone, etc.)<br />
• Device ID: Specific ID of the source device.<br />
• Source software: <strong>The</strong> precise identity of the software that created the data.<br />
• Derivation: An indication of whether the data was derived and if so, how.<br />
• Creator: <strong>The</strong> owner or enabler of the device and software that created the data.<br />
• Owner: <strong>The</strong> owner of the data, who may be different to the creator.<br />
• Permissions: Security permissions for usage of the data, enabling its use by<br />
specific programs .<br />
• Status: When there is the master copy of the data or a valid replica. (Replicas<br />
will be kept for back-up at the very least but may also be created for multiple<br />
concurrent usage).<br />
• Metadata: Associated directly with each data value is a metadata tag or reference<br />
identifying the meaning of the data.<br />
19
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
• Master Audit Trail: Recording of who has used the data, when, where and how.<br />
• Archive Flag: <strong>The</strong> date when the data is scheduled for archive or deletion.<br />
This is not necessarily an exhaustive list. Note that the scheme described here assumes<br />
that data is never updated. Under such a regime, data correction, where values are<br />
known to be wrong, would be corrected in a way analogous to how postings are<br />
reversed out in an accounting ledger.<br />
It is clear in perusing this collection of definition data that it constitutes a significant<br />
amount of data of itself. How such data is stored is a physical implementation issue -<br />
clearly there are many ways in which it much of this data could be compressed and/<br />
or stored separately to data values, depending on how its is being used or is intended<br />
to be used.<br />
<strong>The</strong> function of all this additional data is to establish indelibly:<br />
• <strong>The</strong> provenance of the data.<br />
• <strong>The</strong> ownership of the data and its allowed usage.<br />
• <strong>The</strong> history of its usage.<br />
• <strong>The</strong> schedule for its archive or deletion.<br />
• <strong>The</strong> meaning of the data.<br />
<strong>The</strong>se can also be regarded as five dimensions of data governance. We will discuss<br />
these one by one a little later. However before we do that, we need to define, in a<br />
general way, what we consider a data lake to be.<br />
20
A General Definition of <strong>The</strong> <strong>Data</strong> <strong>Lake</strong><br />
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
It is our view that the <strong>Data</strong> <strong>Lake</strong> can and should be the system of record of an<br />
organization. Specifically, that means that the data storage system the data lake<br />
embodies, becomes the authoritative data source for all corporate information.<br />
We are speaking here of<br />
a “logical” data lake. For<br />
reasons of practicality<br />
it may be necessary to<br />
have multiple physical<br />
data lakes. In which case<br />
the system of record is<br />
constituted by all those<br />
physical data lakes taken<br />
together.<br />
Not only should the<br />
<strong>Data</strong> <strong>Lake</strong> be the system<br />
of record, it should be<br />
the logical location for<br />
the implementation of<br />
data governance. <strong>Data</strong><br />
governance rules and<br />
procedures need to<br />
have a natural point of<br />
enforcement and the<br />
logical place for such<br />
enforcement is the system<br />
of record, the data lake.<br />
Figure 5 depicts the data<br />
lake in terms of the primary<br />
software components that<br />
are required: an ingest<br />
capability that can harvest<br />
both data streams and<br />
batches of data, the data storage capability, data governance so that data is governed on<br />
entry to the lake, data lake management so that the data lake environment is monitored<br />
and kept operational and ETL capabilities for transporting data to other locations. In<br />
addition, we also have the applications that run on the lake.<br />
<strong>Data</strong> Governance Processes<br />
Static <strong>Data</strong> Sources<br />
<strong>Data</strong><br />
Governance<br />
<strong>Data</strong> <strong>Lake</strong><br />
Mgt<br />
Ingest<br />
<strong>Data</strong><br />
<strong>Lake</strong><br />
To<br />
<strong>Data</strong>bases<br />
<strong>Data</strong> Marts<br />
Other Apps<br />
<strong>Data</strong> Streams<br />
Analytics<br />
or BI Apps<br />
<strong>The</strong> data lake is a staging area for the entry of new data into the system of record,<br />
whether it was created within the organization or elsewhere. <strong>The</strong>re is an imperative for<br />
all governance processes to be applied to the data at that point of entry, if possible, and<br />
for data to be available for use once those processes have completed.<br />
ETL<br />
Figure 5. <strong>Data</strong> <strong>Lake</strong> Overview<br />
21
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
We will discuss these processes one by one:<br />
1. Assigning data provenance and lineage. Accurate data analytics needs to be certain<br />
of the provenance and lineage of the data precisely. Prior to the dawn of the “big data<br />
age” this was rarely a problem as the data either originated within the business or came<br />
from a traditional and reputable external source. As the number of sources of data<br />
increases the difficulties of provenance/lineage will increase.<br />
We expect the ability of data to self-identify for the sake of provenance to increase,<br />
although it may be many years before a general standard is agreed. For the sake of<br />
lineage and provenance each event record would need to record the time of creation,<br />
geolocation of creation, ID of creating device, ID of the process/app which created the<br />
data, ownership of the data, the metadata, and the identity of the data set or grouping<br />
it belongs to. To this we can add the details of derivation if the data was derived in<br />
some way, which would allow lineage to be deduced.<br />
Where such precise details are absent on ingest to the data lake, it should be possible at<br />
least, to know and record where the data came from and how. As very few data records<br />
self-identify as described some compromise is inevitable until standards emerge. With<br />
the advent of the Internet of Things, the need to self-defining data will increase.<br />
2. <strong>Data</strong> security. <strong>The</strong> goal of data security is to prevent data theft or illicit data usage.<br />
Encryption is one primary dimension of this and access control is the other. Let us<br />
consider encryption first.<br />
Encryption needs to be planned and, ideally, applied as data enters the lake. In a<br />
world of data movement, security rules that are applied need to be distributable to<br />
wherever encrypted data is used. Ideally with encryption, you will encrypt data as<br />
soon as possible and decrypt it at the last moment, when it needs to be seen “in the<br />
clear.”<br />
<strong>The</strong> reason for this approach is twofold. First, it makes obvious sense to minimize the<br />
time that data is in the clear. Secondly, encryption and decryption make heavy use of<br />
CPU resources and hence minimizing such activity reduces cost.<br />
To implement this approach, format-preserving encryption (FPE) is necessary. <strong>The</strong><br />
point about FPE is that it does not change the characteristics of the data such as format<br />
and sort order, it simply disguises the data values. <strong>The</strong>re are FPE standards and vendors<br />
that specialize in their application.<br />
<strong>Data</strong> access controls require the existence of a reasonably comprehensive identity<br />
management system and access rights associated with all data. Access rights may<br />
distinguish between the right to view and the right to process the data using a particular<br />
program or process.<br />
3. <strong>Data</strong> Compliance. <strong>Data</strong> compliance regulations are now common, and likely to<br />
become increasingly complicated with the passage of time. <strong>The</strong> EU is hoping to establish<br />
a General <strong>Data</strong> Protection Regulation (GDPR) for personal data that is implemented<br />
across the world and has devoted considerable effort to formulating rules for that. GDPR<br />
22
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
will become law within the EU and will likely take effect in many other countries. <strong>The</strong><br />
practical consequence is that businesses world wide need to implement these rules.<br />
<strong>The</strong> EU legislation does not care where the data is held, so its jurisdiction is worldwide<br />
in respect of anyone living in the EU.<br />
You can think of such regulations as international whereas healthcare regulations<br />
(HIPPA in the US) and financial compliance rules tend to take effect at a national level.<br />
To this collection of compliance regulations you can add non-binding sector compliance<br />
initiatives and best practice rules that an individual business might establish. It should<br />
be obvious that applying such rules is difficult unless the data is self-defining to some<br />
degree.<br />
4. <strong>Data</strong> integrity. Once you eliminate data updates, the possibility of data corruption<br />
diminishes. Nevertheless it still exists and may never be entirely eliminated. Software<br />
errors can certainly achieve it. <strong>The</strong> need to specify whether data is a copy or the actual<br />
source is necessary if there is going to be any possibility of auditing the use of replicated<br />
data. Test data needs to know that it is test data. <strong>Data</strong> back-ups also need to know they<br />
are back-ups. Disaster recovery needs to restore all data to its original state. All of this<br />
applies to data in motion as well as data at rest.<br />
5. <strong>Data</strong> cleansing. Newly ingested data may be inaccurate, especially if we have no<br />
control over and little knowledge of how it was created. No data creation or collection<br />
processes are perfect. This is even true of sensor data, which we may think of as reliable,<br />
because it comes from an automated source; sensors are also capable of error. <strong>Data</strong> may<br />
also be corrupted by subsequent processes after creation.<br />
<strong>Data</strong> needs to be cleaned as soon as possible after ingest - where that’s feasible. <strong>The</strong>re<br />
are exceptional situations, for example where the requirement is to process a data<br />
stream as it is ingested. <strong>The</strong>re is no time available for data cleansing, so the streaming<br />
application must allow for possible data error. A real world parallel to this is the way<br />
that news is processed. Sometimes false news reports are aired because some editor<br />
thought that the news was too “valuable” not to air it immediately and the usual<br />
verification process was skipped. It is corrected later, if it turns out to be wrong. Any<br />
urgent processing of uncleaned data will normally allow for its poor quality and the<br />
possible need for later correction.<br />
<strong>Data</strong> cleansing standards are a natural part of data governance. However there is no<br />
silver bullet for cleaning data. <strong>The</strong>re are obvious tests that can be done such as checking<br />
for logically impossible values, and you can formulate rules to detect unlikely values.<br />
It is possible to cleanse some data automatically, and there are some well-designed<br />
tools for this from Trifacta, Paxata, Unifi and others. But no matter how effective the<br />
cleansing software, it requires human supervision and intervention. Cleansing can<br />
thus be a slow process.<br />
6. <strong>Data</strong> reliability. As far as it is possible to ensure, data needs to be accurate and<br />
also to be checked for accuracy regularly. <strong>Data</strong> values can be corrupted in various<br />
ways: It can be changed by a hacker for fraudulent reasons or even sabotage. It can<br />
be overwritten at some point in its life. It can be corrupted “in flight,” although this<br />
23
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
is rare because of communications error checking procedures (in flight corruption is<br />
most likely hacking). It can be corrupted by any software that rewrites the data. It can<br />
be corrupted by database error (DBA error). To deal with such possibilities, some form<br />
of checksum integrity can be applied.<br />
7. Disambiguation. Disambiguation is an aspect of data cleansing and data reliability.<br />
It refers to the removal of ambiguities in the data. It is a process that would be unlikely<br />
to be applied when data enters the data lake, since it can be a time consuming process.<br />
<strong>The</strong>re is a particular problem with data ambiguity when it comes to people. <strong>The</strong>re is no<br />
bullet-proof global identity system, so identity theft is a fact of life.<br />
<strong>The</strong>re is no standard for people’s names. Sometimes just the first name and surname is<br />
asked for. Sometimes a middle initial or a middle name is asked for. Sometimes the full<br />
name. <strong>The</strong> name is a poor identifier. People can change their names legally and women<br />
change their names by marriage. People sometimes disguise their names deliberately<br />
for fraudulent reasons. But they may do so legitimately (in an effort to anonymize<br />
their data). Some attributes change (address, telephone number, etc.) and usually do so<br />
without the information being easily gathered. To complicate the picture, the structure<br />
of the customer entity changes over time. For example, the social media identities (on<br />
twitter, etc.) were born only recently.<br />
<strong>The</strong> consequence of this is that for many businesses cleansing customer data also<br />
means disambiguating the data. <strong>The</strong>re are few software tools that are effective at<br />
disambiguation. Novetta has this capability and IBM also, but they are the only two<br />
software providers we know of with this capability.<br />
8. Audit trail of data usage. A record of who used what data and when needs to be<br />
maintained both for security purposes and for usage analytics. Usage analytics play a<br />
part in query optimization as well as in data life-cycle management.<br />
9. <strong>Data</strong> life-cycle management. Few companies have formally implemented a general<br />
data life cycle strategy. More likely is that they have a strategy for some data, such as<br />
data covered by compliance regulations, but not all data. Others may have no strategy<br />
at all.<br />
With analytical applications in particular, the need to manage data life cycles is<br />
important, because much of the data used in data exploration may eventually be<br />
discarded as worthless. <strong>The</strong>re is no point in retaining the data, beyond recording that<br />
it was once explored. As data lakes are also used for archive, the use of a data lake<br />
creates an opportunity to implement or tighten up the procedures around data life cycle<br />
management. Life-cycle management can be thought of as the strategy for moving data<br />
to least cost locations as its usage diminishes. Deletion is a possible destination in this.<br />
Metadata and Schema-on-read<br />
<strong>The</strong> only aspect of governance that we have not yet discussed is metadata management.<br />
<strong>The</strong> metadata situation is complex and thus we are devoting more time to it than<br />
other aspect of data governance. In overview, the situation is simple: since metadata<br />
determines the meaning of data, the natural preference is for metadata be as complete<br />
24
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
as possible as soon as possible, so the possibility of data being misinterpreted is<br />
minimized.<br />
However, one of the loudly proclaimed benefits of the data lake is that “there is no need<br />
to model data before ingesting it.” This contrasts significantly to the data warehouse<br />
situation, where a great deal of data modelling effort is required before data is allowed<br />
in to the warehouse. <strong>The</strong> alternative to data modelling is called schema-on-read.<br />
With schema-on-read, the process that reads the data for the first time determines the<br />
metadata. <strong>The</strong> work involved varies according to data source. Some data such as CSV<br />
files and XML files define the metadata, and thus it’s possible to know the metadata<br />
as the data is read. Other data may not be so convenient. However, some data formats<br />
can be recognized from the data and also some metadata may be deducible from data<br />
values. This is how products like Waterline can automatically determine metadata<br />
values. In other circumstances human input may be required to determine metadata.<br />
Once teh metadata is determined it can be stored in a metadata repository.<br />
With schema-on-read the end goal is not to establish an RDBMS-type data model<br />
which defines the relationships (foreign key relationships) between data sets. For many<br />
BI and analytics applications it is not required of the user can specify the metadata. <strong>The</strong><br />
schema-on-read approach means that any Master <strong>Data</strong> Management (MDM) process<br />
that maintains a master data model of all corporate data, will probably need to be<br />
adjusted.<br />
<strong>The</strong> assumption of data modelling is that by applying various rules and a little<br />
common sense you can provide a data model (usually and ER model) that is suited to<br />
all possible uses of the data. <strong>The</strong> truth is that in almost all circumstances you cannot. It<br />
will be imperfect, at best.<br />
<strong>The</strong> reality of the situation is this:<br />
• <strong>The</strong> time taken to model the data, either in the beginning or when new data<br />
sources are added or if errors are found in the model, constitutes a definite and<br />
possibly large cost to the business in respect of time to value.<br />
• <strong>The</strong> modeler has to try to anticipate all new data sets that may appear later,<br />
so that the model does not require significant rework when new data sets are<br />
added. This can only be guessed at and rework is sometimes required.<br />
• <strong>Data</strong> (in a data lake) is intended to be a shared asset and will be shared by<br />
groups of people with varying roles and differing interests, all of whom hope<br />
to get insights from the data. To model such data means trying to allow for<br />
every constituency in advance. Possibly this will result in a “lowest common<br />
denominator” schema that is an imperfect fit for anyone. This problem gets<br />
worse with more data sources, higher data volumes and more users.<br />
• With schema-on-read, you’re not glued to a predetermined structure so you can<br />
present the data in a schema that fits reasonably well to the task that requests<br />
the data.<br />
25
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
By employing schema-on-read you get value from the data as soon as possible. It does<br />
not impose any structure on the data because it does not change the structure that the<br />
data had when loaded. It can be particularly useful when dealing with semi-structured,<br />
poly-structured, and unstructured data. It was difficult and often impossible to model<br />
some of this data and ingesting it into a data warehouse, but there is no problem getting<br />
it into the data lake.<br />
In general, schema-on-read allows for all types of data and encourages a less rigid<br />
organization of data. It is easier to create two different views of the same data with<br />
schema-on-read. Also, it does not prevent data modelling, it just makes it necessary to<br />
defer it.<br />
<strong>The</strong> <strong>Data</strong> Catalog and Master <strong>Data</strong> Management (MDM)<br />
<strong>The</strong>re needs to be a catalog for all the data in the lake. It is important and will become<br />
increasingly important the data catalog to be as rich as possible, embodying as much<br />
“meaning” as possible. It may help here if we explain what we mean by “data catalog,”<br />
as the term is open to interpretation.<br />
<strong>The</strong> data catalog for an operating system (such as Windows or Linux) is the file system.<br />
<strong>The</strong> metadata it stores for each file is incomplete, as its primary purpose is to enable<br />
programs to find and access files. <strong>The</strong> programs themselves will know the layout of<br />
the data and thus how to interpret the values in any given file. This arrangement is not<br />
ideal because the data cannot be shared with programs that do not understand the file<br />
structure. It is adequate only for data that never needs to be shared.<br />
<strong>The</strong> data catalog for a database is the schema. <strong>The</strong> schema defines the structure of all<br />
the physical data held in the data sets or tables of the database and seeks to provide<br />
a logical meaning to each data item by attaching a label to it (Customer_ID, Amount,<br />
Date_of_Birth, etc). <strong>The</strong> amount of meaning that can be derived from the data catalog<br />
depends upon the specific database product. For relational databases the catalog<br />
roughly reflects the meaning that is captured in an ER Diagram, where the relationships<br />
between specific data entities are indicated. With NoSQL and document databases the<br />
situation is similar, although how the schema is used varies from product to product<br />
With an RDF database (sometimes called a Semantic <strong>Data</strong>base) the catalog can record<br />
a greater level of meaning. This is an area where Cambridge Semantics, with its Smart<br />
<strong>Data</strong> <strong>Lake</strong> solutions currently excels. <strong>The</strong> simple fact is that semantic technology<br />
allows you to capture much more meaning for the data catalog than you can using ER<br />
modelling.<br />
A traditional data warehouse metadata catalog can be complex, but nothing like as<br />
complex as a data lake catalog, which may include many sources of unstructured<br />
data (document data, graph data) that may require ontologies (semantic structures) to<br />
accurately define the metadata.<br />
<strong>The</strong> point of the data lake’s data catalog is to provide users (and programs) with a<br />
data self-service capability. On a simple level, if you compare the pre-data-lake world<br />
of data warehouses and data marts to the data lake world, two obvious facts emerge:<br />
26
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
• <strong>The</strong> data lake can hold every kind of data (unstructured, semi-structured,<br />
structured) allowing it to provide a more comprehensive data service to users.<br />
• <strong>The</strong> data lake can scale out, almost indefinitely, to accommodate far more data<br />
than a data warehouse ever dreamed of doing.<br />
It is best to think of metadata enrichment as an ongoing process. <strong>The</strong> use of schemaon-read<br />
may mean that the data catalog is incomplete when a data set (or file) first<br />
enters the lake. However that will change as soon as the data is used. <strong>The</strong> store of<br />
metadata may be further enriched by usage and formal modelling activity may be<br />
carried out to assist in that.<br />
<strong>The</strong> idea of Master <strong>Data</strong> Management is not dead, but it has morphed over the years.<br />
<strong>The</strong> original idea was to arrive at a “single version of the truth,” a holy grail that none<br />
of the knights of the round table in any organization ever managed to feast their eyes<br />
on. Nevertheless, there is sense it trying to link together all the data of an organization<br />
within a metadata driven model that is comprehensible to business users and enables<br />
data self-service. It should be possible to define the business meaning of everything<br />
that lives within the data lake. This is an area where we expect semantic technology to<br />
make an invaluable contribution.<br />
<strong>The</strong> Cloud Dynamic<br />
It could be argued that data lakes are cloud-neutral. Depending on circumstance, the<br />
cloud may prove attractive to companies involved in data lake projects. Certainly it is<br />
likely to be appropriate for building prototypes, with the idea of later migrating back<br />
into the data center.<br />
As regards data lakes, the economics of the cloud needs to be carefully considered.<br />
With most cloud vendors you will pay rent for all the data that lives in the cloud - and<br />
if a good deal of that data is never accessed, it is will almost certainly be much less<br />
expensive to keep that data on premise.<br />
That’s the downside of the cloud, but it still has its characteristic an upside. It’s the<br />
go-to solution when there’s a need for instant extra capacity, whether for storing data<br />
or processing it.<br />
27
<strong>Data</strong> <strong>Lake</strong> Architecture<br />
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Let’s begin with the idea that the data lake is the system of record. And let’s be clear<br />
what we mean by this. <strong>The</strong> system of record is the system that records all the data that<br />
is used by the business. It also holds the golden copy of each data record.<br />
For simplification the system of record should be thought of as a logical system. It may<br />
be possible to implement it on a single cluster of servers, but this is not a requirement<br />
and should not be a goal. In practice the whole configuration will probably involve<br />
multiple clusters if, for no other reason, than to provide disaster recovery.<br />
<strong>The</strong> system of record should also be the system where governance processes are<br />
applied to data. Clearly, data needs to be subject to governance wherever it is used<br />
within the organization. Some governance processes, particularly data security, need<br />
to be applied to data as soon as possible after its creation or capture. For that reason,<br />
the data lake will inevitably be the landing zone for external data brought into the<br />
business, so governance processes can quickly and easily be applied it. <strong>Data</strong> created<br />
within the organization should be passed to the data lake immediately after creation so<br />
that governance processes can be applied.<br />
It is best to think of the data lake as a store of event records. We can think of it in<br />
this way: events are atoms of data and transactions (the traditional paradigm with its<br />
traditional data structures) are molecules of information. <strong>The</strong> analogy works reasonably<br />
well.<br />
Consider for example a web site visit. A user lands on site, clicks through a few pages,<br />
decides to buy something, enters credit card details, and clicks on the “confirm” button.<br />
In transactional terms we may think of this as a purchase: a molecule of data, that’s<br />
quickly followed by a delivery transaction, another molecule of data to record.<br />
But in reality it’s a stream of events, a series of atoms of data. Every user mouse<br />
action creates an event. <strong>The</strong> computer records these events and responds to each one<br />
immediately; it displays new web pages or expands text descriptions or whatever.<br />
<strong>The</strong> purchase confirmation is just another event, distinguished only by the fact that it<br />
generates a cascade of other related events in the application or in other applications.<br />
When you look at it like that, business systems consist of applications generating events<br />
and sending event information to other applications. <strong>The</strong>re is nothing particularly<br />
special about web applications in this respect; all applications are like that. <strong>The</strong>y<br />
respond to events and send messages or data to other applications. That’s been the<br />
nature of computing for decades.<br />
When you scan all the servers and network devices in a data center you find them<br />
awash with log files that store data about events of every kind: network logs, message<br />
logs, system logs, application logs, API logs, security logs and so on. Collectively the<br />
logs provide an extensive audit trail of the activity of the data center, organized by<br />
time stamp. It happens at the application level, at the data level and lower down at<br />
the hardware level. Hiding within this disparate set of data can be found details of<br />
anomalies, error conditions, hacker attacks, business transactions and so on.<br />
28
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Clearly it is possible, by<br />
adding real-time logging to<br />
every business application on<br />
every device, to collect all the<br />
company’s data and dispatch<br />
it to the data lake - although,<br />
it is no doubt overkill. Figure 6<br />
depicts this possibility. <strong>Data</strong> is<br />
ingested either from static data<br />
sources (files and databases) on<br />
a scheduled basis, or is ingested<br />
directly from data streams.<br />
We show both Kafka and NiFi<br />
as possible components of an<br />
ingest solution. Kafka is pure<br />
publish subscribe and thus<br />
can easily be used to gather<br />
data changes from multiple<br />
files (as publishers) and pass<br />
them to the <strong>Data</strong> Bus (the<br />
subscriber), probably a well<br />
configured server. However<br />
a more sophisticated and set<br />
of capabilities can be created<br />
by integrating NiFi with Kafka. NiFi can completely automate data flows across<br />
thousands of systems and can be configured to handle the failure of any system or<br />
network component, queue management, data corruption, priorities, compliance and<br />
security. And it provides an excellent drag and drop interface that allows data flow<br />
configurations to be designed and enhanced. You can think of NiFi as ETL on steroids.<br />
We included an ingest application in the diagram to allow for any functionality or<br />
integration capability that could not be delivered by Kafka in conjunction with NiFi.<br />
<strong>The</strong> goal is to provide an in-memory stream of data that can be processed as it arrives.<br />
Given that, in theory at least, input may be required from any data source any where,<br />
the data access ability needed here is extensive.<br />
<strong>Data</strong> lake projects will most likely begin with fairly unsophisticated data acquisition<br />
from just a few sources. What we have described here is a kind of “worst case scenario”<br />
and how it would likely be handled.<br />
Real Time Applications<br />
Servers, Desktops, Mobile, Network Devices,<br />
Embedded Chips, RFID, IoT, <strong>The</strong> Cloud, Oses,<br />
VMs, Log Files, Sys Mgt Apps, ESBs, Web<br />
Services, SaaS, Business Apps, Office Apps,<br />
BI Apps, Workflow, <strong>Data</strong> Streams, Social...<br />
Ingest<br />
Kafka<br />
<strong>The</strong> goal of software architecture is to satisfy application service levels, pure and<br />
simple. When we consider real-time applications, for example responding immediately<br />
to price changes in an automated market, there is simply no room for any unavoidable<br />
latency.<br />
D<br />
A<br />
T<br />
A<br />
B<br />
U<br />
S<br />
NiFi<br />
Figure 6. <strong>Data</strong> <strong>Lake</strong> Ingest<br />
29
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
We represent this reality, in Figure 7, by showing real-time<br />
applications running directly against the real-time data<br />
stream within the <strong>Data</strong> Bus and also accessing a disk-based<br />
data source. In practice this is likely to be achieved by a<br />
Lambda or Kappa architecture, which is far ore involved<br />
than teh diagram suggests.<br />
In practice it is far more likely, for latency reason, that realtime<br />
apps will not be located anywhere near the data lake,<br />
but will instead be as close as possible to the data stream(s)<br />
that feed them. Nevertheless they will likely pass the data<br />
stream they process directly to the data lake. Since such<br />
applications run prior to any data governance it may be<br />
necessary to pass data to these applications if anything important is discovered during<br />
the governance processes.<br />
Governance Applications<br />
We need to distinguish between the governance processing that occurs on ingest and<br />
the governance processing that may occur later. One of the rules of governance itself<br />
may be to specify what processes<br />
have to take place before data is<br />
made available to data lake users.<br />
It makes sense to do as much<br />
processing as possible while data is<br />
on the <strong>Data</strong> Bus (i.e. held in memory)<br />
since it can be accessed very<br />
quickly. <strong>The</strong> ideal would be to do<br />
all governance processing on ingest,<br />
but this may be impossible. Some<br />
data cleansing activity and some<br />
metadata discovery activity requires<br />
human intervention and it may not<br />
be practical to do it on ingest. For<br />
some data, the chosen policy may<br />
be to implement schema-on-read so<br />
<strong>Data</strong><br />
Security<br />
<strong>Data</strong><br />
Transforms<br />
<strong>Data</strong><br />
Aggregat'n<br />
Metadata<br />
Mgt<br />
<strong>Data</strong><br />
Cleansing<br />
Figure 8. Ingest Governance Apps<br />
metadata gathering will occur after ingest. <strong>The</strong>re can be other competing dynamics. <strong>The</strong><br />
desire may be to encrypt all data (or at least all data that is destined to be encrypted)<br />
on ingest. If data is to be stored in the write-only HDFS, this is doubly important, as it<br />
is a write-only file system. However data cleansing will require data to be unencrypted.<br />
<strong>The</strong> data transform and aggregation activities shown in the diagram are not governance<br />
activities per se. From an efficiency perspective it will be better to perform data transformations,<br />
aggregations and other data calculations that are known to be required before data is written<br />
to disk. Of course, some will need to be done later simply because not all the data they<br />
required is in the data stream.<br />
D<br />
A<br />
T<br />
A<br />
B<br />
U<br />
S<br />
Real-Time<br />
Apps<br />
<strong>Data</strong><br />
Figure 7. Real-Time<br />
30
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>Data</strong> <strong>Lake</strong> Management<br />
Figure 9 shows the<br />
processes that run on<br />
the data lake. <strong>The</strong> other<br />
on-going activity aside<br />
from data governance is<br />
data lake management.<br />
This is the system<br />
management activity<br />
that monitors and<br />
responds to hardware<br />
and operating system<br />
events and manages all<br />
the applications that<br />
dip their toes in the data<br />
lake.<br />
<strong>The</strong> data lake is a<br />
computer grid that<br />
needs to be managed<br />
like any other network<br />
of hardware. <strong>The</strong><br />
appropriate system management activities can be many and varied, including: server<br />
performance, availability monitoring, automated recovery, software management,<br />
security management, access management, user monitoring, application monitoring,<br />
capacity management, provisioning, network monitoring and scheduling<br />
<strong>The</strong>re is nothing new in respect of system management here. However, it is important to<br />
recognize that some of software employed here will be traditional system management<br />
software and some is likely to be data lake specific (Hadoop management software like<br />
Ambari, Pepperdata for cluster tuning, etc.).<br />
<strong>Data</strong> <strong>Lake</strong> Applications<br />
<strong>Data</strong><br />
Governance<br />
<strong>Data</strong> <strong>Lake</strong><br />
Mgt<br />
DATA<br />
BUS<br />
DATA LAKE<br />
Search &<br />
Query<br />
BI, Visual'n<br />
& Analytics<br />
Other<br />
Apps<br />
Figure 9. <strong>Data</strong> <strong>Lake</strong> Applications and Processes<br />
Little needs to be said about data lake applications beyond the fact that businesses<br />
almost always employ data lakes in the same way they used data warehouses for BI<br />
and analytics applications. It is important to note that data self-service is more practical<br />
with data lakes, than it ever was with data warehouses.<br />
An effective data self-service capability requires an effective search and query capability.<br />
This in turn requires the existence of a data catalog, which should be a natural result<br />
of intelligent metadata management. It will also require a search capability (after the<br />
fashion of Google search), which can pick out anything in the data lake.<br />
A really comprehensive search capability will necessitate some kind of indexing<br />
activity as part of data ingest. A query capability will need to work in conjunction with<br />
the data catalog. <strong>The</strong>re might be support for multiple query languages (SQL, XQuery,<br />
SparQL, etc).<br />
31
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
If we now look at<br />
Figure 10, which<br />
illustrates the data<br />
lake complete, the<br />
only two processes,<br />
we have not yet<br />
discussed are data<br />
extracts and data lifecycle<br />
management.<br />
While data lake<br />
processing can be<br />
fast, if particularly<br />
high performance<br />
is required for some<br />
applications, there<br />
will be a need to<br />
export data to a<br />
fast data engine or<br />
database. It will<br />
probably be many<br />
years before data<br />
lake data access<br />
speed gets close<br />
to a purpose built<br />
database.<br />
<strong>The</strong> truth is that<br />
the focus of data<br />
lake architecture is<br />
data ingest and data<br />
governance. <strong>The</strong>re<br />
are many processes,<br />
all of them important,<br />
competing for the<br />
same resources.<br />
Servers, Desktops, Mobile, Network Devices, Embedded<br />
Chips, RFID, IoT, <strong>The</strong> Cloud, Oses, VMs, Log Files, Sys<br />
Mgt Apps, ESBs, Web Services, SaaS, Business Apps,<br />
Office Apps, BI Apps, Workflow, <strong>Data</strong> Streams, Social...<br />
<strong>Data</strong><br />
Governance<br />
<strong>Data</strong> <strong>Lake</strong><br />
Mgt<br />
Ingest<br />
Transform &<br />
Aggregate<br />
Archive<br />
<strong>Data</strong><br />
Security<br />
Life Cycle<br />
Mgt<br />
DATA LAKE<br />
Real-Time<br />
Apps<br />
Metadata<br />
Mgt<br />
<strong>Data</strong><br />
Cleansing<br />
Extracts<br />
Search &<br />
Query<br />
BI, Visual'n<br />
& Analytics<br />
Other<br />
Apps<br />
To<br />
<strong>Data</strong>bases<br />
<strong>Data</strong> Marts<br />
Other Apps<br />
Figure 10. <strong>The</strong> <strong>Data</strong> <strong>Lake</strong> Complete<br />
<strong>The</strong>re will always be a limit to the capacity of the data lake, and governance processes<br />
naturally take priority, so it will prove necessary to replicate data to other data lakes or<br />
data marts to properly serve some applications or users.<br />
As regards data archive, data life-cycle management can be regarded as an aspect of<br />
data governance. It can best be thought of as a background process. <strong>The</strong> exact rules<br />
of if and when data needs to be deleted may be influenced by business imperatives<br />
(regulation), but may also be determined by storage costs. Ideally, archive will be an<br />
automatic process.<br />
32
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Next Generation Architecture<br />
In our view, the data lake is more than an interesting software architecture that utilizes<br />
many inexpensive open source components, it is the foundation of the next generation<br />
of enterprise software. As such we can think of there being three major architectural<br />
generations of software:<br />
1. Batch Computing. <strong>The</strong> centralized mainframe architecture, characterized by<br />
all applications running on one (or more) large mainframes usually in a batch<br />
manner.<br />
2. On-line Computing. This involves an arrangement of distributed servers and<br />
client devices. It is characterized by applications interacting with databases in<br />
a transactional way, supported by batch data flows between databases for data<br />
sharing.<br />
3. Real-Time Computing. This is based on event-based applications that employ<br />
parallel architectures for speed and can support real-time applications where<br />
data is processed as quickly as it arrives.<br />
If we look at software architectures in this manner, it is clear that the data lake belongs<br />
to a new generation of software that is built in a fundamentally different way to what<br />
came before.<br />
<strong>The</strong> event-based data lake serves as the system of record of the corporate software<br />
ecosystem and the point of implementation of data governance. <strong>Data</strong> flows into the<br />
lake from data streams or bulk data transfers that have their origin within or outside<br />
the organization. During this process, governance rules are applied to the data to ensure<br />
it is stored in an appropriate form and subject to appropriate security policies. Once<br />
in the data lake, with all the appropriate governance procedures applied, it becomes<br />
available for use. If any data is extracted for use outside the data lake, it will be a data<br />
copy. Ultimately data leaves the lake when it is archived or permanently deleted.<br />
So, the data lake is not a swamp into which any old data can be poured and lost from<br />
sight. It is the ingest point for the controlled collection and governance of corporate<br />
data. It is the system of record and the foundation for data life cycle management, from<br />
ingest to archive. It is an event management platform and could be viewed as a truly<br />
versatile data warehouse.<br />
As a data warehouse, it is distinctly different from the traditional variety as it can<br />
accept any data with any structure rather than just the so-called structured data that<br />
relational databases store. <strong>The</strong>re is no modelling required in designing and maintaining<br />
the lake, and a query service should be available to retrieve data from the lake - but it<br />
may not be a lightning-fast query service.<br />
While data warehouses were presided over by extremely powerful database<br />
technology with a purpose built query optimizer, the data lake is likely to be devoid of<br />
such technology and if a fast query service is needed (as is likely) for some of the data<br />
in the lake, it will be provided by exporting data to an appropriate database to deliver<br />
the needed query service.<br />
33
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> Logical and <strong>The</strong> Physical<br />
In our view, the two primary dynamics involved in establishing a data lake are:<br />
1. To gradually migrate all the data that makes up the system of record to the data<br />
lake where it becomes the golden copy of the data.<br />
2. For the data lake to become the primary point of ingest of external data where<br />
governance processing to data, both internal and external, as soon as possible<br />
when it enters the data lake.<br />
We note here that the data lake concept, which was first proposed about five years<br />
ago, has gradually grown in sophistication and thus here we are describing current<br />
thinking about what a data lake is and how to use it.<br />
For companies building a data lake, it is important to think in terms of a “logical data<br />
lake” along the lines we described, and to acknowledge that its physical implementation<br />
may be far more involved than our diagrams suggest.<br />
If the recent history of IT has taught us anything, it is that everything needs to scale.<br />
Most companies have a series of transactional systems (the mission critical systems)<br />
that currently constitute most if not all of the system of record. For the data lake to<br />
assume its role as the system of record, the data from such systems needs to be copied<br />
into the data lake.<br />
Pre-assembled <strong>Data</strong> <strong>Lake</strong>s<br />
For many companies the idea of commencing a strategic data lake project will make no<br />
commercial sense, particularly if their primary goal is, for example, only to do analytic<br />
exploration of a collection of data from various sources. Such a set of applications are<br />
unlikely to require all the governance activities we have discussed. In the circumstances<br />
the pragmatic goal will be to build the desired applications to a simpler target data lake<br />
architecture that omits some of the elements we have described.<br />
This approach will be easier and more likely to bring success if a data lake platform is<br />
employed, which is capable of delivering a data lake “out of the box.” As previously<br />
noted, vendors such as Cask, Unifi or Cambridge Semantics provide such capability.<br />
<strong>The</strong>y deliver a flexible abstraction layer between the Apache Stack and the applications<br />
built on top of it. <strong>The</strong>y also provide other components for managing, building and<br />
enriching data lake applications.<br />
It is possible to think of such vendors as providing a data operating system for an<br />
expansible cluster onto which you can build one or more applications. It is also feasible<br />
to build many such “dedicated” data lakes with different applications on each. A<br />
company might for example, build an event log “data lake” for IT operations usage, a<br />
real-time manufacturing data lake, a sales and marketing “data lake” and so on.<br />
One of the beauties of the current Apache Stack is that, with the inclusion the powerful<br />
communications components, Kafka and NiFi, it possible to establish loosely coupled<br />
data lakes that flow data one to another. If the data is coherently managed, simply<br />
34
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Ingest<br />
Ingest<br />
<strong>Data</strong><br />
Governance<br />
<strong>Data</strong><br />
Governance<br />
<strong>Data</strong> <strong>Lake</strong><br />
Mgt<br />
Apps<br />
<strong>Data</strong><br />
<strong>Lake</strong><br />
<strong>Data</strong><br />
<strong>Lake</strong><br />
<strong>Data</strong> <strong>Lake</strong><br />
Mgt<br />
Apps<br />
Replicate,<br />
Export<br />
<strong>Data</strong><br />
Governance<br />
To other<br />
destinations<br />
(databases, apps, etc.)<br />
<strong>Data</strong> <strong>Lake</strong><br />
Mgt<br />
Apps<br />
<strong>Data</strong><br />
<strong>Lake</strong><br />
Ingest<br />
Figure 11. Multiple Physical <strong>Data</strong> <strong>Lake</strong>s in Overview<br />
adding clusters like this, whether located in the cloud or on premise will allow you to<br />
gradually grow the type of system of record we have discussed in this report.<br />
35
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
<strong>The</strong> main point is that there can be multiple physical data lakes, each with ingest<br />
capabilities and governance processes, that constitute a logical data lake, as illustrated<br />
in Figure 11. While the diagram implies that all the physical data lakes are running<br />
roughly the same processes, this may not be the case. <strong>The</strong>re are different reasons for<br />
having multiple physical data lakes. Some might exist entirely for disaster recovery<br />
or as a reserve resource for unexpected processing demand or as a dedicated analyst<br />
sand box for a specific group of users. Some may simply be established for time zone<br />
or geographical reasons.<br />
Having multiple physical data lakes will complicate the global approach to governance<br />
and establishing a system of record. However, with an intelligent deployment of Kafka<br />
(and possibly also NiFi) to manage the replication and export of data, ensuring that the<br />
physical data lakes correspond to a logical data lake is achievable.<br />
<strong>The</strong> system of record is likely to be logical (i.e. spread physically across multiple<br />
systems and data lakes) for several reasons. One particular cause for this that we<br />
believe is worth discussing is Internet of Things (IoT) applications, where the source<br />
data is created and is likely to remain physically remote.<br />
<strong>The</strong> Internet of Things<br />
<strong>The</strong> IoT is currently in its infancy, although some IoT applications have existed for many<br />
years, particularly those involving mobile phones. <strong>The</strong> Uber and Lyft applications, for<br />
example, are complex internet of things applications.<br />
However, such applications are not<br />
what normally spring to mind when the<br />
IoT is mentioned. <strong>The</strong> general idea is that<br />
there will be some physical domain – a<br />
building or many buildings or a transport<br />
network or a pipeline of a chemical<br />
plant or a factory – and this domain will<br />
be peppered with sensors, controllers<br />
or even embedded CPUs in various<br />
locations that are, at the very minimum,<br />
recording information but may also be<br />
running local applications.<br />
Sensors, Controllers, CPUs<br />
<strong>Data</strong><br />
Depot<br />
Depot<br />
Source<br />
Proc.<br />
Depot<br />
Proc.<br />
Figure 12 illustrates a typical scenario.<br />
Consider an example, let us say a car or a<br />
truck or an airplane engine, loaded with<br />
sensors. <strong>The</strong> data gathered locally needs<br />
to be marshaled in a local data depot,<br />
which may contain considerable amounts<br />
of data, maybe terabytes. Some of that<br />
data will probably be processed and used<br />
locally and there may be no need to send<br />
it to a central data hub. However, there<br />
<strong>Data</strong><br />
Central<br />
Central<br />
Hub<br />
Proc.<br />
Figure 12. IoT in Overview<br />
36
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
are many instances of such an object (car, truck, etc.) and there can be no doubt that the<br />
data gathered for each object will have value in aggregation .<br />
So some of the IoT data will be collected in some central hub so that the aggregate data<br />
can be analyzed. <strong>The</strong> bulk of the data is thus distributed and will remain distributed. It<br />
may even be that some application at some point needs to access the complete collection<br />
of all the data. If so, it will be far more economic to run a distributed application across<br />
all the depots than to try to centralize the data.<br />
A data lake might be involved in IoT applications like this, but its scale-out capabilities<br />
may have little importance for such applications. <strong>The</strong> scale-out applications of the IoT<br />
will probably be handled via distributed processing.<br />
<strong>The</strong> System of Record in Summary<br />
We have positioned the data lake as being a next generation technology concept,<br />
founded on the use of parallel processing in combination with a whole series of new<br />
software components, the majority of which are Apache projects.<br />
In this new ecosystem, the system of record, which historically was regarded as the<br />
data of the primary transactional applications of the business, will reside (mainly) in<br />
the data lake, where the purifying processes of data governance will be applied to it<br />
on ingest.<br />
<strong>The</strong> system of record will no longer consist entirely of the transactions (or events) of<br />
the business. It will also include data from other sources, which the business uses to<br />
perform analytics and inform its users of important information on which decisions can<br />
be based. <strong>The</strong> system of record will be, as it always was, the golden copy of corporate<br />
data and the audit trail of the IT activities of the business.<br />
37
<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />
Thank You To Our Sponsors:<br />
About <strong>The</strong> Bloor Group<br />
<strong>The</strong> Bloor Group is a consulting, research and technology analysis firm that focuses on open<br />
Research and the use of modern media to gather knowledge and disseminate it to IT users.<br />
Visit both www.BloorGroup.com and www.InsideAnalysis.com for more information. <strong>The</strong><br />
Bloor Group is the sole copyright holder of this publication.<br />
Austin, TX 78720 | 512-426–7725<br />
38