03.04.2017 Views

The Data Lake Survival Guide

2o2JwuQ

2o2JwuQ

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> <strong>Data</strong> <strong>Lake</strong> <strong>Survival</strong> <strong>Guide</strong><br />

<strong>The</strong> main point is that there can be multiple physical data lakes, each with ingest<br />

capabilities and governance processes, that constitute a logical data lake, as illustrated<br />

in Figure 11. While the diagram implies that all the physical data lakes are running<br />

roughly the same processes, this may not be the case. <strong>The</strong>re are different reasons for<br />

having multiple physical data lakes. Some might exist entirely for disaster recovery<br />

or as a reserve resource for unexpected processing demand or as a dedicated analyst<br />

sand box for a specific group of users. Some may simply be established for time zone<br />

or geographical reasons.<br />

Having multiple physical data lakes will complicate the global approach to governance<br />

and establishing a system of record. However, with an intelligent deployment of Kafka<br />

(and possibly also NiFi) to manage the replication and export of data, ensuring that the<br />

physical data lakes correspond to a logical data lake is achievable.<br />

<strong>The</strong> system of record is likely to be logical (i.e. spread physically across multiple<br />

systems and data lakes) for several reasons. One particular cause for this that we<br />

believe is worth discussing is Internet of Things (IoT) applications, where the source<br />

data is created and is likely to remain physically remote.<br />

<strong>The</strong> Internet of Things<br />

<strong>The</strong> IoT is currently in its infancy, although some IoT applications have existed for many<br />

years, particularly those involving mobile phones. <strong>The</strong> Uber and Lyft applications, for<br />

example, are complex internet of things applications.<br />

However, such applications are not<br />

what normally spring to mind when the<br />

IoT is mentioned. <strong>The</strong> general idea is that<br />

there will be some physical domain – a<br />

building or many buildings or a transport<br />

network or a pipeline of a chemical<br />

plant or a factory – and this domain will<br />

be peppered with sensors, controllers<br />

or even embedded CPUs in various<br />

locations that are, at the very minimum,<br />

recording information but may also be<br />

running local applications.<br />

Sensors, Controllers, CPUs<br />

<strong>Data</strong><br />

Depot<br />

Depot<br />

Source<br />

Proc.<br />

Depot<br />

Proc.<br />

Figure 12 illustrates a typical scenario.<br />

Consider an example, let us say a car or a<br />

truck or an airplane engine, loaded with<br />

sensors. <strong>The</strong> data gathered locally needs<br />

to be marshaled in a local data depot,<br />

which may contain considerable amounts<br />

of data, maybe terabytes. Some of that<br />

data will probably be processed and used<br />

locally and there may be no need to send<br />

it to a central data hub. However, there<br />

<strong>Data</strong><br />

Central<br />

Central<br />

Hub<br />

Proc.<br />

Figure 12. IoT in Overview<br />

36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!