You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Dm MANAGEMENT: DATA ARCHITECTURE<br />
Embracing open data architecture<br />
Matt Peachey, Vice President, International at Dremio, argues that open is the smart way<br />
forward for data management<br />
For the past decades, data has been<br />
propelling business operations.<br />
Whether an organisation offers<br />
tangible goods or intangible services,<br />
crucial information about partners,<br />
workforce, processes, and clients forms<br />
the backbone of a company's wellbeing.<br />
At the heart of any computing system is<br />
data storage, therefore the selection of<br />
the appropriate solution to store it will<br />
significantly impact how efficiently an<br />
organisation's network and<br />
accompanying infrastructure cater to the<br />
business requirements.<br />
The primary expectation from a data<br />
storage system is to safely keep valuable<br />
data while allowing users and applications<br />
to retrieve it seamlessly and swiftly when<br />
required. However, with the volume of<br />
data growing exponentially and never<br />
deleted, businesses started to add more<br />
storage capacity.<br />
The issue deepens where data warehouse<br />
vendors store data in a proprietary format.<br />
Data gets locked into the platform,<br />
making it difficult and costly to extract if<br />
and when a business wants to. Further,<br />
maintaining and troubleshooting issues<br />
often require teams with deep subject<br />
matter expertise in the ecosystem - an<br />
expensive outlay.<br />
Given the multitude of data storage<br />
alternatives and system setups,<br />
organisations can get dragged into the<br />
rabbit hole whilst adding more data to<br />
their systems - a very inefficient approach.<br />
Organisations must embrace open-source<br />
standards, technologies and formats to<br />
ensure fast and cost-effective analytics<br />
with the best engine for each workload.<br />
This provides the agility to innovate with<br />
the next wave of technology without<br />
draining resources or time.<br />
EVOLUTION OF DATA ARCHITECTURE<br />
Previously, companies depended on<br />
conventional databases or warehouses for<br />
their Business Intelligence (BI) demands.<br />
However, these systems presented certain<br />
difficulties. The typical data warehouse<br />
setup requires investing in expensive onpremises<br />
hardware, maintaining structured<br />
data in proprietary formats, and<br />
dependence on a centralised IT and data<br />
department for analysis. Other obstacles<br />
included technical interoperability, system<br />
orchestration, and, more critically, scalability.<br />
However, things changed in 2006 with the<br />
launch of Hadoop, built on the Map-Reduce<br />
paradigm capable of parallel processing and<br />
producing enormous data sets over large<br />
clusters of commoditised hardware. This<br />
framework facilitated handling vast datasets<br />
distributed over computer clusters, making<br />
it immensely appealing for businesses<br />
accumulating more data with each passing<br />
day. Still, databases like Teradata and Oracle<br />
encapsulated storage, computation, and<br />
data within a single, interconnected system,<br />
offering no separation of compute and<br />
storage components.<br />
Between 2015 and 2020, however, the<br />
widespread usage of the public cloud<br />
altered this approach, enabling the<br />
separation of compute and storage. Cloud<br />
data vendors like AWS and Snowflake<br />
facilitated this separation in cloud<br />
warehouses, enhancing scalability and<br />
efficiency. Nevertheless, data still had to be<br />
ingested, loaded, and duplicated into a<br />
single proprietary system, which was<br />
attached to a solitary query engine.<br />
Employing multiple databases or data<br />
warehouses necessitated the storage of<br />
multiple data copies. Moreover, companies<br />
were still charged for transferring their data<br />
into and out of the proprietary system,<br />
which resulted in excessive costs.<br />
Enter more contemporary and open data<br />
architecture, where data exists as an<br />
independent layer. This includes highlighting<br />
a clear division between data and compute.<br />
Data is stored in open-source file formats<br />
and table formats and accessed by<br />
decoupled and elastic compute engines.<br />
Consequently, different engines can access<br />
the same data in a loosely tied architecture.<br />
In these architectures, data is stored as its<br />
own independent tier source in open<br />
formats within the company's cloud account<br />
and made accessible to downstream<br />
consumers through various services.<br />
This transformation parallels the shift in<br />
applications from monolithic architectures<br />
to microservices. A comparable transition<br />
is presently occurring in data analytics,<br />
with companies migrating from<br />
proprietary data warehouses and ceaseless<br />
ETL (Extract, Transform, Load) processes to<br />
open data architectures like cloud data<br />
lakes and lakehouses.<br />
SEPARATING COMPUTE AND STORAGE<br />
FOR EFFICIENCY<br />
Over the years, there have been many<br />
discussions around the detachment of<br />
compute from storage within the industry,<br />
primarily due to its contribution to<br />
enhancing efficiency, which resulted in<br />
several advantages.<br />
Firstly, the reduction in raw storage costs<br />
was so significant that they practically<br />
disappeared from IT budget spreadsheets.<br />
Secondly, compute costs became<br />
segregated, leading to customers paying<br />
only for what they utilised during data<br />
processing, which lowered overall expenses.<br />
30 <strong>Nov</strong>ember/<strong>Dec</strong>ember <strong>2023</strong> www.document-manager.com<br />
@<strong>DM</strong>MagAndAwards