ST Nov-Dec 2023
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Dremio 16.qxd 01-<strong>Dec</strong>-23 10:58 AM Page 2<br />
MANAGEMENT: DATA ARCHITECTURE<br />
af<br />
EMBRACING OPEN DATA ARCHITECTURE<br />
MATT PEACHEY, VICE PRESIDENT, INTERNATIONAL AT DREMIO,<br />
ARGUES THAT OPEN IS THE SMART WAY FORWARD FOR DATA<br />
MANAGEMENT<br />
For the past decades, data has been<br />
propelling business operations. Whether<br />
an organisation offers tangible goods or<br />
intangible services, crucial information about<br />
partners, workforce, processes, and clients<br />
forms the backbone of a company's wellbeing.<br />
At the heart of any computing system is data<br />
storage, therefore the selection of the<br />
appropriate solution to store it will significantly<br />
impact how efficiently an organisation's network<br />
and accompanying infrastructure cater to the<br />
business requirements.<br />
The primary expectation from a data storage<br />
system is to safely keep valuable data while<br />
allowing users and applications to retrieve it<br />
seamlessly and swiftly when required. However,<br />
with the volume of data growing exponentially<br />
and never deleted, businesses started to add<br />
more storage capacity.<br />
The issue deepens where data warehouse<br />
vendors store data in a proprietary format.<br />
Data gets locked into the platform, making it<br />
difficult and costly to extract if and when a<br />
business wants to. Further, maintaining and<br />
troubleshooting issues often require teams with<br />
deep subject matter expertise in the ecosystem -<br />
an expensive outlay.<br />
Given the multitude of data storage<br />
alternatives and system setups, organisations<br />
can get dragged into the rabbit hole whilst<br />
adding more data to their systems - a very<br />
inefficient approach. Organisations must<br />
embrace open-source standards, technologies<br />
and formats to ensure fast and cost-effective<br />
analytics with the best engine for each<br />
workload. This provides the agility to innovate<br />
with the next wave of technology without<br />
draining resources or time.<br />
EVOLUTION OF DATA ARCHITECTURE<br />
Previously, companies depended on<br />
conventional databases or warehouses for their<br />
Business Intelligence (BI) demands. However,<br />
these systems presented certain difficulties. The<br />
typical data warehouse setup requires investing<br />
in expensive on-premises hardware,<br />
maintaining structured data in proprietary<br />
formats, and dependence on a centralised IT<br />
and data department for analysis. Other<br />
obstacles included technical interoperability,<br />
system orchestration, and, more critically,<br />
scalability.<br />
However, things changed in 2006 with the<br />
launch of Hadoop, built on the Map-Reduce<br />
paradigm capable of parallel processing and<br />
producing enormous data sets over large<br />
clusters of commoditised hardware. This<br />
framework facilitated handling vast datasets<br />
distributed over computer clusters, making it<br />
immensely appealing for businesses<br />
accumulating more data with each passing<br />
day. Still, databases like Teradata and Oracle<br />
encapsulated storage, computation, and data<br />
within a single, interconnected system, offering<br />
no separation of compute and storage<br />
components.<br />
Between 2015 and 2020, however, the<br />
widespread usage of the public cloud altered<br />
this approach, enabling the separation of<br />
compute and storage. Cloud data vendors like<br />
AWS and Snowflake facilitated this separation<br />
in cloud warehouses, enhancing scalability and<br />
efficiency. Nevertheless, data still had to be<br />
ingested, loaded, and duplicated into a single<br />
proprietary system, which was attached to a<br />
solitary query engine. Employing multiple<br />
databases or data warehouses necessitated the<br />
storage of multiple data copies. Moreover,<br />
companies were still charged for transferring<br />
their data into and out of the proprietary<br />
system, which resulted in excessive costs.<br />
Enter more contemporary and open data<br />
architecture, where data exists as an<br />
independent layer. This includes highlighting a<br />
clear division between data and compute. Data<br />
is stored in open-source file formats and table<br />
formats and accessed by decoupled and elastic<br />
compute engines. Consequently, different<br />
engines can access the same data in a loosely<br />
tied architecture. In these architectures, data is<br />
stored as its own independent tier source in<br />
open formats within the company's cloud<br />
account and made accessible to downstream<br />
consumers through various services.<br />
This transformation parallels the shift in<br />
applications from monolithic architectures to<br />
microservices. A comparable transition is<br />
presently occurring in data analytics, with<br />
companies migrating from proprietary data<br />
warehouses and ceaseless ETL (Extract,<br />
Transform, Load) processes to open data<br />
architectures like cloud data lakes and<br />
lakehouses.<br />
SEPARATING COMPUTE AND <strong>ST</strong>ORAGE<br />
FOR EFFICIENCY<br />
Over the years, there have been many<br />
discussions around the detachment of compute<br />
from storage within the industry, primarily due<br />
to its contribution to enhancing efficiency, which<br />
resulted in several advantages.<br />
Firstly, the reduction in raw storage costs was<br />
so significant that they practically disappeared<br />
from IT budget spreadsheets. Secondly,<br />
compute costs became segregated, leading to<br />
customers paying only for what they utilised<br />
during data processing, which lowered overall<br />
expenses. Lastly, the independent scalability of<br />
both storage and compute facilitated ondemand,<br />
elastic resource precision, adding<br />
flexibility to architecture designs.<br />
However, these changes took time to<br />
materialise. Expensive Storage Area Networks<br />
(SANs) and less costly but often complex<br />
Network Attached Storage (NAS) systems have<br />
existed for quite a while. Both storage models<br />
were limited due to administrative and<br />
16 <strong>ST</strong>ORAGE <strong>Nov</strong>/<strong>Dec</strong> <strong>2023</strong><br />
@<strong>ST</strong>MagAndAwards<br />
www.storagemagazine.co.uk<br />
MAGAZINE