01.12.2023 Views

ST Nov-Dec 2023

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Dremio 16.qxd 01-<strong>Dec</strong>-23 10:58 AM Page 2<br />

MANAGEMENT: DATA ARCHITECTURE<br />

af<br />

EMBRACING OPEN DATA ARCHITECTURE<br />

MATT PEACHEY, VICE PRESIDENT, INTERNATIONAL AT DREMIO,<br />

ARGUES THAT OPEN IS THE SMART WAY FORWARD FOR DATA<br />

MANAGEMENT<br />

For the past decades, data has been<br />

propelling business operations. Whether<br />

an organisation offers tangible goods or<br />

intangible services, crucial information about<br />

partners, workforce, processes, and clients<br />

forms the backbone of a company's wellbeing.<br />

At the heart of any computing system is data<br />

storage, therefore the selection of the<br />

appropriate solution to store it will significantly<br />

impact how efficiently an organisation's network<br />

and accompanying infrastructure cater to the<br />

business requirements.<br />

The primary expectation from a data storage<br />

system is to safely keep valuable data while<br />

allowing users and applications to retrieve it<br />

seamlessly and swiftly when required. However,<br />

with the volume of data growing exponentially<br />

and never deleted, businesses started to add<br />

more storage capacity.<br />

The issue deepens where data warehouse<br />

vendors store data in a proprietary format.<br />

Data gets locked into the platform, making it<br />

difficult and costly to extract if and when a<br />

business wants to. Further, maintaining and<br />

troubleshooting issues often require teams with<br />

deep subject matter expertise in the ecosystem -<br />

an expensive outlay.<br />

Given the multitude of data storage<br />

alternatives and system setups, organisations<br />

can get dragged into the rabbit hole whilst<br />

adding more data to their systems - a very<br />

inefficient approach. Organisations must<br />

embrace open-source standards, technologies<br />

and formats to ensure fast and cost-effective<br />

analytics with the best engine for each<br />

workload. This provides the agility to innovate<br />

with the next wave of technology without<br />

draining resources or time.<br />

EVOLUTION OF DATA ARCHITECTURE<br />

Previously, companies depended on<br />

conventional databases or warehouses for their<br />

Business Intelligence (BI) demands. However,<br />

these systems presented certain difficulties. The<br />

typical data warehouse setup requires investing<br />

in expensive on-premises hardware,<br />

maintaining structured data in proprietary<br />

formats, and dependence on a centralised IT<br />

and data department for analysis. Other<br />

obstacles included technical interoperability,<br />

system orchestration, and, more critically,<br />

scalability.<br />

However, things changed in 2006 with the<br />

launch of Hadoop, built on the Map-Reduce<br />

paradigm capable of parallel processing and<br />

producing enormous data sets over large<br />

clusters of commoditised hardware. This<br />

framework facilitated handling vast datasets<br />

distributed over computer clusters, making it<br />

immensely appealing for businesses<br />

accumulating more data with each passing<br />

day. Still, databases like Teradata and Oracle<br />

encapsulated storage, computation, and data<br />

within a single, interconnected system, offering<br />

no separation of compute and storage<br />

components.<br />

Between 2015 and 2020, however, the<br />

widespread usage of the public cloud altered<br />

this approach, enabling the separation of<br />

compute and storage. Cloud data vendors like<br />

AWS and Snowflake facilitated this separation<br />

in cloud warehouses, enhancing scalability and<br />

efficiency. Nevertheless, data still had to be<br />

ingested, loaded, and duplicated into a single<br />

proprietary system, which was attached to a<br />

solitary query engine. Employing multiple<br />

databases or data warehouses necessitated the<br />

storage of multiple data copies. Moreover,<br />

companies were still charged for transferring<br />

their data into and out of the proprietary<br />

system, which resulted in excessive costs.<br />

Enter more contemporary and open data<br />

architecture, where data exists as an<br />

independent layer. This includes highlighting a<br />

clear division between data and compute. Data<br />

is stored in open-source file formats and table<br />

formats and accessed by decoupled and elastic<br />

compute engines. Consequently, different<br />

engines can access the same data in a loosely<br />

tied architecture. In these architectures, data is<br />

stored as its own independent tier source in<br />

open formats within the company's cloud<br />

account and made accessible to downstream<br />

consumers through various services.<br />

This transformation parallels the shift in<br />

applications from monolithic architectures to<br />

microservices. A comparable transition is<br />

presently occurring in data analytics, with<br />

companies migrating from proprietary data<br />

warehouses and ceaseless ETL (Extract,<br />

Transform, Load) processes to open data<br />

architectures like cloud data lakes and<br />

lakehouses.<br />

SEPARATING COMPUTE AND <strong>ST</strong>ORAGE<br />

FOR EFFICIENCY<br />

Over the years, there have been many<br />

discussions around the detachment of compute<br />

from storage within the industry, primarily due<br />

to its contribution to enhancing efficiency, which<br />

resulted in several advantages.<br />

Firstly, the reduction in raw storage costs was<br />

so significant that they practically disappeared<br />

from IT budget spreadsheets. Secondly,<br />

compute costs became segregated, leading to<br />

customers paying only for what they utilised<br />

during data processing, which lowered overall<br />

expenses. Lastly, the independent scalability of<br />

both storage and compute facilitated ondemand,<br />

elastic resource precision, adding<br />

flexibility to architecture designs.<br />

However, these changes took time to<br />

materialise. Expensive Storage Area Networks<br />

(SANs) and less costly but often complex<br />

Network Attached Storage (NAS) systems have<br />

existed for quite a while. Both storage models<br />

were limited due to administrative and<br />

16 <strong>ST</strong>ORAGE <strong>Nov</strong>/<strong>Dec</strong> <strong>2023</strong><br />

@<strong>ST</strong>MagAndAwards<br />

www.storagemagazine.co.uk<br />

MAGAZINE

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!