Big Data Frameworks

DATA BIG

FRAMEWORKS

Presented by Cuelogic Technologies

THINK ABOUT IT

Implementation of Big Data infrastructure and technology

can be seen in various industries like banking,

insurance, healthcare, media, etc.

retail,

Data management functions like storage, sorting,

Big

and analysis for such colossal volumes cannot be

processing

by the existing database systems or technologies.

handled

There are many frameworks presently existing in this space. Some of

the popular ones are Spark, Hadoop, Hive and Storm.

Some score high on utility index like Presto while frameworks like Flink

have great potential.

There are still others which need some mention like the Samza, Impala,

Apache Pig, etc.

Some of these frameworks have been briefly discussed below.

Apache Hadoop

Hadoop is a Java-based platform founded by Mike Cafarella and Doug

Cutting.

This open-source framework provides batch data processing as well

as data storage services across a group of hardware machines

arranged in clusters.

Hadoop consists of multiple layers like HDFS and YARN that work

together to carry out data processing.

HDFS (Hadoop Distributed File System) is the hardware layer that

ensures coordination of data replication and storage activities

across various data clusters. In the event of a cluster node

failure, real-time can still be made available for processing.

YARN (Yet Another Resource Negotiator) is the layer responsible

for resource management and job scheduling.

MapReduce is the software layer that functions as the batch

processing engine.

Include cost-effective solution,

high throughput, multi-language

support, compatibility with most

emerging technologies in Big Data

services, high scalability, fault

tolerance, better suited for R&D,

high availability through excellent

failure handling mechanism.

Include vulnerability to security

breaches, does not perform inmemory

computation hence

suffers processing overheads,

not suited for stream

processing and real-time

processing, issues in

processing small files in large

numbers.

Pros

Cons

Apache Spark

It is a batch processing framework with enhanced data streaming

processing.

With full in-memory computation and processing optimisation, it

promises a lightning fast cluster computing system.

Spark framework is composed of five layers.

HDFS and HBASE: They form the first layer of data storage

systems.

YARN and Mesos: They form the resource management layer.

Core engine: This forms the third layer.

Library: This forms the fourth layer containing Spark SQL for SQL

queries while stream processing, GraphX and Spark R utilities for

processing graph data and MLlib for machine learning algorithms.

The fifth layer contains an application program interface such as

Java or Scala.

Include scalability, lightning

processing speeds through

reduced number of I/O operations

to disk, fault tolerance, supports

advanced analytics applications

with superior AI implementation

and seamless integration with

Hadoop

Include complexity of setup and

implementation, language support

limitation, not a genuine streaming

engine.

Pros

Cons

Include ease in setup and

operation, high scalability, good

speed, fault tolerance, support for

a wide range of languages

Include complex implementation,

debugging issues and not very

learner-friendly

Pros

Cons

Flink system contains multiple layers

Deploy Layer

Runtime Layer

Library Layer

Include low latency, high

throughput, fault tolerance,

entry by entry processing,

ease of batch and stream

data processing,

compatibility with Hadoop.

Include few scalability issues.

Pros

Cons

Hive

Apache Hive, designed by Facebook, is an ETL (Extract / Transform/

Load) and data warehousing system. It is built on top of the Hadoop –

HDFS platform.

The key components of the Hive Architecture include

Deploy Layer

Runtime Layer

The key components of the Hive Architecture include

Hive Clients

Hive Services

Hive Storage and Computing

The Hive engine converts SQL- queries or requests to MapReduce

task chains. The engine comprises of,

Parser: It goes through the incoming SQL-requests and sorts

ThemOptimizer: It goes through the sorted requests and optimises

ThemExecutor: It sends tasks to the Map Reduce framework

Include low latency, high

throughput, fault tolerance,

entry by entry processing,

ease of batch and stream

data processing,

compatibility with Hadoop.

Include few scalability issues.

Pros

Cons

Include least query

degradation even in the event

of increased concurrent

query workload. It has a query

execution rate that is three

times faster than Hive. Ease

in adding images and

embedding links. Highly userfriendly.

Include reliability issues

Pros

Cons

Impala

Impala is an open-source MPP (Massive Parallel Processing) query

engine that runs on multiple systems under a Hadoop cluster.

It has been written in C++ and Java.

Include supports in-memory

computation hence accesses

data without movement

directly from Hadoop nodes,

smooth integration with BI

tools like Tableau, ZoomData,

etc., supports a wide range of

file formats.

Include no support for serialisation

and deserialization of data, inability

to read custom binary files, table

refresh needed for every record

addition.

Pros

Cons

Contact Us

+1 347 374 8437

info@cuelogic.com

https://www.cuelogic.com/

Unit 610, 134 W 29th St,

New York, NY 10001

Content Source: Cuelogic Blog

Big Data Frameworks

Create successful ePaper yourself

Delete template?

Save as template?