Big Data Frameworks
Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.
Implementation of Big Data infrastructure and technology can be seen in various industries like banking, retail, insurance, healthcare, media, etc. Big Data management functions like storage, sorting, processing and analysis for such colossal volumes cannot be handled by the existing database systems or technologies. Frameworks come into picture in such scenarios. Frameworks are nothing but toolsets that offer innovative, cost-effective solutions to the problems posed by Big Data processing and helps in providing insights, incorporating metadata and aids decision making aligned to the business needs.
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
DATA BIG
FRAMEWORKS
Presented by Cuelogic Technologies
THINK ABOUT IT
Implementation of Big Data infrastructure and technology
can be seen in various industries like banking,
insurance, healthcare, media, etc.
retail,
Data management functions like storage, sorting,
Big
and analysis for such colossal volumes cannot be
processing
by the existing database systems or technologies.
handled
There are many frameworks presently existing in this space. Some of
the popular ones are Spark, Hadoop, Hive and Storm.
Some score high on utility index like Presto while frameworks like Flink
have great potential.
There are still others which need some mention like the Samza, Impala,
Apache Pig, etc.
Some of these frameworks have been briefly discussed below.
Apache Hadoop
Hadoop is a Java-based platform founded by Mike Cafarella and Doug
Cutting.
This open-source framework provides batch data processing as well
as data storage services across a group of hardware machines
arranged in clusters.
Hadoop consists of multiple layers like HDFS and YARN that work
together to carry out data processing.
HDFS (Hadoop Distributed File System) is the hardware layer that
ensures coordination of data replication and storage activities
across various data clusters. In the event of a cluster node
failure, real-time can still be made available for processing.
YARN (Yet Another Resource Negotiator) is the layer responsible
for resource management and job scheduling.
MapReduce is the software layer that functions as the batch
processing engine.
Include cost-effective solution,
high throughput, multi-language
support, compatibility with most
emerging technologies in Big Data
services, high scalability, fault
tolerance, better suited for R&D,
high availability through excellent
failure handling mechanism.
Include vulnerability to security
breaches, does not perform inmemory
computation hence
suffers processing overheads,
not suited for stream
processing and real-time
processing, issues in
processing small files in large
numbers.
Pros
Cons
Apache Spark
It is a batch processing framework with enhanced data streaming
processing.
With full in-memory computation and processing optimisation, it
promises a lightning fast cluster computing system.
Spark framework is composed of five layers.
HDFS and HBASE: They form the first layer of data storage
systems.
YARN and Mesos: They form the resource management layer.
Core engine: This forms the third layer.
Library: This forms the fourth layer containing Spark SQL for SQL
queries while stream processing, GraphX and Spark R utilities for
processing graph data and MLlib for machine learning algorithms.
The fifth layer contains an application program interface such as
Java or Scala.
Include scalability, lightning
processing speeds through
reduced number of I/O operations
to disk, fault tolerance, supports
advanced analytics applications
with superior AI implementation
and seamless integration with
Hadoop
Include complexity of setup and
implementation, language support
limitation, not a genuine streaming
engine.
Pros
Cons
Include ease in setup and
operation, high scalability, good
speed, fault tolerance, support for
a wide range of languages
Include complex implementation,
debugging issues and not very
learner-friendly
Pros
Cons
Flink system contains multiple layers
Deploy Layer
Runtime Layer
Library Layer
Include low latency, high
throughput, fault tolerance,
entry by entry processing,
ease of batch and stream
data processing,
compatibility with Hadoop.
Include few scalability issues.
Pros
Cons
Hive
Apache Hive, designed by Facebook, is an ETL (Extract / Transform/
Load) and data warehousing system. It is built on top of the Hadoop –
HDFS platform.
The key components of the Hive Architecture include
Deploy Layer
Runtime Layer
The key components of the Hive Architecture include
Hive Clients
Hive Services
Hive Storage and Computing
The Hive engine converts SQL- queries or requests to MapReduce
task chains. The engine comprises of,
Parser: It goes through the incoming SQL-requests and sorts
ThemOptimizer: It goes through the sorted requests and optimises
ThemExecutor: It sends tasks to the Map Reduce framework
Include low latency, high
throughput, fault tolerance,
entry by entry processing,
ease of batch and stream
data processing,
compatibility with Hadoop.
Include few scalability issues.
Pros
Cons
Include least query
degradation even in the event
of increased concurrent
query workload. It has a query
execution rate that is three
times faster than Hive. Ease
in adding images and
embedding links. Highly userfriendly.
Include reliability issues
Pros
Cons
Impala
Impala is an open-source MPP (Massive Parallel Processing) query
engine that runs on multiple systems under a Hadoop cluster.
It has been written in C++ and Java.
Include supports in-memory
computation hence accesses
data without movement
directly from Hadoop nodes,
smooth integration with BI
tools like Tableau, ZoomData,
etc., supports a wide range of
file formats.
Include no support for serialisation
and deserialization of data, inability
to read custom binary files, table
refresh needed for every record
addition.
Pros
Cons
Contact Us
+1 347 374 8437
info@cuelogic.com
https://www.cuelogic.com/
Unit 610, 134 W 29th St,
New York, NY 10001
Content Source: Cuelogic Blog